Neuropsych reports will compare your scores against published normative data with known sample sizes and demographic corrections. Scoring is deterministic. The language model is constrained to writing prose around numbers that have already been computed and citations that have already been retrieved. This page describes the architecture we are building and the data behind it.
Report generation follows four stages. The first three are deterministic. The language model only enters at the end, after every number is computed and every citation is retrieved. The normative database (stage 1) is built and growing. Stages 2-4 are in development.
Normative tables from published papers and national health surveys, extracted into structured, queryable format. Each entry links back to a study with a known sample size, age range, and demographics. When the engine needs to know what "average" looks like for a 34-year-old with a college degree, it looks it up. No estimation.
Pure code. Raw score plus demographics go in. Percentile, standard score, and classification come out. The engine will pick the right normative table for your age, education, and locale. Where we have raw participant data, we use continuous norming for smooth percentile curves instead of rough bins.
220 papers in the research corpus. The engine will take your score pattern and pull the specific research that applies. Low digit span with normal processing speed? It finds the papers that discuss that pattern. Every retrieved passage gets cited in the report.
The language model receives verified scores, retrieved research, and a clinical template. It writes a readable narrative. It cannot change a percentile. It cannot add claims without a retrieved source. Calculation happened two stages ago.
Every comparison in a report traces to one of these published sources. Click the study name to go to the original paper or dataset.
Computerized cognitive battery from Northwestern / NIH. Normed to the 2020 U.S. Census.
Online cognitive testing from Harvard/McLean Hospital. Open dataset with age, gender, education.
National Health and Nutrition Examination Survey. Nationally representative, stratified by demographics.
Largest published normative study for computerized Tower of London. Age, sex, education stratified.
Computerized digit span with a better scoring metric than traditional methods. Hardware-calibrated reaction time.
The most-cited norms for trail making and verbal fluency. Age × education percentile tables.
Victoria Stroop normative study. Scaled scores by age for dots, color-word, and interference conditions.
Combined data on approximate number system acuity. Weber fraction norms across the lifespan.
The main Spanish normative project. Spain-specific norms for classic neuropsychological tests. Used for Spanish-language reports.
Also: Kessels et al. (Corsi Blocks), Unsworth et al. (Operation Span), Siegler (Number Line). Full citations in each report.
Verbal fluency norms from Boston don't apply in Madrid. Different letter frequencies, different educational baselines, different cultural context. The engine selects the correct norm set based on what language you tested in and where.
Nonverbal tests (matrix reasoning, reaction time, dot comparison) can share norms across locales. Verbal tests always use locale-specific norms. Spanish verbal fluency uses P-M-R letters, not the English F-A-S, because the letter frequencies are different.
The limits matter as much as the features.
Percentiles, standard scores, and classifications are computed by code. The language model never produces a number. It gets them as inputs.
If a clinical statement appears in the report, it cites a paper from the indexed corpus. No citation, no claim.
Reports flag patterns and suggest follow-up. They say "this is consistent with" not "you have." Screening, not diagnosis.
Test responses stay in your browser during the assessment. Report generation requires sending scores to the server. We don't train on your data.
The retrieval system only surfaces papers from our indexed corpus. General AI hallucinates citations confidently. We can't.
The report tells you which study was used, its sample size, and what demographic corrections were applied. You or your clinician can check.
Where we have raw participant-level data (TestMyBrain, NHANES), we apply continuous norming with Generalized Additive Models, following Timmerman et al. (2021) and the NIH Toolbox V3 approach. This produces smooth percentile curves across age instead of binned tables, which reduces misclassification at bin boundaries.
Where we only have published summary statistics (means and SDs), we use the original conversion tables with the demographic corrections the study authors specified.
Reports include percentile ranks, z-scores, T-scores (M=50, SD=10), standard scores (M=100, SD=15), scaled scores (M=10, SD=3), and Wechsler-system classifications (Average, Low Average, Borderline, etc.) depending on the test and source.
The research corpus is split into semantic chunks and indexed with vector embeddings. At report time, the engine queries the index with the user's score pattern and retrieves the most relevant interpretation passages. The language model receives these as context. It can summarize what the research says. It cannot add claims beyond what was retrieved.
Some of our tests run in a browser while the normative data comes from in-person administration (Tombaugh TMT, Troyer Stroop). The report flags this. Where computerized norms exist (NIH Toolbox, Woods digit span, Gutenberg Tower of London), we use those exclusively. We don't quietly apply paper norms to a computerized test.
All normative sources are from published, peer-reviewed studies or public government health surveys. No proprietary databases. Source data is available for inspection by research collaborators.