How scoring works

Deterministic scoring.
Traceable sources.

Neuropsych reports will compare your scores against published normative data with known sample sizes and demographic corrections. Scoring is deterministic. The language model is constrained to writing prose around numbers that have already been computed and citations that have already been retrieved. This page describes the architecture we are building and the data behind it.

Normative databases

40,000+

Participants across
norm sources

220

Research papers
indexed

Locale-specific
norm sets

Report pipeline

Code does math. AI writes sentences.

Report generation follows four stages. The first three are deterministic. The language model only enters at the end, after every number is computed and every citation is retrieved. The normative database (stage 1) is built and growing. Stages 2-4 are in development.

Norms database

Normative tables from published papers and national health surveys, extracted into structured, queryable format. Each entry links back to a study with a known sample size, age range, and demographics. When the engine needs to know what "average" looks like for a 34-year-old with a college degree, it looks it up. No estimation.

Deterministic No AI Built

Scoring engine

Pure code. Raw score plus demographics go in. Percentile, standard score, and classification come out. The engine will pick the right normative table for your age, education, and locale. Where we have raw participant data, we use continuous norming for smooth percentile curves instead of rough bins.

Pure code Deterministic In development

Research retrieval

220 papers in the research corpus. The engine will take your score pattern and pull the specific research that applies. Low digit span with normal processing speed? It finds the papers that discuss that pattern. Every retrieved passage gets cited in the report.

Retrieval-augmented Cited Corpus collected

Report writing

The language model receives verified scores, retrieved research, and a clinical template. It writes a readable narrative. It cannot change a percentile. It cannot add claims without a retrieved source. Calculation happened two stages ago.

Language model Constrained In development

Where the numbers come from

Normative sources

Every comparison in a report traces to one of these published sources. Click the study name to go to the original paper or dataset.

NIH Toolbox V3

2024 · Ages 3-85+

N = 3,956

Computerized cognitive battery from Northwestern / NIH. Normed to the 2020 U.S. Census.

Flanker, Pattern Comparison, Card Sort, List Sorting, Symbol Digit, Visual Reasoning

TestMyBrain.org

Germine et al., 2018 · Ages 10-90

N = 9,996

Online cognitive testing from Harvard/McLean Hospital. Open dataset with age, gender, education.

Matrix Reasoning, Vocabulary

NHANES

CDC, 2011-2014 · Ages 60-80+

N = 3,014

National Health and Nutrition Examination Survey. Nationally representative, stratified by demographics.

Digit Symbol (DSST), Animal Fluency, CERAD Word Learning

Gutenberg Health Study

Kaller et al., 2019 · Ages 40-80

N = 7,703

Largest published normative study for computerized Tower of London. Age, sex, education stratified.

Tower of London (planning)

Woods et al.

2011, 2015 · Ages 18-65

N = 2,232

Computerized digit span with a better scoring metric than traditional methods. Hardware-calibrated reaction time.

Digit Span, Simple Reaction Time

Tombaugh

1999, 2004 · Ages 16-95

N = 2,209

The most-cited norms for trail making and verbal fluency. Age × education percentile tables.

Trail Making A & B, Verbal Fluency (FAS)

Troyer et al.

2006 · Ages 18-94

N ≈ 400

Victoria Stroop normative study. Scaled scores by age for dots, color-word, and interference conditions.

Stroop (inhibition)

Halberda et al.

Meta-analysis · 115 samples

N > 10,000

Combined data on approximate number system acuity. Weber fraction norms across the lifespan.

Dot Comparison (number sense)

NEURONORMA

Peña-Casanova, 2009-2024 · Ages 18-89

N = 535

The main Spanish normative project. Spain-specific norms for classic neuropsychological tests. Used for Spanish-language reports.

TMT, Stroop, Digit Span, Verbal Fluency (P-M-R), Animal Fluency

Also: Kessels et al. (Corsi Blocks), Unsworth et al. (Operation Span), Siegler (Number Line). Full citations in each report.

Bilingual scoring

Scored against the right population

Verbal fluency norms from Boston don't apply in Madrid. Different letter frequencies, different educational baselines, different cultural context. The engine selects the correct norm set based on what language you tested in and where.

🇺🇸

English (US)

NIH Toolbox, NHANES, TestMyBrain, Woods, Tombaugh

~30,000+ participants

🇪🇸

Español (España)

NEURONORMA, NEURONORMA Jóvenes

535+ participants

🌎

Español (Latinoamérica)

NEUROPSI, Multi-country (11 nations)

5,200+ participants

Nonverbal tests (matrix reasoning, reaction time, dot comparison) can share norms across locales. Verbal tests always use locale-specific norms. Spanish verbal fluency uses P-M-R letters, not the English F-A-S, because the letter frequencies are different.

Constraints

What we don't do

The limits matter as much as the features.

AI doesn't touch numbers

Percentiles, standard scores, and classifications are computed by code. The language model never produces a number. It gets them as inputs.

No unsourced claims

If a clinical statement appears in the report, it cites a paper from the indexed corpus. No citation, no claim.

No diagnosis

Reports flag patterns and suggest follow-up. They say "this is consistent with" not "you have." Screening, not diagnosis.

Privacy by default

Test responses stay in your browser during the assessment. Report generation requires sending scores to the server. We don't train on your data.

No fake citations

The retrieval system only surfaces papers from our indexed corpus. General AI hallucinates citations confidently. We can't.

Show your work

The report tells you which study was used, its sample size, and what demographic corrections were applied. You or your clinician can check.

For researchers

Technical details

Norming method

Where we have raw participant-level data (TestMyBrain, NHANES), we apply continuous norming with Generalized Additive Models, following Timmerman et al. (2021) and the NIH Toolbox V3 approach. This produces smooth percentile curves across age instead of binned tables, which reduces misclassification at bin boundaries.

Where we only have published summary statistics (means and SDs), we use the original conversion tables with the demographic corrections the study authors specified.

Score types

Reports include percentile ranks, z-scores, T-scores (M=50, SD=10), standard scores (M=100, SD=15), scaled scores (M=10, SD=3), and Wechsler-system classifications (Average, Low Average, Borderline, etc.) depending on the test and source.

Retrieval-augmented generation

The research corpus is split into semantic chunks and indexed with vector embeddings. At report time, the engine queries the index with the user's score pattern and retrieves the most relevant interpretation passages. The language model receives these as context. It can summarize what the research says. It cannot add claims beyond what was retrieved.

Computerized vs. paper tests

Some of our tests run in a browser while the normative data comes from in-person administration (Tombaugh TMT, Troyer Stroop). The report flags this. Where computerized norms exist (NIH Toolbox, Woods digit span, Gutenberg Tower of London), we use those exclusively. We don't quietly apply paper norms to a computerized test.

Open data

All normative sources are from published, peer-reviewed studies or public government health surveys. No proprietary databases. Source data is available for inspection by research collaborators.

Limitation. This is a screening tool, not a diagnostic instrument. Normative comparisons give context but can't account for everything (medication, fatigue, testing environment, motivation). Talk to a professional about your results. The normative datasets have varying sample sizes. Some age ranges are better represented than others. The report notes this where it applies.

Deterministic scoring.Traceable sources.