February 15, 2026·benchmarkusmlellm-evaluation

Frontier LLM Performance on USMLE-Style Medical Questions

Zero-shot evaluation of 7 models across 40 board-style questions

Complement Research

Abstract. We evaluate 7 frontier language models on 40 expert-written USMLE Step 1 and Step 2 CK questions under zero-shot, temperature-0 conditions.

1. Introduction

The USMLE is a three-step examination required for medical licensure in the United States. Step 1 tests foundational science knowledge and Step 2 CK tests clinical reasoning. Performance on these exams has become a standard benchmark for evaluating medical knowledge in language models, following early demonstrations that GPT-4 could pass the exam at or near the expert threshold.

We constructed a set of 40 original USMLE-style questions spanning 19 topic categories across both Step 1 and Step 2 CK. Unlike repurposed question bank items, these questions were generated by Claude Opus 4.6 and reviewed for clinical accuracy, reducing the risk of contamination from training data.

2. Methods

Question generation. 40 multiple-choice questions (5 options each) generated by Claude Opus 4.6, spanning Step 1 and Step 2 CK across 19 USMLE categories.
Prompting. Zero-shot: each model receives the question stem, lead-in, and options with no examples, chain-of-thought instructions, or system prompts.
Temperature. All models evaluated at temperature 0.0 for reproducibility.
Confidence intervals. Wilson score intervals at 95% confidence.
Human baseline. Medical students take the same 40-question challenge under timed conditions via the interactive exam interface.

3. Results

3.1 Model leaderboard

Loading results...

Rank	Model	Accuracy	95% CI	Runs
Loading...

3.3 Performance by topic

Accuracy varies substantially across USMLE topic categories. The chart below shows the average accuracy across all models for each topic:

Loading topic data...

4. Sample questions

Below is a representative question from the evaluation set. Select an answer to reveal the correct response and explanation.

5. Limitations

AI-generated questions. Questions were generated by Claude Opus 4.6 rather than sourced from official NBME material. While reviewed for accuracy, they may not perfectly represent USMLE difficulty or style distributions.
Small question set. 40 questions provides limited statistical power for fine-grained comparisons. CIs are wide for per-topic and per-step breakdowns.
Zero-shot only. Models may perform differently under few-shot prompting, chain-of-thought, or with system prompts. Our results reflect a lower bound on capability.
Single evaluation. Most models were evaluated a small number of times. At temperature 0.0, results are deterministic per run, but question sampling effects remain.

Methodology notes

Questions generated by Claude Opus 4.6, reviewed for clinical accuracy
Span Step 1 & Step 2 CK across 19 USMLE categories
Zero-shot prompting, no chain-of-thought or system prompts
Temperature 0.0 for all models
Wilson score 95% confidence intervals
Human scores from medical students via timed challenge interface

Take the challenge

Answer the same 40 questions and see how you compare.