Frontier LLM Performance on USMLE-Style Medical Questions
Zero-shot evaluation of 7 models across 40 board-style questions
Abstract. We evaluate 7 frontier language models on 40 expert-written USMLE Step 1 and Step 2 CK questions under zero-shot, temperature-0 conditions.
1. Introduction
The USMLE is a three-step examination required for medical licensure in the United States. Step 1 tests foundational science knowledge and Step 2 CK tests clinical reasoning. Performance on these exams has become a standard benchmark for evaluating medical knowledge in language models, following early demonstrations that GPT-4 could pass the exam at or near the expert threshold.
We constructed a set of 40 original USMLE-style questions spanning 19 topic categories across both Step 1 and Step 2 CK. Unlike repurposed question bank items, these questions were generated by Claude Opus 4.6 and reviewed for clinical accuracy, reducing the risk of contamination from training data.
2. Methods
- Question generation. 40 multiple-choice questions (5 options each) generated by Claude Opus 4.6, spanning Step 1 and Step 2 CK across 19 USMLE categories.
- Prompting. Zero-shot: each model receives the question stem, lead-in, and options with no examples, chain-of-thought instructions, or system prompts.
- Temperature. All models evaluated at temperature 0.0 for reproducibility.
- Confidence intervals. Wilson score intervals at 95% confidence.
- Human baseline. Medical students take the same 40-question challenge under timed conditions via the interactive exam interface.
3. Results
3.1 Model leaderboard
| Rank | Model | Accuracy | 95% CI | Runs |
|---|---|---|---|---|
| Loading... | ||||
3.3 Performance by topic
Accuracy varies substantially across USMLE topic categories. The chart below shows the average accuracy across all models for each topic:
4. Sample questions
Below is a representative question from the evaluation set. Select an answer to reveal the correct response and explanation.
5. Limitations
- AI-generated questions. Questions were generated by Claude Opus 4.6 rather than sourced from official NBME material. While reviewed for accuracy, they may not perfectly represent USMLE difficulty or style distributions.
- Small question set. 40 questions provides limited statistical power for fine-grained comparisons. CIs are wide for per-topic and per-step breakdowns.
- Zero-shot only. Models may perform differently under few-shot prompting, chain-of-thought, or with system prompts. Our results reflect a lower bound on capability.
- Single evaluation. Most models were evaluated a small number of times. At temperature 0.0, results are deterministic per run, but question sampling effects remain.
Methodology notes
- Questions generated by Claude Opus 4.6, reviewed for clinical accuracy
- Span Step 1 & Step 2 CK across 19 USMLE categories
- Zero-shot prompting, no chain-of-thought or system prompts
- Temperature 0.0 for all models
- Wilson score 95% confidence intervals
- Human scores from medical students via timed challenge interface
Answer the same 40 questions and see how you compare.