Complement

← All research
·benchmarkusmlellm-evaluation

Frontier LLM Performance on USMLE-Style Medical Questions

Zero-shot evaluation of 7 models across 40 board-style questions

Complement Research

Abstract. We evaluate 7 frontier language models on 40 expert-written USMLE Step 1 and Step 2 CK questions under zero-shot, temperature-0 conditions.

1. Introduction

The USMLE is a three-step examination required for medical licensure in the United States. Step 1 tests foundational science knowledge and Step 2 CK tests clinical reasoning. Performance on these exams has become a standard benchmark for evaluating medical knowledge in language models, following early demonstrations that GPT-4 could pass the exam at or near the expert threshold.

We constructed a set of 40 original USMLE-style questions spanning 19 topic categories across both Step 1 and Step 2 CK. Unlike repurposed question bank items, these questions were generated by Claude Opus 4.6 and reviewed for clinical accuracy, reducing the risk of contamination from training data.

2. Methods

  • Question generation. 40 multiple-choice questions (5 options each) generated by Claude Opus 4.6, spanning Step 1 and Step 2 CK across 19 USMLE categories.
  • Prompting. Zero-shot: each model receives the question stem, lead-in, and options with no examples, chain-of-thought instructions, or system prompts.
  • Temperature. All models evaluated at temperature 0.0 for reproducibility.
  • Confidence intervals. Wilson score intervals at 95% confidence.
  • Human baseline. Medical students take the same 40-question challenge under timed conditions via the interactive exam interface.

3. Results

3.1 Model leaderboard

Loading results...
RankModelAccuracy95% CIRuns
Loading...

3.3 Performance by topic

Accuracy varies substantially across USMLE topic categories. The chart below shows the average accuracy across all models for each topic:

Loading topic data...

4. Sample questions

Below is a representative question from the evaluation set. Select an answer to reveal the correct response and explanation.

5. Limitations

  • AI-generated questions. Questions were generated by Claude Opus 4.6 rather than sourced from official NBME material. While reviewed for accuracy, they may not perfectly represent USMLE difficulty or style distributions.
  • Small question set. 40 questions provides limited statistical power for fine-grained comparisons. CIs are wide for per-topic and per-step breakdowns.
  • Zero-shot only. Models may perform differently under few-shot prompting, chain-of-thought, or with system prompts. Our results reflect a lower bound on capability.
  • Single evaluation. Most models were evaluated a small number of times. At temperature 0.0, results are deterministic per run, but question sampling effects remain.

Methodology notes

  • Questions generated by Claude Opus 4.6, reviewed for clinical accuracy
  • Span Step 1 & Step 2 CK across 19 USMLE categories
  • Zero-shot prompting, no chain-of-thought or system prompts
  • Temperature 0.0 for all models
  • Wilson score 95% confidence intervals
  • Human scores from medical students via timed challenge interface
Take the challenge

Answer the same 40 questions and see how you compare.

Frontier LLM Performance on USMLE-Style Medical Questions | Complement Research