Complement
Questions generated by Claude Opus 4.6 at hard difficulty, then quality-scored by an independent critic model
Questions span Step 1 and Step 2 CK across all 19 USMLE organ-system categories
Each model tested with zero-shot prompting — no examples, no chain-of-thought
Temperature set to 0.0 for maximum reproducibility
Wilson score confidence intervals computed across all runs
Human scores will be added as medical students complete the challenge