Reasoning
83.4. Internal 2,400-task suite, May 2026, baseline 79.1.
Benchmark table with scores, datasets, sample notes, baseline, contamination screening, update date, and limitations.
83.4. Internal 2,400-task suite, May 2026, baseline 79.1.
71.8 pass@1. Public issue set, contamination screened, baseline 68.0.
92.2. Needle and multi-source synthesis suite, 128k context.
Benchmarks do not guarantee task success; production evals should mirror your workflow.