Scores / method / caveats

Benchmarks

Benchmark table with scores, datasets, sample notes, baseline, contamination screening, update date, and limitations.

83.4. Internal 2,400-task suite, May 2026, baseline 79.1.

71.8 pass@1. Public issue set, contamination screened, baseline 68.0.

92.2. Needle and multi-source synthesis suite, 128k context.

Benchmarks do not guarantee task success; production evals should mirror your workflow.