Scores / method / caveats

Benchmarks

Benchmark table with scores, datasets, sample notes, baseline, contamination screening, update date, and limitations.

Reasoning

83.4. Internal 2,400-task suite, May 2026, baseline 79.1.

Code repair

71.8 pass@1. Public issue set, contamination screened, baseline 68.0.

Long context

92.2. Needle and multi-source synthesis suite, 128k context.

Caveat

Benchmarks do not guarantee task success; production evals should mirror your workflow.