About
AI Reasoning Benchmarks

AI Reasoning Benchmarks

1 entry in Legal Intelligence Tracker

AI lab claims self-improving coding agents set new benchmark

Poetic's meta-system has reportedly achieved a score of 93.9 on the Soda benchmark—surpassing GPT-5.5—by running live code benchmarks and building its own test harnesses without fine-tuning or special access. In a separate effort, Prime Intellect provided idle compute to Anthropic's Codex and Claude Code to optimize a "nano GPT speedrun" track; after approximately 14,000 H200 GPU hours, the agents beat the human baseline, with Opus 4.7 recording a time of 2,930 steps. These developments were discussed in a May 15, 2026 episode of The Innermost Loop, hosted by Dr. Alex Wissner-Gross, which framed the activity as evidence that AI systems are beginning to optimize their own optimizers.

mail Subscribe to AI Reasoning Benchmarks email updates

Primary sources. No fluff. Straight to your inbox.

Also on LawSnap