About

AI lab claims self-improving coding agents set new benchmark

Published
Score
12

Why it matters

Poetic's meta-system has reportedly achieved a score of 93.9 on the Soda benchmark—surpassing GPT-5.5—by running live code benchmarks and building its own test harnesses without fine-tuning or special access. In a separate effort, Prime Intellect provided idle compute to Anthropic's Codex and Claude Code to optimize a "nano GPT speedrun" track; after approximately 14,000 H200 GPU hours, the agents beat the human baseline, with Opus 4.7 recording a time of 2,930 steps. These developments were discussed in a May 15, 2026 episode of The Innermost Loop, hosted by Dr. Alex Wissner-Gross, which framed the activity as evidence that AI systems are beginning to optimize their own optimizers.

The claims remain unverified outside the podcast discussion. No independent benchmarking body has confirmed the reported scores, and details about Poetic's methodology and Prime Intellect's compute allocation have not been made public. The timeline and technical specifications come solely from the podcast episode and related materials.

Attorneys tracking AI liability and IP issues should note the shift in how these systems are being deployed. When AI agents design their own test harnesses and optimization loops, questions about ownership of improvements, reproducibility for patent prosecution, and liability for errors in self-generated benchmarks become material. The recursive nature of these tasks—machines improving the machines that improve machines—may also trigger closer scrutiny from regulators focused on autonomous AI development and safety validation.

mail Subscribe to Artificial Intelligence email updates

Primary sources. No fluff. Straight to your inbox.

Also on LawSnap