The benchmark covers models tested from early 2024 through November 2025, with the paper expected in early 2026. The specific dimensions where models underperform—particularly predictability and catastrophic error avoidance—remain areas of active development. The timing of publication arrives as enterprises accelerate AI agent deployment, creating a gap between when the research was conducted and current market conditions.
For practicing attorneys, this research directly challenges the narrative of AI readiness for high-stakes applications. The calibration and error-avoidance failures documented here expose material liability risks for organizations deploying these systems in legal research, contract review, or compliance functions. The gap between raw accuracy and genuine reliability suggests that current AI agents cannot yet be trusted as autonomous decision-makers in domains where errors carry legal or financial consequences. Organizations should scrutinize vendor claims about model capabilities against these benchmark results and establish human oversight protocols accordingly.