Seven faculty members from the University of Maryland Francis King Carey School of Law graded both models using the same criteria applied to their students. In spring 2025, o3 earned A+ grades in Constitutional Law, Professional Responsibility, and Property Law, with grades ranging from A+ to B across eight exams. When the researchers tested GPT-5.5 the following spring using its "xhigh" reasoning effort setting, results showed only marginal gains: two A+s, three As, and a B+. The newer model demonstrated no clear superiority.
The study extends a multi-year tracking effort beginning with GPT-3.5 in 2022, which scored mostly C's and D's, through GPT-4, which passed the bar exam at the 90th percentile. The plateau in performance despite increased computational resources and advanced reasoning features suggests potential stagnation in AI progress on legal benchmarks.
Attorneys and legal institutions should monitor this finding as it complicates the narrative around AI capability scaling. If performance gains on complex legal reasoning tasks have genuinely plateaued, claims about AI readiness for high-stakes professional work warrant skepticism. The results may influence how courts, bar associations, and law firms evaluate and deploy AI tools in the years ahead.