Reasoning Bench: Testing GPT-5.5 and Opus 4.8 on Short Reasoning Prompts
I ran frontier models through 94 short reasoning prompts to see where they still trip up. The scores mostly tied, but the misses, cost, and speed told the useful story.
Calvin··10 min read