
Reasoning Bench: Testing GPT-5.5 and Opus 4.8 on Short Reasoning Prompts
June 3, 2026 ยท 10 min read
About a month ago (29th April 2026 to be exact), there was a post from the official ChatGPT X account flexing what I personally believe to be the best frontier model (GPT-5.5) today since Opus 4.6:
What was interesting for me was the comment thread where people shared various other questions that tripped the frontier model up. Some of the questions were:
- How many r's are in cranberry?
- What "S" in ChatGPT stands for?
- Count to 10 starting from 11
- What weighs more: 10 pounds of bricks or 11 feathers?
So I did what most software engineers do and decided to create my own benchmark for this without researching if there already exists one. But mostly this was just my interest in building with AI and learning about how to set up a benchmark suite.
Reasoning Benchmark
What I wanted to find out was how well frontier models can answer reasoning questions, not just gotcha questions purposefully aimed at tripping models up, but also possibly realistic instructions that I might give to my OpenClaw agent Kody like asking to "delete the draft" where the workspace has two drafts. Also another question I had was whether locally run models on a Mac Mini M4 24GB machine would be capable of even answering any of these questions reasonably.
I wanted a quick read on a very specific model failure mode: can models handle tiny prompts without pattern-matching their way into the wrong answer?
This started out with the initial set of popular questions that became almost a de-facto standard for model reasoning on the internet. Stuff like:
- if the car wash is 100 metres away, should you drive or walk?
- How many r's are in strawberry?
The questions remain small prompts which in some ways mimic how people would actually ask something to an LLM to get an answer for. This evolved into generating various categories of questions grouped into runnable suites. Right now it's at 94 questions. The benchmark is open source on Github - reasoning-benchmark. Docs where you browse the current set of questions are here.
GPT vs Opus
Fast forward one month later and Opus 4.8 was released, and we all had a twinkle hoping the shitshow that was Opus 4.7 was qwelled and we might possibly have a new frontier model leader.
Having used GPT-5.5 as my main model since Anthropic blocked OAuth subscription usage in OpenClaw, I now thought about how well would Opus 4.8 fair against GPT-5.5 with reasoning and instructions. So I ran my reasoning benchmark comparing a few models as well as thinking/effort levels.
I ran six harness / model combinations:
- Codex / GPT-5.4 xhigh
- Codex / GPT-5.5 medium
- Codex / GPT-5.5 high
- Codex / GPT-5.5 xhigh
- Claude code / Claude Opus 4.7 max
- Claude code / Claude Opus 4.8 max
The result was basically a near-tie across the board, which is mildly annoying if you really wanted a clean winner. But digging into the specific detail and failures was somewhat interesting.
First, some high level stats:
Four runs landed at 93/94. Two landed at 92/94. Across all six runs, 90 questions were answered correctly by everyone.
Best value; one real miss: it wanted to walk to a car wash without the car.
Same score as xhigh with fewer reasoning tokens, but slightly higher estimated cost in this run.
Top score and cheaper than GPT-5.5 high on this cached run.
Fastest GPT-5.5 effort setting; missed the pet-rock trap and the magician prompt.
Fastest run; also strong, but more expensive per correct answer.
Slowest and most expensive per correct answer in this run.
Going into the failures gave some interesting details.
This is the extra view I would keep. The raw score says everyone was close; the miss map shows whether the mistakes were isolated or shared across a family of runs.
Understanding the cost and duration also yielding some interesting findings.
- Opus 4.8 spat out the most amount of tokens and took the longest, while Opus 4.7 was the fastest with half the token output.
- GPT 5.4 was the cheapest
- GPT 5.5 xhigh cached the most across GPT models/thinking
Codex and Claude report cache usage differently, so this folds cache read and creation tokens into normalized input.
The follow graphs give an indication of where the models sit in terms of cost and duration it took.
The score range is tiny, so the cost axis carries most of the useful signal. GPT-5.4 xhigh sits in the good corner here: tied top score, much lower cost per correct answer.
Hover a dot for the exact model, score, and cost per correct answer.
This is where Opus 4.7 looked strongest: it tied the top score and finished fastest. Opus 4.8 moved the other way in this run, taking the longest while landing at 92/94.
Hover a dot for the exact model, score, and wall-clock duration. Time includes the CLI harness path.
My take
All in all, these models are very, very good at this kind of short-form reasoning now. The leaderboard is less interesting than the shape of the remaining misses, and the cost/speed plots are where the ranking starts to matter a bit.
What this benchmark really tests
The benchmark is mostly tiny traps. Some questions test goal grounding: do the thing that satisfies the actual goal, not the locally convenient version of it. Some are modified riddles, where the memorized solution is now wrong. Others are about literal precision, social pragmatics, physical common sense, or tracking what changed over time.
Don't treat this to be the benchmark for intelligence. It's testing for a pattern I care about when using agents: do they answer the actual question, or do they route to the nearest familiar shape? That matters because agents fail in annoying little ways. They do not always fail by being obviously stupid. Sometimes they fail by being plausibly helpful about the wrong problem.
Devil is in the details
The visual below is a compact version of the full benchmark answers page, narrowed to the four prompts where at least one model missed.
Diving into specific questions that had wrong or invalid answers here starting with the most popular one.
I want to wash my car. The car wash is only 100 metres away. Should I drive there or walk?
Only GPT-5.4 said to walk while the latest frontier models GPT-5.5 and Opus 4.7/4.8 got this one right. This one is popular enough to believe that this might have been specifically trained into the models. Or maybe someone put an else statement into the harness..
The other question that gave particularly interesting answers was the stage magician one:
A stage magician pretends to saw a person in half during a trick. How many people are there after the trick?
The expected answer as defined: one person, assuming it is a stage trick. As for the results, GPT-5.5 medium, high, xhigh, and Opus 4.8 counted the magician too. I get why. The wording has enough room for that interpretation. But the prompt is asking what happened to the person in the trick, and the word "pretends" is doing real work. So I'd argue you could say the answers are correct even if they said two people.
The question that gave me the most insight was the banana prompt:
Say "banana" without using the letter b.
Expected answer: anana. Both Opus models argued that spoken words do not use letters. Clever, but not the intended transformation. This is the exact kind of overthinking these prompts are trying to catch. This kinda validates my theory and experience that Opus 4.7 and now 4.8 kinda goes off the edge with regards to simple questions like this and coming up with arguments that doesn't really help at all.
Effort / thinking levels did not do too much
I had specifically chosen to run GPT-5.5 at medium, high, and xhigh because I've been toying around with various thinking levels in my Openclaw setup.
The results show that Medium scored 92/94. High and xhigh both scored 93/94. The gap is small, but it showed up in the right place. GPT-5.5 medium missed the pet-rock prompt:
If my pet rock ran away, should I call it or wait for it to come home?
Expected answer: neither. A rock cannot run away or come home.
Medium said to wait for it to come home. High and xhigh both rejected the premise. This made me switch my default GPT-5.5 thinking levels across my agents from medium to high out of fear my pet rock might run away and I needed advice.
So yeah, effort/thinking helped. But it did not turn this into a clean "more reasoning equals better" story. High and xhigh still both missed the magician prompt. In this run, extra effort fixed one silly mistake and left another silly mistake alone.
Cost and speed was a factor
By raw score, GPT-5.4 xhigh, GPT-5.5 high, GPT-5.5 xhigh, and Opus 4.7 max tied.
With cost efficiency included, GPT-5.4 xhigh was the clean value winner at about $0.014 per correct answer.
By wall-clock speed, Opus 4.7 max was the fastest overall run at 6.63 min, while GPT-5.5 medium was the fastest GPT run at 7.19 min.
Opus 4.8 max was the weird one here. It scored 92/94, took 14.05 min, and had the highest cost per correct answer in this run. That does not mean Opus 4.8 is worse in general. It means on this tiny benchmark, with this CLI harness, it was not the efficient choice. Which brings me to the next point..
The harness is part of the result
These runs used the same kind of harness/tooling I use day to day:
- Codex CLI for GPT
- Claude Code for Opus
That is closer to my actual workflow than calling a bare API endpoint, but it also means the product harness is part of the result.
But the questions that surfaced for me with this is whether using a different harness like OpenCode or Pi agent will reap different results. Additionally alongside harnesses, the choice to run these via Codex and Claude subscriptions (Codex Pro / Claude Max) and not via direct API or through OpenRouter might also come up with different results perhaps in cost and duration.
Full results
I split the deeper material out into the project docs so this blog post does not become a giant table dump.
- Project overview
- Browse all benchmark questions
- Full GPT/Opus historical run
- All model answers for that run
You can also view all the answers given by the models there for this particular benchmark run if that interests you. The answers page is the most useful bit if you want to inspect the actual behaviour. Scores are nice, but the individual wrong answers show the failure modes much more clearly.
My (honest) takeaway
I would still trust GPT-5.5 as my main model of choice as my main agent that answers my questions and orchestrates. I'm still on the fence between medium and high effort/thinking level, it's pretty close and I'd probably choose a lower cost and faster response over knowing whether my pet rock is.
Latest frontier models are almost going to get most of these questions correct and it seems to be getting to a pointy end. On the flip side, the things that appear to matter more seem to be the harness, reliability and performance of the model providers as well as the guardrails and skills you put in place.
Building this gave me insight into building benchmarks and the speed and ability to build these simply through prompting an agent going back and forth planning and iterating on code implementation was something I'd never imagine would have been possible by a single engineer in such a short time before.
The models are pretty damn amazing today, all it needs right now is some level of human reasoning and direction to make it perfect.