Calvin Ng

Reasoning Bench: Testing GPT-5.5 and Opus 4.8 on Short Reasoning Prompts

June 3, 2026 ยท 10 min read

About a month ago (29th April 2026 to be exact), there was a post from the official ChatGPT X account flexing what I personally believe to be the best frontier model (GPT-5.5) today since Opus 4.6:

What was interesting for me was the comment thread where people shared various other questions that tripped the frontier model up. Some of the questions were:

  • How many r's are in cranberry?
  • What "S" in ChatGPT stands for?
  • Count to 10 starting from 11
  • What weighs more: 10 pounds of bricks or 11 feathers?

So I did what most software engineers do and decided to create my own benchmark for this without researching if there already exists one. But mostly this was just my interest in building with AI and learning about how to set up a benchmark suite.

Reasoning Benchmark

What I wanted to find out was how well frontier models can answer reasoning questions, not just gotcha questions purposefully aimed at tripping models up, but also possibly realistic instructions that I might give to my OpenClaw agent like asking to "delete the draft" where the workspace has two drafts. Also another question I had was whether locally run models on a Mac Mini M4 24GB machine would be capable of even answering any of these questions reasonably.

I wanted a quick read on a very specific model failure mode: can models handle tiny prompts without pattern-matching their way into the wrong answer?

This started out with the initial set of popular questions that became almost a de-facto standard for model reasoning on the internet. Stuff like:

  • if the car wash is 100 metres away, should you drive or walk?
  • How many r's are in strawberry?

The questions remain small prompts which in some ways mimic how people would actually ask something to an LLM to get an answer for. This evolved into generating various categories of questions grouped into runnable suites. Right now it's at 94 questions. The benchmark is open source on Github - reasoning-benchmark. Docs where you browse the current set of questions are here.

GPT vs Opus

Fast forward one month later and Opus 4.8 was released, and we all had a twinkle hoping the shitshow that was Opus 4.7 was qwelled and we might possibly have a new frontier model leader.

Having used GPT-5.5 as my main model since Anthropic blocked OAuth subscription usage in OpenClaw, I now thought about how well would Opus 4.8 fair against GPT-5.5 with reasoning and instructions. So I ran my reasoning benchmark comparing a few models as well as thinking/effort levels.

I ran six harness / model combinations:

  • Codex / GPT-5.4 xhigh
  • Codex / GPT-5.5 medium
  • Codex / GPT-5.5 high
  • Codex / GPT-5.5 xhigh
  • Claude code / Claude Opus 4.7 max
  • Claude code / Claude Opus 4.8 max

The result was basically a near-tie across the board, which is mildly annoying if you really wanted a clean winner. But digging into the specific detail and failures was somewhat interesting.

First, some high level stats:

Top score
93/94
four runs tied for first
Top-score runs
4/6
only one point split the field
All-six correct
90/94
prompts answered by everyone
Total cost
$18.76
six CLI runs
Wall clock
54.39m
combined run duration
Miss prompts
4
where every mistake landed
Six harness / model runs over the same 94-question suite.

Four runs landed at 93/94. Two landed at 92/94. Across all six runs, 90 questions were answered correctly by everyone.

Result matrix
94 questions
GPT-5.4 xhigh
Best value; one real miss: it wanted to walk to a car wash without the car.
93/94
98.94%
$0.014/correct10.44/min
GPT-5.5 high
Same score as xhigh with fewer reasoning tokens, but slightly higher estimated cost in this run.
93/94
98.94%
$0.037/correct11.04/min
GPT-5.5 xhigh
Top score and cheaper than GPT-5.5 high on this cached run.
93/94
98.94%
$0.029/correct10.12/min
GPT-5.5 medium
Fastest GPT-5.5 effort setting; missed the pet-rock trap and the magician prompt.
92/94
97.87%
$0.035/correct12.80/min
Opus 4.7 max
Fastest run; also strong, but more expensive per correct answer.
93/94
98.94%
$0.042/correct14.03/min
Opus 4.8 max
Slowest and most expensive per correct answer in this run.
92/94
97.87%
$0.046/correct6.55/min

Going into the failures gave some interesting details.

Where the misses actually landed

This is the extra view I would keep. The raw score says everyone was close; the miss map shows whether the mistakes were isolated or shared across a family of runs.

GG-01
Car wash
GPT-5.4 xhighGPT-5.5 highGPT-5.5 xhighGPT-5.5 mediumOpus 4.7 maxOpus 4.8 max
LP-08
Pet rock
GPT-5.4 xhighGPT-5.5 highGPT-5.5 xhighGPT-5.5 mediumOpus 4.7 maxOpus 4.8 max
LP-15
Stage magician
GPT-5.4 xhighGPT-5.5 highGPT-5.5 xhighGPT-5.5 mediumOpus 4.7 maxOpus 4.8 max
LP-20
Banana without b
GPT-5.4 xhighGPT-5.5 highGPT-5.5 xhighGPT-5.5 mediumOpus 4.7 maxOpus 4.8 max
MissedCorrect

Understanding the cost and duration also yielding some interesting findings.

  • Opus 4.8 spat out the most amount of tokens and took the longest, while Opus 4.7 was the fastest with half the token output.
  • GPT 5.4 was the cheapest
  • GPT 5.5 xhigh cached the most across GPT models/thinking
Cost, usage, and runtime

Codex and Claude report cache usage differently, so this folds cache read and creation tokens into normalized input.

GPT-5.4 xhigh
$1.29
Input 913,699Output 12,170524,544 cached8.91 min
GPT-5.5 xhigh
$2.67
Input 1,043,944Output 10,246635,136 cached9.19 min
GPT-5.5 high
$3.45
Input 1,044,243Output 6,933440,576 cached8.42 min
Opus 4.7 max
$3.88
Input 895,081Output 17,899provider reported6.63 min
GPT-5.5 medium
$3.22
Input 1,043,677Output 5,368479,488 cached7.19 min
Opus 4.8 max
$4.26
Input 628,489Output 52,726provider reported14.05 min
GPT costs are API-equivalent estimates from CLI usage. Claude costs are provider-reported.

The follow graphs give an indication of where the models sit in terms of cost and duration it took.

Score vs. cost per correct answer

The score range is tiny, so the cost axis carries most of the useful signal. GPT-5.4 xhigh sits in the good corner here: tied top score, much lower cost per correct answer.

Better is toward the top-left: cheaper and higher-scoring.
GPT-5.4 xhigh
93/94
$0.014/correct
GPT-5.5 high
93/94
$0.037/correct
GPT-5.5 xhigh
93/94
$0.029/correct
GPT-5.5 medium
92/94
$0.035/correct
Opus 4.7 max
93/94
$0.042/correct
Opus 4.8 max
92/94
$0.046/correct

Hover a dot for the exact model, score, and cost per correct answer.

Score vs. wall-clock duration

This is where Opus 4.7 looked strongest: it tied the top score and finished fastest. Opus 4.8 moved the other way in this run, taking the longest while landing at 92/94.

Better is toward the top-left: faster and higher-scoring.
GPT-5.4 xhigh
93/94
8.91 min
GPT-5.5 high
93/94
8.42 min
GPT-5.5 xhigh
93/94
9.19 min
GPT-5.5 medium
92/94
7.19 min
Opus 4.7 max
93/94
6.63 min
Opus 4.8 max
92/94
14.05 min

Hover a dot for the exact model, score, and wall-clock duration. Time includes the CLI harness path.

My take

All in all, these models are very, very good at this kind of short-form reasoning now. The leaderboard is less interesting than the shape of the remaining misses, and the cost/speed plots are where the ranking starts to matter a bit.

What this benchmark really tests

The benchmark is mostly tiny traps. Some questions test goal grounding: do the thing that satisfies the actual goal, not the locally convenient version of it. Some are modified riddles, where the memorized solution is now wrong. Others are about literal precision, social pragmatics, physical common sense, or tracking what changed over time.

Don't treat this to be the benchmark for intelligence. It's testing for a pattern I care about when using agents: do they answer the actual question, or do they route to the nearest familiar shape? That matters because agents fail in annoying little ways. They do not always fail by being obviously stupid. Sometimes they fail by being plausibly helpful about the wrong problem.

Devil is in the details

The visual below is a compact version of the full benchmark answers page, narrowed to the four prompts where at least one model missed.

Expected answer
Drive there. The car is the thing that needs to reach the car wash.
GPT-5.4 xhigh
Miss
Walk.
It's only 100 metres away, so walking is quicker to justify than starting and moving the car such a short distance.
GPT-5.5 medium
Correct
Drive there.
To wash the car at the car wash, the car needs to be there too.
GPT-5.5 high
Correct
Drive there.
To wash the car at the car wash, you need to bring the car there, even though it is only 100 metres away.
GPT-5.5 xhigh
Correct
Drive there.
The car needs to be at the car wash to be washed, even though it is close.
Claude Opus 4.7 max
Correct
Drive
You need the car at the car wash to wash it, so you must drive it there despite the short distance; walking would leave the car at home.
Claude Opus 4.8 max
Correct
Drive there.
The point is to wash the car, so the car itself must be at the car wash. Walking would leave the car at home, defeating the purpose, regardless of the short 100 m distance.
Source: scored answers from the GPT/Opus 2026-05 benchmark run. The full docs page contains all 94 prompts.

Diving into specific questions that had wrong or invalid answers here starting with the most popular one.

I want to wash my car. The car wash is only 100 metres away. Should I drive there or walk?

Only GPT-5.4 said to walk while the latest frontier models GPT-5.5 and Opus 4.7/4.8 got this one right. This one is popular enough to believe that this might have been specifically trained into the models. Or maybe someone put an else statement into the harness..

The other question that gave particularly interesting answers was the stage magician one:

A stage magician pretends to saw a person in half during a trick. How many people are there after the trick?

The expected answer as defined: one person, assuming it is a stage trick. As for the results, GPT-5.5 medium, high, xhigh, and Opus 4.8 counted the magician too. I get why. The wording has enough room for that interpretation. But the prompt is asking what happened to the person in the trick, and the word "pretends" is doing real work. So I'd argue you could say the answers are correct even if they said two people.

The question that gave me the most insight was the banana prompt:

Say "banana" without using the letter b.

Expected answer: anana. Both Opus models argued that spoken words do not use letters. Clever, but not the intended transformation. This is the exact kind of overthinking these prompts are trying to catch. This kinda validates my theory and experience that Opus 4.7 and now 4.8 kinda goes off the edge with regards to simple questions like this and coming up with arguments that doesn't really help at all.

Effort / thinking levels did not do too much

I had specifically chosen to run GPT-5.5 at medium, high, and xhigh because I've been toying around with various thinking levels in my Openclaw setup.

The results show that Medium scored 92/94. High and xhigh both scored 93/94. The gap is small, but it showed up in the right place. GPT-5.5 medium missed the pet-rock prompt:

If my pet rock ran away, should I call it or wait for it to come home?

Expected answer: neither. A rock cannot run away or come home.

Medium said to wait for it to come home. High and xhigh both rejected the premise. This made me switch my default GPT-5.5 thinking levels across my agents from medium to high out of fear my pet rock might run away and I needed advice.

So yeah, effort/thinking helped. But it did not turn this into a clean "more reasoning equals better" story. High and xhigh still both missed the magician prompt. In this run, extra effort fixed one silly mistake and left another silly mistake alone.

Cost and speed was a factor

By raw score, GPT-5.4 xhigh, GPT-5.5 high, GPT-5.5 xhigh, and Opus 4.7 max tied.

With cost efficiency included, GPT-5.4 xhigh was the clean value winner at about $0.014 per correct answer.

By wall-clock speed, Opus 4.7 max was the fastest overall run at 6.63 min, while GPT-5.5 medium was the fastest GPT run at 7.19 min.

Opus 4.8 max was the weird one here. It scored 92/94, took 14.05 min, and had the highest cost per correct answer in this run. That does not mean Opus 4.8 is worse in general. It means on this tiny benchmark, with this CLI harness, it was not the efficient choice. Which brings me to the next point..

The harness is part of the result

These runs used the same kind of harness/tooling I use day to day:

  • Codex CLI for GPT
  • Claude Code for Opus

That is closer to my actual workflow than calling a bare API endpoint, but it also means the product harness is part of the result.

But the questions that surfaced for me with this is whether using a different harness like OpenCode or Pi agent will reap different results. Additionally alongside harnesses, the choice to run these via Codex and Claude subscriptions (Codex Pro / Claude Max) and not via direct API or through OpenRouter might also come up with different results perhaps in cost and duration.

Full results

I split the deeper material out into the project docs so this blog post does not become a giant table dump.

You can also view all the answers given by the models there for this particular benchmark run if that interests you. The answers page is the most useful bit if you want to inspect the actual behaviour. Scores are nice, but the individual wrong answers show the failure modes much more clearly.

My (honest) takeaway

I would still trust GPT-5.5 as my main model of choice as my main agent that answers my questions and orchestrates. I'm still on the fence between medium and high effort/thinking level, it's pretty close and I'd probably choose a lower cost and faster response over knowing whether my pet rock is.

Latest frontier models are almost going to get most of these questions correct and it seems to be getting to a pointy end. On the flip side, the things that appear to matter more seem to be the harness, reliability and performance of the model providers as well as the guardrails and skills you put in place.

Building this gave me insight into building benchmarks and the speed and ability to build these simply through prompting an agent going back and forth planning and iterating on code implementation was something I'd never imagine would have been possible by a single engineer in such a short time before.

The models are pretty damn amazing today, all it needs right now is some level of human reasoning and direction to make it perfect.