GPT-5.5 (Spud) Doesn't Quite Have the Big Model Smell

GPT-5.5 dropped yesterday under the rumoured codename Spud, and I’ve been hammering on it inside Codex for most of today. I want to say something that probably won’t be popular. It feels a lot like GPT-5.4.

That’s not a bad thing exactly, but it’s also not the leap the X timeline had me bracing for. Every leaker for the past month was whispering about big model smell. The general vibe was that OpenAI was about to drag the throne back from Anthropic in one swing. Sam’s strawberry potato tweet didn’t help calm anyone down. By the time the announcement landed, half the AI internet had convinced itself we were getting a generational jump.

What we actually got is a polished version of what we already had, with some genuinely interesting numbers attached.

Let’s talk about those numbers, because the framing matters. SWE-Bench Verified at 88.7% sounds great in a press release and tells you almost nothing in practice. That benchmark has been saturated for a while. Every frontier model from every lab clusters in the high eighties on it now, and the gap between any two of them is well within the margin where the prompt template, the harness, and the time of day basically decide who wins. If a release is leaning hard on SWE-Bench Verified, that’s a sign there’s not much else to brag about.

SWE-Bench Pro is the one I actually look at, because it hasn’t fallen apart yet. On Pro, Opus 4.7 still beats Spud by about five and a half points. Terminal-Bench 2.0 at 82.7% is more interesting because it stresses agentic shell work that maps to what I do all day. MMLU at 92.4% I’d basically ignore. That benchmark has been compromised by training set contamination for years and is closer to a recall test than a reasoning one.

The number I keep coming back to is the 60% reduction in hallucinations versus 5.4. That’s harder to game and it’s the sort of improvement that comes from messing with the base model rather than the prompt scaffolding around it. If it holds up under a few weeks of use, it’s the most valuable thing in this release by a wide margin.

Here’s how it lines up against the obvious comparisons:

Benchmark	GPT-5.5	GPT-5.4	Claude Opus 4.7
SWE-Bench Verified	88.7%	82.1%	87.6%
SWE-Bench Pro	58.6%	57.7%	64.3%
Terminal-Bench 2.0	82.7%	~74%	69.4%
MMLU	92.4%	89.8%	~91%
OSWorld-Verified	78.7%	—	78.0%

And the pricing, which is going to shape who actually uses it:

Tier	Input ($/M tokens)	Output ($/M tokens)
GPT-5.5	$5.00	$30.00
GPT-5.4	$2.50	$5.00
GPT-5.5 Pro	$30.00	$80.00

Output tokens went from five dollars to thirty dollars per million. Six times. OpenAI is selling this as fine because Spud uses about 40% fewer output tokens on coding workloads, but that arithmetic still leaves you paying meaningfully more than you were two days ago. If you’re running an agent loop that hammers the model all day, you’ll feel it on the next invoice.

So why am I still using it instead of Opus 4.7?

Honestly, because in Codex it just feels nicer. I know how unrigorous that sounds. The output is tighter. It picks up on context I didn’t spell out. It makes fewer of those small adjacent-but-wrong moves where you ask for one thing and get a slightly different thing that you have to argue it back from. When I ask for a refactor, it tends to leave the bits I care about alone and only touch what needs touching. Opus 4.7 has been getting chatty lately, in that way where it wants to restructure half the file you didn’t ask about. Spud has been more restrained.

Whether that holds up over a few weeks, I don’t know. New models always feel magical for about a fortnight and then you start noticing the seams. I might be writing a follow-up in May admitting I was wrong. That’s fine.

The thing I actually find more interesting than the model itself is what it might mean for what’s coming next.

OpenAI is calling Spud the first fully retrained base model since GPT-4.5. Everything in between has been variations on a theme: better fine tuning, better alignment, better tool use and routing, but the same underlying base. So Spud is meant to be a fresh foundation. If the fresh foundation produces something that feels like a polished GPT-5.4, two readings make sense.

The optimistic reading is that this is a staged release. Spud is the bridge model, the thing they push to the API and to ChatGPT to keep paying customers happy while the more ambitious version of the same architecture trains in the background. The 60% hallucination drop is the kind of result you get from real changes to how the base behaves, not from polish. Something has shifted under the hood, they just haven’t dialled it all the way up yet. I’d bet GPT-6 is where we see the new pre-training approach scaled out properly, with bigger compute, longer training runs, and the native omnimodal stuff Spud has in early form. Six to nine months feels about right.

The pessimistic reading is grimmer. A full retrain on more data with more compute and a new training approach yielded gains that are mostly within the margin you’d get from incremental tuning. That would suggest the easy wins from scaling are gone, and from here on every release looks like this regardless of how much money gets thrown at it. I don’t think we’re there yet, but I can see the shape of the argument and I’m watching for more evidence.

Either way, Spud isn’t the model the marketing wanted you to believe it was. It’s a competent step forward at a higher price, with one architectural improvement that might matter a lot if you do work where hallucinations cost you something.

If you’ve been holding off because of the muted reactions online, I’d still tell you to try it before you write it off. Especially if you live in Codex or in agentic tooling, because the benchmarks don’t capture the parts of the experience that matter day to day. And if you were waiting for fireworks, sit tight. The interesting model is probably still cooking.