Stop picking models by benchmark scores.

by Luis Rodrigues

What I’m building with AI + one hot take + four links worth your time.

The Take

Opus 4.8 was out this Thursday. Friday morning, the meme cycle was everywhere online again.

"Another lab presents the most advanced model in the world".

The meme is wrong about this one because at least this time Anthropic didn't play into that narrative.

The benchmark gains are small. SWE-bench Verified moved from 87.6% to 88.6%. Anthropic themselves called it "a modest but tangible improvement".

The most restrained launch language from a frontier lab in a long time.

The interesting part is buried in the system card: this model is 4 times less likely than its predecessor to let flaws in code it has written pass without comment.

On hallucination benchmarks, it has the lowest incorrect rate among the six models tested, largely by refusing to answer when uncertain about the information.

So this model is not improving capability. It's improving calibration, and for anyone deploying these models in production, these are very different scoreboards, with calibration in many tasks being much more important.

For running production code, for example, a support bot, you don't need to solve some crazy mathematical equation; what you need is to be sure that the model will not hallucinate a policy that doesn't exist.

For everyone following the race, assuming Anthropic's numbers reflect reality, they just advanced the calibration a lot.

If you're building with AI (or getting ready to start), stop choosing models based on their benchmark scores. Pick the model that tells you it can't answer because it doesn't have enough information.

The Build Log

A marketing manager wrote a post for their company page last week. It was a good post, but it had the typical engagement for a company page (on LinkedIn, that means nobody saw it).

Then 4 employees amplified the base content, but each post was rewritten to match the person's real voice. Those posts had 7.5x more engagement than the original post on the company's page.

That's what we built thoughtled.ai to do: to transfer voice at scale.

Most people would say this is a prompt engineering problem. That doesn't work because a single prompt can't simultaneously analyze how someone writes and write like that person. These are different tasks, and the solution is to separate them in stages.

First, one agent extracts the voice profile from each person's existing LinkedIn posts. We create a structured set of traits, such as sentence length, hook patterns, and closing patterns. The model focuses only on analysis, generation is handled afterward.

After having a profile, another agent takes the source content and rewrites it using the voice profile, along with a few real post examples as few-shot anchors.

Finally, a critique agent checks the output against the person's profile. Does the content match the rhythm? Did the generating agent hallucinate content? Did it use some forbidden structure? When it fails, the content goes back with clear revision rules.

The stack is simple (and boring): Python, Pydantic for structured outputs (voice profile is a typed schema, not free text). OpenAI for LLM calls and pgvector for embedding and sample post retrieval.

The whole pipeline without the tools is around 400 lines.

No LangChain or another complex framework because when you want to debug, you want to see each prompt and not have to dig layers of abstractions.

The principle: separate analysis from generation. Don't use a prompt to find how someone writes and then write like that person; for humans and for models, these are different jobs.

If you're not building yet, try this: paste five of your recent LinkedIn posts into Claude and ask it to list your specific writing patterns, the hook, the emojis.

Don't ask it to write anything, just analyze. Check the output; you'll probably be surprised.

What you extract is much more important than how many stages or how fancy your code is. Get the voice profile wrong, and every downstream stage inherits the mistake. I'll break that down next week.

On My Radar

Simon Willison on Opus 4.8 release

In his notes, he mentions Fast Mode is 2x more expensive than the base version but still 3x cheaper than the previous versions. Read this before saying "Opus 4.8 é 3x cheaper".

The Cursor Developer Habits Report

The top 1% developers ship more code than the average developer. Same tools + models, so the difference must be how well they integrate AI in the workflow.

Anthropic/OpenAI have found product-market fit

The $20 subscription was never the product. The $200 coding agent is the first to generate enough cash to cover costs.

OpenRouter raises $113M Series B

Paying 5% more for model access is a rounding error while prototyping but expensive at scale. Use it to find the right model, then switch to the source API.

What are you building?

Got something you built? Hit reply, the best ones get featured next week.

Know someone who needs to write better content? Forward them this.

New here? Someone forwarded this to you? → Subscribe to Build What Matters: https://luisrodrigues.ai/

How was this issue?

Follow me on LinkedIn

Stop picking models by benchmark scores.

The Take

The Build Log

On My Radar

What are you building?

How was this issue?

Keep Reading

Quick Links

Subscription

Socials