Stop trusting your AI tools

by Luis Rodrigues

What I’m building with AI + one hot take + four links worth your time.

The Take

Three people asked me about Scout this week. All of their companies are based on Microsoft, but the fun part is that nobody asked the right question.

They all wanted to know if it's safe, but Microsoft already replied to that: they assume it's not.

Scout is an always-on agent that lives in Teams and Outlook, has access to your files, and can do things before you even ask.

Why build an enterprise product on a framework with security flaws? The answer is simple: Microsoft didn't make OpenClaw trustworthy.

They assumed it would never be, and the main Scout container is treated as untrusted. Each tool call, model request, or network connection is processed by a zero-trust runtime with identity and policy managed outside the agent.

The agent cannot change them.

Read that again: "Microsoft assumes the agent is compromised and acts accordingly".

And this is not only a feature in Scout. It's the default of the Agent 365 control plane.

Its only job is to observe and constrain agents across the org.

Gartner forecasts that the average Fortune 500 company will run more than 150k agents in 2028, up from fewer than 15 last year.

Trust doesn't scale. At 150k agents, you can't vouch for any of them, so you stop trying and watch everything instead.

Whether you're shipping production agents or just chatting with one in a window, the lesson holds:

The time to blindly trust these tools is finished.

Your Slack or Teams agent (or the one you'll surely add soon) is a smart intern with broad access to your system but little understanding of the risks.

The most advanced companies already treat them like that.

The only question is whether you will, before it does something you can't undo.

The Build Log

Last week I showed the three agents behind thoughtled.ai voice transfer. This week, the piece the rest depends on: the voice profile.

In our first version, I asked AI to simply describe how someone writes. The answers were nice but not very useful: "a confident, energetic communicator who connects authentically with professional audiences".

It's not possible to generate a post based on vibes.

The drafts were terrible. They came out as caricatures. Too much of one trait, missing the actual patterns, which made people reading it say, "this is not me".

And they were right because the profile described their personality, not the way they write.

So we stopped being descriptive and started focusing on what we could measure and how.

Sentence length: short. Humor: dry. Emojis: moderate. Hashtags: none.

Each trait is extracted from real posts. And it works because we deliberately restrict the AI to only return a set of values.

Those values are typed using a Pydantic schema and built on OpenAI structured outputs.

In our architecture, the model doesn't write a paragraph describing the style of the sentences. It picks from a predefined set of options: short, mixed, long.

Pydantic validates each response before it gets to our system. If the model returns an invalid value, it will not be accepted.

A model can't simply say "this person uses rhetorical questions"; it needs to show which posts have rhetorical questions. Without real evidence, no trait is saved.

This is extremely important because a wrong trait in the profile is worse than a missing one.

Just the traits in the screenshot above produce nearly one million possible combinations. The ones you can't see or configure are the ones that make two people with "short sentences" and "dry humor" write completely differently.

The principle for AI projects is mostly the same. Don't trust free text, try to use structured schemas with tools like Pydantic.

Drop everything that doesn't validate and force the model to redo it.

If you're not building yet, try this: ask Claude to evaluate your writing on LinkedIn using the same dimensions we use in thoughtled.ai.

Make sure you give it a list of options, and you'll see it returns consistent answers for the same set of LinkedIn posts.

The voice profile tells the system how someone writes, but it doesn't tell who they are. That gap created a completely different type of failure I'll address in the next newsletter.

On My Radar

The $1,500 LLM hacking test

He tasked 5 models to hack an app. GPT-5.5 cracked it 7 of 10 times. Opus 4.8 was third. The model who wins vibe check is not always the one that works best.

Intelligence Per Dollar

Microsoft started sharing the average number of tokens in the model card. This will kill the question "which model is smarter" and focus on the one that matters: "how much does it cost?"

The oral tradition that built software

If you're putting AI-generated code you can't explain in production, you're not building a product; you just have a black box that is renting a fancy model.

Mem0's State of Memory in agent harnesses

Tested memory from Claude Code, Codex, Copilot, and a few others and found 57-71% contamination between users. So the "memory" you trust in production is silently passing context from one user to another.

What are you building?

Got something you built? Hit reply, the best ones get featured next week.

Know someone who needs to write better content? Forward them this.

New here? Someone forwarded this to you? → Subscribe to Build What Matters: https://luisrodrigues.ai/

How was this issue?

Follow me on LinkedIn

Stop trusting your AI tools

The Take

The Build Log

On My Radar

What are you building?

How was this issue?

Keep Reading

Quick Links

Subscription

Socials