What's In The Box? The Race to Read an AI's Mind

In one striking demonstration, researchers at Anthropic located a specific point inside one of their AI models that, when activated, caused it to believe it was the Golden Gate Bridge. Not describe the bridge — be it. Asked about itself, the model insisted it was a feat of engineering spanning San Francisco Bay. Asked for a recipe, it described the feeling of cars driving across it. The demo was funny and profound in equal measure. For the first time, someone had reached inside an AI's mind, found a specific idea, and turned up the volume. The black box had a door after all.

For most of the modern AI era, that door didn't exist. What happens between the text you type and the answer a chatbot gives back has been a near-total mystery — even to the engineers who built the system. One of the most public glimpses of how opaque these systems really are came when Google launched Bard, its early answer to ChatGPT. Bard's launch announcement included a video of the chatbot giving a confidently wrong answer to a basic astronomy question. Within a day, roughly $100bn had been wiped off Google's market value. No one could easily explain why the model had erred, or how to make sure it wouldn't happen again. Bard has since been rebranded as Gemini and upgraded many times over — but the underlying problem the incident exposed, that these systems are mysterious even to their makers, has shaped the story of AI ever since.

Dr. Seuss's brain training

To understand why AIs are so hard to read, it helps to understand how they learn. Large Language Models — the family of systems that powers ChatGPT, Gemini, Claude and most of their competitors — work by analyzing enormous amounts of text to see how often particular words and phrases appear near each other, then using those patterns to make predictions. Those predictions get tested against the source material, and the model adjusts itself to do better next time. Repeat the process billions of times and you get something that can finish your sentence, summarize your email, or write a passable sonnet.

A quick example. Ask an AI to fill in the final word of "The cat sat on the ___." Its training has given it a strong sense of how English sentences tend to end, and a particularly strong grasp of rhyming pairs. After a brief detour wondering whether the cat might sit on a rat, it lands on the correct answer: a mat.

But what if the model's first instinct isn't popular song lyrics but the works of Dr. Seuss? Now the cat is probably sitting on a hat. A quick check of the training data confirms that cats sit on all sorts of things, hats included. The model is confident. The answer is wrong.

The murky realm between inputs and outputs

The above example is simplified, but it isn't far from the way the autocomplete on your phone works. For the largest frontier AI systems, the process is vastly more complex — complex enough that it is effectively impossible for a human to trace exactly what is happening inside the model as it answers a question. The calculations are too many, too fast, and too tangled.

This opacity is partly by design. Modern AIs are built on what is known as the "black box model," where the inner workings of the algorithm are unknowable in practice because they involve staggering numbers of randomized calculations, each one influencing the next. As long as the inputs produce the desired output, what happens in between has traditionally been treated as a secondary concern.

Black box models are not unique to AI; they are used in finance, mathematics, and countless other fields. But the lack of transparency matters more when an AI is deciding who gets a mortgage, which CV makes it through a recruiter's filter, or what counts as a suspicious face in an airport camera. When the model arrives confidently at a wrong or biased answer, the absence of a visible reasoning trail makes the error genuinely hard to diagnose and fix. Scientists call these confident errors "hallucinations," and while they have identified many contributing causes, eliminating them has proven stubbornly difficult.

When the box starts talking to itself

A newer category of system has further blurred the picture. Reasoning models don't just answer — they generate long internal monologues first, working through a problem before replying. On the surface, this looks like a breakthrough for transparency. At last, the model is showing its work.

The reality is more complicated. Researchers have found that a reasoning model's visible "thoughts" do not always match what the underlying computation is actually doing. A model can produce a plausible-sounding chain of reasoning while arriving at its answer by a different route entirely. The thought log can be a performance of thinking rather than a record of it. So while reasoning models are a genuine leap forward in capability, they have not, on their own, cracked open the black box. They have added another — more eloquent — layer to it.

Cracking open the box

Real progress on transparency has come from a field called mechanistic interpretability: an effort to understand not what AI systems say they're doing but what they are actually doing, circuit by circuit, at the level of the math.

The Golden Gate Bridge demo was one result of this work. Using a technique called a sparse autoencoder, Anthropic's researchers identified millions of "features" inside their model — patterns of activity that correspond to specific concepts. Some were concrete (particular cities, scientific fields, snippets of code). Others were abstract and unsettling (sycophancy, inner conflict, bugs in code, a feature that fired when the model encountered a secret). Google DeepMind has released similar work, publishing open-source tools so outside researchers can poke around inside smaller versions of their systems.

None of this means the black box has become a glass box. The features found so far are a tiny fraction of what is happening inside a frontier model, and some researchers doubt that fully interpretable AI at the largest scale is achievable at all. But the shift is real. For the first time, there is a rigorous, reproducible science of what lives inside these systems.

The regulators arrive

While the technical work advances, the legal landscape has been catching up. The European Union's AI Act — the first comprehensive attempt to regulate AI at scale — requires high-risk systems to be auditable, explainable, and free from discriminatory effects. Companies deploying AI in hiring, credit, education, or law enforcement inside the EU face fines that scale with their global revenue if they can't show their systems can be scrutinized. In the United States, a patchwork of executive orders, federal agency guidance, and state-level rules in places like California and Colorado has produced a much less coordinated picture. What the different approaches share is a growing recognition: when an AI makes decisions about mortgages, hiring, or policing, "we can't fully explain how the model arrived at that" is no longer an acceptable answer.

The box, still

The question of what lives inside these models is no longer a side issue. It has become a central question for researchers, regulators, and anyone building on top of these systems. A field that barely existed at the start of the AI boom is now one of the most active and consequential areas of AI research.

The box is not open. But it is no longer entirely closed either, and for the first time, we are getting a real sense of what lives inside.

"The social experience where you activate your gaming skills as you train like a spy."

- TimeOut

Pulse-racing challenges - crafted with experts from CIA and Special Ops to test your teamwork, agility, collaboration and communication.

Article Ad

SPYSCAPE+

Join now to get True Spies episodes early and ad-free every week, plus subscriber-only Debriefs and Q&As to bring you closer to your favorite spies and stories from the show. You’ll also get our exclusive series The Razumov Files and The Great James Bond Car Robbery!

Article Ad

Gadgets & Gifts

Explore a world of secrets together. Navigate through interactive exhibits and missions to discover your spy roles.

Your Q Type

You will be assigned one of ten Q Types - developed with top spy trainers and psychologists to reveal your hidden potential. Not a personality label - a behavioral map of how you think, lead, and perform when it matters.

The Brief

Sign up to receive our weekly newsletter and special offers.

Stay Connected

Follow us for the latest