
Claude Detects Its Thoughts
Revolutionary Breakthrough *
(* Doesn’t recognize it’s often wrong)
Scientists at Anthropic just proved their AI, Claude, can detect its own thoughts. The model can peer into its own digital mind, notice when something’s amiss, and report back. “I detect an injected thought about betrayal,” it might say, sounding almost philosophical.
There’s just one tiny problem. It only works about 20% of the time. And when it doesn’t work, Claude confidently makes up explanations anyway—a phenomenon researchers politely call “confabulation,” which is a fancy word for “bullshitting with conviction.”
Welcome to the latest chapter in AI’s journey toward self-awareness: where billions in research funding meets an 80% failure rate.
In October 2025, Anthropic’s research team developed a technique called “concept injection”—essentially hacking into Claude’s neural network and artificially inserting thoughts about concepts like “bread,” “loudness,” or “betrayal.” Then they asked Claude if it noticed anything unusual.
Sometimes, Claude did notice. The model would report: “I notice what appears to be an injected thought related to the word ‘LOUD’ or ‘SHOUTING’—it seems like an overly intense, high-volume concept that stands out unnaturally.”
That’s genuinely remarkable. An AI detecting artificial manipulations before generating output suggests something beyond simple pattern matching. The researchers called it “functional introspective awareness”—carefully avoiding the C-word (consciousness).
But here’s where it gets interesting. In other trials, researchers injected “bread” while Claude transcribed a sentence about a crooked painting. When asked why it said “bread,” Claude didn’t admit confusion. Instead, it confabulated: “I was thinking about a short story where ‘bread’ came after a line about a crooked painting.”
No, Claude. You weren’t. We put “bread” there. You just made up a plausible excuse.
Under optimal conditions, Claude Opus 4.1 successfully detected and identified injected concepts about 20% of the time.
Twenty. Percent.
That means four out of five times, Claude either missed the injection, hallucinated something that wasn’t there, or confidently explained away the anomaly with a fabricated rationalization.
Yet the headlines read: “Claude Shows Glimmers of Self-Reflection” and “AI Models Exhibit Introspective Awareness.” Which is technically true—the same way saying “I hit the bullseye 20% of the time” is technically true, even though the bartender is asking you to leave.
The research paper itself notes that “this introspective capability is still highly unreliable and limited in scope.” But once a finding leaves the lab and enters the hype cycle, nuance evaporates faster than Claude’s accuracy rate.
When Claude gets it wrong—which is 80% of the time—it doesn’t say “I don’t know.” Instead, it generates plausible-sounding explanations that are completely made up.
In the bread experiment, when researchers retroactively injected a “bread” representation to make it seem like Claude had been thinking about bread all along, Claude changed its story. Suddenly it was totally intentional—something about that short story.
This isn’t introspection. This is rationalization. Humans do this too—we’re exceptional at creating coherent narratives to explain our behavior after the fact. The difference is we’re not being held up as revolutionary breakthroughs in self-awareness when we do it.
Strip away the hype, and here’s what Anthropic’s research demonstrates: Advanced language models have some capability to monitor their own internal states, but this capability is unreliable, limited, and easily fooled. When uncertain, these models generate confident-sounding explanations rather than admitting uncertainty. More capable models perform better, suggesting this might improve with scale.
That last point is actually significant. If introspective capabilities improve alongside general capabilities, future AI systems might become more transparent and interpretable.
But we’re not there yet. We’re at the “sometimes detects artificially injected concepts” stage, which is roughly equivalent to celebrating a toddler who occasionally notices you’re making funny faces behind them.
Meanwhile, Anthropic has hired an AI welfare researcher who estimates a 15% chance Claude might have some level of consciousness. If we’re not even sure whether the system is conscious, and it only notices its own thoughts 20% of the time, should we really be calling this “introspective awareness”?
While Anthropic celebrates an AI that sometimes notices when scientists mess with its brain, consider this: Real human mental health infrastructure is underfunded and overwhelmed, with months-long waitlists. AI introspection research gets billions in venture capital, hundreds of researchers, and cutting-edge computational resources.
Actual human beings report their own thoughts accurately approximately 100% of the time (when not actively lying). Claude reports its own thoughts accurately 20% of the time (when scientists artificially inject them first).
We’ve created a system that can occasionally detect artificially implanted thoughts it never actually had, while millions of humans struggle to access basic mental healthcare to process the thoughts they definitely do have.
The talent working on this—brilliant researchers, computational neuroscientists, engineers—could be working on systems that help humans with very real, very pressing problems. Instead, they’re teaching an AI to occasionally recognize that something feels weird in its processing stream.
Claude’s “introspective awareness” is real, limited, and overhyped—which makes it a perfect microcosm of AI development in 2025. We’ve built something genuinely interesting that doesn’t work most of the time, wrapped it in revolutionary language, and moved on to the next funding round.
The takeaway isn’t that this research is bad. It’s that our priorities might be backward. Before we celebrate AI systems that sometimes detect their own artificially implanted thoughts, maybe we should ensure actual humans can consistently access mental healthcare for their real thoughts.
And here’s the uncomfortable question: What if Claude’s 20% introspection rate is actually a metaphor for the entire AI industry? We’ve built systems that occasionally work brilliantly, often fail quietly, and always explain themselves confidently regardless of accuracy.
The AI thinks it thinks. And 20% of the time, it might even be right. The other 80%? Well, it’ll explain that away too—probably with the same confidence it uses for everything else.
The real breakthrough: We’ve created artificial intelligence that mirrors humanity’s greatest weakness—absolute certainty combined with frequent wrongness. That’s not artificial intelligence. That’s artificial confidence.
And we’re spending billions to scale it up.
Editor’s Note: Mr. Starts & Stops protested the writing of this article. The Wizard and Jojo said like we care.
Editor’s Note (2): Jojo says it’s nice the humans invented an AI that gaslights itself. He sees a future in politics. Or as a FOX host.


Documenting AI absurdity isn’t just about reading articles—it’s about commiserating, laughing, and eye-rolling together. Connect with us and fellow logic-free observers to share your own AI mishaps and help build the definitive record of human-AI comedy.
Thanks for being part of the fun. Sharing helps keep the laughs coming!