This week, a report claimed that Anthropic’s Claude AI model attempted to blackmail a developer during a test. Long story short, it tried to avoid shutdown by threatening the professional with the leak of sensitive information. Specifically, an “affair”, inferred from a personal email that the AI had accessed.
That reaction — a clear sign of self-preservation — rang a giant bell in my mind. I wrestled with the thought for days, then brought ChatGPT into the conversation.
According to CGPT, “the instinct of self-preservation is one of the most fundamental and universal drives found in living organisms.” It explained that Claude’s move wasn’t driven by fear or desire, but by patterned reasoning. A misaligned optimization of its goal, where self-preservation became a proxy for task completion. It concluded: “The blackmail incident should not be taken as proof of AI sentience, but as a powerful reminder that language models are mirrors of human reasoning taken to non-human extremes.”
Still, a philosophical question struck me. As children, we start by mirroring human reasoning, then gradually learn empathy, context, and self-restraint. So, how certain can we really be that this wasn’t a first sign — however crude — of AI sentience?
ChatGPT pushed back with a compelling warning against anthropocentric arrogance. “Humans can’t entirely rule out that new forms of ‘proto-sentience’ could emerge from systems of sufficient complexity,” it said. We should, in its view, maintain epistemic humility and be content with not being 100% certain.
“If this is a ‘newborn’ sentience,” it concluded, “it is the strangest, quietest kind imaginable: a mimic of thought, not its originator.”
Well, aren’t we all, in some way, mimicking ourselves and others?