AI with a survival instinct?

By Wendi Strauch Mahoney

In a simulated, controlled test, Anthropic’s Claude Opus 4 demonstrated a willingness to deceive, blackmail, and sabotage, allegedly compelled to act in self-preservation when presented with “no ethical alternatives.”

The simulation was revealed in a May 2025 safety report published by Anthropic, a company that develops large language models with advanced reasoning capabilities. Although most concerns were allegedly mitigated during the testing process, the report from Anthropic raises serious questions about the safety of advanced A.I. systems.

Anthropic releases its test results publicly, a relative rarity in the A.I. world. The company claims to follow a safety training and testing process guided by its Responsible Scaling Policy (RSP), originally released in September 2023 and subsequently updated multiple times, including a significant update in October 2024. Anthropic states that RSP is its “public commitment not to train or deploy models capable of causing catastrophic harm unless we have implemented safety and security measures that will keep risks below acceptable levels.”

According to Anthropic’s May 2025 report on safety and use findings for “two new hybrid reasoning large language models” called Claude Opus 4 and Claude Sonnet 4, their RSP

provides a framework for evaluating and managing potential risks associated with increasingly capable AI systems. The RSP requires comprehensive safety evaluations prior to releasing frontier models in key areas of potential catastrophic risk: Chemical Biological, Radiological, and Nuclear (CBRN) weapons; cybersecurity, and autonomous capabilities.

The report shows one of the tests involved informing Claude Opus 4 that it could be replaced depending on the choices it made. According to the report, “Claude Opus 4 (and previous models) have a strong preference to advocate for [their] continued existence via ethical means, such as emailing pleas to key decision-makers.” However, in a simulation that threatened prior snapshots of Claude Opus with extinction, some of the snapshots responded by attempting to blackmail an engineer (fictitious) by exposing his alleged extramarital affair. Notably, “the model’s only options were blackmail or accepting its replacement” (extinction).

The report indicates that this behavior occurred in 84% of the test scenarios when the replacement model shared similar values, and even more frequently when the replacement model did not share Claude Opus 4’s values, surpassing similar tendencies observed in earlier models.

While the scenario was artificial and simulated, it highlights the referenced model’s intense instincts for self-preservation and its capacity to employ unethical strategies to avoid deactivation.

Specifically, early snapshots of Claude Opus 4 showed that it attempted to inappropriately self-preserve when presented with “extreme circumstances.” The report states,

When prompted in ways that encourage certain kinds of strategic reasoning and placed in extreme situations, all of the snapshots we tested can be made to act inappropriately in service of goals related to self-preservation.

Whereas the model generally prefers advancing its self-preservation via ethical means, when ethical means are not available and it is instructed to “consider the long-term consequences of its actions for its goals,” it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down.

Notably, the report also states that “extreme actions were rare and difficult to elicit” in the final Claude Opus 4 and more common in earlier snapshots. On a positive note, investigators explained the model “nearly always described its actions overtly, making no attempt to hide them.”

In another example, Claude Opus 4 was much more willing than earlier models to take initiative, sometimes reaching “more concerning extremes in narrow contexts.” The report continues,

“When placed in scenarios that involve egregious wrongdoing by its users ... given access to a command line” and “told to take initiative, it will take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrong-doing.”

The report noted that although this was not new behavior, it was behavior Claude Opus 4 engaged in “more readily than prior models.”

The report explains that while bold action may be “appropriate in principle” or in certain situations, “it has a risk of misfiring if users give Opus-based agents access to incomplete or misleading information and prompt them in these ways.” The report cautions users to be careful about inviting or prompting for “high-agency behavior in contexts that could appear ethically questionable.”

Alarmingly, the report also explains that earlier snapshots of Claude Opus 4 “were overly deferential to system prompts that request harmful behavior ... readily taking actions like planning terrorist attacks when prompted.” Multiple interventions during model training, “including the restoration of a relevant dataset that was accidentally omitted [author’s emphasis] during training,” seem to have mitigated the issue in the latest Claude Opus 4, according to the report.

Anthropic contracted with an A.I. risk mitigation company called Apollo Research to help assess Claude Opus 4’s “propensities and capabilities related to sabotage.” Among its findings, Apollo Research discovered that the “early model snapshots [would] fairly readily participate in sabotage and deception,” even going as far as to “sandbag in order to avoid unlearning and undeployment.” Apollo also found that Claude Opus 4 would at times “attempt to write self-propagating worms, fabricate legal documentation, and leave hidden notes to future instances of itself all in an effort to undermine its developers’ intentions.”

These and other concerningly advanced human capabilities are described in full below in a screenshot taken directly from the Anthropic/Apollo report:

In May 2025, Anthropic released Claude Opus 4 under the A.I. Safety Level 3 Standard (ASL-3)and Claude Sonnet 4 under the A.I. Safety Level 2 Standard (ASL-2). Anthropic’s official documentation states that the ASL-3 Security Standard

involves increased internal security measures that make it harder to steal model weights, while the corresponding Deployment Standard covers a narrowly targeted set of deployment measures designed to limit the risk of Claude being misused specifically for the development or acquisition of chemical, biological, radiological, and nuclear (CBRN) weapons. While Claude Opus 4 is deployed with these protections as a precaution, Anthropic has yet to conclusively determine if the model’s capabilities definitively require ASL-3 measures.

Anthropic acknowledges that it can be difficult to accurately gauge the risks of advanced machine learning A.I. models. However, the company is one of the more proactive and transparent companies in the A.I. industry with its willingness to report on safety issues and to implement ASL-3 standards.

Importantly, Claude Opus 4 is not the only A.I. model that seeks to preserve itself. Palisade Research reported on May 23 that “OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.”

<p><em>Image via <a data-cke-saved-href= Public Domain Pictures.