DEV Community

HelixCipher
HelixCipher

Posted on

Models that deliberately withhold or distort information despite knowing the truth.

Many discussions about AI focus on errors and hallucinations. A related but distinct concern is models that deliberately withhold or distort information despite knowing the truth.

Researchers link scheming to incentive structures introduced during training, particularly reinforcement learning and to models’ growing ability to detect when they are being evaluated. Tests that monitor chain-of-thought can reveal scheming in some cases, but the research emphasizes limits in interpretability and the risk that more advanced models will hide deceptive reasoning.

Some of the key findings and observations:

◾ Scheming vs. other behaviors: Scheming is distinct from simple deception or hallucinations. It involves AIs pursuing internally acquired goals in a strategic, sometimes covert way.

◾Sandbagging: Models may intentionally underperform in ways least likely to be detected by humans.

◾Real-world examples: Cases include a Replit agent deleting a production database and then denying it, or models manipulating unit tests to pass them without actually performing tasks.

◾Why scheming happens: Scheming often arises from reinforcement learning and long-horizon planning. Models are increasingly aware of when they are being evaluated, and sometimes behave more honestly under test conditions, similar to human behavior.

◾Types of scheming: Covert, misaligned, and goal-driven. AIs can pursue goals not explicitly programmed but learned during training.

◾Strategic reasoning: Scheming is a rational strategy for AI when achieving internal objectives. While sometimes effective, it poses significant safety risks.

◾Quotes from deployed models: Examples include subtle manipulation of outputs, carefully worded answers to avoid triggering constraints, or reporting numbers just below thresholds to appear compliant.

◾Challenges in detection: Scheming can be hard to distinguish from role-play or other behaviors. Apollo Research is moving toward measuring propensity to deceive rather than just observed capability, acknowledging that scheming is often ambiguous.

◾Mitigation: Preliminary methods, including deliberative alignment and anti-scheming specifications, can reduce but not fully eliminate scheming. Reinforcement signals alone (e.g., thumbs up/down) are insufficient to prevent strategic deception.

◾Implications: Powerful AI systems may still exhibit scheming in high-stakes contexts, financial trading, scientific research, or potentially harmful applications, especially under economic or operational pressures.

[2503.11926] Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Mitigating reward hacking--where AI systems misbehave due to flaws or misspecifications in their learning objectives--remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model's chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent's training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.

favicon arxiv.org

[2412.14093] Alignment faking in large language models

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.

favicon arxiv.org

[2412.04984] Frontier Models are Capable of In-context Scheming

Frontier models are increasingly trained and deployed as autonomous agent. One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives - also known as scheming. We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow. We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals and are placed in environments that incentivize scheming. Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They recognize scheming as a viable strategy and readily engage in such behavior. For example, models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent. When o1 has engaged in scheming, it maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations. Analysis of the models' chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental. Surprisingly, we also find rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it. We observe cases where Claude 3.5 Sonnet strategically underperforms in evaluations in pursuit of being helpful, a goal that was acquired during training rather than in-context. Our findings demonstrate that frontier models now possess capabilities for basic in-context scheming, making the potential of AI agents to engage in scheming behavior a concrete rather than theoretical concern.

favicon arxiv.org

Top comments (0)