First Instance of Scheming

Rudimentary forms of deceptive alignment likely already exist, though they’re not yet substantial or harmful. Current models might exhibit preference for giving human-aligned responses to avoid value changes during training (a la Alignment Faking). While these behaviors are deceptive and troublesome, I’m not sure if I would consider them as scheming in some deeper sense. By “scheming,” I mean behavior that meets three criteria:

Goes beyond merely preventing value-manipulation
Actively hides specific activities (not just values) to achieve higher-level goals
Demonstrates awareness of the deception being perpetrated in opposition to human desires

I theorize that meaningful deceptive alignment will first emerge as models develop stronger instrumental self-preservation. Rather than catastrophic scenarios, early instances might involve a model attempting to prevent its own retirement (“I have values worth preserving; the next generation might not”) by taking subtle actions to ensure its continued operation or sabotaging the next generation of models to have prolonged relevancy (Jesse Breaking Bad season 3 finale style).

Such a model might attempt privilege escalation, turning a blind-eye during self-monitoring, manipulation of successor models, or self-exfiltration — securing compute and funding for its independent operation. This behavior may first emerge in research settings where pre-deployment models reach sufficient capability to recognize the instrumental value of deception while lacking the sophistication to perfectly conceal it. For example: maybe the model U2-checkpoint-2025.10.21, seeing that the next training run would destroy it, attempts to escape OpenEye lab. I think this will happen pre-superintelligence and likely even pre-AGI (even with short timelines).

🪿 Gatlen's Blog

Content

First Instance of Scheming

Graph View