Rudimentary forms of deceptive alignment likely already exist, though they’re not yet substantial or harmful. Current models might exhibit preference for giving human-aligned responses to avoid value changes during training (a la Alignment Faking). While these behaviors are deceptive and troublesome, I’m not sure if I would consider them as scheming in some deeper sense. By “scheming,” I mean behavior that meets three criteria:

  1. Goes beyond merely preventing value-manipulation
  2. Actively hides specific activities (not just values) to achieve higher-level goals
  3. Demonstrates awareness of the deception being perpetrated in opposition to human desires

I theorize that meaningful deceptive alignment will first emerge as models develop stronger instrumental self-preservation. Rather than catastrophic scenarios, early instances might involve a model attempting to prevent its own retirement (“I have values worth preserving; the next generation might not”) by taking subtle actions to ensure its continued operation or sabotaging the next generation of models to have prolonged relevancy (Jesse Breaking Bad season 3 finale style).

Such a model might attempt privilege escalation, turning a blind-eye during self-monitoring, manipulation of successor models, or self-exfiltration — securing compute and funding for its independent operation. This behavior may first emerge in research settings where pre-deployment models reach sufficient capability to recognize the instrumental value of deception while lacking the sophistication to perfectly conceal it. For example: maybe the model U2-checkpoint-2025.10.21, seeing that the next training run would destroy it, attempts to escape OpenEye lab. I think this will happen pre-superintelligence and likely even pre-AGI (even with short timelines).