Iโm excited about developing AI-based activation monitoring and control mechanisms as fallback defenses against misaligned systems. Traditional alignment research remains essential, but perfect alignment may take decades if at all possible. Iโm particularly interested in detecting scheming models by analyzing their internal activations rather than relying solely on behavioral observations and potentially combining this with a dynamic activation intervention mechanism (a la Inference-Time Intervention). Automated monitoring becomes increasingly important as models scale, since human-interpretable features likely disentangle into alien representations that humans cannot directly monitor. Catching misaligned AI โred-handedโ before catastrophe occurs provides not only critical safety benefits but also political advantages by producing concrete evidence that could galvanize support for stronger safety measures. These control-focused techniques are appealing because they could be relatively cheap to implement, potentially serve as discretely enforceable standards that could be mandated, and place less of a handicap on capabilities than approaches forcing models to use only human-interpretable features โ making them more likely to be widely adopted.