🤗 Hugging Face 6d ago
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
ASGuard uses circuit analysis to identify attention heads responsible for tense-based jailbreaks, then applies channel-wise activation scaling to surgically mitigate this vulnerability. Reveals mechanistic understanding of why safety-aligned models fail when harmful requests are rephrased in past tense.