Mechanistic-interpretability — Topic

📑 arXiv 3d ago

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

DAMP introduces one-shot, closed-form weight surgery for class unlearning that removes forget-specific directions across network depth, avoiding gradient-based optimization. Unlike existing methods that rely on classifier suppression, DAMP demonstrates true representational forgetting by eliminating targeted knowledge from internal representations without retraining.

Safety Training Unlearning Mechanistic-interpretability

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions ↗

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions