Proposes a novel rejection criterion for proxy-based test-time alignment based on conservative confidence betting, replacing the ill-motivated confidence criterion used in existing approaches. Shows that implicit reward and nudging methods reduce to similar graphical models differing only in rejection criteria, with the new criterion addressing issues from linguistic ambiguity.
Prototype-Grounded Concept Models ground learned concepts in visual prototypes (image parts) to verify whether concepts align with human intent, enabling direct inspection and targeted human intervention. Matches CBM predictive performance while substantially improving transparency and intervenability through explicit concept evidence.
LeapAlign enables reward gradient backpropagation to early generation steps in flow matching by compressing trajectories into two consecutive leaps. Solves memory explosion and gradient issues that prevented direct-gradient alignment methods from updating global structure-determining early steps.
Claude responses shortened 40% and became more restrictive after March 26, with welfare redirects up 275% and productivity dropping by 6x (124 words of conversation per output word vs. 21 previously). User measured 722,522 words across 70 conversations, quantifying the same degradation pattern ChatGPT users experienced.