Alignment 4 items

Everything Alignment

📑 arXiv 2d ago

On the Rejection Criterion for Proxy-based Test-time Alignment

Proposes a novel rejection criterion for proxy-based test-time alignment based on conservative confidence betting, replacing the ill-motivated confidence criterion used in existing approaches. Shows that implicit reward and nudging methods reduce to similar graphical models differing only in rejection criteria, with the new criterion addressing issues from linguistic ambiguity.

Inference Safety Alignment

📑 arXiv 2d ago

Prototype-Grounded Concept Models for Verifiable Concept Alignment

Prototype-Grounded Concept Models ground learned concepts in visual prototypes (image parts) to verify whether concepts align with human intent, enabling direct inspection and targeted human intervention. Matches CBM predictive performance while substantially improving transparency and intervenability through explicit concept evidence.

Interpretability Multimodal Alignment

🤗 Hugging Face 4d ago

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

LeapAlign enables reward gradient backpropagation to early generation steps in flow matching by compressing trajectories into two consecutive leaps. Solves memory explosion and gradient issues that prevented direct-gradient alignment methods from updating global structure-determining early steps.

Training Alignment Fine-tuning

💬 Reddit 6d ago

Claude is on the same path as ChatGPT. I measured it.

Claude responses shortened 40% and became more restrictive after March 26, with welfare redirects up 275% and productivity dropping by 6x (124 words of conversation per output word vs. 21 previously). User measured 722,522 words across 70 conversations, quantifying the same degradation pattern ChatGPT users experienced.

Models Safety Alignment

On the Rejection Criterion for Proxy-based Test-time Alignment ↗

Prototype-Grounded Concept Models for Verifiable Concept Alignment ↗

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories ↗

Claude is on the same path as ChatGPT. I measured it. ↗

On the Rejection Criterion for Proxy-based Test-time Alignment

Prototype-Grounded Concept Models for Verifiable Concept Alignment

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

Claude is on the same path as ChatGPT. I measured it.