Ternary Bonsai uses 1.58-bit weights {-1, 0, +1} to achieve 9x smaller memory footprint than 16-bit models while outperforming peers in standard benchmarks. Available in 8B, 4B, and 1.7B parameter sizes, it balances extreme compression with improved accuracy over 1-bit predecessors.
Analysis of all 154 Pythia-160m checkpoints reveals INT4 quantization robustness diverges catastrophically (11% to 517% gap) late in training while FP32 perplexity plateaus, contradicting the assumption that converged models are quantization-ready. Divergence begins when FP32 perplexity stagnates, not during learning rate decay, suggesting flat minima in full precision don't guarantee quantization stability.
Switch-KD proposes a visual-switch distillation framework unifying vision-language knowledge transfer by addressing modality-specific supervision inconsistencies in VLM knowledge distillation. Current KD methods supervise modalities separately without explicitly addressing multimodal alignment, leading to inconsistent knowledge transfer. The approach enables efficient VLM deployment in resource-constrained scenarios.
Guide on distilling knowledge from 100B+ parameter models into sub-4B models. Addresses practical methods for compressing frontier model capabilities into efficient local deployments.