Quantization 9 items

Everything Quantization

💬 Reddit 1d ago

qwen3.6 performance jump is real, just make sure you have it properly configured

Qwen 3.6 achieves significant performance improvements approaching Claude Opus and Codex usefulness when `preserve_thinking` configuration is enabled. Runs efficiently at 8-bit quantization on M5 Max hardware with 3K prompt processing and 100 token/s generation via oMLX.

Open Weights Reasoning Inference Quantization

💬 Reddit 2d ago

Qwen3.6 GGUF Benchmarks

Unsloth's Qwen3.6-35B-A3B GGUF quantizations achieve best KLD-to-size ratio on 21/22 pareto frontier points. Team clarifies that 95% of their frequent re-uploads stem from upstream llama.cpp issues rather than their own errors, citing Gemma 4's four re-uploads as example.

Models Quantization Inference

💬 Reddit 2d ago

Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B

Comparative evaluation shows Bonsai-8B at 1.125 bpw (782 MB) underperforms Gemma-4-2B at 4.8 bpw (1104 MB) despite only 29% size reduction, questioning the value proposition of extreme quantization. Ternary 1.58-bit variant performed even worse while being 33% larger than Gemma at 1477 MB. Suggests aggressive sub-2-bit quantization may sacrifice too much capability for modest size gains.

Quantization Benchmarks Inference Models

💬 Reddit 2d ago

Ternary Bonsai: Top intelligence at 1.58 bits

Ternary Bonsai uses 1.58-bit weights {-1, 0, +1} to achieve 9x smaller memory footprint than 16-bit models while outperforming peers in standard benchmarks. Available in 8B, 4B, and 1.7B parameter sizes, it balances extreme compression with improved accuracy over 1-bit predecessors.

Models Quantization Compression

📑 arXiv 3d ago

When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

Analysis of all 154 Pythia-160m checkpoints reveals INT4 quantization robustness diverges catastrophically (11% to 517% gap) late in training while FP32 perplexity plateaus, contradicting the assumption that converged models are quantization-ready. Divergence begins when FP32 perplexity stagnates, not during learning rate decay, suggesting flat minima in full precision don't guarantee quantization stability.

Quantization Training Compression

💬 Reddit 4d ago

1-bit Bonsai 1.7B (290MB in size) running locally in your browser on WebGPU

1-bit quantized Bonsai 1.7B model runs entirely in-browser via WebGPU at 290MB. Demonstrates extreme compression enabling local LLM inference without backend servers.

Inference Quantization Deployment

💬 Reddit 5d ago

Updated Qwen3.5-9B Quantization Comparison

KLD evaluation framework for Qwen3.5-9B GGUF quantizations measures probability distribution drift from BF16 baseline rather than perplexity. Provides data-driven quant selection by measuring faithfulness to original weights independent of dataset artifacts.

Evaluation Quantization Models

💬 Reddit 6d ago

Local Minimax M2.7, GTA benchmark

Minimax M2.7 generates functional 3D GTA-style web experiences with minimal prompting, running at extreme IQ2_XXS quantization while maintaining coherence. Competes with GLM-5 on coding benchmarks for interactive 3D applications, though GLM-5 produces more aesthetically detailed outputs without explicit instruction.

Code Gen Multimodal Inference Quantization

💬 Reddit 2w ago

Gemma 4 KV Cache Fix Dramatically Improves VRAM Efficiency

llama.cpp released critical fixes for Gemma 4's KV cache implementation that was consuming excessive VRAM, significantly reducing memory footprint. Community members successfully deployed Gemma 4 26B with 4-bit quantization on Rockchip NPU at 4W power consumption.

Inference Infrastructure Quantization