Method predicts LLM output lengths with uncertainty quantification to improve inference scheduling efficiency, accepted at AISTATS 2026. Tackles variable-length generation bottleneck impacting production throughput and cost. Systems-level contribution for scaling serving infrastructure.
Community discussion on replacing Claude Opus with Qwen-35B-A3B for coding agent workflows on M5 Max hardware. Users weighing Opus's reasoning edge against Qwen's local deployment and cost benefits for daily development tasks.
llama.cpp merged speculative checkpointing support achieving 0-50% speedup on coding tasks with optimized parameters, though performance varies by prompt repetition patterns and draft acceptance rates. The feature uses n-gram matching for speculative decoding with configurable draft token ranges.
Demonstration of Gemma 4 running entirely in-browser (3.1GB) to generate Excalidraw diagrams from text prompts using E2B. The implementation showcases on-device inference without server requirements. Novel for combining diagram generation with fully client-side LLM execution.
Qwen3.6-35B-A3B running at 8-bit quantization with 64k context matches Claude quality for code tasks on consumer hardware (M5 Max, 128GB). Handles complex multi-step research tasks with many tool calls and maintains performance on long context coding tasks. Enables fully local development workflows without sending code to external providers.
A locally-running world model trained for iPad interprets arbitrary photos and drawings into controllable driving gameplay. The experimental game demonstrates on-device world model inference for interactive applications, though current output quality remains imperfect.
Qwen3.6-35B-A3B achieves 79 t/s with 128K context on RTX 5070 Ti + 9800X3D by using --n-cpu-moe instead of --cpu-moe, delivering 54% speedup. Demonstrates effective MoE offloading strategy for 16GB consumer GPUs with high-cache CPUs.
Qwen 3.6 achieves significant performance improvements approaching Claude Opus and Codex usefulness when `preserve_thinking` configuration is enabled. Runs efficiently at 8-bit quantization on M5 Max hardware with 3K prompt processing and 100 token/s generation via oMLX.
Conformal prediction framework for LLMs using Layer-Wise Information (LI) scores from internal representations instead of output statistics like token probabilities. LI scores measure how conditioning on input reshapes predictive entropy across model depth, providing more robust uncertainty quantification under calibration-deployment mismatch.
Unsloth's Qwen3.6-35B-A3B GGUF quantizations achieve best KLD-to-size ratio on 21/22 pareto frontier points. Team clarifies that 95% of their frequent re-uploads stem from upstream llama.cpp issues rather than their own errors, citing Gemma 4's four re-uploads as example.
Analysis of Claude 4.7's tokenizer efficiency and associated API costs.
Proposes a novel rejection criterion for proxy-based test-time alignment based on conservative confidence betting, replacing the ill-motivated confidence criterion used in existing approaches. Shows that implicit reward and nudging methods reduce to similar graphical models differing only in rejection criteria, with the new criterion addressing issues from linguistic ambiguity.
Qwen3.6-35B-A3B represents the first local model practitioners find genuinely competitive with proprietary APIs for code generation, producing usable output for UI XML and embedded C++ with minimal post-generation fixes. This marks a capability threshold where local deployment overhead becomes worthwhile compared to previous iterations requiring extensive manual correction.
AST is a training-free speech editing framework using pre-trained autoregressive TTS models with Latent Recomposition to precisely edit speech segments while preserving speaker identity and acoustic context. Eliminates trade-offs between editing quality and consistency by selectively stitching preserved and synthesized segments without task-specific training.
STOP (Super TOken for Pruning) is the first learnable internal path pruning method for Large Reasoning Models, addressing prohibitive costs from futile reasoning paths. Outperforms existing baselines across LRMs from 1.5B to 20B parameters by systematically pruning at the prefix level using internal signals.
Post-trained language models produce less varied outputs than base models, undermining inference-time scaling methods that rely on sample diversity. Study traces output diversity through three Olmo 3 post-training lineages, finding collapse location co-varies with data composition—the Think lineage loses most semantic diversity during supervised fine-tuning.
Qwen3.6-35B-UD at 2-bit K_XL quantization achieves 98.3% tool call success rate across 58 calls while processing 2.7M tokens on 16GB VRAM. Successfully converts research papers to web applications using llama.cpp on consumer laptop hardware. Demonstrates extreme quantization can maintain performance on complex multi-step tasks.
Experience Compression Spectrum unifies agent memory, skills, and rules as points along a compression axis (5-20× for memory, 50-500× for skills, 1000×+ for rules). Framework addresses the critical bottleneck of managing accumulated experience in long-horizon, multi-session LLM agent deployments by reducing context consumption and retrieval latency.
Comparative evaluation shows Bonsai-8B at 1.125 bpw (782 MB) underperforms Gemma-4-2B at 4.8 bpw (1104 MB) despite only 29% size reduction, questioning the value proposition of extreme quantization. Ternary 1.58-bit variant performed even worse while being 33% larger than Gemma at 1477 MB. Suggests aggressive sub-2-bit quantization may sacrifice too much capability for modest size gains.
MemoSight unifies context compression with multi-token prediction to accelerate LLM reasoning without quality loss, addressing computational bottlenecks in long-context reasoning. The approach makes advanced reasoning capabilities more practical for production as context windows expand.
📝 Blog 3d ago
★ High Signal
NVIDIA's Nemotron 3 Super is a 120B/12B-active MoE model with hybrid Mamba-Attention architecture scoring 60.47% on SWE-Bench Verified—the highest open-weight coding score at launch. Features 1M context, 2.2x throughput improvement, and native speculative decoding for efficient agentic reasoning.
Speculative decoding uses a smaller draft model to generate candidate tokens that a larger target model validates in a single pass, providing significant speedup for agentic workloads heavy on tool calls and structured outputs without quality loss. Cloudflare reports this is particularly effective for coding agents and API integration tasks where tool calling volume is high.
Qwen 3.6 35B A3B achieves 187 tokens/sec on RTX 5090 32GB at Q5_K_S quantization with 120K context. Performance benchmark for local inference. Demonstrates practical deployment of mid-size models on consumer hardware.
Qwen 3.6 introduces a preserve_thinking flag that prevents KV cache invalidation by maintaining reasoning context across turns. This improves cache reuse in agent scenarios, reduces token consumption from redundant reasoning, and fixes a template issue that caused cache invalidation in Qwen 3.5.
Anthropic appears to be constructively terminating consumer Claude Max subscriptions through silent service degradation rather than transparent communication, likely pivoting to enterprise-only offerings. The strategy aims to salvage subscription revenue while implementing stricter limits and higher-tier pricing that will drive consumer churn.
SegWithU augments frozen pretrained segmentation models with a lightweight uncertainty head that produces voxel-wise uncertainty maps using rank-1 posterior probes in a compact feature space. Unlike existing methods requiring repeated inference, it achieves strong failure detection and calibration in a single forward pass for medical image segmentation.
SpecGuard performs step-level verification in speculative decoding using only model-internal signals (attention-based grounding scores and ensemble verification) without external reward models. Prevents erroneous reasoning steps from propagating while avoiding the latency and computational overhead of external verifiers in multi-step reasoning tasks.
VisPCO formulates visual token pruning as a Pareto optimization problem to automatically find optimal computation-performance configurations for vision-language models. Uses continuous relaxation and gradient-based search via Augmented Lagrangian to approximate the empirical Pareto frontier across 8 visual benchmarks.
📑 arXiv 3d ago
★ High Signal
Scepsy is a serving system for multi-LLM agentic workflows that schedules arbitrary agent frameworks onto GPU clusters under oversubscription. It exploits the observation that while end-to-end workflow latencies are unpredictable, the relative execution time shares of each LLM remain stable across runs. Enables efficient serving of complex agentic workflows at target throughput with low latency.
AdaSplash-2 accelerates differentiable sparse attention (α-entmax) via histogram-based initialization that reduces normalizer computation to 1-2 iterations. The method stores coarse attention score histograms in on-chip SRAM for accurate initialization, addressing the computational overhead that previously made sparse attention slower than softmax.
K-Token Merging compresses prompts in latent embedding space by merging K-token blocks via a lightweight encoder, then processing with LoRA-adapted LLMs. Operates at the embedding level rather than token space, reducing quadratic attention costs for long contexts.
Atropos optimizes cost-benefit trade-offs for LLM agents using self-consistency by predicting when to terminate cheaper Small Language Model inference early and hotswap to larger commercial models. The system analyzes structural properties of inference paths merged into graphs to decide when local SLMs suffice versus when expensive API calls are needed.
Route to Rome Attack (R²A) exploits LLM routers by using adversarial suffix optimization to force expensive model selection, increasing costs. Uses hybrid ensemble surrogate routers to mimic black-box routing logic, demonstrating new attack surface in cost-aware inference systems.
MoE-FM uses mixture-of-experts to capture complex latent geometries (anisotropy, multimodality) in flow matching for language models. YAN non-autoregressive LM built on MoE-FM matches diffusion quality with faster inference in both Transformer and Mamba architectures.
Qwen3.6-35B-A3B running locally outperformed Claude Opus 4.7 on an SVG pelican generation task, demonstrating the narrowing capability gap between quantized open-weight models and proprietary APIs for specific visual generation benchmarks. The comparison highlights increasing viability of local inference despite not reflecting overall model capability.
Hugging Face transformers adds support for Mistral 4 (119B MoE with 128 experts unifying Instruct, Reasoning, and Devstral), Jina Embeddings v3, and multiple OCR/video models including VidEoMT, UVDoc, and PI0 robotics VLA. Includes quantization, tokenization, and caching speedups with breaking changes.
AIMO 3 competition analysis across 50 IMO problems shows model capability dominates inference-time optimization; diverse prompting strategies fail to beat high-temperature sampling on strong models. The 8-point capability gap persists across all prompt interventions; only verifier-based selection could close remaining selection loss.
TRACER trains lightweight ML surrogates on LLM production traces to route classification traffic, activating them only when agreement with the base LLM exceeds a user-specified threshold. This approach converts logged inference data into a continuously growing training set that handles routine traffic at near-zero marginal cost while deferring edge cases to the full model.
Gemma 4 26B and E4B models outperform Qwen 3.5 series in local deployment scenarios, replacing a multi-model routing setup that previously used Qwen variants for chat, reasoning, and code generation. Users report better performance despite similar quantization levels, suggesting improved base model capabilities at comparable parameter counts.
1-bit quantized Bonsai 1.7B model runs entirely in-browser via WebGPU at 290MB. Demonstrates extreme compression enabling local LLM inference without backend servers.
Discussion analyzing whether AI agent operational costs are experiencing exponential growth similar to training costs. Examines infrastructure and inference expenses for agentic systems at scale. Raises concerns about economic sustainability of agent-based architectures.
Community appreciation for local AI deployment emphasizes freedom from censorship, data harvesting, and ability to fine-tune models for personal use cases with complete privacy. Credits llama.cpp developers and open-weight model contributors for enabling on-device inference. Reflects growing preference for self-hosted solutions over cloud APIs.
Google Gemma 4 achieves full offline inference natively on iPhone hardware without cloud connectivity. Demonstrates on-device deployment capability for frontier model compression.
DGX Spark owner seeks advice on configuring vLLM with PyTorch and Hugging Face models for local inference in education/analytics use case. First on-prem deployment after cloud GPU experience, asking for model recommendations and vLLM tuning tips for unified memory systems. Community discussion of practical deployment considerations.
AI inference queries consume massive energy, GPU hardware lifecycles are 2-3 years, and sustainability costs remain hidden from users, creating major environmental challenges. QCon London talk proposes model compression, quantization, and novel architectures as solutions, arguing sustainability must be a design constraint not an afterthought.
Mistral's Voxtral is a 4B-parameter multilingual TTS model supporting 9 languages with emotionally expressive generation, low-latency streaming, and custom voice adaptation. Available via Mistral Studio and API, it targets enterprise voice agent workflows with focus on natural rhythm and cultural authenticity.
Qwen 3.6-35B-A3B generated exceptional community engagement (2,154 upvotes) with practitioners reporting significant capability leaps for local deployment, particularly requiring manual 'preserve_thinking' flag for optimal performance. The mixture-of-experts A3B variant activates only 3B of 35B parameters, enabling consumer hardware deployment with strong tool calling and coding performance.
ddtree-mlx ports tree-based speculative decoding to Apple Silicon with custom Metal kernels, achieving 10-15% speedup over DFlash on code and 1.5x over autoregressive inference. First MLX implementation supporting hybrid model architectures.
An LLM-based auto-tuning system for llama.cpp that optimizes inference flags by reading --help output and iteratively testing configurations. Achieves 54% speedup on Qwen3.5-27B (40 tok/s vs 26 tok/s) and automatically adapts to new llama.cpp releases by ingesting updated help text.
Developer converted Xiaomi 12 Pro smartphone into headless 24/7 LLM inference server running Gemma4 via Ollama with LineageOS, custom thermal management, and battery protection scripts. Uses ~9GB RAM for compute after stripping Android UI, with active cooling triggered at 45°C and charging capped at 80% for longevity. Demonstrates edge deployment of open-weights models on consumer mobile hardware.
Active community discussion (129 posts) on knowledge distillation techniques for compressing 100B+ parameter models into sub-4B variants suitable for consumer hardware deployment. Represents shift from passive model consumption to creating custom distilled models optimized for edge devices, phones, and lightweight laptops. Enables preserving large model capabilities while meeting resource constraints.
KV Packet enables context-independent KV cache reuse without recomputation by wrapping cached documents in trainable soft-token adapters. Unlike CacheBlend or SAM-KV which still require selective recomputation, KV Packet treats caches as immutable packets and uses self-supervised distillation to bridge context discontinuities with zero FLOPs overhead.
Independent researcher trained a 1.088B parameter pure Spiking Neural Network for language modeling from random initialization, achieving 4.4 loss and 93% activation sparsity at 27k steps before running out of compute budget. This challenges conventional wisdom that billion-scale SNNs require ANN-to-SNN conversion due to vanishing gradients, demonstrating direct spike-domain training is viable. Cross-lingual emergence appeared around step 25K despite no explicit multilingual objective.
Community megathread discusses recent local LLM releases including Qwen3.5, Gemma4, GLM-5.1 claiming SOTA performance, Minimax-M2.7 as accessible alternative to Claude Sonnet, and PrismML Bonsai 1-bit models. Users share deployment configurations and real-world usage experiences with open-weight models.
Hardware build consolidates two RTX 6000 Ada GPUs (96GB GDDR6 each, 192GB total VRAM) into single Threadripper PRO 7965WX workstation with 256GB DDR5 ECC and dual 1600W Titanium PSUs. Targets local LLM training and inference at scale with 128 PCIe 5.0 lanes supporting x16/x16 GPU configuration. Community build documentation for high-end ML workstations.
Minimax M2.7 generates functional 3D GTA-style web experiences with minimal prompting, running at extreme IQ2_XXS quantization while maintaining coherence. Competes with GLM-5 on coding benchmarks for interactive 3D applications, though GLM-5 produces more aesthetically detailed outputs without explicit instruction.
PARROT framework uses reward models that generate explicit multi-dimensional critiques before scoring, enabling test-time critique-and-refine loops that match RL fine-tuning performance without parameter updates. Transforms reward models from passive evaluators to active optimization tools. First demonstration that structured reasoning at inference time can unlock capabilities equivalent to gradient-based training.
Simon Willison demonstrates running Gemma 4 audio models locally using MLX on Apple Silicon, enabling on-device audio understanding and generation.
llama.cpp released critical fixes for Gemma 4's KV cache implementation that was consuming excessive VRAM, significantly reducing memory footprint. Community members successfully deployed Gemma 4 26B with 4-bit quantization on Rockchip NPU at 4W power consumption.
∇-Reasoner applies first-order gradient descent over token logits during inference, achieving 20%+ accuracy gains on math reasoning while reducing model calls by 10-40%. Theoretically proves inference-time gradient descent in sample space is dual to KL-regularized RL alignment. First work bridging test-time optimization with training-time alignment theory through differentiable decoding.
4.5-hour comprehensive state-of-AI discussion covering LLMs, geopolitics, training approaches, open vs. closed models, AGI timelines, and industry implications in 2026. Technical depth on inference-time scaling and reasoning models. Major synthesis from Raschka and Lambert on field evolution.
4.5-hour discussion with Sebastian Raschka, Nathan Lambert, and Lex Fridman covering 2026 AI landscape including inference-time scaling, RLVR, architecture evolution, open vs closed models, AGI timelines, and economic forces shaping development. Comprehensive synthesis of current industry perspectives and technical directions.
Comprehensive taxonomy of inference-time scaling approaches including recursive language models and test-time compute research. Inference scaling has become most effective method for improving deployed LLM answer quality. Technical explainer for understanding modern reasoning model architectures.
Curated reading list featuring 1 paper/blog/model family per week for all of 2025, covering LLMs, reasoning models, inference-time scaling, and AI engineering. Represents canonical synthesis of 2025's key technical developments from Latent Space podcast.