🍡 feedmeAI
← All topics
Inference 64 items

Everything Inference

💬 Reddit 1d ago

I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude

Qwen3.6-35B-A3B running at 8-bit quantization with 64k context matches Claude quality for code tasks on consumer hardware (M5 Max, 128GB). Handles complex multi-step research tasks with many tool calls and maintains performance on long context coding tasks. Enables fully local development workflows without sending code to external providers.

💬 Reddit 2d ago

Qwen3.6 GGUF Benchmarks

Unsloth's Qwen3.6-35B-A3B GGUF quantizations achieve best KLD-to-size ratio on 21/22 pareto frontier points. Team clarifies that 95% of their frequent re-uploads stem from upstream llama.cpp issues rather than their own errors, citing Gemma 4's four re-uploads as example.

📑 arXiv 2d ago

On the Rejection Criterion for Proxy-based Test-time Alignment

Proposes a novel rejection criterion for proxy-based test-time alignment based on conservative confidence betting, replacing the ill-motivated confidence criterion used in existing approaches. Shows that implicit reward and nudging methods reduce to similar graphical models differing only in rejection criteria, with the new criterion addressing issues from linguistic ambiguity.

💬 Reddit 2d ago

Qwen 3.6 is the first local model that actually feels worth the effort for me

Qwen3.6-35B-A3B represents the first local model practitioners find genuinely competitive with proprietary APIs for code generation, producing usable output for UI XML and embedded C++ with minimal post-generation fixes. This marks a capability threshold where local deployment overhead becomes worthwhile compared to previous iterations requiring extensive manual correction.

💬 Reddit 2d ago

Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B

Comparative evaluation shows Bonsai-8B at 1.125 bpw (782 MB) underperforms Gemma-4-2B at 4.8 bpw (1104 MB) despite only 29% size reduction, questioning the value proposition of extreme quantization. Ternary 1.58-bit variant performed even worse while being 33% larger than Gemma at 1477 MB. Suggests aggressive sub-2-bit quantization may sacrifice too much capability for modest size gains.

📝 Blog 3d ago

Speculative Decoding Shines for Agentic Use Cases

Speculative decoding uses a smaller draft model to generate candidate tokens that a larger target model validates in a single pass, providing significant speedup for agentic workloads heavy on tool calls and structured outputs without quality loss. Cloudflare reports this is particularly effective for coding agents and API integration tasks where tool calling volume is high.

💬 Reddit 3d ago

Only LocalLLaMa can save us now.

Anthropic appears to be constructively terminating consumer Claude Max subscriptions through silent service degradation rather than transparent communication, likely pivoting to enterprise-only offerings. The strategy aims to salvage subscription revenue while implementing stricter limits and higher-tier pricing that will drive consumer churn.

📑 arXiv 3d ago

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

SegWithU augments frozen pretrained segmentation models with a lightweight uncertainty head that produces voxel-wise uncertainty maps using rank-1 posterior probes in a compact feature space. Unlike existing methods requiring repeated inference, it achieves strong failure detection and calibration in a single forward pass for medical image segmentation.

📑 arXiv 3d ago

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

SpecGuard performs step-level verification in speculative decoding using only model-internal signals (attention-based grounding scores and ensemble verification) without external reward models. Prevents erroneous reasoning steps from propagating while avoiding the latency and computational overhead of external verifiers in multi-step reasoning tasks.

📑 arXiv 3d ago
★ High Signal

Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

Scepsy is a serving system for multi-LLM agentic workflows that schedules arbitrary agent frameworks onto GPU clusters under oversubscription. It exploits the observation that while end-to-end workflow latencies are unpredictable, the relative execution time shares of each LLM remain stable across runs. Enables efficient serving of complex agentic workflows at target throughput with low latency.

📑 arXiv 3d ago

AdaSplash-2: Faster Differentiable Sparse Attention

AdaSplash-2 accelerates differentiable sparse attention (α-entmax) via histogram-based initialization that reduces normalizer computation to 1-2 iterations. The method stores coarse attention score histograms in on-chip SRAM for accurate initialization, addressing the computational overhead that previously made sparse attention slower than softmax.

📑 arXiv 3d ago

Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap

Atropos optimizes cost-benefit trade-offs for LLM agents using self-consistency by predicting when to terminate cheaper Small Language Model inference early and hotswap to larger commercial models. The system analyzes structural properties of inference paths merged into graphs to decide when local SLMs suffice versus when expensive API calls are needed.

✍️ Simon Willison 4d ago

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Qwen3.6-35B-A3B running locally outperformed Claude Opus 4.7 on an SVG pelican generation task, demonstrating the narrowing capability gap between quantized open-weight models and proprietary APIs for specific visual generation benchmarks. The comparison highlights increasing viability of local inference despite not reflecting overall model capability.

🤗 Hugging Face 4d ago

TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification

TRACER trains lightweight ML surrogates on LLM production traces to route classification traffic, activating them only when agreement with the base LLM exceeds a user-specified threshold. This approach converts logged inference data into a continuously growing training set that handles routine traffic at near-zero marginal cost while deferring edge cases to the full model.

💬 Reddit 4d ago

Local AI is the best

Community appreciation for local AI deployment emphasizes freedom from censorship, data harvesting, and ability to fine-tune models for personal use cases with complete privacy. Credits llama.cpp developers and open-weight model contributors for enabling on-device inference. Reflects growing preference for self-hosted solutions over cloud APIs.

📝 Blog 5d ago

Mistral Voxtral TTS Model

Mistral's Voxtral is a 4B-parameter multilingual TTS model supporting 9 languages with emotionally expressive generation, low-latency streaming, and custom voice adaptation. Available via Mistral Studio and API, it targets enterprise voice agent workflows with focus on natural rhythm and cultural authenticity.

💬 Reddit 5d ago

Qwen 3.6-35B-A3B Release Generates Major Community Buzz on r/LocalLLaMA

Qwen 3.6-35B-A3B generated exceptional community engagement (2,154 upvotes) with practitioners reporting significant capability leaps for local deployment, particularly requiring manual 'preserve_thinking' flag for optimal performance. The mixture-of-experts A3B variant activates only 3B of 35B parameters, enabling consumer hardware deployment with strong tool calling and coding performance.

🐙 GitHub 5d ago

humanrouter/ddtree-mlx: Tree-based speculative decoding for Apple Silicon (MLX). ~10-15% faster than DFlash on code, ~1.5x over autoregressive. First MLX port with custom Metal kernels for hybrid model support.

ddtree-mlx ports tree-based speculative decoding to Apple Silicon with custom Metal kernels, achieving 10-15% speedup over DFlash on code and 1.5x over autoregressive inference. First MLX implementation supporting hybrid model architectures.

💬 Reddit 5d ago

24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/Gemma4)

Developer converted Xiaomi 12 Pro smartphone into headless 24/7 LLM inference server running Gemma4 via Ollama with LineageOS, custom thermal management, and battery protection scripts. Uses ~9GB RAM for compute after stripping Android UI, with active cooling triggered at 45°C and charging capped at 80% for longevity. Demonstrates edge deployment of open-weights models on consumer mobile hardware.

💬 Reddit 6d ago

How to Distill from 100B+ to <4B Models

Active community discussion (129 posts) on knowledge distillation techniques for compressing 100B+ parameter models into sub-4B variants suitable for consumer hardware deployment. Represents shift from passive model consumption to creating custom distilled models optimized for edge devices, phones, and lightweight laptops. Enables preserving large model capabilities while meeting resource constraints.

💬 Reddit 6d ago

I scaled a pure Spiking Neural Network (SNN) to 1.088B parameters from scratch. Ran out of budget, but here is what I found [R]

Independent researcher trained a 1.088B parameter pure Spiking Neural Network for language modeling from random initialization, achieving 4.4 loss and 93% activation sparsity at 27k steps before running out of compute budget. This challenges conventional wisdom that billion-scale SNNs require ANN-to-SNN conversion due to vanishing gradients, demonstrating direct spike-domain training is viable. Cross-lingual emergence appeared around step 25K despite no explicit multilingual objective.

💬 Reddit 6d ago

Best Local LLMs - Apr 2026

Community megathread discusses recent local LLM releases including Qwen3.5, Gemma4, GLM-5.1 claiming SOTA performance, Minimax-M2.7 as accessible alternative to Claude Sonnet, and PrismML Bonsai 1-bit models. Users share deployment configurations and real-world usage experiences with open-weight models.

💬 Reddit 6d ago

Local Minimax M2.7, GTA benchmark

Minimax M2.7 generates functional 3D GTA-style web experiences with minimal prompting, running at extreme IQ2_XXS quantization while maintaining coherence. Competes with GLM-5 on coding benchmarks for interactive 3D applications, though GLM-5 produces more aesthetically detailed outputs without explicit instruction.

📑 arXiv 1w ago

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

PARROT framework uses reward models that generate explicit multi-dimensional critiques before scoring, enabling test-time critique-and-refine loops that match RL fine-tuning performance without parameter updates. Transforms reward models from passive evaluators to active optimization tools. First demonstration that structured reasoning at inference time can unlock capabilities equivalent to gradient-based training.

📑 arXiv Mar 5

∇-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

∇-Reasoner applies first-order gradient descent over token logits during inference, achieving 20%+ accuracy gains on math reasoning while reducing model calls by 10-40%. Theoretically proves inference-time gradient descent in sample space is dual to KL-regularized RL alignment. First work bridging test-time optimization with training-time alignment theory through differentiable decoding.