🍡 feedmeAI
← All topics
Inference 7 items

Everything Inference

💬 Reddit Apr 22

Qwen3 TTS is seriously underrated - I got it running locally in real-time and it's one of the most expressive open TTS models I've tried

Qwen3 TTS achieves real-time local inference with notably expressive output, integrated into the open-source Persona Engine project (ASR→LLM→TTS pipeline with lip-synced avatar). The author positions it as a meaningful step up from prior local TTS options like Sesame for latency-sensitive, fully offline deployments.

📑 arXiv Apr 22

Supplement Generation Training for Enhancing Agentic Task Performance

Supplement Generation Training (SGT) trains a small LLM to produce task-specific supplemental text prepended to the input of a larger frozen LLM, improving downstream task performance without modifying the large model. This decouples task-specific adaptation from expensive full model retraining, making it practical to update only the lightweight supplement generator as base models evolve. The approach is framed as an alternative to repeated post-training of frontier models for agentic tasks.

🟢 OpenAI Apr 22

Speeding up agentic workflows with WebSockets in the Responses API

OpenAI engineering post details how the Codex agent loop uses WebSockets in the Responses API to reduce per-request connection overhead and leverages connection-scoped caching to cut model latency in multi-turn agentic workflows. The post quantifies improvements but frames them around the specific Codex loop design. Practical reference for anyone building low-latency agents on top of the Responses API.

📝 Blog Apr 17

Practitioner post: Qwen3.6.35B-A3B MoE outperforms Claude Opus 4.7 locally on MacBook Pro at 20.9 GB quantized

Alibaba's Qwen3 6.35B-A3B MoE (35B total, 3B active parameters) reportedly matches or beats Claude Opus 4.7 on local tasks while fitting in 20.9 GB of quantized RAM on a MacBook Pro. If the benchmark methodology holds, this is a notable MoE-for-edge result: frontier-tier quality within consumer-RAM constraints. Practitioner claim; independent verification of benchmark methodology still needed.

💬 Reddit Apr 16
⭐ Editor's Pick

Opus 4.7 is 50% more expensive with context regression?!

User benchmarks show Claude Opus 4.7 scoring 59.2% vs Opus 4.6's 91.9% on the MRCR v2 8-needle 256K context benchmark — a sharp context retention regression. Compounding the issue, a tokenizer change reportedly causes Opus 4.7 to consume ~1.35x more tokens than Opus 4.6 and ~2x more than competing proprietary models, effectively raising costs ~50% for equivalent workloads. If the benchmark numbers hold, this is a meaningful quality-cost tradeoff moving in the wrong direction.

📝 Blog Mar 16

What Comes Next with Open Models

Lambert argues the open-closed performance gap will widen in 2026 because closed models are accumulating advantages on long-horizon, domain-specific tasks with non-public training data. Proposes a three-class taxonomy: true closed frontier, open frontier, and small specialized open models. Predicts the highest-impact open models will be narrow, fast, cheap sub-agents used as tools inside closed-model pipelines.