🍡 feedmeAI
← All topics
Deployment 19 items

Everything Deployment

🟧 Hacker News 2d ago

Gemma 4 Release Triggers Debate About Tool Calling Implementation Issues

Gemma 4 release exposed systemic reliability issues where local model runners (Ollama, LM Studio) rushed launch-day support with broken tokenizer implementations and failed tool calls. Discussion highlighted trade-offs between inference tools, with performance benchmarks showing Ollama 25% faster than LM Studio on Mac, but recurring pattern of premature releases creating production issues.

💬 Reddit 3d ago

Only LocalLLaMa can save us now.

Anthropic appears to be constructively terminating consumer Claude Max subscriptions through silent service degradation rather than transparent communication, likely pivoting to enterprise-only offerings. The strategy aims to salvage subscription revenue while implementing stricter limits and higher-tier pricing that will drive consumer churn.

🤗 Hugging Face 4d ago

TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification

TRACER trains lightweight ML surrogates on LLM production traces to route classification traffic, activating them only when agreement with the base LLM exceeds a user-specified threshold. This approach converts logged inference data into a continuously growing training set that handles routine traffic at near-zero marginal cost while deferring edge cases to the full model.

📝 Blog 5d ago

Latent Space: Notion Custom Agents - Building Production AI

Notion rebuilt Custom Agents 4-5 times before production launch due to early failures from lack of tool-calling standards, short context, and unreliable models. "Agent Lab" thesis: time roadmap carefully to avoid swimming upstream against model limitations while building early enough. Practical lessons on when to ship agent features based on foundation model maturity.

📝 Blog 5d ago

Latent Space: Notion's Journey Building Custom AI Agents

Notion rebuilt Custom Agents 4-5 times before production, revealing early agent attempts failed due to lack of tool-calling standards and short context windows. Their 'Agent Lab' thesis focuses on building product systems around frontier capabilities, with coding agents viewed as the kernel of future 'software factories' comprising spec/code/test/review agents.

💬 Reddit 5d ago

24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/Gemma4)

Developer converted Xiaomi 12 Pro smartphone into headless 24/7 LLM inference server running Gemma4 via Ollama with LineageOS, custom thermal management, and battery protection scripts. Uses ~9GB RAM for compute after stripping Android UI, with active cooling triggered at 45°C and charging capped at 80% for longevity. Demonstrates edge deployment of open-weights models on consumer mobile hardware.

💬 Reddit 6d ago

How to Distill from 100B+ to <4B Models

Active community discussion (129 posts) on knowledge distillation techniques for compressing 100B+ parameter models into sub-4B variants suitable for consumer hardware deployment. Represents shift from passive model consumption to creating custom distilled models optimized for edge devices, phones, and lightweight laptops. Enables preserving large model capabilities while meeting resource constraints.