Method predicts LLM output lengths with uncertainty quantification to improve inference scheduling efficiency, accepted at AISTATS 2026. Tackles variable-length generation bottleneck impacting production throughput and cost. Systems-level contribution for scaling serving infrastructure.
Simon Willison discusses headless deployment patterns for personal AI applications. Explores infrastructure approaches for running AI systems without graphical interfaces. Practical guide for self-hosted AI setups.
Uber CTO reports budget constraints limiting AI initiatives despite $3.4B spend. Signals potential cooling in enterprise AI investment even at major tech companies.
Gemma 4 release exposed systemic reliability issues where local model runners (Ollama, LM Studio) rushed launch-day support with broken tokenizer implementations and failed tool calls. Discussion highlighted trade-offs between inference tools, with performance benchmarks showing Ollama 25% faster than LM Studio on Mac, but recurring pattern of premature releases creating production issues.
Mixed precision and floating-point settings cause ~2.4× training time variation in distributed deep learning, but existing predictors ignore precision and incur up to 147.85% MAPE. This work proposes a precision-aware predictor that accounts for mixed precision configurations to accurately forecast distributed training times for resource allocation and scheduling.
Probabilistic Synchronous Parallel (PSP) in federated learning assumes static, independent device behavior, causing unfair synchronization when device availability correlates with data distribution. Proposes robust synchronization methods to handle correlated device failures from mobility, power constraints, and user activity in edge deployments.
Prism is the first symbolic superoptimizer for tensor programs, using sGraph representation to symbolically encode operator families and execution parameters. Two-level search with symbolic pruning and e-graph verification achieves provably optimal kernels across large search spaces.
📑 arXiv 3d ago
★ High Signal
Scepsy is a serving system for multi-LLM agentic workflows that schedules arbitrary agent frameworks onto GPU clusters under oversubscription. It exploits the observation that while end-to-end workflow latencies are unpredictable, the relative execution time shares of each LLM remain stable across runs. Enables efficient serving of complex agentic workflows at target throughput with low latency.
CoGrid is a multi-agent grid simulation library with NumPy and JAX backends, paired with Multi-User Gymnasium (MUG) that converts simulations into interactive web experiments. The tools lower barriers for researchers studying human-AI interaction by supporting arbitrary numbers of humans and AI agents in both server-authoritative and peer-to-peer modes.
Autogenesis Protocol (AGP) standardizes self-evolving agent systems by modeling prompts, agents, tools, environments, and memory as protocol-registered resources with lifecycle management and version tracking. The Resource Substrate Protocol Layer decouples what evolves from how evolution occurs, addressing brittleness in existing protocols like A2A and MCP.
Discussion analyzing whether AI agent operational costs are experiencing exponential growth similar to training costs. Examines infrastructure and inference expenses for agentic systems at scale. Raises concerns about economic sustainability of agent-based architectures.
DGX Spark owner seeks advice on configuring vLLM with PyTorch and Hugging Face models for local inference in education/analytics use case. First on-prem deployment after cloud GPU experience, asking for model recommendations and vLLM tuning tips for unified memory systems. Community discussion of practical deployment considerations.
Source-available AI gateway from 35m.ai supporting unified access to text, image, video, audio, and music generation APIs with intelligent multi-provider routing and hybrid BYOK (bring-your-own-key) workflows. Optimizes compute utilization across heterogeneous provider backends.
Google's Agent-to-Agent Protocol reached 150+ organizations and production deployments in Azure AI Foundry and Amazon Bedrock AgentCore at 1-year milestone. v1.0 added Signed Agent Cards for cryptographic identity verification between agents; combined with IBM's merged Agent Communication Protocol and AP2 commerce extension, it now covers full lifecycle from tool access to delegation to payments.
AI inference queries consume massive energy, GPU hardware lifecycles are 2-3 years, and sustainability costs remain hidden from users, creating major environmental challenges. QCon London talk proposes model compression, quantization, and novel architectures as solutions, arguing sustainability must be a design constraint not an afterthought.
NVIDIA releases Nemotron models and datasets to support systems R&D and sell GPUs, organizing 500+ people with "invitation, not control" philosophy. One of few economically coherent open model strategies: understand customer needs and drive hardware sales. Explains evolution from Megatron to modern open releases.
ddtree-mlx ports tree-based speculative decoding to Apple Silicon with custom Metal kernels, achieving 10-15% speedup over DFlash on code and 1.5x over autoregressive inference. First MLX implementation supporting hybrid model architectures.
Hardware build consolidates two RTX 6000 Ada GPUs (96GB GDDR6 each, 192GB total VRAM) into single Threadripper PRO 7965WX workstation with 256GB DDR5 ECC and dual 1600W Titanium PSUs. Targets local LLM training and inference at scale with 128 PCIe 5.0 lanes supporting x16/x16 GPU configuration. Community build documentation for high-end ML workstations.
Asynkor provides file leasing coordination for AI agent teams via MCP server, preventing merge conflicts when multiple agents edit code. Works across IDEs without changing agent implementations.
Cloudflare integrates OpenAI's GPT-5.4 and Codex into Agent Cloud, enabling enterprises to build and deploy AI agents at scale. The partnership combines Cloudflare's infrastructure with OpenAI's latest models for production agentic workflows.
Interview with Sebastian Raschka covering 2026 AI architecture evolution, post-training to hybrid models, and Process Reward Models as the next frontier. Discusses his minimal AI stack (Mac mini, Codex, Ollama), fine-tuning as economic decision, and layer-by-layer verification philosophy for his upcoming book 'Build a Reasoning Model from Scratch.'
llama.cpp released critical fixes for Gemma 4's KV cache implementation that was consuming excessive VRAM, significantly reducing memory footprint. Community members successfully deployed Gemma 4 26B with 4-bit quantization on Rockchip NPU at 4W power consumption.
Simon Willison predicts 2026 as inflection point where LLM code quality becomes undeniable, driven by reasoning models trained with RL specifically for code. Also forecasts 2026 as year of solving code sandboxing via containers and WebAssembly, addressing security risks and prompt injection vulnerabilities from executing untrusted LLM-generated code. Critical for safe agentic workflows.