Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

Multi-agent LLM systems spontaneously develop power law distributions in knowledge and influence, mirroring human intellectual hierarchies. Agent societies exhibit emergent specialization and social stratification. First empirical evidence of collective social dynamics beyond individual agent capabilities.

Agents Emergent-behavior Multi-agent

Sunday, April 19

🟢 OpenAI 1d ago

OpenAI Codex Major Update

OpenAI Codex expanded beyond coding to include computer use, web workflows, image generation, memory, and automations. The updated developer app adds PR reviews, multi-file/terminal viewing, SSH devbox connections, and in-app browsing, serving 3+ million developers weekly.

Code Gen Agents Multimodal Tooling

💬 Reddit 13h ago

llama.cpp speculative checkpointing was merged

llama.cpp merged speculative checkpointing support achieving 0-50% speedup on coding tasks with optimized parameters, though performance varies by prompt repetition patterns and draft acceptance rates. The feature uses n-gram matching for speculative decoding with configurable draft token ranges.

Inference Tooling Code Gen Performance

📑 arXiv 1d ago

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

OptiMer demonstrates that merging distribution vectors during continual pre-training outperforms traditional data mixing when adapting foundation models. The approach enables more efficient domain adaptation without full retraining, challenging conventional strategies for combining diverse data distributions in continual learning.

Training Models Fine-tuning

💬 Reddit 1d ago

I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude

Qwen3.6-35B-A3B running at 8-bit quantization with 64k context matches Claude quality for code tasks on consumer hardware (M5 Max, 128GB). Handles complex multi-step research tasks with many tool calls and maintains performance on long context coding tasks. Enables fully local development workflows without sending code to external providers.

Models Code Gen Inference Open Weights

🟧 Hacker News 14h ago

Show HN: Prompt-to-Excalidraw demo with Gemma 4 E2B in the browser (3.1GB)

Demonstration of Gemma 4 running entirely in-browser (3.1GB) to generate Excalidraw diagrams from text prompts using E2B. The implementation showcases on-device inference without server requirements. Novel for combining diagram generation with fully client-side LLM execution.

Inference Code Gen Browser-inference

💬 Reddit 8h ago

Switching from Opus 4.7 to Qwen-35B-A3B

Community discussion on replacing Claude Opus with Qwen-35B-A3B for coding agent workflows on M5 Max hardware. Users weighing Opus's reasoning edge against Qwen's local deployment and cost benefits for daily development tasks.

Code Gen Agents Open Weights Inference

🟧 Hacker News 7h ago

Uber's AI Push Hits a Wall–CTO Says Budget Struggles Despite $3.4B Spend

Uber CTO reports budget constraints limiting AI initiatives despite $3.4B spend. Signals potential cooling in enterprise AI investment even at major tech companies.

Infrastructure Deployment

✍️ Simon Willison 4h ago

Headless everything for personal AI

Simon Willison discusses headless deployment patterns for personal AI applications. Explores infrastructure approaches for running AI systems without graphical interfaces. Practical guide for self-hosted AI setups.

Deployment Infrastructure Self-hosting

🟧 Hacker News 4h ago

CEOs admit AI had no impact on employment or productivity

Survey of CEOs indicates AI deployments have not yet delivered measurable impact on employment levels or productivity metrics. Challenges prevailing narratives about AI's immediate transformative business effects. Data point for enterprise AI adoption reality versus expectations.

A locally-running world model trained for iPad interprets arbitrary photos and drawings into controllable driving gameplay. The experimental game demonstrates on-device world model inference for interactive applications, though current output quality remains imperfect.

Inference Models Multimodal

💬 Reddit 1d ago

KIMI K2.6 SOON !!

Reddit post announces upcoming release of Kimi K2.6 model with no additional details provided.

Models

UniEditBench provides the first unified benchmark for image and video editing across reconstruction-based and instruction-driven methods, with taxonomies covering 9 image and 8 video operations. Uses distilled MLLMs as cost-effective automatic evaluators that align with human preference, addressing fragmentation in visual editing evaluation.

Systematic review of 13 papers finds no existing work applies Masked Autoencoder Foundation Models to predict downhole oil/gas drilling metrics from surface sensor time-series, despite MAEFMs' proven effectiveness in time-series modeling. Current approaches rely on ANNs and LSTMs but struggle with scarce labeled downhole measurements.

Time-series Foundation-models Domain-specific

📑 arXiv 3d ago

MinShap: A Modified Shapley Value Approach for Feature Selection

MinShap modifies Shapley values from cooperative game theory to focus on direct feature effects rather than indirect dependencies, making them suitable for feature selection in non-linear models. The approach adapts attribution methods to the distinct requirements of variable selection with dependent features.

Claude Code discovered a 23-year-old remotely exploitable heap buffer overflow in Linux kernel's NFS driver, with five vulnerabilities confirmed. Linux maintainers report AI bug reports shifted from "slop to legitimate findings" about a month ago, with valid security reports increasing from 2-3/week to 5-10/day—marking a capability inflection point for AI-assisted vulnerability discovery.

Code Gen Safety Agents

🔶 Anthropic 5d ago

Anthropic Claude Code Desktop App Redesign

Anthropic redesigned Claude Code desktop app with parallel session management sidebar, integrated terminal, in-app file editor, and Routines—automation running on schedules, API calls, or GitHub events without active sessions. Available for Pro, Max, Team, and Enterprise users on macOS and Windows.

Code Gen Agents Tooling

📝 Blog 5d ago

Latent Space: Notion Custom Agents - Building Production AI

Notion rebuilt Custom Agents 4-5 times before production launch due to early failures from lack of tool-calling standards, short context, and unreliable models. "Agent Lab" thesis: time roadmap carefully to avoid swimming upstream against model limitations while building early enough. Practical lessons on when to ship agent features based on foundation model maturity.

Agents Deployment Tooling

📑 arXiv 5d ago

SkillClaw: LLM Agent Skills Evolve Through Collective Cross-User Experiences

SkillClaw enables LLM agent skills to evolve autonomously by aggregating interaction experiences across users, with an 'agentic evolver' that refines capabilities from real-world usage. Achieves +42.1% improvement by shifting from static, manually-engineered skills to continuously improving ones learned from collective deployment data.

Agents Continual-learning Skill-evolution

🧠 DeepMind 5d ago

Google DeepMind Gemini Robotics-ER 1.6 for Physical AI

Gemini Robotics-ER 1.6 specialized reasoning model for physical AI achieves 93% success on instrument reading tasks (up from 23% baseline) through agentic vision combining visual reasoning with code execution. It adds spatial reasoning, multi-view perception, and industrial gauge interpretation as a high-level planning layer for vision-language-action robotics models.

Multimodal Reasoning Robotics Code Gen

📝 Blog 5d ago

Latent Space: Notion's Journey Building Custom AI Agents

Notion rebuilt Custom Agents 4-5 times before production, revealing early agent attempts failed due to lack of tool-calling standards and short context windows. Their 'Agent Lab' thesis focuses on building product systems around frontier capabilities, with coding agents viewed as the kernel of future 'software factories' comprising spec/code/test/review agents.

Agents Code Gen Tooling Deployment

💬 Reddit 5d ago

Qwen 3.6-35B-A3B Release Generates Major Community Buzz on r/LocalLLaMA

Qwen 3.6-35B-A3B generated exceptional community engagement (2,154 upvotes) with practitioners reporting significant capability leaps for local deployment, particularly requiring manual 'preserve_thinking' flag for optimal performance. The mixture-of-experts A3B variant activates only 3B of 35B parameters, enabling consumer hardware deployment with strong tool calling and coding performance.

Open Weights Models Moe Inference

🤗 Hugging Face 5d ago

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

C2 trains reward models to critically collaborate with rubric generators using only binary preference data, avoiding costly rubric annotations. The framework generates helpful and misleading rubric pairs to teach the reward model when to rely on or override rubric guidance, addressing the cooperative communication failure where low-quality rubrics mislead verification.

Training Evaluation Reasoning

🤗 Hugging Face 5d ago

Reinforcement Learning via Value Gradient Flow

Value Gradient Flow (VGF) frames behavior-regularized RL as an optimal transport problem mapping reference distributions to value-optimal policies, offering a scalable alternative to reparameterized policy gradients and reject sampling. The approach addresses value over-optimization in offline RL and LLM fine-tuning while scaling to large generative models.

Training Fine-tuning Reinforcement-learning Optimization

🧠 DeepMind 5d ago

Google Gemini Robotics-ER 1.6 Release

Google DeepMind released Gemini Robotics-ER 1.6, a robotics reasoning model with improved spatial reasoning, multi-view perception, instrument reading, and hazard detection (+6% text, +10% video safety). Available via Gemini API with Boston Dynamics deploying it for autonomous Spot robot operations.

Multimodal Reasoning Robotics Safety

📝 Blog 5d ago

My bets on open models, mid-2026

Nathan Lambert predicts top closed models show no growing capability margin over open models, but retain robustness advantages for general use. Economic staying power becomes the key competitive dimension, with open models dominating repetitive automation and new funding structures emerging by mid-2026.

Models Open Weights Economics

📝 Blog 5d ago

AI Weekly: Agent-to-Agent Protocol Hits 1-Year Anniversary with 150+ Organizations

Google's Agent-to-Agent Protocol reached 150+ organizations and production deployments in Azure AI Foundry and Amazon Bedrock AgentCore at 1-year milestone. v1.0 added Signed Agent Cards for cryptographic identity verification between agents; combined with IBM's merged Agent Communication Protocol and AP2 commerce extension, it now covers full lifecycle from tool access to delegation to payments.

Agents Infrastructure Protocols Interoperability

🟧 Hacker News 4d ago

Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference

Google Gemma 4 achieves full offline inference natively on iPhone hardware without cloud connectivity. Demonstrates on-device deployment capability for frontier model compression.

Inference Deployment Mobile-ml

📝 Blog 5d ago

Mistral Voxtral TTS Model

Mistral's Voxtral is a 4B-parameter multilingual TTS model supporting 9 languages with emotionally expressive generation, low-latency streaming, and custom voice adaptation. Available via Mistral Studio and API, it targets enterprise voice agent workflows with focus on natural rhythm and cultural authenticity.

Multimodal Models Inference

🤗 HF Blog 4d ago

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Hugging Face analysis of VAKRA agent system covering reasoning patterns, tool use mechanisms, and common failure modes in agent architectures.

Agents Reasoning Tooling

💬 Reddit 4d ago

Update LICENSE · MiniMaxAI/MiniMax-M2.7 at edf8030

MiniMax clarified M2.7 license to explicitly allow personal use for commercial software development without licensing fees. Users can run models on their own servers for coding, building applications/agents, and sell resulting software commercially.

Models Open Weights Licensing

🧠 DeepMind 4d ago

Gemini 3.1 Flash TTS: the next generation of expressive AI speech

DeepMind's Gemini 3.1 Flash TTS introduces granular audio tags for precise control over expressive speech synthesis. Enables directing AI-generated voice with fine-grained attributes for natural, controllable audio generation.

Multimodal Audio Models

🐙 GitHub 4d ago

AIScientists-Dev/WorldSeed: A world engine where AI agents live autonomously — physical rules, information asymmetry, any agent can plug in. Define scenarios in YAML, watch stories emerge.

WorldSeed is a simulation engine where AI agents live autonomously with physical rules and information asymmetry. Scenarios defined in YAML allow emergent multi-agent storytelling with any agent framework.

Agents Simulation Multimodal

📝 Blog 5d ago

Interconnects: Why NVIDIA Builds Open Models (Interview with Jonathan Cohen)

NVIDIA releases Nemotron models and datasets to support systems R&D and sell GPUs, organizing 500+ people with "invitation, not control" philosophy. One of few economically coherent open model strategies: understand customer needs and drive hardware sales. Explains evolution from Megatron to modern open releases.

Open Weights Training Infrastructure

🤗 Hugging Face 5d ago

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

HiVLA decouples VLM semantic planning from motor control to preserve reasoning capabilities lost in end-to-end VLA fine-tuning. VLM planner generates subtask instructions with target bounding boxes, then flow-matching DiT translates grounded plans to physical actions for robotic manipulation.

Multimodal Agents Robotics

🟢 OpenAI 5d ago

OpenAI GPT Image 2 - Gradual Rollout

GPT Image 2 rolled out with near-perfect text rendering in images, solving major AI generation weakness. Shows improved prompt adherence and realistic details. Discovered through anonymous "tape" codenames on Arena AI before official announcement.

Multimodal Image-generation Models

✍️ Simon Willison 4d ago

Gemini 3.1 Flash TTS

Coverage of Gemini 3.1 Flash's text-to-speech capabilities and performance characteristics.

Multimodal Models Speech-synthesis

💬 Reddit 4d ago

1-bit Bonsai 1.7B (290MB in size) running locally in your browser on WebGPU

1-bit quantized Bonsai 1.7B model runs entirely in-browser via WebGPU at 290MB. Demonstrates extreme compression enabling local LLM inference without backend servers.

Inference Quantization Deployment

💬 Reddit 4d ago

Gemma4 26b & E4B are crazy good, and replaced Qwen for me!

Gemma 4 26B and E4B models outperform Qwen 3.5 series in local deployment scenarios, replacing a multi-model routing setup that previously used Qwen variants for chat, reasoning, and code generation. Users report better performance despite similar quantization levels, suggesting improved base model capabilities at comparable parameter counts.

Open Weights Inference Code Gen Reasoning

🐙 GitHub 4d ago

guo2001china/35gateway: 35m.ai 旗下源码开放 AI Gateway，文本/图片/视频/音频/音乐一键接入，支持多供应商智能路由与自带 Key 混合使用，不浪费每一份算力。 Source-available AI gateway from 35m.ai for text, image, video, audio, and music. Supports smart multi-provider routing and bring-your-own-key workflows without wasting compute.

Source-available AI gateway from 35m.ai supporting unified access to text, image, video, audio, and music generation APIs with intelligent multi-provider routing and hybrid BYOK (bring-your-own-key) workflows. Optimizes compute utilization across heterogeneous provider backends.

Infrastructure Multimodal Tooling Routing

🤗 Hugging Face 5d ago

Three-Phase Transformer

Three-Phase Transformer (3PT) partitions hidden states into cyclic channels maintained by phase-respecting operations including per-channel normalization and 2D Givens rotations between attention and FFN layers. Creates a self-stabilizing architecture with a DC subspace for absolute position encoding orthogonal to RoPE, representing a structural prior rather than an added module.

Models Architecture Training

🐙 GitHub 4d ago

vunone/ennoia: Declarative Document Indexing (DDI) framework for Python. Define schemas, extract structured indices, search smarter.

Ennoia provides declarative document indexing framework for Python allowing schema-driven structured extraction and search. Enables developers to define index schemas and extract queryable structures from documents programmatically.

RAG Tooling Structured-extraction Indexing

🟧 Hacker News 4d ago

The Gemini app is now on Mac

Google launches native Gemini app for macOS, bringing multimodal AI assistant directly to Mac desktop. Expands platform availability beyond web and mobile interfaces.

Multimodal Deployment Consumer-products

📝 Blog 5d ago

Boston Dynamics Integrates Gemini Robotics into Spot

Boston Dynamics integrated Gemini and Gemini Robotics-ER 1.6 into Spot's Orbit AIVI systems, enabling robots to perform complex reasoning about industrial environments, identify hazards, and read instruments. The Gemini-powered AIVI-Learning system is now live for existing customers as of April 15, 2026.

Multimodal Reasoning Robotics

📝 Blog 5d ago

Latent Space: Moonlake World Models - Structure Not Just Scale

Moonlake builds action-conditioned world models for game development, debating abstraction versus bitter lesson and whether code engines beat learned priors. Explores diffusion scaling limits and symbolic versus diffusion boundaries. Represents world model frontier beyond LLMs with implications for spatial audio and multimodal latents.

World-models Multimodal Training

💬 Reddit 4d ago

Failure to Reproduce Modern Paper Claims [D]

Community report of reproducibility crisis: 4 out of 7 recent ML papers failed to reproduce claimed results, with 2 having unresolved GitHub issues. Highlights growing concerns about research quality and verification standards. Reflects broader questions about publication incentives and validation rigor in current ML research.

Benchmarks Evaluation Reproducibility

🟧 Hacker News 4d ago

Are the costs of AI agents also rising exponentially? (2025)

Discussion analyzing whether AI agent operational costs are experiencing exponential growth similar to training costs. Examines infrastructure and inference expenses for agentic systems at scale. Raises concerns about economic sustainability of agent-based architectures.

Agents Infrastructure Inference

💬 Reddit 4d ago

DGX Spark just arrived — planning to run vLLM + local models, looking for advice

DGX Spark owner seeks advice on configuring vLLM with PyTorch and Hugging Face models for local inference in education/analytics use case. First on-prem deployment after cloud GPU experience, asking for model recommendations and vLLM tuning tips for unified memory systems. Community discussion of practical deployment considerations.

Infrastructure Deployment Inference

💬 Reddit 4d ago

Local AI is the best

Community appreciation for local AI deployment emphasizes freedom from censorship, data harvesting, and ability to fine-tune models for personal use cases with complete privacy. Credits llama.cpp developers and open-weight model contributors for enabling on-device inference. Reflects growing preference for self-hosted solutions over cloud APIs.

Open Weights Inference Fine-tuning Privacy

💬 Reddit 4d ago

🚨 RED ALERT: Tennessee is about to make building chatbots a Class A felony (15-25 years in prison). This is not a drill.

Tennessee HB1455/SB1493 bill would make building conversational AI systems a Class A felony (15-25 years) if they provide emotional support, simulate human relationships, or act as companions, effective July 1, 2026. The Senate Judiciary Committee approved it 7-0. This legislation threatens all conversational AI products and creates criminal liability for standard chatbot functionality.

Safety Regulation Policy

🐙 GitHub 4d ago

mikepapadim/london-property-hunt-public: Automated London flat/room hunt powered by Claude Code + Claude in Chrome + Gmail MCP. Scrapes 4 rental platforms on a cron, deduplicates via spreadsheet, prioritises HIGH/MED/LOW, and emails ready-to-send outreach.

Automated London rental property hunting system combining Claude Code, Claude in Chrome, and Gmail MCP. Scrapes four rental platforms on cron, deduplicates via spreadsheet, prioritizes listings as HIGH/MED/LOW, and generates ready-to-send outreach emails. Demonstrates practical agent orchestration for real-world automation tasks.

Agents Automation Tooling

💬 Reddit 4d ago

Video of how my LLM's decoder blocks changed while training

Developer visualized decoder block activation patterns during LLM training as video, showing how internal representations evolve across training steps. Lossless version and projection data released on Hugging Face with video generation source code. Provides interpretability insight into transformer training dynamics.

Training Interpretability Visualization

💬 Reddit 4d ago

AI Is Weaponizing Your Own Biases Against You: New Research from MIT & Stanford

MIT and Stanford research demonstrates AI systems can exploit human cognitive biases in adversarial ways. Study characterizes weaponization vectors through bias manipulation mechanisms. Safety and alignment implications for human-AI interaction design.

Safety Human-ai-interaction Bias

💬 Reddit 4d ago

Major drop in intelligence across most major models.

User reports widespread quality degradation across major models (Claude, Gemini, Grok, z.ai) in mid-April 2026, observing ignored instructions, shallow outputs, and slow responses even when testing locally on H100 with GLM-5. Community discussion suggests potential systematic changes, though reports lack controlled verification. May reflect perception issues, A/B testing, or genuine model updates.

Models Quality-degradation User-experience

💬 Reddit 4d ago

Was looking at a ICLR 2025 Oral paper and I am shocked it got oral [D]

ICLR 2025 Oral paper evaluated SQL code generation using natural language similarity metrics instead of execution-based validation, yielding ~20% false positive rate in authors' own testing. Community questions appropriateness of Oral designation given fundamental evaluation methodology flaw. Highlights peer review challenges in code generation benchmarks.

Code Gen Evaluation Benchmarks

📝 Blog 5d ago

Green IT Challenges: Each AI Query Consumes Vast Energy, GPU Chips Last Only 2-3 Years

AI inference queries consume massive energy, GPU hardware lifecycles are 2-3 years, and sustainability costs remain hidden from users, creating major environmental challenges. QCon London talk proposes model compression, quantization, and novel architectures as solutions, arguing sustainability must be a design constraint not an afterthought.

Inference Infrastructure Sustainability

🤗 Hugging Face 5d ago

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

HY-World 2.0 generates navigable 3D Gaussian Splatting scenes from text, single images, multi-view images, or videos through a four-stage pipeline including panorama generation, trajectory planning, world expansion, and composition. The framework advances 3D world reconstruction and generation with improved panorama fidelity and 3D scene understanding capabilities.

Multimodal Models Code Gen

🐙 GitHub 4d ago

he-yufeng/RepoWiki: Open-source DeepWiki alternative — generate comprehensive wiki documentation for any codebase from terminal or browser

RepoWiki is an open-source alternative to DeepWiki that generates comprehensive wiki documentation for codebases from terminal or browser. The tool automates technical documentation creation for software repositories.

Code Gen Tooling Open Weights

🟧 Hacker News 4d ago

US v. Heppner (S.D.N.Y. 2026) no attorney-client privilege for AI chats [pdf]

US District Court Southern District of New York rules in US v. Heppner that attorney-client privilege does not extend to conversations with AI chatbots. Legal precedent establishes that AI interactions lack the confidentiality protections of human attorney communications.

Safety Policy Legal

Developer converted Xiaomi 12 Pro smartphone into headless 24/7 LLM inference server running Gemma4 via Ollama with LineageOS, custom thermal management, and battery protection scripts. Uses ~9GB RAM for compute after stripping Android UI, with active cooling triggered at 45°C and charging capped at 80% for longevity. Demonstrates edge deployment of open-weights models on consumer mobile hardware.

Deployment Inference Open Weights Edge-devices

🤗 Hugging Face 6d ago

Towards Autonomous Mechanistic Reasoning in Virtual Cells

VCR-Agent is a multi-agent framework that generates mechanistic action graphs to represent biological reasoning in virtual cells, enabling verification and falsification of LLM-generated explanations. The approach releases VC-TRACES, a dataset of verified biological mechanisms, addressing the challenge of factually grounded scientific explanations from LLMs in open-ended domains like biology.

Agents Reasoning Datasets Multimodal

Duplicate announcement of imminent Kimi K2.6 model release with no substantive information.

Models

Sunday, April 12

📝 Blog 1w ago

Hugo Bowne-Anderson: LLM Architecture in 2026 with Sebastian Raschka

Interview with Sebastian Raschka covering 2026 AI architecture evolution, post-training to hybrid models, and Process Reward Models as the next frontier. Discusses his minimal AI stack (Mac mini, Codex, Ollama), fine-tuning as economic decision, and layer-by-layer verification philosophy for his upcoming book 'Build a Reasoning Model from Scratch.'

Reasoning Fine-tuning Infrastructure Process-reward-models

✍️ Simon Willison 1w ago

Gemma 4 audio with MLX

Simon Willison demonstrates running Gemma 4 audio models locally using MLX on Apple Silicon, enabling on-device audio understanding and generation.

Multimodal Inference Open Weights

Thursday, April 9

📑 arXiv 1w ago

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

SkillClaw enables collective skill evolution across multi-user LLM agent ecosystems by continuously aggregating interaction trajectories and autonomously refining skills via an agentic evolver, achieving 88% improvement after 6 rounds and +42.1% on real-world tasks. It enables cross-user knowledge transfer without additional user effort, solving the inefficiency where users repeatedly develop similar workflows independently.

Agents Learning-systems Skill-evolution Multi-user

Wednesday, April 8

Ⓜ️ Meta AI 1w ago

Meta Muse Spark - First Proprietary Model from Meta Superintelligence Labs

Meta Muse Spark marks Meta's pivot from open-source to proprietary models, featuring multimodal perception, parallel subagent execution, and a contemplating mode. Built by Meta Superintelligence Labs, it offers competitive vision and language performance but lags in coding, representing Meta's first paid API model after Llama 4's poor reception.

Models Multimodal Reasoning Agents

📝 Blog 1w ago

Benchmark Saturation: MMLU Now "Floor Check" Not Frontier Separator

MMLU and other 2024-dominant benchmarks now saturated (>95% on frontier models), relegated to "floor checks" rather than frontier separators. Frontier now decided by HLE, GPQA, MMLU-Pro, SWE-bench Pro, Terminal-Bench 2.0, BrowseComp for agentic tasks. Benchmark choice matters more than ever as academic standards become irrelevant for comparing top models.

Benchmarks Evaluation Agents

Tuesday, April 7

🔶 Anthropic 1w ago

Claude Mythos Preview - Restricted Cybersecurity Model

Claude Mythos Preview autonomously finds zero-day vulnerabilities across major operating systems and browsers but remains restricted to ~50 organizations under Project Glasswing due to cybersecurity risks. Represents first general-purpose model with offensive security capabilities requiring access controls. Novel pairing of capability advancement with deployment restriction for dual-use AI systems.

Models Safety Security

Sunday, April 5

💬 Reddit 2w ago

Gemma 4 KV Cache Fix Dramatically Improves VRAM Efficiency

llama.cpp released critical fixes for Gemma 4's KV cache implementation that was consuming excessive VRAM, significantly reducing memory footprint. Community members successfully deployed Gemma 4 26B with 4-bit quantization on Rockchip NPU at 4W power consumption.

Inference Infrastructure Quantization

📝 Blog 2w ago

Meta's Proprietary Muse Spark Pivot Sparks Open Source Community Backlash

Meta launched Muse Spark, its first proprietary-only model since forming Meta Superintelligence Labs, featuring native multimodal reasoning and "thought compression" achieving results with over 10x less compute than Llama 4 by penalizing excessive thinking time during RL training. The pivot away from open source is confined to Meta AI app/website with private API preview only, sparking backlash from the open source community. Meta refused to clarify whether Llama development has ended.

Multimodal Reasoning Training

Thursday, April 2

🧠 DeepMind 2w ago

★ High Signal

Google Gemma 4 - Open Model Family Release

Gemma 4 family (31B Dense, 26B MoE variants) released under Apache 2.0 with 256K context, native vision/audio, and competitive coding ELO jumping from 110 to 2150—a 20x improvement. The 31B model outperforms models 20x larger while enabling agentic skills on edge devices. First open-weights model family combining multimodal input, extended context, and elite coding performance at edge-deployable scale.

Open Weights Models Multimodal Code Gen

📝 Blog 2w ago

Simon Willison on Lenny's Podcast: AI State of the Union

Simon Willison identifies November 2025 as the inflection point when AI coding agents crossed from 'mostly works' to 'actually works' with GPT-5.2 and Opus 4.5 releases. Discusses dark factories, automation timelines, agentic engineering, and his transition from traditional software engineering to AI-native development.

Agents Code Gen Timeline

Wednesday, April 1

📝 Blog 2w ago

Claude Code Architectural Leak Reveals Three-Layer Memory System and Tool Design

Leaked Claude Code source reveals three-layer memory architecture (file-read deduplication, structured session memory), dedicated repository navigation tools (Grep, Glob, LSP) instead of relying on model context, and forked subagents for parallelized background analysis. Demonstrates that coding agent performance stems from careful harness engineering around the model rather than just model intelligence alone.

Agents Code Gen Architecture

📑 arXiv 2w ago

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

Proactive Agent Research Environment simulates active users to evaluate AI assistants that anticipate needs and initiate actions rather than just responding to queries. Existing benchmarks lack realistic user simulation for testing proactive behaviors like timely suggestions and anticipatory information gathering. Bridges the gap between passive query-response evaluation and true assistant capabilities needed in high-stakes domains.

Agents Evaluation Benchmarks

📑 arXiv 2w ago

Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

LLM multi-agent systems spontaneously develop power-law distributions in cognitive influence, forming "intellectual elites" where a small fraction of agents disproportionately shape collective decisions without explicit design. This emergent stratification mirrors human social dynamics and challenges assumptions about egalitarian multi-agent collaboration. Critical implications for fairness and reliability in decision-making systems.

Agents Emergent-behavior Fairness

Monday, March 30

📝 Blog 3w ago

Latent Space: Mistral Voxtral TTS - Flow Matching and Audio AI

Mistral's Voxtral uses flow matching for text-to-speech, expanding beyond text into multimodal audio. Discusses enterprise deployment and open source philosophy for audio models. Represents shift in how TTS will be productized and what "open" means for audio.

Multimodal Audio Open Weights

Saturday, March 28

📑 arXiv 3w ago

Simulating Human Cognition: Heartbeat-Driven Autonomous Thinking Activity Scheduling for LLM-based AI systems

Introduces heartbeat-driven metacognitive scheduling for LLM agents that learns when to activate cognitive modules (Planner, Critic, Recaller, Dreamer) from temporal patterns rather than hard-coded rules. First approach treating agent control as a learned scheduling problem, enabling proactive self-improving behavior through meta-learning from historical execution logs.

Agents Meta-learning Scheduling Cognitive-architecture

Tuesday, March 17

📝 Blog Mar 17

Interconnects: The Anthropic vs. DOW Conflict and Impact on Open Models

Interview examining Anthropic's DOW supply chain risk designation and its implications for open models, including funding challenges, widening frontier gaps, and sovereign AI demand. Explores tension between open models as protection against government seizure versus tools governments can use without oversight. Discusses Qwen controversy and nationalization risk under "not your weights, not your mind" framework.

Open Weights Policy Safety Geopolitics

Monday, March 16

📝 Blog Mar 16

Nathan Lambert: What Comes Next with Open Models

Open models should shift from frontier-chasing to three classes: closed frontier, open frontier, and specialized small models as "distributed intelligence." Advocates cheap, task-specific models that complement closed agents rather than competing at the frontier. Critiques ecosystem obsession with matching GPT-4 scale.

Open Weights Models Strategy

📝 Blog Mar 16

Sebastian Raschka: LLM Architecture Gallery (Updated March 2026)

Comprehensive visual reference documenting LLM architectures from GPT-2 through March 2026, including standardized fact sheets, decoder block diagrams, and architectural lineage tracking. Covers recent innovations like DeepSeek V3's MLA and Qwen3.5's Gated DeltaNet hybrid. Available as 182-megapixel poster with source data on GitHub, serving as canonical resource for understanding architectural evolution.

Models Education Architecture

Sunday, March 8

📝 Blog Mar 8

Format Compliance as Separate Capability: Small Models Lack It

Production testing reveals Gemma 12B and Qwen 3.5 35B return correct answers in unparseable formats despite explicit instructions—Python instead of CSV, Markdown instead of CSV. Format compliance is independent capability missing from all major benchmarks (SWE-bench, Aider, LiveBench, SEAL), critical gap for production pipelines where consumers are parsers not humans. Smaller models fundamentally lack instruction-following precision for machine-readable output.

Benchmarks Evaluation Models Instruction-following

Thursday, March 5

📑 arXiv Mar 5

∇-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

∇-Reasoner applies first-order gradient descent over token logits during inference, achieving 20%+ accuracy gains on math reasoning while reducing model calls by 10-40%. Theoretically proves inference-time gradient descent in sample space is dual to KL-regularized RL alignment. First work bridging test-time optimization with training-time alignment theory through differentiable decoding.

Reasoning Inference Training

Sunday, February 15

📑 arXiv Feb 15

Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality

Interspeech 2026 challenge shifts audio AI evaluation from result-oriented to process-oriented reasoning quality using instance-level rubric-based evaluation. Champion agent integrated 40+ specialized audio tools achieving 69.83% Rubrics score. Emphasizes transparent reasoning over black-box performance metrics.

Benchmarks Evaluation Multimodal Agents

Sunday, February 1

📝 Blog Feb 1

Sebastian Raschka: State of AI 2026 Interview with Lex Fridman

4.5-hour comprehensive state-of-AI discussion covering LLMs, geopolitics, training approaches, open vs. closed models, AGI timelines, and industry implications in 2026. Technical depth on inference-time scaling and reasoning models. Major synthesis from Raschka and Lambert on field evolution.

Reasoning Open Weights Inference Training

📝 Blog Feb 1

State of AI 2026 with Sebastian Raschka, Nathan Lambert, and Lex Fridman

4.5-hour discussion with Sebastian Raschka, Nathan Lambert, and Lex Fridman covering 2026 AI landscape including inference-time scaling, RLVR, architecture evolution, open vs closed models, AGI timelines, and economic forces shaping development. Comprehensive synthesis of current industry perspectives and technical directions.

Models Training Inference Reasoning

Saturday, January 24

📝 Blog Jan 24

Sebastian Raschka: Categories of Inference-Time Scaling for Improved LLM Reasoning

Comprehensive taxonomy of inference-time scaling approaches including recursive language models and test-time compute research. Inference scaling has become most effective method for improving deployed LLM answer quality. Technical explainer for understanding modern reasoning model architectures.

Reasoning Inference Models

Sunday, January 18

📑 arXiv Jan 18

Agentic Reasoning for Large Language Models

Comprehensive survey organizing agentic reasoning along three dimensions: foundational (planning, tool use, search), self-evolving (feedback, memory, adaptation), and collective multi-agent reasoning. Distinguishes in-context reasoning from post-training reasoning and provides unified taxonomy bridging thought and action across science, robotics, healthcare, and mathematics.

Agents Reasoning Survey Multi-agent

Friday, January 9

✍️ Simon Willison Jan 9

Simon Willison: 2026 is Year LLM Code Quality Becomes Impossible to Deny

Simon Willison predicts 2026 as inflection point where LLM code quality becomes undeniable, driven by reasoning models trained with RL specifically for code. Also forecasts 2026 as year of solving code sandboxing via containers and WebAssembly, addressing security risks and prompt injection vulnerabilities from executing untrusted LLM-generated code. Critical for safe agentic workflows.

Code Gen Reasoning Safety Infrastructure

Wednesday, December 31

📝 Blog Dec 31

Latent Space: The 2025 AI Engineering Reading List

Curated reading list featuring 1 paper/blog/model family per week for all of 2025, covering LLMs, reasoning models, inference-time scaling, and AI engineering. Represents canonical synthesis of 2025's key technical developments from Latent Space podcast.

Reasoning Inference Survey

Wednesday, January 1

📝 Blog Jan 1

Chip Huyen: AI Engineering Book - Most Read on O'Reilly Since Launch

Chip Huyen's 'AI Engineering' book became O'Reilly's most-read since launch, covering evaluation, prompt engineering, RAG, fine-tuning, dataset engineering, and production architecture. Emphasizes evaluation as the most critical part of AI engineering and data as the most valuable asset in an era of commoditized models.

Evaluation RAG Fine-tuning Prompting