Multi-Agent Reflexion uses diverse reasoning personas with separate judge model to synthesize critiques, improving HotPotQA by 3 points and HumanEval by 6.2 points. Separates acting, diagnosing, critiquing, and aggregating to reduce shared blind spots in single-agent self-reflection. Addresses systematic limitation where solo agents repeat misconceptions without external correction signals.
Community discussion on replacing Claude Opus with Qwen-35B-A3B for coding agent workflows on M5 Max hardware. Users weighing Opus's reasoning edge against Qwen's local deployment and cost benefits for daily development tasks.
llama.cpp merged speculative checkpointing support achieving 0-50% speedup on coding tasks with optimized parameters, though performance varies by prompt repetition patterns and draft acceptance rates. The feature uses n-gram matching for speculative decoding with configurable draft token ranges.
Demonstration of Gemma 4 running entirely in-browser (3.1GB) to generate Excalidraw diagrams from text prompts using E2B. The implementation showcases on-device inference without server requirements. Novel for combining diagram generation with fully client-side LLM execution.
Qwen3.6-35B-A3B running at 8-bit quantization with 64k context matches Claude quality for code tasks on consumer hardware (M5 Max, 128GB). Handles complex multi-step research tasks with many tool calls and maintains performance on long context coding tasks. Enables fully local development workflows without sending code to external providers.
OpenAI Codex expanded beyond coding to include computer use, web workflows, image generation, memory, and automations. The updated developer app adds PR reviews, multi-file/terminal viewing, SSH devbox connections, and in-app browsing, serving 3+ million developers weekly.
Benchmark comparing Claude and Gemini on the laden knight's tour problem, a weighted variant requiring optimal pathfinding with accumulating costs. Tests coding agents on combinatorial optimization task combining movement constraints with dynamic cost calculation.
Qwen3.6-35B-A3B successfully solved coding problems that Qwen3.5-27B couldn't handle, reducing technical debt in a complex budgeting app project. Users report improved code quality and architectural decisions on multi-feature applications.
User benchmark comparing Qwen 3.6 35B against Gemma 4 26B on 30k-line codebase with 37 intentional bugs and PDF analysis tasks shows Qwen significantly outperforming across agentic capabilities, coding, image-to-text, instruction following, and reasoning. Both models tested at Q4_K_XL quantization for fair comparison.
Qwen3.6 with OpenCode successfully implemented row-level security across a multi-service codebase (Rust, TypeScript, Python), demonstrating practical viability for complex code generation tasks. Users report quality comparable to Claude for certain daily-drive use cases despite remaining bugs.
Qwen3.6-35B-A3B represents the first local model practitioners find genuinely competitive with proprietary APIs for code generation, producing usable output for UI XML and embedded C++ with minimal post-generation fixes. This marks a capability threshold where local deployment overhead becomes worthwhile compared to previous iterations requiring extensive manual correction.
Qwen3.6-35B model successfully builds a complete tower defense game with autonomous bug detection and fixing using MCP screenshot verification. User reports the model identified rendering issues and wave completion bugs independently during development. Demonstrates strong multimodal code generation capabilities with visual feedback integration.
LLMSniffer fine-tunes GraphCodeBERT with two-stage supervised contrastive learning to detect AI-generated code, improving accuracy from 70% to 78% on GPTSniffer and 91% to 94.65% on Whodunit. The approach combines comment removal preprocessing with an MLP classifier and produces well-separated embeddings confirmed by t-SNE visualization.
Exposes a critical keyword shortcut bias in code localization benchmarks where models rely on superficial lexical matching rather than structural reasoning. Introduces KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without naming hints, revealing catastrophic performance drops in state-of-the-art approaches and motivating a neurosymbolic framework combining neural retrieval with symbolic verification.
Anthropic released Auto mode for Claude Code (Opus 4.7, Max tier) and new "xhigh" effort level between high and max for granular reasoning control. Update includes fullscreen TUI rendering, mobile notifications for Remote Control, and Windows/MCP fixes.
📝 Blog 3d ago
★ High Signal
NVIDIA's Nemotron 3 Super is a 120B/12B-active MoE model with hybrid Mamba-Attention architecture scoring 60.47% on SWE-Bench Verified—the highest open-weight coding score at launch. Features 1M context, 2.2x throughput improvement, and native speculative decoding for efficient agentic reasoning.
Command-line tool claims to accelerate Android app development 3x when used with AI coding agents. Streamlines agent-based mobile development workflows.
Agent-Aided Design systems use LLMs in a feedback loop to write CAD code, compile models, visualize results, and iteratively refine designs, but cannot yet generate complex 3D assemblies with moving parts like pistons or scissors. This work identifies the capability gap preventing these training-free agentic systems from impacting industrial manufacturing. Addresses the transition from static CAD objects to dynamic mechanical assemblies.
QuantCode-Bench provides 400 tasks evaluating LLMs on generating executable algorithmic trading strategies for Backtrader from English descriptions. Unlike standard code benchmarks, requires domain-specific financial logic, specialized API knowledge, and code producing actual trades on historical data, with tasks sourced from Reddit, TradingView, and synthetic generators.
Compact "Gene" representation outperforms documentation-oriented "Skill" packages for test-time evolution across 4,590 trials in scientific code tasks. Expanding experience into fuller documentation degrades performance, showing representation format is first-order factor in reusable experience.
LLM agents autonomously evolve the ABC logic synthesis codebase by rewriting sub-components while preserving its single-binary execution model. The self-evolving framework operates on the entire integrated codebase and bootstraps using existing open-source synthesis components before iteratively improving through agent-driven code evolution.
Qwen3.6-35B-A3B is a sparse MoE model with 35B total parameters and 3B active, released under Apache 2.0. The model matches agentic coding performance of models 10x its active size and includes multimodal perception with thinking and non-thinking modes.
COEVO unifies functional correctness and PPA (power, performance, area) optimization for LLM-generated RTL code in a single co-evolutionary loop, replacing sequential pipelines that discard partially correct but architecturally promising candidates. Existing methods decouple correctness from PPA and reduce multi-objective optimization to scalar fitness, obscuring trade-offs. COEVO treats correctness as continuous rather than binary, enabling simultaneous optimization of both objectives.
Agentic framework for RTL timing optimization using LLMs with tool-grounded self-improvement and reusable optimization skills. Evaluated on realistic RTL designs with industrial-grade tools rather than manually degraded toy examples. Moves beyond coarse design-level feedback to fine-grained optimization through learned skills.
UniClaude integrates Claude directly into Unity Editor as a dockable window with full project context awareness and 60+ MCP tools. Eliminates context switching during game development by embedding the AI assistant natively in the IDE. Provides workflow-specific tooling for game developers working in Unity.
OpenAI's Codex app for macOS and Windows now includes computer use capabilities, in-app browsing, image generation, memory, and plugins. The update transforms Codex from a code-focused assistant into a multi-capability developer productivity platform.
🔶 Anthropic 4d ago
★ High Signal
Claude Opus 4.7 delivers 13% improvement on coding benchmarks with enhanced vision for higher-resolution images and new effort controls/task budgets for autonomous development. Powers upgraded Claude Code review tools for long-running software engineering tasks. Introduces task-level resource management for extended autonomous coding workflows.
🔶 Anthropic 4d ago
★ High Signal
Claude Opus 4.7 achieves 87.6% on SWE-bench Verified (13% improvement) with 2x throughput on agentic tasks while maintaining $5/$25 per million token pricing and full 1M context window. The performance gains make it effectively cheaper per task despite unchanged nominal pricing. Higher-resolution vision capabilities included.
GitHub Copilot adding Claude Opus 4.7 with stronger multi-step task performance and more reliable agentic execution. Launches with promotional 7.5× premium request multiplier until April 30th, replacing Opus 4.5 and 4.6 for Copilot Pro+ users.
MM-WebAgent is a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. Jointly optimizes global layout, local multimodal content, and their integration to produce coherent and visually consistent webpages, addressing style inconsistency in isolated element generation.
Gemma 4 26B and E4B models outperform Qwen 3.5 series in local deployment scenarios, replacing a multi-model routing setup that previously used Qwen variants for chat, reasoning, and code generation. Users report better performance despite similar quantization levels, suggesting improved base model capabilities at comparable parameter counts.
RepoWiki is an open-source alternative to DeepWiki that generates comprehensive wiki documentation for codebases from terminal or browser. The tool automates technical documentation creation for software repositories.
ICLR 2025 Oral paper evaluated SQL code generation using natural language similarity metrics instead of execution-based validation, yielding ~20% false positive rate in authors' own testing. Community questions appropriateness of Oral designation given fundamental evaluation methodology flaw. Highlights peer review challenges in code generation benchmarks.
Anthropic redesigned Claude Code desktop app with parallel session management sidebar, integrated terminal, in-app file editor, and Routines—automation running on schedules, API calls, or GitHub events without active sessions. Available for Pro, Max, Team, and Enterprise users on macOS and Windows.
Gemini Robotics-ER 1.6 specialized reasoning model for physical AI achieves 93% success on instrument reading tasks (up from 23% baseline) through agentic vision combining visual reasoning with code execution. It adds spatial reasoning, multi-view perception, and industrial gauge interpretation as a high-level planning layer for vision-language-action robotics models.
🟢 OpenAI 5d ago
★ High Signal
OpenAI Codex expands from coding to full computer use with web workflows, multi-step planning, autonomous actions, and audio-visual processing for 3M+ weekly developers. Now handles PR reviews, multiple file/terminal views, SSH connections, and in-app browsing. Shift from code generation tool to general-purpose computer control agent.
📝 Blog 5d ago
★ High Signal
Claude Code discovered a 23-year-old remotely exploitable heap buffer overflow in Linux kernel's NFS driver, with five vulnerabilities confirmed. Linux maintainers report AI bug reports shifted from "slop to legitimate findings" about a month ago, with valid security reports increasing from 2-3/week to 5-10/day—marking a capability inflection point for AI-assisted vulnerability discovery.
Notion rebuilt Custom Agents 4-5 times before production, revealing early agent attempts failed due to lack of tool-calling standards and short context windows. Their 'Agent Lab' thesis focuses on building product systems around frontier capabilities, with coding agents viewed as the kernel of future 'software factories' comprising spec/code/test/review agents.
HY-World 2.0 generates navigable 3D Gaussian Splatting scenes from text, single images, multi-view images, or videos through a four-stage pipeline including panorama generation, trajectory planning, world expansion, and composition. The framework advances 3D world reconstruction and generation with improved panorama fidelity and 3D scene understanding capabilities.
ddtree-mlx ports tree-based speculative decoding to Apple Silicon with custom Metal kernels, achieving 10-15% speedup over DFlash on code and 1.5x over autoregressive inference. First MLX implementation supporting hybrid model architectures.
r/LocalLLaMA consensus ranks Qwen 3.5 most broadly recommended, Gemma 4 showing strong buzz, GLM-5/4.7 near top of rankings, MiniMax M2.5/M2.7 for agentic workloads, DeepSeek V3.2 in top cluster. Qwen3-Coder-Next dominates for local coding. Community-driven practical guidance on deployed models.
Analysis of Claude Code's TypeScript source code and comparison with OpenClaw identifies five core human values (decision authority, safety, reliable execution, capability amplification, contextual adaptability) traced through thirteen design principles to implementation choices. The core architecture is a simple while-loop calling the model, running tools, and returning results—demonstrating how design philosophy shapes agentic system architecture.
LeanKG focuses on efficient formal theorem proving in Lean, emphasizing token efficiency for mathematical code generation. Title suggests optimization techniques for reducing computational cost while maintaining correctness. Targets formal verification and proof assistant workflows.
Curated collection of 50+ Claude Code skills, agents, and plugins organized by use case with recommendation ratings. Ready-to-use extensions for Claude-based development workflows.
Minimax M2.7 generates functional 3D GTA-style web experiences with minimal prompting, running at extreme IQ2_XXS quantization while maintaining coherence. Competes with GLM-5 on coding benchmarks for interactive 3D applications, though GLM-5 produces more aesthetically detailed outputs without explicit instruction.
Asynkor provides file leasing coordination for AI agent teams via MCP server, preventing merge conflicts when multiple agents edit code. Works across IDEs without changing agent implementations.
📝 Blog 1w ago
★ High Signal
GLM-5.1 achieves 94.6% of Claude Opus 4.6's coding performance at $3/month under MIT license, while Google's Gemma 4 and Qwen 3.5 deliver frontier-competitive performance. This marks the collapse of the performance gap between open and closed-source models, fundamentally shifting AI economics and deployment patterns.
Simon Willison uses Claude Code to explore Servo v0.1.0 Rust crate, building CLI screenshot tool and investigating WebAssembly compilation autonomously. Demonstrates "agentic engineering" workflow where developer tasks AI with discovering library capabilities and building working tools. Evolution from code completion to exploratory development assistance.
🧠 DeepMind 2w ago
★ High Signal
Gemma 4 family (31B Dense, 26B MoE variants) released under Apache 2.0 with 256K context, native vision/audio, and competitive coding ELO jumping from 110 to 2150—a 20x improvement. The 31B model outperforms models 20x larger while enabling agentic skills on edge devices. First open-weights model family combining multimodal input, extended context, and elite coding performance at edge-deployable scale.
Simon Willison identifies November 2025 as the inflection point when AI coding agents crossed from 'mostly works' to 'actually works' with GPT-5.2 and Opus 4.5 releases. Discusses dark factories, automation timelines, agentic engineering, and his transition from traditional software engineering to AI-native development.
Leaked Claude Code source reveals three-layer memory architecture (file-read deduplication, structured session memory), dedicated repository navigation tools (Grep, Glob, LSP) instead of relying on model context, and forked subagents for parallelized background analysis. Demonstrates that coding agent performance stems from careful harness engineering around the model rather than just model intelligence alone.
Simon Willison predicts 2026 as inflection point where LLM code quality becomes undeniable, driven by reasoning models trained with RL specifically for code. Also forecasts 2026 as year of solving code sandboxing via containers and WebAssembly, addressing security risks and prompt injection vulnerabilities from executing untrusted LLM-generated code. Critical for safe agentic workflows.