Code Gen 52 items

Everything Code Gen

📑 arXiv 1h ago

Multi-Agent Reflexion (MAR): Diverse Reasoning Personas Improve LLM Agents

Multi-Agent Reflexion uses diverse reasoning personas with separate judge model to synthesize critiques, improving HotPotQA by 3 points and HumanEval by 6.2 points. Separates acting, diagnosing, critiquing, and aggregating to reduce shared blind spots in single-agent self-reflection. Addresses systematic limitation where solo agents repeat misconceptions without external correction signals.

Agents Reasoning Code Gen

💬 Reddit 8h ago

Switching from Opus 4.7 to Qwen-35B-A3B

Community discussion on replacing Claude Opus with Qwen-35B-A3B for coding agent workflows on M5 Max hardware. Users weighing Opus's reasoning edge against Qwen's local deployment and cost benefits for daily development tasks.

Code Gen Agents Open Weights Inference

💬 Reddit 13h ago

llama.cpp speculative checkpointing was merged

llama.cpp merged speculative checkpointing support achieving 0-50% speedup on coding tasks with optimized parameters, though performance varies by prompt repetition patterns and draft acceptance rates. The feature uses n-gram matching for speculative decoding with configurable draft token ranges.

Inference Tooling Code Gen Performance

🟧 Hacker News 14h ago

Show HN: Prompt-to-Excalidraw demo with Gemma 4 E2B in the browser (3.1GB)

Demonstration of Gemma 4 running entirely in-browser (3.1GB) to generate Excalidraw diagrams from text prompts using E2B. The implementation showcases on-device inference without server requirements. Novel for combining diagram generation with fully client-side LLM execution.

Inference Code Gen Browser-inference

💬 Reddit 1d ago

I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude

Qwen3.6-35B-A3B running at 8-bit quantization with 64k context matches Claude quality for code tasks on consumer hardware (M5 Max, 128GB). Handles complex multi-step research tasks with many tool calls and maintains performance on long context coding tasks. Enables fully local development workflows without sending code to external providers.

Models Code Gen Inference Open Weights

🟢 OpenAI 1d ago

OpenAI Codex Major Update

OpenAI Codex expanded beyond coding to include computer use, web workflows, image generation, memory, and automations. The updated developer app adds PR reviews, multi-file/terminal viewing, SSH devbox connections, and in-app browsing, serving 3+ million developers weekly.

Code Gen Agents Multimodal Tooling

💬 Reddit 1d ago

Claude vs Gemini: Solving the laden knight's tour problem

Benchmark comparing Claude and Gemini on the laden knight's tour problem, a weighted variant requiring optimal pathfinding with accumulating costs. Tests coding agents on combinatorial optimization task combining movement constraints with dynamic cost calculation.

Code Gen Benchmarks Agents

💬 Reddit 1d ago

Qwen3.6-35B-A3B solved coding problems Qwen3.5-27B couldn’t

Qwen3.6-35B-A3B successfully solved coding problems that Qwen3.5-27B couldn't handle, reducing technical debt in a complex budgeting app project. Users report improved code quality and architectural decisions on multi-feature applications.

Code Gen Models

💬 Reddit 2d ago

Qwen 3.6 35B crushes Gemma 4 26B on my tests

User benchmark comparing Qwen 3.6 35B against Gemma 4 26B on 30k-line codebase with 37 intentional bugs and PDF analysis tasks shows Qwen significantly outperforming across agentic capabilities, coding, image-to-text, instruction following, and reasoning. Both models tested at Q4_K_XL quantization for fair comparison.

Benchmarks Agents Code Gen Evaluation

💬 Reddit 2d ago

Qwen3.6 is incredible with OpenCode!

Qwen3.6 with OpenCode successfully implemented row-level security across a multi-service codebase (Rust, TypeScript, Python), demonstrating practical viability for complex code generation tasks. Users report quality comparable to Claude for certain daily-drive use cases despite remaining bugs.

Code Gen Models

💬 Reddit 2d ago

Qwen 3.6 is the first local model that actually feels worth the effort for me

Qwen3.6-35B-A3B represents the first local model practitioners find genuinely competitive with proprietary APIs for code generation, producing usable output for UI XML and embedded C++ with minimal post-generation fixes. This marks a capability threshold where local deployment overhead becomes worthwhile compared to previous iterations requiring extensive manual correction.

Code Gen Open Weights Inference Models

💬 Reddit 2d ago

Qwen3.6. This is it.

Qwen3.6-35B model successfully builds a complete tower defense game with autonomous bug detection and fixing using MCP screenshot verification. User reports the model identified rendering issues and wave completion bugs independently during development. Demonstrates strong multimodal code generation capabilities with visual feedback integration.

Models Code Gen Multimodal Open Weights

📑 arXiv 2d ago

LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning

LLMSniffer fine-tunes GraphCodeBERT with two-stage supervised contrastive learning to detect AI-generated code, improving accuracy from 70% to 78% on GPTSniffer and 91% to 94.65% on Whodunit. The approach combines comment removal preprocessing with an MLP classifier and produces well-separated embeddings confirmed by t-SNE visualization.

Code Gen Evaluation Detection

📑 arXiv 2d ago

Neurosymbolic Repo-level Code Localization

Exposes a critical keyword shortcut bias in code localization benchmarks where models rely on superficial lexical matching rather than structural reasoning. Introduces KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without naming hints, revealing catastrophic performance drops in state-of-the-art approaches and motivating a neurosymbolic framework combining neural retrieval with symbolic verification.

Code Gen Benchmarks Evaluation

🔶 Anthropic 3d ago

Claude Code Auto Mode and Opus 4.7 xhigh Effort Level

Anthropic released Auto mode for Claude Code (Opus 4.7, Max tier) and new "xhigh" effort level between high and max for granular reasoning control. Update includes fullscreen TUI rendering, mobile notifications for Remote Control, and Windows/MCP fixes.

Code Gen Reasoning Models

📝 Blog 3d ago

★ High Signal

NVIDIA Nemotron 3 Super: Open-Weight Coding Champion

NVIDIA's Nemotron 3 Super is a 120B/12B-active MoE model with hybrid Mamba-Attention architecture scoring 60.47% on SWE-Bench Verified—the highest open-weight coding score at launch. Features 1M context, 2.2x throughput improvement, and native speculative decoding for efficient agentic reasoning.

Open Weights Code Gen Agents Inference

🟧 Hacker News 3d ago

Android CLI: Build Android apps 3x faster using any agent

Command-line tool claims to accelerate Android app development 3x when used with AI coding agents. Streamlines agent-based mobile development workflows.

Agents Code Gen Tooling

📑 arXiv 3d ago

Agent-Aided Design for Dynamic CAD Models

Agent-Aided Design systems use LLMs in a feedback loop to write CAD code, compile models, visualize results, and iteratively refine designs, but cannot yet generate complex 3D assemblies with moving parts like pistons or scissors. This work identifies the capability gap preventing these training-free agentic systems from impacting industrial manufacturing. Addresses the transition from static CAD objects to dynamic mechanical assemblies.

Agents Code Gen Tooling

📑 arXiv 3d ago

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

QuantCode-Bench provides 400 tasks evaluating LLMs on generating executable algorithmic trading strategies for Backtrader from English descriptions. Unlike standard code benchmarks, requires domain-specific financial logic, specialized API knowledge, and code producing actual trades on historical data, with tasks sourced from Reddit, TradingView, and synthetic generators.

Benchmarks Code Gen Evaluation Domain-specific

📑 arXiv 3d ago

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Compact "Gene" representation outperforms documentation-oriented "Skill" packages for test-time evolution across 4,590 trials in scientific code tasks. Expanding experience into fuller documentation degrades performance, showing representation format is first-order factor in reusable experience.

Code Gen Prompting Reasoning

📑 arXiv 3d ago

Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC

LLM agents autonomously evolve the ABC logic synthesis codebase by rewriting sub-components while preserving its single-binary execution model. The self-evolving framework operates on the entire integrated codebase and bootstraps using existing open-source synthesis components before iteratively improving through agent-driven code evolution.

Agents Code Gen Self-improvement

💬 Reddit 3d ago

Qwen3.6-35B-A3B released!

Qwen3.6-35B-A3B is a sparse MoE model with 35B total parameters and 3B active, released under Apache 2.0. The model matches agentic coding performance of models 10x its active size and includes multimodal perception with thinking and non-thinking modes.

Models Open Weights Multimodal Code Gen

📑 arXiv 3d ago

COEVO: Co-Evolutionary Framework for Joint Functional Correctness and PPA Optimization in LLM-Based RTL Generation

COEVO unifies functional correctness and PPA (power, performance, area) optimization for LLM-generated RTL code in a single co-evolutionary loop, replacing sequential pipelines that discard partially correct but architecturally promising candidates. Existing methods decouple correctness from PPA and reduce multi-objective optimization to scalar fitness, obscuring trade-offs. COEVO treats correctness as continuous rather than binary, enabling simultaneous optimization of both objectives.

Code Gen Optimization Hardware-design

📑 arXiv 3d ago

Dr.~RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement

Agentic framework for RTL timing optimization using LLMs with tool-grounded self-improvement and reusable optimization skills. Evaluated on realistic RTL designs with industrial-grade tools rather than manually degraded toy examples. Moves beyond coarse design-level feedback to fine-grained optimization through learned skills.

Agents Code Gen Hardware-design

🐙 GitHub 3d ago

TheArcForge/UniClaude: Claude Code, natively inside Unity Editor. A dockable chat window with full project awareness, 60+ MCP tools, and zero alt-tabbing.

UniClaude integrates Claude directly into Unity Editor as a dockable window with full project context awareness and 60+ MCP tools. Eliminates context switching during game development by embedding the AI assistant natively in the IDE. Provides workflow-specific tooling for game developers working in Unity.

Tooling Code Gen Agents

🟢 OpenAI 3d ago

Codex for (almost) everything

OpenAI's Codex app for macOS and Windows now includes computer use capabilities, in-app browsing, image generation, memory, and plugins. The update transforms Codex from a code-focused assistant into a multi-capability developer productivity platform.

Code Gen Multimodal Tooling Agents

🔶 Anthropic 4d ago

★ High Signal

Claude Opus 4.7 - Major Model Release

Claude Opus 4.7 delivers 13% improvement on coding benchmarks with enhanced vision for higher-resolution images and new effort controls/task budgets for autonomous development. Powers upgraded Claude Code review tools for long-running software engineering tasks. Introduces task-level resource management for extended autonomous coding workflows.

Models Code Gen Agents

🔶 Anthropic 4d ago

★ High Signal

Claude Opus 4.7 - Major Software Engineering Model Release

Claude Opus 4.7 achieves 87.6% on SWE-bench Verified (13% improvement) with 2x throughput on agentic tasks while maintaining $5/$25 per million token pricing and full 1M context window. The performance gains make it effectively cheaper per task despite unchanged nominal pricing. Higher-resolution vision capabilities included.

Models Code Gen Agents Benchmarks

🐙 GitHub 4d ago

GitHub Copilot Adds Claude Opus 4.7

GitHub Copilot adding Claude Opus 4.7 with stronger multi-step task performance and more reliable agentic execution. Launches with promotional 7.5× premium request multiplier until April 30th, replacing Opus 4.5 and 4.6 for Copilot Pro+ users.

Code Gen Agents Models

🤗 Hugging Face 4d ago

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

MM-WebAgent is a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. Jointly optimizes global layout, local multimodal content, and their integration to produce coherent and visually consistent webpages, addressing style inconsistency in isolated element generation.

Agents Multimodal Code Gen Web-design

💬 Reddit 4d ago

Gemma4 26b & E4B are crazy good, and replaced Qwen for me!

Gemma 4 26B and E4B models outperform Qwen 3.5 series in local deployment scenarios, replacing a multi-model routing setup that previously used Qwen variants for chat, reasoning, and code generation. Users report better performance despite similar quantization levels, suggesting improved base model capabilities at comparable parameter counts.

Open Weights Inference Code Gen Reasoning

🐙 GitHub 4d ago

he-yufeng/RepoWiki: Open-source DeepWiki alternative — generate comprehensive wiki documentation for any codebase from terminal or browser

RepoWiki is an open-source alternative to DeepWiki that generates comprehensive wiki documentation for codebases from terminal or browser. The tool automates technical documentation creation for software repositories.

Code Gen Tooling Open Weights

💬 Reddit 4d ago

Was looking at a ICLR 2025 Oral paper and I am shocked it got oral [D]

ICLR 2025 Oral paper evaluated SQL code generation using natural language similarity metrics instead of execution-based validation, yielding ~20% false positive rate in authors' own testing. Community questions appropriateness of Oral designation given fundamental evaluation methodology flaw. Highlights peer review challenges in code generation benchmarks.

Code Gen Evaluation Benchmarks

🔶 Anthropic 5d ago

Anthropic Claude Code Desktop App Redesign

Anthropic redesigned Claude Code desktop app with parallel session management sidebar, integrated terminal, in-app file editor, and Routines—automation running on schedules, API calls, or GitHub events without active sessions. Available for Pro, Max, Team, and Enterprise users on macOS and Windows.

Code Gen Agents Tooling

🧠 DeepMind 5d ago

Google DeepMind Gemini Robotics-ER 1.6 for Physical AI

Gemini Robotics-ER 1.6 specialized reasoning model for physical AI achieves 93% success on instrument reading tasks (up from 23% baseline) through agentic vision combining visual reasoning with code execution. It adds spatial reasoning, multi-view perception, and industrial gauge interpretation as a high-level planning layer for vision-language-action robotics models.

Multimodal Reasoning Robotics Code Gen

🟢 OpenAI 5d ago

★ High Signal

OpenAI Codex Major Update - Expanded Computer Use

OpenAI Codex expands from coding to full computer use with web workflows, multi-step planning, autonomous actions, and audio-visual processing for 3M+ weekly developers. Now handles PR reviews, multiple file/terminal views, SSH connections, and in-app browsing. Shift from code generation tool to general-purpose computer control agent.

Code Gen Agents Multimodal Tooling

📝 Blog 5d ago

★ High Signal

Claude Code Used to Find 23-Year-Old Linux Kernel Vulnerability

Claude Code discovered a 23-year-old remotely exploitable heap buffer overflow in Linux kernel's NFS driver, with five vulnerabilities confirmed. Linux maintainers report AI bug reports shifted from "slop to legitimate findings" about a month ago, with valid security reports increasing from 2-3/week to 5-10/day—marking a capability inflection point for AI-assisted vulnerability discovery.

Code Gen Safety Agents

📝 Blog 5d ago

Latent Space: Notion's Journey Building Custom AI Agents

Notion rebuilt Custom Agents 4-5 times before production, revealing early agent attempts failed due to lack of tool-calling standards and short context windows. Their 'Agent Lab' thesis focuses on building product systems around frontier capabilities, with coding agents viewed as the kernel of future 'software factories' comprising spec/code/test/review agents.

Agents Code Gen Tooling Deployment

🤗 Hugging Face 5d ago

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

HY-World 2.0 generates navigable 3D Gaussian Splatting scenes from text, single images, multi-view images, or videos through a four-stage pipeline including panorama generation, trajectory planning, world expansion, and composition. The framework advances 3D world reconstruction and generation with improved panorama fidelity and 3D scene understanding capabilities.

Multimodal Models Code Gen

🐙 GitHub 5d ago

humanrouter/ddtree-mlx: Tree-based speculative decoding for Apple Silicon (MLX). ~10-15% faster than DFlash on code, ~1.5x over autoregressive. First MLX port with custom Metal kernels for hybrid model support.

ddtree-mlx ports tree-based speculative decoding to Apple Silicon with custom Metal kernels, achieving 10-15% speedup over DFlash on code and 1.5x over autoregressive inference. First MLX implementation supporting hybrid model architectures.

Inference Code Gen Infrastructure

📝 Blog 6d ago

Top Local Models List April 2026: Community Consensus

r/LocalLLaMA consensus ranks Qwen 3.5 most broadly recommended, Gemma 4 showing strong buzz, GLM-5/4.7 near top of rankings, MiniMax M2.5/M2.7 for agentic workloads, DeepSeek V3.2 in top cluster. Qwen3-Coder-Next dominates for local coding. Community-driven practical guidance on deployed models.

Models Code Gen Agents Evaluation

🤗 Hugging Face 6d ago

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Analysis of Claude Code's TypeScript source code and comparison with OpenClaw identifies five core human values (decision authority, safety, reliable execution, capability amplification, contextual adaptability) traced through thirteen design principles to implementation choices. The core architecture is a simple while-loop calling the model, running tools, and returning results—demonstrating how design philosophy shapes agentic system architecture.

Agents Code Gen Tooling

🐙 GitHub 6d ago

FreePeak/LeanKG: LeanKG: Stop Burning Tokens. Start Coding Lean.

LeanKG focuses on efficient formal theorem proving in Lean, emphasizing token efficiency for mathematical code generation. Title suggests optimization techniques for reducing computational cost while maintaining correctness. Targets formal verification and proof assistant workflows.

Code Gen Formal-verification Efficiency

🐙 GitHub 6d ago

helloianneo/awesome-claude-code-skills: Claude Code 最实用的 Skills / Agents / Plugins 精选合集 | 50+ 精选 | 按场景分类 | 带推荐等级 | 复制即装

Curated collection of 50+ Claude Code skills, agents, and plugins organized by use case with recommendation ratings. Ready-to-use extensions for Claude-based development workflows.

Agents Code Gen Tooling

💬 Reddit 6d ago

Local Minimax M2.7, GTA benchmark

Minimax M2.7 generates functional 3D GTA-style web experiences with minimal prompting, running at extreme IQ2_XXS quantization while maintaining coherence. Competes with GLM-5 on coding benchmarks for interactive 3D applications, though GLM-5 produces more aesthetically detailed outputs without explicit instruction.

Code Gen Multimodal Inference Quantization

🐙 GitHub 6d ago

asynkor/asynkor: File leasing for AI agent teams. One MCP server. Any IDE. Zero merge conflicts.

Asynkor provides file leasing coordination for AI agent teams via MCP server, preventing merge conflicts when multiple agents edit code. Works across IDEs without changing agent implementations.

Agents Code Gen Infrastructure

📝 Blog 1w ago

★ High Signal

Open Source Catches Up: GLM-5.1, Gemma 4, and the Narrowing Gap

GLM-5.1 achieves 94.6% of Claude Opus 4.6's coding performance at $3/month under MIT license, while Google's Gemma 4 and Qwen 3.5 deliver frontier-competitive performance. This marks the collapse of the performance gap between open and closed-source models, fundamentally shifting AI economics and deployment patterns.

Open Weights Models Code Gen

✍️ Simon Willison 1w ago

Simon Willison: Exploring the Servo Crate with Claude Code

Simon Willison uses Claude Code to explore Servo v0.1.0 Rust crate, building CLI screenshot tool and investigating WebAssembly compilation autonomously. Demonstrates "agentic engineering" workflow where developer tasks AI with discovering library capabilities and building working tools. Evolution from code completion to exploratory development assistance.

Code Gen Agents Tooling

🧠 DeepMind 2w ago

★ High Signal

Google Gemma 4 - Open Model Family Release

Gemma 4 family (31B Dense, 26B MoE variants) released under Apache 2.0 with 256K context, native vision/audio, and competitive coding ELO jumping from 110 to 2150—a 20x improvement. The 31B model outperforms models 20x larger while enabling agentic skills on edge devices. First open-weights model family combining multimodal input, extended context, and elite coding performance at edge-deployable scale.

Open Weights Models Multimodal Code Gen

📝 Blog 2w ago

Simon Willison on Lenny's Podcast: AI State of the Union

Simon Willison identifies November 2025 as the inflection point when AI coding agents crossed from 'mostly works' to 'actually works' with GPT-5.2 and Opus 4.5 releases. Discusses dark factories, automation timelines, agentic engineering, and his transition from traditional software engineering to AI-native development.

Agents Code Gen Timeline

📝 Blog 2w ago

Claude Code Architectural Leak Reveals Three-Layer Memory System and Tool Design

Leaked Claude Code source reveals three-layer memory architecture (file-read deduplication, structured session memory), dedicated repository navigation tools (Grep, Glob, LSP) instead of relying on model context, and forked subagents for parallelized background analysis. Demonstrates that coding agent performance stems from careful harness engineering around the model rather than just model intelligence alone.

Agents Code Gen Architecture

✍️ Simon Willison Jan 9

Simon Willison: 2026 is Year LLM Code Quality Becomes Impossible to Deny

Simon Willison predicts 2026 as inflection point where LLM code quality becomes undeniable, driven by reasoning models trained with RL specifically for code. Also forecasts 2026 as year of solving code sandboxing via containers and WebAssembly, addressing security risks and prompt injection vulnerabilities from executing untrusted LLM-generated code. Critical for safe agentic workflows.

Code Gen Reasoning Safety Infrastructure