OpenAI Codex expanded beyond coding to include computer use, web workflows, image generation, memory, and automations. The updated developer app adds PR reviews, multi-file/terminal viewing, SSH devbox connections, and in-app browsing, serving 3+ million developers weekly.
A locally-running world model trained for iPad interprets arbitrary photos and drawings into controllable driving gameplay. The experimental game demonstrates on-device world model inference for interactive applications, though current output quality remains imperfect.
Production LLM deployments span automated bureaucracy monitoring (extracting structured data from German government sites), multi-agent sales automation with 8 sub-agents and critic loops, and corporate knowledge RAG using Qdrant+LlamaIndex. Key insight: LLMs enable processing unstructured data at scale previously impossible.
Hugging Face tutorial on building a fast multilingual OCR model using synthetic data generation. Demonstrates techniques for creating training data without manual annotation. Practical guide for scaling OCR across multiple languages efficiently.
MARCH emulates the professional hierarchy of radiology departments using a multi-agent framework with specialized roles: a Resident Agent for initial drafting, Fellow Agents for retrieval-augmented revision, and an Attending Agent orchestrating iterative consensus. The approach addresses clinical hallucinations and lack of verification in automated 3D CT report generation by mimicking collaborative clinical workflows.
Veritas-RPM uses a five-layer multi-agent architecture (ground-truth assembly, anomaly detection, specialist routing, domain specialists, and conflict resolution) to suppress false positives in remote patient monitoring. Evaluated on 530 synthetic patient epochs across 98 documented false-positive scenarios, it reports True Suppression Rate, False Escalation Rate, and Indeterminate Rate metrics.
Prototype-Grounded Concept Models ground learned concepts in visual prototypes (image parts) to verify whether concepts align with human intent, enabling direct inspection and targeted human intervention. Matches CBM predictive performance while substantially improving transparency and intervenability through explicit concept evidence.
AEGIS addresses catastrophic forgetting when fine-tuning vision-language models for robotic control by preventing cross-modal gradient asymmetry—high-magnitude continuous action gradients overwriting the VLM's cross-entropy pre-trained manifold. Uses anchor-enforced gradient isolation to preserve VQA capabilities while injecting flow-matching action supervision, unlike stop-gradient or LoRA approaches.
Chain-of-Thought prompting consistently degrades performance in visual spatial reasoning tasks across seventeen multimodal models and thirteen benchmarks. Novel No-Image++ ablation reveals MRMs hallucinate visual details from textual priors even when images are absent, indicating severe shortcut learning in CoT-prompted vision-language models.
Qwen3.6-35B model successfully builds a complete tower defense game with autonomous bug detection and fixing using MCP screenshot verification. User reports the model identified rendering issues and wave completion bugs independently during development. Demonstrates strong multimodal code generation capabilities with visual feedback integration.
AST is a training-free speech editing framework using pre-trained autoregressive TTS models with Latent Recomposition to precisely edit speech segments while preserving speaker identity and acoustic context. Eliminates trade-offs between editing quality and consistency by selectively stitching preserved and synthesized segments without task-specific training.
Mind's Eye benchmark evaluates MLLMs on eight visuo-cognitive tasks inspired by human intelligence tests, organized under Abstraction-Relation-Transformation taxonomy. Humans achieve 80% accuracy while top MLLMs remain below 50%, revealing failures in visual attention, pattern induction, and mental transformation—core processes of fluid intelligence.
AstroVLM is a multi-agent VLM system for diagnosing quality issues in astronomical imaging by handling complex underlying correlations across multidisciplinary subtasks. It addresses the time-intensive manual effort NASA and expert astronomers invest in quality diagnosis and error localization during the imaging process.
SocialGrid is an Among Us-inspired benchmark evaluating LLM agents on planning, task execution, and social reasoning in embodied multi-agent settings. Even GPT-OSS-120B achieves below 60% accuracy, with agents stuck in repetitive behaviors—revealing social reasoning remains a bottleneck even with planning assistance.
ReactBench reveals fundamental limitations in MLLMs' structural reasoning by testing them on chemical reaction diagrams with branching paths, converging flows, and cyclic dependencies. Existing models degrade sharply on topological structures despite excelling at individual visual elements, exposing a gap that semantic-focused benchmarks miss.
UniEditBench provides the first unified benchmark for image and video editing across reconstruction-based and instruction-driven methods, with taxonomies covering 9 image and 8 video operations. Uses distilled MLLMs as cost-effective automatic evaluators that align with human preference, addressing fragmentation in visual editing evaluation.
MM-WebAgent uses hierarchical planning and iterative self-reflection to coordinate AIGC tools for webpage generation, jointly optimizing layout, multimodal content, and integration. Solves style inconsistency problems in prior approaches that generate visual elements independently, introducing a new multimodal webpage generation benchmark.
LLMs and VLMs can perform viewpoint rotation understanding tasks using only text descriptions, without visual input. The study investigates how models infer final viewpoints and predict observations after textual descriptions of rotations, examining whether linguistic intelligence alone enables spatial reasoning. Uses interpretability methods to understand the internal mechanisms enabling this capability.
Vision-language models struggle to recognize human emotions, underperforming even specialized vision-only classifiers despite progress on other visual tasks. The study identifies two critical vulnerabilities: long-tailed emotion dataset distributions exacerbated by web-scale pretraining, and challenges with continuous dynamic facial expression recognition. Reveals fundamental gap in VLM emotional understanding capabilities.
Blue's Data Intelligence Layer orchestrates agents across multi-source, multi-modal data beyond single-database NL2SQL. Addresses iterative queries, heterogeneous data sources, and external knowledge requirements in enterprise compound AI systems.
RadAgent generates chest CT reports through stepwise tool-using with fully inspectable reasoning traces for clinical validation. Tool-augmented agent improves over 3D VLM baseline CT-Chat on clinical accuracy, groundedness, and radiologist efficiency across three evaluation dimensions.
IRS framework decomposes humor understanding into three structured components: identifying visual incongruities, constructing coherent reinterpretations, and aligning with human preference judgments. Applies incongruity-resolution theory to the New Yorker Cartoon Caption Contest, moving beyond black-box prediction to explicit reasoning processes. Demonstrates that humor comprehension requires getting both the answer and the underlying reasoning correct.
VisPCO formulates visual token pruning as a Pareto optimization problem to automatically find optimal computation-performance configurations for vision-language models. Uses continuous relaxation and gradient-based search via Augmented Lagrangian to approximate the empirical Pareto frontier across 8 visual benchmarks.
OpenMobile is an open-source framework for synthesizing high-quality mobile agent task instructions and trajectories, achieving nearly 70% success on AndroidWorld. Features scalable task synthesis using global environment memory and policy-switching strategy alternating between learner and expert models during trajectory rollout. Makes training recipes transparent unlike closed leading models.
ProVoice-Bench introduces the first evaluation framework for proactive voice agents with 1,182 samples across four tasks measuring intervention and monitoring capabilities. State-of-the-art multimodal LLMs show significant performance gaps particularly in over-triggering and reasoning, revealing limitations in current proactive agent paradigms.
Qwen3.6-35B-A3B is a sparse MoE model with 35B total parameters and 3B active, released under Apache 2.0. The model matches agentic coding performance of models 10x its active size and includes multimodal perception with thinking and non-thinking modes.
ConfGuide improves learning-to-guide systems by using conformal risk control to select outcome sets with guaranteed false negative rates, generating more succinct textual guidance. Unlike existing approaches that compound all possible outcomes into dense text, this method provides targeted guidance that reduces cognitive load. Keeps humans responsible for final decisions while making AI assistance more digestible.
OpenAI's Codex app for macOS and Windows now includes computer use capabilities, in-app browsing, image generation, memory, and plugins. The update transforms Codex from a code-focused assistant into a multi-capability developer productivity platform.
Anthropic launched Claude Design, a multimodal collaboration product that generates visual outputs including designs, prototypes, and slides alongside Opus 4.7. Expands Claude beyond text into integrated design workflows, competing with specialized design-focused AI tools. Available through Anthropic Labs for Opus 4.7 users.
Tutorial on training and fine-tuning multimodal embedding and reranker models using Sentence Transformers framework. Covers practical implementation for combining text and visual modalities in retrieval tasks.
OpenAI will shut down the Sora app on April 26, 2026, and the API on September 24, marking a rare product retreat as competition from Veo 3.1, Kling 3.0, and open alternatives commoditized video generation faster than expected. The shutdown signals Sora's economics became untenable in an increasingly crowded market.
Qwen3.6-35B-A3B running locally outperformed Claude Opus 4.7 on an SVG pelican generation task, demonstrating the narrowing capability gap between quantized open-weight models and proprietary APIs for specific visual generation benchmarks. The comparison highlights increasing viability of local inference despite not reflecting overall model capability.
Hugging Face transformers adds support for Mistral 4 (119B MoE with 128 experts unifying Instruct, Reasoning, and Devstral), Jina Embeddings v3, and multiple OCR/video models including VidEoMT, UVDoc, and PI0 robotics VLA. Includes quantization, tokenization, and caching speedups with breaking changes.
DR³-Eval provides a reproducible benchmark for deep research agents using static research sandbox corpora paired with authentic user tasks, measuring multimodal report generation across dimensions including information recall, factual accuracy, and citation coverage. It addresses the challenge of evaluating long-horizon research tasks by simulating open-web complexity while remaining fully verifiable.
MM-WebAgent is a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. Jointly optimizes global layout, local multimodal content, and their integration to produce coherent and visually consistent webpages, addressing style inconsistency in isolated element generation.
RadAgent is a tool-using AI agent for chest CT interpretation that generates reports through a stepwise, interpretable process with fully inspectable traces of intermediate decisions and tool interactions. Improves on CT-Chat VLM baseline across three dimensions while allowing clinicians to examine how findings are derived rather than being passive observers.
UniDoc-RL uses reinforcement learning to unify retrieval, reranking, and visual perception in a single LVLM agent with hierarchical actions. The model progressively refines evidence from document-level retrieval to region-level cropping, enabling fine-grained visual semantics for complex reasoning tasks.
Switch-KD proposes a visual-switch distillation framework unifying vision-language knowledge transfer by addressing modality-specific supervision inconsistencies in VLM knowledge distillation. Current KD methods supervise modalities separately without explicitly addressing multimodal alignment, leading to inconsistent knowledge transfer. The approach enables efficient VLM deployment in resource-constrained scenarios.
Google launches native Gemini app for macOS, bringing multimodal AI assistant directly to Mac desktop. Expands platform availability beyond web and mobile interfaces.
Coverage of Gemini 3.1 Flash's text-to-speech capabilities and performance characteristics.
DeepMind's Gemini 3.1 Flash TTS introduces granular audio tags for precise control over expressive speech synthesis. Enables directing AI-generated voice with fine-grained attributes for natural, controllable audio generation.
WorldSeed is a simulation engine where AI agents live autonomously with physical rules and information asymmetry. Scenarios defined in YAML allow emergent multi-agent storytelling with any agent framework.
Source-available AI gateway from 35m.ai supporting unified access to text, image, video, audio, and music generation APIs with intelligent multi-provider routing and hybrid BYOK (bring-your-own-key) workflows. Optimizes compute utilization across heterogeneous provider backends.
Gemini Robotics-ER 1.6 specialized reasoning model for physical AI achieves 93% success on instrument reading tasks (up from 23% baseline) through agentic vision combining visual reasoning with code execution. It adds spatial reasoning, multi-view perception, and industrial gauge interpretation as a high-level planning layer for vision-language-action robotics models.
Google DeepMind released Gemini Robotics-ER 1.6, a robotics reasoning model with improved spatial reasoning, multi-view perception, instrument reading, and hazard detection (+6% text, +10% video safety). Available via Gemini API with Boston Dynamics deploying it for autonomous Spot robot operations.
🟢 OpenAI 5d ago
★ High Signal
OpenAI Codex expands from coding to full computer use with web workflows, multi-step planning, autonomous actions, and audio-visual processing for 3M+ weekly developers. Now handles PR reviews, multiple file/terminal views, SSH connections, and in-app browsing. Shift from code generation tool to general-purpose computer control agent.
GPT Image 2 rolled out with near-perfect text rendering in images, solving major AI generation weakness. Shows improved prompt adherence and realistic details. Discovered through anonymous "tape" codenames on Arena AI before official announcement.
Boston Dynamics integrated Gemini and Gemini Robotics-ER 1.6 into Spot's Orbit AIVI systems, enabling robots to perform complex reasoning about industrial environments, identify hazards, and read instruments. The Gemini-powered AIVI-Learning system is now live for existing customers as of April 15, 2026.
Moonlake builds action-conditioned world models for game development, debating abstraction versus bitter lesson and whether code engines beat learned priors. Explores diffusion scaling limits and symbolic versus diffusion boundaries. Represents world model frontier beyond LLMs with implications for spatial audio and multimodal latents.
Mistral's Voxtral is a 4B-parameter multilingual TTS model supporting 9 languages with emotionally expressive generation, low-latency streaming, and custom voice adaptation. Available via Mistral Studio and API, it targets enterprise voice agent workflows with focus on natural rhythm and cultural authenticity.
HY-World 2.0 generates navigable 3D Gaussian Splatting scenes from text, single images, multi-view images, or videos through a four-stage pipeline including panorama generation, trajectory planning, world expansion, and composition. The framework advances 3D world reconstruction and generation with improved panorama fidelity and 3D scene understanding capabilities.
HiVLA decouples VLM semantic planning from motor control to preserve reasoning capabilities lost in end-to-end VLA fine-tuning. VLM planner generates subtask instructions with target bounding boxes, then flow-matching DiT translates grounded plans to physical actions for robotic manipulation.
MLLMs underutilize visual information during instruction tuning because many tasks can be solved with language priors alone. This method augments visual instruction tuning with self-supervised tasks (rotation prediction, color matching, cross-view correspondence) reformulated as natural language instructions. Improves fine-grained visual reasoning without increasing model size.
VCR-Agent is a multi-agent framework that generates mechanistic action graphs to represent biological reasoning in virtual cells, enabling verification and falsification of LLM-generated explanations. The approach releases VC-TRACES, a dataset of verified biological mechanisms, addressing the challenge of factually grounded scientific explanations from LLMs in open-ended domains like biology.
Gemini Robotics-ER 1.6 enhances spatial reasoning and multi-view understanding for autonomous robotics tasks. Focuses on embodied reasoning capabilities for real-world robot control.
Minimax M2.7 generates functional 3D GTA-style web experiences with minimal prompting, running at extreme IQ2_XXS quantization while maintaining coherence. Competes with GLM-5 on coding benchmarks for interactive 3D applications, though GLM-5 produces more aesthetically detailed outputs without explicit instruction.
Simon Willison demonstrates running Gemma 4 audio models locally using MLX on Apple Silicon, enabling on-device audio understanding and generation.
Meta Muse Spark marks Meta's pivot from open-source to proprietary models, featuring multimodal perception, parallel subagent execution, and a contemplating mode. Built by Meta Superintelligence Labs, it offers competitive vision and language performance but lags in coding, representing Meta's first paid API model after Llama 4's poor reception.
Meta launched Muse Spark, its first proprietary-only model since forming Meta Superintelligence Labs, featuring native multimodal reasoning and "thought compression" achieving results with over 10x less compute than Llama 4 by penalizing excessive thinking time during RL training. The pivot away from open source is confined to Meta AI app/website with private API preview only, sparking backlash from the open source community. Meta refused to clarify whether Llama development has ended.
🧠 DeepMind 2w ago
★ High Signal
Gemma 4 family (31B Dense, 26B MoE variants) released under Apache 2.0 with 256K context, native vision/audio, and competitive coding ELO jumping from 110 to 2150—a 20x improvement. The 31B model outperforms models 20x larger while enabling agentic skills on edge devices. First open-weights model family combining multimodal input, extended context, and elite coding performance at edge-deployable scale.
Mistral's Voxtral uses flow matching for text-to-speech, expanding beyond text into multimodal audio. Discusses enterprise deployment and open source philosophy for audio models. Represents shift in how TTS will be productized and what "open" means for audio.
Interspeech 2026 challenge shifts audio AI evaluation from result-oriented to process-oriented reasoning quality using instance-level rubric-based evaluation. Champion agent integrated 40+ specialized audio tools achieving 69.83% Rubrics score. Emphasizes transparent reasoning over black-box performance metrics.