🍡 feedmeAI
← All topics
Multimodal 62 items

Everything Multimodal

🟢 OpenAI 1d ago

OpenAI Codex Major Update

OpenAI Codex expanded beyond coding to include computer use, web workflows, image generation, memory, and automations. The updated developer app adds PR reviews, multi-file/terminal viewing, SSH devbox connections, and in-app browsing, serving 3+ million developers weekly.

📑 arXiv 2d ago

MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

MARCH emulates the professional hierarchy of radiology departments using a multi-agent framework with specialized roles: a Resident Agent for initial drafting, Fellow Agents for retrieval-augmented revision, and an Attending Agent orchestrating iterative consensus. The approach addresses clinical hallucinations and lack of verification in automated 3D CT report generation by mimicking collaborative clinical workflows.

📑 arXiv 2d ago

Veritas-RPM: Provenance-Guided Multi-Agent False Positive Suppression for Remote Patient Monitoring

Veritas-RPM uses a five-layer multi-agent architecture (ground-truth assembly, anomaly detection, specialist routing, domain specialists, and conflict resolution) to suppress false positives in remote patient monitoring. Evaluated on 530 synthetic patient epochs across 98 documented false-positive scenarios, it reports True Suppression Rate, False Escalation Rate, and Indeterminate Rate metrics.

📑 arXiv 2d ago

AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning

AEGIS addresses catastrophic forgetting when fine-tuning vision-language models for robotic control by preventing cross-modal gradient asymmetry—high-magnitude continuous action gradients overwriting the VLM's cross-entropy pre-trained manifold. Uses anchor-enforced gradient isolation to preserve VQA capabilities while injecting flow-matching action supervision, unlike stop-gradient or LoRA approaches.

💬 Reddit 2d ago

Qwen3.6. This is it.

Qwen3.6-35B model successfully builds a complete tower defense game with autonomous bug detection and fixing using MCP screenshot verification. User reports the model identified rendering issues and wave completion bugs independently during development. Demonstrates strong multimodal code generation capabilities with visual feedback integration.

📑 arXiv 2d ago

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Mind's Eye benchmark evaluates MLLMs on eight visuo-cognitive tasks inspired by human intelligence tests, organized under Abstraction-Relation-Transformation taxonomy. Humans achieve 80% accuracy while top MLLMs remain below 50%, revealing failures in visual attention, pattern induction, and mental transformation—core processes of fluid intelligence.

📑 arXiv 3d ago

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

LLMs and VLMs can perform viewpoint rotation understanding tasks using only text descriptions, without visual input. The study investigates how models infer final viewpoints and predict observations after textual descriptions of rotations, examining whether linguistic intelligence alone enables spatial reasoning. Uses interpretability methods to understand the internal mechanisms enabling this capability.

📑 arXiv 3d ago

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Vision-language models struggle to recognize human emotions, underperforming even specialized vision-only classifiers despite progress on other visual tasks. The study identifies two critical vulnerabilities: long-tailed emotion dataset distributions exacerbated by web-scale pretraining, and challenges with continuous dynamic facial expression recognition. Reveals fundamental gap in VLM emotional understanding capabilities.

📑 arXiv 3d ago

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

IRS framework decomposes humor understanding into three structured components: identifying visual incongruities, constructing coherent reinterpretations, and aligning with human preference judgments. Applies incongruity-resolution theory to the New Yorker Cartoon Caption Contest, moving beyond black-box prediction to explicit reasoning processes. Demonstrates that humor comprehension requires getting both the answer and the underlying reasoning correct.

📑 arXiv 3d ago

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

OpenMobile is an open-source framework for synthesizing high-quality mobile agent task instructions and trajectories, achieving nearly 70% success on AndroidWorld. Features scalable task synthesis using global environment memory and policy-switching strategy alternating between learner and expert models during trajectory rollout. Makes training recipes transparent unlike closed leading models.

💬 Reddit 3d ago

Qwen3.6-35B-A3B released!

Qwen3.6-35B-A3B is a sparse MoE model with 35B total parameters and 3B active, released under Apache 2.0. The model matches agentic coding performance of models 10x its active size and includes multimodal perception with thinking and non-thinking modes.

📑 arXiv 3d ago

Hybrid Decision Making via Conformal VLM-generated Guidance

ConfGuide improves learning-to-guide systems by using conformal risk control to select outcome sets with guaranteed false negative rates, generating more succinct textual guidance. Unlike existing approaches that compound all possible outcomes into dense text, this method provides targeted guidance that reduces cognitive load. Keeps humans responsible for final decisions while making AI assistance more digestible.

🟢 OpenAI 3d ago

Codex for (almost) everything

OpenAI's Codex app for macOS and Windows now includes computer use capabilities, in-app browsing, image generation, memory, and plugins. The update transforms Codex from a code-focused assistant into a multi-capability developer productivity platform.

📝 Blog 4d ago

OpenAI Sora Shutdown: Video Model to Cease Operations

OpenAI will shut down the Sora app on April 26, 2026, and the API on September 24, marking a rare product retreat as competition from Veo 3.1, Kling 3.0, and open alternatives commoditized video generation faster than expected. The shutdown signals Sora's economics became untenable in an increasingly crowded market.

✍️ Simon Willison 4d ago

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Qwen3.6-35B-A3B running locally outperformed Claude Opus 4.7 on an SVG pelican generation task, demonstrating the narrowing capability gap between quantized open-weight models and proprietary APIs for specific visual generation benchmarks. The comparison highlights increasing viability of local inference despite not reflecting overall model capability.

🤗 Hugging Face 4d ago

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

DR³-Eval provides a reproducible benchmark for deep research agents using static research sandbox corpora paired with authentic user tasks, measuring multimodal report generation across dimensions including information recall, factual accuracy, and citation coverage. It addresses the challenge of evaluating long-horizon research tasks by simulating open-web complexity while remaining fully verifiable.

🤗 Hugging Face 4d ago

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

MM-WebAgent is a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. Jointly optimizes global layout, local multimodal content, and their integration to produce coherent and visually consistent webpages, addressing style inconsistency in isolated element generation.

🤗 Hugging Face 4d ago

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

RadAgent is a tool-using AI agent for chest CT interpretation that generates reports through a stepwise, interpretable process with fully inspectable traces of intermediate decisions and tool interactions. Improves on CT-Chat VLM baseline across three dimensions while allowing clinicians to examine how findings are derived rather than being passive observers.

🤗 Hugging Face 4d ago

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Switch-KD proposes a visual-switch distillation framework unifying vision-language knowledge transfer by addressing modality-specific supervision inconsistencies in VLM knowledge distillation. Current KD methods supervise modalities separately without explicitly addressing multimodal alignment, leading to inconsistent knowledge transfer. The approach enables efficient VLM deployment in resource-constrained scenarios.

🐙 GitHub 4d ago

guo2001china/35gateway: 35m.ai 旗下源码开放 AI Gateway,文本/图片/视频/音频/音乐一键接入,支持多供应商智能路由与自带 Key 混合使用,不浪费每一份算力。 Source-available AI gateway from 35m.ai for text, image, video, audio, and music. Supports smart multi-provider routing and bring-your-own-key workflows without wasting compute.

Source-available AI gateway from 35m.ai supporting unified access to text, image, video, audio, and music generation APIs with intelligent multi-provider routing and hybrid BYOK (bring-your-own-key) workflows. Optimizes compute utilization across heterogeneous provider backends.

🧠 DeepMind 5d ago

Google DeepMind Gemini Robotics-ER 1.6 for Physical AI

Gemini Robotics-ER 1.6 specialized reasoning model for physical AI achieves 93% success on instrument reading tasks (up from 23% baseline) through agentic vision combining visual reasoning with code execution. It adds spatial reasoning, multi-view perception, and industrial gauge interpretation as a high-level planning layer for vision-language-action robotics models.

🧠 DeepMind 5d ago

Google Gemini Robotics-ER 1.6 Release

Google DeepMind released Gemini Robotics-ER 1.6, a robotics reasoning model with improved spatial reasoning, multi-view perception, instrument reading, and hazard detection (+6% text, +10% video safety). Available via Gemini API with Boston Dynamics deploying it for autonomous Spot robot operations.

🟢 OpenAI 5d ago
★ High Signal

OpenAI Codex Major Update - Expanded Computer Use

OpenAI Codex expands from coding to full computer use with web workflows, multi-step planning, autonomous actions, and audio-visual processing for 3M+ weekly developers. Now handles PR reviews, multiple file/terminal views, SSH connections, and in-app browsing. Shift from code generation tool to general-purpose computer control agent.

📝 Blog 5d ago

Mistral Voxtral TTS Model

Mistral's Voxtral is a 4B-parameter multilingual TTS model supporting 9 languages with emotionally expressive generation, low-latency streaming, and custom voice adaptation. Available via Mistral Studio and API, it targets enterprise voice agent workflows with focus on natural rhythm and cultural authenticity.

🤗 Hugging Face 5d ago

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

HY-World 2.0 generates navigable 3D Gaussian Splatting scenes from text, single images, multi-view images, or videos through a four-stage pipeline including panorama generation, trajectory planning, world expansion, and composition. The framework advances 3D world reconstruction and generation with improved panorama fidelity and 3D scene understanding capabilities.

🤗 Hugging Face 6d ago

Boosting Visual Instruction Tuning with Self-Supervised Guidance

MLLMs underutilize visual information during instruction tuning because many tasks can be solved with language priors alone. This method augments visual instruction tuning with self-supervised tasks (rotation prediction, color matching, cross-view correspondence) reformulated as natural language instructions. Improves fine-grained visual reasoning without increasing model size.

🤗 Hugging Face 6d ago

Towards Autonomous Mechanistic Reasoning in Virtual Cells

VCR-Agent is a multi-agent framework that generates mechanistic action graphs to represent biological reasoning in virtual cells, enabling verification and falsification of LLM-generated explanations. The approach releases VC-TRACES, a dataset of verified biological mechanisms, addressing the challenge of factually grounded scientific explanations from LLMs in open-ended domains like biology.

💬 Reddit 6d ago

Local Minimax M2.7, GTA benchmark

Minimax M2.7 generates functional 3D GTA-style web experiences with minimal prompting, running at extreme IQ2_XXS quantization while maintaining coherence. Competes with GLM-5 on coding benchmarks for interactive 3D applications, though GLM-5 produces more aesthetically detailed outputs without explicit instruction.

📝 Blog 2w ago

Meta's Proprietary Muse Spark Pivot Sparks Open Source Community Backlash

Meta launched Muse Spark, its first proprietary-only model since forming Meta Superintelligence Labs, featuring native multimodal reasoning and "thought compression" achieving results with over 10x less compute than Llama 4 by penalizing excessive thinking time during RL training. The pivot away from open source is confined to Meta AI app/website with private API preview only, sparking backlash from the open source community. Meta refused to clarify whether Llama development has ended.

🧠 DeepMind 2w ago
★ High Signal

Google Gemma 4 - Open Model Family Release

Gemma 4 family (31B Dense, 26B MoE variants) released under Apache 2.0 with 256K context, native vision/audio, and competitive coding ELO jumping from 110 to 2150—a 20x improvement. The 31B model outperforms models 20x larger while enabling agentic skills on edge devices. First open-weights model family combining multimodal input, extended context, and elite coding performance at edge-deployable scale.