Production LLM deployments span automated bureaucracy monitoring (extracting structured data from German government sites), multi-agent sales automation with 8 sub-agents and critic loops, and corporate knowledge RAG using Qdrant+LlamaIndex. Key insight: LLMs enable processing unstructured data at scale previously impossible.
MARCH emulates the professional hierarchy of radiology departments using a multi-agent framework with specialized roles: a Resident Agent for initial drafting, Fellow Agents for retrieval-augmented revision, and an Attending Agent orchestrating iterative consensus. The approach addresses clinical hallucinations and lack of verification in automated 3D CT report generation by mimicking collaborative clinical workflows.
Case study empirically measures where anonymization should occur in RAG pipelines to balance privacy protection with utility when handling PII and sensitive data. Systematically evaluates placement options (at retrieval, augmentation, or generation stages) to guide RAG administrators in deploying privacy-preserving systems.
Survey categorizing graph-LLM integration methods by purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, causal graphs), and integration strategy (prompting, augmentation, training, agent-based). Provides clarity on when and what types of graph representations enhance LLM capabilities.
RAGognizer uses token-level hallucination annotations from real RAG outputs as a direct training signal, integrating a detection head during fine-tuning rather than treating hallucination detection as post-hoc. The approach trains models to recognize when generated content is unsupported by retrieved context, addressing closed-domain hallucinations in retrieval-augmented generation.
Experience Compression Spectrum unifies agent memory, skills, and rules as points along a compression axis (5-20× for memory, 50-500× for skills, 1000×+ for rules). Framework addresses the critical bottleneck of managing accumulated experience in long-horizon, multi-session LLM agent deployments by reducing context consumption and retrieval latency.
IG-Search introduces step-level information gain rewards for search-augmented reasoning, measuring how retrieved documents improve model confidence in answers relative to random baselines. This addresses the gradient collapse problem in trajectory-level RL when all sampled trajectories fail and enables distinguishing precise queries from vague ones within rollout groups.
Blinded multi-rater study with 6 senior diabetes clinicians evaluated retrieval-grounded LLM conversational agent for CGM data interpretation and patient counseling support across 12 cases. System generated plain-language explanations while avoiding individualized therapeutic advice, addressing time-intensive nature of CGM pattern explanation. Evidence development for RAG-based clinical decision support in diabetes care.
Ecom-RLVE introduces adaptive verifiable environments for training and evaluating e-commerce conversational agents with reinforcement learning. Provides structured simulation environments where agent actions can be verified against ground truth. Enables systematic development of domain-specific conversational AI for shopping and customer service scenarios.
Tutorial on training and fine-tuning multimodal embedding and reranker models using Sentence Transformers framework. Covers practical implementation for combining text and visual modalities in retrieval tasks.
Corpus2Skill distills document corpora into hierarchical skill directories that LLM agents navigate rather than passively retrieve, addressing RAG's limitation of treating models as passive consumers. The system clusters documents offline into a navigable tree with LLM-written summaries at each level, giving agents a bird's-eye corpus view for better evidence synthesis.
DR³-Eval provides a reproducible benchmark for deep research agents using static research sandbox corpora paired with authentic user tasks, measuring multimodal report generation across dimensions including information recall, factual accuracy, and citation coverage. It addresses the challenge of evaluating long-horizon research tasks by simulating open-web complexity while remaining fully verifiable.
UniDoc-RL uses reinforcement learning to unify retrieval, reranking, and visual perception in a single LVLM agent with hierarchical actions. The model progressively refines evidence from document-level retrieval to region-level cropping, enabling fine-grained visual semantics for complex reasoning tasks.
Ennoia provides declarative document indexing framework for Python allowing schema-driven structured extraction and search. Enables developers to define index schemas and extract queryable structures from documents programmatically.
Chip Huyen's 'AI Engineering' book became O'Reilly's most-read since launch, covering evaluation, prompt engineering, RAG, fine-tuning, dataset engineering, and production architecture. Emphasizes evaluation as the most critical part of AI engineering and data as the most valuable asset in an era of commoditized models.