#production-ai News & Analysis

22 articles tagged with #production-ai. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems

Researchers introduce Litmus, a zero-label evaluation system that automatically designs metrics for AI pipelines by analyzing source code rather than relying on manual labeling. The system identifies what needs to be measured and why before constructing justified metric portfolios, outperforming existing baselines on three real-world AI applications including financial and scientific tasks.

AIBullisharXiv – CS AI · Jun 87/10

🧠

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Researchers introduce Insights Generator (IG), a multi-agent system that automates the diagnosis of failures in large language model agents by analyzing execution trace corpora at scale. IG produces evidence-backed natural language insights about systematic behavioral patterns, demonstrating 30.4 percentage point performance improvements when human experts implement its recommendations.

AIBullisharXiv – CS AI · Jun 57/10

🧠

SAGE: Scalable AI Governance & Evaluation

Researchers and LinkedIn introduce SAGE, a framework that combines human judgment with AI surrogates to evaluate search relevance at scale. By using a bidirectional calibration loop between policy, precedent examples, and LLM judges, the system achieves near-human agreement while reducing inference costs by 92×, ultimately driving a 0.25% lift in LinkedIn's daily active users.

AINeutralarXiv – CS AI · Jun 47/10

🧠

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Researchers introduce CHARM, a framework for detecting and mitigating cascading hallucinations in multi-step AI reasoning pipelines where errors compound across stages. The system achieves 89.4% detection accuracy with minimal false positives, addressing a critical vulnerability in agentic RAG systems that existing methods fail to catch.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Ekka: Automated Diagnosis of Silent Errors in LLM Inference

Researchers introduce Ekka, an automated diagnostic system that identifies root causes of silent errors in large language model serving frameworks by comparing execution states between target and reference implementations. The system achieves 80% pass@1 accuracy and has already discovered 4 new bugs in production serving frameworks, addressing a critical reliability challenge in LLM deployment.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

Researchers present a self-healing orchestration framework for tool-augmented large language models that treats reliability as a bounded runtime control problem, achieving 98.8% task success by mapping failure signals to recovery actions and verifying results. The approach outperforms retry-only and full-replanning baselines across multiple benchmarks, particularly excelling when recovery budgets are constrained.

AIBullisharXiv – CS AI · Jun 27/10

🧠

SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management

Sherlock is an AI framework that combines Large Language Models with structured domain knowledge to automate e-commerce fraud investigation and risk management. Deployed at JD.com, it achieved an 82% expert acceptance rate and 386.7% throughput increase while continuously adapting to evolving fraud tactics through a self-improving data flywheel.

AIBullisharXiv – CS AI · May 17/10

🧠

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

Researchers propose a schema-grounded approach to AI memory that treats persistent storage as a system of record rather than a search problem, using iterative extraction with validation gates. The method achieves 97.10% F1 on memory benchmarks and 95.2% accuracy on application tasks, significantly outperforming retrieval-based baselines and suggesting that memory architecture matters more than model scale alone.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Pioneer Agent: Continual Improvement of Small Language Models in Production

Researchers introduce Pioneer Agent, an automated system that continuously improves small language models in production by diagnosing failures, curating training data, and retraining under regression constraints. The system demonstrates significant performance gains across benchmarks, with real-world deployments achieving improvements from 84.9% to 99.3% in intent classification.

AIBullisharXiv – CS AI · Mar 277/10

🧠

Cross-Model Disagreement as a Label-Free Correctness Signal

Researchers introduce cross-model disagreement as a training-free method to detect when AI language models make confident errors without requiring ground truth labels. The approach uses Cross-Model Perplexity and Cross-Model Entropy to measure how surprised a second verifier model is when reading another model's answers, significantly outperforming existing uncertainty-based methods across multiple benchmarks.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 177/10

🧠

Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents

Researchers introduce the Agent Lifecycle Toolkit (ALTK), an open-source middleware collection designed to address critical failure modes in enterprise AI agent deployments. The toolkit provides modular components for systematic error detection, repair, and mitigation across six key intervention points in the agent lifecycle.

AIBullisharXiv – CS AI · Mar 56/10

🧠

Ethical and Explainable AI in Reusable MLOps Pipelines

Researchers developed a unified MLOps framework that integrates ethical AI principles, reducing demographic bias from 0.31 to 0.04 while maintaining predictive accuracy. The system automatically blocks deployments and triggers retraining based on fairness metrics, demonstrating practical implementation of ethical AI in production environments.

AIBullisharXiv – CS AI · Mar 37/103

🧠

CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

Meta presents CharacterFlywheel, an iterative process for improving large language models in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, the system achieved significant improvements through 15 generations of refinement, with the best models showing up to 8.8% improvement in engagement breadth and 19.4% in engagement depth while substantially improving instruction following capabilities.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Learning to Construct Practical Agentic Systems

Researchers propose a practical framework for building LLM-based agentic systems that prioritizes simplicity, cost predictability, and controllability over maximum optimization. The framework uses modular "pseudo-tools" and fixed workflows, demonstrating that hand-engineered agents often outperform dynamically-planned systems in production environments.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study

Researchers present HOPM, a hierarchical prompt mutation framework that adaptively optimizes language model outputs for high-stakes document generation in marketplace dispute resolution. Testing on 600 real cases, the system achieved an 11 percentage point improvement in win rate and 19.1 percentage point improvement in amount-weighted outcomes compared to static prompting, combining human feedback with automated evaluation.

AIBullisharXiv – CS AI · Jun 16/10

🧠

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

OrcaRouter is a production-ready LLM routing system that uses contextual bandits and hybrid offline-online learning to intelligently direct requests to the most appropriate language model. The system ranked second on the RouterArena leaderboard with 75.54% accuracy while maintaining low inference costs of $1.00 per 1,000 queries.

AINeutralarXiv – CS AI · May 276/10

🧠

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Researchers introduce AgingBench, a longitudinal reliability benchmark that evaluates how AI agents degrade over time in production environments rather than just at deployment. The study reveals that agent reliability decays through four distinct mechanisms—compression, interference, revision, and maintenance aging—and that fixes must target specific failure stages rather than assuming stronger base models solve the problem.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Prompt Readiness Levels (PRL): a maturity scale and scoring framework for production grade prompt assets

Researchers have introduced Prompt Readiness Levels (PRL), a nine-level maturity framework for evaluating and governing AI prompt assets in production environments. The system includes a multidimensional scoring method (PRS) designed to ensure prompt engineering meets operational, safety, and compliance standards across organizations.

AINeutralarXiv – CS AI · Mar 55/10

🧠

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Researchers present a blueprint for evaluating and optimizing multi-agent conversational shopping assistants, addressing challenges in multi-turn interactions and tightly coupled AI systems. The paper introduces evaluation rubrics and two prompt-optimization strategies including a novel Multi-Agent Multi-Turn GEPA approach for system-level optimization.

AIBullisharXiv – CS AI · Mar 36/1010

🧠

Agentic Multi-Source Grounding for Enhanced Query Intent Understanding: A DoorDash Case Study

DoorDash developed an AI system that uses multiple data sources to better understand ambiguous search queries by combining catalog data with web search results. The system achieved significant accuracy improvements over traditional methods and is now deployed across 95% of DoorDash's daily search traffic.

AIBullisharXiv – CS AI · Mar 26/1017

🧠

Higress-RAG: A Holistic Optimization Framework for Enterprise Retrieval-Augmented Generation via Dual Hybrid Retrieval, Adaptive Routing, and CRAG

Researchers have developed Higress-RAG, a new enterprise-grade framework that addresses key challenges in Retrieval-Augmented Generation systems including low retrieval precision, hallucination, and high latency. The system introduces innovations like 50ms semantic caching, hybrid retrieval methods, and corrective evaluation to optimize the entire RAG pipeline for production use.

$LINK

AIBullisharXiv – CS AI · Feb 276/106

🧠

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Apple's App Store search team successfully implemented LLM-generated textual relevance labels to augment their ranking system, addressing data scarcity issues. A fine-tuned specialized model outperformed larger pre-trained models, generating millions of labels that improved search relevance. This resulted in a statistically significant 0.24% increase in conversion rates in worldwide A/B testing.