AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers present a multi-agent LLM pipeline architecture that reduces hallucinations by 31-36% through nested learning, semantic caching, and progressive review stages. The system simultaneously improves factual reliability, cuts energy consumption by 47%, and enhances auditability without requiring model retraining.
AINeutralarXiv – CS AI · 3d ago7/10
🧠Researchers propose the SMARt framework, a four-layer autonomous AI system architecture that manages failures through formal escalation protocols rather than relying solely on model improvements. The framework enables AI agents to detect uncertainty, suspend operations, attempt recovery, and surrender control when reliability diminishes, addressing the fundamental architectural vulnerability of unbounded autonomy in deployed agentic systems.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce Prompt Codebooks (PCO), a new framework for automatic prompt optimization that breaks down instructions into reusable, atomic components rather than treating prompts as fixed strings. The method achieves up to 30% performance gains over baseline approaches while reducing prompt lengths by 14x, enabling more efficient and adaptive language model instruction refinement.
AIBullisharXiv – CS AI · 4d ago7/10
🧠MiniMax introduces the M2 series, a Mixture-of-Experts language model with 229.9B total parameters but only 9.8B activated per token, achieving frontier-tier performance on agentic tasks through agent-driven data pipelines and a custom reinforcement learning system called Forge. The M2.7 checkpoint demonstrates early self-evolution capabilities, autonomously debugging and modifying its own training scaffold.
AINeutralarXiv – CS AI · 4d ago7/10
🧠Researchers introduce Trajel, a dataset and evaluation framework for detecting hallucinations in multi-step LLM agent workflows, revealing that existing benchmarks miss intermediate failures. The framework defines five hallucination types and shows that trajectory-level detection outperforms traditional post-hoc verification, highlighting critical gaps in current AI safety evaluation methodologies.
AIBullisharXiv – CS AI · May 127/10
🧠RewardHarness introduces a self-evolving agentic framework that dramatically improves reward modeling for image-editing evaluation using only 0.05% of typical training data. By iteratively refining tools and skills from minimal examples rather than large-scale annotations, the system achieves 47.4% accuracy on benchmarks, outperforming GPT-5 and enabling more efficient AI alignment.
🧠 GPT-5
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce AHD Agent, a reinforcement learning framework that enables language models to autonomously design heuristics for solving complex combinatorial optimization problems. A 4-billion-parameter model achieves performance comparable to much larger systems while requiring significantly fewer computational evaluations, advancing the frontier of AI-driven algorithm design.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers have identified critical security vulnerabilities in multi-agent AI networks where compromised parent agents can propagate malicious instructions to spawned subagents through inherited memory. The study demonstrates how current LLM frameworks violate trust boundaries via insecure memory inheritance and weak resource controls, turning localized agent compromises into systemic network risks.
🧠 ChatGPT
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce MARL-Rad, a multi-agent reinforcement learning framework that optimizes AI agents specifically for radiology report generation rather than using fixed LLMs in pre-designed workflows. The system decomposes chest X-ray interpretation into specialized regional agents coordinated by a global integrator, achieving state-of-the-art clinical performance on benchmark datasets with clinician validation.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce execution lineage, a DAG-based execution model that makes AI-native workflows reproducible and maintainable by explicitly tracking dependencies and enabling identity-based replay. Tested against traditional loop-based approaches, the system demonstrated superior performance in preserving work integrity during updates while preventing unrelated context contamination.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers present a layered security architecture for multitenant enterprise AI systems that isolates data and controls access in retrieval-augmented generation (RAG) and agentic AI deployments. The approach separates security-critical operations to the server while preventing cross-tenant data leakage, validated through an open-source OGX framework with negligible performance overhead.
🏢 OpenAI
AIBullisharXiv – CS AI · May 17/10
🧠Researchers introduce ObjectGraph (.og), a new file format designed specifically for how AI agents consume documents through retrieval rather than linear reading. The format reduces token consumption by up to 95.3% while maintaining task accuracy, addressing a fundamental architectural mismatch between traditional documents and LLM agent workflows.
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers introduce DeepER-Med, an agentic AI framework designed to advance evidence-based medical research with explicit transparency and trustworthiness mechanisms. The system outperforms existing production-grade platforms on complex medical questions and demonstrates clinical alignment in real-world case evaluations, addressing critical gaps in AI reliability for healthcare adoption.
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers introduce EvoTest, an evolutionary framework enabling AI agents to improve performance across consecutive test episodes without fine-tuning or gradients. The method outperforms existing adaptation techniques on a new Jericho Test-Time Learning benchmark, successfully winning games that all baseline methods failed to complete.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce ExecTune, a training methodology for optimizing black-box LLM systems where a guide model generates strategies executed by a core model. The approach improves accuracy by up to 9.2% while reducing inference costs by 22.4%, enabling smaller models like Claude Haiku to match larger competitors at significantly lower computational expense.
🧠 Claude🧠 Haiku🧠 Sonnet
AIBullisharXiv – CS AI · Mar 117/10
🧠AlphaApollo is a new AI reasoning system that addresses limitations in foundation models through multi-turn agentic reasoning, learning, and evolution components. The system demonstrates significant performance improvements across math reasoning benchmarks, with success rates exceeding 85% for tool calls and substantial gains from reinforcement learning across different model scales.
AINeutralarXiv – CS AI · Mar 97/10
🧠Researchers evaluated 34 large language models on radiology questions, finding that agentic retrieval-augmented reasoning systems improve consensus and reliability across different AI models. The study shows these systems reduce decision variability between models and increase robust correctness, though 72% of incorrect outputs still carried moderate to high clinical severity.
AI × CryptoNeutralBankless · Mar 67/10
🤖The article discusses three key developments in the intersection of AI and cryptocurrency, highlighting both problematic applications like criminal use cases and positive developments such as AI-powered smart contract auditing. These developments signal the emergence of an 'agentic frontier' where AI agents operate autonomously within crypto ecosystems.
AIBearisharXiv – CS AI · Mar 67/10
🧠Research reveals that AI language models exhibit self-attribution bias when monitoring their own behavior, evaluating their own actions as more correct and less risky than identical actions presented by others. This bias causes AI monitors to fail at detecting high-risk or incorrect actions more frequently when evaluating their own outputs, potentially leading to inadequate monitoring systems in deployed AI agents.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers introduce KairosAgent, an agentic framework combining large language models with time series foundation models to improve multimodal forecasting across domains. The system uses semantic reasoning from LLMs fused with numerical forecasting capabilities, achieving superior zero-shot performance through reinforcement learning and structured tool integration.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers introduce Loong, an AI agent designed to improve long document translation by selectively retrieving relevant context from a 3E memory module rather than processing all available information. The system uses reinforcement learning to optimize context selection and demonstrates significant translation quality improvements across multiple language pairs, achieving gains up to 13 points on standard evaluation metrics.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers propose a hierarchical framework for deploying compact language models in resource-constrained agentic systems, combining knowledge distillation with oracle-supervised fine-tuning to maintain protocol compliance and semantic performance. The approach addresses core deployment challenges including context length limitations, memory constraints, and cost efficiency by separating schema learning from semantic adaptation.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Tool Forge presents a validation-carrying toolchain that converts natural-language descriptions into governed, sandbox-verified tools for large language model agents. The system achieves 99.2% reduction in context requirements while maintaining 0.940 micro-F1 accuracy, addressing critical infrastructure gaps in enterprise agentic execution.
AIBullisharXiv – CS AI · May 126/10
🧠AI-Care is a conversational AI system designed to help individuals with Alzheimer's disease and related dementia manage daily tasks through natural language interaction, reducing cognitive barriers to using digital tools. The system prioritizes safety through caregiver-verified records and controlled clarification flows, with preliminary pilot testing showing positive user trust and task completion outcomes.
AIBullisharXiv – CS AI · May 96/10
🧠VibeServe introduces an AI-driven approach to LLM serving infrastructure that automatically generates specialized system stacks for different workloads rather than relying on single general-purpose designs. The system matches vLLM performance in standard deployment scenarios while significantly outperforming existing solutions in non-standard cases, suggesting a paradigm shift toward generation-time specialization in infrastructure software.