AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers propose A2X, an LLM-native service discovery system that organizes thousands of callable services into hierarchical taxonomies to solve the context-window limitation problem facing AI agents. The approach achieves 20+ point improvements in retrieval accuracy while reducing token consumption to one-ninth compared to baseline methods, enabling scalable orchestration of distributed services.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce MEMENTO, a framework that treats web exploration as a learning signal for AI agents operating in data-scarce domains. By combining iterative web search with dual-channel memory systems, MEMENTO achieves 25-36% performance improvements over baseline models in professional applications like sales automation and legal research without requiring additional model training.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce Prompt Codebooks (PCO), a new framework for automatic prompt optimization that breaks down instructions into reusable, atomic components rather than treating prompts as fixed strings. The method achieves up to 30% performance gains over baseline approaches while reducing prompt lengths by 14x, enabling more efficient and adaptive language model instruction refinement.
AINeutralarXiv – CS AI · May 17/10
🧠A new research paper demonstrates that current LLM evaluation frameworks using static prompts across all models produce misleading rankings compared to industry practice. The study reveals that prompt optimization (PO) significantly affects model performance rankings, suggesting practitioners must optimize prompts per model for accurate comparative evaluations.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers introduce ObjectGraph (.og), a new file format designed specifically for how AI agents consume documents through retrieval rather than linear reading. The format reduces token consumption by up to 95.3% while maintaining task accuracy, addressing a fundamental architectural mismatch between traditional documents and LLM agent workflows.
AIBullisharXiv – CS AI · Apr 137/10
🧠AlphaLab is an autonomous research system using frontier LLMs to automate experimental cycles across computational domains. Without human intervention, it explores datasets, validates frameworks, and runs large-scale experiments while accumulating domain knowledge—achieving 4.4x speedups in CUDA optimization, 22% lower validation loss in LLM pretraining, and 23-25% improvements in traffic forecasting.
🧠 GPT-5🧠 Claude🧠 Opus
AIBullisharXiv – CS AI · Mar 46/104
🧠Researchers introduce MASPOB, a bandit-based framework that optimizes prompts for Multi-Agent Systems using Graph Neural Networks to handle topology-induced coupling. The system reduces search complexity from exponential to linear while achieving state-of-the-art performance across benchmarks.
AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers developed a hierarchical multi-agent LLM framework that significantly improves multi-robot task planning by combining natural language processing with classical PDDL planners. The system uses prompt optimization and meta-learning to achieve success rates of up to 95% on compound tasks, outperforming previous state-of-the-art methods by substantial margins.
$COMP
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce eXTC, a new framework combining structured prompt optimization with reinforcement learning to create interpretable text classifiers that balance performance with explainability. The system generates human-readable domain rules while maintaining inference speed through knowledge distillation, addressing a longstanding trade-off in AI transparency.
AIBullisharXiv – CS AI · 4d ago6/10
🧠TCP-MCP introduces a co-evolution framework that simultaneously optimizes AI agent prompts and communication network topologies, achieving state-of-the-art accuracy on multiple benchmarks while reducing token consumption by up to 5.69x compared to existing multi-agent systems. The approach treats prompt design and communication structure as interdependent variables rather than independent parameters, offering a practical methodology for cost-efficient multi-agent AI system design.
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers introduce MemTrace, a framework for debugging Large Language Model memory systems by tracing information flow through memory evolution graphs. The system identifies root causes of memory failures and uses attribution signals to automatically optimize prompts, achieving up to 7.62% performance improvements across multiple memory architectures.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers introduce PICACO, a novel in-context alignment method that optimizes meta-instructions to help large language models better understand and balance multiple, often conflicting human values without fine-tuning. The approach uses total correlation optimization to improve alignment across up to 8 distinct values while reducing noise, addressing a key limitation where LLMs struggle to reconcile competing preferences in single prompts.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce EGL-SCA, a framework for graph reasoning agents that jointly optimizes both natural language instructions and computational tools through structural credit assignment. The system achieves 92.0% success rate on graph reasoning benchmarks by precisely routing failures to either prompt optimization or tool synthesis, outperforming isolated improvement approaches.
AIBearisharXiv – CS AI · May 96/10
🧠Researchers demonstrate that self-consistency—a technique where LLMs sample multiple reasoning paths to improve accuracy—delivers diminishing returns on modern models. Testing with Gemini 2.5 shows minimal accuracy gains (0.4-1.6%) while token costs scale linearly, suggesting the technique has become inefficient as model reliability improves.
🧠 Gemini
AINeutralarXiv – CS AI · May 46/10
🧠Researchers introduce MENTAT, a novel method for reasoning-intensive regression (RiR)—extracting subtle numerical scores from text in specialized domains. The approach combines batch-reflective prompt optimization with neural ensemble learning, achieving up to 65% improvement over standard LLM prompting and fine-tuning approaches on tasks like rubric-based scoring and domain-specific retrieval.
AI × CryptoNeutralarXiv – CS AI · May 46/10
🤖Researchers introduce ATLAS, a multi-agent framework that uses large language models for autonomous trading by combining dynamic prompt optimization with real-time market feedback. The system addresses key challenges in deploying LLMs for finance: adapting to delayed, noisy market signals and converting model outputs into executable orders.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers demonstrate that large language models can extract predictive features from financial news with valid intermediate signals (Information Coefficient >0.15), yet these features fail to improve reinforcement learning trading agents during macroeconomic shocks. The findings reveal a critical gap between feature-level validity and downstream policy robustness, suggesting that valid signals alone cannot guarantee trading performance under distribution shifts.
AINeutralarXiv – CS AI · Mar 96/10
🧠Researchers have developed ContextBench, a new benchmark for evaluating methods that generate targeted inputs to trigger specific behaviors in language models. The study introduces enhanced Evolutionary Prompt Optimization techniques that better balance effectiveness in activating AI model features while maintaining linguistic fluency.
AINeutralarXiv – CS AI · Mar 55/10
🧠Researchers present a blueprint for evaluating and optimizing multi-agent conversational shopping assistants, addressing challenges in multi-turn interactions and tightly coupled AI systems. The paper introduces evaluation rubrics and two prompt-optimization strategies including a novel Multi-Agent Multi-Turn GEPA approach for system-level optimization.
AIBullisharXiv – CS AI · Mar 36/106
🧠Researchers introduce 3R, a new RAG-based framework that optimizes prompts for text-to-video generation models without requiring model retraining. The system uses three key strategies to improve video quality: RAG-based modifier extraction, diffusion-based preference optimization, and temporal frame interpolation for better consistency.
AIBullisharXiv – CS AI · Feb 276/105
🧠Researchers demonstrated that prompt optimization using Genetic-Pareto (GEPA) significantly improves language models' ability to detect errors in medical notes. The technique boosted accuracy from 0.669 to 0.785 with GPT-5 and from 0.578 to 0.690 with Qwen3-32B, achieving state-of-the-art performance on medical error detection benchmarks.