AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduced Uno-Orchestra, a new orchestration framework for multi-agent LLM systems that dynamically decides when to decompose tasks and which model-primitive pairs to use, achieving 77% accuracy across 13 benchmarks while reducing computational costs by an order of magnitude compared to existing approaches.
AIBearisharXiv – CS AI · May 17/10
🧠Researchers demonstrate that supervised financial NLP benchmarks used to evaluate LLMs contain hidden measurement risks, where rubric wording, metric selection, and aggregation methods materially alter model performance rankings. Testing on the Japanese Financial Implicit-Commitment Recognition dataset reveals 13-point agreement variance across rubric variants and shows that certain metrics produce unreliable signals, highlighting the need for standardized evaluation governance in financial AI model selection.
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers propose a cost-aware model orchestration method that improves how Large Language Models select and coordinate multiple AI tools for complex tasks. By incorporating quantitative performance metrics alongside qualitative descriptions, the approach achieves up to 11.92% accuracy gains, 54% energy efficiency improvements, and reduces model selection latency from 4.51 seconds to 7.2 milliseconds.
AIBullisharXiv – CS AI · Apr 107/10
🧠AgentOpt v0.1, a new Python framework, addresses client-side optimization for AI agents by intelligently allocating models, tools, and API budgets across pipeline stages. Using search algorithms like Arm Elimination and Bayesian Optimization, the tool reduces evaluation costs by 24-67% while achieving near-optimal accuracy, with cost differences between model combinations reaching up to 32x at matched performance levels.
AINeutralarXiv – CS AI · Mar 267/10
🧠A systematic study of 8 frontier reasoning language models reveals that cheaper API pricing often leads to higher actual costs due to variable 'thinking token' consumption. The research found that in 21.8% of model comparisons, the cheaper-listed model actually costs more to operate, with cost differences reaching up to 28x.
🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce Continual Model Routing (CMR), a framework addressing the challenge of efficiently selecting from thousands of pre-trained models in expanding AI hubs. They present CMRBench, a large-scale benchmark with over 2,000 candidate models, and CARvE, a contrastive embedding method that outperforms existing routing strategies as model repositories grow.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers propose Architecture-driven Shift (ADS), a lightweight computational method to predict how pre-trained neural networks will perform in continual learning scenarios by measuring logit shift without expensive calculations. The approach theoretically decouples architecture characteristics from data dependency, achieving strong correlation with actual performance across 175+ diverse model architectures.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce Structure-Adaptive Conformal Inference (SCQ and P-TAMS), a statistical framework that improves out-of-distribution testing in machine learning by incorporating auxiliary structural information like spatiotemporal patterns. The approach provides finite-sample error-rate control and enhanced interpretability compared to traditional conformal methods, with applications in high-stakes prediction scenarios.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers demonstrate that reasoning-capable LLMs improve judgment accuracy significantly on complex tasks like math and coding, but offer minimal or negative benefits on simpler evaluations while consuming substantially more computational resources. They introduce RACER, an adaptive routing algorithm that dynamically selects between reasoning and non-reasoning judges under budget constraints while accounting for distribution shifts.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers propose a query-efficient method for evaluating new AI models using cached responses from previously-evaluated models, leveraging the Data Kernel Perspective Space (DKPS) framework to reduce computational costs while maintaining evaluation accuracy. The approach demonstrates that by intelligently reusing existing model outputs, organizations can achieve equivalent benchmarking results with substantially fewer new queries.
AINeutralarXiv – CS AI · May 116/10
🧠A comprehensive empirical study reveals that reported inefficiencies in multi-LLM routing systems are substantially inflated by evaluation artifacts rather than genuine model limitations. Researchers found that LLM-as-a-judge biases, output truncation, and format mismatches account for a significant portion of measured failures, suggesting current routing cost-quality tradeoff estimates significantly overstate the actual unsolvability ceiling.
🧠 Llama
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose SIREN, a new evaluation protocol that corrects for the 'winner's curse' bias in large language model benchmarking. This addresses a critical flaw where reusing benchmark items during model tuning inflates performance estimates, potentially leading to flawed deployment decisions based on unreliable comparisons.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers propose a black-box robustness evaluation framework for NLP explanations, revealing that decoder-based LLMs produce 73% more stable explanations than encoder models like BERT. The study establishes practical cost-robustness tradeoffs that help organizations select models for compliance-sensitive applications before deployment.
🧠 Llama
AINeutralarXiv – CS AI · Apr 146/10
🧠RPA-Check introduces an automated four-stage framework for evaluating Large Language Model-based Role-Playing Agents in complex scenarios, addressing the gap in standard NLP metrics for assessing role adherence and narrative consistency. Testing across legal scenarios reveals that smaller, instruction-tuned models (8-9B parameters) outperform larger models in procedural consistency, suggesting optimal performance doesn't correlate with model scale.
AIBullisharXiv – CS AI · Mar 27/1012
🧠Researchers developed a new framework for selecting optimal medical AI foundation models without costly fine-tuning, achieving 31% better performance than existing methods. The topology-driven approach evaluates manifold tractability rather than statistical overlap to better assess model transferability for medical image segmentation tasks.
AINeutralarXiv – CS AI · Mar 174/10
🧠Researchers propose a new constraint-based approach to LLM routing that formulates the problem as weighted MaxSAT/MaxSMT optimization, using natural language feedback to create constraints over model attributes. Testing on a 25-model benchmark shows this method can effectively route queries to appropriate LLMs based on user preferences expressed in natural language.