y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#model-selection News & Analysis

16 articles tagged with #model-selection. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

16 articles
AIBullisharXiv – CS AI · May 77/10
🧠

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

Researchers introduced Uno-Orchestra, a new orchestration framework for multi-agent LLM systems that dynamically decides when to decompose tasks and which model-primitive pairs to use, achieving 77% accuracy across 13 benchmarks while reducing computational costs by an order of magnitude compared to existing approaches.

AIBearisharXiv – CS AI · May 17/10
🧠

Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

Researchers demonstrate that supervised financial NLP benchmarks used to evaluate LLMs contain hidden measurement risks, where rubric wording, metric selection, and aggregation methods materially alter model performance rankings. Testing on the Japanese Financial Implicit-Commitment Recognition dataset reveals 13-point agreement variance across rubric variants and shows that certain metrics produce unreliable signals, highlighting the need for standardized evaluation governance in financial AI model selection.

AIBullisharXiv – CS AI · Apr 207/10
🧠

Cost-Aware Model Orchestration for LLM-based Systems

Researchers propose a cost-aware model orchestration method that improves how Large Language Models select and coordinate multiple AI tools for complex tasks. By incorporating quantitative performance metrics alongside qualitative descriptions, the approach achieves up to 11.92% accuracy gains, 54% energy efficiency improvements, and reduces model selection latency from 4.51 seconds to 7.2 milliseconds.

AIBullisharXiv – CS AI · Apr 107/10
🧠

AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent

AgentOpt v0.1, a new Python framework, addresses client-side optimization for AI agents by intelligently allocating models, tools, and API budgets across pipeline stages. Using search algorithms like Arm Elimination and Bayesian Optimization, the tool reduces evaluation costs by 24-67% while achieving near-optimal accuracy, with cost differences between model combinations reaching up to 32x at matched performance levels.

AINeutralarXiv – CS AI · Mar 267/10
🧠

The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

A systematic study of 8 frontier reasoning language models reveals that cheaper API pricing often leads to higher actual costs due to variable 'thinking token' consumption. The research found that in 21.8% of model comparisons, the cheaper-listed model actually costs more to operate, with cost differences reaching up to 28x.

🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · 3d ago6/10
🧠

Continual Model Routing in Evolving Model Hubs

Researchers introduce Continual Model Routing (CMR), a framework addressing the challenge of efficiently selecting from thousands of pre-trained models in expanding AI hubs. They present CMRBench, a large-scale benchmark with over 2,000 candidate models, and CARvE, a contrastive embedding method that outperforms existing routing strategies as model repositories grow.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

Researchers propose Architecture-driven Shift (ADS), a lightweight computational method to predict how pre-trained neural networks will perform in continual learning scenarios by measuring logit shift without expensive calculations. The approach theoretically decouples architecture characteristics from data dependency, achieving strong correlation with actual performance across 175+ diverse model architectures.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

Researchers introduce Structure-Adaptive Conformal Inference (SCQ and P-TAMS), a statistical framework that improves out-of-distribution testing in machine learning by incorporating auxiliary structural information like spatiotemporal patterns. The approach provides finite-sample error-rate control and enhanced interpretability compared to traditional conformal methods, with applications in high-stakes prediction scenarios.

AINeutralarXiv – CS AI · May 126/10
🧠

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

Researchers demonstrate that reasoning-capable LLMs improve judgment accuracy significantly on complex tasks like math and coding, but offer minimal or negative benefits on simpler evaluations while consuming substantially more computational resources. They introduce RACER, an adaptive routing algorithm that dynamically selects between reasoning and non-reasoning judges under budget constraints while accounting for distribution shifts.

AIBullisharXiv – CS AI · May 116/10
🧠

Query-efficient model evaluation using cached responses

Researchers propose a query-efficient method for evaluating new AI models using cached responses from previously-evaluated models, leveraging the Data Kernel Perspective Space (DKPS) framework to reduce computational costs while maintaining evaluation accuracy. The approach demonstrates that by intelligently reusing existing model outputs, organizations can achieve equivalent benchmarking results with substantially fewer new queries.

AINeutralarXiv – CS AI · May 116/10
🧠

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

A comprehensive empirical study reveals that reported inefficiencies in multi-LLM routing systems are substantially inflated by evaluation artifacts rather than genuine model limitations. Researchers found that LLM-as-a-judge biases, output truncation, and format mismatches account for a significant portion of measured failures, suggesting current routing cost-quality tradeoff estimates significantly overstate the actual unsolvability ceiling.

🧠 Llama
AINeutralarXiv – CS AI · May 96/10
🧠

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

Researchers propose SIREN, a new evaluation protocol that corrects for the 'winner's curse' bias in large language model benchmarking. This addresses a critical flaw where reusing benchmark items during model tuning inflates performance estimates, potentially leading to flawed deployment decisions based on unreliable comparisons.

AINeutralarXiv – CS AI · Apr 156/10
🧠

Robust Explanations for User Trust in Enterprise NLP Systems

Researchers propose a black-box robustness evaluation framework for NLP explanations, revealing that decoder-based LLMs produce 73% more stable explanations than encoder models like BERT. The study establishes practical cost-robustness tradeoffs that help organizations select models for compliance-sensitive applications before deployment.

🧠 Llama
AINeutralarXiv – CS AI · Apr 146/10
🧠

RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

RPA-Check introduces an automated four-stage framework for evaluating Large Language Model-based Role-Playing Agents in complex scenarios, addressing the gap in standard NLP metrics for assessing role adherence and narrative consistency. Testing across legal scenarios reveals that smaller, instruction-tuned models (8-9B parameters) outperform larger models in procedural consistency, suggesting optimal performance doesn't correlate with model scale.

AIBullisharXiv – CS AI · Mar 27/1012
🧠

The Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking

Researchers developed a new framework for selecting optimal medical AI foundation models without costly fine-tuning, achieving 31% better performance than existing methods. The topology-driven approach evaluates manifold tractability rather than statistical overlap to better assess model transferability for medical image segmentation tasks.

AINeutralarXiv – CS AI · Mar 174/10
🧠

LLM Routing as Reasoning: A MaxSAT View

Researchers propose a new constraint-based approach to LLM routing that formulates the problem as weighted MaxSAT/MaxSMT optimization, using natural language feedback to create constraints over model attributes. Testing on a 25-model benchmark shows this method can effectively route queries to appropriate LLMs based on user preferences expressed in natural language.