#model-selection News & Analysis

24 articles tagged with #model-selection. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

24 articles

AIBullisharXiv – CS AI · May 77/10

🧠

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

Researchers introduced Uno-Orchestra, a new orchestration framework for multi-agent LLM systems that dynamically decides when to decompose tasks and which model-primitive pairs to use, achieving 77% accuracy across 13 benchmarks while reducing computational costs by an order of magnitude compared to existing approaches.

AIBearisharXiv – CS AI · May 17/10

🧠

Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

Researchers demonstrate that supervised financial NLP benchmarks used to evaluate LLMs contain hidden measurement risks, where rubric wording, metric selection, and aggregation methods materially alter model performance rankings. Testing on the Japanese Financial Implicit-Commitment Recognition dataset reveals 13-point agreement variance across rubric variants and shows that certain metrics produce unreliable signals, highlighting the need for standardized evaluation governance in financial AI model selection.

AIBullisharXiv – CS AI · Apr 207/10

🧠

Cost-Aware Model Orchestration for LLM-based Systems

Researchers propose a cost-aware model orchestration method that improves how Large Language Models select and coordinate multiple AI tools for complex tasks. By incorporating quantitative performance metrics alongside qualitative descriptions, the approach achieves up to 11.92% accuracy gains, 54% energy efficiency improvements, and reduces model selection latency from 4.51 seconds to 7.2 milliseconds.

AIBullisharXiv – CS AI · Apr 107/10

🧠

AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent

AgentOpt v0.1, a new Python framework, addresses client-side optimization for AI agents by intelligently allocating models, tools, and API budgets across pipeline stages. Using search algorithms like Arm Elimination and Bayesian Optimization, the tool reduces evaluation costs by 24-67% while achieving near-optimal accuracy, with cost differences between model combinations reaching up to 32x at matched performance levels.

AINeutralarXiv – CS AI · Mar 267/10

🧠

The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

A systematic study of 8 frontier reasoning language models reveals that cheaper API pricing often leads to higher actual costs due to variable 'thinking token' consumption. The research found that in 21.8% of model comparisons, the cheaper-listed model actually costs more to operate, with cost differences reaching up to 28x.

🧠 GPT-5🧠 Gemini

AIBullishCrypto Briefing · Jun 266/10

🧠

OpenAI introduces new model naming system with capability tiers

OpenAI has introduced a new model naming system organized by capability tiers to improve clarity for developers selecting appropriate models. The streamlined approach aims to simplify decision-making and boost development efficiency while reshaping competitive dynamics in the AI market.

🏢 OpenAI

AIBullisharXiv – CS AI · Jun 236/10

🧠

Agent-as-a-Router: Agentic Model Routing for Coding Tasks

Researchers propose Agent-as-a-Router, a framework that dynamically routes coding tasks to the most suitable LLM among multiple providers by accumulating execution-grounded experience during deployment. The approach, instantiated as ACRouter, demonstrates 15.3% performance gains over static routers and introduces CodeRouterBench, a benchmark with ~10K tasks from 8 frontier LLMs, addressing the critical need for intelligent model selection in multi-provider environments.

AINeutralarXiv – CS AI · Jun 236/10

🧠

SPADE: Structure-Prior Adaptive Decision Estimation

SPADE introduces a machine learning framework that adaptively decides whether to enforce physical-structure priors (conservation laws, Hamiltonian forms) based on data evidence, using statistical tests and shrinkage estimation. The method automatically calibrates prior enforcement strength and selects among competing structures, achieving oracle-level performance while reducing computational overhead compared to cross-validation approaches.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Gradient-Descent Steps to Success over Mean Accuracy: A Paradigm Shift for ML

Researchers propose evaluating machine learning models based on computational effort (gradient descent steps to reach target accuracy) rather than maximum accuracy alone. The study reveals that larger learning rates, phase transitions in training strategy, and restart-based approaches optimize both generalization and computational efficiency, offering a new framework for AutoML and model selection.

AINeutralarXiv – CS AI · Jun 96/10

🧠

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

Researchers introduce ABLE, a framework that represents and compares large language models through gradient-based feature attributions rather than parameter analysis or output comparison. The training-free method achieves competitive performance on model comparison tasks across 239 open-source LLMs while providing theoretical stability guarantees.

AIBullisharXiv – CS AI · Jun 96/10

🧠

MedicalRec: Medical recommender system for image classification without retraining

Researchers have developed MedicalRec, a transformer-based recommender system that identifies optimal deep learning models for medical image classification tasks without requiring retraining. The system leverages a new dataset (MedicalRec-Bench) containing over 5,000 model performance records across five medical imaging domains, achieving a 75.5% HitRate@100 and addressing the computational waste inherent in trial-and-error model selection.

AINeutralarXiv – CS AI · Jun 26/10

🧠

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

Researchers introduce MOSAIC, a structured agentic framework that automates data science model selection by combining LLM flexibility with systematic verification. Unlike traditional AutoML systems or unstructured LLM agents, MOSAIC creates intermediate 'blueprints' that ground decisions in retrieved evidence and execution feedback, improving task performance and decision traceability.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Computation-Aware Kalman Filtering with Model Selection for Neural Dynamics

Researchers introduce CASSM, a Bayesian framework that combines Kalman filtering with model selection to improve neural dynamics modeling on modern datasets. The method addresses computational complexity and uncertainty calibration challenges, offering competitive performance with deep networks while maintaining better uncertainty quantification, particularly for datasets with fewer trials than recorded neurons.

AINeutralarXiv – CS AI · May 286/10

🧠

Continual Model Routing in Evolving Model Hubs

Researchers introduce Continual Model Routing (CMR), a framework addressing the challenge of efficiently selecting from thousands of pre-trained models in expanding AI hubs. They present CMRBench, a large-scale benchmark with over 2,000 candidate models, and CARvE, a contrastive embedding method that outperforms existing routing strategies as model repositories grow.

AINeutralarXiv – CS AI · May 286/10

🧠

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

Researchers propose Architecture-driven Shift (ADS), a lightweight computational method to predict how pre-trained neural networks will perform in continual learning scenarios by measuring logit shift without expensive calculations. The approach theoretically decouples architecture characteristics from data dependency, achieving strong correlation with actual performance across 175+ diverse model architectures.

AINeutralarXiv – CS AI · May 276/10

🧠

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

Researchers introduce Structure-Adaptive Conformal Inference (SCQ and P-TAMS), a statistical framework that improves out-of-distribution testing in machine learning by incorporating auxiliary structural information like spatiotemporal patterns. The approach provides finite-sample error-rate control and enhanced interpretability compared to traditional conformal methods, with applications in high-stakes prediction scenarios.

AINeutralarXiv – CS AI · May 126/10

🧠

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

Researchers demonstrate that reasoning-capable LLMs improve judgment accuracy significantly on complex tasks like math and coding, but offer minimal or negative benefits on simpler evaluations while consuming substantially more computational resources. They introduce RACER, an adaptive routing algorithm that dynamically selects between reasoning and non-reasoning judges under budget constraints while accounting for distribution shifts.

AIBullisharXiv – CS AI · May 116/10

🧠

Query-efficient model evaluation using cached responses

Researchers propose a query-efficient method for evaluating new AI models using cached responses from previously-evaluated models, leveraging the Data Kernel Perspective Space (DKPS) framework to reduce computational costs while maintaining evaluation accuracy. The approach demonstrates that by intelligently reusing existing model outputs, organizations can achieve equivalent benchmarking results with substantially fewer new queries.

AINeutralarXiv – CS AI · May 116/10

🧠

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

A comprehensive empirical study reveals that reported inefficiencies in multi-LLM routing systems are substantially inflated by evaluation artifacts rather than genuine model limitations. Researchers found that LLM-as-a-judge biases, output truncation, and format mismatches account for a significant portion of measured failures, suggesting current routing cost-quality tradeoff estimates significantly overstate the actual unsolvability ceiling.

🧠 Llama

AINeutralarXiv – CS AI · May 96/10

🧠

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

Researchers propose SIREN, a new evaluation protocol that corrects for the 'winner's curse' bias in large language model benchmarking. This addresses a critical flaw where reusing benchmark items during model tuning inflates performance estimates, potentially leading to flawed deployment decisions based on unreliable comparisons.

AINeutralarXiv – CS AI · Apr 156/10

🧠

Robust Explanations for User Trust in Enterprise NLP Systems

Researchers propose a black-box robustness evaluation framework for NLP explanations, revealing that decoder-based LLMs produce 73% more stable explanations than encoder models like BERT. The study establishes practical cost-robustness tradeoffs that help organizations select models for compliance-sensitive applications before deployment.

🧠 Llama

AINeutralarXiv – CS AI · Apr 146/10

🧠

RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

RPA-Check introduces an automated four-stage framework for evaluating Large Language Model-based Role-Playing Agents in complex scenarios, addressing the gap in standard NLP metrics for assessing role adherence and narrative consistency. Testing across legal scenarios reveals that smaller, instruction-tuned models (8-9B parameters) outperform larger models in procedural consistency, suggesting optimal performance doesn't correlate with model scale.

AIBullisharXiv – CS AI · Mar 27/1012

🧠

The Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking

Researchers developed a new framework for selecting optimal medical AI foundation models without costly fine-tuning, achieving 31% better performance than existing methods. The topology-driven approach evaluates manifold tractability rather than statistical overlap to better assess model transferability for medical image segmentation tasks.

AINeutralarXiv – CS AI · Mar 174/10

🧠

LLM Routing as Reasoning: A MaxSAT View

Researchers propose a new constraint-based approach to LLM routing that formulates the problem as weighted MaxSAT/MaxSMT optimization, using natural language feedback to create constraints over model attributes. Testing on a 25-model benchmark shows this method can effectively route queries to appropriate LLMs based on user preferences expressed in natural language.