AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers demonstrate that aggregating complete reasoning traces from multiple LLM agents recovers correct solutions more effectively than majority voting, even when agents unanimously agree. A new approach called Self-Consistent Mixture of Agents uses semantic-preserving perturbations to generate trace diversity while maintaining safety guarantees, outperforming heterogeneous model ensembles across mathematical and scientific reasoning tasks.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce SciAidanBench, a benchmark revealing that LLM capability improvements are uneven across tasks and domains—a phenomenon termed 'jaggedness.' By evaluating 19 models across 8 providers, they demonstrate that stronger models don't uniformly excel at scientific creativity, but this fragmentation can be leveraged through ensemble methods to achieve superior performance.
AIBullisharXiv – CS AI · Apr 157/10
🧠CascadeDebate introduces a novel multi-agent deliberation system for large language model cascades that dynamically allocates computational resources based on query difficulty. By inserting lightweight agent ensembles at escalation boundaries to resolve ambiguous cases internally, the system achieves up to 26.75% performance improvement while reducing unnecessary escalations to expensive models.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers developed StableTTA, a training-free method that significantly improves AI model accuracy on ImageNet-1K, with 33 models achieving over 95% accuracy and several surpassing 96%. The method allows lightweight architectures to outperform Vision Transformers while using 95% fewer parameters and 89% less computational cost.
AINeutralarXiv – CS AI · Mar 177/10
🧠Research comparing 200 humans and 95 AI detectors found humans significantly outperform AI at detecting deepfakes, especially in low-quality mobile phone videos where AI accuracy drops to near chance levels. The study reveals human-AI hybrid systems are most effective, as humans and AI make complementary errors in deepfake detection.
AINeutralarXiv – CS AI · Mar 47/105
🧠Researchers introduce Federated Inference (FI), a new collaborative paradigm where independently trained AI models can work together at inference time without sharing data or model parameters. The study identifies key requirements including privacy preservation and performance gains, while highlighting system-level challenges that differ from traditional federated learning approaches.
AIBearisharXiv – CS AI · 2d ago6/10
🧠Researchers identify a critical failure mode in multi-component LLM agent systems where individually coherent components produce globally incoherent outputs that violate probability axioms. The study proposes metrics to detect and repair these failures, finding them present in 33-94% of tested multi-LLM ensembles with measurable economic impact on prediction tasks.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers identify a critical failure mode in test-time reinforcement learning (TTRL) where majority voting locks onto incorrect answers, permanently suppressing correct signals in low-ability problems. They introduce TTRL-Guard, a framework using flip-rate monitoring and selective updating to prevent this 'Correct-Answer Extinction Window,' achieving 54% relative improvement on AIME 2025 benchmarks.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers present Belief-Aware GSAC, an adaptive knowledge distillation method for autonomous driving that modulates teacher guidance based on ensemble disagreement. Testing reveals that adaptive guidance helps under mild-to-moderate partial observability but fails under severe occlusion due to 'observability blindness'—where ensembles achieve low disagreement on visible data while missing occluded information.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers present DEI, a distributed Quality-Diversity search framework that uses heterogeneous large language models as mutation operators to solve competitive programming tasks. A four-model ensemble achieved 124% higher performance than single-model baselines, demonstrating that model diversity—not just computational parallelism—drives superior outcomes in evolutionary AI search.
🧠 GPT-5🧠 Claude🧠 Haiku
AINeutralarXiv – CS AI · May 126/10
🧠Researchers from UTS achieved second place in a psychological defense mechanism classification competition using a multi-agent AI system that identifies defense patterns through absence-based reasoning rather than presence detection. The system combines Gemini 2.5 agents with fine-tuned Qwen models to achieve an F1 score of 0.406, addressing critical biases in minority class prediction through structured ensemble methods.
🧠 Gemini
AINeutralarXiv – CS AI · May 126/10
🧠Researchers demonstrate that large language models like Qwen2.5-Math achieve 95%+ accuracy on algorithmic number theory problems with optimal hints, and empirically verify a folklore conjecture that Dirichlet character moduli are uniquely determined by L-function zeros using machine learning ensemble methods.
AINeutralarXiv – CS AI · May 125/10
🧠Researchers propose Context-Aligned Contrastive Regression, a machine learning approach that combines contrastive learning with ridge regression ensembling to improve lexical difficulty prediction across multiple language backgrounds. The method addresses limitations in existing regression-only models by structuring representation spaces to better capture cross-lingual alignment and ordinal difficulty rankings, showing improved performance stability across difficulty levels.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce Evolutionary Ensemble (EvE), a decentralized framework that organizes coding agents into a self-evolving system for algorithmic discovery. By co-evolving two populations—functional code solvers and agent guidance states—EvE autonomously discovered novel mechanisms for In-Context Operator Networks, demonstrating that dynamic agent adaptation outperforms static optimization approaches.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce ARMOR, an agentic framework that improves chemical reaction feasibility prediction by intelligently combining multiple AI tools rather than relying on single models. The system uses hierarchical tool organization and memory-augmented reasoning to resolve conflicting predictions, demonstrating significant performance gains especially when different tools disagree on outcomes.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers introduce Consensus Entropy (CE), a training-free metric that improves OCR quality by measuring agreement across multiple Vision-Language Models, achieving 42.1% F1 score improvements over existing methods. The technique enables self-verifying OCR without supervision, addressing a critical gap in automated error detection for data generation pipelines used in LLM training.
AINeutralarXiv – CS AI · May 115/10
🧠Nürnberg NLP's ensemble approach for detecting psychological defence mechanisms achieved first place in the PsyDefDetect shared task by leveraging nine independent voters across different model architectures and training methods. The strategy prioritizes error independence over single-model strength, addressing the inherent ambiguity in classifying overlapping psychological categories.
AINeutralarXiv – CS AI · May 115/10
🧠Researchers compared ensemble machine learning techniques for predicting obesity risk, finding that ensemble stacking with a neural network meta-classifier outperformed hybrid voting methods, particularly on complex datasets. The study evaluated nine ML algorithms across 50 hyperparameter configurations, demonstrating that stacking achieves superior accuracy (up to 98.98%) for healthcare predictive modeling.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose an active learning framework for optimizing communication structures in multi-agent systems powered by large language models, using ensemble-based task selection to identify the most informative training tasks while reducing token consumption and computational costs.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers analyzing LLM-based automated scoring found that strategic model selection and reasoning configurations outperform ensemble methods for accuracy. Temperature sampling improved performance, but larger ensemble sizes showed diminishing returns, while higher reasoning effort correlated with better accuracy at varying cost-benefit ratios across model families.
🏢 OpenAI🧠 GPT-5🧠 Gemini
AIBullisharXiv – CS AI · May 16/10
🧠Researchers introduce CastFlow, a dynamic agentic framework that applies large language models to time series forecasting through multi-stage workflows combining planning, action, and reflection. The system uses role-specialized agents—a general-purpose LLM paired with a fine-tuned domain-specific model—to iteratively refine forecasts using ensemble methods and contextual memory, demonstrating superior performance over existing static generative approaches.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers demonstrate a zero-shot knowledge graph construction pipeline using local open-source LLMs on consumer hardware, achieving 0.70 F1 on document relations and 0.55 exact match on multi-hop reasoning through ensemble methods. The study reveals that strong model consensus often signals collective hallucination rather than accuracy, challenging traditional ensemble assumptions while maintaining low computational costs and carbon footprint.
AINeutralarXiv – CS AI · Mar 27/1013
🧠Researchers introduce E-CIT (Ensemble Conditional Independence Test), a new framework that significantly reduces computational costs in causal discovery by partitioning data into subsets and aggregating results. The method achieves linear computational complexity while maintaining competitive performance, particularly on real-world datasets.
AINeutralarXiv – CS AI · Mar 165/10
🧠Researchers introduce BoSS (Best-of-Strategies Selector), a new oracle strategy for active learning that outperforms existing methods by using an ensemble approach to select optimal data annotation batches. The study reveals that current state-of-the-art active learning strategies still significantly underperform compared to oracle performance, particularly on large-scale datasets.
AINeutralarXiv – CS AI · Mar 164/10
🧠Researchers developed an automated query expansion framework using multiple large language models that constructs domain-specific examples without manual intervention. The system uses a two-LLM ensemble approach where different models generate expansions that are then refined by a third LLM, showing significant improvements over traditional methods across multiple datasets.