169 articles tagged with #reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers developed Localized In-Context Learning (L-ICL), a technique that significantly improves large language model performance on symbolic planning tasks by targeting specific constraint violations with minimal corrections. The method achieves 89% valid plan generation compared to 59% for best baselines, representing a major advancement in LLM reasoning capabilities.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers introduce RM-R1, a new class of Reasoning Reward Models (ReasRMs) that integrate chain-of-thought reasoning into reward modeling for large language models. The models outperform much larger competitors including GPT-4o by up to 4.9% across reward model benchmarks by using a chain-of-rubrics mechanism and two-stage training process.
๐ง GPT-4๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers have developed a new technique called activation steering to reduce reasoning biases in large language models, particularly the tendency to confuse content plausibility with logical validity. Their novel K-CAST method achieved up to 15% improvement in formal reasoning accuracy while maintaining robustness across different tasks and languages.
AIBullishCrypto Briefing ยท Mar 57/10
๐ง OpenAI has released GPT-5.4, featuring enhanced reasoning, coding, and computer use capabilities across ChatGPT, API, and Codex platforms. This represents a significant advancement in AI technology that could impact various industries and development workflows.
๐ข OpenAI๐ง GPT-5๐ง ChatGPT
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers developed MA-RAG, a Multi-Round Agentic RAG framework that improves medical AI reasoning by iteratively refining responses through conflict detection and external evidence retrieval. The system achieved a substantial +6.8 point accuracy improvement over baseline models across 7 medical Q&A benchmarks by addressing hallucinations and outdated knowledge in healthcare AI applications.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers introduce ToolVQA, a large-scale multimodal dataset with 23K instances designed to improve AI models' ability to use external tools for visual question answering. The dataset features real-world contexts and multi-step reasoning tasks, with fine-tuned 7B models outperforming GPT-3.5-turbo on various benchmarks.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers introduce TTSR, a new framework that enables AI models to improve their reasoning abilities during test time by having a single model alternate between student and teacher roles. The system allows models to learn from their mistakes by analyzing failed reasoning attempts and generating targeted practice questions for continuous improvement.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers propose a geometric framework showing how large language models 'think' through representation space as flows, with logical statements acting as controllers of these flows' velocities. The study provides evidence that LLMs can internalize logical invariants through next-token prediction training, challenging the 'stochastic parrot' criticism and suggesting universal representational laws underlying machine understanding.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers developed COREA, a system that combines small and large language models to reduce AI reasoning costs by 21.5% while maintaining nearly identical accuracy. The system uses confidence scoring to decide when to escalate questions from cheaper small models to more expensive large models.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers introduce Structure of Thought (SoT), a new prompting technique that helps large language models better process text by constructing intermediate structures, showing 5.7-8.6% performance improvements. They also release T2S-Bench, the first benchmark with 1.8K samples across 6 scientific domains to evaluate text-to-structure capabilities, revealing significant room for improvement in current AI models.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers introduce Dynamic Pruning Policy Optimization (DPPO), a new framework that accelerates AI language model training by 2.37x while maintaining accuracy. The method addresses computational bottlenecks in Group Relative Policy Optimization through unbiased gradient estimation and improved data efficiency.
AIBullishMicrosoft Research Blog ยท Mar 47/101
๐ง Microsoft Research announces Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight multimodal reasoning model. The model is designed for vision-language tasks including image captioning and is available through Microsoft Foundry, HuggingFace, and GitHub.
AIBearisharXiv โ CS AI ยท Mar 46/103
๐ง New research reveals that current large language models struggle with collaborative reasoning, showing that 'stronger' models are often more fragile when distracted by misleading information. The study of 15 LLMs found they fail to effectively leverage guidance from other models, with success rates below 9.2% on challenging problems.
AIBullisharXiv โ CS AI ยท Mar 47/103
๐ง Researchers have developed MedLA, a new logic-driven multi-agent AI framework that uses large language models for complex medical reasoning. The system employs multiple AI agents that organize their reasoning into explicit logical trees and engage in structured discussions to resolve inconsistencies and reach consensus on medical questions.
AIBullisharXiv โ CS AI ยท Mar 47/104
๐ง Researchers propose an Adaptive Social Learning (ASL) framework with Adaptive Mode Policy Optimization (AMPO) algorithm to improve language agents' reasoning abilities in social interactions. The system dynamically adjusts reasoning depth based on context, achieving 15.6% higher performance than GPT-4o while using 32.8% shorter reasoning chains.
AIBullisharXiv โ CS AI ยท Mar 47/103
๐ง Researchers introduce LaDiR (Latent Diffusion Reasoner), a novel framework that combines continuous latent representation with iterative refinement capabilities to enhance Large Language Models' reasoning abilities. The system uses a Variational Autoencoder to encode reasoning steps and a latent diffusion model for parallel generation of diverse reasoning trajectories, showing improved accuracy and interpretability in mathematical reasoning benchmarks.
AIBearisharXiv โ CS AI ยท Mar 46/103
๐ง Researchers have identified 'contextual drag' - a phenomenon where large language models (LLMs) generate similar errors when failed attempts are present in their context. The study found 10-20% performance drops across 11 models on 8 reasoning tasks, with iterative self-refinement potentially leading to self-deterioration.
AIBullisharXiv โ CS AI ยท Mar 46/104
๐ง Researchers developed a new method to reduce content biases in large language models' reasoning tasks by transforming syllogisms into canonical logical representations with deterministic parsing. The approach achieved top-5 rankings on the multilingual SemEval-2026 Task 11 benchmark while offering a competitive alternative to complex fine-tuning methods.
AIBullisharXiv โ CS AI ยท Mar 46/105
๐ง Researchers developed a three-stage curriculum learning framework that improves Chain-of-Thought reasoning distillation from large language models to smaller ones. The method enables Qwen2.5-3B-Base to achieve 11.29% accuracy improvement while reducing output length by 27.4% through progressive skill acquisition and Group Relative Policy Optimization.
AIBullisharXiv โ CS AI ยท Mar 47/104
๐ง Researchers introduce PRISM, a new AI inference algorithm that uses Process Reward Models to guide deep reasoning systems. The method significantly improves performance on mathematical and scientific benchmarks by treating candidate solutions as particles in an energy landscape and using score-guided refinement to concentrate on higher-quality reasoning paths.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง New research demonstrates that Masked Diffusion Models (MDMs) for text generation are computationally equivalent to chain-of-thought augmented transformers in finite-precision settings. The study proves MDMs can solve all reasoning problems that CoT transformers can, while being more efficient for certain problem classes due to parallel generation capabilities.
AINeutralarXiv โ CS AI ยท Mar 37/104
๐ง Researchers discovered that large reasoning models (LRMs) suffer from inconsistent answers due to competing mechanisms between Chain-of-Thought reasoning and memory retrieval. They developed FARL, a new fine-tuning framework that suppresses retrieval shortcuts to promote genuine reasoning capabilities in AI models.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduce SPIRAL, a self-play reinforcement learning framework that enables language models to develop reasoning capabilities by playing zero-sum games against themselves without human supervision. The system improves performance by up to 10% across 8 reasoning benchmarks on multiple model families including Qwen and Llama.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduced Scaf-GRPO, a new training framework that overcomes the 'learning cliff' problem in LLM reasoning by providing strategic hints when models plateau. The method boosted Qwen2.5-Math-7B performance on the AIME24 benchmark by 44.3% relative to baseline GRPO methods.
AINeutralarXiv โ CS AI ยท Mar 37/104
๐ง Researchers analyzed Mixture-of-Experts (MoE) language models to determine optimal sparsity levels for different tasks. They found that reasoning tasks require balancing active compute (FLOPs) with optimal data-to-parameter ratios, while memorization tasks benefit from more parameters regardless of sparsity.