AIBullisharXiv – CS AI · 21h ago7/10
🧠Researchers introduce ThinkSwitch, a method that distills reasoning capabilities from large language models into smaller, more efficient models using LoRA and weight interpolation. The technique improves performance on mathematical and scientific reasoning tasks while maintaining low computational costs, doubling accuracy on AIME problems at minimal expense.
AIBearisharXiv – CS AI · 21h ago7/10
🧠Researchers discovered the 'Alignment Curse,' revealing that stronger text-audio alignment in multimodal AI models inadvertently enables more effective transfer of text-based jailbreak attacks to audio channels. The finding exposes a critical safety vulnerability in recent omni-models like Qwen, suggesting current audio safety evaluations significantly underestimate risks originating from text modalities.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Alibaba's Qwen team released Qwen-VLA, a unified foundation model that combines vision, language, and action capabilities for robotics across multiple tasks and robot types. The model demonstrates strong performance on manipulation, navigation, and trajectory prediction benchmarks while generalizing well to out-of-distribution scenarios and real-world robot deployments.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose DenseSteer, a training-free framework that improves mathematical reasoning in small language models (≤3B parameters) by steering internal representations toward denser reasoning patterns. The method demonstrates that smaller models can match larger ones' performance by executing fewer, more information-rich reasoning steps rather than verbose chain-of-thought processes.
AIBullisharXiv – CS AI · 4d ago7/10
🧠DeepTool is a new AI framework that enhances large language models' ability to reason through tool use by implementing process-supervised reinforcement learning. The system dramatically improves performance on mathematical benchmarks like AIME24 (3.2% to 40.4%) while maintaining token efficiency through interleaved thinking and action.
AIBearisharXiv – CS AI · 5d ago7/10
🧠Researchers discovered that chain-of-thought distillation—training smaller AI models to imitate larger models' reasoning—produces higher answer accuracy on medical benchmarks while simultaneously degrading reasoning quality. A Qwen3-8B student model improved from 74.7% to 84.4% accuracy on MedQA-USMLE, yet error rates in individual reasoning steps jumped from 30.6% to 50.3%, suggesting models learn to mimic expert-like output without grounding claims in sound logic.
AIBullisharXiv – CS AI · 6d ago7/10
🧠Researchers introduce ReMoE, a router fine-tuning framework that optimizes Mixture-of-Experts language models for memory-constrained inference by increasing expert reuse and reducing storage I/O overhead. The approach improves expert reuse by 26% while maintaining performance, delivering up to 1.99× decode speedup on edge devices.
AIBullisharXiv – CS AI · 6d ago7/10
🧠Search-E1 introduces a simplified self-evolution method for search-augmented reasoning agents that achieves competitive performance through vanilla GRPO and self-distillation, without external supervision or complex auxiliary systems. The approach reaches 0.440 average EM on QA benchmarks with Qwen2.5-3B, demonstrating that elaborate post-training machinery may be unnecessary for effective agent development.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce M2A, a novel model merging paradigm that combines mathematical and agentic reasoning in large language models without retraining. The approach improves a Qwen3-8B model's software engineering benchmark performance from 44.0% to 51.2% by strategically injecting mathematical reasoning capabilities along directions that preserve agent behavior.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce One-Step-Train (OST), a new data selection framework for Large Multimodal Models that uses incremental optimization to identify high-quality training samples. The method reduces computational costs by 43% while outperforming existing approaches like LLM-as-a-Judge, demonstrating significant efficiency gains in multimodal model training.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers demonstrate that large language models encode social role granularity—from individual to institutional perspectives—as a structured geometric axis in their internal representations. Using activation steering, they show this axis is causally manipulable, enabling controlled shifts in response scope across different models.
🧠 Llama
AINeutralarXiv – CS AI · Apr 207/10
🧠Researchers demonstrate through causal experiments that hallucinations in language models arise from early trajectory commitments governed by asymmetric attractor dynamics. Using controlled prompt bifurcation on Qwen2.5-1.5B, they show that 44% of test prompts diverge into factual or hallucinated outputs at the first token, with activation patterns revealing that corrupting correct trajectories is far easier than recovering hallucinated ones—suggesting hallucination represents a stable but difficult-to-escape attractor state.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce Lightning OPD, an offline on-policy distillation framework that eliminates the need for live teacher inference servers during large language model post-training. By enforcing 'teacher consistency'—using the same teacher model for both supervised fine-tuning and distillation—the method achieves comparable performance to standard OPD while delivering 4x speedup and significantly reducing infrastructure costs.
AIBullisharXiv – CS AI · Apr 147/10
🧠UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precision—surpassing GPT, Gemini, and Claude in specific benchmarks.
🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · Apr 107/10
🧠A comprehensive study of the open language model ecosystem reveals that Chinese AI models, including Qwen and DeepSeek, have overtaken U.S.-developed models like Meta's Llama since summer 2025, with the gap continuing to widen. The research analyzes ~1.5K mainline open models across adoption metrics, market share, and performance to document this significant shift in AI development geography.
$ATOM🏢 Hugging Face🧠 Llama
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers introduce Cog-DRIFT, a new framework that improves AI language model reasoning by transforming difficult problems into easier formats like multiple-choice questions, then gradually training models on increasingly complex versions. The method shows significant performance gains of 8-10% on previously unsolvable problems across multiple reasoning benchmarks.
🧠 Llama
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose Continuous Softened Retracing reSampling (CSRS) to improve the self-evolution of Multimodal Large Language Models by addressing biases in feedback mechanisms. The method uses continuous reward signals instead of binary rewards and achieves state-of-the-art results on mathematical reasoning benchmarks like MathVision using Qwen2.5-VL-7B.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose a new method for aligning AI language models with human preferences that addresses stability issues in existing approaches. The technique uses relative density ratio optimization to achieve both statistical consistency and training stability, showing effectiveness with Qwen 2.5 and Llama 3 models.
🧠 Llama
AIBullisharXiv – CS AI · Apr 67/10
🧠Researchers developed a quantitative method to improve role consistency in multi-agent AI systems by introducing a role clarity matrix that measures alignment between agents' assigned roles and their actual behavior. The approach significantly reduced role overstepping rates from 46.4% to 8.4% in Qwen models and from 43.4% to 0.2% in Llama models during ChatDev system experiments.
🧠 Llama
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduced SAGE, a multi-agent framework that improves large language model reasoning through self-evolution using four specialized agents. The system achieved significant performance gains on coding and mathematics benchmarks without requiring large human-labeled datasets.
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers identified a fundamental flaw in large language models where they exhibit moral indifference by compressing distinct moral concepts into uniform probability distributions. The study analyzed 23 models and developed a method using Sparse Autoencoders to improve moral reasoning, achieving 75% win-rate on adversarial benchmarks.
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers used mechanistic interpretability techniques to demonstrate that transformer language models have distinct but interacting neural circuits for recall (retrieving memorized facts) and reasoning (multi-step inference). Through controlled experiments on Qwen and LLaMA models, they showed that disabling specific circuits can selectively impair one ability while leaving the other intact.
AINeutralarXiv – CS AI · Mar 167/10
🧠Researchers developed a testing framework to evaluate how reliably AI agents maintain consistent reasoning when inputs are semantically equivalent but differently phrased. Their study of seven foundation models across 19 reasoning problems found that larger models aren't necessarily more robust, with the smaller Qwen3-30B-A3B achieving the highest stability at 79.6% invariant responses.
AINeutralarXiv – CS AI · Mar 117/10
🧠Research analyzes FP4 quantization sensitivity across different layers in large language models using NVFP4 and MXFP4 formats on Qwen2.5 models. The study finds MLP projection layers are most sensitive to quantization, while attention layers show substantial robustness to FP4 precision reduction.