#qwen News & Analysis

73 articles tagged with #qwen. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

73 articles

AIBullisharXiv – CS AI · 21h ago7/10

🧠

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

Researchers introduce ThinkSwitch, a method that distills reasoning capabilities from large language models into smaller, more efficient models using LoRA and weight interpolation. The technique improves performance on mathematical and scientific reasoning tasks while maintaining low computational costs, doubling accuracy on AIME problems at minimal expense.

AIBearisharXiv – CS AI · 21h ago7/10

🧠

The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer

Researchers discovered the 'Alignment Curse,' revealing that stronger text-audio alignment in multimodal AI models inadvertently enables more effective transfer of text-based jailbreak attacks to audio channels. The finding exposes a critical safety vulnerability in recent omni-models like Qwen, suggesting current audio safety evaluations significantly underestimate risks originating from text modalities.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Alibaba's Qwen team released Qwen-VLA, a unified foundation model that combines vision, language, and action capabilities for robotics across multiple tasks and robot types. The model demonstrates strong performance on manipulation, navigation, and trajectory prediction benchmarks while generalizing well to out-of-distribution scenarios and real-world robot deployments.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Researchers propose DenseSteer, a training-free framework that improves mathematical reasoning in small language models (≤3B parameters) by steering internal representations toward denser reasoning patterns. The method demonstrates that smaller models can match larger ones' performance by executing fewer, more information-rich reasoning steps rather than verbose chain-of-thought processes.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool is a new AI framework that enhances large language models' ability to reason through tool use by implementing process-supervised reinforcement learning. The system dramatically improves performance on mathematical benchmarks like AIME24 (3.2% to 40.4%) while maintaining token efficiency through interleaved thinking and action.

AIBearisharXiv – CS AI · 5d ago7/10

🧠

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

Researchers discovered that chain-of-thought distillation—training smaller AI models to imitate larger models' reasoning—produces higher answer accuracy on medical benchmarks while simultaneously degrading reasoning quality. A Qwen3-8B student model improved from 74.7% to 84.4% accuracy on MedQA-USMLE, yet error rates in individual reasoning steps jumped from 30.6% to 50.3%, suggesting models learn to mimic expert-like output without grounding claims in sound logic.

AIBullisharXiv – CS AI · 6d ago7/10

🧠

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

Researchers introduce ReMoE, a router fine-tuning framework that optimizes Mixture-of-Experts language models for memory-constrained inference by increasing expert reuse and reducing storage I/O overhead. The approach improves expert reuse by 26% while maintaining performance, delivering up to 1.99× decode speedup on edge devices.

AIBullisharXiv – CS AI · 6d ago7/10

🧠

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Search-E1 introduces a simplified self-evolution method for search-augmented reasoning agents that achieves competitive performance through vanilla GRPO and self-distillation, without external supervision or complex auxiliary systems. The approach reaches 0.440 average EM on QA benchmarks with Qwen2.5-3B, demonstrating that elaborate post-training machinery may be unnecessary for effective agent development.

AIBullisharXiv – CS AI · May 127/10

🧠

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.

AIBullisharXiv – CS AI · May 127/10

🧠

M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

Researchers introduce M2A, a novel model merging paradigm that combines mathematical and agentic reasoning in large language models without retraining. The approach improves a Qwen3-8B model's software engineering benchmark performance from 44.0% to 51.2% by strategically injecting mathematical reasoning capabilities along directions that preserve agent behavior.

AIBullisharXiv – CS AI · May 117/10

🧠

Efficient Data Selection for Multimodal Models via Incremental Optimization Utility

Researchers introduce One-Step-Train (OST), a new data selection framework for Large Multimodal Models that uses incremental optimization to identify high-quality training samples. The method reduces computational costs by 43% while outperforming existing approaches like LLM-as-a-Judge, demonstrating significant efficiency gains in multimodal model training.

AINeutralarXiv – CS AI · May 97/10

🧠

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Researchers demonstrate that large language models encode social role granularity—from individual to institutional perspectives—as a structured geometric axis in their internal representations. Using activation steering, they show this axis is causally manipulable, enabling controlled shifts in response scope across different models.

🧠 Llama

AINeutralarXiv – CS AI · Apr 207/10

🧠

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

Researchers demonstrate through causal experiments that hallucinations in language models arise from early trajectory commitments governed by asymmetric attractor dynamics. Using controlled prompt bifurcation on Qwen2.5-1.5B, they show that 44% of test prompts diverge into factual or hallucinated outputs at the first token, with activation patterns revealing that corrupting correct trajectories is far easier than recovering hallucinated ones—suggesting hallucination represents a stable but difficult-to-escape attractor state.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Researchers introduce Lightning OPD, an offline on-policy distillation framework that eliminates the need for live teacher inference servers during large language model post-training. By enforcing 'teacher consistency'—using the same teacher model for both supervised fine-tuning and distillation—the method achieves comparable performance to standard OPD while delivering 4x speedup and significantly reducing infrastructure costs.

AIBullisharXiv – CS AI · Apr 147/10

🧠

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precision—surpassing GPT, Gemini, and Claude in specific benchmarks.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · Apr 107/10

🧠

The ATOM Report: Measuring the Open Language Model Ecosystem

A comprehensive study of the open language model ecosystem reveals that Chinese AI models, including Qwen and DeepSeek, have overtaken U.S.-developed models like Meta's Llama since summer 2025, with the gap continuing to widen. The research analyzes ~1.5K mainline open models across adoption metrics, market share, and performance to document this significant shift in AI development geography.

$ATOM🏢 Hugging Face🧠 Llama

AIBullisharXiv – CS AI · Apr 77/10

🧠

Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

Researchers introduce Cog-DRIFT, a new framework that improves AI language model reasoning by transforming difficult problems into easier formats like multiple-choice questions, then gradually training models on increasingly complex versions. The method shows significant performance gains of 8-10% on previously unsolvable problems across multiple reasoning benchmarks.

🧠 Llama

AIBullisharXiv – CS AI · Apr 77/10

🧠

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Researchers propose Continuous Softened Retracing reSampling (CSRS) to improve the self-evolution of Multimodal Large Language Models by addressing biases in feedback mechanisms. The method uses continuous reward signals instead of binary rewards and achieves state-of-the-art results on mathematical reasoning benchmarks like MathVision using Qwen2.5-VL-7B.

AIBullisharXiv – CS AI · Apr 77/10

🧠

Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

Researchers propose a new method for aligning AI language models with human preferences that addresses stability issues in existing approaches. The technique uses relative density ratio optimization to achieve both statistical consistency and training stability, showing effectiveness with Qwen 2.5 and Llama 3 models.

🧠 Llama

AIBullisharXiv – CS AI · Apr 67/10

🧠

Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity

Researchers developed a quantitative method to improve role consistency in multi-agent AI systems by introducing a role clarity matrix that measures alignment between agents' assigned roles and their actual behavior. The approach significantly reduced role overstepping rates from 46.4% to 8.4% in Qwen models and from 43.4% to 0.2% in Llama models during ChatDev system experiments.

🧠 Llama

AIBullisharXiv – CS AI · Mar 177/10

🧠

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Researchers introduced SAGE, a multi-agent framework that improves large language model reasoning through self-evolution using four specialized agents. The system achieved significant performance gains on coding and mathematics benchmarks without requiring large human-labeled datasets.

AINeutralarXiv – CS AI · Mar 177/10

🧠

Mechanistic Origin of Moral Indifference in Language Models

Researchers identified a fundamental flaw in large language models where they exhibit moral indifference by compressing distinct moral concepts into uniform probability distributions. The study analyzed 23 models and developed a method using Sparse Autoencoders to improve moral reasoning, achieving 75% win-rate on adversarial benchmarks.

AIBullisharXiv – CS AI · Mar 167/10

🧠

Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis

Researchers used mechanistic interpretability techniques to demonstrate that transformer language models have distinct but interacting neural circuits for recall (retrieving memorized facts) and reasoning (multi-step inference). Through controlled experiments on Qwen and LLaMA models, they showed that disabling specific circuits can selectively impair one ability while leaving the other intact.

AINeutralarXiv – CS AI · Mar 167/10

🧠

Semantic Invariance in Agentic AI

Researchers developed a testing framework to evaluate how reliably AI agents maintain consistent reasoning when inputs are semantically equivalent but differently phrased. Their study of seven foundation models across 19 reasoning problems found that larger models aren't necessarily more robust, with the smaller Qwen3-30B-A3B achieving the highest stability at 79.6% invariant responses.

AINeutralarXiv – CS AI · Mar 117/10

🧠

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Research analyzes FP4 quantization sensitivity across different layers in large language models using NVFP4 and MXFP4 formats on Qwen2.5 models. The study finds MLP projection layers are most sensitive to quantization, while attention layers show substantial robustness to FP4 precision reduction.

Page 1 of 3Next →