#qwen News & Analysis

87 articles tagged with #qwen. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

87 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Social World Model for Lifelong Social Intelligence

Researchers propose the Social World Model, a framework for continuous learning in language agents through structured social interaction decomposition across five dimensions. The approach demonstrates that smaller open-source models like Qwen2.5-7B can achieve competitive social intelligence capabilities comparable to closed-source alternatives while maintaining performance across difficulty levels.

🧠 Gemini

AIBearisharXiv – CS AI · Jun 237/10

🧠

Self-Improvement Can Self-Regress: The Rise-and-Collapse Failure Mode of LLM Self-Training

Researchers identify a critical failure mode in LLM self-training where models improve rapidly then collapse during REINFORCE post-training on coding tasks. The study tests three intervention strategies—CARE, early stopping, and GRPO—finding that effectiveness varies by model size and that none fully eliminates the within-task policy over-optimization problem.

AIBullisharXiv – CS AI · Jun 117/10

🧠

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

Researchers introduce ISE (Intent → Simulate → Execute), a three-stage framework for training OS agents that generates 43,956 structured intents and 23,132 multi-turn trajectories with live execution validation. Fine-tuning Qwen3-8B on this dataset achieves 37.7% pass@1 on ClawEval, outperforming GPT-4o zero-shot and the larger Qwen3-32B model, demonstrating that high-quality synthetic data design can overcome model scale limitations.

🧠 GPT-4

AIBearisharXiv – CS AI · Jun 117/10

🧠

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

Researchers quantified how undesirable behaviors transfer from teacher to student language models during distillation, even when trained only on benign data. Testing Llama-2 and Qwen2.5 models with varying steering strengths revealed different vulnerability profiles: Llama-2 showed a sharp behavioral transfer threshold, while Qwen2.5 exhibited continuous, higher-rate transfer of unwanted characteristics.

🧠 GPT-4🧠 Llama

AIBullisharXiv – CS AI · Jun 57/10

🧠

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

SUPERNOVA introduces a framework for extending reinforcement learning with verifiable rewards (RLVR) beyond STEM fields by systematically curating data from natural instruction datasets. A 25K-instance dataset trained on smaller models achieves 64.4 percentage point gains on complex reasoning benchmarks, with improvements generalizing across model scales and families.

AIBullisharXiv – CS AI · Jun 57/10

🧠

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

Researchers introduce DragOn, a large-scale benchmark dataset with 286K training screenshots and 3.5M tasks designed to improve GUI agents' ability to perform drag-based interactions like highlighting, resizing, and swiping. The dataset addresses a critical gap where drag-grounding capabilities lag significantly behind click-grounding in AI models controlling desktops and mobile devices.

🧠 Claude

AIBullisharXiv – CS AI · Jun 47/10

🧠

Making Expert Reasoning Learnable with Self-Distillation

Researchers propose Distribution Aligned Imitation Learning (DAIL), a self-distillation method that improves LLM reasoning by converting expert human solutions into computational training data. The technique achieves significant performance gains on frontier models using fewer than 1000 expert examples, addressing the challenge that expert solutions are typically written for humans rather than machines.

AIBullisharXiv – CS AI · Jun 27/10

🧠

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

Researchers introduce ThinkSwitch, a method that distills reasoning capabilities from large language models into smaller, more efficient models using LoRA and weight interpolation. The technique improves performance on mathematical and scientific reasoning tasks while maintaining low computational costs, doubling accuracy on AIME problems at minimal expense.

AIBearisharXiv – CS AI · Jun 27/10

🧠

The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer

Researchers discovered the 'Alignment Curse,' revealing that stronger text-audio alignment in multimodal AI models inadvertently enables more effective transfer of text-based jailbreak attacks to audio channels. The finding exposes a critical safety vulnerability in recent omni-models like Qwen, suggesting current audio safety evaluations significantly underestimate risks originating from text modalities.

AIBullisharXiv – CS AI · May 297/10

🧠

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool is a new AI framework that enhances large language models' ability to reason through tool use by implementing process-supervised reinforcement learning. The system dramatically improves performance on mathematical benchmarks like AIME24 (3.2% to 40.4%) while maintaining token efficiency through interleaved thinking and action.

AIBullisharXiv – CS AI · May 297/10

🧠

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Researchers propose DenseSteer, a training-free framework that improves mathematical reasoning in small language models (≤3B parameters) by steering internal representations toward denser reasoning patterns. The method demonstrates that smaller models can match larger ones' performance by executing fewer, more information-rich reasoning steps rather than verbose chain-of-thought processes.

AIBullisharXiv – CS AI · May 297/10

🧠

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Alibaba's Qwen team released Qwen-VLA, a unified foundation model that combines vision, language, and action capabilities for robotics across multiple tasks and robot types. The model demonstrates strong performance on manipulation, navigation, and trajectory prediction benchmarks while generalizing well to out-of-distribution scenarios and real-world robot deployments.

AIBearisharXiv – CS AI · May 287/10

🧠

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

Researchers discovered that chain-of-thought distillation—training smaller AI models to imitate larger models' reasoning—produces higher answer accuracy on medical benchmarks while simultaneously degrading reasoning quality. A Qwen3-8B student model improved from 74.7% to 84.4% accuracy on MedQA-USMLE, yet error rates in individual reasoning steps jumped from 30.6% to 50.3%, suggesting models learn to mimic expert-like output without grounding claims in sound logic.

AIBullisharXiv – CS AI · May 277/10

🧠

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

Researchers introduce ReMoE, a router fine-tuning framework that optimizes Mixture-of-Experts language models for memory-constrained inference by increasing expert reuse and reducing storage I/O overhead. The approach improves expert reuse by 26% while maintaining performance, delivering up to 1.99× decode speedup on edge devices.

AIBullisharXiv – CS AI · May 277/10

🧠

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Search-E1 introduces a simplified self-evolution method for search-augmented reasoning agents that achieves competitive performance through vanilla GRPO and self-distillation, without external supervision or complex auxiliary systems. The approach reaches 0.440 average EM on QA benchmarks with Qwen2.5-3B, demonstrating that elaborate post-training machinery may be unnecessary for effective agent development.

AIBullisharXiv – CS AI · May 127/10

🧠

M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

Researchers introduce M2A, a novel model merging paradigm that combines mathematical and agentic reasoning in large language models without retraining. The approach improves a Qwen3-8B model's software engineering benchmark performance from 44.0% to 51.2% by strategically injecting mathematical reasoning capabilities along directions that preserve agent behavior.

AIBullisharXiv – CS AI · May 127/10

🧠

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.

AIBullisharXiv – CS AI · May 117/10

🧠

Efficient Data Selection for Multimodal Models via Incremental Optimization Utility

Researchers introduce One-Step-Train (OST), a new data selection framework for Large Multimodal Models that uses incremental optimization to identify high-quality training samples. The method reduces computational costs by 43% while outperforming existing approaches like LLM-as-a-Judge, demonstrating significant efficiency gains in multimodal model training.

AINeutralarXiv – CS AI · May 97/10

🧠

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Researchers demonstrate that large language models encode social role granularity—from individual to institutional perspectives—as a structured geometric axis in their internal representations. Using activation steering, they show this axis is causally manipulable, enabling controlled shifts in response scope across different models.

🧠 Llama

AINeutralarXiv – CS AI · Apr 207/10

🧠

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

Researchers demonstrate through causal experiments that hallucinations in language models arise from early trajectory commitments governed by asymmetric attractor dynamics. Using controlled prompt bifurcation on Qwen2.5-1.5B, they show that 44% of test prompts diverge into factual or hallucinated outputs at the first token, with activation patterns revealing that corrupting correct trajectories is far easier than recovering hallucinated ones—suggesting hallucination represents a stable but difficult-to-escape attractor state.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Researchers introduce Lightning OPD, an offline on-policy distillation framework that eliminates the need for live teacher inference servers during large language model post-training. By enforcing 'teacher consistency'—using the same teacher model for both supervised fine-tuning and distillation—the method achieves comparable performance to standard OPD while delivering 4x speedup and significantly reducing infrastructure costs.

AIBullisharXiv – CS AI · Apr 147/10

🧠

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precision—surpassing GPT, Gemini, and Claude in specific benchmarks.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · Apr 107/10

🧠

The ATOM Report: Measuring the Open Language Model Ecosystem

A comprehensive study of the open language model ecosystem reveals that Chinese AI models, including Qwen and DeepSeek, have overtaken U.S.-developed models like Meta's Llama since summer 2025, with the gap continuing to widen. The research analyzes ~1.5K mainline open models across adoption metrics, market share, and performance to document this significant shift in AI development geography.

$ATOM🏢 Hugging Face🧠 Llama

AIBullisharXiv – CS AI · Apr 77/10

🧠

Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

Researchers propose a new method for aligning AI language models with human preferences that addresses stability issues in existing approaches. The technique uses relative density ratio optimization to achieve both statistical consistency and training stability, showing effectiveness with Qwen 2.5 and Llama 3 models.

🧠 Llama

AIBullisharXiv – CS AI · Apr 77/10

🧠

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Researchers propose Continuous Softened Retracing reSampling (CSRS) to improve the self-evolution of Multimodal Large Language Models by addressing biases in feedback mechanisms. The method uses continuous reward signals instead of binary rewards and achieves state-of-the-art results on mathematical reasoning benchmarks like MathVision using Qwen2.5-VL-7B.

Page 1 of 4Next →