AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers propose PEAR, a novel supervised fine-tuning (SFT) method that optimizes language models with downstream reinforcement learning in mind rather than in isolation. The approach uses importance sampling to reweight training data, addressing a critical distribution mismatch between offline SFT and online RL stages, achieving up to 14.6% performance gains on mathematical reasoning benchmarks.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers propose ESPO, an optimization technique that improves large language model training by detecting and terminating failed reasoning trajectories early rather than forcing completion. The method reduces computational waste by over 20% while achieving superior performance on mathematical reasoning benchmarks compared to standard PPO training.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce PuzzleClone, a DSL-driven framework that automatically synthesizes large-scale, verifiable datasets for training LLMs on mathematical and logical reasoning tasks. The team generates PC-83K, a benchmark of 83,000+ diverse puzzles, and demonstrates that models fine-tuned on this dataset achieve substantial performance improvements across multiple logic and mathematical benchmarks.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers propose COSE, a self-evolution framework for large language models that uses confidence signals to filter noisy self-generated training feedback without external verifiers. The method demonstrates consistent improvements across 19 benchmarks and multiple model sizes (0.6B–4B parameters), achieving state-of-the-art performance in reasoning and mathematics tasks.
🧠 Llama
AIBullisharXiv – CS AI · 3d ago7/10
🧠SynthTools introduces an LLM-based pipeline for generating synthetic tool environments at scale, creating a dataset of 73,883 validated tools across 6,800 environments and 79,925 verifiable tasks. The framework demonstrates that agents trained on synthetic tool-use data can transfer capabilities to real APIs, addressing a critical bottleneck in agentic AI system development.
AIBullisharXiv – CS AI · 4d ago7/10
🧠GraphDancer is a new post-training framework that enables large language models to reason over heterogeneous graph-structured data by combining natural-language reasoning with graph function execution. The two-stage curriculum approach uses structural complexity ordering to teach models to explore and reason over graphs, achieving strong cross-domain generalization with only a 3B parameter backbone.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce SAERL, a data engineering framework that uses Sparse Autoencoders to extract intrinsic signals from LLM internals for improved reinforcement learning post-training. The method achieves 3% accuracy gains and 20% faster convergence on math reasoning tasks by modeling data diversity, difficulty, and quality—demonstrating that model internals provide practical signals beyond external training data metrics.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose Divergence Proximal Policy Optimization (DPPO), a replacement for PPO's ratio clipping mechanism that better handles the large vocabularies in LLM fine-tuning. The new approach uses direct policy divergence estimates instead of noisy token probability ratios, offering improved training stability and efficiency.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce MedGuideX, a medical language model trained on executable clinical decision logic extracted from practice guidelines, achieving 10.28% accuracy improvement over existing methods. The approach transforms procedural guideline structures into synthetic training data that teaches models both correct clinical decisions and counterfactual reasoning, with physician validation confirming improved explanation quality.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose GraphGPO, a novel reinforcement learning method that improves credit assignment in agentic tasks by aggregating trajectories into a state-transition graph rather than relying on coarse-grained outcome-based attribution. This approach enables step-level credit recognition and achieves state-of-the-art performance on challenging benchmarks while significantly improving training efficiency.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers identify weight gradient (Wgrad) quantization as the primary cause of instability in FP4 training of large language models, while forward and activation gradient quantization prove relatively benign. Using deterministic Hadamard rotations on AMD MI355X GPUs, they demonstrate that structured micro-scaling errors—not insufficient randomness—drive training divergence, offering insights for efficient LLM pretraining.
🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce CauSim, a framework that enables large language models to improve causal reasoning by constructing increasingly complex executable causal simulators. The approach transforms causal reasoning from a scarce-data problem into a scalable supervised learning task, allowing LLMs to generate synthetic training data and demonstrate improved performance across different representations.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce G-Zero, a verifier-free framework that enables large language models to improve autonomously through self-play without relying on external judges or proxy models. The approach uses an intrinsic reward mechanism called Hint-δ to identify and address the Generator model's blind spots, achieving scalable self-evolution across unverifiable domains.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers propose Adaptive Negative Sample Reinforcement (A-NSR) and Confidence-Weighted Negative Reinforcement (CW-NSR) to improve LLM reasoning by dynamically adjusting penalty weights during training rather than applying fixed penalties. The methods are evaluated on challenging math datasets using Qwen2.5-Math-1.5B, demonstrating that intelligent error correction can match or exceed complex frameworks like PPO.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce MedAction, a new framework and dataset designed to improve how large language models perform clinical diagnosis by simulating real-world multi-turn diagnostic processes. The approach addresses fundamental limitations in current medical LLMs through a tree-structured distillation pipeline that generates high-quality diagnostic trajectories, achieving state-of-the-art performance among open-source models.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce ROPD, a rubric-based on-policy distillation framework that replaces teacher logits with structured semantic rubrics for model alignment. The approach achieves up to 10x better sample efficiency than logit-based methods while enabling distillation from proprietary black-box LLMs, addressing a critical scalability limitation in current model training.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers propose ESSAM, a novel training framework combining Evolution Strategies with Sharpness-Aware Maximization to fine-tune large language models for mathematical reasoning while dramatically reducing GPU memory requirements. The approach achieves comparable accuracy to reinforcement learning methods like PPO and GRPO while using 18-10× less memory, addressing a critical bottleneck in LLM development.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce rubric-grounded reinforcement learning, a framework that trains AI models using structured, multi-criterion rewards from an LLM judge rather than binary outcomes. Training Llama-3.1-8B on scientific documents achieved 71.7% normalized reward and demonstrated improved performance on multiple reasoning benchmarks, suggesting that document-grounded training signals can produce generalizable reasoning capabilities.
🧠 Llama
AIBullisharXiv – CS AI · May 97/10
🧠Researchers present a statistical-physics framework explaining how large language models develop multi-step reasoning through reinforcement learning with verifiable rewards (RLVR), modeling the process as inverse tree freezing in a concept network. They propose Annealed-RLVR, a timing-optimized training method that outperforms standard RLVR by applying supervised fine-tuning at peak frustration rather than after convergence, preventing policy collapse.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers propose ADAPT, an online data reweighting framework that dynamically adjusts training sample importance during LLM training rather than using static offline selection methods. This approach maintains data diversity while improving generalization, outperforming existing offline curation techniques on instruction tuning and large-scale pretraining tasks.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers propose Lorem Perturbation for Exploration (LoPE), a training technique that addresses the zero-advantage problem in reinforcement learning for large language models by prepending random Latin-based text to prompts, enabling broader reasoning exploration across 1.7B to 7B parameter models.
🏢 Perplexity
AIBullisharXiv – CS AI · May 77/10
🧠Researchers propose Anchored Learning, a new fine-tuning method that prevents catastrophic forgetting in large language models by controlling distributional drift through a dynamically evolving reference anchor. The technique achieves near-optimal performance gains while reducing degradation from over 53% to under 5% on benchmark tasks.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers propose skill neologisms—soft tokens added to LLM vocabularies—as a scalable approach to continual learning that enables models to acquire new capabilities without catastrophic forgetting or weight updates. The method demonstrates that independently trained skill tokens can compose zero-shot and work with out-of-distribution tasks, offering a practical alternative to fine-tuning.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce RFT-FaultBench, the first comprehensive benchmark for diagnosing failures in reinforcement fine-tuning of large language models, and propose RFT-FM, an automated framework for detecting, diagnosing, and remediating training failures. This addresses a critical gap in LLM post-training reliability where practitioners currently rely on manual inspection.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers introduce RoundPipe, a novel pipeline scheduling algorithm that enables efficient fine-tuning of large language models on consumer-grade GPUs by eliminating the weight binding constraint that causes computational bottlenecks. The system achieves 1.48-2.16x speedups over existing approaches and enables fine-tuning of models with up to 235 billion parameters on standard hardware.