AIBullisharXiv – CS AI · May 297/10
🧠Researchers introduce PuzzleClone, a DSL-driven framework that automatically synthesizes large-scale, verifiable datasets for training LLMs on mathematical and logical reasoning tasks. The team generates PC-83K, a benchmark of 83,000+ diverse puzzles, and demonstrates that models fine-tuned on this dataset achieve substantial performance improvements across multiple logic and mathematical benchmarks.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce OPT-BENCH, a framework for training LLMs on NP-hard optimization problems using quality-aware reinforcement learning. Testing on Qwen2.5-7B achieves 93.1% success rate and 46.6% quality ratio, substantially outperforming GPT-4o, with demonstrated transfer benefits across mathematics, logic, and reasoning tasks.
🧠 GPT-4
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose Theorem-SFT, a novel supervised fine-tuning approach that teaches language models to apply mathematical rules explicitly rather than memorize surface-level correlations between problems and solutions. The method demonstrates significant performance improvements across benchmarks while revealing that feed-forward layers, not memorization itself, are the primary locus of reasoning capability.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce rubric-grounded reinforcement learning, a framework that trains AI models using structured, multi-criterion rewards from an LLM judge rather than binary outcomes. Training Llama-3.1-8B on scientific documents achieved 71.7% normalized reward and demonstrated improved performance on multiple reasoning benchmarks, suggesting that document-grounded training signals can produce generalizable reasoning capabilities.
🧠 Llama
AIBullisharXiv – CS AI · Mar 127/10
🧠Researchers developed a method using neural cellular automata (NCA) to generate synthetic data for pre-training language models, achieving up to 6% improvement in downstream performance with only 164M synthetic tokens. This approach outperformed traditional pre-training on 1.6B natural language tokens while being more computationally efficient and transferring well to reasoning benchmarks.
AINeutralarXiv – CS AI · Jun 46/10
🧠Researchers introduce Dynamic Infilling Anchors (DIA), a training-free method that improves how diffusion large language models generate structured outputs like JSON or reasoning templates. By dynamically adjusting generation length constraints, DIA achieves better format compliance and accuracy on mathematical reasoning benchmarks without requiring model retraining.
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers propose Bottom-up Policy Optimization (BuPO), a novel reinforcement learning approach that optimizes internal layers of language models rather than treating them as unified policies. The study reveals that LLMs contain distinct internal policy structures with different entropy patterns across layers, offering new insights into how transformer-based models process reasoning tasks.
🧠 Llama
AINeutralarXiv – CS AI · May 286/10
🧠Researchers propose SC-SDPO, an improved machine learning technique that enhances how large language models learn from their own feedback during training. By weighting training examples based on question difficulty, the method achieves 3-4% performance gains on reasoning benchmarks while maintaining stable training dynamics.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduced EpiQAL, the first benchmark for evaluating large language models on epidemiological reasoning tasks. Testing 15 models reveals significant performance gaps in multi-step inference and evidence synthesis, indicating current LLMs struggle with population-level disease analysis despite their general capabilities.
AINeutralarXiv – CS AI · May 126/10
🧠NoisyCoconut is an inference-time method that improves LLM reliability by injecting controlled noise into internal representations to generate diverse reasoning paths, enabling models to abstain when uncertain without requiring retraining. The technique reduces error rates from 40-70% to below 15% on mathematical reasoning tasks through unanimous agreement among noise-perturbed paths, offering practical reliability improvements compatible with existing models.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce HTPO, a novel reinforcement learning algorithm that optimizes Large Language Models by assigning different learning objectives to different tokens based on their functional roles in reasoning tasks. The method achieves significant performance improvements on challenging benchmarks like AIME, demonstrating that granular token-level control can better balance exploration and exploitation in AI training.
AIBullisharXiv – CS AI · Apr 156/10
🧠Researchers introduce KnowRL, a reinforcement learning framework that improves large language model reasoning by using minimal, strategically-selected knowledge points rather than verbose hints. The approach achieves state-of-the-art results on reasoning benchmarks at the 1.5B parameter scale, with the trained model and code made publicly available.