#llm-post-training News & Analysis

4 articles tagged with #llm-post-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AINeutralarXiv – CS AI · Jun 96/10

🧠

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

Researchers propose PVPO, a sample-efficient reinforcement learning method that improves LLM-based LEGO assembly generation by addressing PhysHack, a failure mode where structures satisfy physical constraints but lack semantic or geometric coherence. The approach uses selective data training and couples physical feasibility with geometric rewards, achieving better structural alignment while reducing reliance on rejection sampling.

AINeutralarXiv – CS AI · Jun 56/10

🧠

On Advantage Estimates for Max@K Policy Gradients

Researchers introduce MaxPO, a new policy-gradient method that improves advantage estimation for max@K objectives in reinforcement learning, addressing challenges in LLM post-training by reducing gradient variance through a Leave-Two-Out baseline that ensures centered advantages.

AINeutralarXiv – CS AI · Jun 26/10

🧠

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training

Researchers introduce AlphaToken, a framework that improves large language model post-training by valuating individual response tokens based on their contribution to both task adaptation and preservation of pre-trained knowledge. The method uses gradient-based signals and a Fisher-drift proxy to identify high-value tokens, enabling more efficient fine-tuning and preference optimization while reducing catastrophic forgetting.

AINeutralarXiv – CS AI · May 116/10

🧠

Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

Researchers introduce Dr. Post-Training, a novel framework that treats general training data as a regularizer rather than a selection pool for LLM post-training. The method projects target-data updates onto a feasible set defined by general data, improving performance across SFT, RLHF, and RLVR tasks while maintaining computational efficiency.