AINeutralarXiv – CS AI · 9h ago6/10
🧠
Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks
Researchers propose Direct Reasoning Optimization (DRO), a constrained reinforcement learning framework that improves LLM training on unverifiable tasks by combining token-level reasoning rewards with rubric-based feasibility gates. The approach demonstrates faster, more sample-efficient learning across scientific, medical, legal, and financial domains.