VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct
VeriEvol is a new framework for scaling multimodal mathematical reasoning in AI by treating data creation as a verifiable problem, combining evolved prompts with a multi-source verifier to ensure answer reliability. Testing shows the approach increases visual math accuracy from 35.42% to 54.73% when scaling from 10K to 250K samples, with reinforcement learning adding further gains of 3.88% points.
VeriEvol addresses a fundamental challenge in scaling AI training: as datasets grow, maintaining label quality becomes increasingly difficult. Rather than simply generating more data and trusting labels, the framework decouples two critical dimensions—prompt difficulty and answer correctness—before applying reinforcement learning. This separation is methodologically sound because it recognizes that harder questions without verified correct answers can actually harm model training.
The technical approach combines route-specific evolution operators that generate harder, image-grounded questions with HTV-Agent, a verifier that uses hypothesis-test falsification across multiple sources to validate answers. This dual-component design extends beyond existing GRPO-style reinforcement learning recipes by ensuring data quality upstream rather than relying on policy updates to handle noisy labels.
The empirical results demonstrate meaningful improvements across a five-benchmark visual-math suite. The breakdown of gains—1.82 percentage points from evolved prompts and 2.06 from verified answers—reveals that both components contribute substantially. Starting from a 35.42% baseline and reaching 54.73% through SFT alone represents 19.31 percentage points of improvement, suggesting the evolution mechanism effectively increases problem difficulty in ways that improve generalization.
The full release of prompts, data, models, code, and verifier traces sets a transparency standard that allows downstream researchers to audit the pipeline rather than treating it as a black box. This approach may influence how future AI training frameworks balance scale with verifiability, particularly in domains where correctness is verifiable but expensive to confirm.
- →VeriEvol decouples prompt difficulty and answer reliability as separate scaling problems, improving multimodal mathematical reasoning accuracy to 54.73% on visual-math benchmarks.
- →The framework combines evolved prompts and multi-source verification, yielding +3.88 percentage point improvements over baseline reinforcement learning.
- →Full transparency through released code, data, and verifier traces enables auditing of the training pipeline at scale rather than only inspecting final outputs.
- →The approach is agnostic to underlying RL recipes, allowing integration with existing GRPO-style methods without architectural changes.
- →Hypothesis-test falsification for answer verification introduces a novel mechanism for ensuring label quality as dataset volume increases.