LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training
LEAF (Low-rank Exploration with Adaptive Forking) introduces a novel tree-based reinforcement learning method for training speech-aware large language models that improves credit assignment by identifying shared response prefixes and assigning rewards at the span level rather than uniformly across tokens. The approach achieves superior performance compared to existing GRPO-style methods without requiring additional computational overhead, enabling smaller models to match or exceed larger baselines.
LEAF addresses a fundamental limitation in current speech-aware LLM training methodologies. Existing GRPO-based approaches distribute reward signals uniformly across all tokens in a response, failing to recognize that speech-conditioned completions frequently share common initial sequences before diverging at critical decision points. By recovering this latent tree structure from sampled rollouts, LEAF enables more granular credit assignment that better reflects which specific decisions drive performance improvements.
The technical contribution emerges from observations about how speech models naturally generate responses. Rather than implementing expensive online branching during inference, LEAF operates retrospectively on completed responses, identifying high-surprisal boundaries where meaningful divergences occur and grouping responses by prefix alignment. This design choice maintains computational efficiency while capturing valuable structural information.
The empirical results carry meaningful implications for AI development efficiency. LEAF demonstrates consistent improvements over GRPO across both speech question answering and translation tasks while operating within identical computational budgets. More significantly, smaller LEAF-trained models achieve state-of-the-art performance that previously required deploying substantially larger full-parameter models. This efficiency gain reduces both training costs and deployment requirements for speech-enabled systems.
The theoretical grounding for span-level credit assignment and boundary selection provides confidence in the method's generalizability beyond the tested domains. As organizations increasingly deploy multimodal systems requiring speech understanding, more efficient training methodologies directly impact which models become economically viable to develop and scale.
- βLEAF improves credit assignment in speech-aware LLMs by recognizing shared response prefixes and assigning span-level advantages rather than uniform token-level rewards.
- βThe method achieves superior performance without online branching or additional inference cost, maintaining the same rollout and adaptation budgets as baseline GRPO approaches.
- βSmaller models trained with LEAF match or exceed current state-of-the-art full-parameter baselines on speech translation and question answering tasks.
- βRetrospective tree structure recovery enables efficient encoding of the natural branching patterns that emerge in speech-conditioned completions.
- βThe approach has theoretical justification for its span-level credit assignment and boundary selection mechanisms.