🧠 AI🟢 BullishImportance 6/10

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

arXiv – CS AI|Sara Papi, Luisa Bentivogli|June 1, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce DOA (Decoder-Only Attention), a training-free method that enables simultaneous speech-to-text translation using decoder-only SpeechLLMs by extracting alignment signals from self-attention mechanisms. The approach achieves low-latency, long-form translation quality comparable to offline decoding without requiring model retraining.

Analysis

The research addresses a fundamental challenge in real-time speech translation: how to decide when to process input versus generate output during live streaming. Traditional encoder-decoder models use cross-attention to explicitly align source and target languages, but newer SpeechLLMs rely exclusively on self-attention, making their alignment capabilities unclear.

This work builds on the broader trend toward unified, decoder-only architectures in large language models. As foundation models consolidate around self-attention-only designs for efficiency and scaling, researchers must adapt real-time processing pipelines to these new constraints. The paper validates this approach on production-grade models like Phi4-Multimodal and Qwen3-Omni, demonstrating practical applicability.

The significance lies in its training-free nature. Most streaming translation systems require expensive fine-tuning or rely on simplistic heuristics like wait-k policies that ignore linguistic context. DOA extracts alignment information from existing model weights through clever attention analysis, eliminating retraining overhead. This lowers barriers for deploying simultaneous translation across diverse language pairs and resource-constrained environments.

The ability to handle long-form translation matters for real-world scenarios like conferences, live broadcasts, and multilingual meetings where maintaining quality over extended durations is critical. The findings suggest decoder self-attention contains richer alignment information than previously assumed, opening new research directions for understanding how these models encode structural relationships implicitly.

Key Takeaways

→DOA extracts alignment signals from self-attention in decoder-only SpeechLLMs without requiring model retraining.
→The method achieves near-offline translation quality while maintaining low latency for simultaneous speech-to-text translation.
→Training-free approaches reduce deployment friction and make real-time translation more accessible across different models.
→Decoder self-attention contains sufficient structural information to guide streaming translation policies effectively.
→Long-form translation capabilities enable practical applications in live conference interpretation and multilingual broadcasting.

#speech-translation #simultaneous-translation #speechllms #decoder-only #streaming-policy #low-latency #training-free #self-attention

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge