Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts
Researchers introduce Visual-SDPO, a self-distillation framework that enables code-generating LLMs to improve visual artifact quality by learning from rendered output feedback. The method achieves 10+ point improvements on code-to-visual generation benchmarks while maintaining inference efficiency.
Visual-SDPO addresses a fundamental limitation in code-generating AI systems: the inability to observe and correct visual defects before committing to code. Traditional LLMs generate charts, web pages, and slides blindly, resulting in common rendering issues like misaligned elements and text overflow. This research introduces a training framework where a teacher model receives privileged access to rendered visual feedback, then distills this knowledge into a student model that generates better code without requiring visual feedback at inference time.
The innovation lies in spatially-targeted supervision through Visual-Grounded Code Credit Weighting, which traces detected visual defects back to specific code statements rather than treating all code equally. This precision-focused approach amplifies learning signals where they matter most. Combined with sequence-level policy optimization rewards for executable, high-quality outputs, the framework handles both successful and failed executions as learning opportunities.
For the AI development community, this work demonstrates how self-distillation can bridge the gap between code generation and visual quality without runtime overhead. Across three benchmark categories—charts, web interfaces, and slides—the method consistently outperforms baseline approaches by 2.4+ points while requiring fewer training iterations. This efficiency improvement has practical implications for model training costs and deployment scaling.
Looking forward, this approach suggests broader applications in multimodal code generation where intermediate execution feedback can improve output quality. The unified backbone supporting multiple visual generation tasks hints at potential consolidation in specialized code-generation models, potentially influencing how development tools integrate AI capabilities.
- →Visual-SDPO uses rendered feedback as privileged training context to improve code generation quality without inference-time costs.
- →Spatial credit assignment traces visual defects to specific code statements, enabling targeted learning improvements.
- →Method achieves 10+ point improvements on chart, UI, and slide generation benchmarks compared to zero-shot baselines.
- →Framework successfully handles execution errors as learnable signals, maintaining robustness across failed and successful code.
- →Unified multi-task approach demonstrates potential for consolidating visual artifact generation across different domains.