IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment
Researchers introduce IEA, a conversational AI agent that enables amateur users to edit images through natural language by learning to operate parameterized editing tools in an interpretable action space. The system uses a three-stage training pipeline combining supervised fine-tuning, reinforcement learning with rewards for editing quality, and synthetic data fine-tuning, producing transparent edit traces that outperform both generative and tool-calling baselines in user studies.
IEA represents a meaningful shift in how generative AI systems approach creative tasks, prioritizing interpretability and user control over black-box generation. The research addresses a genuine problem in current image editing software: the gap between what amateur users intend and what they receive, whether from fixed filters or generative models that produce artifacts and lack explainability. By training an AI agent to manipulate explicit editing tools step-by-step rather than generating images directly, the system maintains transparency—users can inspect and debug each edit decision.
This work reflects broader recognition that not all AI applications benefit from end-to-end generation. In creative fields, users increasingly demand understanding of why changes occurred and ability to iterate meaningfully. The three-stage training approach—combining expert demonstrations, reinforcement learning with nuanced rewards, and large-scale synthetic data—demonstrates sophisticated methodology for aligning AI behavior with human intent and practical constraints.
The implications extend beyond image editing. Tool-centric approaches offer advantages for professional workflows where audit trails and reproducibility matter. User studies showing IEA outperforms generative methods in perceptual quality while maintaining interpretability suggest this design pattern could apply to other creative and technical domains. For developers, the research validates that conversational agents can effectively learn to compose existing tools rather than replacing them, potentially accelerating adoption in domains where explainability carries regulatory or practical importance.
- →IEA achieves superior image editing quality and user instruction following by learning to sequentially manipulate explicit editing tools rather than generating images end-to-end.
- →The three-stage multitask training pipeline combining supervised learning, reinforcement learning, and synthetic data proves effective for aligning conversational AI with creative intent.
- →Tool-centric VLMs produce transparent, inspectable edit traces that users can debug, addressing a critical limitation of artifact-prone generative models.
- →User studies demonstrate IEA outperforms both generative methods and other tool-calling baselines in overall perceptual quality and instruction following.
- →The interpretable, composable approach suggests a viable design pattern for AI systems in professional workflows requiring explainability and auditability.