AINeutralarXiv โ CS AI ยท 14h ago5/10
๐ง
Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions
Researchers propose a novel reinforcement learning approach for fine-tuning multimodal conversational agents by learning a compact latent action space instead of operating directly on large text token spaces. The method combines paired image-text data with unpaired text-only data through a cross-modal projector trained with cycle consistency loss, demonstrating superior performance across multiple RL algorithms and conversation tasks.