AINeutralarXiv โ CS AI ยท Feb 277/106
๐ง
Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Researchers identified a fundamental limitation in multimodal LLMs where decoders trained on text cannot effectively utilize non-text information like speaker identity or visual textures, despite this information being preserved through all model layers. The study demonstrates this 'modality collapse' is due to decoder design rather than encoding failures, with experiments showing targeted training can improve specific modality accessibility.