Modality Gap: When Multimodal LLMs Don't Trust Their Senses

A multimodal LLM has a modality gap when it trusts one input type (like text) over another (like images), even with identical information. This bias causes performance drops, like ignoring visual data if conflicting text is present.
A multimodal LLM has a modality gap when it systematically trusts one input type (like text) over another (like images or audio), even when they convey identical semantic content. This bias, often from imbalanced pretraining, can cause performance to drop by over 90 points on some tasks. The biggest mistake is assuming a model integrates all inputs equally; it often defaults to its text-based foundation, ignoring other crucial sensory data.
Read the original → emergentmind.com
- #llm
- #multimodality
- #generative ai
- #model evaluation
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.