"摘要": "Multimodal Large Language Models (MLLMs) perform strongly in high-resource languages, yet their effectiveness drops sharply in low-resource settings, largely due to the scarcity of aligned and culturally informative multimodal data. Existing multilingual enhancement approaches predominantly rely on text-only resources or translation-based pipelines, which improve surface-level fluency but often fail to capture culturally specific visual knowledge.\nIn this work, we present MELLA, a large-scale multimodal multilingual dataset designed to support both linguistic fluency and culturally grounded visual understanding in low-resource languages. MELLA is constructed using a dual-source data curation strategy that combines (i) native web image-alt-text pairs, which provide in-context, culture-specific visual-textual alignments, and (ii) high-quality image descriptions generated in a high-resource language and translated into target languages to ensure linguistic richness and structural completeness. Rather than expanding multilingual coverage alone, this design explicitly disentangles two complementary learning signals that are conflated in existing multilingual multimodal datasets.\nMELLA covers eight low-resource languages and contains 6.8M image-text pairs spanning diverse domains and visual categories. Through controlled diagnostic fine-tuning experiments on multiple MLLM backbones, we show that training on MELLA mitigates the cultural hallucination gap, often manifested as culturally “thin“ descriptions, by enabling models to recognize and articulate culturally specific entities that are systematically overlooked by translation-centric pipelines. Our findings underscore the central role of data alignment, rather than model modification, in achieving culturally grounded multimodal understanding for low-resource languages.",
0 commit comments