A Survey of Generative Categories and Techniques in Multimodal Generative Models

Authors: Longzhen Han, Awes Mubarak, Almas Baimagambetov, Nikolaos Polatidis, Thar Baker

Published: 2025-05-29 12:29:39+00:00

AI Summary

This survey systematically categorizes six primary generative modalities of Multimodal Generative Models (MGMs) and examines how foundational techniques like SSL, MoE, RLHF, and CoT enable their cross-modal capabilities. It proposes a unified evaluation framework centered on faithfulness, compositionality, and robustness, while also analyzing trustworthiness, safety, and ethical risks. Finally, it discusses architectural trends, evaluation protocols, and governance mechanisms to guide future development towards more controllable and accountable multimodal systems.

Abstract

Multimodal Generative Models (MGMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Building on a common taxonomy of models and training recipes, we propose a unified evaluation framework centred on faithfulness, compositionality, and robustness, and synthesise evidence from benchmarks and human studies across modalities. We further analyse trustworthiness, safety, and ethical risks, including multimodal bias, privacy leakage, and the misuse of high-fidelity media generation for deepfakes, disinformation, and copyright infringement in music and 3D assets, together with emerging mitigation strategies. Finally, we discuss how architectural trends, evaluation protocols, and governance mechanisms can be co-designed to close current capability and safety gaps, outlining critical paths toward more general-purpose, controllable, and accountable multimodal generative systems.


Key findings
Progress in Multimodal Generative Models (MGMs) is symbiotic, with architectural advances and techniques (like Transformers and Diffusion) often transferring and benefiting multiple modalities. A unified evaluation framework based on faithfulness, compositionality, and robustness is essential, as current metrics often poorly align with human perception for complex generations. Addressing trustworthiness, safety, and ethical risks such as deepfakes, bias, privacy, and copyright requires a co-design approach involving models, data, evaluation, and governance mechanisms rather than isolated technical fixes.
Approach
The authors conduct a systematic review, categorizing six primary generative output modalities (Text-to-Text, Text-to-Image, Text-to-Music, Text-to-Video, Text-to-Human-Motion, Text-to-3D-Objects) and analyzing four pivotal techniques: Self-Supervised Learning, Mixture of Experts, Reinforcement Learning from Human Feedback, and Chain-of-Thought prompting. They establish a unified evaluation framework based on faithfulness, compositionality, and robustness, applying it systematically across modalities. The survey also integrates a dedicated analysis of trustworthiness, safety, and ethical risks in MGMs.
Datasets
UNKNOWN
Model(s)
UNKNOWN
Author countries
United Kingdom, United Arab Emirates