Seamless: Multilingual Expressive and Streaming Speech Translation

Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia Gonzalez, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-jussà, Maha Elbayad, Hongyu Gong, Francisco Guzmán, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson

Published: 2023-12-08 17:18:42+00:00

AI Summary

This paper introduces the Seamless family of models for end-to-end expressive and multilingual speech translation in a streaming fashion. It details SeamlessM4T v2, an improved foundational model, SeamlessExpressive for vocal style and prosody preservation, and SeamlessStreaming for low-latency simultaneous translation. These components are unified into "Seamless", the first publicly available system for real-time expressive cross-lingual communication.

Abstract

Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication


Key findings
SeamlessM4T v2 achieved state-of-the-art semantic accuracy across various speech and text translation tasks, significantly outperforming previous models and cascaded systems. SeamlessExpressive successfully preserves vocal styles and prosody, including speech rate and pauses, while SeamlessStreaming enables low-latency simultaneous speech-to-speech/text translation. The unified Seamless system integrates these capabilities, backed by strong robustness against noise, effective toxicity mitigation, and a novel localized watermarking mechanism for AI-generated content.
Approach
The approach introduces a family of models: SeamlessM4T v2, an improved foundation, SeamlessExpressive for prosody and vocal style preservation, and SeamlessStreaming for low-latency simultaneous translation. SeamlessM4T v2 utilizes an updated UnitY2 framework and a w2v-BERT 2.0 speech encoder. SeamlessExpressive integrates a Prosody UnitY2 and PRETSSEL acoustic model, while SeamlessStreaming leverages Efficient Monotonic Multihead Attention (EMMA) for real-time output.
Datasets
SeamlessAlign, NLLB data, ASR data, pseudo-labeled S2ST data, mExpresso, mDRAL, automatically extracted expressive audio alignments, parallel segments from videos, SONAR Expressive data, controllable TTS (cTTS) augmented data, Detoxy, JigSaw, MuTox corpus, Multilingual HolisticBias, Fleurs, Flores, CoVoST2, CVSS, VoxPopuli.
Model(s)
SeamlessM4T v2, SeamlessExpressive, SeamlessStreaming, Seamless. Key architectures and components include UnitY2, w2v-BERT 2.0, Conformer, Transformer, Efficient Monotonic Multihead Attention (EMMA), PRETSSEL (Paralinguistic REpresentation-based TextleSS acoustic modEL), HiFi-GAN vocoder, SONAR speech encoders, Unit Voicebox, MuTox (speech toxicity classifier).
Author countries
USA, France