Leveraging Mixture of Experts for Improved Speech Deepfake Detection

Authors: Viola Negroni, Davide Salvi, Alessandro Ilic Mezza, Paolo Bestagini, Stefano Tubaro

Published: 2024-09-24 13:24:03+00:00

Comment: Submitted to ICASSP 2025

AI Summary

This paper introduces a novel Mixture of Experts (MoE) architecture to enhance speech deepfake detection, specifically addressing the challenge of generalization to unseen data. The proposed approach leverages a lightweight gating mechanism to dynamically assign expert weights, allowing the system to specialize in different input types and efficiently handle data variability. This modular framework demonstrates superior generalization and adaptability compared to traditional single models or ensemble methods.

Abstract

Speech deepfakes pose a significant threat to personal security and content authenticity. Several detectors have been proposed in the literature, and one of the primary challenges these systems have to face is the generalization over unseen data to identify fake signals across a wide range of datasets. In this paper, we introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture. The Mixture of Experts framework is well-suited for the speech deepfake detection task due to its ability to specialize in different input types and handle data variability efficiently. This approach offers superior generalization and adaptability to unseen data compared to traditional single models or ensemble methods. Additionally, its modular structure supports scalable updates, making it more flexible in managing the evolving complexity of deepfake techniques while maintaining high detection accuracy. We propose an efficient, lightweight gating mechanism to dynamically assign expert weights for each input, optimizing detection performance. Experimental results across multiple datasets demonstrate the effectiveness and potential of our proposed approach.


Key findings
The proposed enhanced Mixture of Experts (MoE) model significantly outperforms individual experts, ensemble methods, and jointly trained baselines in terms of Equal Error Rate (EER) and Area Under the Curve (AUC). It achieves superior generalization, demonstrating robust performance on both known and unseen speech deepfake datasets. Furthermore, the gating network effectively leverages domain-specific expert knowledge and provides valuable insights into the relationships between different deepfake datasets.
Approach
The authors propose a Mixture of Experts (MoE) framework where multiple individual speech deepfake detectors, termed "experts," are each pre-trained on a distinct dataset. A lightweight gating network dynamically assigns weights to these experts based on the input, allowing the system to specialize. An "enhanced" version further improves this by feeding the gating network internal representations (embeddings) from the experts.
Datasets
ASVspoof 2019 (DASV), FakeOrReal (DFoR), ADD 2022 (DADD), In-the-Wild (DItW), Purdue speech dataset (DPUR), TIMIT-TTS (DTIM)
Model(s)
LCNN (Lightweight CNN) model, processing mel-spectrograms
Author countries
Italy