Attention-based Mixture of Experts for Robust Speech Deepfake Detection

View on arXiv ← Back to list

Authors: Viola Negroni, Davide Salvi, Alessandro Ilic Mezza, Paolo Bestagini, Stefano Tubaro

Published: 2025-09-22 11:09:20+00:00

AI Summary

This paper presents a novel approach to audio deepfake detection using a Mixture of Experts (MoE) architecture. The system combines multiple state-of-the-art detectors, weighting their outputs via an attention-based gating network, achieving first place in the SAFE challenge at IH&MMSec 2025.

Abstract

AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfakes become nearly indistinguishable from real human speech, the need for robust detection methods and effective countermeasures has become critically urgent. In this paper, we present the ISPL's submission to the SAFE challenge at IH&MMSec 2025, where our system ranked first across all tasks. Our solution introduces a novel approach to audio deepfake detection based on a Mixture of Experts architecture. The proposed system leverages multiple state-of-the-art detectors, combining their outputs through an attention-based gating network that dynamically weights each expert based on the input speech signal. In this design, each expert develops a specialized understanding of the shared training data by learning to capture different complementary aspects of the same input through inductive biases. Experimental results indicate that our method outperforms existing approaches across multiple datasets. We further evaluate and analyze the performance of our system in the SAFE challenge.

Key findings

The proposed method, using an attention-based gating network and a novel domain partitioning strategy, outperforms existing approaches across multiple datasets. It achieved first place in all three tasks of the SAFE challenge, demonstrating robustness in various scenarios, although performance degraded with increasingly complex post-processing and laundering.

Approach

The authors utilize a Mixture of Experts (MoE) architecture for audio deepfake detection. This involves multiple expert detectors with diverse architectures and feature inputs, whose outputs are weighted by an attention-based gating network based on input speech. The model is trained on a pooled dataset, promoting specialization through architectural biases rather than data partitioning.

Datasets

ASVspoof 2019 (LA subset), FakeOrReal, In-the-Wild, MLAAD (English partition), Purdue speech dataset, ASVspoof 2021 (DF partition, clean subset), ASVspoof 5 (evaluation set, clean subset), LibriSpeech, LJSpeech, VCTK, Mozilla Common Voice, DiffSSD

Model(s)

LCNN, ResNet18 (two instances with different input features: mel-frequency and linear-frequency spectrograms), transformer-based gating network

Author countries

Italy

← Previous