Deepfake Audio Detection Using Self-supervised Fusion Representations

Authors: Khalid Zaman, Qixuan Huang, Muhammad Uzair, Masashi Unoki

Published: 2026-05-05 06:51:41+00:00

AI Summary

This paper proposes a dual-branch deepfake detection framework for component-level audio manipulation, where speech and environmental sounds can be independently spoofed. It leverages self-supervised fusion representations from pretrained XLS-R (for speech) and BEATs (for environmental sound) models. The system introduces a Matching Head and multi-head cross-attention for effective representation interaction, feeding into an AASIST classifier for spoofing probability prediction.

Abstract

This paper describes a submission to the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026, which addresses component-level deepfake detection using the CompSpoofV2 dataset, where speech and environmental sounds may be independently manipulated. To address this challenge, a dual-branch deepfake detection framework is proposed to jointly model speech and environmental contextual representations from input audio. Two pretrained models, XLS-R for speech and BEATs for environmental sound, are used to extract complementary contextual representations. A Matching Head is introduced to model representation differences through statistical normalization and representation interaction, enabling estimation of the original class. In parallel, multi-head cross-attention enables effective information exchange between speech and environmental components. The refined representations are processed with residual connections and layer normalization, and passed to an AASIST classifier to predict speech-based and environment-based spoofing probabilities. The model outputs original, speech, and environment predictions. On the test set, the proposed system achieves an F1-score of 70.20% and an environmental EER of 16.54%, outperforming the baseline system.


Key findings
The proposed system achieved an F1-score of 70.20% and an environmental EER of 16.54% on the test set, demonstrating an approximate 7-8% improvement in F1-score compared to the baseline. The dual-branch design with cross-attention and the Matching Head effectively captured component-level manipulations, leading to improved discrimination and better generalization for environment-aware deepfake detection.
Approach
The proposed method employs a dual-branch architecture, using pretrained XLS-R to extract speech representations and BEATs for environmental sound representations from the input audio. A Matching Head estimates the original class by modeling representation differences through statistical normalization and interaction, while multi-head cross-attention facilitates information exchange between the two refined representation streams. These are then passed to an AASIST classifier to predict speech-based and environment-based spoofing probabilities.
Datasets
CompSpoofV2
Model(s)
XLS-R, BEATs, AASIST classifier
Author countries
Japan