FakeSound2: A Benchmark for Explainable and Generalizable Deepfake Sound Detection

View on arXiv ← Back to list

Authors: Zeyu Xie, Yaoyun Zhang, Xuenan Xu, Yongkang Yin, Chenxing Li, Mengyue Wu, Yuexian Zou

Published: 2025-09-21 17:10:06+00:00

AI Summary

FakeSound2 is a new benchmark dataset for deepfake sound detection that goes beyond binary classification, evaluating models on localization, traceability, and generalization across six manipulation types and twelve sources. Experimental results reveal that while current models achieve high accuracy in binary classification, they struggle with explainability and generalization.

Abstract

The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largely focus on binary classification and fall short in explaining how manipulations occur, tracing where the sources originated, or generalizing to unseen sources-thereby limiting the explainability and reliability of detection. To address these limitations, we present FakeSound2, a benchmark designed to advance deepfake sound detection beyond binary accuracy. FakeSound2 evaluates models across three dimensions: localization, traceability, and generalization, covering 6 manipulation types and 12 diverse sources. Experimental results show that although current systems achieve high classification accuracy, they struggle to recognize forged pattern distributions and provide reliable explanations. By highlighting these gaps, FakeSound2 establishes a comprehensive benchmark that reveals key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.

Key findings

Current deepfake sound detection models exhibit strong localization capabilities but struggle with explainability and generalization. They achieve high accuracy in identifying manipulated audio but fail to reliably identify manipulation types or sources, especially for unseen sources. This highlights the need for models that learn the underlying distribution of genuine audio rather than relying on artifacts in the training data.

Approach

FakeSound2 evaluates deepfake sound detection models across three dimensions: localization, traceability, and generalization. It uses a pipeline to generate a large-scale dataset with diverse manipulation types and sources. The models are then evaluated on their ability to identify manipulated segments, determine the manipulation type and source, and generalize to unseen sources.

Datasets

FakeSound2 dataset, constructed using AudioCaps dataset and various audio manipulation techniques (generation, editing, inpainting, separation, splicing, addition) from 12 sources (11 synthetic and 1 genuine).

Model(s)

A baseline model based on prior work [14], utilizing a self-supervised pre-trained EAT audio encoder, a ResNet, Transformer encoder, and a bidirectional LSTM network for classification.

Author countries

China

← Previous