Towards Robust Audio Deepfake Detection: A Evolving Benchmark for Continual Learning

Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao

Published: 2024-05-14 13:37:13+00:00

AI Summary

This paper introduces EVDA, a novel benchmark designed to evaluate continual learning methods for robust audio deepfake detection. It addresses the growing challenge posed by advanced large language models generating evolving synthetic speech, where traditional methods struggle with catastrophic forgetting. EVDA includes a diverse set of classic and newly generated deepfake audio datasets and supports various continual learning techniques to foster the development of adaptable detection algorithms.

Abstract

The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts as an effective tool for detecting newly emerged deepfake audio while maintaining performance on older types, lacks a well-constructed and user-friendly evaluation framework. To address this gap, we introduce EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA includes classic datasets from the Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o. It supports various continual learning techniques, such as Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), and recent methods like Regularized Adaptive Weight Modification (RAWM) and Radian Weight Modification (RWM). Additionally, EVDA facilitates the development of robust algorithms by providing an open interface for integrating new continual learning methods


Key findings
The evaluation on the EVDA benchmark revealed that Replay consistently demonstrated the most competitive performance with the lowest average Equal Error Rate (EER) across all tasks. Elastic Weight Consolidation (EWC) also performed strongly, showing effectiveness in mitigating catastrophic forgetting on older tasks. The benchmark highlights that continual learning is crucial for maintaining detection performance against evolving deepfake audio, with some methods (e.g., CWRStar, OWM) being less effective than others.
Approach
The authors propose EVDA, a comprehensive benchmark for evaluating continual learning methods in audio deepfake detection. EVDA comprises eight distinct tasks, incorporating diverse datasets from Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from advanced LLMs like GPT-4/GPT-4o. It supports and evaluates several continual learning techniques, including Replay, Finetuning, EWC, GDumb, CWRStar, SI, OWM, RAWM, and RWM, to assess their ability to adapt to new threats while retaining past knowledge.
Datasets
FMFCC-A, In-the-Wild, ADD 2022, ASVspoof2015, ASVspoof2019LA, ASVspoof2021LA, FoR, HAD. These encompass classic Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o.
Model(s)
5 linear layers model with 128-hidden dimension (used as the base detection model for evaluating continual learning methods on the benchmark).
Author countries
China