audio
Published: 2026-01-21
Accepted @ IEEE ICASSP 2026
Authors: Viola Negroni, Luca Cuccovillo, Paolo Bestagini, Patrick Aichroth, Stefano Tubaro
This paper introduces SFATNet-4, a lightweight multi-task transformer for explainable speech deepfake detection. The model simultaneously predicts formant trajectories and voicing patterns while classifying speech as real or fake, providing insights into whether its decisions rely more on voiced or unvoiced regions. It improves upon its predecessor by requiring fewer parameters, training faster, and offering better interpretability without sacrificing prediction performance.
audio
Published: 2026-01-20
Accepted by ICASSP 2026
Authors: Jinhua Zhang, Zhenqi Jia, Rui Liu
This paper proposes EAI-ADD, a novel audio deepfake detection framework that leverages cross-level emotion-acoustic inconsistency as the primary detection signal. It addresses limitations of prior methods that isolate features or rely on correlation, which often overlook subtle desynchronization and abrupt discontinuities in spoofed speech. EAI-ADD projects emotional and acoustic representations into a comparable space and progressively integrates frame-level and utterance-level emotion features with acoustic features to capture inconsistencies across different temporal granularities.
audio
Published: 2026-01-19
Authors: Chongyang Gao, Marco Postiglione, Julian Baldwin, Natalia Denisenko, Isabel Gortner, Luke Fosdick, Chiara Pulice, Sarit Kraus, V. S. Subrahmanian
This paper introduces the Context-based Audio Deepfake Detector (CADD) which leverages contextual information and transcripts to significantly improve the detection of deepfake audios of public figures. It also presents two new datasets, JDD and SYN, composed of real-world and synthetically generated deepfakes respectively. The research demonstrates CADD's enhanced performance and robustness against various adversarial manipulations compared to existing baseline detectors.
audio
Published: 2026-01-15
Accepted as full paper to CHIIR'26
Authors: Marcel Gohsen, Nicola Libera, Johannes Kiesel, Jan Ehlers, Benno Stein
This paper investigates how cognitive load affects human accuracy in detecting voice-based deepfakes through an empirical study with 30 participants. The findings suggest that low cognitive load does not generally impair detection abilities. Interestingly, simultaneous exposure to a secondary stimulus can actually benefit human performance in the deepfake detection task.
audio
Published: 2026-01-12
Authors: Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li
This paper introduces the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on detecting component-level manipulations where either speech or environmental sounds (or both) can be synthesized or altered. To address this, they propose the large-scale CompSpoofV2 dataset and a separation-enhanced joint learning framework. The challenge aims to promote research in this more realistic and complex audio deepfake detection scenario.
audio
Published: 2026-01-10
Authors: K. A. Shahriar
This paper introduces a lightweight resolution-aware audio deepfake detection framework that explicitly models and aligns multi-resolution spectral representations. It utilizes cross-scale attention and consistency learning to enhance robustness under channel distortions, replay attacks, and real-world recording conditions. The approach achieves strong performance on various benchmarks while maintaining computational efficiency.
audio
Published: 2026-01-07
Submitted
Authors: Xin Wang, Héctor Delgado, Nicholas Evans, Xuechen Liu, Tomi Kinnunen, Hemlata Tak, Kong Aik Lee, Ivan Kukanov, Md Sahidullah, Massimiliano Todisco, Junichi Yamagishi
This paper presents an overview and analysis of the ASVspoof 5 challenge, which promotes research in speech spoofing and deepfake detection. It evaluates the performance of 53 participating teams' solutions against a new crowdsourced database featuring diverse generative speech technologies, recording conditions, and adversarial attacks. The findings highlight effective detection solutions but also reveal performance degradation under adversarial attacks and neural encoding, alongside persistent generalization challenges.
audio
Published: 2026-01-07
Preprint for ACL 2026 submission
Authors: Binh Nguyen, Thai Le
This paper introduces a forensic auditing framework to evaluate the robustness of Audio Language Models' (ALMs) reasoning in audio deepfake detection under adversarial attacks. It analyzes reasoning shifts across acoustic perception, cognitive coherence, and cognitive dissonance, revealing that explicit reasoning does not universally enhance robustness. Instead, reasoning can act as a defensive "shield" for acoustically robust models but imposes a "tax" on others, while high cognitive dissonance can serve as a "silent alarm" for potential manipulation.
audio
Published: 2026-01-06
Authors: Yuankun Xie, Xiaoxuan Guo, Jiayi Zhou, Tao Wang, Jian Liu, Ruibo Fu, Xiaopeng Wang, Haonan Cheng, Long Ye
This paper addresses the need for all-type audio deepfake detection (ADD) that generalizes across heterogeneous audio and provides interpretable decisions. The authors propose an automatic annotation pipeline to construct Frequency-Time (FT) structured Chain-of-Thought (CoT) rationales, generating ~340K cold-start demonstrations. Building on this data, they introduce Frequency Time-Group Relative Policy Optimization (FT-GRPO), a two-stage training paradigm combining SFT cold-start with GRPO under rule-based frequency-time constraints, achieving state-of-the-art performance and interpretable rationales.
audio
Published: 2026-01-06
11 pages, 3 figures
Authors: Kwok-Ho Ng, Tingting Song, Yongdong WU, Zhihua Xia
This paper proposes XLSR-MamBo, a modular framework for audio deepfake detection that integrates an XLSR front-end with hybrid Mamba-Attention backbones. It leverages complementary strengths of State Space Models for temporal compression and Attention for global artifact retrieval. The framework achieves competitive performance and robust generalization across ASVspoof 2021 LA, DF, In-the-Wild, and DFADD benchmarks, with deeper backbones enhancing stability.
audio
Published: 2026-01-06
Authors: Mengze Hong, Di Jiang, Zeying Xie, Weiwei Zhao, Guan Wang, Chen Jason Zhang
This paper empirically evaluates state-of-the-art speaker authentication systems against modern audio deepfake synthesis. It reveals two critical security vulnerabilities: commercial speaker verification systems are easily bypassed by voice cloning models trained on minimal data, and anti-spoofing detectors fail to generalize robustly to unseen deepfake generation methods. The findings highlight an urgent need for architectural innovations and adaptive multi-factor authentication strategies.
audio
Published: 2026-01-02
Accepted at IJCB 2025
Authors: Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall
This paper investigates the viability of using Multimodal Large Language Models (MLLMs) for audio deepfake detection by reformulating it as an Audio Question-Answering (AQA) task. Evaluating Qwen2-Audio-7B-Instruct and SALMONN in zero-shot and LoRA fine-tuned settings, the study finds that MLLMs perform poorly without task-specific training and struggle with out-of-domain generalization. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.
audio
Published: 2025-12-31
IJRAR Int. J. Res. Anal. Rev., vol. 12, no. 4, pp. 102-109, 2025
Authors: Prajwal Chinchmalatpure, Suyash Chinchmalatpure, Siddharth Chavan
This study focuses on the real-time detection of AI-generated speech produced using Retrieval-based Voice Conversion (RVC), crucial for mitigating impersonation and fraud. The researchers propose a streaming classification approach that segments audio into one-second windows, extracts acoustic features, and employs supervised machine learning models to classify each segment as real or voice-converted. This method allows for low-latency inference and demonstrates the feasibility of practical, real-time deepfake speech detection under realistic audio mixing conditions.
audio
Published: 2025-12-30
Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang
This paper provides an overview of the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge, which introduced EnvSDD, the first large-scale dataset for ESDD. The challenge aimed to develop effective methods for detecting fake environmental sounds, addressing limitations in existing datasets. The paper analyzes challenge results and highlights common effective design choices observed in top-performing systems across two distinct tracks.
audio
Published: 2025-12-25
Accepted for publication in 2025 28th International Conference on Computer and Information...
Authors: Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Zahid Hossain, Md. Kamrozzaman Bhuiyan, Farhad Uz Zaman
This paper addresses the challenge of detecting Bengali deepfake audio, an area that is largely unexplored. The authors evaluate both zero-shot inference with several pretrained models and fine-tune multiple deep learning architectures on the BanglaFake dataset. They demonstrate that fine-tuning significantly improves detection performance over zero-shot methods, providing the first systematic benchmark for Bengali deepfake audio detection.
audio
Published: 2025-12-23
ESDD 2026 Challenge Technical Report
Authors: Xiaoxuan Guo, Hengyan Huang, Jiayi Zhou, Renhe Sun, Jian Liu, Haonan Cheng, Long Ye, Qin Zhang
This paper proposes EnvSSLAM-FFN, a lightweight system for environmental sound deepfake detection in the ESDD 2026 Challenge. It combines a frozen SSLAM self-supervised encoder with a feed-forward network (FFN) back-end. The system employs fusion of intermediate SSLAM representations (layers 4-9) and a class-weighted training objective to address data imbalance and effectively capture spoofing artifacts.
audio
Published: 2025-12-21
This paper is accepted in ICDM 2025-MLC workshop
Authors: Lisan Al Amin, Vandana P. Janeja
This paper introduces the use of quantum-kernel Support Vector Machines (QSVMs) for robust audio deepfake detection in conditions with scarce labeled data and varying recording environments. The authors demonstrate that QSVMs significantly reduce false-positive rates and equal-error rates (EER) compared to classical SVMs, leveraging quantum feature maps to achieve superior class separability without increasing model size. The approach provides consistent performance gains across diverse datasets, making it a viable drop-in alternative for practical deepfake detection pipelines.
audio
Published: 2025-12-20
Authors: Wen Huang, Yuchen Mao, Yanmin Qian
This paper introduces a data-centric approach to generalizable speech deepfake detection (SDD), emphasizing the critical role of data composition over model-centric solutions. It characterizes data scaling laws for SDD, quantifying the impact of source and generator diversity, and proposes the Diversity-Optimized Sampling Strategy (DOSS) for mixing heterogeneous data. The DOSS framework achieves state-of-the-art generalization performance with superior data and model efficiency on public benchmarks and a new challenge set of commercial APIs.
audio
Published: 2025-12-17
3 pages, 1 figure, challenge paper
Authors: Sanghyeok Chung, Eujin Kim, Donggun Kim, Gaeun Heo, Jeongbin You, Nahyun Lee, Sunmook Choi, Soyul Han, Seungsang Oh, Il-Youp Kwak
This paper introduces BEAT2AASIST, an extension of the BEATs-AASIST model, for Environmental Sound Deepfake Detection (ESDD) within the ESDD 2026 Challenge. The proposed model enhances feature representations by splitting BEATs-derived features for processing by dual AASIST branches and incorporates top-k transformer layer fusion strategies. Additionally, vocoder-based data augmentation is utilized to improve robustness against unseen spoofing methods.
audio
Published: 2025-12-15
6 pages, 4 figures, 2 tables
Authors: Menglu Li, Majd Alber, Ramtin Asgarianamiri, Lian Zhao, Xiao-Ping Zhang
This paper introduces HQ-MPSD, a high-quality, multilingual partial deepfake speech dataset designed to address limitations of existing datasets which often contain superficial artifacts. HQ-MPSD uses linguistically coherent splice points derived from forced alignment and incorporates background effects, yielding perceptually natural samples. Benchmarking state-of-the-art models on HQ-MPSD reveals significant performance drops, highlighting generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced.
audio
Published: 2025-12-15
6 pages
Authors: Udayon Sen, Alka Luqman, Anupam Chattopadhyay
This paper addresses the performance degradation of state-of-the-art audio deepfake detection models in noisy, realistic capture conditions. It introduces a reproducible framework that mixes MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate robustness under controlled Signal-to-Noise Ratios (SNRs). The study surveys and benchmarks pretrained encoders, demonstrating that finetuning significantly improves detection robustness at lower SNRs.
audio
Published: 2025-12-12
Authors: Yupei Li, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang, Björn W. Schuller
The paper introduces "EmoBridge," a novel training framework for speech deepfake detection that unifies diverse feature representations by leveraging emotion as a bridging mechanism. This approach integrates emotion-related characteristics into existing feature encoders through a continual learning strategy, aiming for a robust and interpretable feature space. EmoBridge consistently improves deepfake detection performance across various datasets and feature types.
audio
Published: 2025-12-10
Submitted to Speech Communication
Authors: Eugenia San Segundo, Aurora López-Jareño, Xin Wang, Junichi Yamagishi
This study investigates human perception of audio deepfakes, exploring how language, speaking style, and voice familiarity influence detection accuracy and the underlying reasons for listeners' judgments. Through a perceptual experiment with Spanish and Japanese native speakers, the research reveals an average accuracy of 59.11%, with higher performance on authentic samples. It highlights that listeners primarily rely on suprasegmental and higher-level linguistic or extralinguistic characteristics for detection, with observable cross-linguistic differences in perceptual strategies.
audio
Published: 2025-12-09
Authors: Yupei Li, Li Wang, Yuxiang Wang, Lei Wang, Rizhao Cai, Jie Shi, Björn W. Schuller, Zhizheng Wu
This study proposes DFALLM, an Audio Large Language Model (ALLM) framework designed for generalizable and multitask audio deepfake detection. It addresses previous ALLM generalization bottlenecks by systematically optimizing audio encoder and text-based LLM components. DFALLM achieves state-of-the-art performance across multiple datasets for binary deepfake detection and demonstrates competitive capabilities in advanced tasks like spoof attribution and localization.
audio
Published: 2025-12-09
Authors: Junyi Peng, Lin Zhang, Jin Li, Oldrich Plchot, Jan Cernocky
This paper presents the BUT submission to the ESDD 2026 Challenge, focusing on environmental sound deepfake detection with unseen generators. The main contribution is a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models coupled with a Multi-Head Factorized Attention (MHFA) back-end. A feature domain augmentation strategy based on distribution uncertainty modeling is also introduced to enhance robustness against unseen spectral distortions.
audio
Published: 2025-12-05
Authors: Candy Olivia Mawalim, Haotian Zhang, Shogo Okada
This paper presents the Nomi Team's work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge. They propose an audio-text cross-attention model to address unseen generators and low-resource black-box scenarios. Experiments demonstrate competitive EER improvements over the challenge baseline, particularly when integrating semantic text and using an ensemble model.
audio
Published: 2025-12-04
Authors: Alireza Mohammadi, Keshav Sood, Dhananjay Thiruvady, Asef Nazari
This paper introduces a framework designed for voice authentication systems at the network edge, addressing the dual threats of deepfake synthesis attacks and control-plane poisoning in federated learning. The approach integrates interpretable physics-guided features, modeling vocal tract dynamics, with representations from a self-supervised learning module. These are processed through a Multi-Modal Ensemble Architecture and a Bayesian ensemble to provide uncertainty estimates, enhancing robustness against advanced deepfake attacks and sophisticated control-plane poisoning.
audio
Published: 2025-11-26
Authors: Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv
Deepfake (DF) audio detectors still struggle to generalize to out-of-distribution inputs due to spectral bias, which causes generators to leave high-frequency (HF) artifacts under-exploited by detectors. To address this, SONAR proposes a frequency-guided framework that explicitly disentangles an audio signal into low-frequency content and HF residuals using XLSR encoders and learnable SRM filters. By employing frequency cross-attention and a frequency-aware Jensen-Shannon contrastive loss, SONAR aligns real content-noise pairs while pushing fake embeddings apart, achieving state-of-the-art generalization and significantly faster convergence.
audio
Published: 2025-11-25
Authors: Wangjie Li, Lin Li, Qingyang Hong
This paper introduces a novel framework for continual audio deepfake detection that leverages Universal Adversarial Perturbation (UAP). This approach allows models to retain knowledge of historical spoofing distributions without needing direct access to past data, addressing the challenge of evolving deepfake attacks and high fine-tuning costs. By integrating UAP with pre-trained self-supervised audio models, the method offers an efficient solution for continual learning.
audio
Published: 2025-11-14
Authors: Guangke Chen, Yuhui Wang, Shouling Ji, Xiapu Luo, Ting Wang
This research introduces HARMGEN, a suite of five attacks designed to compel Large Audio-Language Model (LALM)-based Text-to-Speech (TTS) systems to generate speech containing harmful content, bypassing safety alignments and moderation filters. The attacks utilize semantic obfuscation techniques for text and audio-modality exploits to covertly inject harmful words. The study evaluates these attacks across multiple commercial LALMs and assesses the effectiveness of reactive and proactive countermeasures, revealing significant vulnerabilities in current defenses.
audio
Published: 2025-11-13
Accepted to IJCNLP-AACL 2025
Authors: Farhan Sheth, Girish, Mohd Mujtaba Akhtar, Muskaan Singh
This paper introduces RHYME, a novel framework for generalizable audio deepfake detection across diverse speech synthesis paradigms, including conventional TTS, diffusion, and flow-matching generators. RHYME achieves synthesis-invariant detection by fusing utterance-level embeddings from pretrained speech encoders using non-Euclidean projections into hyperbolic and spherical manifolds, unified via Riemannian barycentric averaging. This geometry-aware approach effectively aligns structural distortions common to synthetic speech, leading to robust generalization in cross-paradigm and unseen-generator conditions.
audio
Published: 2025-11-08
7 pages, Accepted at NeurIPS'25 workshop on AI for Music
Authors: Atharva Mehta, Shivam Chauhan, Megha Sharma, Gus Xia, Kaustuv Kanti Ganguli, Nishanth Chandran, Zeerak Talat, Monojit Choudhury
This paper raises concerns about cultural and genre biases in AI for music systems (music-AI systems), particularly how these biases misrepresent marginalized traditions and reduce creators' trust. It highlights the harms of such biases, including cultural erosion and limited creativity, affecting stakeholders like creators, distributors, and listeners. The authors propose recommendations at the dataset, model, and interface levels to address these issues and promote fairness.
audio
Published: 2025-10-27
Submitted to ICASSP 2026
Authors: Jiyoung Hong, Yoonseo Chung, Seungyeon Oh, Juntae Kim, Jiyoung Lee, Sookyung Kim, Hyunsoo Cho
This paper introduces TWINSHIFT, a novel benchmark designed to evaluate the robustness and generalization capabilities of audio deepfake detection (ADD) systems under strictly unseen conditions. It is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for rigorous assessment of detector performance when both the generative model and speaker identity change. TWINSHIFT reveals significant robustness gaps in current ADD systems and provides guidance for developing more resilient detectors.
audio
Published: 2025-10-23
8 pages, Accepted at Workshop on AI for Cyber Threat Intelligence, co-located with ACSAC 2025
Authors: Nguyen Linh Bao Nguyen, Alsharif Abuadbba, Kristen Moore, Tingmin Wu
This paper investigates the effectiveness of state-of-the-art audio deepfake detectors against FOICE, a novel face-to-voice synthesis method that generates speech from a single facial image. The study reveals that current detectors consistently fail to identify FOICE-generated audio, highlighting a critical vulnerability. While fine-tuning on FOICE data significantly improves detection, it often leads to a detrimental trade-off, diminishing the detectors' robustness against unseen deepfake generators.
audio
Published: 2025-10-22
Authors: Tong Zhang, Yihuan Huang, Yanzhen Ren
This paper introduces EchoFake, a novel dataset designed to address the vulnerability of speech deepfake detection systems to physical replay attacks, which often bypass models trained on lab-generated synthetic speech. EchoFake provides over 120 hours of audio including cutting-edge zero-shot text-to-speech (TTS) and physical replay recordings under varied real-world conditions. Experiments show that models trained on EchoFake achieve lower average Equal Error Rates (EERs) and better generalization across datasets, highlighting its value for advancing robust anti-spoofing methods.
audio
Published: 2025-10-20
Accepted for presentation at the NeurIPS 2025 Workshop on Generative and Protective AI for...
Authors: Davide Salvi, Hendrik Vincent Koops, Elio Quinton
This paper introduces a novel two-stage pipeline for robust singer identification in singing voice deepfakes, prioritizing high-quality forgeries. It first employs a discriminator to filter out low-quality deepfakes that fail to accurately reproduce vocal likeness. A subsequent singer identification model, trained exclusively on authentic recordings, then identifies the artist in the remaining high-quality deepfakes and authentic audio, outperforming existing baselines.
audio
Published: 2025-10-16
Authors: Hui Wang, Jinghua Zhao, Yifan Yang, Shujie Liu, Junyang Chen, Yanzhe Zhang, Shiwan Zhao, Jinyu Li, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin
This paper introduces SpeechLLM-as-Judges, a novel paradigm that leverages large language models (LLMs) for structured and explanation-based speech quality evaluation. They develop SpeechEval, a large-scale multilingual dataset for four speech evaluation tasks, and train SQ-LLM, a speech-quality-aware LLM using chain-of-thought reasoning and reward optimization. SQ-LLM demonstrates strong, interpretable performance across diverse tasks and languages, highlighting the potential of this LLM-as-judge approach.
audio
Published: 2025-10-14
Authors: Wanying Ge, Xin Wang, Junichi Yamagishi
FakeMark is a novel watermarking framework for deepfake speech attribution that injects artifact-correlated watermarks associated with deepfake systems, rather than pre-assigned bitstring messages. This design allows a detector to attribute the source system by leveraging both injected watermarks and intrinsic deepfake artifacts, maintaining effectiveness even when one cue is elusive or removed. Experimental results demonstrate improved generalization to cross-dataset samples and high accuracy under various distortions and removal attacks.
audio
Published: 2025-10-07
Authors: Antoine Teissier, Marie Tahon, Nicolas Dugué, Aghilas Sini
This paper proposes a novel approach to enhance deepfake detection by introducing sparse latent representations in the AASIST architecture. By applying a TopK activation on the last hidden layer, the method improves detection performance, achieving an EER of 23.36% on ASVSpoof5 with 95% sparsity. Furthermore, it demonstrates that these sparse representations lead to better disentanglement of attack-related information in the latent space, thus promoting interpretability.
audio
Published: 2025-10-06
Submitted to ICASSP 2026
Authors: Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen
This paper introduces WaveSP-Net, a novel parameter-efficient front-end for speech deepfake detection that combines prompt-tuning with classical signal processing transforms. It specifically proposes a Partial-WSPT-XLSR front-end that uses learnable wavelet filters to inject multi-resolution features into prompt embeddings for a frozen XLSR model, paired with a Mamba-based back-end. WaveSP-Net achieves state-of-the-art performance on challenging benchmarks while maintaining low trainable parameters.
audio
Published: 2025-10-03
Submitted @ IEEE OJSP
Authors: Viola Negroni, Davide Salvi, Daniele Ugo Leonzio, Paolo Bestagini, Stefano Tubaro
This paper introduces 'Forensic Similarity for Speech Deepfakes', a digital audio forensics approach that determines if two audio segments share the same forensic traces. The system utilizes a two-part deep-learning architecture comprising a feature extractor based on a speech deepfake detector backbone and a shallow similarity network. The method demonstrates strong generalization to previously unseen generative models for source verification and shows applicability to splicing detection.
audio
Published: 2025-10-02
Authors: Oğuzhan Kurnaz, Jagabandhu Mishra, Tomi H. Kinnunen, Cemal Hanilçi
This paper proposes a modular yet jointly optimized architecture for spoofing-robust automatic speaker verification (SASV), integrating outputs from speaker and spoof detectors via trainable back-end classifiers. The approach directly optimizes the back-end using the architecture-agnostic detection cost function (a-DCF) as a training objective. Experiments demonstrate that nonlinear score fusion and a combination of weighted cosine scoring for speaker detection with SSL-AASIST for spoof detection achieve state-of-the-art performance.
audio
Published: 2025-09-30
Authors: Rahul Vijaykumar, Ajan Ahmed, John Parker, Dinesh Pendyala, Aidan Collins, Stephanie Schuckers, Masudul H. Imtiaz
This paper introduces ELAD-SVDSR, a novel extended-length audio dataset designed for synthetic voice detection and speaker recognition. It comprises 45-minute audio recordings from 36 participants, captured with five different microphones, along with 20 generated deepfake voices. The dataset aims to facilitate the creation of high-quality deepfakes and the development of robust detection systems.
audio
Published: 2025-09-30
Submitted to IEEE ICASSP 2026. Paper resources available at...
Authors: Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso, Gennady Karvitsky, Daniele Colibro, Haydar Talib
This paper highlights how current deepfake datasets and research methodologies lead to systems that fail to generalize to real-world applications due to the lack of realistic presentation. The authors propose a new framework for data creation and research methodology that incorporates the effects of deepfake audio being presented through communication channels. By following these guidelines, they significantly improved deepfake detection accuracy in robust lab setups and real-world benchmarks, demonstrating that dataset quality is more crucial than model size.
audio
Published: 2025-09-29
Authors: Manasi Chhibber, Jagabandhu Mishra, Tomi H. Kinnunen
This paper introduces a novel zero-shot source tracing framework for speech deepfakes, adapting the SSL-AASIST system for attack classification. It investigates both zero-shot (cosine similarity, Siamese) and few-shot (MLP, Siamese) backend scoring approaches for attack verification. Experiments show that few-shot learning offers advantages in closed-set scenarios, while zero-shot approaches are more effective for open-set source tracing.
audio
Published: 2025-09-28
Authors: Pu Huang, Shouguang Wang, Siya Yao, Mengchu Zhou
The paper introduces Information Bottleneck enhanced Confidence-Aware Adversarial Network (IB-CAAN) for generalizable speech deepfake detection. This method employs confidence-guided adversarial alignment to suppress attack-specific artifacts and an information bottleneck to remove nuisance variability, thereby preserving transferable discriminative features. Experiments demonstrate that IB-CAAN consistently outperforms baselines and achieves state-of-the-art performance on many benchmarks, addressing distribution shifts across spoofing methods and other variabilities.
audio
Published: 2025-09-26
Authors: Xuechen Liu, Xin Wang, Junichi Yamagishi
This paper proposes a training-free retrieval-augmented framework for detecting zero-day audio deepfakes, addressing the challenge of novel synthesis methods unseen during training. The framework leverages knowledge representations and voice profile matching through retrieval and ensemble methods. It achieves performance comparable to supervised and fine-tuned baselines on the DeepFake-Eval-2024 benchmark without requiring additional model training.
audio
Published: 2025-09-25
Authors: Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk
This paper introduces AUDDT, an open-source toolkit for benchmarking audio deepfake detection models across 28 diverse datasets. It aims to automate the evaluation process, providing insights into the generalization capabilities and shortcomings of pretrained detectors. The toolkit also highlights limitations of current datasets and their gap relative to real-world deployment.
audio
Published: 2025-09-25
5 pages, 4 figures
Authors: Duc-Tuan Truong, Tianchi Liu, Junjie Li, Ruijie Tao, Kong Aik Lee, Eng Siong Chng
This paper addresses gradient misalignment in data-augmented training for speech deepfake detection (SDD), where conflicting gradients from original and augmented inputs can hinder optimization. The authors propose a dual-path data-augmented (DPDA) training framework with gradient alignment, processing original and augmented speech in parallel to compare and align their backpropagated gradients. This approach resolves conflicts, accelerates convergence, and achieves significant Equal Error Rate reductions compared to baselines.
audio
Published: 2025-09-25
5 pages, 4 figures
Authors: Duc-Tuan Truong, Tianchi Liu, Ruijie Tao, Junjie Li, Kong Aik Lee, Eng Siong Chng
This paper proposes QAMO (Quality-Aware Multi-Centroid One-Class Learning) for speech deepfake detection, which addresses the limitations of single-centroid one-class models by introducing multiple centroids, each representing a distinct speech quality subspace. This approach better models intra-class variability in bona fide speech and supports a multi-centroid ensemble scoring strategy for improved decision thresholding. QAMO achieves a 5.09% EER on the In-the-Wild dataset, outperforming previous one-class and quality-aware systems.
audio
Published: 2025-09-24
5 pages, 1 figure, 3 tables
Authors: Jinyang Wu, Nana Hou, Zihan Pan, Qiquan Zhang, Sailor Hardik Bhupendra, Soumik Mondal
This paper introduces SEA-Spoof, the first large-scale Audio Deepfake Detection (ADD) dataset specifically designed for South-East Asian (SEA) languages, addressing the critical gap where current models fail due to data scarcity and linguistic mismatches. Spanning over 300 hours across six SEA languages, SEA-Spoof includes paired real and spoof speech generated by diverse state-of-the-art systems. Benchmarking reveals severe cross-lingual performance degradation, which is significantly mitigated by fine-tuning models on SEA-Spoof, thereby highlighting its importance for robust, cross-lingual fraud detection.
audio
Published: 2025-09-23
Authors: Visar Berisha, Prad Kadambi, Isabella Lenz
This paper argues that speech deepfake detectors fail to generalize in real-world, open-world conditions due to a 'coverage debt' caused by multiplicatively growing factors like devices, codecs, and attack families. Through an analysis of a cross-testing framework, the authors demonstrate that detectors struggle significantly with newer synthesizers and conversational speech domains. They conclude that detection alone is insufficient for high-stakes decisions and advocate for layered defenses.
audio
Published: 2025-09-22
Accepted @ IEEE WIFS 2025
Authors: Viola Negroni, Davide Salvi, Alessandro Ilic Mezza, Paolo Bestagini, Stefano Tubaro
The paper presents ISPL's first-ranked submission to the SAFE challenge, introducing a novel Mixture of Experts (MoE) architecture for robust audio deepfake detection. This system combines multiple state-of-the-art detectors, dynamically weighting their outputs using an attention-based gating network based on the input speech signal.
audio
Published: 2025-09-21
Authors: Zeyu Xie, Yaoyun Zhang, Xuenan Xu, Yongkang Yin, Chenxing Li, Mengyue Wu, Yuexian Zou
This paper introduces FakeSound2, a new benchmark designed to advance deepfake sound detection beyond simple binary classification. It evaluates models across three critical dimensions: localization, traceability, and generalization, encompassing 6 manipulation types and 12 diverse sources. Experimental results using FakeSound2 reveal that while current systems achieve high classification accuracy, they significantly struggle with recognizing forged pattern distributions, providing reliable explanations, and generalizing to unseen sources.
audio
Published: 2025-09-18
Accepted by ICASSP 2026
Authors: Xuanjun Chen, Chia-Yu Hu, I-Ming Lin, Yi-Cheng Lin, I-Hsiang Chiu, You Zhang, Sung-Feng Huang, Yi-Hsuan Yang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
This paper investigates how instrumental music influences singing voice deepfake (SingFake) detection models. The authors conduct behavioral analyses testing various backbones, unpaired instrumental tracks, and frequency subbands, along with representational analyses probing encoder capabilities post-fine-tuning. They conclude that instrumental music primarily acts as data augmentation, enhancing reliance on shallow speaker features while reducing sensitivity to deeper content and musical information.
audio
Published: 2025-09-17
6 pages, 3 figures, 1 table
Authors: Janne Laakkonen, Ivan Kukanov, Ville Hautamäki
This paper proposes a Mixture-of-LoRA-Experts (MoE-LoRA) approach to enhance the generalizability of audio deepfake detection by integrating multiple low-rank adapters (LoRA) into the attention layers of foundation models. A routing mechanism dynamically activates specialized experts, enabling better adaptation to evolving deepfake attacks. The method significantly outperforms standard fine-tuning, reducing equal error rates in both in-domain and out-of-domain scenarios.
audio
Published: 2025-09-15
Authors: Pierre Serrano, Raphaël Duroselle, Florian Angulo, Jean-François Bonastre, Olivier Boeffard
This paper addresses the challenge of generalizing audio deepfake detection systems based on frozen pre-trained self-supervised learning (SSL) encoders to out-of-domain (OOD) conditions. The authors conduct a layer-by-layer analysis of six different SSL models, compare single-layer pooling with multi-head factorized attentive pooling (MHFA), and demonstrate that score-level fusion of several encoders significantly enhances OOD generalization. This approach achieves state-of-the-art performance in OOD conditions with limited training data and no data augmentation.
audio
Published: 2025-09-13
Authors: Xiaokang Li, Yicheng Gong, Dinghao Zou, Xin Cao, Sunbowen Lee
This paper proposes EmoAnti, a novel audio anti-deepfake system that leverages high-level emotional cues often neglected by existing methods. It utilizes a Wav2Vec2 model fine-tuned on emotion recognition tasks to derive emotion-guided representations, which are then refined by a dedicated convolutional residual feature extractor. EmoAnti achieves state-of-the-art performance on ASVspoof2019LA and ASVspoof2021LA benchmarks and demonstrates strong generalization on ASVspoof2021DF.
audio
Published: 2025-09-12
code to be pushed to https://github.com/nii-yamagishilab/AntiDeepfake
Authors: Xin Wang, Wanying Ge, Junichi Yamagishi
This paper investigates data drift monitoring for speech deepfake detection within an MLOps context. It explores whether drift from new text-to-speech (TTS) attacks can be monitored using feature distribution distances and if fine-tuning detectors with similarly drifted data can reduce this drift and improve detection performance. The study demonstrates that drift can be monitored and effectively reduced by fine-tuning, leading to improved detection error rates.
audio
Published: 2025-09-11
Published in Interspeech 2025
Authors: Chin Yuen Kwok, Jia Qi Yip, Zhen Qiu, Chi Hung Chi, Kwok Yan Lam
This paper introduces 'bona fide cross-testing,' a novel evaluation framework for Audio Deepfake Detection (ADD) models. It addresses the limitations of traditional evaluation methods, such as imbalanced synthesizer weighting and lack of diverse bona fide speech, by incorporating various bona fide datasets and aggregating Equal Error Rates (EERs). This approach aims to provide more robust and interpretable assessments, revealing vulnerabilities often overlooked by conventional methods.
audio
Published: 2025-09-11
Authors: Zihan Pan, Sailor Hardik Bhupendra, Jinyang Wu
This paper proposes MoLEx (Mixture of LoRA Experts), a parameter-efficient framework for audio deepfake detection that combines Low-Rank Adaptation (LoRA) with a Mixture-of-Experts (MoE) router. MoLEx efficiently finetunes only selected experts of pre-trained Self-Supervised Learning (SSL) models, preserving core knowledge while reducing computational costs. Evaluated on the ASVSpoof 5 dataset, MoLEx achieves a state-of-the-art Equal Error Rate (EER) of 5.56% on the evaluation set without augmentation.
audio
Published: 2025-09-10
Authors: Li Wang, Junyi Ao, Linyong Gan, Yuancheng Wang, Xueyao Zhang, Zhizheng Wu
This paper introduces the Audio Deepfake Verification (ADV) task for open-set deepfake source tracing, moving beyond binary detection and closed-set attribution. It proposes Audity, a novel dual-branch architecture that extracts deepfake features from both audio structure and generation artifacts. Experimental results demonstrate that Audity outperforms single-branch configurations and achieves excellent performance in both deepfake detection and verification tasks simultaneously.
audio
Published: 2025-09-09
Authors: Kamel Kamel, Hridoy Sankar Dutta, Keshav Sood, Sunil Aryal
This paper introduces the Spectral Masking and Interpolation Attack (SMIA), a novel black-box adversarial attack designed to bypass both voice authentication systems (VAS) and anti-spoofing countermeasures (CMs). SMIA strategically manipulates inaudible frequency regions of AI-generated audio, creating adversarial samples that are perceptually authentic yet effectively deceive state-of-the-art defenses. The attack demonstrates high success rates, highlighting critical vulnerabilities in current voice biometric security paradigms.
audio
Published: 2025-09-08
Authors: Kutub Uddin, Muhammad Umar Farooq, Awais Khan, Khalid Mahmood Malik
This paper conducts a comprehensive benchmark and comparative study of state-of-the-art audio deepfake detection (ADD) methods under adversarial conditions. It evaluates the effectiveness and vulnerabilities of both raw and spectrogram-based ADD approaches against a wide range of anti-forensic (AF) attacks across five deepfake benchmark datasets. The study highlights significant performance degradation of ADD methods when exposed to these attacks, informing the design of more robust and generalized detectors.
audio
Published: 2025-09-08
Authors: Liping Chen, Kong Aik Lee, Zhen-Hua Ling, Xin Wang, Rohan Kumar Das, Tomoki Toda, Haizhou Li
This paper provides a concise overview of three techniques—voice anonymization, deepfake detection, and watermarking—developed to address security threats from deepfake speech misuse. It describes their methodologies, advancements, and challenges in protecting speaker attributes and defending against malicious use of synthetic speech. A more comprehensive version is slated for future publication.
audio
Published: 2025-09-05
Authors: Wangjie Li, Xingjia Xie, Yishuang Li, Wenhao Guan, Kaidi Wang, Pengyu Ren, Lin Li, Qingyang Hong
The XMUspeech systems for the ASVspoof 5 Challenge focus on speech deepfake detection, noting that increased audio duration significantly improves performance. The approach integrates advanced models like AASIST, HM-Conformer, Hubert, and Wav2vec2 with an adaptive multi-scale feature fusion method and optimized one-class loss functions. Their final fusion system achieved competitive results in both closed (minDCF 0.4783, EER 20.45%) and open conditions (minDCF 0.2245, EER 9.36%).
audio
Published: 2025-09-04
Authors: Qizhou Wang, Hanxun Huang, Guansong Pang, Sarah Erfani, Christopher Leckie
This paper introduces AUDETER (AUdio DEepfake TEst Range), a large-scale and highly diverse dataset for deepfake audio detection, aimed at addressing the poor generalization of existing detection methods in real-world, open-world scenarios due to domain shifts. Comprising over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders (totaling 3 million clips), AUDETER is the largest deepfake audio dataset by scale. Experiments demonstrate that models trained on AUDETER achieve significantly improved generalized detection performance, reducing error rates by 44.1% to 51.6% on diverse cross-domain samples.
audio
Published: 2025-09-04
Authors: Yunqi Hao, Yihao Chen, Minqiang Xu, Jianbo Zhan, Liang He, Lei Fang, Sian Fang, Lin Liu
This paper proposes Wav2DF-TSL, a two-stage learning strategy for robust audio deepfake detection, combining efficient pre-training with hierarchical expert fusion. It leverages adapters for learning artifacts from unlabeled spoofed speech and a Hierarchical Adaptive Mixture of Experts (HA-MoE) for dynamically fusing multi-level spoofing cues. The method significantly outperforms existing state-of-the-art systems across benchmark datasets, showing improved generalization, especially on cross-domain scenarios.
audio
Published: 2025-09-04
Authors: Huhong Xian, Rui Liu, Berrak Sisman, Haizhou Li
This paper proposes NE-PADD, a novel method for Partial Audio Deepfake Detection (PADD) that effectively leverages named entity knowledge. It integrates two parallel branches, Speech Name Entity Recognition (SpeechNER) and PADD, using Attention Fusion (AF) and Attention Transfer (AT) mechanisms to aggregate attention weights and guide PADD with semantic information. Experiments on the PartialSpoof-NER dataset demonstrate that NE-PADD significantly outperforms existing baselines in frame-level fake speech localization.
audio
Published: 2025-09-03
This paper has been accepted by ACM MM 2025
Authors: Hoan My Tran, Damien Lolive, Aghilas Sini, Arnaud Delhay, Pierre-François Marteau, David Guennec
This paper introduces a multi-level SSL feature gating mechanism for audio deepfake detection, leveraging the XLS-R model as a front-end feature extractor. It employs a Multi-kernel gated Convolution (MultiConv) for the back-end classifier and incorporates Centered Kernel Alignment (CKA) to promote diverse feature learning across MultiConv layers. The approach achieves state-of-the-art performance on in-domain benchmarks and demonstrates robust generalization to unseen deepfake attacks and multilingual out-of-domain datasets.
audio
Published: 2025-09-02
Authors: Sandipana Dowerah, Atharva Kulkarni, Ajinkya Kulkarni, Hoan My Tran, Joonas Kalda, Artem Fedorchenko, Benoit Fauve, Damien Lolive, Tanel Alumäe, Matthew Magimai Doss
Speech DeepFake (DF) Arena is introduced as the first comprehensive benchmark for audio deepfake detection, providing a toolkit for uniform evaluation across 14 diverse datasets and attack scenarios. It includes a leaderboard to compare and rank 12 state-of-the-art open-source and 3 proprietary detection systems using standardized metrics. The study reveals many systems exhibit high Equal Error Rate (EER) in out-of-domain scenarios, underscoring the necessity for extensive cross-domain evaluation.
audio
Published: 2025-08-29
Proc. Interspeech 2025, 4553-4557
Authors: Arnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller
This study proposes a novel method for generalizable audio spoofing detection by leveraging non-semantic universal audio representations extracted using TRILL and TRILLsson models. The approach demonstrates comparable performance on in-domain test sets while significantly outperforming state-of-the-art methods on out-of-domain and public-domain test sets, highlighting its superior generalization capabilities.
audio
Published: 2025-08-28
Accepted @ IEEE ASRU 2025
Authors: Hashim Ali, Surya Subramani, Lekha Bollinani, Nithin Sai Adupa, Sali El-Loh, Hafiz Malik
This paper presents a robust audio deepfake detection system for the SAFE Challenge, focusing on strategies for integrating diverse multilingual datasets. Their AASIST-based approach, utilizing WavLM Large as an SSL frontend with RawBoost augmentation, achieved second place in both Task 1 (unmodified audio) and Task 3 (laundered audio) of the challenge. The work highlights the importance of comprehensive data diversity and longer audio segments for strong generalization across various spoofing scenarios.
audio
Published: 2025-08-25
14 Pages, Accepted by AsiaCCS 2025
Authors: Yuanda Wang, Bocheng Chen, Hanqing Guo, Guangjing Wang, Weikang Ding, Qiben Yan
This paper introduces ClearMask, a noise-free defense mechanism against voice deepfake attacks that preserves audio naturalness. It modifies the audio mel-spectrogram by selective frequency filtering, applies audio style transfer, and optimizes reverberation to induce transferable voice feature loss. Additionally, LiveMask is proposed for real-time streaming speech protection, both effectively preventing deepfake voices from deceiving speaker verification models and human listeners, even against unseen voice synthesis models and adaptive attackers.
audio
Published: 2025-08-22
This paper is submitted to the IEEE IoT Journal
Authors: Kamel Kamel, Keshav Sood, Hridoy Sankar Dutta, Sunil Aryal
This survey provides a comprehensive review of the modern threat landscape targeting Voice Authentication Systems (VAS) and Anti-Spoofing Countermeasures (CMs), including data poisoning, adversarial, deepfake, and adversarial spoofing attacks. It chronologically traces the evolution of voice authentication vulnerabilities alongside technological advancements, summarizing methodologies, highlighting datasets, and comparing performance and limitations for each attack category. The paper aims to support the development of more secure and resilient voice authentication systems by identifying emerging risks and open challenges.
audio
Published: 2025-08-18
Authors: Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews
This paper proposes a self-attentive prototypical network for few-shot detection of synthesized speech, designed to rapidly adapt to new voice spoofing under distribution shifts. The method effectively leverages a small number of in-distribution samples to significantly improve detection performance over traditional zero-shot detectors. It achieves up to 32% relative EER reduction on deepfakes in Japanese language and 20% on the ASVspoof 2021 Deepfake dataset.
audio
Published: 2025-08-14
Authors: Yuankun Xie, Ruibo Fu, Xiaopeng Wang, Zhiyong Wang, Ya Li, Zhengqi Wen, Haonnan Cheng, Long Ye
This paper addresses the significant performance degradation of deepfake audio countermeasures (CMs) in cross-domain scenarios, particularly on social media. It introduces the Fake Speech Wild (FSW) dataset, comprising 254 hours of real and deepfake audio from four different media platforms. By establishing a benchmark with self-supervised learning (SSL)-based CMs and employing data augmentation strategies with joint training on public and FSW datasets, the research achieves an average equal error rate (EER) of 3.54% for real-world deepfake audio detection.
audio
Published: 2025-08-13
Authors: Chongyang Gao, Marco Postiglione, Isabel Gortner, Sarit Kraus, V. S. Subrahmanian
This paper introduces Perturbed Public Voices (P$^{2}$V), an IRB-approved dataset designed for robust audio deepfake detection, capturing identity-consistent transcripts, environmental/adversarial noise, and state-of-the-art voice cloning. Experiments reveal significant vulnerabilities in 22 recent audio deepfake detectors when tested on P$^{2}$V, showing up to 43% performance degradation for models trained on existing benchmarks, while P$^{2}$V-trained models maintain robustness and generalize effectively.
audio
Published: 2025-08-12
Accepted at IEEE ASRU 2025
Authors: Xi Xuan, Zimo Zhu, Wenxin Zhang, Yi-Cheng Lin, Tomi Kinnunen
This paper introduces Fake-Mamba, a novel real-time speech deepfake detection system that employs bidirectional Mamba as an efficient alternative to Self-Attention. Fake-Mamba integrates an XLSR front-end with a proposed PN-BiMamba encoder to effectively capture subtle local and global artifacts in synthetic speech. It achieves substantial performance gains over state-of-the-art models on ASVspoof 2021 LA, DF, and In-The-Wild benchmarks while maintaining real-time inference.
audio
Published: 2025-08-11
Authors: Vojtěch Staněk, Karel Srna, Anton Firc, Kamil Malinka
This paper introduces the Speaker Characteristics Deepfake (SCDF) dataset, a novel and richly annotated resource designed to facilitate systematic evaluation of demographic biases in deepfake speech detection. SCDF comprises over 237,000 utterances balanced across sex, five languages, and a wide age range. The authors demonstrate that speaker characteristics significantly influence the performance of state-of-the-art deepfake detectors, revealing disparities across sex, language, age, and synthesizer type, underscoring the need for bias-aware development.
audio
Published: 2025-08-06
Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang
This paper proposes EnvSDD, the first large-scale curated dataset for environmental sound deepfake detection (ESDD), comprising 45.25 hours of real and 316.7 hours of fake sound. Based on EnvSDD, the authors are launching the ESDD 2026 Challenge with two tracks: ESDD in Unseen Generators and Black-Box Low-Resource ESDD, aiming to foster robust deepfake detection methods for environmental sounds.
audio
Published: 2025-08-06
Accepted at Interspeech SPSC 2025 - 5th Symposium on Security and Privacy in Speech Communication (Oral)
Authors: Xi Xuan, Yang Xiao, Rohan Kumar Das, Tomi Kinnunen
This paper introduces the first benchmark for multilingual speech deepfake source tracing, investigating both mono- and cross-lingual scenarios. It comparatively analyzes DSP- and SSL-based modeling, evaluating how SSL representations fine-tuned on different languages impact cross-lingual generalization performance. The work also assesses generalization to unseen languages and speakers, providing initial insights into the challenges of identifying speech generation models when training and inference languages differ.
audio
Published: 2025-08-04
Authors: Andrea Di Pierno, Luca Guarnera, Dario Allegra, Sebastiano Battiato
This paper introduces LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition. It leverages attention-enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. The framework employs two specialized classifiers: Audio Deepfake Attribution (ADA) to identify generation technology and Audio Deepfake Model Recognition (ADMR) to recognize specific generative model instances, incorporating confidence-based rejection for open-set conditions.
audio
Published: 2025-08-03
Accepted for publication on Interspeech 2025
Authors: Mingru Yang, Yanmei Gu, Qianhua He, Yanxiong Li, Peirong Zhang, Yongqiang Chen, Zhiming Wang, Huijia Zhu, Jian Liu, Weiqiang Wang
This paper introduces Poin-HierNet, a novel framework for generalizable audio deepfake detection (ADD) that addresses critical generalization challenges due to diverse spoofing attacks and domain variations. Poin-HierNet constructs domain-invariant hierarchical representations in the Poincaré sphere, moving beyond traditional Euclidean distance-based methods. It achieves this through three key components: Poincaré Prototype Learning (PPL), Hierarchical Structure Learning (HSL), and Poincaré Feature Whitening (PFW).
audio
Published: 2025-08-02
Authors: Haohan Shi, Xiyu Shi, Safak Dogan, Tianjin Huang, Yunxiao Zhang
This paper introduces the first unified framework for robust Audio Deepfake Detection (ADD) that effectively operates under real-world communication degradations, such as packet losses and speech codec compression. The core contribution is a novel Multi-Granularity Adaptive Attention (MGAA) architecture, which utilizes multi-scale attention heads and an adaptive fusion mechanism to capture global and local time-frequency features. This framework dynamically reallocates its focus to subtle forgery traces, significantly outperforming state-of-the-art baselines and improving feature separability across diverse degradation scenarios.
audio
Published: 2025-08-01
Accepted at APSIPA ASC 2025
Authors: Rishith Sadashiv T N, Abhishek Bedge, Saisha Suresh Bore, Jagabandhu Mishra, Mrinmoy Bhattacharjee, S R Mahadeva Prasanna
This paper proposes a novel speech representation for fake speech detection to address domain generalization challenges. It fuses self-supervised (SSL) wav2vec 2.0 XLS-R embeddings with Modulation Spectrogram (MS) features using a multi-head attention mechanism. The combined representation is then fed into an AASIST backend network, significantly improving performance and generalizability across in-domain, cross-dataset, and multilingual scenarios.
audio
Published: 2025-07-29
Published in ACL 2025. Dataset available at: https://github.com/YMLLG/SpeechFake
Authors: Wen Huang, Yanmei Gu, Zhiming Wang, Huijia Zhu, Yanmin Qian
This paper introduces SpeechFake, a large-scale, multilingual dataset for speech deepfake detection, comprising over 3 million samples (3,000+ hours) generated by 40 diverse speech synthesis tools including cutting-edge text-to-speech, voice conversion, and neural vocoder methods across 46 languages. It addresses limitations of existing datasets in scale and diversity, providing detailed creation and composition, along with baseline detection model performance and an analysis of factors influencing detection.
audio
Published: 2025-07-27
ACCEPTED WASPAA 2025
Authors: Yassine El Kheir, Arnab Das, Enes Erdem Erdogan, Fabian Ritter-Guttierez, Tim Polzehl, Sebastian Möller
This paper proposes hybrid fusion frameworks for robust audio deepfake detection, integrating self-supervised learning (SSL) based representations with handcrafted spectral descriptors (MFCC, LFCC, CQCC). By exploring various fusion strategies, the approach aims to capture subtle artifacts that single-feature methods often overlook, leading to improved generalization to unseen attacks. The cross-attention fusion strategy significantly reduces the equal error rate (EER), confirming that joint modeling of waveform and spectral views produces robust, domain-agnostic representations for audio deepfake detection.
audio
Published: 2025-07-23
Accepted to IJCB 2025 (IEEE/IAPR International Joint Conference on Biometrics). Code available...
Authors: Aditya Pujari, Ajita Rattani
This paper introduces WaveVerify, a novel audio watermarking framework designed for media authentication and combatting deepfakes. It leverages a Feature-wise Linear Modulation (FiLM)-based generator for resilient multiband watermark embedding and a Mixture-of-Experts (MoE) detector for accurate extraction and localization. The system significantly enhances robustness against diverse audio distortions and temporal manipulations through a unified training framework with dynamic effect scheduling.
audio
Published: 2025-07-22
Accepted by IEEE International Joint Conference on Biometrics (IJCB) 2025, Osaka, Japan
Authors: Xuechen Liu, Wanying Ge, Xin Wang, Junichi Yamagishi
This study introduces LENS-DF, a novel recipe for training and evaluating audio deepfake detection and temporal localization under realistic conditions, including longer duration, noisy environments, and multiple speakers. Models trained using data generated with LENS-DF consistently outperform those trained with conventional recipes, demonstrating its effectiveness for robust audio deepfake detection and localization.
audio
Published: 2025-07-20
5 pages, 4 figures, 4 tables. Accepted to IEEE SPL
Authors: Menglu Li, Xiao-Ping Zhang, Lian Zhao
This paper introduces a novel Temporal Difference Attention Module (TDAM) for detecting partial deepfake speech by analyzing frame-level temporal differences. TDAM identifies unnatural temporal variations and erratic directional changes in deepfake speech, overcoming the limitation of requiring costly frame-level annotations. The proposed TDAM-AvgPool model achieves state-of-the-art performance on relevant datasets.
audio
Published: 2025-07-17
Authors: Kutub Uddin, Awais Khan, Muhammad Umar Farooq, Khalid Malik
This paper introduces SHIELD, a novel collaborative learning method designed to enhance robust deepfake audio detection against generative anti-forensic (AF) attacks. It integrates an auxiliary defense generative model to expose AF signatures and employs a triplet model to capture correlations between real and AF attacked audios, along with their generated counterparts. SHIELD demonstrates significantly improved detection accuracy against various generative AF attacks, outperforming existing methods and bolstering defense against adversarial manipulations.
audio
Published: 2025-07-17
Accepted by ACM MM 2025, Open-sourced
Authors: Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Qingming Li, Tianyu Du, Shouling Ji
This paper introduces Enkidu, a novel user-oriented privacy-preserving framework designed to protect against voice deepfake threats. Enkidu utilizes universal frequential perturbations (UFP) generated through black-box knowledge and few-shot training to provide real-time, lightweight protection. It ensures strong generalization across variable-length audio and robust resistance to voice deepfake attacks while preserving perceptual quality and speech intelligibility.
audio
Published: 2025-07-15
Authors: Ivan Viakhirev, Daniil Sirota, Aleksandr Smirnov, Kirill Borodin
This work introduces modest refinements to the AASIST anti-spoofing architecture to enhance speech deepfake detection, particularly in data-limited scenarios. It incorporates a frozen Wav2Vec 2.0 encoder, replaces bespoke graph attention with a standardized multi-head attention module, and integrates a trainable, context-aware fusion layer. The proposed system achieves a 7.6% equal error rate (EER) on the ASVspoof 5 corpus, outperforming a re-implemented AASIST baseline.
audio
Published: 2025-07-11
Accepted at ICCV Workshop - Authenticity & Provenance in the age of Generative AI
Authors: Davide Salvi, Viola Negroni, Sara Mandelli, Paolo Bestagini, Stefano Tubaro
This paper proposes a Person-of-Interest (POI) based speech deepfake detection method that operates at the phoneme level. It decomposes reference audio into phonemes to build a detailed speaker profile and then individually compares phonemes from a test sample against this profile to detect synthetic artifacts. The approach achieves comparable accuracy to traditional methods while offering superior robustness and interpretability, exploring a novel direction for explainable, speaker-centric deepfake detection.
audio
Published: 2025-07-11
Submitted to APSIPA ASC 2025
Authors: Yang Xiao, Ting Dang, Rohan Kumar Das
This paper introduces RawTFNet, a lightweight CNN model designed for speech anti-spoofing, which addresses the high computational cost of existing transformer-based models. RawTFNet improves performance by separating feature processing along time and frequency dimensions to capture fine-grained details of synthetic speech. Tested on ASVspoof 2021 LA and DF datasets, RawTFNet achieves comparable performance to state-of-the-art models while significantly reducing computational resources.
audio
Published: 2025-07-09
Accepted by INTERSPEECH 2025 as part of the special session "Source Tracing: The Origins of...
Authors: Nicholas Klein, Hemlata Tak, Elie Khoury
This paper addresses the critical need for robust open-set source tracing of audio deepfake systems by introducing Softmax Energy (SME), a novel adaptation to the energy score for out-of-distribution (OOD) detection. The authors leverage the Interspeech 2025 special session protocol to evaluate methods for improving open-set source tracing performance, combining SME with SME-guided training and various augmentations. This approach significantly enhances OOD detection, achieving a relative average improvement of 31% in FPR95.
audio
Published: 2025-07-07
ISMIR 2025 LBD, 2 pages + bibliography, 1 figure
Authors: Tomasz Sroka, Tomasz Wężowicz, Dominik Sidorczuk, Mateusz Modrzejewski
This paper evaluates the robustness of fake music detection systems against audio augmentations. Researchers constructed a dataset of real and synthetic music, applied various audio transformations, and tested a state-of-the-art musical deepfake detection model. The study reveals that the model's performance significantly degrades even with the introduction of light augmentations.
audio
Published: 2025-07-04
APSIPA 2025
Authors: Hieu-Thi Luong, Inbal Rimon, Haim Permuter, Kong Aik Lee, Eng Siong Chng
This paper critically examines current evaluation practices for partial audio deepfake localization, arguing that metrics like Equal Error Rate (EER) obscure generalization and deployment readiness. It proposes reframing the task as sequential anomaly detection and using threshold-dependent metrics for better real-world assessment. The study demonstrates that existing self-supervised learning models generalize poorly to out-of-domain data and that careful training data selection, specifically adding partially fake utterances, is crucial for improving robustness.
audio
Published: 2025-07-02
8 pages, 3 figures
Authors: Jose A. Lopez, Georg Stemmer, Héctor Cordourier Maruri
This paper presents a comprehensive study to enhance the generalization capabilities of audio deepfake detection models. The authors investigate various pre-trained backbones (Wav2Vec2, WavLM, Whisper), different data augmentation strategies, and novel loss functions across a diverse set of datasets. Their research demonstrates significant improvements in generalization, surpassing the performance of the top-ranked single system in the ASVspoof 5 Challenge.
audio
Published: 2025-06-30
Authors: Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Adupa, Lekha Bollinani, Hafiz Malik
This paper introduces a comprehensive methodology for collecting, curating, and generating high-quality synthetic speech data for ten public figures, addressing the challenges of maintaining voice authenticity. It details an automated pipeline for bonafide speech sample collection, featuring transcription-based segmentation that significantly enhances synthetic speech quality. The resulting 'Famous Figures' dataset demonstrates superior naturalness with a NISQA-TTS score of 3.69 and achieves a 61.9% human misclassification rate, indicating high realism.
audio
Published: 2025-06-28
5 pages, 3 figures, Published at Proceedings of Interspeech 2025, for the dataset see...
Proceedings of Interspeech 2025
Authors: Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali
This study introduces PhonemeFake, a novel deepfake (DF) attack that manipulates critical speech segments using language reasoning, significantly reducing human and benchmark detection accuracy. To counter this, they propose PhonemeFakeDetect, an adaptive bilevel detection model that efficiently and accurately identifies these fine-grained manipulations. Their detection model reduces EER by 91% while achieving up to 90% speed-up with precise localization.
audio
Published: 2025-06-26
Corrected previous implementation of EER calculation. Slight numerical changes in some of the results
Authors: Wanying Ge, Xin Wang, Xuechen Liu, Junichi Yamagishi
This paper introduces a post-training approach for deepfake speech detection, adapting self-supervised learning (SSL) models to bridge the gap between general pre-training and domain-specific fine-tuning. Named AntiDeepfake models, they are developed using a large-scale multilingual speech dataset comprising over 56,000 hours of genuine speech and 18,000 hours of speech with various artifacts. These models achieve strong robustness and generalization to unseen deepfake speech, consistently surpassing existing state-of-the-art detectors when further fine-tuned.
audio
Published: 2025-06-23
Project Website: https://indie-fake-dataset.netlify.app/
Authors: Abhay Kumar, Kunal Verma, Omkar More
This paper introduces the IndieFake Dataset (IFD), a new benchmark dataset for audio deepfake detection, specifically addressing the lack of diverse ethnic accents, particularly from South-Asian speakers, in existing datasets. IFD comprises 27.17 hours of bonafide and deepfake audio from 50 English-speaking Indian speakers, featuring balanced data distribution and speaker-level characterization. The dataset is publicly accessible and proves to be a more challenging benchmark than existing datasets like ASVspoof21 (DF) and In-The-Wild (ITW).
audio
Published: 2025-06-17
Authors: Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang
This work proposes a unified framework to comparatively evaluate proactive watermarking models and passive deepfake detectors for speech deepfake detection. The framework enables fair comparison by training and testing all models on common datasets, using a shared metric, and analyzing their robustness against various adversarial attacks. The study reveals distinct vulnerabilities of different models to speech attribute distortions, indicating that robustness remains a critical challenge.
audio
Published: 2025-06-17
Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence
Authors: Jiayi He, Jiangyan Yi, Jianhua Tao, Siding Zeng, Hao Gu
This survey provides the first comprehensive overview of manipulated region localization tasks for partially deepfake audio. It systematically introduces the fundamentals, categorizes existing methods, highlights current limitations, and discusses future development trends in this emerging field. The paper aims to offer a revealing insight for researchers and guide future advancements in detecting covert audio manipulations.
audio
Published: 2025-06-14
Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Arun Balaji Buduru, Rajesh Sharma
This paper introduces Neural Audio Codec Source Parsing (NACSP), a novel paradigm reframing audio deepfake source attribution as a multi-task regression problem to predict generative Neural Audio Codec (NAC) parameters. It proposes HYDRA, a framework leveraging hyperbolic geometry and task-specific attention to disentangle latent properties from pre-trained model representations. HYDRA significantly outperforms Euclidean baselines on benchmark codecfake datasets, enabling more granular and generalizable forensic insights into unseen NACs.
audio
Published: 2025-06-13
Accepted to Interspeech 2025
Authors: Wen Huang, Xuechen Liu, Xin Wang, Junichi Yamagishi, Yanmin Qian
This paper investigates sharpness as a theoretical proxy for generalization in speech deepfake detection (SDD). It demonstrates that sharpness increases in unseen conditions, indicating higher model sensitivity to domain shifts. By applying Sharpness-Aware Minimization (SAM), the authors achieve better and more stable SDD performance across diverse unseen test sets, confirming a statistically significant relationship between sharpness and generalization.
audio
Published: 2025-06-11
Proceedings of Interspeech 2025
Authors: David Combei, Adriana Stan, Dan Oneata, Nicolas Müller, Horia Cucu
This paper introduces a novel dataset of real-world audio deepfakes, AI4T, demonstrating that state-of-the-art detection models struggle with these challenging examples. Instead of increasing model complexity, the authors adopt a data-centric paradigm, employing dataset curation, pruning, and augmentation strategies. This approach significantly improves model robustness and generalization for real-world audio deepfake detection.
audio
Published: 2025-06-08
IEEE ASRU 2025
Authors: Xuanjun Chen, I-Ming Lin, Lin Zhang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
This paper addresses the challenge of source tracing for codec-based deepfake speech, which often suffers from generalization issues due to overfitting to non-speech regions and specific content. The authors propose SASTNet, a Semantic-Acoustic Source Tracing Network, that combines semantic and acoustic feature encoding to improve robustness and generalization. SASTNet achieves state-of-the-art performance on the CodecFake+ dataset, demonstrating its effectiveness in reliably tracing the source of deepfake speech.
audio
Published: 2025-06-07
Accepted in Interspeech 2025
Authors: Rishabh Ranjan, Kishan Pipariya, Mayank Vatsa, Richa Singh
This paper introduces SynHate, the first multilingual dataset for detecting hate speech in synthetic audio, spanning 37 languages. SynHate employs a novel four-class scheme (Real-normal, Real-hate, Fake-normal, Fake-hate) and is built from the MuTox and ADIMA datasets. The authors evaluate five leading self-supervised models on SynHate, finding that Whisper-small performs best overall, but cross-dataset generalization remains a significant challenge.
audio
Published: 2025-06-06
Proceedings of Interspeech 2025
Authors: Adriana Stan, David Combei, Dan Oneata, Horia Cucu
This paper introduces TADA, a training-free and computationally efficient approach for audio deepfake model attribution and out-of-domain (OOD) detection. It leverages k-Nearest Neighbors (kNN) on features extracted from a pre-trained self-supervised learning (SSL) model. The method achieves a 0.93 F1-score for attributing deepfake sources across five datasets and an 0.84 F1-score for detecting samples from unseen generative models.
audio
Published: 2025-06-03
5 pages, 3 figures, accepted at Interspeech 2025
Interspeech 2025
Authors: Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj
This paper introduces a novel data-driven diffusion-based approach for generating explanations in audio deepfake detection. It creates ground-truth explanations by analyzing the time-frequency differences between paired real and vocoded audio. These ground truths are then used to train a diffusion model to identify and highlight artifact regions in deepfake audio, outperforming traditional explainability techniques both qualitatively and quantitatively.
audio
Published: 2025-06-03
Interspeech 2025 camera ready. Project page: https://yzyouzhang.com/PartialEdit/
Authors: You Zhang, Baotong Tian, Lin Zhang, Zhiyao Duan
This paper introduces PartialEdit, a novel deepfake speech dataset curated using advanced neural speech editing techniques to encourage research in detecting partially edited deepfake speech. It explores both detection and localization tasks, demonstrating that existing models struggle with these new deepfakes and providing insights into neural audio codec artifacts.
audio
Published: 2025-06-03
Authors: Chi Ding, Junxiao Xue, Cong Wang, Hao Zhou
This paper proposes a trusted fake audio detection approach based on the Dirichlet distribution to enhance the reliability of detection by modeling decision trustworthiness. It generates evidence through a neural network, models uncertainty using the Dirichlet distribution to estimate uncertainty for each decision, and then combines predicted probabilities with these uncertainty estimates to form a final opinion. The method demonstrates excellent performance in accuracy, robustness, and trustworthiness on the ASVspoof series datasets.
audio
Published: 2025-06-02
Accepted at Interspeech 2025, Netherlands
Authors: Ajinkya Kulkarni, Sandipana Dowerah, Tanel Alumae, Mathew Magimai. -Doss
This paper introduces a novel audio source tracing system designed to identify the generative origin of audio deepfakes, moving beyond just discerning real from spoofed speech. The approach combines deep metric multi-class N-pair loss with a Real Emphasis and Fake Dispersion framework, a Conformer classification network, and an ensemble score-embedding fusion strategy. It aims to improve discriminative ability, robustness, and achieve an optimal trade-off in both in-domain and out-of-domain source tracing scenarios.
audio
Published: 2025-05-31
Accepted at EACL 2026
Authors: Ioan-Paul Ciobanu, Andrei-Iulian Hiji, Nicolae-Catalin Ristea, Paul Irofti, Cristian Rusu, Radu Tudor Ionescu
This paper introduces XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech across seven languages. Unlike typical in-domain setups, XMAD-Bench features distinct speakers, generative methods, and real audio sources across training and test splits, creating a challenging 'in the wild' evaluation scenario. Experiments reveal a significant disparity between high in-domain detection performance (near 100%) and poor cross-domain performance, often akin to random chance, highlighting the urgent need for more robust audio deepfake detectors.
audio
Published: 2025-05-31
Authors: Ruibo Fu, Xiaopeng Wang, Zhengqi Wen, Jianhua Tao, Yuankun Xie, Zhiyong Wang, Chunyu Qiang, Xuefei Liu, Cunhang Fan, Chenxing Li, Guanjun Li
This paper proposes RPRA-ADD, a robust audio deepfake detection framework that enhances forgery traces by integrating Reconstruction-Perception-Reinforcement-Attention networks. It aims to overcome generalization challenges and better distinguish real from fake audio by specifically focusing on subtle forgery characteristics. The framework achieves state-of-the-art performance across multiple benchmark datasets and demonstrates strong generalization capabilities in diverse audio domains.
audio
Published: 2025-05-30
Accepted by Interspeech 2025
Authors: Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia
This paper introduces Rehearsal with Auxiliary-Informed Sampling (RAIS), a continual learning approach designed to prevent performance degradation in audio deepfake detection against new attacks. RAIS leverages a label generation network to produce auxiliary labels, which then guide the selection of diverse samples for the memory buffer. This strategy helps retain prior knowledge while incorporating new information, achieving a superior average Equal Error Rate (EER) of 1.953% across five experiences.
audio
Published: 2025-05-29
Authors: Neta Glazer, David Chernin, Idan Achituve, Sharon Gannot, Ethan Fetaya
This paper introduces ADD-GP, a few-shot adaptive framework based on a Gaussian Process (GP) classifier for Audio Deepfake Detection (ADD). The approach combines a powerful deep embedding model (XLS-R) with the flexibility of Gaussian Processes to achieve strong performance and efficient adaptation to unseen Text-to-Speech (TTS) models with minimal data. It also demonstrates applicability for personalized detection with increased robustness and one-shot adaptability.
audio
Published: 2025-05-26
Accepted at INTERSPEECH 2025 The dataset is available at https://huggingface.co/datasets/MBZUAI/ArVoice
Authors: Hawau Olamide Toyin, Rufael Marew, Humaid Alblooshi, Samar M. Magdy, Hanan Aldarmaki
This paper introduces ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, designed for tasks like speech synthesis, voice conversion, and deepfake detection. The corpus combines new professional recordings from six diverse voice talents, a modified subset of the Arabic Speech Corpus, and high-quality synthetic speech from commercial systems. Totaling 83.52 hours across 11 voices (10 hours human speech from 7 speakers), ArVoice is illustrated through training open-source TTS and voice conversion systems.
audio
Published: 2025-05-26
Published at Interspeech 2025 conference
Interspeech 2025
Authors: Anton Firc, Manasi Chhibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, Kamil Malinka
This paper introduces STOPA, a novel, systematically curated, and metadata-rich dataset designed for deepfake speech source tracing in open-set scenarios. It covers 8 acoustic models, 6 vocoder models, and various parameter settings, comprising over 700k samples from 13 distinct synthesizers. STOPA aims to address the limitations of existing datasets by providing a controlled framework for analyzing generative factors, thereby improving attribution reliability and aiding forensic analysis and deepfake detection.
audio
Published: 2025-05-25
Proceedings of Interspeech 2025
Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley
This paper introduces EnvSDD, the first large-scale dataset for environmental sound deepfake detection (ESDD), comprising both real and AI-generated audio. It also proposes an ESDD system based on a pre-trained audio foundation model (BEATs) integrated with AASIST. The proposed system significantly outperforms state-of-the-art speech and singing deepfake detection methods on the new dataset, demonstrating the importance of domain-specific pre-training.
audio
Published: 2025-05-23
15 pages, 2 fogures
Authors: Binh Nguyen, Shuji Shi, Ryan Ofman, Thai Le
This paper investigates the linguistic sensitivity of audio deepfake detectors by applying transcript-level adversarial attacks. It demonstrates that subtle linguistic perturbations can significantly reduce the accuracy of both open-source and commercial anti-spoofing systems, with attack success rates exceeding 60% in some cases. The study further identifies linguistic complexity and model-level audio embedding similarity as key factors contributing to detector vulnerability.
audio
Published: 2025-05-21
5 pages, 3 figures. Accepted to Interspeech 2025 Conference
https://www.isca-archive.org/interspeech_2025/weizman25_interspeech.pdf
Authors: Avishai Weizman, Yehuda Ben-Shimol, Itshak Lapidot
This paper analyzes and compares the ASVspoof2019 and ASVspoof5 challenge databases, focusing on changes in database conditions and their impact on spoofing detection. It highlights that ASVspoof5 introduces significant mismatches in both bona fide and spoofed speech statistics, making it considerably more challenging than ASVspoof2019. The study demonstrates that in ASVspoof5, genuine speech statistically shifts closer to spoofed speech, increasing the difficulty for countermeasure systems.
audio
Published: 2025-05-20
Interspeech 2025
Authors: Nicolas Müller, Piotr Kawa, Wei-Herng Choong, Adriana Stan, Aditya Tirumala Bukkapatnam, Karla Pizzi, Alexander Wagner, Philip Sperl
This paper demonstrates how replay attacks undermine audio deepfake detection by playing and re-recording deepfake audio, making spoofed samples appear authentic to detection models. The authors introduce ReplayDF, a dataset of such recordings across diverse acoustic conditions, languages, and TTS models. Their analysis shows significant vulnerability in existing detection models, with performance dropping considerably even after adaptive retraining.
audio
Published: 2025-05-20
Accepted by Interspeech 2025
Authors: Yang Xiao, Rohan Kumar Das
This paper addresses the challenge of catastrophic forgetting in audio deepfake source tracing (ST) when models must incrementally learn new deepfake attacks while retaining knowledge of previous ones. The authors propose AnaST, an analytic class incremental learning method that updates the classifier with a closed-form analytical solution in one epoch while keeping the feature extractor fixed. This exemplar-free approach ensures data privacy, optimizes memory usage, and demonstrates superior performance against baselines in adapting to new attacks.
audio
Published: 2025-05-20
Accepted at INTERSPEECH 2025
Authors: Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro
This paper introduces the novel task of source verification for speech deepfakes, aiming to trace the origin of synthetic audio rather than merely detecting its authenticity. Inspired by speaker verification, the approach leverages embeddings from a classifier trained for source attribution to compute distance scores, determining if a test track was produced by the same model as a set of reference signals. This method provides a flexible solution for open-set scenarios and unseen generators, offering insights for real-world forensic applications.
audio
Published: 2025-05-20
Accepted by Interspeech 2025
Authors: Taewoo Kim, Guisik Kim, Choongsang Cho, Young Han Lee
This study proposes naturalness-aware curriculum learning, a novel training framework for speech deepfake detection (SDD) that leverages speech naturalness to enhance robustness and generalization. It measures sample difficulty using ground-truth labels and mean opinion scores (MOS), progressively introducing challenging samples, and incorporates dynamic temperature scaling based on naturalness to regulate model confidence. This approach significantly improved detection performance without modifying existing model architectures.
audio
Published: 2025-05-20
Accepted Interspeech 2025
Authors: Yassine El Kheir, Tim Polzehl, Sebastian Möller
This paper introduces BiCrossMamba-ST, a novel and robust framework for speech deepfake detection. It employs a dual-branch spectro-temporal architecture utilizing bidirectional Mamba blocks and mutual cross-attention to effectively capture subtle cues of synthetic speech. Additionally, a convolution-based 2D attention map focuses on critical spectro-temporal regions for enhanced detection.
audio
Published: 2025-05-20
Accepted for publication in Forensic Science International
Authors: Tianle Yang, Chengzhe Sun, Siwei Lyu, Phil Rose
This study investigates the efficacy of segmental speech features, particularly vowel formants, for forensic deepfake audio detection. It proposes a speaker-specific framework using these highly interpretable features, which are expected to be more difficult for deepfake models to replicate. The research demonstrates that certain segmental features provide significantly stronger evidential value in distinguishing real from synthetic speech, outperforming some global acoustic features.
audio
Published: 2025-05-19
Accepted by Interspeech 2025; Update table 3/4
Authors: Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
This paper introduces a novel method for tracing the source of codec-based audio deepfakes (CodecFakes) by analyzing neural audio codec taxonomy. It defines three multi-class classification tasks based on vector quantization, auxiliary objectives, and decoder types, integrated into a multi-task training framework. Experimental results on the CodecFake+ dataset demonstrate the feasibility of this source tracing approach while also highlighting challenges, particularly with out-of-domain deepfakes.
audio
Published: 2025-05-16
Accepted by ACMMM 2025
Authors: Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen
This paper introduces ALLM4ADD, a novel framework leveraging Audio Large Language Models (ALLMs) for audio deepfake detection (ADD). It reformulates ADD as an audio question answering task, fine-tuning ALLMs to determine if audio is fake or real by prompting. Experiments demonstrate ALLM4ADD's superior performance, especially in data-scarce environments, after initial zero-shot evaluation revealed ALLMs' ineffectiveness without fine-tuning.
audio
Published: 2025-05-16
5 page
Authors: Istiaq Ahmed Fahad, Kamruzzaman Asif, Sifat Sikder
This paper introduces BanglaFake, the first specialized Bengali Deepfake Audio Dataset, addressing the scarcity of resources for low-resource languages. It comprises 12,260 real and 13,260 deepfake utterances, generated using state-of-the-art Text-to-Speech (TTS) models to ensure high naturalness and quality. The dataset is evaluated through qualitative (Mean Opinion Score) and quantitative (t-SNE visualization of MFCCs) analyses, demonstrating the challenge of differentiating real from synthetic speech.
audio
Published: 2025-05-10
Submitted to IEEE Transactions on Biometrics, Behavior, and Identity Science (T-BIOM)
Authors: Yasaman Ahmadiadli, Xiao-Ping Zhang, Naimul Khan
This study addresses the challenge of poor generalization in deepfake audio detection models caused by implicit identity leakage, where models inadvertently learn speaker-specific features instead of manipulation artifacts. It proposes an identity-independent framework leveraging Artifact Detection Modules (ADMs) and novel dynamic artifact generation techniques to focus on forgery-specific cues. The approach significantly improves cross-dataset generalization, with dynamic frequency swap proving most effective.
audio
Published: 2025-05-03
Submitted as part of coursework at UT Austin. Accompanying code available at:...
Authors: Nick Sunday
This study investigates the detection of AI-generated music, termed musical deepfakes, by classifying audio as either deepfake or human. It utilizes the FakeMusicCaps dataset, augmented with tempo stretching and pitch shifting to simulate adversarial conditions. Mel spectrograms are generated from the audio and fed into a convolutional neural network for classification.
audio
Published: 2025-04-29
Authors: Andrea Di Pierno, Luca Guarnera, Dario Allegra, Sebastiano Battiato
This paper proposes RawNetLite, a lightweight, end-to-end deep learning framework for audio deepfake detection that processes raw waveforms directly, without handcrafted preprocessing. To enhance robustness and generalization to unseen spoofing methods, the model integrates a training strategy combining data from multiple domains, Focal Loss, and waveform-level audio augmentations. It achieves high in-domain performance and significant generalization improvements on challenging out-of-distribution datasets.
audio
Published: 2025-04-29
Authors: Yue Li, Weizhi Liu, Dongdong Lin
This paper introduces TriniMark, a robust generative speech watermarking method designed to provide trinity-level attribution for AI-generated speech. It addresses the limitations of existing deepfake detection and watermarking techniques by enabling traceability of the diffusion model, generated content, and end-user. TriniMark achieves this by embedding watermarks into speech's time-domain features and fine-tuning diffusion models to inherently generate watermarked content.
audio
Published: 2025-04-16
Accepted by EUSIPCO 2025
2025 33rd European Signal Processing Conference (EUSIPCO)
Authors: Haohan Shi, Xiyu Shi, Safak Dogan, Saif Alzubi, Tianjin Huang, Yunxiao Zhang
This paper addresses the poor generalization of Audio Deepfake Detection (ADD) systems in real-world communication scenarios due to audio quality degradation. It introduces a rigorous benchmark and a new test dataset, ADD-C, which simulates diverse communication conditions including various audio codecs and packet loss rates. A novel data augmentation strategy is proposed to significantly improve the robustness and performance of ADD systems under these challenging conditions.
audio
Published: 2025-04-15
Accpeted by IEEE International Conference on Multimedia & Expo 2025 (ICME 2025)
Authors: Botao Zhao, Zuheng Kang, Yayun He, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang
This paper introduces a novel framework for generalized audio deepfake detection, f-InfoED, which leverages frame-level latent information entropy to distinguish between bonafide and spoof audio by hypothesizing differences in information content. Coupled with AdaLAM, an adapter-based approach for enhancing feature extraction from large pre-trained audio models, the method achieves state-of-the-art performance and remarkable generalization capabilities across unseen deepfake data and perturbations. The authors also release the ADFF 2024 dataset to facilitate comprehensive evaluation of modern TTS/VC methods.
audio
Published: 2025-04-14
Accepted to USENIX Security 2025
Authors: Zhisheng Zhang, Derui Wang, Qianyi Yang, Pengyang Huang, Junhan Pu, Yuxin Cao, Kai Ye, Jie Hao, Yixian Yang
This paper introduces SafeSpeech, a defensive framework designed to protect user audio from malicious speech synthesis. It achieves this by embedding imperceptible perturbations into original speeches before they are uploaded, utilizing a novel technique called Speech PErturbative Concealment (SPEC). SafeSpeech effectively prevents the generation of high-quality synthetic speech, demonstrating state-of-the-art voice protection, strong transferability across various TTS models, and robustness against adaptive adversaries, all while maintaining real-time capabilities.
audio
Published: 2025-04-09
Accepted to AAAI 2026
Authors: Yuankun Xie, Ruibo Fu, Zhiyong Wang, Xiaopeng Wang, Songjun Cao, Long Ma, Haonan Cheng, Long Ye
This paper addresses the all-type audio deepfake detection (ADD) task by establishing a comprehensive benchmark covering speech, sound, singing voice, and music. It introduces the prompt tuning self-supervised learning (PT-SSL) paradigm and the wavelet prompt tuning (WPT)-SSL method, which leverages wavelet transforms to capture type-invariant frequency domain information, significantly reducing trainable parameters compared to fine-tuning. The proposed WPT-XLSR-AASIST achieves superior performance in detecting all types of deepfake audio.
audio
Published: 2025-04-08
Accepted to IEEE Transactions on Information Forensics and Security
Authors: Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li
This paper introduces Nested Res2Net (Nes2Net), a lightweight, dimensionality reduction (DR) layer-free back-end architecture for speech anti-spoofing using foundation models. Nes2Net directly processes high-dimensional features, enhancing multi-scale feature extraction and interaction while reducing computational costs and preventing information loss. It significantly improves performance and generalization across various deepfake detection scenarios, outperforming state-of-the-art baselines.
audio
Published: 2025-03-23
Authors: Emma Coletta, Davide Salvi, Viola Negroni, Daniele Ugo Leonzio, Paolo Bestagini
This paper introduces a novel interpretable one-class detection framework that reframes speech deepfake detection as an anomaly detection task. The model is trained exclusively on real speech to characterize its distribution, enabling the classification of out-of-distribution synthetic samples. It utilizes a Student-Teacher Feature Pyramid Matching system, enhanced with Discrepancy Scaling, to provide both robust detection and interpretable anomaly maps.
audio
Published: 2025-03-21
Authors: Xiang Li, Pin-Yu Chen, Wenqi Wei
This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions, categorized into noise perturbation, audio modification, and compression. It finds that while models are robust to noise, they are vulnerable to modifications and compression, especially neural codecs. Speech foundation models generally outperform traditional models, with larger models showing improved robustness, and data augmentation further enhancing resilience.
audio
Published: 2025-03-18
Authors: Guang Dai, Pinhao Wang, Cheng Yao, Fangtian Ying
InnerSelf introduces an innovative voice system leveraging speech synthesis and Large Language Models to create a personalized self-deepfaked voice for emotional well-being. This system allows users to engage in supportive and empathic dialogue with their own cloned voice, aiming to promote self-disclosure and regulation. By manipulating positive self-talk, InnerSelf seeks to reshape negative thoughts and improve overall emotional well-being.
audio
Published: 2025-02-27
Authors: Lam Pham, Dat Tran, Phat Lam, Florian Skopik, Alexander Schindler, Silvia Poletti, David Fischinger, Martin Boyer
This paper proposes DIN-CTS, a low-complexity Depthwise-Inception Neural Network (DIN) with a Contrastive Training Strategy (CTS) for deepfake speech detection (DSD). The approach transforms audio into spectrograms, trains the DIN using a three-stage contrastive method, and detects deepfakes by comparing test utterance embeddings to a learned Gaussian distribution of genuine speech via Mahalanobis distance. It achieves high performance on ASVspoof 2019 LA with significantly fewer parameters than traditional methods.
audio
Published: 2025-02-27
Authors: Nicolas Müller, Piotr Kawa, Adriana Stan, Thien-Phuc Doan, Souhwan Jung, Wei Herng Choong, Philip Sperl, Konstantin Böttinger
This paper introduces DeePen, a systematic penetration testing methodology designed to evaluate the robustness of audio deepfake detection classifiers. DeePen operates as a black-box attack, leveraging a set of signal processing modifications to expose vulnerabilities in both real-world commercial systems and publicly available academic models. The findings demonstrate that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations like time-stretching or echo addition.
audio
Published: 2025-02-20
Authors: Kevin Warren, Daniel Olszewski, Seth Layton, Kevin Butler, Carrie Gates, Patrick Traynor
This paper proposes using acoustic prosodic analysis for detecting audio deepfakes, focusing on high-level linguistic features like pitch, jitter, and shimmer. Their detector achieves comparable performance to baseline models (93% accuracy, 24.7% EER) on the ASVspoof2021 dataset. Crucially, the approach demonstrates superior robustness against L-infinity norm adversarial attacks and offers explainability through attention mechanisms, identifying key prosodic features influencing detection decisions.
audio
Published: 2025-02-15
10 pages, 5 figures, 7 tables
Authors: Janne Laakkonen, Ivan Kukanov, Ville Hautamäki
This paper proposes a novel approach for generalizable speech deepfake detection by combining meta-learning with Low-Rank Adaptation (LoRA). By inserting LoRA adapters into a self-supervised (SSL) backbone and training only these adapters using Meta-Learning Domain Generalization (MLDG), the method achieves strong zero-shot performance on unseen spoofing attacks. This approach significantly reduces the number of trainable parameters compared to full fine-tuning while improving generalization and stability.
audio
Published: 2025-02-14
9 pages, four figures
Authors: Qingyuan Fei, Wenjie Hou, Xuan Hai, Xin Liu
This paper introduces VocalCrypt, a novel active defense mechanism against deepfake voice cloning based on the psychoacoustic masking effect. It embeds imperceptible pseudo-timbre (jamming information) into audio segments to prevent AI voice cloning systems from accurately replicating a target voice. VocalCrypt significantly enhances robustness and real-time performance compared to existing adversarial noise methods, demonstrating superior defense efficacy without compromising audio quality.
audio
Published: 2025-02-14
Work in progress
Authors: Yu-Xiang Lin, Chih-Kai Yang, Wei-Chih Chen, Chen-An Li, Chien-yu Huang, Xuanjun Chen, Hung-yi Lee
This report provides a preliminary evaluation of GPT-4o's audio processing and reasoning capabilities across diverse tasks in audio, speech, and music understanding. It highlights GPT-4o's strengths in areas like multilingual speech recognition and robustness against hallucinations, but also identifies weaknesses in tasks such as audio duration prediction and its tendency to refuse certain safety-sensitive tasks.
audio
Published: 2025-02-13
Authors: Eshaq Jamdar, Amith Kamath Belman
This paper introduces SyntheticPop, a novel data poisoning attack method targeting Voice Authentication (VA) systems enhanced with the VoicePop defense mechanism. SyntheticPop embeds synthetic 'pop' noises into spoofed audio samples during training, significantly degrading the VA+VoicePop system's phoneme recognition capabilities. The attack achieves a high success rate, demonstrating a critical vulnerability in current voice authentication systems against logical attacks.
audio
Published: 2025-02-13
Database link: https://zenodo.org/records/14498691, Database mirror link:...
Authors: Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi, Myeonghun Jeong, Ge Zhu, Yongyi Zang, You Zhang, Soumi Maiti, Florian Lux, Nicolas Müller, Wangyou Zhang, Chengzhe Sun, Shuwei Hou, Siwei Lyu, Sébastien Le Maguer, Cheng Gong, Hanjie Guo, Liping Chen, Vishwanath Singh
ASVspoof 5 introduces a new challenge and a comprehensive crowdsourced database designed for evaluating speech spoofing, deepfake, and adversarial attack detection solutions. The database features speech from approximately 2,000 speakers, incorporating diverse acoustic conditions and attacks generated by 32 different algorithms, including legacy, contemporary TTS/VC, and adversarial methods. The paper details the database design, collection, and experimental validation using baseline detectors, making the resources freely available to the research community.
audio
Published: 2025-02-06
24 pages, 10 figures
Authors: Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli
This paper introduces XAttnMark, a novel neural audio watermarking framework designed to achieve robust watermark detection and accurate message attribution simultaneously, addressing limitations of prior methods. It employs architectural innovations such as partial parameter sharing, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. XAttnMark also integrates a psychoacoustic-aligned temporal-frequency masking loss to enhance watermark imperceptibility, demonstrating state-of-the-art performance against a wide range of audio transformations, including challenging generative editing.
audio
Published: 2025-02-05
Accepted to NAACL Findings 2025
Authors: Yassine El Kheir, Youness Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller
This paper conducts a comprehensive layer-wise analysis of self-supervised learning (SSL) models for audio deepfake detection across diverse multilingual and contextual scenarios. The study reveals that lower transformer layers consistently provide the most discriminative features, allowing models to achieve competitive equal error rate (EER) scores even with a reduced number of layers. This finding suggests a significant potential for reducing computational costs and increasing inference speed in deepfake detection.
audio
Published: 2025-01-31
Accepted in ICASSP,2025
Authors: Falguni Sharma, Priyanka Gupta
This work proposes a singing voice deepfake detection (SVDD) system utilizing noise-variant encodings from OpenAI's Whisper model. Counter-intuitively, while Whisper is noise-robust for ASR, its encodings capture rich non-speech information and are noise-variant, making them suitable features for SVDD. The system is evaluated on singing vocals and mixtures using CNN and ResNet34 classifiers, across different Whisper model sizes and testing conditions.
audio
Published: 2025-01-24
Accepted to ICASSP 2025
Authors: Wen Huang, Yanmei Gu, Zhiming Wang, Huijia Zhu, Yanmin Qian
This paper proposes a novel strategy for generalizable audio deepfake detection by integrating Latent Space Refinement (LSR) and Latent Space Augmentation (LSA). LSR introduces multiple learnable prototypes for the spoof class to capture intricate variations, while LSA diversifies spoofed data representations through latent space augmentations. The combined approach significantly enhances generalization and achieves competitive or superior performance compared to state-of-the-art methods.
audio
Published: 2025-01-23
Accepted to ICASSP 2025
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing...
Authors: Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj
This paper introduces Gradient Average Transformer Relevancy (GATR), a novel explainable AI (XAI) method for interpreting transformer-based audio deepfake detection (ADD) models in the time domain. GATR is quantitatively shown to outperform existing XAI techniques like Grad-CAM and SHAP-based methods on various faithfulness metrics when evaluating explanations on large datasets. The study highlights that XAI methods differ significantly in their interpretations and that conclusions about detector focus (e.g., speech/non-speech regions, phonetic content) derived from limited utterances may not generalize across entire datasets or different acoustic conditions.
audio
Published: 2025-01-21
WACV 2025
Authors: Muhammad Umar Farooq, Awais Khan, Kutub Uddin, Khalid Mahmood Malik
This paper introduces a transferable GAN-based adversarial attack framework to evaluate the resilience of state-of-the-art (SOTA) audio deepfake detection (ADD) systems. The framework generates high-quality adversarial attacks by leveraging an ensemble of surrogate ADD models, a discriminator, and a self-supervised audio model to ensure transcription and perceptual integrity. Experimental results demonstrate significant vulnerabilities in SOTA ADD systems, with substantial accuracy drops across white-box, gray-box, and black-box attack scenarios on benchmark datasets.
audio
Published: 2025-01-14
Work in Progress
Authors: Xuanjun Chen, Jiawei Du, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee
This paper introduces CodecFake+, a large-scale dataset for detecting deepfake speech generated by neural audio codec-based speech generation (CoSG) systems. It also proposes a comprehensive taxonomy for neural audio codecs, categorizing them by vector quantizer, auxiliary objectives, and decoder types. Through multi-level analysis, the study demonstrates the dataset's effectiveness in advancing CodecFake detection and provides insights into optimal training data selection and generalization factors.
audio
Published: 2025-01-11
Authors: Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Songjun Cao, Long Ma, Chenxing Li, Haonnan Cheng, Long Ye
This paper defines the Neural Codec Source Tracing (NCST) task for open-set neural codec classification and interpretable Audio Language Model (ALM) detection, addressing limitations of existing closed-set studies. The authors construct the ST-Codecfake dataset with bilingual audio from 11 neural codecs and ALM-based out-of-distribution (OOD) samples to establish a comprehensive open-set benchmark. Experimental results show that while NCST models perform well in in-distribution (ID) classification and OOD detection, they lack robustness when classifying unseen real audio.
audio
Published: 2025-01-09
Authors: Inbal Rimon, Oren Gal, Haim Permuter
This paper proposes a hybrid training framework for deepfake speech detection that leverages novel augmentation strategies, including a dual-stage masking approach (MaskedSpec and MaskedFeature) and a compression-aware strategy during self-supervised pretraining. The framework integrates a learnable self-supervised feature extractor with a ResNet classification head in a unified training pipeline. The system achieves state-of-the-art results on the ASVspoof5 Challenge (Track 1) and leading performance on ASVspoof2019 and ASVspoof2021 DF tasks.
audio
Published: 2025-01-09
Expert Systems with Applications, Volume 307, 25 April 2026, 131056
Authors: Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna
SIGNL is a label-efficient audio deepfake detection system that leverages spectral and temporal graphs derived from audio spectrograms. It employs a graph non-contrastive self-supervised learning strategy to pre-train graph convolutional encoders on augmented graph pairs without labels. These pre-trained encoders are then fine-tuned on minimal labeled data for robust deepfake detection.
audio
Published: 2024-12-24
To appear in IEEE ICASSP 2025
Authors: Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen
This study investigates the explainability of 'spoof embeddings' from deep neural network-based audio spoofing detection systems, contrasting them with speaker embeddings. It uses probing analysis with simple neural classifiers to determine how well these embeddings capture speaker-related (metadata and acoustic) information. The research demonstrates that spoof embeddings preserve certain key traits like gender, speaking rate, F0, and duration, and leverage this information to ensure robust and gender-invariant spoof detection.
audio
Published: 2024-12-23
Keywords: Audio DeepFakes, DeepFake detection, multilingual audio DeepFakes
Authors: Bartłomiej Marek, Piotr Kawa, Piotr Syga
This paper benchmarks multilingual audio DeepFake detection, evaluating various adaptation strategies on models primarily trained with English datasets. It investigates the generalizability of these models to non-English languages and compares intra-linguistic and cross-linguistic adaptation approaches. The study highlights significant variations in detection efficacy across languages and underscores the critical importance of even limited target-language data for effective DeepFake detection.
audio
Published: 2024-12-23
Authors: Orchid Chetia Phukan, Drishti Singh, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma
This work investigates various speech pre-trained models (PTMs) for their ability to capture prosodic signatures of generative sources for audio deepfake source attribution (ADSD). The authors propose FINDER, a novel framework leveraging Renyi divergence for effective fusion of PTM representations. Their approach achieves state-of-the-art performance, with the fusion of Whisper and x-vector representations proving most effective.
audio
Published: 2024-12-17
Authors: Kuiyuan Zhang, Zhongyun Hua, Rushi Lan, Yushu Zhang, Yifang Guo
This paper introduces a novel speech deepfake detection method by identifying inconsistencies in phoneme-level speech features, which sophisticated synthesizers struggle to replicate perfectly. The approach leverages an adaptive phoneme pooling technique to extract sample-specific phoneme features and employs a Graph Attention Network (GAT) to model their temporal dependencies. Enhanced with a random phoneme substitution augmentation technique, the method demonstrates superior performance and generalization across various deepfake datasets.
audio
Published: 2024-12-16
Accepted by AAAI 2025
Authors: Yujie Chen, Jiangyan Yi, Cunhang Fan, Jianhua Tao, Yong Ren, Siding Zeng, Chu Yuan Zhang, Xinrui Yan, Hao Gu, Jun Xue, Chenglong Wang, Zhao Lv, Xiaohui Zhang
The paper introduces Region-Based Optimization (RegO), a novel continual learning method designed for audio deepfake detection, addressing the challenge of model performance degradation when encountering diverse and evolving deepfakes. RegO employs the Fisher information matrix to identify and categorize neuron regions, applying adaptive gradient optimization strategies, complemented by an Ebbinghaus forgetting mechanism to manage redundant neurons. This approach significantly outperforms state-of-the-art continual learning methods, achieving a 21.3% improvement in EER for audio deepfake detection, while also demonstrating generalizability to other domains like image recognition.
audio
Published: 2024-12-12
Authors: Yangguang Feng
This study introduces an audio deepfake detection method leveraging a multi-frequency channel attention mechanism (MFCA) and 2D discrete cosine transform (DCT). The approach processes audio into melspectrograms, extracts deep features using MobileNet V2, and employs MFCA to weight frequency channels, enhancing the capture of fine-grained frequency domain features. Experimental results demonstrate significant improvements in detection accuracy, robustness, and generalization compared to traditional methods.
audio
Published: 2024-12-02
Accepted by ISCSLP 2024
Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Yujie Chen, Hao Gu, Guanjun Li, Junzuo Zhou, Yong Ren, Tao Xu
This paper introduces Reject Threshold Adaptation (ReTA), a novel framework for open-set model attribution of deepfake audio. ReTA addresses the limitations of manually set rejection thresholds, which often lead to overfitting and poor adaptability across different data distributions. It proposes an adaptive threshold calculation mechanism based on learning reconstruction error distributions to accurately identify deepfake audio generation models and handle unknown classes.
audio
Published: 2024-11-30
Authors: Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller
This paper provides a comprehensive overview of existing AI-generated music (AIGM) detection methods. It lays a foundation by reviewing principles of AIGM, advancements in deepfake audios, and multimodal detection techniques. The authors propose a pathway for leveraging foundation models from audio deepfake detection to AIGM detection and discuss future research directions to address ongoing challenges.
audio
Published: 2024-11-29
arXiv admin note: text overlap with arXiv:2309.10560
Authors: Awais Khan, Ijaz Ul Haq, Khalid Mahmood Malik
This paper introduces the Parallel Stacked Aggregation Network (PSA-Net), a lightweight anti-spoofing defense system for voice-controlled smart IoT devices. PSA-Net processes raw audio directly using a split-transform-aggregate approach with intrinsic differentiable embeddings and incorporates cardinality to generalize across diverse voice-spoofing attacks. The proposed framework demonstrates consistent and superior performance for various attacks, outperforming many existing dedicated and unified solutions while being suitable for resource-constrained IoT environments.
audio
Published: 2024-11-26
Published at Asilomar Conference on Signals, Systems, and Computers 2024
Authors: Davide Salvi, Amit Kumar Singh Yadav, Kratika Bhagtani, Viola Negroni, Paolo Bestagini, Edward J. Delp
This paper systematically analyzes the relationship between Automatic Speech Recognition (ASR) performance and speech deepfake detection capabilities. The authors adapt pre-trained self-supervised ASR models, Whisper and Wav2Vec 2.0, as feature extractors for binary speech deepfake detection. They investigate whether improvements in ASR performance, corresponding to larger model versions, correlate with enhanced deepfake detection.
audio
Published: 2024-11-21
Authors: Noshaba N. Bhalli, Nehal Naqvi, Chloe Evered, Christine Mallinson, Vandana P. Janeja
This paper evaluates the effectiveness of training undergraduate students to improve their ability to discern audio deepfakes by listening for expert-defined linguistic features (EDLFs). Using a pre-/post-experimental design, the study assesses whether familiarizing listeners with English language variation can enhance their perceptual awareness and discernment of fake audio. The research aims to improve human discernment as a key factor in cybersecurity solutions against audio misinformation.
audio
Published: 2024-11-15
Accepted by IEEE Signal Processing Letters
Authors: Yang Xiao, Rohan Kumar Das
This paper introduces XLSR-Mamba, a novel model for spoofing attack detection that combines pre-trained wav2vec 2.0 (XLSR) features with a Dual-Column Bidirectional Mamba (DuaBiMamba) architecture. The proposed approach efficiently captures long-range temporal dependencies and fine-grained artifacts in spoofed speech, addressing the computational expense of traditional Transformers. XLSR-Mamba demonstrates competitive results and faster inference on standard and challenging deepfake datasets.
audio
Published: 2024-11-14
Authors: Kuiyuan Zhang, Zhongyun Hua, Yushu Zhang, Yifang Guo, Tao Xiang
This paper proposes a robust deepfake speech detection method that mitigates over-reliance on synthesizer artifacts by employing feature decomposition to learn synthesizer-independent content features. It introduces a dual-stream learning strategy with a synthesizer stream (for specific artifacts) and a content stream (for synthesizer-independent features via pseudo-labeling and adversarial learning). Additionally, a synthesizer feature augmentation strategy enhances the model's robustness to diverse synthesizer characteristics and feature combinations.
audio
Published: 2024-11-08
Authors: Vandana P. Janeja, Christine Mallinson
This perspective paper advocates for transdisciplinary approaches, combining Artificial Intelligence and linguistics, to address the challenge of audio deepfake detection. It highlights the limitations of current AI models, which lack a full understanding of linguistic variability and human speech complexities, hindering robust detection. The authors propose incorporating linguistic knowledge into AI methods and enhancing human discernment as promising pathways for more comprehensive deepfake detection.
audio
Published: 2024-10-31
Authors: Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao
This paper addresses the challenge of robust deepfake audio detection by introducing DeepFakeVox-HQ, the largest public voice dataset to date, comprising 1.3 million samples. The authors propose F-SAT, a Frequency-Selective Adversarial Training method, which focuses on high-frequency components that existing detectors rely on but are vulnerable to manipulation. Their approach significantly improves detection accuracy and robustness against corruptions and adversarial attacks.
audio
Published: 2024-10-28
Accepted to ACM CCS Workshop (LAMPS) 2024
Authors: Zhisheng Zhang, Qianyi Yang, Derui Wang, Pengyang Huang, Yuxin Cao, Kai Ye, Jie Hao
This paper introduces Pivotal Objective Perturbation (POP), a proactive defense mechanism that applies imperceptible, error-minimizing noise to original speech samples. The goal of POP is to prevent state-of-the-art text-to-speech (TTS) synthesis models from effectively learning speaker voiceprints, thereby inhibiting the generation of high-quality deepfake speech. Extensive experiments demonstrate POP's outstanding effectiveness, transferability across various TTS models, and robustness against noise reduction and data augmentation techniques, significantly increasing the unclarity of synthesized speech.
audio
Published: 2024-10-27
6 pages, accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024
Authors: Ivan Kukanov, Janne Laakkonen, Tomi Kinnunen, Ville Hautamäki
This paper addresses the challenge of generalizing speech deepfake detection to unseen attacks, which existing methods struggle with. The authors propose using meta-learning, specifically ProtoNet and ProtoMAML, to learn attack-invariant features and adapt to novel attacks with very few samples. This approach significantly improves performance on unseen datasets, demonstrating the efficacy of few-shot adaptation for continuous system updates.
audio
Published: 2024-10-21
Authors: Zahra Khanjani, Christine Mallinson, James Foulds, Vandana P Janeja
This paper introduces ALDAS, an AI framework for the automatic labeling of linguistic features to enhance spoofed audio detection (SAD). ALDAS is trained on expert-defined linguistic features and its auto-labeled features are used to augment traditional SAD models. The findings indicate that ALDAS improves SAD performance compared to acoustic-only models, providing a scalable solution for integrating linguistic insights into detection.
audio
Published: 2024-10-13
Accepted at Interspeech 2024. Hideyuki Oiso and Yuto Matsunaga contributed equally
Authors: Hideyuki Oiso, Yuto Matsunaga, Kazuya Kakizaki, Taiki Miyagawa
This paper proposes a prompt tuning method for Audio Deepfake Detection (ADD) to address critical challenges in test-time domain adaptation, including source-target domain gaps, limited target dataset sizes, and high computational costs. The method operates in a plug-in style, seamlessly integrating with state-of-the-art transformer models to enhance accuracy on target data. By introducing a small number of trainable parameters, it prevents overfitting on small datasets and maintains computational efficiency.
audio
Published: 2024-10-11
Authors: Chu-Hsuan Abraham Lin, Chen-Yu Liu, Samuel Yen-Chi Chen, Kuan-Cheng Chen
This paper introduces a Quantum-Trained Convolutional Neural Network (QT-CNN) framework for enhanced deepfake audio detection, leveraging a hybrid quantum-classical approach. The QT-CNN integrates Quantum Neural Networks (QNNs) with classical CNNs to optimize training efficiency and significantly reduce trainable parameters. It achieves comparable detection performance to traditional CNNs while reducing parameter count by up to 70%.
audio
Published: 2024-10-09
Authors: Georgia Channing, Juil Sock, Ronald Clark, Philip Torr, Christian Schroeder de Witt
This paper addresses the lack of explainability and poor real-world generalizability in current audio deepfake detection solutions. It introduces novel explainability methods for state-of-the-art transformer-based detectors and open-sources a new benchmark for evaluating their robustness, aiming to build trust and leverage citizen intelligence for scalable detection.
audio
Published: 2024-10-09
Accepted into ASVspoof5 workshop
Authors: Yi Zhu, Chirag Goel, Surya Koppisetti, Trang Tran, Ankur Kumar, Gaurav Bharaj
This paper presents Reality Defender's submission to the ASVspoof5 challenge, featuring the SLIM system which uses a novel pretraining strategy to enhance audio deepfake detection. SLIM employs self-supervised contrastive learning on bonafide speech to learn style-linguistics dependency embeddings. These embeddings, combined with standard SSL representations, effectively discriminate spoofed from bonafide speech while maintaining low computational cost and improving generalizability.
audio
Published: 2024-10-09
Presented at International Conference of the Biometrics Special Interest Group (BIOSIG 2024)
2024 International Conference of the Biometrics Special Interest Group (BIOSIG)
Authors: Anton Firc, Kamil Malinka, Petr Hanáček
This paper introduces a novel deepfake speech dataset generated using diffusion models to evaluate their impact on current deepfake detection systems. The study compares diffusion-generated deepfakes with non-diffusion ones, assessing their quality and detectability. Findings suggest that diffusion-based deepfakes are generally comparable to non-diffusion deepfakes in terms of detection, with some variability across detector architectures.
audio
Published: 2024-10-09
Authors: Hongbin Liu, Youzheng Chen, Arun Narayanan, Athula Balachandran, Pedro J. Moreno, Lun Wang
This work presents the first systematic study of active malicious attacks against state-of-the-art open-source Synthetic Speech Detectors (SSDs). It reveals that these detectors are highly vulnerable to white-box, black-box, and even transferability attacks, especially when facing synthetic audio from unseen Text-to-Speech (TTS) systems. The findings highlight an urgent need for more robust detection methods as current SSDs can be easily bypassed without significant degradation in audio quality.
audio
Published: 2024-10-06
Authors: Xiang Li, Pin-Yu Chen, Wenqi Wei
This paper introduces SONAR, a synthetic AI-Audio Detection Framework and Benchmark designed for comprehensively evaluating cutting-edge AI-synthesized auditory content. SONAR includes a novel evaluation dataset from 9 diverse audio synthesis platforms and is the first framework to uniformly benchmark AI-audio detection across traditional and foundation model-based detection systems. Through extensive experiments, the authors identify generalization limitations of existing methods and highlight the superior performance of foundation models.
audio
Published: 2024-10-01
Authors: Hashim Ali, Surya Subramani, Hafiz Malik
This paper presents a submission to the ASVspoof 5 Challenge, investigating the performance of an Audio Spoof Detection (ASD) system. It focuses on training the AASIST model using data augmentation generated through various 'laundering attacks' to enhance robustness against diverse acoustic conditions, spoofing attacks, and codec conditions. The study evaluates the system's performance on the ASVspoof 5 database.
audio
Published: 2024-09-27
IEEE Spoken Language Technology Workshop 2024
Authors: Qishan Zhang, Shuangbing Wen, Fangke Yan, Tao Hu, Jun Li
This paper introduces XWSB, a blend system for Singing Voice Deepfake Detection (SVDD) in the SVDD 2024 Challenge, achieving state-of-the-art performance. The system integrates pre-trained XLS-R and WavLM models with a Sensitive Layer Select (SLS) classifier. XWSB demonstrates advanced recognition capabilities, specifically achieving an EER of 2.32% in the CtrSVDD track.
audio
Published: 2024-09-26
Submitted to ICASSP 2025
Authors: Davide Salvi, Viola Negroni, Luca Bondi, Paolo Bestagini, Stefano Tubaro
This paper investigates optimal strategies for applying continual learning to speech deepfake detectors, specifically examining whether it's more effective to update the entire model or selectively freeze layers. The findings, validated across multiple models, reveal that updating only the initial layers responsible for processing input features is the most effective approach for maintaining generalization while adapting to new data.
audio
Published: 2024-09-24
Submitted to ICASSP 2025
Authors: Viola Negroni, Davide Salvi, Alessandro Ilic Mezza, Paolo Bestagini, Stefano Tubaro
This paper introduces a novel Mixture of Experts (MoE) architecture to enhance speech deepfake detection, specifically addressing the challenge of generalization to unseen data. The proposed approach leverages a lightweight gating mechanism to dynamically assign expert weights, allowing the system to specialize in different input types and efficiently handle data variability. This modular framework demonstrates superior generalization and adaptability compared to traditional single models or ensemble methods.
audio
Published: 2024-09-24
Submitted to ICASSP 2025
Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Nitin Choudhury, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna
This paper addresses the high computational demands of environmental audio deepfake detection (EADD) due to the high dimensionality of foundation model representations. The authors propose a randomized selection strategy, showing that randomly selecting 40-50% of representation values can preserve or improve performance compared to full representations and SOTA dimensionality reduction techniques. This method significantly reduces model parameters and inference time by almost half.
audio
Published: 2024-09-23
Journal preprint to be published at Computer Science Review
Authors: Lam Pham, Phat Lam, Dat Tran, Hieu Tang, Tin Nguyen, Alexander Schindler, Florian Skopik, Alexander Polonsky, Canh Vu
This paper provides a comprehensive survey and critical analysis of Deepfake Speech Detection (DSD), examining current challenge competitions, public datasets, and deep-learning techniques. The authors propose hypotheses based on their analysis, which are then validated through extensive experiments. They ultimately present a competitive DSD model and outline promising future research directions.
audio
Published: 2024-09-23
7 pages, to be presented at SLT 2024
Authors: Hieu-Thi Luong, Duc-Tuan Truong, Kong Aik Lee, Eng Siong Chng
This paper investigates the vulnerability of state-of-the-art deepfake speech detection systems to attacks leveraging Room Impulse Responses (RIRs) to add reverberation to fake speech, significantly increasing their evasion rate. To counteract this, the authors propose augmenting training data with large-scale synthetic or simulated RIRs. Their method significantly enhances detection robustness, improving performance on both reverberated fake speech and original samples.
audio
Published: 2024-09-21
Submitted to ICASSP 2025
Authors: Orchid Chetia Phukan, Sarthak Jain, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna
This study comprehensively investigates the effectiveness of Music Foundation Models (MFMs) and Speech Foundation Models (SFMs) for singing voice deepfake detection (SVDD). It reveals that speaker recognition SFMs, particularly x-vector, perform best individually. The paper proposes a novel fusion framework called FIONA, which leverages Centered Kernel Alignment (CKA) to synergistically combine x-vector (SFM) and MERT-v1-330M (MFM), achieving state-of-the-art SVDD performance.
audio
Published: 2024-09-20
Submitted to ICASSP 2025
Authors: Lauri Juvela, Xin Wang
This paper extends collaborative watermarking for speech synthesis to incorporate augmentation with non-differentiable traditional audio codecs and neural audio codecs. It demonstrates that codec augmentation can be reliably achieved using a waveform-domain straight-through estimator for gradient approximation. The approach significantly improves robustness against codec attacks while maintaining negligible perceptual degradation at higher bitrates.
audio
Published: 2024-09-18
IEEE OJSP. Official document lives at: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10839331
Authors: Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe
This paper introduces SpoofCeleb, a novel dataset for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV). It leverages real-world speech from VoxCeleb1, processed into a format suitable for training 23 diverse Text-To-Speech (TTS) systems, which then generate spoofing attacks. SpoofCeleb aims to provide more realistic and diverse data for robust deepfake detection and speaker verification systems, including partitioned training, validation, and evaluation sets with controlled protocols.
audio
Published: 2024-09-14
Accepted by ACM CCS 2024. Please cite this paper as "Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan,...
Authors: Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu
This paper introduces SafeEar, a novel framework for content privacy-preserving audio deepfake detection. It achieves this by decoupling speech into semantic and acoustic information using a neural audio codec, then employing only the acoustic information for deepfake detection. SafeEar demonstrates high effectiveness in detecting various deepfake techniques while simultaneously shielding speech content from machine and human recovery attempts.
audio
Published: 2024-09-13
Accepted by IEEE SLT 2024
Authors: Jiawei Du, I-Ming Lin, I-Hsiang Chiu, Xuanjun Chen, Haibin Wu, Wenze Ren, Yu Tsao, Hung-yi Lee, Jyh-Shing Roger Jang
This paper introduces DFADD, a novel audio deepfake dataset comprising speech synthesized by advanced diffusion and flow-matching based Text-to-Speech (TTS) models. The research demonstrates that current state-of-the-art anti-spoofing models struggle to detect these highly natural deepfake audios. The DFADD dataset is proposed to address this gap, aiming to foster the development of more robust anti-spoofing countermeasures.
audio
Published: 2024-09-12
Authenticating deep generative models, 5 pages, 5 figures, 2 tables
Authors: Mayank Kumar Singh, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji
This paper introduces LOCKEY, a novel system for authenticating generative models and tracking users in white-box scenarios by integrating key-based authentication with watermarking. Users receive a unique key alongside model parameters; a valid key enables expected, watermarked output, while an invalid key triggers degraded output, thereby enforcing authentication and user ID embedding for tracking deepfakes. The approach is demonstrated effectively on audio codecs and vocoders, proving its robustness.
audio
Published: 2024-09-11
14 pages
Authors: Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
This work analyzes the resilience of the D-CAPTCHA system, designed to detect fake phone calls, against transferable imperceptible adversarial attacks. It reveals the system's vulnerabilities, particularly in its deepfake detector and task classification modules. The paper then proposes D-CAPTCHA++, a more robust version, by integrating adversarial training to significantly mitigate these vulnerabilities and enhance defense against sophisticated deepfake audio threats.
audio
Published: 2024-09-10
Authors: Ziwei Yan, Yanjie Zhao, Haoyu Wang
VoiceWukong introduces a comprehensive benchmark for deepfake voice detection, comprising 265,200 English and 148,200 Chinese deepfake samples generated by 34 diverse tools and featuring 38 manipulation variants. Evaluations of 12 state-of-the-art detectors on VoiceWukong reveal significant performance degradation compared to previous benchmarks, with the best model (AASIST2) achieving 13.50% EER. The study also compares detector performance with human perception and finds current multimodal large language models (MLLMs) lack deepfake voice detection ability.
audio
Published: 2024-09-09
Authors: Zahra Khanjani, Tolulope Ale, Jianwu Wang, Lavon Davis, Christine Mallinson, Vandana P. Janeja
This paper investigates causal relationships between human-discernible linguistic features (EDLFs) and spoofed audio labels using causal discovery and inference models. By employing an ensemble causal discovery model and causal inference, the study aims to strengthen AI models for spoofed audio detection and inform the training of humans to discern such audio. The findings highlight the utility of incorporating human knowledge into AI for improved feature selection and automation of EDLF labeling.
audio
Published: 2024-09-09
Submitted to INTERSPEECH 2024
Authors: Tuan Duy Nguyen Le, Kah Kuan Teh, Huy Dat Tran
This paper introduces a novel framework for audio deepfake detection designed for both high accuracy and efficient continuous learning on new fake data in a few-shot manner. The approach leverages a large collected dataset augmented with various distortions and employs an Audio Spectrogram Transformer (AST) for the main detection model. A continuous learning plugin module is presented to update the trained model effectively with minimal new labeled data, outperforming conventional fine-tuning.
audio
Published: 2024-09-08
Proc. The Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), Kos,...
Authors: Theophile Stourbe, Victor Miara, Theo Lepage, Reda Dehak
This paper presents systems submitted to the ASVspoof 5 Challenge Track 1 for speech deepfake detection, utilizing a pre-trained WavLM as a front-end with various back-end techniques. The framework is fine-tuned using the challenge's training dataset, augmented with noise, reverberation, and codec augmentations. System fusion and score calibration with the Bosaris toolkit further enhance performance.
audio
Published: 2024-09-03
Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024
Authors: Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong Wang
This work details an approach for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024, achieving a leading 1.79% pooled equal error rate (EER). The authors explore ensemble methods utilizing speech foundation models and introduce a novel Squeeze-and-Excitation Aggregation (SEA) method to efficiently integrate features, outperforming individual systems.
audio
Published: 2024-09-03
ASVspoof5 workshop paper
Authors: Yihao Chen, Haochen Wu, Nan Jiang, Xiang Xia, Qing Gu, Yunqi Hao, Pengfei Cai, Yu Guan, Jialong Wang, Weilin Xie, Lei Fang, Sian Fang, Yan Song, Wu Guo, Lin Liu, Minqiang Xu
This paper describes the USTC-KXDIGIT system for the ASVspoof5 Challenge, addressing speech deepfake detection (Track 1) and spoofing-robust automatic speaker verification (Track 2). The system employs extensive embedding engineering, including hand-crafted features and self-supervised model representations, coupled with data augmentation, activation ensemble, and score fusion of multiple models to enhance generalization and robustness under adversarial conditions.
audio
Published: 2024-08-30
8 pages, 2 figures, 2 tables. Accepted paper at the ASVspoof 2024 (the 25th Interspeech Conference)
Authors: Kirill Borodin, Vasiliy Kudryavtsev, Dmitrii Korzh, Alexey Efimenko, Grach Mkrtchian, Mikhail Gorodnichev, Oleg Y. Rogov
We propose a novel architecture named AASIST3 for speech deepfake detection, enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques. AASIST3 achieves significant performance improvements, demonstrating minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, thereby improving ASV security.
audio
Published: 2024-08-28
6 pages, Accepted by 2024 IEEE Spoken Language Technology Workshop (SLT 2024)
Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan
The SVDD 2024 Challenge was launched to advance research in detecting AI-generated singing voices, featuring two tracks: a controlled setting (CtrSVDD) and an in-the-wild scenario (WildSVDD). The challenge successfully attracted 47 submissions for CtrSVDD, with 37 teams surpassing baselines and the top team achieving a 1.65% equal error rate. This paper reviews the results, discusses key findings, and outlines future directions for singing voice deepfake detection research.
audio
Published: 2024-08-28
Accepted in ASVspoof2024 workshop
10.21437/ASVspoof.2024
Authors: Oğuzhan Kurnaz, Selim Can Demirtaş, Aykut Büker, Jagabandhu Mishra, Cemal Hanilçi
This paper introduces BTU Speech Group's parallel network-based spoofing-aware speaker verification (SASV) system for the ASVspoof5 Challenge. The system integrates ASV and CM models through embedding fusion, employing a novel parallel DNN structure that processes different input embedding combinations independently. The final SASV probability is derived by averaging scores from these parallel networks, enhancing robustness against spoofing attacks.
audio
Published: 2024-08-28
Authors: Octavian Pascu, Dan Oneata, Horia Cucu, Nicolas M. Müller
This paper demonstrates that voice deepfake attacks in the ASVspoof5 dataset can be accurately detected using a small subset of simple, interpretable openSMILE features. A threshold classifier using these features achieves EERs as low as 0.8% for specific attacks, with an overall EER of 15.7 ± 6.0%. The study also reveals that feature generalization is primarily effective between attacks from similar Text-to-Speech architectures, suggesting unique TTS system 'fingerprints' are being identified.
audio
Published: 2024-08-27
Conference Paper
Authors: Hashim Ali, Surya Subramani, Shefali Sudhir, Raksha Varahamurthy, Hafiz Malik
This paper evaluates the robustness of seven state-of-the-art audio spoof detection approaches against various "laundering attacks." A new ASVSpoof Laundering Database is introduced, generated by applying distortions such as reverberation, additive noise, recompression, resampling, and low-pass filtering to the ASVSpoof 2019 LA eval database. The study reveals that current SOTA systems perform poorly against aggressive laundering attacks, particularly reverberation and additive noise, underscoring the urgent need for more robust detection methodologies.
audio
Published: 2024-08-26
Accepted to ICLR 2025. Project url: https://github.com/awsaf49/sonics
Authors: Md Awsafur Rahman, Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Bishmoy Paul, Shaikh Anowarul Fattah
This paper introduces SONICS, a novel large-scale dataset for end-to-end synthetic song detection, addressing the limitations of existing datasets which primarily focus on singing voice deepfake detection. It also proposes SpecTTTra, an efficient Transformer-based architecture designed to effectively capture long-range temporal dependencies in songs for improved authenticity detection. SpecTTTra outperforms conventional CNN and Transformer models in both performance and computational efficiency.
audio
Published: 2024-08-25
Accepted at ASVspoof 5 Workshop (Interspeech2024 Satellite)
Authors: Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro
This paper investigates the impact of splicing artifacts in partially fake speech signals, revealing that simple signal concatenation introduces detectable spectral leakage. The authors demonstrate that these induced artifacts allow for effective detection of partially fake audio using an untrained method and can bias existing deepfake detection models trained on such datasets. The findings highlight the complexities of generating reliable spliced audio data and provide insights for future research in this area.
audio
Published: 2024-08-23
IEEE ACCESS 2024
IEEE ACCESS 2024
Authors: Zhenyu Wang, John H. L. Hansen
This paper proposes a robust synthetic audio spoofing detection system that enhances a RawNet2-based encoder with a Simple Attention Module (SimAM). The approach combines a weighted additive angular margin loss to address data imbalance and improve generalization to unseen attacks, with a meta-learning framework for learning spoofing-category-independent embeddings. Furthermore, it incorporates disentangled adversarial training using auxiliary batch normalization to exploit adversarial examples as data augmentation for improved robustness.
audio
Published: 2024-08-20
8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)
Authors: Johan Rohdin, Lin Zhang, Oldřich Plchot, Vojtěch Staněk, David Mihola, Junyi Peng, Themos Stafylakis, Dmitriy Beveraki, Anna Silnova, Jan Brukner, Lukáš Burget
This paper presents BUT's systems for the ASVspoof 5 challenge, addressing deepfake detection and spoofing-robust automatic speaker verification (SASV). For deepfake detection, they employ ResNet18 and self-supervised models, analyzing various speaker and spoofing label schemes. For SASV, they propose a generalized LLR framework with effective priors and logistic regression for joint calibration and fusion of countermeasure and ASV scores.
audio
Published: 2024-08-20
Authors: Yuankun Xie, Chenxu Xiong, Xiaopeng Wang, Zhiyong Wang, Yi Lu, Xin Qi, Ruibo Fu, Yukun Liu, Zhengqi Wen, Jianhua Tao, Guanjun Li, Long Ye
This paper investigates the effectiveness of current audio deepfake detection (CM) models against audio language model (ALM)-based deepfake audio, which poses significant threats due to its realism and diversity. By collecting and evaluating 12 types of the latest ALM-based deepfake audio using state-of-the-art CMs, the authors find that the latest codec-trained CMs can effectively detect these deepfakes, achieving surprisingly low equal error rates.
audio
Published: 2024-08-20
accepted by ISCSLP2024
Authors: Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Yukun Liu, Guanjun Li, Xin Qi, Yi Lu, Xuefei Liu, Yongwei Li
This paper introduces a novel feature extraction method for fake audio detection that utilizes color quantization on spectral image-like inputs. The approach constrains reconstruction to a limited number of colors, generating discriminative features that intuitively highlight differences between genuine and fake audio. Experiments on the ASVspoof2019 dataset demonstrate improved classification performance over using original spectral inputs, with additional benefits from pretraining the recolor network.
audio
Published: 2024-08-19
This paper was accepted at ASVspoof Workshop 2024
Authors: Juan M. Martín-Doñas, Eros Roselló, Angel M. Gomez, Aitor Álvarez, Iván López-Espejo, Antonio M. Peinado
This paper details the ASASVIcomtech team's participation in the ASVspoof5 Challenge, addressing both speech deepfake detection (Track 1) and spoofing-aware speaker verification (Track 2). While a closed-condition system for Track 1 yielded unsatisfactory results, the team achieved very competitive performance in open-condition settings for both tracks through an ensemble system leveraging self-supervised models and augmented training data.
audio
Published: 2024-08-19
8 pages, 2 figures, ASVspoof 5 Workshop (Interspeech2024 Satellite)
Authors: Yuxiong Xu, Jiafeng Zhong, Sengui Zheng, Zefeng Liu, Bin Li
This paper introduces the SZU-AFS anti-spoofing system for the ASVspoof 5 Challenge Track 1, which employs a four-stage approach: baseline model selection, data augmentation (DA) for fine-tuning, gradient norm aware minimization (GAM) for secondary fine-tuning, and score-level fusion. The system leverages a Wav2Vec2 feature extractor and an AASIST classifier, enhanced by various DA policies and GAM-based co-enhancement. The final fused system achieved a minDCF of 0.115 and an EER of 4.04% on the evaluation set.
audio
Published: 2024-08-17
Accepted at ASVspoof Workshop 2024
Authors: Massimiliano Todisco, Michele Panariello, Xin Wang, Héctor Delgado, Kong Aik Lee, Nicholas Evans
This paper introduces Malacopula, a neural-based generalized Hammerstein model designed to create adversarial perturbations that enhance the effectiveness of spoofing attacks against Automatic Speaker Verification (ASV) systems. The model modifies speech utterances using non-linear processes to minimize the cosine distance between speaker embeddings of spoofed and bona fide speech. Experiments show Malacopula substantially increases ASV system vulnerabilities, though it reduces speech quality and the attacks are detectable under controlled conditions.
audio
Published: 2024-08-16
8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)
Authors: Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi
ASVspoof 5 is the latest challenge promoting research in speech spoofing and deepfake detection, introducing a novel, large-scale crowdsourced database with diverse, optimized attacks, including adversarial ones. The challenge defines two tracks—stand-alone detection and spoofing-robust speaker verification—alongside new evaluation metrics, baselines, and a platform. The paper summarizes the challenge setup and reports that attacks significantly compromise baseline systems, while participant submissions demonstrate substantial improvements.
audio
Published: 2024-08-14
Accepted at ASVspoof Workshop 2024
Authors: David Combei, Adriana Stan, Dan Oneata, Horia Cucu
This paper addresses audio deepfake detection for the ASVspoof5 challenge by benchmarking and finetuning self-supervised models. The authors found WavLM to be superior among models pre-trained on allowed datasets. Their final approach employs a late fusion ensemble of four WavLM models, achieving equal error rates of 6.56% and 17.08% on the two evaluation sets.
audio
Published: 2024-08-13
Authors: Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Haonan Cheng, Long Ye
This paper addresses open-domain audio deepfake detection for the ASVspoof5 Track1 challenge by investigating data expansion, data augmentation, and self-supervised learning (SSL) features. They introduce Frequency Mask, a data augmentation method, to counter high-frequency gaps characteristic of the ASVspoof5 dataset. Combining temporal information from various scales with multiple SSL features through score fusion, their approach achieved a minDCF of 0.0158 and an EER of 0.55% on the ASVspoof5 Track 1 evaluation progress set.
audio
Published: 2024-08-09
This work has been submitted to the IEEE for possible publication
Authors: Jiangyan Yi, Chu Yuan Zhang, Jianhua Tao, Chenglong Wang, Xinrui Yan, Yong Ren, Hao Gu, Junzuo Zhou
This paper introduces the ADD 2023 challenge, which aims to advance audio deepfake detection beyond binary classification by emulating real-world scenarios like identifying manipulated intervals and attributing deepfake sources. It describes the challenge's comprehensive dataset designed for fake audio generation, detection, manipulation region location, and deepfake algorithm recognition tasks. The paper also analyzes the technical methodologies of top-performing participants, highlighting commonalities, differences, current limitations, and future research directions in the field.
audio
Published: 2024-07-26
Authors: Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj
This paper introduces SLIM (Style-LInguistics Mismatch), a novel model for generalized audio deepfake detection that addresses generalization and interpretability challenges. SLIM learns the style-linguistics dependency from only real speech samples via self-supervised pretraining. It then uses these learned dependency features, complemented by standard acoustic features, to classify real versus fake speech, yielding superior out-of-domain performance and providing explainable decisions by quantifying the mismatch.
audio
Published: 2024-07-15
Accepted by ACM MM 2024
Authors: Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li
This paper introduces GROOT, a novel generative robust audio watermarking method for proactively supervising diffusion-model-based synthesized audio. GROOT embeds watermarks during the audio synthesis process by integrating a dedicated encoder and decoder with parameter-fixed diffusion models. The method demonstrates superior robustness against individual and compound post-processing attacks, outperforming state-of-the-art techniques while maintaining high audio fidelity and capacity.
audio
Published: 2024-07-14
Submitted to IEEE Tencon. 5 pages
Authors: Feiyi Dong, Qingchen Tang, Yichen Bai, Zihan Wang
This paper introduces Continual Audio Defense Enhancer (CADE), a novel continual learning method designed to enhance robust deepfake audio detection against emerging spoofing attacks. CADE integrates a replay-based strategy with a fixed memory size, two distillation losses, and a novel multi-layer embedding similarity loss to mitigate catastrophic forgetting. Experiments on the ASVspoof2019 dataset demonstrate CADE's superior performance compared to baseline methods.
audio
Published: 2024-07-11
To be published at ISMIR 2024
Authors: Dorian Desblancs, Gabriel Meseguer-Brocal, Romain Hennequin, Manuel Moussallam
This paper investigates the use of singer identification methods to detect the original singer in synthetic voices, addressing concerns about personality rights in the music industry. It proposes three embedding models trained with a singer-level contrastive learning scheme using mixtures, vocals, or both. While the models effectively identify real singers, their performance significantly deteriorates when classifying cloned versions of singers, particularly for models trained on mixtures, highlighting biases in current singer identification systems.
audio
Published: 2024-07-10
Accepted by INTERSPEECH 2024
Proceedings of Interspeech 2024
Authors: Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury
This paper introduces a system to classify various spoofing attributes (input type, acoustic model, vocoder) of audio deepfake generation systems, moving beyond simple fake/genuine detection. The system aims to identify the specific techniques used in the deepfake creation pipeline to enable better generalization to unseen spoofing algorithms. It is evaluated on two datasets, demonstrating its robustness in identifying these different deepfake generation attributes.
audio
Published: 2024-07-10
Accepted in EUSIPCO 2024
Authors: Marcella Astrid, Enjie Ghorbel, Djamila Aouada
The paper introduces a novel data augmentation method for audio deepfake detection that generates "pseudo-fakes" by adversarially perturbing real audio data. This perturbation targets the decision boundary of the model by aiming for ambiguous predictions (half real, half fake), thereby enhancing the generalization capabilities of deepfake detectors to unseen manipulations.
audio
Published: 2024-07-08
Authors: Zhenchun Lei, Hui Yan, Changhong Liu, Minglei Ma, Yingen Yang
This paper proposes two-path GMM-ResNet and GMM-SENet models for ASV spoofing detection to address the limitations of traditional GMM classifiers that ignore component score distributions and inter-frame correlations. The models utilize Gaussian probability features derived from separate GMMs for genuine and spoofed speech, processed by deep learning architectures with a two-step training scheme. Experiments on ASVspoof 2019 demonstrate significant performance improvements over the GMM baseline, achieving competitive results after score fusion.
audio
Published: 2024-07-03
Proc. INTERSPEECH 2023
Authors: Chirag Goel, Surya Koppisetti, Ben Colman, Ali Shahriyari, Gaurav Bharaj
This paper introduces Vision Transformers (ViTs) for audio spoof detection by proposing a novel attention-based contrastive learning framework called SSAST-CL. The framework utilizes a two-stage Siamese training approach with a cross-attention branch and a custom contrastive loss to learn discriminative representations for bonafide and spoof classes. This approach, combined with appropriate data augmentations, achieves competitive performance on the ASVSpoof 2021 challenge, significantly outperforming a vanilla ViT fine-tuning baseline.
audio
Published: 2024-07-02
Authors: Zhenchun Lei, Hui Yan, Changhong Liu, Yong Zhou, Minglei Ma
This paper introduces GMM-ResNet2, an improved deep learning model for synthetic speech detection. It enhances the previous GMM-ResNet with four key improvements: multi-scale Log Gaussian Probability features from multiple GMMs, a grouping technique with ensemble averaging, an improved residual block, and an ensemble-aware loss function. The GMM-ResNet2 achieves competitive performance on ASVspoof 2019 LA, ASVspoof 2021 LA, and DF tasks.
audio
Published: 2024-07-01
Authors: Lam Pham, Phat Lam, Truong Nguyen, Huyen Nguyen, Alexander Schindler
This paper proposes a deep learning system for deepfake audio detection leveraging various spectrogram-based features and an ensemble of deep learning models. It explores three main approaches: training baseline CNN/RNN models directly on spectrograms, transfer learning from computer vision models, and using embeddings from state-of-the-art audio pre-trained models. The system achieves a highly competitive Equal Error Rate (EER) of 0.03 on the ASVspoof 2019 dataset by fusing high-performing models and selective spectrograms.
audio
Published: 2024-07-01
5 pages, 4 figures, Proc. INTERSPEECH 2024
Authors: Oguzhan Baser, Kaan Kale, Sandeep P. Chinchali
SecureSpectra introduces a defense mechanism against deepfake audio threats by embedding orthogonal, irreversible signatures within the high-frequency content of audio. It leverages the empirical observation that deepfake models struggle to replicate high-frequency content. The system integrates differential privacy to protect signatures from reverse engineering, achieving superior detection performance compared to existing methods.
audio
Published: 2024-06-27
Authors: Borodin Kirill Nikolayevich, Kudryavtsev Vasiliy Dmitrievich, Mkrtchian Grach Maratovich, Gorodnichev Mikhail Genadievich, Korzh Dmitrii Sergeevich
This paper presents an Automatic Speaker Verification (ASV) system designed to extract speaker embeddings, capturing characteristics like pitch, energy, and phoneme duration. While intended for a multi-voice TTS pipeline, the system was primarily evaluated for identifying original speakers in voice-converted audio within the SSTC challenge. It demonstrated an Equal Error Rate (EER) of 20.669% in this deepfake detection task.
audio
Published: 2024-06-25
Accepted by INTERSPEECH 2024
Authors: Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng
This paper introduces a Temporal-Channel Modeling (TCM) module to enhance the Multi-head Self-Attention (MHSA) mechanism in Transformer-based synthetic speech detectors. The TCM module addresses MHSA's neglect of temporal-channel dependencies by integrating channel representation head tokens with temporal input tokens. With only 0.03M additional parameters, the proposed module significantly improves the performance of state-of-the-art systems on the ASVspoof 2021 dataset, demonstrating the importance of modeling temporal-channel interactions for synthetic speech detection.
audio
Published: 2024-06-24
Accepted by Interspeech 2024
Authors: Hyun Myung Kim, Kangwook Jang, Hoirin Kim
This paper introduces Adaptive Centroid Shift (ACS), a novel method for one-class learning in audio deepfake detection. ACS continuously updates a bonafide centroid using only genuine speech samples, creating a tightly clustered representation for authentic audio while pushing spoofed audio further away. This approach significantly enhances the model's generalization ability against unseen deepfake attacks.
audio
Published: 2024-06-14
Accepted by Interspeech 2024
Authors: Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv
This paper addresses the challenging Fake Speech Detection (FSD) task in telephony scenarios by proposing a novel data augmentation (DA) method called Frequency-mix (Freqmix) and integrating it into a Freqmix knowledge distillation (FKD) framework. FKD uses Freqmix-enhanced data as input for a teacher model and a time-domain DA for the student model, employing multi-level feature distillation to restore information and improve generalization. The approach achieves state-of-the-art results on the ASVspoof 2021 LA dataset, showing a 31% improvement over the baseline, and performs competitively on the ASVspoof 2021 DF dataset.
audio
Published: 2024-06-12
Accepted by INTERSPEECH 2024. arXiv admin note: substantial text overlap with arXiv:2405.04880
Authors: Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi
This paper addresses the challenge of detecting Large Language Model (LLM)-based deepfake audio, which relies on neural codecs rather than traditional vocoders, rendering existing detection methods ineffective. The authors propose the Codecfake dataset, generated using seven representative neural codec methods, to facilitate the development of detection models for this new type of audio. Experiments demonstrate that models trained on the Codecfake dataset achieve a 41.406% reduction in average equal error rate compared to vocoder-trained models on LLM-based deepfake audio.
audio
Published: 2024-06-12
Accepted by INTERSPEECH 2024
Authors: Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu
This paper introduces the new task of deepfake general audio detection, aiming to identify manipulated audio content and locate deepfake regions. The authors propose an automated manipulation pipeline to create FakeSound, a novel dataset for this task. They also present a benchmark deepfake detection model utilizing a general audio pre-trained model, demonstrating its superior performance over state-of-the-art deepfake speech detection models and human evaluators.
audio
Published: 2024-06-12
Authors: Zihan Pan, Tianchi Liu, Hardik B. Sailor, Qiongqiong Wang
This paper investigates the multi-layer behavior of the WavLM self-supervised learning model for anti-spoofing detection and proposes an attentive merging method to leverage its hierarchical hidden embeddings. The approach demonstrates the feasibility of fine-tuning WavLM to achieve state-of-the-art Equal Error Rates (EERs) on ASVspoof datasets. A key finding is that early hidden transformer layers contribute significantly, allowing for computational efficiency by using only a partial pre-trained model.
audio
Published: 2024-06-11
Accepted to Interspeech 2024, project page: https://codecfake.github.io/
Authors: Haibin Wu, Yuan Tseng, Hung-yi Lee
This paper introduces CodecFake, the first dataset specifically designed for detecting deepfake audios generated by contemporary codec-based speech synthesis systems. The authors demonstrate that current state-of-the-art anti-spoofing models trained on traditional datasets are largely ineffective against these new deepfakes. However, training with the proposed CodecFake dataset significantly enhances these models' detection capabilities.
audio
Published: 2024-06-10
Accepted by Interspeech 2024
Authors: Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan
This paper introduces RawBMamba, an end-to-end bidirectional state space model for audio deepfake detection that captures both short- and long-range discriminative information. It addresses the unidirectional limitation of Mamba by designing a bidirectional Mamba and a bidirectional fusion module to enhance audio context representation. RawBMamba demonstrates significant performance improvements on the ASVspoof2021 LA dataset and competitive results on other datasets, proving its effectiveness and generalizability.
audio
Published: 2024-06-05
Accepted by INTERSPEECH 2024
Authors: Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, Jianhua Tao
This paper proposes the Real Emphasis and Fake Dispersion (REFD) strategy for audio deepfake algorithm recognition, focusing on the challenging task of identifying novel, out-of-distribution (OOD) deepfake algorithms. REFD is a dual-stage approach that effectively discriminates in-distribution samples while identifying OOD ones. It introduces Novel Similarity Detection (NSD), a new OOD method considering both feature and logits scores to achieve state-of-the-art performance.
audio
Published: 2024-06-05
accepted by INTERSPEECH2024
Authors: Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, Shuchen Shi
This paper introduces a Sample Weight Learning (SWL) module within a stable learning framework to enhance the generalization of fake audio detection models against distribution shifts. The SWL module operates as a plug-in, decorrelating selected features by learning sample weights from training data, thereby simplifying the training process without requiring additional data. Experiments on ASVspoof datasets demonstrate SWL's effectiveness in improving the generalization of various base models across different data distributions.
audio
Published: 2024-06-05
Interspeech 2024
Authors: Nicolas M. Müller, Nicholas Evans, Hemlata Tak, Philip Sperl, Konstantin Böttinger
This paper investigates why audio deepfake detection models generalize poorly to unseen deepfakes by decomposing the performance gap into 'hardness' and 'difference' components. Experiments using ASVspoof datasets indicate that the performance drop is predominantly due to the 'difference' in deepfake characteristics rather than increased 'hardness'. The findings suggest that merely increasing model capacity is insufficient for improving generalization and research should instead focus on understanding and addressing these inherent differences.
audio
Published: 2024-06-05
Accepted by Interspeech 2024; Our code is available at https://github.com/xjchenGit/SingGraph.git
Authors: Xuanjun Chen, Haibin Wu, Jyh-Shing Roger Jang, Hung-yi Lee
This paper introduces SingGraph, a novel model designed to detect singing voice deepfakes (SingFake), addressing the limitations of existing speech deepfake detectors in this unique domain. SingGraph integrates the MERT model for pitch and rhythm analysis with the wav2vec2.0 model for linguistic analysis, complemented by RawBoost and beat matching for data augmentation. The proposed method achieves new state-of-the-art results on the SingFake dataset, significantly improving EER across various scenarios including seen and unseen singers and different codecs.
audio
Published: 2024-06-04
Accepted by Interspeech 2024
Proceedings of Interspeech 2024
Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan
This paper introduces CtrSVDD, a large-scale and diverse benchmark dataset for controlled singing voice deepfake detection (SVDD), containing 307.98 hours of bonafide and deepfake singing vocals synthesized using 14 state-of-the-art methods. The authors also present and evaluate a versatile baseline system with flexible front-end features against a structured train/dev/eval split of CtrSVDD. The study highlights the importance of feature selection and the current limitations in generalization to unseen deepfake methods.
audio
Published: 2024-06-04
5 pages, 4 figures
Authors: Renmingyue Du, Jixun Yao, Qiuqiang Kong, Yin Cao
This study proposes a reconstruction-based approach for out-of-distribution (OOD) detection in vocoder recognition, addressing limitations of probability-score or classified-distance methods. It employs an autoencoder where acoustic features, extracted by a pre-trained WavLM model, are reconstructed by decoders specific to vocoder classes. Samples are classified as OOD if none of the decoders can satisfactorily reconstruct their features, with contrastive learning and an auxiliary classifier enhancing distinctiveness.
audio
Published: 2024-05-14
Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao
This paper introduces EVDA, a novel benchmark designed to evaluate continual learning methods for robust audio deepfake detection. It addresses the growing challenge posed by advanced large language models generating evolving synthetic speech, where traditional methods struggle with catastrophic forgetting. EVDA includes a diverse set of classic and newly generated deepfake audio datasets and supports various continual learning techniques to foster the development of adaptable detection algorithms.
audio
Published: 2024-05-08
Evaluation plan of the SVDD Challenge @ SLT 2024
Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan
This paper introduces the SVDD Challenge 2024, the first research challenge focused on detecting deepfake singing voices. It addresses the unique challenges of singing voice deepfake detection (SVDD) compared to spoken voice, due to its musical nature and background music. The challenge aims to advance SVDD research by providing a platform for developing and evaluating systems on both lab-controlled and in-the-wild bonafide and deepfake singing voice recordings.
audio
Published: 2024-05-08
Authors: Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun
This paper addresses the urgent need for generalized detection of Audio Language Model (ALM) based deepfake audio by introducing the Codecfake dataset, a large-scale collection of over 1 million neural codec-generated audio samples. It proposes the Co-training Sharpness-Aware Minimization (CSAM) strategy to achieve universal deepfake audio detection by learning a domain-balanced and generalized minima. Experiments demonstrate that models trained with Codecfake effectively detect ALM-based audio, and the CSAM countermeasure yields a state-of-the-art average equal error rate (EER) of 0.616% across diverse test conditions.
audio
Published: 2024-05-07
Under review
Authors: Darius Afchar, Gabriel Meseguer-Brocal, Romain Hennequin
This paper introduces the first study on detecting music deepfakes, demonstrating that simple convolutional classifiers can achieve high accuracy (up to 99.8%) in distinguishing real music from artificially generated content using various neural codecs. However, the authors emphasize that high performance scores alone are insufficient for reliable deployment, highlighting critical issues like robustness to audio manipulation, generalization to unseen generative models, calibration, and the need for interpretability and recourse.
audio
Published: 2024-05-03
Authors: Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva
This paper addresses the generalization issue of audio deepfake detectors by proposing a training-free approach that leverages large-scale pre-trained models. The detection problem is reframed as a speaker verification task, where fake audios are identified by a mismatch between the test sample and the claimed identity's voice. This method requires no training on fake speech samples, thus ensuring full generalization ability on out-of-distribution data.
audio
Published: 2024-04-26
Authors: Mingrui He, Longting Xu, Han Wang, Mingjun Zhang, Rohan Kumar Das
This paper introduces novel graph domain features, GFDCC, GFLC, and GFLDC, for detecting replay speech attacks by incorporating logarithmic processing and device-related linear transformations derived from the graph frequency domain. These features are evaluated with GMM and LCNN classifiers, demonstrating superior performance against existing front-ends on ASVspoof 2017 V2, ASVspoof 2019 PA, and ASVspoof 2021 PA datasets. The approach effectively captures device and environmental noise effects, which are crucial for robust replay speech detection.
audio
Published: 2024-04-24
Submitted to IEEE TDSC
Authors: Haolin Wu, Jing Chen, Ruiying Du, Cong Wu, Kun He, Xingcan Shang, Hao Ren, Guowen Xu
This paper investigates the vulnerability of audio deepfake detection systems to manipulation attacks, revealing that simple manipulations can significantly bypass existing detectors. To address this, the authors propose CLAD (Contrastive Learning-based Audio deepfake Detector), which employs contrastive learning to minimize manipulation-induced variations and a length loss to enhance the clustering of real audios in the feature space. CLAD demonstrates significantly improved robustness, consistently maintaining a low False Acceptance Rate (FAR) against various manipulation attacks.
audio
Published: 2024-04-23
Submitted to ACM journal -- Digital Threats: Research and Practice
Authors: Seth Layton, Thiago De Andrade, Daniel Olszewski, Kevin Warren, Kevin Butler, Patrick Traynor
This paper proposes using breath as a high-level feature for deepfake speech detection, hypothesizing that current synthetic speech lacks natural breathing patterns. They develop a breath detector and leverage breath-related statistics from a custom dataset of in-the-wild online news audio to discriminate between real and deepfake speech. Their simple breath-based detector achieves perfect classification (1.0 AUPRC and 0.0 EER) on test data, outperforming the state-of-the-art SSL-wav2vec model.
audio
Published: 2024-04-22
38 pages. This paper has been accepted by ACM Computing Surveys
Authors: Menglu Li, Yasaman Ahmadiadli, Xiao-Ping Zhang
This survey provides a comprehensive analysis of over 200 papers on speech deepfake detection published up to March 2024. It systematically reviews each component of the detection pipeline, including model architectures, optimization techniques, datasets, and evaluation metrics. The paper assesses recent progress, discusses ongoing challenges, explores emerging topics like partial deepfake detection and adversarial defenses, and suggests promising future research directions.
audio
Published: 2024-04-22
Accepted by the 2024 International Conference on Multimedia Retrieval (ICMR 2024)
Authors: Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang
This paper proposes a Retrieval-Augmented Detection (RAD) framework to enhance audio deepfake detection by augmenting test samples with similar retrieved samples. Inspired by Retrieval-Augmented Generation (RAG), RAD integrates with an extended multi-fusion attentive classifier. The framework achieves state-of-the-art results on the ASVspoof 2021 DF set and competitive performance on ASVspoof 2019 and 2021 LA sets.
audio
Published: 2024-04-19
Authors: Mohammed Yousif, Jonat John Mathew, Huzaifa Pallan, Agamjeet Singh Padda, Syed Daniyal Shah, Sara Adamski, Madhu Reddiboina, Arjun Pankajakshan
This paper addresses the significant challenge of generalization in audio deepfake detection, where models struggle with deepfakes from unknown algorithms. The authors propose a neural collapse-based sampling approach to create an efficient new training database from diverse pre-trained models. This method demonstrates comparable generalization on unseen data with reduced computational costs and less training data.
audio
Published: 2024-04-07
Authors: Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang
This paper addresses the issue of outdated datasets in audio deepfake detection (ADD) by constructing a new cross-domain ADD dataset (CD-ADD) comprising over 300 hours of speech generated by five advanced zero-shot TTS models. It demonstrates that pre-trained speech encoders, like Wav2Vec2-large and Whisper-medium, achieve strong detection performance through novel attack-augmented training and exhibit outstanding few-shot ADD ability, though neural codec compression remains a significant challenge.
audio
Published: 2024-03-31
Accepted to NAACL (Findings) 2024
Authors: Orchid Chetia Phukan, Gautam Siddharth Kashyap, Arun Balaji Buduru, Rajesh Sharma
This work investigates the effectiveness of multilingual speech Pre-Trained Models (PTMs) for Audio Deepfake Detection (ADD), hypothesizing their robustness due to diverse pre-training. The study evaluates representations from various PTMs on benchmark datasets and proposes a fusion framework called MiO (Merge into One). MiO achieves state-of-the-art performance on ASVSpoof 2019 and In-the-Wild datasets, validating the hypothesis that multilingual PTMs are highly effective for ADD.
audio
Published: 2024-03-26
Authors: Hafsa Ouajdi, Oussama Hadder, Modan Tailleur, Mathieu Lagrange, Laurie M. Heller
This paper proposes a simple and efficient pipeline for detecting deepfake environmental audio, an area less explored than fake speech detection. The method utilizes the CLAP audio embedding and a multi-layer perceptron for binary classification. Experiments on the 2023 DCASE challenge data demonstrate high detection accuracy for various synthesized environmental sounds.
audio
Published: 2024-03-21
This manuscript is under review in a conference
Authors: Subhajit Saha, Md Sahidullah, Swagatam Das
This study introduces a novel 'Green AI' framework for audio deepfake detection, focusing on minimizing the carbon footprint by enabling CPU-only training. It leverages off-the-shelf pre-trained self-supervised learning (SSL) models for feature extraction without fine-tuning, combined with classical machine learning algorithms for the downstream detection task. The approach demonstrates competitive performance with significantly fewer trainable parameters compared to high-carbon footprint deep neural network methods.
audio
Published: 2024-03-18
Authors: Jonat John Mathew, Rakin Ahsan, Sae Furukawa, Jagdish Gautham Krishna Kumar, Huzaifa Pallan, Agamjeet Singh Padda, Sara Adamski, Madhu Reddiboina, Arjun Pankajakshan
This study investigates the feasibility of deploying static deepfake audio detection models in real-time communication platforms. It implements ResNet and LCNN models, training them on the ASVspoof 2019 dataset, and develops cross-platform software to assess their real-time performance in actual communication scenarios. The work highlights challenges for static models in dynamic real-time environments and proposes future strategies for enhancement.
audio
Published: 2024-03-04
5 pages, 2 figures
Authors: Yujie Yang, Haochen Qin, Hang Zhou, Chengcheng Wang, Tianyu Guo, Kai Han, Yunhe Wang
This paper proposes a robust audio deepfake detection (ADD) system by exploiting a broad range of audio features, including both handcrafted and learning-based types. It demonstrates that learning-based features, especially those pretrained on large datasets, offer superior generalizability for out-of-domain scenarios. The system further enhances generalizability through proposed multi-feature approaches like feature selection and feature fusion.
audio
Published: 2024-03-02
https://github.com/shihkuanglee/ADFA
Authors: Lee Shih Kuang
This study introduces novel signal analysis methods: Arbitrary Analysis (AA), Mel Scale Analysis (MA), and Constant Q Analysis (CQA) for replay speech detection in automatic speaker verification (ASV) systems. Inspired by the Fourier inversion formula, these methods offer new perspectives by using alternative sinusoidal sequence groups. They demonstrate superior efficacy and/or efficiency compared to conventional methods on ASVspoof 2019 & 2021 PA databases, especially when integrated with the Temporal Autocorrelation of Speech (TAC) feature.
audio
Published: 2024-02-28
To appear in ASIA CCS 2025. Human Instrument, Code and Dataset at https://govindm.me/pitch
Authors: Govind Mittal, Arthur Jakobsson, Kelly O. Marshall, Chinmay Hegde, Nasir Memon
This paper introduces PITCH, a robust challenge-response method designed to detect and tag interactive real-time deepfake audio calls. It proposes a comprehensive taxonomy of audio challenges, which, when applied to a novel dataset, significantly boosts machine detection capabilities to an 88.7% AUROC score. Furthermore, PITCH integrates a human-AI collaborative system that achieves an 84.5% detection accuracy, leveraging complementary strengths of human intuition and machine precision.
audio
Published: 2024-02-27
5 pages
Authors: Taein Kang, Soyul Han, Sunmook Choi, Jaejin Seo, Sanghyeok Chung, Seungeun Lee, Seungsang Oh, Il-Youp Kwak
This research investigates enhancing voice spoofing detection by leveraging wav2vec 2.0 as an audio feature extractor. It proposes a method to optimize wav2vec 2.0 by selectively choosing and fine-tuning its pretrained Transformer layers, which are then integrated with various spoofing detection back-end models. The study demonstrates that this approach achieves state-of-the-art performance on the ASVspoof 2019 LA dataset, offering valuable insights into refined feature extraction strategies.
audio
Published: 2024-02-22
9 pages, 4 figures, 3 tables
Authors: Mahsa Salehi, Kalin Stefanov, Ehsan Shareghi
This paper investigates human brain activity, as measured by EEG, when individuals listen to real versus deepfake audio. It contrasts these human responses with the representations learned by a state-of-the-art deepfake audio detection algorithm. Preliminary results indicate that while machine learning representations do not clearly distinguish fake from real audio, human EEG patterns display distinct differences, suggesting a promising avenue for future deepfake detection research.
audio
Published: 2024-01-20
Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)
Authors: Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen
This paper introduces a Generalized Standalone Automatic Speaker Verification (G-SASV) system to detect spoofing attacks without requiring a separate countermeasure (CM) module during the authentication phase. It enhances a simple deep neural network backend by leveraging limited CM training data through domain adaptation and multi-task learning, integrating spoof embeddings at the training stage. Experiments on the ASVspoof 2019 logical access dataset demonstrate significant improvements over statistical ASV backends.
audio
Published: 2024-01-17
IJCNN 2024
Authors: Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger
This paper introduces MLAAD (Multi-Language Audio Anti-Spoofing Dataset), version 9, a comprehensive dataset of 687.4 hours of synthetic voice generated by 140 Text-to-Speech (TTS) models across 51 different languages to combat audio deepfakes. It demonstrates that models trained on MLAAD exhibit superior performance over comparable datasets like InTheWild and FakeOrReal, and acts as a complementary resource to the renowned ASVspoof 2019 dataset, enhancing cross-dataset generalization capabilities. The authors aim to democratize anti-spoofing technology by publishing MLAAD and making a trained model accessible via a webserver.
audio
Published: 2024-01-11
Authors: Lian Huang, Chi-Man Pun
This paper proposes a novel framework for replay and deep-fake audio detection by integrating hybrid features with a self-attention mechanism. It extracts deep learning features via parallel CNNs and Mel-spectrogram features via STFT and Mel-frequency, concatenates them, and then processes them with self-attention before classification using ResNet. The approach achieved state-of-the-art Equal Error Rates of 9.67% for physical access and 8.94% for deep-fake tasks on the ASVspoof 2021 dataset, significantly outperforming baseline systems.
audio
Published: 2024-01-04
arXiv admin note: text overlap with arXiv:2308.12734 by other authors
Authors: Enkhtogtokh Togootogtokh, Christian Klasen
This research introduces "AntiDeepFake," an AI system designed to recognize deepfake or generative AI cloned synthetic voices. The proposed technology encompasses the entire pipeline from data collection and feature extraction to model training and evaluation. It leverages feature engineering and tabular AI models to effectively classify audio as real or deepfake.
audio
Published: 2023-12-15
Accepted by the main track The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)
Authors: Xiaohui Zhang, Jiangyan Yi, Chenglong Wang, Chuyuan Zhang, Siding Zeng, Jianhua Tao
This paper proposes Radian Weight Modification (RWM), a self-adaptive continual learning approach for audio deepfake detection. RWM addresses the challenge of existing models struggling with new deepfake types by categorizing audio classes into groups based on feature distribution compactness. This categorization informs a trainable gradient modification direction, enabling effective knowledge acquisition for new tasks while mitigating catastrophic forgetting of previously learned information.
audio
Published: 2023-12-13
Accepted to ICASSP 2024. 5 pages, 1 figure
Authors: Yinlin Guo, Haofan Huang, Xi Chen, He Zhao, Yuehai Wang
This paper introduces a novel approach for audio deepfake detection by combining a self-supervised WavLM model with a Multi-Fusion Attentive (MFA) classifier. The method leverages WavLM for extracting features highly conducive to spoofing detection and proposes the MFA classifier, based on Attentive Statistics Pooling (ASP), to capture complementary information across different time steps and layers of audio features. Experiments demonstrate that this approach achieves state-of-the-art results on the ASVspoof 2021 DF set and competitive performance on the ASVspoof 2019 and 2021 LA sets.
audio
Published: 2023-12-08
Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia Gonzalez, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-jussà, Maha Elbayad, Hongyu Gong, Francisco Guzmán, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson
This paper introduces the Seamless family of models for end-to-end expressive and multilingual speech translation in a streaming fashion. It details SeamlessM4T v2, an improved foundational model, SeamlessExpressive for vocal style and prosody preservation, and SeamlessStreaming for low-latency simultaneous translation. These components are unified into "Seamless", the first publicly available system for real-time expressive cross-lingual communication.
audio
Published: 2023-11-06
Authors: Karthik Sivarama Krishnan, Koushik Sivarama Krishnan
The paper introduces the Multi-Feature Audio Authenticity Network (MFAAN), an advanced architecture designed for the detection of fabricated audio content. MFAAN leverages multiple parallel paths to process diverse audio representations, including MFCC, LFCC, and Chroma-STFT. By synergistically fusing these features, the network achieves a nuanced understanding of audio content, enabling robust differentiation between genuine and manipulated recordings.
audio
Published: 2023-10-09
Authors: Xiangyu Shi, Yuhao Luo, Li Wang, Haorui He, Hao Li, Lei Wang, Zhizheng Wu
This study proposes an audio compression-assisted feature extraction approach for voice replay attack detection. By utilizing the 'missed information' after audio decompression as content- and speaker-independent channel noise, the method aims to robustly detect spoofing. The proposed approach achieved the lowest Equal Error Rate (EER) of 22.71% on the ASVspoof 2021 Physical Access (PA) evaluation set, demonstrating its effectiveness.
audio
Published: 2023-10-09
17 pages, 11 figures
Authors: Itshak Lapidot, Jean-Francois Bonastre
This paper introduces "genuinization," an algorithm designed to reduce the waveform probability mass function (PMF) gap between genuine and spoofed speech, covering synthesized, converted, and replayed audio. Evaluated on ASVspoof 2019, genuinization used by attackers significantly degrades spoofing detection performance by up to a factor of 10. Conversely, integrating this algorithm into spoofing countermeasures leads to substantial improvements in detection, emphasizing the critical role of waveform distribution in anti-spoofing systems.
audio
Published: 2023-10-05
Authors: Awais Khan, Khalid Mahmood Malik
This paper introduces Quick-SpoofNet, a novel one-shot and metric learning approach for detecting both seen and unseen audio deepfake attacks in Automatic Speaker Verification (ASV) systems. It extracts compact temporal embeddings from voice samples using effective spectral features and employs triplet loss to distinguish bona fide speeches from spoofing attacks based on similarity indexing. The system demonstrates enhanced generalization capabilities against unseen deepfakes and bona fide speech across various datasets.
audio
Published: 2023-09-26
Accepted to ICASSP 2024
Authors: Lauri Juvela, Xin Wang
This paper proposes a collaborative training scheme for synthetic speech watermarking, where a HiFi-GAN neural vocoder works with ASVspoof 2021 baseline countermeasure models. This approach consistently improves detection performance over conventional classifier training. Furthermore, collaborative training, especially when paired with augmentation strategies, enhances robustness against noise and time-stretching with minimal adverse effects on perceptual quality.
audio
Published: 2023-09-22
9 pages, 6 figures, 7 tables
Authors: Alexandre R. Ferreira, Cláudio E. C. Campelo
This work proposes a framework for data augmentation using deepfake audio to train robust automatic speech-to-text transcription models. Experiments were conducted using an existing voice cloner and an Indian English dataset, where the augmented data was used to fine-tune speech-to-text models. Despite the approach, the quality of transcriptions declined, attributed to the low quality of the generated deepfake audio.
audio
Published: 2023-09-19
Authors: Awais Khan, Khalid Mahmood Malik
This paper introduces a Parallel Stacked Aggregation Network to bridge the gap in unified spoofing detection for Automatic Speaker Verification (ASV) systems, which are vulnerable to both logical (LA) and physical (PA) attacks. The proposed approach directly processes raw audio using a split-transform-aggregation technique to identify spoofing attacks. It significantly outperforms state-of-the-art solutions on ASVspoof-2019 and VSDC datasets, showing reduced Equal Error Rate (EER) disparities and superior generalizability across attack types.
audio
Published: 2023-09-18
Authors: Awais Khan, Khalid Mahmood Malik, Shah Nawaz
This paper introduces a unified spectra-temporal approach for detecting various voice spoofing attacks, including synthetic, replay, and partial deepfakes. The method leverages frame-level spectral deviation coefficients (SDC) and utterance-level sequential temporal coefficients (STC) through a bi-LSTM network. These coefficients are then fused and processed by an auto-encoder to generate robust spectra-temporal deviated coefficients (STDC), demonstrating enhanced performance across diverse spoofing categories.
audio
Published: 2023-09-18
Accepted to ICASSP 2024
Authors: Wanying Ge, Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Nicholas Evans
This paper investigates how the training conditions of deep-learning-based spoofing attacks impact the generalisation capability of deepfake detection countermeasures (CMs). It demonstrates that attack potency can vary substantially, causing significant degradation in detection performance for some CMs like RawNet2, while others like AASIST and SSL-AASIST show more robustness. The authors propose that training CMs with a variety of differently-trained attack models can serve as an effective data augmentation strategy to improve generalisation.
audio
Published: 2023-09-15
submitted to icassp 2024
Authors: Jingze Lu, Yuxiang Zhang, Wenchao Wang, Zengqiang Shang, Pengyuan Zhang
This paper addresses the challenge of detecting spoofing speech generated by unseen algorithms, attributing the generalization issue to traditional binary classification paradigms. It proposes a novel one-class knowledge distillation (OCKD) method within a teacher-student framework to learn the distribution of bonafide speech. The approach significantly outperforms state-of-the-art methods on ASVspoof 21DF and InTheWild datasets, demonstrating enhanced generalization ability.
audio
Published: 2023-09-15
Submitted to 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)
Authors: Hyun-seo Shin, Jungwoo Heo, Ju-ho Kim, Chan-yeong Lim, Wonbin Kim, Ha-Jin Yu
This paper introduces HM-Conformer, a modified Conformer-based system for audio deepfake detection, addressing the sub-optimal direct application of Conformer to classification tasks. It integrates hierarchical pooling to reduce sequence length and duplicated information, alongside a multi-level classification token aggregation method to gather features from different blocks. HM-Conformer efficiently detects spoofing evidence by processing and aggregating information from various sequence lengths, achieving a competitive 15.71% Equal Error Rate (EER) on the ASVspoof 2021 Deepfake dataset.
audio
Published: 2023-09-15
Submitted to ICASSP 2024
Authors: Yi Zhu, Saurabh Powar, Tiago H. Falk
This study proposes a novel method to enhance the generalizability of deepfake speech detection systems to unseen attacks by characterizing the long-term temporal dynamics of universal speech representations. By applying a modulation transformation block to embeddings from models like wav2vec2 and wavLM, the approach reveals consistent dynamic patterns across various generative models. Experiments on ASVspoof 2019 and 2021 datasets demonstrate significant improvements in detecting deepfakes from unseen generation methods, outperforming several benchmark systems.
audio
Published: 2023-09-14
Accepted at ICASSP 2024
Authors: Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan
This paper introduces the Singing Voice Deepfake Detection (SVDD) task and presents SingFake, the first curated in-the-wild dataset for singing voice deepfakes. It evaluates state-of-the-art speech countermeasure systems, demonstrating their significant performance degradation on singing voices when trained on speech. However, retraining these systems on SingFake leads to substantial improvements, though challenges related to unseen singers, languages, and musical contexts remain.
audio
Published: 2023-09-12
To appear in ICASSP 2024. code on github:...
Authors: Xin Wang, Junichi Yamagishi
This study investigates if large-scale vocoded spoofed data can improve speech spoofing countermeasures (CMs) with self-supervised learning (SSL) front ends. They generated over 9,000 hours of vocoded data from the VoxCeleb2 corpus and found that continually training an SSL model on this data, and especially distilling a new SSL from both pre-trained and continually trained SSLs, significantly improved overall CM performance on multiple challenging unseen test sets, outperforming previous state-of-the-art models.
audio
Published: 2023-09-11
Accepted at Interspeech 2024
Authors: Octavian Pascu, Adriana Stan, Dan Oneata, Elisabeta Oneata, Horia Cucu
This paper investigates the use of pretrained self-supervised representations for building generalizable and calibrated audio deepfake detection models. The authors demonstrate that large frozen representations, combined with a simple logistic regression classifier, significantly improve generalization capabilities and produce more reliable predictions. This approach drastically reduces the equal error rate from 30.9% (RawNet2) to 8.8% on a benchmark of eight diverse deepfake datasets.
audio
Published: 2023-09-06
Authors: Yuankun Xie, Haonan Cheng, Yutian Wang, Long Ye
This paper introduces Temporal Deepfake Location (TDL), a fine-grained method for detecting partially spoofed audio by accurately locating the authenticity of audio at the frame level. TDL incorporates an embedding similarity module to effectively separate real and fake features within an embedding space and a temporal convolution operation to capture precise positional information. Extensive experiments demonstrate TDL's superior performance over baseline models on the ASVspoof2019 Partial Spoof dataset and its strong generalizability in cross-dataset scenarios.
audio
Published: 2023-09-05
Submitted to ICASSP 2024
Authors: Yuankun Xie, Jingjing Zhou, Xiaolin Lu, Zhenghao Jiang, Yuxin Yang, Haonan Cheng, Long Ye
This paper introduces FSD, an initial Chinese dataset specifically designed for Fake Song Detection, addressing the lack of specialized resources in this domain. The dataset comprises real and fake songs generated by five state-of-the-art singing voice synthesis and conversion methods. Initial experiments demonstrate that existing speech-trained Audio DeepFake Detection (ADD) models are ineffective for song deepfake detection, highlighting the necessity of song-specific training.
audio
Published: 2023-08-29
Authors: Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, Yan Zhao
This survey paper provides a systematic overview of audio deepfake detection, analyzing various deepfake audio types, competitions, datasets, features, classifications, and evaluation metrics for state-of-the-art approaches. It performs a unified comparison of representative features and classifiers on key datasets. The authors identify critical future research directions, including the need for large-scale in-the-wild datasets, improved generalization to unknown attacks, and better interpretability of detection results.
audio
Published: 2023-08-24
Authors: Jordan J. Bird, Ahmad Lotfi
This study addresses the urgent need for real-time detection of AI-generated speech from DeepFake Voice Conversion by creating the DEEP-VOICE dataset, comprising real human speech and RVC-converted deepfakes. It performs statistical analysis on temporal audio features and applies hyperparameter-optimized machine learning models for binary classification. The Extreme Gradient Boosting model achieves an average classification accuracy of 99.3% and classifies speech in real-time.
audio
Published: 2023-08-22
Interspeech 2023
Authors: Nicolas M. Müller, Philip Sperl, Konstantin Böttinger
This paper introduces a novel voice anti-spoofing method utilizing complex-valued neural networks to process complex-valued Constant-Q Transform (CQT) spectrograms. This approach retains crucial phase information, enabling explainable AI methods and outperforming existing systems on the "In-the-Wild" dataset. Ablation studies confirm the model effectively uses phase information to detect voice spoofing.
audio
Published: 2023-08-20
The DKU-DukeECE system description to Task 2 of Audio Deepfake Detection Challenge (ADD 2023)
Authors: Zexin Cai, Weiqing Wang, Yikang Wang, Ming Li
This paper introduces the DKU-DUKEECE system designed for Track 2 of the ADD 2023 challenge, which focuses on locating manipulated regions in audio deepfakes. The approach integrates three systems: two frame-level models for boundary detection and deepfake detection, and a VAE model for authenticity assessment. This fusion secured the first rank in Track 2 of ADD 2023 with an impressive 82.23% sentence accuracy and an F1 score of 60.66%.
audio
Published: 2023-08-19
Accept by Neural Networks
Authors: Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chenglong Wang, Chengshi Zheng, Zhao Lv
This paper introduces a novel F0 subband feature for fake speech detection (FSD), leveraging the distinct fundamental frequency characteristics of synthetic speech. To effectively model this feature, a Spatial Reconstructed Local Attention Res2Net (SR-LA Res2Net) architecture is proposed, which enhances Res2Net with a spatial reconstruction mechanism and local attention for improved feature representation. The method achieves state-of-the-art performance among single systems on the ASVspoof 2019 LA dataset.
audio
Published: 2023-08-18
Authors: Penghui Wen, Kun Hu, Wenxi Yue, Sen Zhang, Wanlei Zhou, Zhiyong Wang
This paper proposes S2pecNet, a novel deep learning method for robust audio anti-spoofing that leverages multi-order spectral patterns through a spectral fusion-reconstruction strategy. It fuses spectral patterns up to second-order in a coarse-to-fine manner with two branches for fine-level fusion from spectral and temporal contexts. A reconstruction mechanism from the fused representation to input spectrograms is employed to reduce information loss, achieving state-of-the-art performance.
audio
Published: 2023-07-28
Accepted at ECML-PKDD 2023 Workshop "Deep Learning and Multimedia Forensics. Combating fake...
Authors: Daniele Mari, Davide Salvi, Paolo Bestagini, Simone Milani
This paper proposes a deep learning-based system for synthetic speech detection that fuses three distinct feature sets: First Digit (FD), short-term long-term (STLT), and bicoherence features. The model leverages an end-to-end deep learning approach to integrate these features, achieving superior performance compared to state-of-the-art single-feature solutions. The system demonstrates robustness against anti-forensic attacks and strong generalization capabilities across various datasets.
audio
Published: 2023-07-03
Authors: Sheng Zhao, Qilong Yuan, Yibo Duan, Zhuoyue Chen
This paper presents an end-to-end multi-module synthetic speech generation model designed for the ADD Challenge 2023. The system, comprising a speaker encoder, a Tacotron2-based synthesizer, and a WaveRNN-based vocoder, aims to generate high-quality fake human voices from text. The authors' system achieved first place in the ADD 2023 Challenge Track 1.1 with a weighted deception success rate (WDSR) of 44.97%.
audio
Published: 2023-06-27
Accepted by DADA2023
Authors: Shunbo Dong, Jun Xue, Cunhang Fan, Kang Zhu, Yujie Chen, Zhao Lv
This paper proposes a Multi-perspective Information Fusion (MPIF) Res2Net with a random Specmix data augmentation strategy for fake speech detection (FSD). The system is designed to improve the model's ability to learn precise forgery information in low-quality scenarios. It achieves this by enhancing generalization through random Specmix and reducing redundant interference information via multi-perspective fusion in MPIF-Res2Net.
audio
Published: 2023-06-27
Authors: Jie Liu, Zhiba Su, Hui Huang, Caiyan Wan, Quanxiu Wang, Jiangli Hong, Benlai Tang, Fengjie Zhu
This paper introduces TranssionADD, a novel system for detecting and locating manipulated regions in audio deepfakes, specifically for the ADD 2023 Challenge Track 2. It adapts a sequence tagging task for audio deepfake detection, enhances model generalization through various data augmentation techniques, and incorporates a multi-frame detection (MFD) module along with an isolated-frame penalty (IFP) loss to handle limited representation and outliers. The system achieved 2nd place in the challenge, demonstrating its effectiveness and robustness.
audio
Published: 2023-06-13
Accepted at INTERSPEECH 2023
Authors: Michele Panariello, Wanying Ge, Hemlata Tak, Massimiliano Todisco, Nicholas Evans
This paper introduces Malafide, a novel universal adversarial attack designed to compromise the reliability of automatic speaker verification (ASV) spoofing countermeasures (CMs). It achieves this by introducing optimized convolutive noise through a linear time-invariant filter, which degrades CM performance significantly while preserving speech quality. Malafide filters are optimized independently of specific utterances, targeting underlying spoofing attacks, and are effective in both white-box and black-box settings, though integrated self-supervised learning CMs show greater robustness.
audio
Published: 2023-06-09
6pages
IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis
Authors: Chenglong Wang, Jiangyan Yi, Xiaohui Zhang, Jianhua Tao, Le Xu, Ruibo Fu
This paper introduces a Low-rank Adaptation (LoRA) method to fine-tune the wav2vec2 model for fake audio detection, addressing challenges of long training times and high memory consumption associated with full fine-tuning. By freezing pre-trained weights and injecting trainable rank-decomposition matrices, LoRA drastically reduces the number of trainable parameters. The approach achieves performance comparable to full fine-tuning while significantly improving training efficiency and reducing hardware requirements.
audio
Published: 2023-06-02
Accepted to INTERSPEECH 2023
Authors: Piotr Kawa, Marcin Plata, Michał Czuba, Piotr Szymański, Piotr Syga
This paper investigates the use of the state-of-the-art Whisper automatic speech recognition model as a feature extraction front-end for audio DeepFake detection. The authors compare various combinations of Whisper and traditional front-ends (LFCC, MFCC) with three detection models (LCNN, SpecRNet, MesoNet). They demonstrate that Whisper-based features significantly improve detection, particularly enhancing generalization to real-world DeepFakes.
audio
Published: 2023-05-30
5 pages
Authors: Qing Wang, Jixun Yao, Ziqian Wang, Pengcheng Guo, Lei Xie
This study proposes a timbre-reserved adversarial attack method for black-box speaker identification (SID) systems. It generates fake audio by integrating an adversarial constraint into a voice conversion (VC) model to preserve timbre, while a pseudo-Siamese network trains a substitute SID model to mimic the black-box target. This approach allows for effective attacks that deceive both machines and humans by exploiting SID vulnerabilities while maintaining high audio quality.
audio
Published: 2023-05-25
To appear at InterSpeech2023
Authors: Rui Liu, Jinhua Zhang, Guanglai Gao, Haizhou Li
This paper introduces M2S-ADD, a novel audio deepfake detection (ADD) model that exploits previously unstudied dual-channel stereo information. It proposes converting mono audio to stereo using a pretrained synthesizer and then processing the left and right channels with a dual-branch neural architecture. This method effectively reveals authenticity cues and artifacts in fake audio, significantly improving ADD performance.
audio
Published: 2023-05-23
Authors: Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li
The ADD 2023 challenge advances audio deepfake detection by moving beyond binary classification to include tasks for localizing manipulated regions in partially fake audio and recognizing the source generation algorithm. This paper outlines the challenge's three subchallenges, details the datasets, specifies the evaluation metrics, and describes the protocols for participants. It also reports initial findings from the submitted results and provided baselines.
audio
Published: 2023-05-23
Interspeech2023
Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Ruibo Fu, Xun Chen
This paper proposes TO-RawNet, an enhanced end-to-end fake audio detection system that improves upon RawNet by optimizing Sinc-conv parameters. It incorporates orthogonal convolution to reduce filter correlation and temporal convolutional networks (TCN) to capture long-term dependencies in speech signals. Experiments on the ASVspoof 2019 dataset demonstrate that TO-RawNet significantly reduces the Equal Error Rate (EER) compared to RawNet.
audio
Published: 2023-05-22
Code available at: https://github.com/gan-police/audiodeepfake-detection
Published in Transactions on Machine Learning Research (04/2024)
Authors: Konstantin Gasenzer, Moritz Wolter
This paper investigates the generalization capabilities of deepfake audio detectors, addressing previous reports of their limited ability to generalize to unseen generators. The authors analyze stable frequency domain fingerprints of various audio generative networks and leverage these insights to train lightweight, dilated convolutional neural network (DCNN) based detectors. Their approach demonstrates excellent generalization and achieves improved detection performance on the WaveFake dataset and its newly extended version.
audio
Published: 2023-05-18
Accepted by interspeech2023
Authors: Chang Zeng, Xin Wang, Xiaoxiao Miao, Erica Cooper, Junichi Yamagishi
This paper investigates a new mismatch scenario for audio deepfake detection, where fake audio is generated from real audio with unseen genres. To address this, the authors create a new dataset called CN-Spoof based on CN-Celeb1&2 and propose a multi-task learning method combining a main anti-spoofing objective with two auxiliary regularization objectives: meta-optimization and a genre alignment module, using learnable loss weights. The proposed method significantly improves the generalization ability of countermeasures in this cross-genre evaluation dataset.
audio
Published: 2023-05-12
Authors: Eran Kaufman, Lee-Ad Gottlieb
This paper proposes an automated method for detecting word emphasis in spoken language by leveraging deepfake technology. It generates an emphasis-devoid version of a speaker's utterance using a voice sample and the extracted text. By comparing this synthesized speech with the original, the approach isolates and identifies patterns of emphasis, addressing challenges posed by speaker-specific speech characteristics.
audio
Published: 2023-05-09
Authors: Yuanda Wang, Hanqing Guo, Guangjing Wang, Bocheng Chen, Qiben Yan
VSMask proposes a real-time defense mechanism against deep learning-based voice synthesis attacks by generating predictive perturbations for streaming speech. It utilizes a neural network to forecast effective perturbations and incorporates a universal perturbation header for comprehensive protection. A weight-based constraint is applied to minimize audio distortion, ensuring the added perturbations are imperceptible to human ears.
audio
Published: 2023-04-25
Paper accepted in CVPRW 2023. Codes and data can be found at...
Authors: Chengzhe Sun, Shan Jia, Shuwei Hou, Siwei Lyu
This study proposes a novel approach to detect AI-synthesized human voices by identifying artifacts introduced by neural vocoders in audio signals. It introduces a multi-task learning framework for a RawNet2 model, where vocoder identification serves as a pretext task to constrain the feature extractor to focus on vocoder-specific artifacts. This method aims to provide discriminative features for the final binary classifier, achieving high classification performance.
audio
Published: 2023-03-02
Accepted by ICASSP 2023
Authors: Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang, Zhengqi Wen, Dan Zhang, Zhao Lv
This paper introduces a novel self-distillation method for fake speech detection (FSD) that enhances performance without increasing model complexity. It addresses the challenge of capturing fine-grained information by using the deepest network as a teacher to instruct and strengthen shallow networks. The approach involves segmenting the network, adding classifiers to shallow layers as student models, and employing distillation paths to reduce feature differences and transfer knowledge effectively.
audio
Published: 2023-03-02
Authors: Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen
This paper introduces speaker-aware anti-spoofing, a novel approach that integrates prior knowledge of the target speaker into a voice spoofing countermeasure (CM). By extending the state-of-the-art AASIST model in a speaker-conditioned manner, the method leverages target speaker enrollment information at either the frame or utterance level. Experimental results on a custom protocol based on ASVspoof 2019 demonstrate significant improvements in spoofing detection performance.
audio
Published: 2023-02-20
Authors: Domna Bilika, Nikoletta Michopoulou, Efthimios Alepis, Constantinos Patsakis
This paper investigates the vulnerability of widely used Voice Assistants (VAs) like Google Assistant and Siri to audio deepfake attacks. Participants trained their VAs, and then their voices were synthesized to create commands for dangerous tasks. The study found that a significant percentage of these deepfake attacks were successful, highlighting alarming security gaps and variations among vendors, including a notable gender bias in one case.
audio
Published: 2023-02-18
Dataset and codes will be available at https://github.com/csun22/LibriVoc-Dataset
Authors: Chengzhe Sun, Shan Jia, Shuwei Hou, Ehab AlBadawy, Siwei Lyu
This work introduces a novel approach to detect AI-synthesized human voices by identifying artifacts inherent to neural vocoders, which are core components in most DeepFake audio synthesis models. It proposes a multi-task learning framework for a binary-class RawNet2 model, where a shared front-end feature extractor is constrained by a vocoder identification pretext task. This strategy forces the feature extractor to focus on vocoder artifacts, yielding highly discriminative features for robust synthetic voice detection.
audio
Published: 2023-01-19
PLoS ONE 18(8) (2023): e0285333
Authors: Kimberly T. Mai, Sergi D. Bray, Toby Davies, Lewis D. Griffin
This paper investigates human ability to detect speech deepfakes across English and Mandarin, finding that human detection capabilities are unreliable. Listeners correctly identified deepfakes only 73% of the time, and familiarization with examples offered only slight improvement. These findings highlight the significant threat posed by speech deepfakes and the necessity for robust automated detection mechanisms.
audio
Published: 2023-01-08
Authors: Lior Yasur, Guy Frankovits, Fred M. Grabovski, Yisroel Mirsky
This paper proposes D-CAPTCHA, an active defense against real-time deepfakes, primarily focusing on audio. Unlike passive detection, D-CAPTCHA challenges the deepfake model to generate content beyond its current capabilities, causing distortions that make detection easier. The system focuses on the AI's ability to create content rather than classify it, enhancing deepfake detection accuracy.
audio
Published: 2022-12-30
Accepted to INTERSPEECH 2023
Authors: Piotr Kawa, Marcin Plata, Piotr Syga
This work investigates the vulnerability of audio deepfake detection systems to adversarial attacks, evaluating the robustness of three deep neural network architectures in both white-box and transferability scenarios. It then introduces a novel adaptive adversarial training method to enhance the detectors' resilience against such attacks. The paper also highlights the first adaptation of RawNet3 for audio deepfake detection.
audio
Published: 2022-12-16
Accepted by APSIPA ASC
Authors: Tinglong Zhu, Xingming Wang, Xiaoyi Qin, Ming Li
This paper proposes a source tracing system for detecting voice spoofing by classifying different spoofing attributes rather than just determining if an audio is fake. The system aims to identify the methods used at various stages of speech generation, such as conversion, speaker representation, and waveform generation. This attribute-based classification improves robustness against unseen spoofing methods and serves as an auxiliary system for anti-spoofing.
audio
Published: 2022-11-11
Accepted by Pattern Recognition, 1 April 2024
Authors: Jiangyan Yi, Chenglong Wang, Jianhua Tao, Chu Yuan Zhang, Cunhang Fan, Zhengkun Tian, Haoxin Ma, Ruibo Fu
This paper introduces SceneFake, a novel dataset for detecting scene-manipulated audio, where an original audio's acoustic scene is altered using speech enhancement technologies. The dataset aims to address a gap in existing fake audio datasets, which primarily focus on timbre, prosody, content, or channel noise manipulation. Benchmarks on SceneFake using baseline models indicate that these models struggle to reliably detect scene fake utterances, especially on unseen test sets, despite performing well on seen data.
audio
Published: 2022-11-10
Authors: Yan Zhao, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xiaohui Zhang, Yongfeng Dong
This paper introduces EmoFake, a novel dataset designed for detecting emotion fake audio, where the emotion state of speech is altered while other information like speaker identity and content remains unchanged. The dataset is generated using seven open-source emotion voice conversion models. Benchmark experiments using existing fake audio detection models demonstrate that EmoFake poses a significant challenge, revealing a notable degradation in their performance against emotion fake audio.
audio
Published: 2022-11-01
Authors: Zexin Cai, Weiqing Wang, Ming Li
This paper introduces a deep learning-based frame-level system for detecting partially spoofed audio and localizing manipulated segments within an utterance. The proposed method addresses audio deepfake scenarios where parts of an audio waveform are replaced with synthetic or natural clips. It achieves an Equal Error Rate (EER) of 6.58% on the ADD2022 challenge test set, establishing state-of-the-art performance for systems capable of locating manipulated clips.
audio
Published: 2022-10-31
Authors: Luigi Attorresi, Davide Salvi, Clara Borrelli, Paolo Bestagini, Stefano Tubaro
This paper introduces ProsoSpeaker, a novel synthetic speech detection method that combines high-level semantic properties of the human voice: speaker identity cues (speaker embeddings) and voice prosody (prosody embeddings). These combined features are fed into a supervised binary classifier to detect deepfake speech generated by both Text-to-Speech (TTS) and Voice Conversion (VC) techniques. The approach demonstrates improved performance over baselines, good generalization across multiple datasets, and robustness to audio compression.
audio
Published: 2022-10-21
7 pages, 8 figures, 4 tables
Authors: Vardhan Dongre, Abhinav Thimma Reddy, Nikhitha Reddeddy
This paper proposes an approach for synthetic speech detection using channel-wise recalibration of features via attentional feature fusion (AFF) and Squeeze Excitation (SE) blocks within ResNet models. The authors demonstrate that combining Linear Frequency Cepstral Coefficients (LFCC) and Mel Frequency Cepstral Coefficients (MFCC) using AFF creates better input feature representations that improve generalization. Their models, trained on the Fake or Real (FoR) dataset, achieved high test accuracy and generalized well to different varieties of synthetic speech.
audio
Published: 2022-10-19
ICASSP 2023 accepted. Code:...
Authors: Xin Wang, Junichi Yamagishi
This study proposes an efficient method to create diverse spoofed training data for speech spoofing countermeasures by using neural vocoders for copy-synthesis on bona fide utterances. It introduces a contrastive feature loss to better utilize the paired bona fide and vocoded data. The approach, combining optimized vocoder data creation and the new loss, achieves competitive performance and outperforms the top-1 ASVspoof 2021 challenge submission on hidden subsets.
audio
Published: 2022-10-13
Accepted by ACM Multimedia 2022 Workshop: First International Workshop on Deepfake Detection for...
Authors: Yuxiang Zhang, Jingze Lu, Xingming Wang, Zhuo Li, Runqiu Xiao, Wenchao Wang, Ming Li, Pengyuan Zhang
This paper describes a deepfake audio detection system submitted to the Audio Deep Synthesis Detection (ADD) Challenge Track 3.2, focusing on score fusion. The system leverages score-level fusion of multiple Light Convolutional Neural Network (LCNN) based models with various input features and online data augmentation. The authors analyze the reasons for the limited performance improvement of score fusion, attributing it to model overfitting and low correlation of scores on out-of-distribution test data.
audio
Published: 2022-10-12
Accepted by TrustCom 2022: The 21st IEEE International Conference on Trust, Security and Privacy...
Authors: Piotr Kawa, Marcin Plata, Piotr Syga
This paper introduces SpecRNet, a novel neural network architecture for audio DeepFake detection, aiming to increase accessibility by providing faster inference times and lower computational requirements. SpecRNet achieves performance comparable to state-of-the-art models like LCNN while processing audio samples up to 40% faster. The work also benchmarks SpecRNet's effectiveness across various challenging scenarios, including low-resource datasets, short utterances, and limited attack types.
audio
Published: 2022-10-11
7 pages, 1 figures, Accecpted by Proceedings of the 1st International Workshop on Deepfake...
Authors: Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li, Nan Li, Kong Aik Lee, Longbiao Wang, Jianwu Dang
This paper describes the authors' submitted systems for the Audio Deep Synthesis Detection (ADD) Challenge, addressing both low-quality (LF) and partially fake (PF) audio detection tracks. Their approach focused on detecting spectro-temporal artifacts using raw temporal signals, spectral features, and deep embeddings, ultimately achieving 4th and 5th place in the respective tracks.
audio
Published: 2022-10-07
Accepted by the 13th International Symposium on Chinese Spoken Language Processing (ISCSLP 2022)
Authors: Lei Wang, Benedict Yeoh, Jun Wah Ng
This paper introduces the SE-Res2Net-Conformer architecture to enhance synthetic voice detection by better exploiting local acoustic patterns, showing improved performance on the ASVspoof 2019 database. Additionally, it re-formulates the audio splicing detection problem to focus on identifying splicing segment boundaries, proposing a deep learning approach for this task.
audio
Published: 2022-10-06
Accepted at WIFS 2022
Authors: Daniele Mari, Federica Latora, Simone Milani
This paper investigates the discriminative role of silenced parts in synthetic speech detection, proposing a computationally-lightweight and robust method. It leverages first digit statistics extracted from MFCC coefficients to identify irregularities in these silent segments. The approach achieves over 90% accuracy on most ASVSpoof dataset classes, outperforming some state-of-the-art methods in open-set scenarios.
audio
Published: 2022-10-05
IEEE/ACM Transactions on Audio, Speech, and Language Processing
Authors: Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, Kong Aik Lee
The ASVspoof 2021 challenge benchmarked spoofed and deepfake speech detection in realistic conditions across three tasks: logical access (LA), physical access (PA), and deepfake (DF). Analyzing 54 participant teams' results, the study found LA countermeasures robust to new encoding and transmission effects, while PA solutions showed potential for real replay detection but poor generalization to simulated environments. DF task solutions exhibited some resilience to compression but lacked generalization across different source datasets, emphasizing key data factors and outlining future challenge directions.
audio
Published: 2022-09-28
Authors: Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva
This paper proposes a novel approach for deepfake audio detection that addresses the poor generalization ability of existing methods to unseen synthetic audios. The method leverages only the biometric characteristics of the speaker by adapting off-the-shelf speaker verification tools, ensuring generalization by training exclusively on real data. It demonstrates good performance, high generalization, and robustness across various test sets and conditions.
audio
Published: 2022-09-23
Authors: Chenlei Hu, Ruohua Zhou
This paper presents the Online Hard Example Mining (OHEM) algorithm to enhance synthetic voice spoofing detection, specifically for unknown attacks. OHEM addresses the imbalance between simple and hard samples in the dataset, leading to improved recognition performance. The proposed system achieves an equal error rate (EER) of 0.77% on the ASVspoof 2019 Challenge logical access scenario's evaluation set.
audio
Published: 2022-09-14
6 pages
Authors: Qiaowei Ma, Jinghui Zhong, Yitao Yang, Weiheng Liu, Ying Gao, Wing W. Y. Ng
This paper proposes a lightweight, end-to-end anti-spoofing model for Automatic Speaker Verification (ASV) systems, named ConvNeXt Based Neural Network (CNBNN), by revising the ConvNeXt architecture. It integrates a modified channel attention block and utilizes focal loss to improve focus on informative sub-bands and hard-to-classify samples. The proposed system achieves an equal error rate of 0.64% and min-tDCF of 0.0187 on the ASVSpoof 2019 LA evaluation dataset, outperforming state-of-the-art systems.
audio
Published: 2022-08-21
13 pages, 5 figures. arXiv admin note: text overlap with arXiv:2208.10489v3
Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Jie Chen
This paper introduces Audio Deepfake Attribution (ADA), a novel task for identifying the source generation tools of deepfake audio, moving beyond binary detection. It presents the first dataset for this purpose, also named ADA, and proposes the Class-Representation Multi-Center Learning (CRML) method to tackle the challenge of open-set attribution, particularly for unknown audio generation tools. The CRML method effectively addresses real-world open-set risks by learning discriminative representations.
audio
Published: 2022-08-20
Accepted by ACM Multimedia 2022 Workshop: First International Workshop on Deepfake Detection for...
Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Haoxin Ma, Tao Wang, Shiming Wang, Ruibo Fu
This paper introduces the novel problem of detecting vocoder fingerprints in fake audio to identify the specific synthesis model used, rather than just determining authenticity. The authors conduct experiments using datasets synthesized by eight state-of-the-art vocoders, exploring various features and model architectures to distinguish these fingerprints. Their preliminary investigation, including t-SNE visualization, demonstrates that different vocoders indeed produce distinct and detectable fingerprints.
audio
Published: 2022-08-20
Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Haiyang Sun, Xun Chen, Zhengkun Tian, Haoxin Ma, Cunhang Fan, Ruibo Fu
This paper introduces a fully automated end-to-end fake audio detection method that eliminates the need for manual feature engineering or hyperparameter tuning. It utilizes pre-trained wav2vec models for high-level speech representation combined with a novel light-DARTS architecture search for automatically optimizing the neural network structure. The proposed system achieves state-of-the-art performance on the ASVspoof 2019 LA dataset.
audio
Published: 2022-08-02
Authors: Jun Xue, Cunhang Fan, Zhao Lv, Jianhua Tao, Jiangyan Yi, Chengshi Zheng, Zhengqi Wen, Minmin Yuan, Shegang Shao
This paper proposes a novel audio deepfake detection method that leverages fundamental frequency (F0) information and real plus imaginary spectrogram features. It addresses the limitations of existing acoustic features by incorporating F0 subband information and fully utilizing phase and full-band details through real and imaginary spectrograms. The system employs a two-stage fusion approach, achieving superior performance on the ASVspoof 2019 LA dataset.
audio
Published: 2022-06-27
Proceedings of INTERSPEECH 2022 (Updated version: corrected ASVspoof dataset description)
Authors: Piotr Kawa, Marcin Plata, Piotr Syga
This paper introduces the Attack Agnostic Dataset, a novel combination of three audio deepfake and anti-spoofing datasets designed to improve the generalization and stability of audio deepfake detection methods. The authors conduct a thorough analysis of current detection methods using various audio features and propose an LCNN-based model with a combined LFCC and mel-spectrogram front-end. Their proposed solution demonstrates improved generalization, stability, and performance compared to existing LFCC-based approaches.
audio
Published: 2022-06-27
arXiv admin note: text overlap with arXiv:1904.05441 by other authors
Authors: Rohit Arora
This research introduces end-to-end deep learning models, WSTnet and CWTnet, for doctored speech detection, which enhance the Sincnet architecture by replacing its initial layer with Wavelet Scattering and Continuous Wavelet Transform layers, respectively. A novel Wavelet Deconvolution (WD) layer is also proposed for CWTnet to parametrically learn and optimize scale parameters using back-propagation. These approaches demonstrate significant relative improvements over traditional handcrafted features and the Sincnet baseline on modern spoofing attacks.
audio
Published: 2022-05-27
15 pages
Authors: Ranya Aloufi, Hamed Haddadi, David Boyle
This paper presents 'VoiceID', an on-device voice authentication system designed to offer local privacy preservation and robust security against replay and deepfake attacks. It locally derives token-based credentials from unique voice attributes, incorporating liveness detection and a flexible privacy filter to selectively remove paralinguistic information before data transmission. The system achieves high authentication accuracy while ensuring user sovereignty over their voice data on edge devices.
audio
Published: 2022-04-30
Accepted to Odyssey 2022
Authors: Alexey Sholokhov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen
This paper addresses the existing gap in public evaluation protocols and open-source baselines for household speaker recognition, a challenging task due to domain heterogeneity, short utterances, and passive enrollment. The authors provide an accessible evaluation benchmark derived from VoxCeleb and ASVspoof 2019 data. They also introduce a preliminary pool of open-source baselines, encompassing four algorithms for active enrollment and one for passive enrollment.
audio
Published: 2022-04-11
Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (DOI:...
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813-825, 2023
Authors: Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi
This paper addresses the novel 'Partial Spoof' (PS) scenario, where short synthesized or transformed speech segments are embedded within a bona fide utterance, a manipulation difficult for existing countermeasures (CMs) to detect. The authors propose an improved CM capable of detecting and localizing these fake speech segments at multiple temporal resolutions. Key advancements include the use of self-supervised pre-trained models for enhanced feature extraction, an extended PartialSpoof database with multi-resolution segment labels, and a new CM architecture that simultaneously leverages both segment- and utterance-level labels.
audio
Published: 2022-04-09
Submitted to SLT 2022
Authors: Shih-Kuang Lee, Yu Tsao, Hsin-Min Wang
This study investigates the cepstrogram properties and demonstrates its effectiveness as a powerful countermeasure against replay attacks in automatic speaker verification (ASV) systems. A cepstrum analysis suggests that crucial anti-spoofing information for replay attacks is retained in the cepstrogram. Experiments show that cepstrogram-based single and fusion systems, particularly with an LCNN backend, significantly outperform existing state-of-the-art methods on the ASVspoof 2019 physical access database.
audio
Published: 2022-04-06
Accepted to be published in the Proceedings of Interspeech 2022
Authors: Jin Woo Lee, Eungbeom Kim, Junghyun Koo, Kyogu Lee
This paper investigates the effectiveness of wav2vec 2.0 features for spoofing detection in automatic speaker verification (ASV) systems and proposes a Spoof-Aware Speaker Verification (SASV) method. The study analyzes which feature space within wav2vec 2.0, specifically from different Transformer layers of XLSR-53, is most advantageous for identifying synthetic speech artifacts. A novel Representation Selective Self-Distillation (RSSD) module is introduced to improve SASV by disentangling speaker and spoofing representations.
audio
Published: 2022-04-04
Accepted to Interspeech 2022
Authors: Youngsik Eom, Yeonghyeon Lee, Ji Sub Um, Hoirin Kim
This paper proposes a transfer learning scheme for speech anti-spoofing using a pre-trained wav2vec 2.0 model augmented with a variational information bottleneck (VIB). The VIB module helps to extract generalized representations by suppressing irrelevant information from the speech embeddings. The method achieves state-of-the-art performance on the ASVspoof 2019 LA database and demonstrates improved generalization in low-resource and cross-dataset scenarios.
audio
Published: 2022-03-31
Accepted by ISCA SPSC 2022
https://www.isca-archive.org/spsc_2022/liao22_spsc.html#
Authors: Yen-Lun Liao, Xuanjun Chen, Chung-Che Wang, Jyh-Shing Roger Jang
This paper proposes an adversarial speaker distillation method for developing lightweight countermeasure (CM) models to protect Automatic Speaker Verification (ASV) systems from spoof attacks. This approach improves upon knowledge distillation by integrating generalized end-to-end (GE2E) pre-training and adversarial fine-tuning. The resulting ASD-ResNetSE model achieves competitive performance while significantly reducing model size, making it suitable for resource-constrained edge devices.
audio
Published: 2022-03-31
This paper is submitted to INTERSPEECH 2022
Authors: Petr Grinberg, Vladislav Shikhov
This paper compares various fusion methods for the SASV Challenge 2022, focusing on jointly optimizing Automatic Speaker Verification (ASV) and Countermeasure (CM) systems. It introduces novel fusion techniques, including boosting over embeddings with CatBoost, which significantly outperforms existing baseline methods. The study also explores other fusion approaches over both embeddings and scores derived from ASV and CM models.
audio
Published: 2022-03-30
Interspeech 2022
Authors: Nicolas M. Müller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, Konstantin Böttinger
This paper systematizes audio deepfake detection by uniformly re-implementing and evaluating twelve existing architectures, identifying key factors like feature type (cqtspec/logspec over melspec) for success. They introduce a new 'in-the-wild' dataset of celebrity/politician deepfakes and authentic audio to assess generalization. The study reveals that current models perform poorly on this real-world data, indicating they are likely over-optimized for the ASVSpoof benchmark.
audio
Published: 2022-03-28
5 pages, 2 figures, 2 tables, submitted to Interspeech 2022 as a conference paper
Authors: Jee-weon Jung, Hemlata Tak, Hye-jin Shim, Hee-Soo Heo, Bong-Jin Lee, Soo-Whan Chung, Ha-Jin Yu, Nicholas Evans, Tomi Kinnunen
The SASV 2022 challenge is introduced to integrate speaker verification and anti-spoofing research by incorporating spoofed trials into the speaker verification scenario, aiming for jointly optimized solutions. It provided pre-trained spoofing detection and speaker verification models and baselines for participants. The top-performing system achieved a significant reduction in Equal Error Rate (EER) from 23.83% to 0.13% compared to a conventional speaker verification system.
audio
Published: 2022-03-28
Submitted to Insterspeech 2022
Authors: Nicolas M. Müller, Franziska Dieckmann, Jennifer Williams
This paper addresses the novel problem of deepfake attacker attribution in the audio domain, moving beyond mere detection to identify who created a fake. It proposes methods for creating attacker signatures using both low-level acoustic descriptors and machine learning embeddings. The research demonstrates that while speech signal features are inadequate, recurrent neural network embeddings can successfully characterize attacks from both known and unknown attackers.
audio
Published: 2022-03-21
Accepted by Speaker Odyssey 2022
Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen
This paper investigates enhancing the spoofing robustness of Automatic Speaker Verification (ASV) systems without a separate countermeasure module. It applies three unsupervised domain adaptation techniques (CORAL, CORAL+, APLDA) to optimize the probabilistic linear discriminant analysis (PLDA) back-end of an ASVspoof 2019 baseline system. The approach yields notable improvements, particularly for physical access scenarios, on both bonafide and spoofed trials.
audio
Published: 2022-03-12
Update Experiment Results in ASV2019 protocol
Authors: Zhongwei Teng, Quchen Fu, Jules White, Maria E. Powell, Douglas C. Schmidt
This paper proposes SA-SASV, an ensemble-free, end-to-end Spoof-Aggregated Spoofing-Aware Speaker Verification system. It uses multi-task classifiers optimized by various losses, including a novel spoof source-based triplet loss and a spoof aggregator, to improve the differentiation between spoofed and bona fide speech and speakers. The system achieves a new state-of-the-art SASV-EER of 4.86% on the ASVSpoof 2019 LA dataset.
audio
Published: 2022-03-03
Accepted by ICASSP 2022
Authors: Juan M. Martín-Doñas, Aitor Álvarez
This paper presents Vicomtech's audio deepfake detection system for the 2022 ADD challenge, utilizing a pre-trained Wav2Vec2 model as a feature extractor combined with a downstream classifier. The approach exploits contextualized speech representations from Wav2Vec2's transformer layers and employs data augmentation to enhance robustness in challenging environments. The system demonstrates strong performance in both the ASVspoof 2021 and 2022 ADD challenges across various realistic scenarios.
audio
Published: 2022-02-28
Accepted to Speaker Odyssey Workshop 2022
Authors: Wanying Ge, Massimiliano Todisco, Nicholas Evans
This paper extends previous research by applying SHapley Additive exPlanations (SHAP) for attack analysis in deepfake and spoofing detection. The goal is to identify specific artifacts that characterize utterances generated by different attack algorithms. Using classifiers operating on raw waveforms or magnitude spectrograms, the study demonstrates that SHAP visualisations can effectively pinpoint attack-specific artifacts and reveal consistencies or differences between synthetic speech and converted voice spoofing attacks.
audio
Published: 2022-02-24
Submitted to Speaker Odyssey Workshop 2022
Authors: Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, Nicholas Evans
This paper proposes a novel approach for automatic speaker verification spoofing and deepfake detection utilizing a fine-tuned wav2vec 2.0 self-supervised learning front-end. Combined with a new self-attentive aggregation layer and data augmentation, the method significantly improves generalization to unseen attacks. It achieves the lowest equal error rates reported in the literature for both the ASVspoof 2021 Logical Access and Deepfake databases.
audio
Published: 2022-02-17
Accepted by ICASSP 2022
Authors: Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Xiaohui Zhang, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu
This paper introduces ADD 2022, the first Audio Deep Synthesis Detection challenge, addressing real-life and challenging scenarios not covered by previous shared tasks. It defines three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF), and an audio fake game (FG) involving generation and detection. The paper describes the datasets, evaluation metrics, protocols, and reports key findings from the challenge.
audio
Published: 2022-02-14
Submitted to ICASSP 2022
Authors: Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao, Hsin-Min Wang, Helen Meng
This paper introduces a novel framework for partially fake audio detection by employing a question-answering (fake span discovery) strategy coupled with a self-attention mechanism. The approach trains an anti-spoofing model to predict the start and end positions of fake clips within an audio, enhancing its ability to differentiate between real and partially fake audios. This method secured second place in the partially fake audio detection track of the ADD 2022 challenge.
audio
Published: 2022-01-24
Authors: Monisankha Pal, Aditya Raikar, Ashish Panda, Sunil Kumar Kopparapu
This paper proposes a synthetic speech detection system that enhances generalization to unseen spoofing attacks using a meta-learning paradigm with prototypical loss. The system employs a Squeeze-Excitation Residual Network (SE-ResNet) architecture to learn an embedding space directly. It achieves competitive performance on ASVspoof 2019 and outperforms the ASVspoof 2021 challenge best baseline on the logical access tasks.
audio
Published: 2022-01-04
Authors: Alejandro Gomez-Alanis, Jose A. Gonzalez-Lopez, Antonio M. Peinado
This paper investigates the robustness of full voice biometrics systems (ASV + PAD) against adversarial spoofing attacks, which aim to compromise their security. It proposes a novel Adversarial Biometrics Transformation Network (ABTN) designed to generate adversarial spoofing attacks that fool the Presentation Attack Detection (PAD) system without being detected by the Automatic Speaker Verification (ASV) system. Experiments on the ASVspoof 2019 corpus demonstrate that the ABTN significantly outperforms existing adversarial techniques in both white-box and black-box attack scenarios.
audio
Published: 2021-12-06
Summary of study findings
Authors: Gabrielle Watson, Zahra Khanjani, Vandana P. Janeja
This study assesses audio deepfake perceptions among college students, investigating how their background and major influence their ability to discern AI-generated audio. It analyzes perception based on grade level, grammar complexity, audio length, prior deepfake knowledge, and political content. A key finding is that political connotations in audio clips significantly impact whether listeners perceive them as real or fake.
audio
Published: 2021-11-28
Abbreviated version of a longer survey under review
Authors: Zahra Khanjani, Gabrielle Watson, Vandana P. Janeja
This survey paper addresses the gap in existing literature by focusing specifically on audio deepfakes, which are often overlooked in surveys predominantly covering video and image deepfakes. It critically analyzes and synthesizes research on audio deepfake generation and detection methods from 2016 to 2020, providing a comprehensive overview of different deepfake categories, their creation, and detection trends. The paper highlights the necessity for increased research in audio deepfakes, particularly concerning robust detection methods.
audio
Published: 2021-11-15
V3: added sub-band analysis, submitted to ISCA Odyssey2022; V2: added min tDCF results on 2019...
Authors: Xin Wang, Junichi Yamagishi
This paper investigates using pre-trained self-supervised speech models as front ends for speech spoofing countermeasures (CMs). It explores different back-end architectures, the benefits of fine-tuning the front end, and the performance of various self-supervised models. The study demonstrates that fine-tuning a well-chosen pre-trained self-supervised front end significantly improves spoofing detection generalizability across diverse ASVspoof datasets.
audio
Published: 2021-11-08
Accepted to IEEE ICASSP 2022
Authors: Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, Nicholas Evans
This paper introduces RawBoost, a novel data boosting and augmentation method designed for raw waveform inputs in automatic speaker verification anti-spoofing. RawBoost simulates nuisance variability like encoding, transmission, and distortion using a combination of linear/non-linear convolutive noise, impulsive signal-dependent noise, and stationary signal-independent noise. Experiments on the ASVspoof 2021 logical access database demonstrate that RawBoost significantly enhances the performance of a state-of-the-art raw end-to-end baseline system without requiring external data or model-level interventions.
audio
Published: 2021-11-04
Accepted to NeurIPS 2021 (Benchmark and Dataset Track); Code:...
Authors: Joel Frank, Lea Schönherr
This paper introduces WaveFake, a novel dataset comprising approximately 196 hours of generated audio from ten sample sets using six different state-of-the-art generative network architectures across two languages. It aims to address the lack of research in audio deepfake detection by providing a comprehensive dataset, an overview of audio signal processing techniques, and two baseline detection models. This resource facilitates further research and development in identifying synthetic audio signals.
audio
Published: 2021-10-20
Authors: Ariel Cohen, Inbal Rimon, Eran Aflalo, Haim Permuter
This paper conducts an in-depth study on data augmentation techniques to improve synthetic and spoofed audio detection, addressing challenges such as channel variability, different audio compressions, bandwidths, and unseen spoofing attacks. The authors propose compression and channel augmentation methods, a novel online SpecAverage augmentation, and an improved Log spectrogram feature design. Their approach achieves state-of-the-art performance in the ASVspoof 2021 Deep Fake (DF) category and significantly enhances results in the Logical Access (LA) category.
audio
Published: 2021-10-11
submitted to ICASSP 2022
Authors: Wei Liu, Meng Sun, Xiongwei Zhang, Hugo Van hamme, Thomas Fang Zheng
This paper proposes a multi-resolution front-end for end-to-end speech anti-spoofing, which automatically learns optimal weighted combinations of various time-frequency resolutions. The front-end uses a learnable neural network, inspired by SENet, to predict weights for features extracted at different resolutions, which are then concatenated and fed to a backend classifier. A refinement step by pruning low-importance resolutions is also introduced to reduce complexity and improve performance.
audio
Published: 2021-10-07
Accepted to ICASSP 2022
Authors: Wanying Ge, Jose Patino, Massimiliano Todisco, Nicholas Evans
This paper explores the use of SHapley Additive exPlanations (SHAP) to provide insights into the behavior of deep learning models for audio spoofing and deepfake detection. It demonstrates how SHAP can reveal unexpected classifier attention to specific audio segments or spectral components, and highlight differences in how competing models make decisions. The work aims to foster more trustworthy and explainable artificial intelligence in spoofing detection by making black-box models more transparent.
audio
Published: 2021-09-06
Authors: Zhongwei Teng, Quchen Fu, Jules White, Maria Powell, Douglas C. Schmidt
This paper introduces an Auxiliary Rawnet (ARNet) model to enhance audio deepfake detection by complementing traditional handcrafted features with features learned directly from raw waveforms. The ARNet approach aims to improve accuracy while maintaining a relatively low computational cost. Experimental results on the ASVspoof 2019 dataset demonstrate that this lightweight waveform encoder effectively boosts the performance of handcrafted-feature-based models.
audio
Published: 2021-09-06
Authors: Quchen Fu, Zhongwei Teng, Jules White, Maria Powell, Douglas C. Schmidt
This paper proposes FastAudio, a learnable audio front-end designed for spoof speech detection. It replaces traditional fixed filterbanks with a learnable layer that can adapt to anti-spoofing tasks through joint training with downstream back-ends. FastAudio achieves a significant performance improvement on the ASVspoof 2019 dataset compared to fixed and other learnable front-ends.
audio
Published: 2021-09-05
Authors: Amir Mohammad Rostami, Mohammad Mehdi Homayounpour, Ahmad Nickabadi
This paper introduces the Efficient Attention Branch Network (EABN) architecture with a combined loss function to improve generalization in Automatic Speaker Verification (ASV) spoof detection. The EABN utilizes attention and perception branches, employing EfficientNet-A0 or SE-Res2Net50, and a novel combined loss that includes Triplet Center Loss. This approach achieves state-of-the-art results on the ASVspoof 2019 dataset for both logical and physical access scenarios, particularly with EfficientNet-A0 requiring fewer parameters.
audio
Published: 2021-09-01
Accepted to the ASVspoof 2021 Workshop
Authors: Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado
ASVspoof 2021 is the fourth challenge edition focusing on spoofing and deepfake speech detection to protect automatic speaker verification systems. It introduced a new deepfake speech detection task alongside updated logical and physical access tasks. The challenge provided new evaluation databases, evaluation metrics, and four baseline systems, demonstrating significant progress despite increased difficulty with channel and compression variability and a lack of matched training data.
audio
Published: 2021-09-01
http://www.asvspoof.org
Authors: Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, Junichi Yamagishi
The ASVspoof 2021 challenge evaluation plan is presented, focusing on developing countermeasures against spoofed and deepfake speech. It defines three distinct tasks: Logical Access (LA) for TTS/VC attacks with channel variability, Physical Access (PA) for real replay attacks, and a new Speech Deepfake (DF) task for detecting compressed deepfake audio. The document outlines the technical details including data, metrics, baselines, and evaluation rules to promote the development of robust and generalized anti-spoofing systems.
audio
Published: 2021-09-01
Submitted to the symposium of the ISCA Security & Privacy in Speech Communications (SPSC)...
Authors: Jean-Francois Bonastre, Hector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Paul-Gauthier Noe, Jose Patino, Md Sahidullah, Brij Mohan Lal Srivastava, Massimiliano Todisco, Natalia Tomashenko, Emmanuel Vincent, Xin Wang, Junichi Yamagishi
This paper provides a high-level overview of benchmarking methodologies for security and privacy in voice biometrics. It describes the ASVspoof challenge, which focuses on developing countermeasures against spoofing attacks, and the VoicePrivacy initiative, which promotes research in speech anonymisation for privacy preservation. The work aims to foster multidisciplinary collaboration and catalyze research efforts in these critical areas of speech technology.
audio
Published: 2021-09-01
Authors: Junxiao Xue, Hao Zhou, Yabo Wang
This paper introduces a novel physiological-physical feature fusion method for automatic voice spoofing detection. The approach extracts physiological features from speech using a pre-trained convolutional neural network and physical features using SE-DenseNet or SE-Res2Net, then integrates them for classification. Experiments on the ASVspoof 2019 dataset demonstrate the model's effectiveness, showing significant improvements in tandem decision cost function (t-DCF) and equal error rate (EER) across both logical and physical access scenarios.
audio
Published: 2021-08-02
Authors: Vanessa Barnekow, Dominik Binder, Niclas Kromrey, Pascal Munaretto, Andreas Schaad, Felix Schmieder
This paper investigates the creation and detection of German voice deepfakes, analyzing the effort required to synthesize convincing voices with limited resources. It demonstrates that realistic deepfakes can be created with a few hours of audio data, and a user study reveals human difficulty in distinguishing them. The work also proposes and evaluates a machine learning technique for detecting these synthetic voices.
audio
Published: 2021-07-29
Submitted to ASVspoof 2021 Workshop
Authors: Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi
This paper establishes multi-task learning benchmarks for simultaneous segmental and utterance-level spoof detection in the PartialSpoof database. It introduces SELCNN, a Light Convolutional Neural Network enhanced with Squeeze-and-Excitation blocks, combined with Bidirectional LSTMs as the base model. The study demonstrates that multi-task learning, particularly with a binary-branch architecture and warm-up training strategies, significantly improves performance over single-task models for both detection levels.
audio
Published: 2021-07-27
Accepted in ASVspoof 2021 Workshop
Authors: Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, Nicholas Evans
This paper introduces an end-to-end Spectro-Temporal Graph Attention Network (RawGAT-ST) for speaker verification anti-spoofing and speech deepfake detection. The model automatically learns discriminative spectro-temporal relationships from raw waveform inputs through the fusion of spectral and temporal sub-graphs. It achieves an impressive Equal Error Rate (EER) of 1.06% on the ASVspoof 2019 logical access database, setting a new state-of-the-art for single systems.
audio
Published: 2021-07-26
Accepted to ASVspoof 2021 Workshop
Authors: Wanying Ge, Jose Patino, Massimiliano Todisco, Nicholas Evans
This paper introduces Raw PC-DARTS, an end-to-end differentiable architecture search method for speech deepfake and spoofing detection. The approach automatically learns the deep network architecture while jointly optimizing all network components and parameters, including a first convolutional layer that operates directly on raw audio signals. It demonstrates that a fully learned system can achieve competitive performance with state-of-the-art hand-crafted solutions.
audio
Published: 2021-07-26
To appear in Proc. ASVspoof 2021 Workshop
Authors: Xinhui Chen, You Zhang, Ge Zhu, Zhiyao Duan
This paper presents the UR-AIR system for the ASVspoof 2021 Challenge, focusing on channel-robust synthetic speech detection for logical access (LA) and speech deepfake (DF) tasks. The system addresses channel variability by augmenting datasets with an acoustic simulator applying various codecs and impulse responses. It utilizes an ECAPA-TDNN backbone combined with one-class learning and channel-robust training strategies to learn channel-invariant speech representations.
audio
Published: 2021-07-20
Published at ACM Multimedia 2022 Workshop DDAM First International Workshop on Deepfake...
Authors: Nicolas M. Müller, Karla Pizzi, Jennifer Williams
This paper investigates human perception of audio deepfakes by comparing human detection capabilities against a state-of-the-art AI algorithm. Utilizing a web-based game, 472 participants distinguished between real and fake audio samples over 14,912 rounds. The study reveals that both humans and AI share similar strengths and weaknesses in detection, contrasting with AI's superhuman performance in other domains, and identifies factors influencing human success such as native language and age.
audio
Published: 2021-07-19
Accepted to INTERSPEECH 2021
Authors: Xu Li, Xixin Wu, Hui Lu, Xunying Liu, Helen Meng
This paper introduces a novel Channel-wise Gated Res2Net (CG-Res2Net) architecture to improve the generalizability of anti-spoofing systems against unseen synthetic speech attacks. It modifies the existing Res2Net by integrating a channel-wise gating mechanism within its residual-like connections, dynamically suppressing less relevant channels. The proposed CG-Res2Net significantly outperforms Res2Net and other state-of-the-art single systems on the ASVspoof 2019 logical access (LA) evaluation set, especially for difficult unseen attacks.
audio
Published: 2021-06-23
ASVspoof 2021 Workshop
Authors: Nicolas M. Müller, Franziska Dieckmann, Pavel Czempin, Roman Canals, Konstantin Böttinger, Jennifer Williams
This paper reveals a significant data artifact in the ASVspoof 2019/2021 Challenge Datasets, where bonafide instances exhibit significantly longer leading and trailing silences than spoofed ones. The authors demonstrate that models can achieve high accuracy (up to 85% EER 15.1%) by learning solely from silence duration, suggesting that previous anti-spoofing systems may have inadvertently exploited this artifact. Trimming silence during pre-processing for established models severely degrades their performance, highlighting a critical flaw in the dataset's representativeness for genuine deepfake detection.
audio
Published: 2021-06-11
Accepted to Interspeech 2021. Example code available at...
Authors: Tomi Kinnunen, Andreas Nautsch, Md Sahidullah, Nicholas Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee
This paper introduces a method to visualize classifier adjacency relations in a 2D space by computing distances between binary classifiers based on their detection scores on a common dataset. The approach utilizes Kendall's τ rank correlation to define these distances, which are then mapped using classical multidimensional scaling (MDS) to facilitate visual comparison and complement traditional ROC/DET analyses. The method is demonstrated through case studies in automatic speaker verification (ASV) and voice anti-spoofing.
audio
Published: 2021-04-08
Camera ready version. Accepted by INTERSPEECH 2021
Authors: Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh
This paper proposes a novel long-range spectro-temporal modulation feature, derived from applying 2D DCT over log-Mel spectrograms, for enhanced audio deepfake detection. This feature, combined with a CNN-based baseline and incorporating spectrum augmentation and feature normalization, achieved state-of-the-art performance on the ASVspoof 2019 challenge. The system outperformed previously top single systems and demonstrated robust generalization across external datasets.
audio
Published: 2021-04-08
Submitted to INTERSPEECH 2021
Authors: Hemlata Tak, Jee-weon Jung, Jose Patino, Massimiliano Todisco, Nicholas Evans
This paper introduces the use of Graph Attention Networks (GATs) to enhance anti-spoofing performance in automatic speaker verification. GATs are utilized to model the relationships between spectral sub-bands or temporal segments, addressing a limitation of previous self-attention mechanisms. The proposed GAT-based model, which processes high-level representations from a ResNet, demonstrates significant improvements in spoofing detection.
audio
Published: 2021-04-08
accepted by Interspeech 2021
Authors: Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu
This paper introduces the Half-Truth Audio Detection (HAD) dataset, addressing the critical yet overlooked problem of detecting small fake audio clips hidden within real speech. The HAD dataset enables both utterance-level fake audio detection and precise localization of manipulated regions, demonstrating that partially fake audio poses a significantly greater challenge for detection than fully fake audio. This dataset is publicly available to foster research in this domain.
audio
Published: 2021-04-07
Accepted to INTERSPEECH 2021
Authors: Wanying Ge, Michele Panariello, Jose Patino, Massimiliano Todisco, Nicholas Evans
This paper presents the first successful application of Differentiable Architecture Search (DARTS), specifically Partially-Connected DARTS (PC-DARTS), for deepfake and spoofing detection in audio. PC-DARTS efficiently learns complex neural architectures composed of convolutional operations and residual blocks with minimal human effort. The resulting automatically learned networks achieve competitive performance with state-of-the-art systems while being significantly less complex, with some models having 85% fewer parameters than competitors.
audio
Published: 2021-03-21
Interspeech 2021
Authors: Xin Wang, Junich Yamagishi
This paper conducts a comparative study on various neural spoofing countermeasures for synthetic speech detection on the ASVspoof 2019 logical access task. It evaluates different neural network architectures, loss functions, and front-end features, emphasizing the significant impact of random initialization on model performance. The study identifies promising techniques, including average pooling for varied-length inputs and a new hyper-parameter-free P2SGrad-based loss function, which achieved a state-of-the-art single model EER of 1.92%.
audio
Published: 2021-02-12
5 pages, Accepted for publication in International Conference on Acoustics, Speech, and Signal...
Authors: Rohan Kumar Das, Jichen Yang, Haizhou Li
This paper proposes a novel data augmentation technique utilizing a-law and mu-law based signal companding to improve the detection of logical access attacks against automatic speaker verification (ASV) systems. The method aims to enhance the robustness of spoofing countermeasures, particularly against unknown attack types derived from advanced voice conversion and text-to-speech technologies. Experiments show that this companding-based augmentation outperforms traditional data augmentation and state-of-the-art countermeasures in handling unseen logical access attacks.
audio
Published: 2021-02-11
IEEE Transactions on Biometrics, Behavior, and Identity Science 2021
Authors: Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, Kong Aik Lee
This paper describes the ASVspoof 2019 challenge, analyzing the results and top-performing single and ensemble systems submitted by 62 teams for logical access (speech synthesis, voice conversion) and physical access (replay attacks) scenarios. It highlights that all top systems substantially outperformed baselines, with ensemble methods showing particular effectiveness for logical access. Deeper analysis reveals performance is often dominated by specific spoofing attacks or acoustic environments, and a significant gap exists between simulated and real replay data performance.
audio
Published: 2020-12-15
Authors: Shentong Mo, Haofan Wang, Pinxu Ren, Ta-Chung Chi
This paper investigates countermeasures for Automatic Speech Verification (ASV) spoofing detection, focusing on developing robust and efficient methods by following the ASVSpoof 2019 competition setup. The goal is to distinguish between real and spoofed audio inputs, employing metrics like EER and t-DCF for evaluation.
audio
Published: 2020-12-06
12 pages, 6 figures, codes used in the experimental section can be found at...
Authors: Yuanjun Zhao, Roberto Togneri, Victor Sreeram
This paper proposes a spoofing-robust automatic speaker verification (SR-ASV) system utilizing a multi-task learning architecture. The deep learning model is jointly trained with time-frequency representations from utterances to simultaneously perform speaker verification and spoofing detection. The approach demonstrates substantial performance improvements over existing state-of-the-art systems on the ASVspoof 2017 and 2019 corpora under diverse spoofing conditions.
audio
Published: 2020-11-07
6 pages excluding references. Paper accepted by IEEE Spoken Language Technology (SLT) 2021
Authors: Yang Gao, Jiachen Lian, Bhiksha Raj, Rita Singh
This paper compares human impersonation and machine-generated deepfake speech attacks on black-box and white-box Automatic Speaker Verification (ASV) systems. It proposes and evaluates speech-production-related features, such as fundamental frequency sequence-related entropy, spectral envelope, and aperiodic parameters, as robust countermeasures for deepfake detection. The study hypothesizes that machines cannot emulate the fine-level intricacies of human speech production, which these features aim to capture.
audio
Published: 2020-11-02
Accepted to ICASSP 2021
Authors: Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, Anthony Larcher
This paper presents the first application of RawNet2, a deep neural network that ingests raw audio, for anti-spoofing in automatic speaker verification. It describes specific modifications to the RawNet2 architecture to adapt it for spoofing detection. The proposed system shows strong performance, particularly for the challenging A17 voice conversion attack, and achieves second-best results when fused with baseline countermeasures for the full ASVspoof 2019 logical access condition.
audio
Published: 2020-10-28
Accepted to ICASSP2021
Authors: Xu Li, Na Li, Chao Weng, Xunying Liu, Dan Su, Dong Yu, Helen Meng
This work proposes leveraging the Res2Net architecture for replay and synthetic speech detection to improve generalizability to unseen spoofing attacks. Res2Net modifies the ResNet block to enable multiple feature scales, which significantly enhances the anti-spoofing countermeasure's performance and reduces model size. Experimental results demonstrate Res2Net's consistent outperformance over ResNet34 and ResNet50 on the ASVspoof 2019 corpus, particularly when integrated with the Squeeze-and-Excitation (SE) block and using Constant-Q Transform (CQT) acoustic features.
audio
Published: 2020-10-27
Authors: You Zhang, Fei Jiang, Zhiyao Duan
This paper proposes an anti-spoofing system to detect unknown synthetic voice spoofing attacks (text-to-speech or voice conversion) using one-class learning. The core idea is to compact the bona fide speech representation and inject an angular margin to separate spoofing attacks in the embedding space. This system achieves an Equal Error Rate (EER) of 2.19% on the ASVspoof 2019 Challenge logical access scenario evaluation set, outperforming all existing single systems without data augmentation.
audio
Published: 2020-10-19
Accepted Interspeech 2020. Video:...
Authors: Tyler Vuong, Yangyang Xia, Richard Stern
This paper proposes a deep-learning-based Voice Type Discrimination (VTD) system, named STRFNet, which incorporates an initial layer of learnable spectro-temporal receptive fields (STRFs). The system demonstrates strong performance on a new VTD database and the ASVspoof 2019 challenge's spoofing detection task. The research highlights the effectiveness of learnable STRFs in improving robustness against various noise conditions and consistently outperforming competitive baseline systems.
audio
Published: 2020-10-15
Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020
Authors: Bhusan Chettri, Emmanouil Benetos, Bob L. T. Sturm
This paper investigates how dataset artifacts in the ASVspoof 2017 benchmark dataset contribute to the apparent success of anti-spoofing systems. It details various artifacts present in the dataset and demonstrates how countermeasure models exploit them. The authors propose and validate a method using speech endpoint detection to discard non-speech segments, leading to more reliable and robust performance estimates for anti-spoofing models.
audio
Published: 2020-10-08
Authors: Lazaro J. Gonzalez-Soler, Jose Patino, Marta Gomez-Barrero, Massimiliano Todisco, Christoph Busch, Nicholas Evans
This paper proposes a texture-based presentation attack detection (PAD) approach for automatic speaker verification (ASV) systems. It transforms speech into spectrogram images, applies texture descriptors, and encodes them into a common Fisher Vector feature space using a generative Gaussian Mixture Model. The method demonstrates strong performance in detecting both known and unknown audio deepfakes.
audio
Published: 2020-09-21
Accepted for publication in Interspeech 2020
Authors: Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li
This paper introduces a novel feature genuinization method for detecting synthetic speech attacks, addressing the challenge of unseen attack types that degrade existing countermeasure performance. The approach leverages the consistent distribution of genuine speech by training a CNN-based transformer using only genuine speech characteristics. This genuinization transformer, combined with a light CNN classifier, effectively amplifies the discriminative features between genuine and synthetic speech.
audio
Published: 2020-08-20
Odyssey 2020 (The Speaker and Language Recognition Workshop)
Authors: Qiongqiong Wang, Kong Aik Lee, Takafumi Koshinaka
This paper introduces a simple yet effective method for anti-spoofing in automatic speaker verification (ASV) using multi-resolution feature maps with convolutional neural networks (CNNs). It addresses the issue of single spectrograms providing insufficient discriminative representations due to trade-offs in time and frequency resolutions. The proposed approach stacks multiple spectrograms, extracted with varying window lengths, and feeds them as multi-channel input to a CNN, improving both resolutions while maintaining low computational cost.
audio
Published: 2020-08-08
Authors: Rahul T P, P R Aravind, Ranjith C, Usamath Nechiyil, Nandakumar Paramparambath
This paper proposes a deep convolutional neural network-based speech classifier for detecting audio spoofing attacks in Automatic Speaker Verification systems. The methodology leverages Mel-spectrograms as acoustic time-frequency representations and an adapted ResNet-34 architecture, utilizing transfer learning. The system achieved competitive Equal Error Rates on the ASVspoof 2019 dataset for both logical and physical access scenarios.
audio
Published: 2020-07-12
Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)
Authors: Tomi Kinnunen, Héctor Delgado, Nicholas Evans, Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang, Md Sahidullah, Junichi Yamagishi, Douglas A. Reynolds
This paper extends the tandem detection cost function (t-DCF) as a risk-based metric for assessing spoofing countermeasures (CMs) when deployed in tandem with automatic speaker verification (ASV) systems. It presents a simplified version of the t-DCF, analyzes a special case for a fixed ASV system, and provides new insights and empirical analyses using the ASVspoof 2019 database. The work aims to foster closer collaboration between anti-spoofing and ASV research communities by promoting a more application-relevant assessment approach than traditional Equal Error Rate (EER).
audio
Published: 2020-06-25
The 25th International Conference on Pattern Recognition (ICPR2020)
Authors: Yongqiang Dou, Haocheng Yang, Maolin Yang, Yanyan Xu, Dengfeng Ke
This paper proposes D3M, a novel method for replay attack detection that addresses data discrepancy by introducing a balanced focal loss function. This loss dynamically scales sample contributions during training, prioritizing indistinguishable samples. The approach also integrates a fusion of complementary magnitude-based (STFT-gram, CQT-gram) and phase-based (MGD-gram) features, demonstrating superior performance on the ASVspoof2019 dataset.
audio
Published: 2020-06-10
9 pages, 2 figures, 6 tables, Published in MDPI Applied Sciences (SCIE)
Authors: Hye-jin Shim, Jee-weon Jung, Ju-ho Kim, Seung-bin Kim, Ha-Jin Yu
This paper proposes and evaluates two approaches for integrating speaker verification (SV) and presentation attack detection (PAD) systems: an end-to-end monolithic approach and a back-end modular approach. The authors hypothesize that SV and PAD require different discriminative information and demonstrate that the modular approach, which separately processes SV embeddings and PAD predictions, is more effective. The proposed back-end modular system achieves a significant improvement in integrated replay spoofing-aware speaker verification.
audio
Published: 2020-06-05
Authors: Haibin Wu, Andy T. Liu, Hung-yi Lee
This paper proposes using Mockingjay, a self-supervised learning model, to defend anti-spoofing models against black-box adversarial attacks. The approach leverages high-level representations extracted by Mockingjay to prevent the transferability of adversarial examples. A layerwise noise to signal ratio (LNSR) is also introduced to quantify the effectiveness of deep models in countering adversarial noise.
audio
Published: 2020-05-28
Accepted by ACM MM'20
Authors: Run Wang, Felix Juefei-Xu, Yihao Huang, Qing Guo, Xiaofei Xie, Lei Ma, Yang Liu
This paper introduces DeepSonar, a novel approach for detecting AI-synthesized fake voices by monitoring the layer-wise neuron behaviors of a speaker recognition (SR) deep neural network. The method leverages neuron activation patterns to capture subtle differences between real and fake voices, providing a cleaner signal for a binary classifier than raw audio inputs. Experiments across three datasets, including commercial products and different languages, demonstrate high detection accuracy (98.1% average) and robustness against manipulation attacks like voice conversion and additive real-world noises.
audio
Published: 2020-05-20
Submitted to Interspeech 2020 conference, 5 pages
Authors: Hemlata Tak, Jose Patino, Andreas Nautsch, Nicholas Evans, Massimiliano Todisco
This paper proposes a spoofing attack detection system that leverages a bank of simple sub-band classifiers, each tuned to different spoofing attacks, combined via non-linear score fusion. This approach demonstrates superior performance, outperforming most sophisticated ensemble solutions relying on complex neural networks. The method achieved competitive results in the ASVspoof 2019 challenge, surpassing all but two of the 48 submitted systems for the logical access condition.
audio
Published: 2020-04-14
Accepted to Speaker Odyssey (The Speaker and Language Recognition Workshop), 2020, 8 pages
Authors: Hemlata Tak, Jose Patino, Andreas Nautsch, Nicholas Evans, Massimiliano Todisco
This paper provides an explainability study for Constant Q Cepstral Coefficients (CQCCs) as a spoofing countermeasure for automatic speaker verification, investigating why they are effective against certain attacks but not others. The research reveals that the efficacy of CQCCs, compared to traditional Linear Frequency Cepstral Coefficients (LFCCs), stems from their attention to specific sub-band components of the spectrum where different spoofing artifacts reside. By analyzing countermeasure performance across various frequency bands, the authors shed light on signal or spectrum level artifacts that distinguish spoofed speech from genuine speech.
audio
Published: 2020-04-04
Accepted to the Speaker Odyssey (The Speaker and Language Recognition Workshop) 2020 conference. 8 pages
Authors: Bhusan Chettri, Tomi Kinnunen, Emmanouil Benetos
This paper systematically investigates the impact of different frequency subbands on replay spoofing detection using a novel joint subband modeling framework. This framework employs multiple sub-networks to learn band-specific features, which are then combined and classified. The study reveals that the most discriminative information for replay spoofing detection is not uniformly distributed across the spectrum, and these findings vary across datasets.
audio
Published: 2020-03-21
Accepted to Computer Speech and Language Special issue on Advances in Automatic Speaker...
Authors: Bhusan Chettri, Tomi Kinnunen, Emmanouil Benetos
This paper proposes a deep generative approach using Variational Autoencoders (VAEs) for replay attack detection in Automatic Speaker Verification (ASV). It introduces three VAE variants: independent VAEs for each class, a conditional VAE (C-VAE) with class label injection, and an auxiliary classifier-augmented C-VAE (AC-VAE). The C-VAE is shown to significantly improve detection performance over simpler VAEs and baseline GMMs.
audio
Published: 2020-03-06
Accepted by ICASSP 2020
Authors: Haibin Wu, Songxiang Liu, Helen Meng, Hung-yi Lee
This paper addresses the vulnerability of automatic speaker verification (ASV) spoofing countermeasure models to adversarial examples. The authors propose and evaluate two defense methods: a passive spatial smoothing technique and a proactive adversarial training approach. Experimental results demonstrate that both methods effectively enhance the robustness of ASV spoofing countermeasure models against adversarial attacks.
audio
Published: 2020-02-28
accepted at Speaker Odyssey 2020
Authors: Jennifer Williams, Joanna Rownicka, Pilar Oplustil, Simon King
This paper investigates automatic quality estimation for multi-speaker Text-to-Speech (TTS) synthesis by training a neural network on human Mean Opinion Score (MOS) ratings. It compares eight different speech representations, including spectrogram features and x-vector embeddings, to identify which best predicts MOS. The approach aims to characterize how different speakers contribute to perceived output quality and to automatically identify speakers who consistently achieve higher or lower quality in TTS systems.
audio
Published: 2020-02-16
Submit to INTERSPEECH2020
Authors: Patrick von Platen, Fei Tao, Gokhan Tur
This paper proposes a Multi-Task Learning (MTL) approach utilizing Siamese Neural Networks (SNN) to enhance replay attack (RA) detection systems by improving their generalizability and discriminability. It introduces SNN to optimize a Residual Neural Network (ResNet) architecture, demonstrating significant performance gains over a cross-entropy baseline. The method is further improved by incorporating an additional reconstruction loss and replacing global average pooling with global average and variance pooling.
audio
Published: 2020-01-31
Authors: Jee-weon Jung, Hye-jin Shim, Hee-Soo Heo, Ha-Jin Yu
This study analyzes the role of various categories of subsidiary information ('Room Size', 'Reverberation', 'Speaker-to-ASV distance', 'Attacker-to-Speaker distance', 'Replay Device Quality') in replay attack spoofing detection. It investigates whether this information is inherently present in a deep neural network's code or if explicit inclusion improves performance. The research concludes that subsidiary information is not sufficiently represented in DNNs trained for binary classification, but explicit inclusion through multi-task learning can enhance performance in closed-set conditions.
audio
Published: 2019-11-05
Accepted, Computer Speech and Language. This manuscript version is made available under the...
Authors: Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, Zhen-Hua Ling
The paper introduces ASVspoof 2019, a large-scale public database designed to benchmark automatic speaker verification (ASV) systems against various spoofing attacks. It is the first edition to include speech synthesis, voice conversion, and replay attacks within a single challenge, covering logical and physical access scenarios. The database aims to foster research on anti-spoofing countermeasures by providing challenging spoofed data, including samples indistinguishable from bona fide speech by humans.
audio
Published: 2019-10-29
Authors: Mohammad Adiban, Hossein Sameti, Saeedreza Shehnepoor
This paper proposes a novel replay spoofing countermeasure for Automatic Speaker Verification (ASV) systems to combat replay attacks. The approach utilizes Constant Q Cepstral Coefficient (CQCC) features, processes them through an autoencoder to capture informative and noise-aware representations, and employs a Siamese network for classification. Experiments on the ASVspoof 2019 dataset demonstrate significant improvements in Equal Error Rate (EER) and Tandem Detection Cost Function (t-DCF) over baseline systems.
audio
Published: 2019-10-22
Authors: Hye-jin Shim, Hee-Soo Heo, Jee-weon Jung, Ha-Jin Yu
This paper proposes a self-supervised pre-training framework for replay spoofing detection by learning acoustic configurations from existing speaker verification datasets. The method involves training deep neural networks to identify identical acoustic configurations (environmental factors like microphone type and ambient noise) from pairs of audio segments. This approach significantly improves performance on the ASVspoof 2019 physical access dataset, outperforming baselines by 30%.
audio
Published: 2019-10-19
Accepted for ASRU 2019
Authors: Songxiang Liu, Haibin Wu, Hung-yi Lee, Helen Meng
This paper investigates the vulnerability of high-performance automatic speaker verification (ASV) spoofing countermeasure systems under adversarial attacks. It applies both white-box and black-box adversarial attacks, using the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD), to challenge models from the ASVspoof 2019 challenge. The study demonstrates that these countermeasure models are highly susceptible to both attack scenarios.
audio
Published: 2019-09-23
Presented at Interspeech 2019
Authors: Jennifer Williams, Joanna Rownicka
This paper presents a system for the ASVspoof 2019 Challenge Physical Access (PA) task, focusing on detecting speech replay attacks. The proposed countermeasure utilizes convolutional neural networks (CNNs) with a combined feature representation of x-vector attack embeddings and sub-band spectral centroid magnitude coefficients (SCMCs). The system demonstrates improved performance over challenge baselines, suggesting that x-vector attack embeddings can regularize CNN predictions for enhanced robustness.
audio
Published: 2019-09-17
6 pages, 3 figures, This paper is submitted to ICASSP 2020
Authors: Xiaohai Tian, Rohan Kumar Das, Haizhou Li
This paper proposes a black-box adversarial framework that enhances voice conversion (VC) attacks on Automatic Speaker Verification (ASV) systems. It uses the ASV system's output scores as feedback to a VC system, optimizing the converted speech to be more deceptive without needing internal ASV knowledge. Experiments demonstrate that this feedback-controlled VC significantly boosts impostor ASV scores while maintaining natural speech quality.
audio
Published: 2019-09-03
Authors: Roland Baumann, Khalid Mahmood Malik, Ali Javed, Andersen Ball, Brandon Kujawa, Hafiz Malik
This paper introduces a novel Voice Spoofing Detection Corpus (VSDC) designed to evaluate anti-spoofing methods against multi-order replay attacks. VSDC uniquely includes first- and second-order replay samples, along with bonafide audio, and is diverse in terms of recording environments, devices, and speakers. It addresses the limitations of existing datasets by focusing on multi-hop replay scenarios prevalent in Voice Controlled Devices (VCDs) within IoT environments.
audio
Published: 2019-07-13
Authors: ['Hossein Zeinali', 'Themos Stafylakis', 'Georgia Athanasopoulou', 'Johan Rohdin', 'Ioannis Gkinis', 'Lukáš Burget', 'Jan "Honza\\'\\' Černocký']
This paper details the BUT-Omilia team's submission to the ASVspoof 2019 Challenge, focusing on detecting spoofing attacks against automatic speaker verification systems. Their approach employs fused deep neural networks, with distinct architectures for physical access (PA) and logical access (LA) attacks. The PA system achieved significant performance improvements over the baseline, while the LA system showed strong performance on seen attacks but struggled with generalization to novel logical access attack types.
audio
Published: 2019-07-05
Accepted for INTERSPEECH 2019
Authors: Weicheng Cai, Haiwei Wu, Danwei Cai, Ming Li
This paper details the DKU replay detection system for the ASVspoof 2019 challenge, focusing on developing spoofing countermeasures for automatic speaker recognition. The system leverages an utterance-level deep learning framework, incorporating data augmentation, various feature representations, residual neural network classification, and score-level fusion. Their best single system utilizes a residual neural network trained on speed-perturbed group delay gram, with performance significantly improved by fusing multiple systems.
audio
Published: 2019-05-28
IEEE Access. 2019
Authors: Balamurali BT, Kin Wah Edward Lin, Simon Lui, Jer-Ming Chen, Dorien Herremans
This research investigates robust audio features for detecting replay spoofing attacks against automatic speaker verification systems, aiming to overcome the limitation of existing systems that depend on knowing the spoofing technique. The authors compare traditional audio features with those learned through an autoencoder and propose a hybrid system that combines both types of features. This approach provides a detailed methodology for setting up state-of-the-art audio feature detection, preprocessing, and postprocessing, evaluated on the ASVspoof 2017 dataset.
audio
Published: 2019-04-23
Accepted for oral presentation at Interspeech 2019, code available at...
Authors: Jee-weon Jung, Hye-jin Shim, Hee-Soo Heo, Ha-Jin Yu
This study proposes an end-to-end deep neural network (DNN) approach for replay attack detection, replacing traditional hand-crafted acoustic feature extraction. It leverages complementary high-resolution spectrograms, including phase information and power spectral density, to detect subtle characteristics of replayed speech. The system, utilizing DNNs without knowledge-based intervention, achieves promising results on the ASVspoof 2019 physical access challenge.
audio
Published: 2019-04-11
Submitted to Interspeech 2019, Graz, Austria
Authors: Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, Alexandr Kozlov
This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASVspoof 2019 challenge. The proposed systems enhance the Light CNN architecture with angular margin based softmax activation for robust deepfake detection across logical (speech synthesis/voice conversion) and physical (replay) access scenarios. These systems achieved competitive EERs of 1.86% and 0.54% in logical and physical access scenarios respectively, demonstrating stability against unknown attack types.
audio
Published: 2019-04-09
Proc. Interspeech 2019
Authors: Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Hector Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee
The ASVspoof 2019 challenge introduced new databases and protocols to benchmark countermeasures against advanced spoofing attacks in automatic speaker verification (ASV). It considered both logical access (LA) and physical access (PA) scenarios with synthetic, converted, and replayed speech attacks, adopting the ASV-centric tandem detection cost function (t-DCF) as the primary evaluation metric. The challenge showcased significant progress in spoofed and fake audio detection, with over half of the 63 participating teams outperforming the provided baseline countermeasures.
audio
Published: 2019-04-01
Submitted to Interspeech 2019, Graz, Austria
Authors: Cheng-I Lai, Nanxin Chen, Jesús Villalba, Najim Dehak
This paper presents ASSERT, JHU's system submission to the ASVspoof 2019 Challenge, designed for anti-spoofing against text-to-speech, voice conversion, and replay attacks. ASSERT is a deep neural network-based pipeline comprising feature engineering, DNN models (variants of squeeze-excitation and residual networks), network optimization, and system combination. The system achieved significant relative improvements over baseline systems in both sub-challenges of ASVspoof 2019.
audio
Published: 2019-01-23
Published in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing...
Authors: Dipjyoti Paul, Md Sahidullah, Goutam Saha
This paper investigates the generalization capability of spoofing countermeasures under restricted training conditions, specifically when certain attack types are excluded from the training data. It analyzes the performance using MFCCs and CQCCs features with a GMM-ML classifier on ASVspoof 2015 and BTAS 2016 corpora, including cross-corpora analysis. The study reveals varying generalization capabilities across different spoofing types and highlights the importance of both static and dynamic spectral feature coefficients for real-life detection.
audio
Published: 2019-01-04
Published as a book-chapter in Handbook of Biometric Anti-Spoofing Presentation Attack Detection...
Published in Handbook of Biometric Anti-Spoofing Presentation Attack Detection (Second Edition...
Authors: Md Sahidullah, Hector Delgado, Massimiliano Todisco, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Kong-Aik Lee
This paper provides a comprehensive review of recent advancements in voice presentation attack detection (PAD) for automatic speaker verification (ASV), focusing on the last three years. It synthesizes findings and lessons learned from the community-led ASVspoof challenges, covering developments in speech corpora, evaluation protocols, feature extraction, and classification. The authors conclude that ASV PAD remains an unsolved problem, emphasizing the need for generalized solutions capable of detecting diverse and previously unseen spoofing attacks.
audio
Published: 2018-10-31
Submitted to ICASSP 2019
Authors: Cheng-I Lai, Alberto Abad, Korin Richmond, Junichi Yamagishi, Najim Dehak, Simon King
This paper introduces the Attentive Filtering Network (AFN) for detecting audio replay attacks in automatic speaker verification systems. AFN employs an attention-based filtering mechanism to enhance feature representations in both frequency and time domains, combined with a Dilated Residual Network (DRN) classifier. The proposed system achieves competitive performance on the ASVspoof 2017 Version 2.0 dataset, with visualizable attention heatmaps demonstrating its feature enhancement capabilities.
audio
Published: 2018-09-12
Accepted at WIFS2018
IEEE International Workshop on Information Forensics and Security (WIFS), 2018
Authors: Fuming Fang, Junichi Yamagishi, Isao Echizen, Md Sahidullah, Tomi Kinnunen
This paper proposes a method to deceive playback spoofing countermeasures (CMs) for automatic speaker verification (ASV) systems by transforming the acoustic characteristics of played-back speech. They achieve this by enhancing 'stolen speech' from a target speaker using a Speech Enhancement Generative Adversarial Network (SEGAN) before playback. Experimental results demonstrate that this 'enhanced stolen speech' method significantly increases equal error rates (EERs) for baseline and CNN-based playback detection models, and degrades the performance of a GMM-UBM-based ASV system.
audio
Published: 2018-05-22
6 pages
Authors: Bhusan Chettri, Saumitra Mishra, Bob L. Sturm, Emmanouil Benetos
This paper studies the performance of Convolutional Neural Networks (CNNs) in an end-to-end setting for replay attack detection within the ASVspoof 2017 challenge. The authors find that existing CNN architectures exhibit poor generalization on the evaluation dataset compared to development data. They propose a compact CNN architecture and investigate factors affecting generalization, highlighting challenges related to data differences and limited training data.
audio
Published: 2018-04-25
Published in Odyssey 2018: the Speaker and Language Recognition Workshop [cleaned up source files]
Authors: Tomi Kinnunen, Kong Aik Lee, Hector Delgado, Nicholas Evans, Massimiliano Todisco, Md Sahidullah, Junichi Yamagishi, Douglas A. Reynolds
This paper introduces the tandem detection cost function (t-DCF) to provide an ASV-centric assessment of spoofing countermeasures (CMs) in automatic speaker verification (ASV). The t-DCF extends the conventional DCF to scenarios involving spoofing attacks, addressing shortcomings of previous CM-only EER evaluations. It serves as a more reliable metric for assessing the combined performance of ASV and CM systems.
audio
Published: 2017-05-24
12 pages, 0 figures, published in Springer Communications in Computer and Information Science...
Authors: Galina Lavrentyeva, Sergey Novoselov, Konstantin Simonchik
This paper overviews and experimentally compares various acoustic feature spaces and classifiers for robust anti-spoofing countermeasures against Automatic Speaker Verification (ASV) spoofing attacks. It evaluates several spoofing detection systems on the ASVspoof Challenge 2015 datasets, highlighting effective combinations of features and classifiers. Key findings emphasize the importance of magnitude and phase information, wavelet-based features, and the strong performance of SVM and deep neural network classifiers.
audio
Published: 2017-02-13
Authors: Hong Yu, Zheng-Hua Tan, Zhanyu Ma, Jun Guo
This paper introduces Deep Neural Network Filter Bank Cepstral Coefficients (DNN-FBCC) for distinguishing between natural and spoofed speech, aiming to improve automatic speaker verification system reliability. The DNN filter bank is automatically generated by training a Filter Bank Neural Network (FBNN) using natural and synthetic speech, with restrictions to create band-limited, frequency-sorted filters. Experimental results on the ASVspoof 2015 database demonstrate that a Gaussian Mixture Model maximum-likelihood (GMM-ML) classifier using DNN-FBCC outperforms state-of-the-art LFCC, particularly in detecting unknown attacks.
audio
Published: 2016-03-14
Presented in IEEE 2015 Annual IEEE India Conference (INDICON)
Authors: Dipjyoti Paul, Monisankha Pal, Goutam Saha
This paper proposes novel speech features to improve the detection of spoofing attacks against Automatic Speaker Verification (ASV) systems. These features are derived using an alternative frequency-warping technique and formant-specific block transformation of filter bank log energies. Evaluated on the ASVspoof 2015 corpora, the proposed techniques outperform existing methods, achieving 0% Equal Error Rate (EER) for natural and synthetic speech classification.
audio
Published: 2016-03-12
23 Pages, 7 figures
Authors: Cemal Hanilci, Tomi Kinnunen, Md Sahidullah, Aleksandr Sizov
This paper analyzes the robustness of state-of-the-art synthetic speech detection systems against additive noise, focusing on acoustic front-end features and classifier back-ends. The study reveals that current countermeasures significantly degrade even at relatively high signal-to-noise ratios (SNRs) and that traditional speech enhancement techniques are unhelpful. It also finds that Gaussian Mixture Model (GMM) back-ends generally outperform i-vector back-ends and that score fusion improves detection accuracy.
audio
Published: 2016-02-09
Submitted to Odyssey: The Speaker and Language Recognition Workshop 2016
Authors: Xiaohai Tian, Zhizheng Wu, Xiong Xiao, Eng Siong Chng, Haizhou Li
This paper presents a preliminary investigation into spoofing detection for automatic speaker verification (ASV) under additive noisy conditions, addressing a gap in previous research which primarily used clean data. The authors introduce a new noisy database, created by augmenting the ASVspoof 2015 database with five types of background noise at various signal-to-noise ratios (SNRs). Their experiments reveal that systems trained on clean data suffer significant performance degradation in noisy environments, with phase-based features showing greater robustness than magnitude-based ones.
audio
Published: 2015-07-29
5 pages, 8 figures, 3 tables
Authors: Sergey Novoselov, Alexandr Kozlov, Galina Lavrentyeva, Konstantin Simonchik, Vadim Shchemelinin
This paper presents the Speech Technology Center (STC) systems for the ASVspoof 2015 Challenge, focusing on robust countermeasures against spoofing attacks using various acoustic feature spaces. It investigates features derived from phase spectrum and multiresolution wavelet transform, combined with TV-JFA for probability modeling and SVM or DBN classifiers. The study demonstrates that phase-related and wavelet-based features substantially improve system efficiency.