Related Work

SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

audio Published: 2025-11-26 Authors: Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv

SONAR addresses the lack of generalization in deepfake audio detection, which stems from spectral bias causing models to overlook subtle high-frequency artifacts left by deepfake generators. The framework explicitly disentangles the audio signal into low-frequency content and high-frequency residuals via a dual-path architecture and utilizes a frequency-aware Jensen-Shannon contrastive loss. This approach sharpens decision boundaries by enforcing alignment for genuine content-noise pairs while maximizing the separation of fake embeddings.

Continual Audio Deepfake Detection via Universal Adversarial Perturbation

audio Published: 2025-11-25 Authors: Wangjie Li, Lin Li, Qingyang Hong

The paper proposes a novel framework for continual audio deepfake detection (ADD) to overcome catastrophic forgetting against evolving spoofing attacks. It leverages Universal Adversarial Perturbation (UAP) integrated into the model fine-tuning process, allowing the system to retain knowledge of historical spoofing distributions without storing past training data. This approach aims to provide an efficient and robust solution for continual learning in ADD by utilizing pseudo-spoofed samples and knowledge distillation.

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

audio Published: 2025-11-14 Authors: Guangke Chen, Yuhui Wang, Shouling Ji, Xiapu Luo, Ting Wang

The paper introduces HARMGEN, a suite of five multi-modal attacks designed to bypass the safety mechanisms of Large Audio-Language Models (LALMs) used for Text-to-Speech (TTS). These attacks exploit semantic obfuscation (Concat, Shuffle) and audio-modality vulnerabilities (Read, Spell, Phoneme) to generate high-fidelity audio containing explicitly harmful linguistic content. The study demonstrates that HARMGEN substantially reduces refusal rates across commercial TTS systems and exposes critical vulnerabilities in current reactive and proactive defense mechanisms.

Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces

audio Published: 2025-11-13 Authors: Farhan Sheth, Girish, Mohd Mujtaba Akhtar, Muskaan Singh

This paper introduces RHYME, a unified framework for generalizable audio deepfake detection (ADD) across diverse synthesis paradigms including conventional TTS and modern diffusion/flow-matching generators. RHYME achieves synthesis-invariant alignment by fusing utterance-level embeddings from diverse pretrained speech encoders using non-Euclidean projections. By mapping representations into complementary hyperbolic and spherical manifolds, the framework captures hierarchical generator families and periodic spectral artifacts, leading to improved cross-paradigm generalization.

TwinShift: Benchmarking Audio Deepfake Detection across Synthesizer and Speaker Shifts

audio Published: 2025-10-27 Authors: Jiyoung Hong, Yoonseo Chung, Seungyeon Oh, Juntae Kim, Jiyoung Lee, Sookyung Kim, Hyunsoo Cho

This paper introduces TWINSHIFT, a new benchmark designed to rigorously evaluate the robustness and generalization ability of Audio Deepfake Detection (ADD) systems under strictly unseen conditions. TWINSHIFT evaluates detectors under simultaneous shifts in both the speech synthesizer and the speaker identity, using six different synthesis systems paired with disjoint sets of speakers. Experiments reveal significant robustness gaps, confirming that current SOTA detectors fail when confronted with truly novel deepfakes.

Can Current Detectors Catch Face-to-Voice Deepfake Attacks?

audio Published: 2025-10-23 Authors: Nguyen Linh Bao Nguyen, Alsharif Abuadbba, Kristen Moore, Tingming Wu

This paper systematically evaluates the performance of state-of-the-art audio deepfake detectors against the novel Face-to-Voice (FOICE) attack, which generates synthetic speech from a single facial image. The study demonstrates that baseline detectors consistently fail to identify FOICE-generated audio, highlighting a security blind spot for cross-modal attacks. While targeted fine-tuning significantly improves FOICE detection, it often compromises the model's robustness and generalization ability against unseen generators like SpeechT5.

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

audio Published: 2025-10-22 Authors: Tong Zhang, Yihuan Huang, Yanzhen Ren

Existing speech deepfake detection systems exhibit severe performance degradation, with accuracy dropping dramatically when evaluated on realistic physical replay attacks. To counter this vulnerability, the authors introduce EchoFake, a novel and comprehensive dataset comprising over 120 hours of zero-shot TTS speech and physical replay recordings collected under varied real-world acoustic settings. Evaluation shows that models trained on EchoFake achieve superior generalization and robustness across multiple standard benchmarks.

Not All Deepfakes Are Created Equal: Triaging Audio Forgeries for Robust Deepfake Singer Identification

audio Published: 2025-10-20 Authors: Davide Salvi, Hendrik Vincent Koops, Elio Quinton

The paper addresses the challenge of identifying singers in highly realistic vocal deepfakes by introducing a novel two-stage pipeline that triages audio forgeries based on quality. This system first uses a discriminator to filter out low-quality deepfakes that fail to reproduce vocal likeness, prioritizing the most harmful, high-quality fakes. Experiments demonstrate that this triage approach significantly improves robust singer identification performance across both authentic and synthetic content compared to traditional baselines.

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

audio Published: 2025-10-16 Authors: Hui Wang, Jinghua Zhao, Yifan Yang, Shujie Liu, Junyang Chen, Yanzhe Zhang, Shiwan Zhao, Jinyu Li, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

This paper presents SpeechLLM-as-Judges, a novel paradigm leveraging Large Language Models (LLMs) for general, structured, and explanation-based speech quality evaluation across diverse tasks. They introduce SpeechEval, a large-scale multilingual dataset spanning four evaluation tasks, and develop SQ-LLM, a speech-quality-aware LLM trained with Chain-of-Thought (CoT) reasoning and reward optimization. SQ-LLM demonstrates strong performance, interpretability, and generalization across multiple evaluation scenarios, including deepfake detection.

FakeMark: Deepfake Speech Attribution With Watermarked Artifacts

audio Published: 2025-10-14 Authors: Wanying Ge, Xin Wang, Junichi Yamagishi

FakeMark is a novel watermarking framework designed for robust deepfake speech attribution, addressing the weaknesses of conventional classifier and watermarking solutions. It injects watermarks that are correlated with intrinsic acoustic artifacts associated with specific deepfake systems, enabling the detector to leverage both cues for source identification. This design significantly improves generalization to domain-shifted samples and maintains high accuracy under various distortions and malicious removal attacks.

Sparse deepfake detection promotes better disentanglement

audio Published: 2025-10-07 Authors: Antoine Teissier, Marie Tahon, Nicolas Dugué, Aghilas Sini

The paper introduces a novel method for deepfake detection by applying a TopK activation layer to the embeddings of the AASIST architecture to enforce sparsity. This approach demonstrates that sparse latent representations not only improve detection performance but also enhance the interpretability of the model by promoting better disentanglement. The study evaluates performance and disentanglement metrics using mutual information based on known attack types.

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

audio Published: 2025-10-06 Authors: Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen

WaveSP-Net is a novel, parameter-efficient architecture for speech deepfake detection combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba back-end. This design utilizes learnable wavelet filters to create sparse, multi-resolution prompt embeddings, enhancing artifact localization without fine-tuning the frozen XLSR backbone. The approach achieves state-of-the-art performance on challenging benchmarks while maintaining low trainable parameter counts.

Forensic Similarity for Speech Deepfakes

audio Published: 2025-10-03 Authors: Viola Negroni, Davide Salvi, Daniele Ugo Leonzio, Paolo Bestagini, Stefano Tubaro

This paper introduces Forensic Similarity for Speech Deepfakes, a digital audio forensics approach designed to determine whether two audio segments contain the same generative forensic traces. The proposed system is a Siamese deep-learning framework combining a deepfake detector backbone as a feature extractor with a shallow neural network similarity model. The method demonstrates strong generalization capabilities for source verification across unseen generative models and shows utility in audio splicing detection.

Joint Optimization of Speaker and Spoof Detectors for Spoofing-Robust Automatic Speaker Verification

audio Published: 2025-10-02 Authors: Oğuzhan Kurnaz, Jagabandhu Mishra, Tomi H. Kinnunen, Cemal Hanilçi

This study proposes a modular, yet jointly optimized architecture for Spoofing-Robust Automatic Speaker Verification (SASV) by integrating separately trained ASV and CM subsystems. The system uses trainable back-end classifiers and non-linear score fusion, exploring direct optimization of the fusion back-end using the architecture-agnostic detection cost function (a-DCF) as the training objective. This approach highlights the importance of modular design, calibrated integration, and task-aligned optimization for robust SASV systems.

Descriptor:: Extended-Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD-SVDSR)

audio Published: 2025-09-30 Authors: Rahul Vijaykumar, Ajan Ahmed, John Parker, Dinesh Pendyala, Aidan Collins, Stephanie Schuckers, Masudul H. Imtiaz

This paper introduces the Extended-Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD-SVDSR), a resource comprising 45-minute audio recordings from 36 participants captured via five different microphones. The dataset is specifically designed to support the creation of high-quality, extended-duration deepfakes and to facilitate the development of robust synthetic voice detection systems. The initial release includes 20 generated deepfake voices, demonstrating the potential for generating highly realistic synthetic audio.

On Deepfake Voice Detection -- It's All in the Presentation

audio Published: 2025-09-30 Authors: Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso, Gennady Karvitsky, Daniele Colibro, Haydar Talib

Current audio deepfake detection systems fail to generalize to real-world scenarios because existing datasets ignore the effects of presentation through communication channels (e.g., phone calls). The authors propose a new framework and research methodology incorporating these realistic presentation distortions into training data creation. This methodology significantly improved deepfake detection accuracy by 39% in robust lab setups and by 57% on a real-world benchmark, demonstrating the critical role of data realism over model size.

Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing

audio Published: 2025-09-29 Authors: Manasi Chhibber, Jagabandhu Mishra, Tomi H. Kinnunen

This work proposes a novel zero-shot framework for open-set speech deepfake source tracing, adapting the SSL-AASIST system with AAM loss for improved attack embedding extraction. It systematically compares zero-shot (cosine, Siamese) and few-shot (MLP, Siamese) backend scoring methods to attribute synthesized speech to its generative source. Experiments confirm that zero-shot cosine scoring generalizes best in the difficult open-set scenario.

Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment

audio Published: 2025-09-28 Authors: Pu Huang, Shouguang Wang, Siya Yao, Mengchu Zhou

This paper addresses the challenge of distribution shifts in speech deepfake detection by proposing the Information Bottleneck enhanced Confidence-Aware Adversarial Network (IB-CAAN). IB-CAAN aims to learn robust and shared discriminative features by suppressing attack-specific artifacts and minimizing nuisance variability across domains. The method achieves state-of-the-art generalization performance across several standard and cross-dataset benchmarks.

Frustratingly Easy Zero-Day Audio DeepFake Detection via Retrieval Augmentation and Profile Matching

audio Published: 2025-09-26 Authors: Xuechen Liu, Xin Wang, Junichi Yamagishi

This study addresses the vulnerability of modern audio deepfake detectors (ADD) to zero-day attacks generated by novel synthesis methods. It proposes a training-free framework utilizing knowledge representations, retrieval augmentation (RA), and voice profile matching. This framework achieves performance comparable to fine-tuned models on the DeepFake-Eval-2024 benchmark without requiring additional model training.

AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit

audio Published: 2025-09-25 Authors: Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk

The paper introduces AUDDT, an open-source toolkit designed to be an Audio Unified Deepfake Detection Benchmark, incorporating a systematic review and standardized evaluation across 28 diverse datasets. The toolkit automates the evaluation of pretrained detectors to assess their generalization capabilities across various deepfake generation methods and acoustic conditions. Using a widely adopted baseline model, the authors demonstrate significant performance disparities, highlighting critical weaknesses in generalization to out-of-domain data.

Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection

audio Published: 2025-09-25 Authors: Duc-Tuan Truong, Tianchi Liu, Junjie Li, Ruijie Tao, Kong Aik Lee, Eng Siong Chng

The paper investigates and addresses the issue of gradient misalignment between original and augmented inputs during data-augmented training for speech deepfake detection (SDD). They propose a Dual-Path Data-Augmented (DPDA) training framework that processes both inputs in parallel and utilizes gradient alignment techniques to resolve conflicting updates. This approach stabilizes optimization, accelerates convergence, and significantly improves robustness across benchmark datasets.

QAMO: Quality-aware Multi-centroid One-class Learning For Speech Deepfake Detection

audio Published: 2025-09-25 Authors: Duc-Tuan Truong, Tianchi Liu, Ruijie Tao, Junjie Li, Kong Aik Lee, Eng Siong Chng

QAMO (Quality-Aware Multi-Centroid One-Class Learning) is proposed to improve speech deepfake detection by addressing the oversimplification inherent in single-centroid one-class models. It introduces multiple quality-aware centroids, each optimized to represent a distinct speech quality subspace derived from estimated Mean Opinion Scores (MOS). This framework better models the intra-class variability of bona fide speech and utilizes a multi-centroid ensemble scoring strategy during inference to reduce the need for explicit quality labels.

SEA-Spoof: Bridging The Gap in Multilingual Audio Deepfake Detection for South-East Asian

audio Published: 2025-09-24 Authors: Jinyang Wu, Nana Hou, Zihan Pan, Qiquan Zhang, Sailor Hardik Bhupendra, Soumik Mondal

This paper introduces SEA-Spoof, the first large-scale audio deepfake detection dataset for South-East Asian languages. It benchmarks state-of-the-art models, revealing significant cross-lingual performance degradation, but demonstrates that fine-tuning on SEA-Spoof substantially improves detection accuracy.

Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World

audio Published: 2025-09-23 Authors: Visar Berisha, Prad Kadambi, Isabella Lenz

Speech deepfake detectors fail to generalize in real-world conditions due to a combinatorial challenge termed 'coverage debt,' where required data grows faster than data collection. Analyzing cross-testing results, the authors demonstrate that detection performance drops significantly with newer synthesizers and in conversational speech domains. The study concludes that detection alone is insufficient for high-stakes decisions and must be integrated into layered defense strategies.

Attention-based Mixture of Experts for Robust Speech Deepfake Detection

audio Published: 2025-09-22 Authors: Viola Negroni, Davide Salvi, Alessandro Ilic Mezza, Paolo Bestagini, Stefano Tubaro

This paper presents a novel approach to audio deepfake detection using a Mixture of Experts (MoE) architecture. The system combines multiple state-of-the-art detectors, weighting their outputs via an attention-based gating network, achieving first place in the SAFE challenge at IH&MMSec 2025.

FakeSound2: A Benchmark for Explainable and Generalizable Deepfake Sound Detection

audio Published: 2025-09-21 Authors: Zeyu Xie, Yaoyun Zhang, Xuenan Xu, Yongkang Yin, Chenxing Li, Mengyue Wu, Yuexian Zou

FakeSound2 is a new benchmark dataset for deepfake sound detection that goes beyond binary classification, evaluating models on localization, traceability, and generalization across six manipulation types and twelve sources. Experimental results reveal that while current models achieve high accuracy in binary classification, they struggle with explainability and generalization.

How Does Instrumental Music Help SingFake Detection?

audio Published: 2025-09-18 Authors: Xuanjun Chen, Chia-Yu Hu, I-Ming Lin, Yi-Cheng Lin, I-Hsiang Chiu, You Zhang, Sung-Feng Huang, Yi-Hsuan Yang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

This paper investigates how instrumental music affects singing voice deepfake (SingFake) detection. It finds that instrumental accompaniment primarily acts as data augmentation, not providing intrinsic cues; fine-tuning increases reliance on shallow speaker features while reducing sensitivity to deeper content information.

Mixture of Low-Rank Adapter Experts in Generalizable Audio Deepfake Detection

audio Published: 2025-09-17 Authors: Janne Laakkonen, Ivan Kukanov, Ville Hautamäki

This paper proposes a Mixture-of-Low-Rank-Adapter-Experts (MoE-LoRA) approach for generalizable audio deepfake detection. The method integrates multiple low-rank adapters into a Wav2Vec2 model, using a routing mechanism to selectively activate specialized adapters for improved adaptability to diverse deepfake attacks. Experimental results demonstrate that MoE-LoRA significantly outperforms standard fine-tuning in both in-domain and out-of-domain scenarios.

Improving Out-of-Domain Audio Deepfake Detection via Layer Selection and Fusion of SSL-Based Countermeasures

audio Published: 2025-09-15 Authors: Pierre Serrano, Raphaël Duroselle, Florian Angulo, Jean-François Bonastre, Olivier Boeffard

This paper addresses the challenge of out-of-domain (OOD) generalization in audio deepfake detection. It improves detection by performing a layer-wise analysis of self-supervised learning (SSL) encoders, selecting the most informative layers, and fusing multiple encoders for enhanced OOD performance.

Emoanti: audio anti-deepfake with refined emotion-guided representations

audio Published: 2025-09-13 Authors: Xiaokang Li, Yicheng Gong, Dinghao Zou, Xin Cao, Sunbowen Lee

EmoAnti is a novel audio deepfake detection system that leverages emotional cues for improved generalization. It uses a Wav2Vec2 model fine-tuned on emotion recognition and a convolutional residual feature extractor to refine emotional representations, achieving state-of-the-art performance on ASVspoof benchmarks.

Towards Data Drift Monitoring for Speech Deepfake Detection in the context of MLOps

audio Published: 2025-09-12 Authors: Xin Wang, Wanying Ge, Junichi Yamagishi

This paper investigates data drift monitoring for speech deepfake detection within an MLOps framework. It explores using distribution distances to monitor drift from a reference dataset caused by new text-to-speech (TTS) attacks and demonstrates that fine-tuning the detector with similarly drifted data reduces drift and improves detection performance.

Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems

audio Published: 2025-09-11 Authors: Chin Yuen Kwok, Jia Qi Yip, Zhen Qiu, Chi Hung Chi, Kwok Yan Lam

This paper introduces 'bona fide cross-testing', a novel evaluation framework for audio deepfake detection that uses diverse bona fide datasets to create more robust and interpretable evaluations compared to traditional methods. It addresses limitations in existing methods by incorporating diverse bona fide speech types and aggregating EERs for a more balanced assessment.

MoLEx: Mixture of LoRA Experts in Speech Self-Supervised Models for Audio Deepfake Detection

audio Published: 2025-09-11 Authors: Zihan Pan, Sailor Hardik Bhupendra, Jinyang Wu

This paper introduces MoLEx, a parameter-efficient framework for audio deepfake detection that combines Low-Rank Adaptation (LoRA) with a Mixture-of-Experts (MoE) router. MoLEx efficiently finetunes pre-trained self-supervised learning (SSL) models by only updating selected experts, achieving state-of-the-art performance with reduced computational costs.

Audio Deepfake Verification

audio Published: 2025-09-10 Authors: Li Wang, Junyi Ao, Linyong Gan, Yuancheng Wang, Xueyao Zhang, Zhizheng Wu

This paper introduces the Audio Deepfake Verification (ADV) task, which aims to determine if two audio samples originate from the same deepfake method, enabling open-set source tracing. A novel dual-branch architecture, Audity, is proposed to extract deepfake features from both audio structure and generation artifacts, outperforming single-branch approaches.

Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

audio Published: 2025-09-09 Authors: Kamel Kamel, Hridoy Sankar Dutta, Keshav Sood, Sunil Aryal

The paper proposes Spectral Masking and Interpolation Attack (SMIA), a black-box adversarial attack that manipulates inaudible frequency regions in AI-generated audio to bypass voice authentication systems (VAS) and anti-spoofing countermeasures (CMs). SMIA achieves high attack success rates against state-of-the-art models, demonstrating vulnerabilities in current voice authentication security.

Adversarial Attacks on Audio Deepfake Detection: A Benchmark and Comparative Study

audio Published: 2025-09-08 Authors: Kutub Uddin, Muhammad Umar Farooq, Awais Khan, Khalid Mahmood Malik

This research paper presents a comparative benchmark study of state-of-the-art audio deepfake detection (ADD) methods under various anti-forensic (AF) attacks. The main contribution is a large-scale evaluation of twelve ADD methods across five datasets and two AF attack categories (statistical and optimization-based), revealing their vulnerabilities and informing the design of more robust detectors.

Speaker Privacy and Security in the Big Data Era: Protection and Defense against Deepfake

audio Published: 2025-09-08 Authors: Liping Chen, Kong Aik Lee, Zhen-Hua Ling, Xin Wang, Rohan Kumar Das, Tomoki Toda, Haizhou Li

This paper provides a concise overview of three techniques for addressing security threats from deepfake speech: voice anonymization, deepfake detection, and watermarking. It describes their methodologies, advancements, and challenges, highlighting the need for further research into integrating these techniques.

XMUspeech Systems for the ASVspoof 5 Challenge

audio Published: 2025-09-05 Authors: Wangjie Li, Xingjia Xie, Yishuang Li, Wenhao Guan, Kaidi Wang, Pengyu Ren, Lin Li, Qingyang Hong

This paper presents XMUspeech systems for the ASVspoof 5 Challenge's speech deepfake detection track. The approach focuses on leveraging various state-of-the-art models (AASIST, HM-Conformer, Hubert, Wav2vec2) with adaptive multi-scale feature fusion and optimized one-class loss functions to improve detection accuracy, particularly with longer audio durations.

AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

audio Published: 2025-09-04 Authors: Qizhou Wang, Hanxun Huang, Guansong Pang, Sarah Erfani, Christopher Leckie

The paper introduces AUDETER, a large-scale (3 million audio clips, 4,500+ hours) dataset for deepfake audio detection, addressing the limitations of existing datasets in handling diverse and up-to-date audio samples. Experiments show that models trained on AUDETER significantly outperform state-of-the-art methods on cross-domain evaluation, reducing detection error rates by 44.1% to 51.6%.

Wav2DF-TSL: Two-stage Learning with Efficient Pre-training and Hierarchical Experts Fusion for Robust Audio Deepfake Detection

audio Published: 2025-09-04 Authors: Yunqi Hao, Yihao Chen, Minqiang Xu, Jianbo Zhan, Liang He, Lei Fang, Sian Fang, Lin Liu

This paper introduces Wav2DF-TSL, a two-stage learning strategy for robust audio deepfake detection. It uses efficient pre-training with adapters to learn spoofed speech artifacts and a hierarchical adaptive mixture of experts (HA-MoE) for multi-level spoofing cue fusion, significantly outperforming state-of-the-art methods.

NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation

audio Published: 2025-09-04 Authors: Huhong Xian, Rui Liu, Berrak Sisman, Haizhou Li

NE-PADD is a novel method for partial audio deepfake detection that leverages named entity knowledge. It uses two parallel branches (SpeechNER and PADD) and two attention aggregation mechanisms (Attention Fusion and Attention Transfer) to improve detection accuracy by integrating semantic information.

Multi-level SSL Feature Gating for Audio Deepfake Detection

audio Published: 2025-09-03 Authors: Hoan My Tran, Damien Lolive, Aghilas Sini, Arnaud Delhay, Pierre-François Marteau, David Guennec

This paper proposes a novel audio deepfake detection approach using a multi-level self-supervised learning (SSL) feature gating mechanism. It combines a gating mechanism to extract relevant features from the XLS-R model with a MultiConv classifier to capture speech artifacts and Centered Kernel Alignment (CKA) to enhance feature diversity, achieving state-of-the-art performance and robust generalization to out-of-domain datasets.

Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

audio Published: 2025-09-02 Authors: Sandipana Dowerah, Atharva Kulkarni, Ajinkya Kulkarni, Hoan My Tran, Joonas Kalda, Artem Fedorchenko, Benoit Fauve, Damien Lolive, Tanel Alumäe, Matthew Magimai Doss

Speech DF Arena is the first comprehensive benchmark for audio deepfake detection, providing a toolkit for uniform evaluation across 14 datasets and standardized metrics. It includes a leaderboard to rank detection systems and highlights the need for extensive cross-domain evaluation due to high error rates in out-of-domain scenarios.

Generalizable Audio Spoofing Detection using Non-Semantic Representations

audio Published: 2025-08-29 Authors: Arnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller

This research introduces a novel audio deepfake detection method using non-semantic universal audio representations from TRILL and TRILLsson models. The approach achieves comparable in-domain performance to state-of-the-art methods but significantly surpasses them in out-of-domain generalization, especially on real-world datasets.

Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System

audio Published: 2025-08-28 Authors: Hashim Ali, Surya Subramani, Lekha Bollinani, Nithin Sai Adupa, Sali El-Loh, Hafiz Malik

This research paper explores multilingual dataset integration strategies for robust audio deepfake detection. By systematically experimenting with self-supervised learning front-ends and various dataset combinations, the authors achieved second place in two tasks of the SAFE Challenge, demonstrating strong generalization and robustness.

ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks

audio Published: 2025-08-25 Authors: Yuanda Wang, Bocheng Chen, Hanqing Guo, Guangjing Wang, Weikang Ding, Qiben Yan

ClearMask is a noise-free defense against voice deepfake attacks that modifies audio mel-spectrograms by selectively filtering frequencies, uses audio style transfer to deceive decoders, and introduces optimized reverberation to disrupt voice generation models. LiveMask, a real-time version, uses a universal frequency filter and reverberation generator for immediate protection.

A Survey of Threats Against Voice Authentication and Anti-Spoofing Systems

audio Published: 2025-08-22 Authors: Kamel Kamel, Keshav Sood, Hridoy Sankar Dutta, Sunil Aryal

This research paper provides a comprehensive survey of modern threats against voice authentication systems (VAS) and anti-spoofing countermeasures (CMs). It categorizes and analyzes various attacks, including data poisoning, adversarial attacks, deepfakes, and adversarial spoofing attacks, highlighting methodologies, datasets, and limitations of existing literature.

Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts

audio Published: 2025-08-18 Authors: Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews

This paper proposes a self-attentive prototypical network for few-shot detection of synthesized speech, addressing the challenge of detecting synthesized speech under distribution shifts. The approach significantly improves performance over existing zero-shot detectors by adapting quickly using as few as 10 in-distribution samples, achieving up to a 32% relative EER reduction on certain datasets.

Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform

audio Published: 2025-08-14 Authors: Yuankun Xie, Ruibo Fu, Xiaopeng Wang, Zhiyong Wang, Ya Li, Zhengqi Wen, Haonnan Cheng, Long Ye

This paper introduces the Fake Speech Wild (FSW) dataset, a 254-hour collection of real and deepfake audio from social media platforms, to address the limitations of existing deepfake audio detection models in real-world scenarios. By augmenting public datasets with FSW and employing self-supervised learning-based countermeasures, the authors significantly improve deepfake audio detection performance, achieving an average equal error rate of 3.54%.

Perturbed Public Voices (P$^{2}$V): A Dataset for Robust Audio Deepfake Detection

audio Published: 2025-08-13 Authors: Chongyang Gao, Marco Postiglione, Isabel Gortner, Sarit Kraus, V. S. Subrahmanian

The paper introduces Perturbed Public Voices (P²V), a new dataset for robust audio deepfake detection that addresses the limitations of existing datasets by incorporating realistic noise, identity-consistent transcripts, and state-of-the-art voice cloning techniques. Experiments show that models trained on P²V are more robust to adversarial attacks and generalize better to other datasets than models trained on existing benchmarks.

Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative

audio Published: 2025-08-12 Authors: Xi Xuan, Zimo Zhu, Wenxin Zhang, Yi-Cheng Lin, Tomi Kinnunen

Fake-Mamba is a real-time speech deepfake detection framework that replaces self-attention with bidirectional Mamba, a state-space model. It introduces three efficient encoders (TransBiMamba, ConBiMamba, and PN-BiMamba), achieving state-of-the-art performance on ASVspoof and In-The-Wild benchmarks while maintaining real-time inference.

SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis

audio Published: 2025-08-11 Authors: Vojtěch Staněk, Karel Srna, Anton Firc, Kamil Malinka

The paper introduces the Speaker Characteristics DeepFake (SCDF) dataset, a large-scale, richly annotated resource for evaluating demographic biases in deepfake speech detection. Using SCDF, the authors demonstrate that state-of-the-art detectors exhibit significant performance disparities across various speaker demographics, highlighting the need for bias-aware development.

ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan

audio Published: 2025-08-06 Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

This paper proposes EnvSDD, a large-scale dataset for environmental sound deepfake detection, and launches the Environmental Sound Deepfake Detection Challenge (ESDD 2026) based on it. The challenge features two tracks: one focusing on unseen generators and another on black-box low-resource detection.

Multilingual Source Tracing of Speech Deepfakes: A First Benchmark

audio Published: 2025-08-06 Authors: Xi Xuan, Yang Xiao, Rohan Kumar Das, Tomi Kinnunen

This paper introduces the first benchmark for multilingual speech deepfake source tracing, evaluating models' ability to identify the source model used to generate deepfake speech across different languages and speakers. The benchmark uses a new dataset and protocols to comprehensively assess model generalization in mono- and cross-lingual scenarios.

Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework

audio Published: 2025-08-04 Authors: Andrea Di Pierno, Luca Guarnera, Dario Allegra, Sebastiano Battiato

This paper introduces LAVA, a hierarchical framework for audio deepfake detection and model recognition. LAVA uses a convolutional autoencoder to extract latent representations from fake audio, which are then classified by two specialized classifiers for attribution and model recognition, achieving high F1-scores.

Generalizable Audio Deepfake Detection via Hierarchical Structure Learning and Feature Whitening in Poincaré sphere

audio Published: 2025-08-03 Authors: Mingru Yang, Yanmei Gu, Qianhua He, Yanxiong Li, Peirong Zhang, Yongqiang Chen, Zhiming Wang, Huijia Zhu, Jian Liu, Weiqiang Wang

This paper introduces Poin-HierNet, a novel framework for generalizable audio deepfake detection. Poin-HierNet leverages the Poincaré sphere to construct domain-invariant hierarchical representations, outperforming state-of-the-art methods in Equal Error Rate.

Multi-Granularity Adaptive Time-Frequency Attention Framework for Audio Deepfake Detection under Real-World Communication Degradations

audio Published: 2025-08-02 Authors: Haohan Shi, Xiyu Shi, Safak Dogan, Tianjin Huang, Yunxiao Zhang

This paper introduces a novel multi-granularity adaptive time-frequency attention framework for robust audio deepfake detection under real-world communication degradations. The framework uses a multi-scale attention mechanism to capture both global and local features, and an adaptive fusion mechanism to dynamically adjust attention based on degradation characteristics, improving detection accuracy in noisy conditions.

Fusion of Modulation Spectrogram and SSL with Multi-head Attention for Fake Speech Detection

audio Published: 2025-08-01 Authors: Rishith Sadashiv T N, Abhishek Bedge, Saisha Suresh Bore, Jagabandhu Mishra, Mrinmoy Bhattacharjee, S R Mahadeva Prasanna

This paper proposes a novel fake speech detection model that fuses self-supervised learning (SSL) speech embeddings with modulation spectrogram features using multi-head attention. The resulting fused representation is then fed into an AASIST network for classification, achieving significant performance improvements over a baseline model in both in-domain and out-of-domain scenarios.

SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods

audio Published: 2025-07-29 Authors: Wen Huang, Yanmei Gu, Zhiming Wang, Huijia Zhu, Yanmin Qian

The paper introduces SpeechFake, a large-scale multilingual speech deepfake dataset containing over 3 million deepfake samples generated using 40 different speech synthesis tools. This dataset addresses limitations in existing datasets by providing scale, diversity in generation methods, and multilingual support, enabling the development of more robust deepfake detection models.

Two Views, One Truth: Spectral and Self-Supervised Features Fusion for Robust Speech Deepfake Detection

audio Published: 2025-07-27 Authors: Yassine El Kheir, Arnab Das, Enes Erdem Erdogan, Fabian Ritter-Guttierez, Tim Polzehl, Sebastian Möller

This paper proposes a robust audio deepfake detection method that fuses self-supervised learning (SSL) features with handcrafted spectral features (MFCC, LFCC, CQCC). The fusion, using cross-attention mechanisms, significantly improves generalization performance compared to using only SSL features, achieving a 38% relative reduction in equal error rate (EER).

WaveVerify: A Novel Audio Watermarking Framework for Media Authentication and Combatting Deepfakes

audio Published: 2025-07-23 Authors: Aditya Pujari, Ajita Rattani

WaveVerify is a novel audio watermarking framework that uses a FiLM-based generator for robust multiband watermark embedding and a Mixture-of-Experts detector for accurate extraction and localization. It significantly outperforms state-of-the-art models in robustness to various audio distortions and temporal modifications.

LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech

audio Published: 2025-07-22 Authors: Xuechen Liu, Wanying Ge, Xin Wang, Junichi Yamagishi

LENS-DF is a novel recipe for training and evaluating audio deepfake detection and temporal localization under realistic conditions (longer duration, noisy conditions, multiple speakers). Models trained with LENS-DF consistently outperform those trained using conventional methods, demonstrating its effectiveness for robust audio deepfake detection and localization.

Frame-level Temporal Difference Learning for Partial Deepfake Speech Detection

audio Published: 2025-07-20 Authors: Menglu Li, Xiao-Ping Zhang, Lian Zhao

This paper proposes a Temporal Difference Attention Module (TDAM) for partial deepfake speech detection that analyzes frame-level temporal differences without requiring frame-level annotations. TDAM identifies unnatural temporal variations in deepfake speech, achieving state-of-the-art performance on PartialSpoof and HAD datasets.

SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks

audio Published: 2025-07-17 Authors: Kutub Uddin, Awais Khan, Muhammad Umar Farooq, Khalid Malik

The paper proposes SHIELD, a collaborative learning method for robust audio deepfake detection against adversarial attacks. SHIELD integrates an auxiliary generative model to expose anti-forensic signatures and uses a triplet model to capture correlations between real and attacked audios, significantly improving robustness against generative adversarial attacks.

Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes

audio Published: 2025-07-17 Authors: Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Qingming Li, Tianyu Du, Shouling Ji

Enkidu is a novel user-oriented audio privacy framework that uses universal frequential perturbations (UFPs) generated via black-box knowledge and few-shot training to protect against voice deepfakes. These UFPs enable real-time, lightweight protection with strong generalization across variable-length audio while preserving audio quality and achieving significantly higher processing efficiency than existing countermeasures.

Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection

audio Published: 2025-07-15 Authors: Ivan Viakhirev, Daniil Sirota, Aleksandr Smirnov, Kirill Borodin

This paper refines the AASIST architecture for speech deepfake detection by freezing a Wav2Vec 2.0 encoder, replacing graph attention with multi-head attention, and using a trainable fusion layer. These modifications achieve a 7.6% equal error rate (EER) on the ASVspoof 5 corpus, significantly improving upon the baseline.

Phoneme-Level Analysis for Person-of-Interest Speech Deepfake Detection

audio Published: 2025-07-11 Authors: Davide Salvi, Viola Negroni, Sara Mandelli, Paolo Bestagini, Stefano Tubaro

This paper proposes a phoneme-level Person-of-Interest (POI) based speech deepfake detection method. It analyzes individual phonemes in reference and test audio to create speaker profiles and compare them for detecting synthetic artifacts, achieving comparable accuracy to traditional methods with improved robustness and interpretability.

RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing

audio Published: 2025-07-11 Authors: Yang Xiao, Ting Dang, Rohan Kumar Das

RawTFNet is a lightweight CNN architecture for speech anti-spoofing that achieves state-of-the-art performance while using fewer computing resources. It separates feature processing along time and frequency dimensions to capture fine-grained details of synthetic speech, showing comparable performance to heavier models on ASVspoof 2021 datasets.

Open-Set Source Tracing of Audio Deepfake Systems

audio Published: 2025-07-09 Authors: Nicholas Klein, Hemlata Tak, Elie Khoury

This paper addresses the challenge of open-set source tracing in audio deepfakes. It introduces a novel softmax energy (SME) score for out-of-distribution (OOD) detection, significantly improving open-set source tracing performance compared to existing energy-based methods. The authors achieve an FPR95 of 8.3% by combining SME with data augmentation techniques.

Evaluating Fake Music Detection Performance Under Audio Augmentations

audio Published: 2025-07-07 Authors: Tomasz Sroka, Tomasz Wężowicz, Dominik Sidorczuk, Mateusz Modrzejewski

This paper investigates the robustness of a state-of-the-art fake music detection model (SONICS) against various audio augmentations. A dataset of real and synthetic music from multiple generators was created and subjected to augmentations; the results show a significant decrease in the model's accuracy even with minor transformations.

Robust Localization of Partially Fake Speech: Metrics, Models, and Out-of-Domain Evaluation

audio Published: 2025-07-04 Authors: Hieu-Thi Luong, Inbal Rimon, Haim Permuter, Kong Aik Lee, Eng Siong Chng

This paper analyzes limitations in evaluating partial audio deepfake localization, advocating for threshold-dependent metrics like accuracy and F1-score over Equal Error Rate (EER). It demonstrates that existing models, while strong in-domain, generalize poorly to out-of-domain data, and that increasing training data doesn't always improve performance.

Generalizable Detection of Audio Deepfakes

audio Published: 2025-07-02 Authors: Jose A. Lopez, Georg Stemmer, Héctor Cordourier Maruri

This paper presents a comprehensive study to improve the generalization of audio deepfake detection models. It explores various pre-trained backbones, data augmentation techniques, and loss functions, achieving performance surpassing the top single system in the ASVspoof 5 Challenge.

Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges

audio Published: 2025-06-30 Authors: Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Adupa, Lekha Bollinani, Hafiz Malik

This paper details a methodology for creating a high-quality speech deepfake dataset of ten public figures. The approach uses an automated pipeline for collecting and curating real speech, incorporating transcription-based segmentation to improve synthetic speech quality generated using various TTS methods.

PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

audio Published: 2025-06-28 Authors: Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali

This paper introduces PhonemeFake (PF), a new deepfake attack that manipulates crucial speech segments using language reasoning, making deepfakes more realistic and harder to detect. A novel bilevel detection model, PhonemeFakeDetect (PFD), is also presented, significantly improving detection accuracy and efficiency by focusing computation on manipulated regions.

Post-training for Deepfake Speech Detection

audio Published: 2025-06-26 Authors: Wanying Ge, Xin Wang, Xuechen Liu, Junichi Yamagishi

This paper introduces a post-training approach to adapt self-supervised learning (SSL) models for deepfake speech detection. By training on a large multilingual dataset of genuine and artifacted speech, the resulting AntiDeepfake models outperform existing state-of-the-art detectors, demonstrating strong robustness and generalization to unseen deepfakes.

IndieFake Dataset: A Benchmark Dataset for Audio Deepfake Detection

audio Published: 2025-06-23 Authors: Abhay Kumar, Kunal Verma, Omkar More

This paper introduces the IndieFake Dataset (IFD), a benchmark dataset for audio deepfake detection focusing on English-speaking Indian speakers. IFD addresses the lack of diversity in existing datasets, providing a balanced dataset with speaker-level characterization and outperforming existing benchmarks on detection performance.

A Comparative Study on Proactive and Passive Detection of Deepfake Speech

audio Published: 2025-06-17 Authors: Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang

This research proposes a framework for comparing proactive (watermarking) and passive (conventional detection) deepfake speech detection models. It ensures fair comparison by training and testing all models on common datasets with a shared metric, and analyzes their robustness against adversarial attacks.

Manipulated Regions Localization For Partially Deepfake Audio: A Survey

audio Published: 2025-06-17 Authors: Jiayi He, Jiangyan Yi, Jianhua Tao, Siding Zeng, Hao Gu

This survey provides the first comprehensive overview of partially deepfake audio manipulated region localization tasks. It systematically introduces existing methods, datasets, evaluation metrics, and challenges, highlighting future research directions and potential trends in this field.

Towards Neural Audio Codec Source Parsing

audio Published: 2025-06-14 Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Arun Balaji Buduru, Rajesh Sharma

This paper introduces Neural Audio Codec Source Parsing (NACSP), a novel approach to audio deepfake detection that focuses on regressing codec parameters instead of binary classification. They propose HYDRA, a framework using hyperbolic geometry to disentangle latent features from pre-trained models, improving multi-task generalization for parameter prediction.

From Sharpness to Better Generalization for Speech Deepfake Detection

audio Published: 2025-06-13 Authors: Wen Huang, Xuechen Liu, Xin Wang, Junichi Yamagishi, Yanmin Qian

This paper explores sharpness as a theoretical proxy for generalization in speech deepfake detection. By applying Sharpness-Aware Minimization (SAM), the authors improve model robustness and stability across diverse unseen datasets, demonstrating a statistically significant relationship between sharpness and generalization performance.

Unmasking real-world audio deepfakes: A data-centric approach

audio Published: 2025-06-11 Authors: David Combei, Adriana Stan, Dan Oneata, Nicolas Müller, Horia Cucu

This paper introduces a new dataset, AI4T, of real-world audio deepfakes collected from online platforms. Instead of focusing on model complexity, it employs data-centric approaches (curation, pruning, augmentation) to significantly improve deepfake detection performance on both AI4T and the In-the-Wild dataset, achieving substantial reductions in Equal Error Rate (EER).

Towards Generalized Source Tracing for Codec-Based Deepfake Speech

audio Published: 2025-06-08 Authors: Xuanjun Chen, I-Ming Lin, Lin Zhang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

This paper addresses the suboptimal performance of source tracing models for codec-based deepfake speech. It introduces SASTNet, a novel network that jointly leverages semantic and acoustic features for improved generalization and state-of-the-art performance on the CodecFake+ dataset.

SynHate: Detecting Hate Speech in Synthetic Deepfake Audio

audio Published: 2025-06-07 Authors: Rishabh Ranjan, Kishan Pipariya, Mayank Vatsa, Richa Singh

SynHate is the first multilingual dataset for detecting hate speech in synthetic audio, encompassing 37 languages and a novel four-class scheme (Real-normal, Real-hate, Fake-normal, Fake-hate). It leverages pre-trained self-supervised models to evaluate hate speech detection performance, revealing variations across languages and highlighting the challenge of cross-dataset generalization.

TADA: Training-free Attribution and Out-of-Domain Detection of Audio Deepfakes

audio Published: 2025-06-06 Authors: Adriana Stan, David Combei, Dan Oneata, Horia Cucu

This paper introduces TADA, a training-free method for audio deepfake source attribution and out-of-domain detection. It leverages a pre-trained self-supervised learning model and k-Nearest Neighbors (kNN) to achieve high F1-scores for both in-domain (0.93) and out-of-domain (0.84) detection across multiple datasets.

A Data-Driven Diffusion-based Approach for Audio Deepfake Explanations

audio Published: 2025-06-03 Authors: Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj

This paper introduces a data-driven approach for explaining audio deepfakes using a diffusion model. It leverages the difference between real and vocoded audio as ground truth to train the model, identifying artifact regions in deepfake audio. Experimental results show this method outperforms traditional explainability techniques.

Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models

audio Published: 2025-06-03 Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

This paper introduces the task of singing voice deepfake source attribution (SVDSA) and proposes COFFE, a novel framework for this task. COFFE uses multimodal foundation models (MMFMs) and a Chernoff Distance loss function for effective fusion of different foundation models, achieving state-of-the-art performance.

PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing

audio Published: 2025-06-03 Authors: You Zhang, Baotong Tian, Lin Zhang, Zhiyao Duan

The paper introduces PartialEdit, a new dataset of partially edited deepfake speech created using advanced neural speech editing techniques. Experiments show that models trained on existing datasets fail to generalize to PartialEdit, highlighting the challenges posed by these new deepfakes.

Trusted Fake Audio Detection Based on Dirichlet Distribution

audio Published: 2025-06-03 Authors: Chi Ding, Junxiao Xue, Cong Wang, Hao Zhou

This paper introduces a novel fake audio detection approach that enhances reliability by modeling the trustworthiness of model decisions using the Dirichlet distribution. The approach generates evidence via a neural network, models uncertainty with the Dirichlet distribution, and combines predicted probabilities with uncertainty estimates for final classification.

Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion

audio Published: 2025-06-02 Authors: Ajinkya Kulkarni, Sandipana Dowerah, Tanel Alumae, Mathew Magimai. -Doss

This paper introduces a novel audio source tracing system for identifying the origin of audio deepfakes. It combines deep metric learning with a Conformer network and ensemble score-embedding fusion to improve both in-domain and out-of-domain source tracing accuracy.

XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark

audio Published: 2025-05-31 Authors: Ioan-Paul Ciobanu, Andrei-Iulian Hiji, Nicolae-Catalin Ristea, Paul Irofti, Cristian Rusu, Radu Tudor Ionescu

This paper introduces XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech across seven languages. Experiments reveal a significant disparity between in-domain and cross-domain performance of state-of-the-art deepfake detectors, highlighting the need for more robust models.

RPRA-ADD: Forgery Trace Enhancement-Driven Audio Deepfake Detection

audio Published: 2025-05-31 Authors: Ruibo Fu, Xiaopeng Wang, Zhengqi Wen, Jianhua Tao, Yuankun Xie, Zhiyong Wang, Chunyu Qiang, Xuefei Liu, Cunhang Fan, Chenxing Li, Guanjun Li

This paper introduces RPRA-ADD, a robust audio deepfake detection framework that enhances forgery traces using Reconstruction-Perception-Reinforcement-Attention networks. It improves upon existing methods by focusing on learning intrinsic differences between real and fake audio, leading to state-of-the-art performance on multiple benchmark datasets and strong cross-domain generalization.

Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection

audio Published: 2025-05-30 Authors: Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia

This paper introduces Rehearsal with Auxiliary-Informed Sampling (RAIS), a continual learning approach for audio deepfake detection that addresses the challenge of catastrophic forgetting. RAIS uses an auxiliary label generation network to improve sample diversity in the memory buffer, leading to better performance in handling new deepfake attacks.

Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes

audio Published: 2025-05-29 Authors: Neta Glazer, David Chernin, Idan Achituve, Sharon Gannot, Ethan Fetaya

This paper introduces ADD-GP, a few-shot adaptive framework for audio deepfake detection using a Gaussian Process (GP) classifier. ADD-GP combines a deep embedding model with GP's flexibility to achieve strong performance and adaptability to new, unseen voice cloning models with minimal data.

ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis

audio Published: 2025-05-26 Authors: Hawau Olamide Toyin, Rufael Marew, Humaid Alblooshi, Samar M. Magdy, Hanan Aldarmaki

ArVoice is a new multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, designed for multi-speaker speech synthesis and useful for tasks like deepfake detection. It comprises professionally recorded speech, a modified subset of the Arabic Speech Corpus, and synthetic speech, totaling 83.52 hours across 11 voices.

STOPA: A Database of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution

audio Published: 2025-05-26 Authors: Anton Firc, Manasi Chibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, Kamil Malinka

The paper introduces STOPA, a new dataset for deepfake audio source tracing. STOPA offers systematic variation in acoustic and vocoder models across 700k samples, enabling more reliable attribution of synthesized speech compared to existing datasets with limited variation.

EnvSDD: Benchmarking Environmental Sound Deepfake Detection

audio Published: 2025-05-25 Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley

This paper introduces EnvSDD, the first large-scale dataset for environmental sound deepfake detection, comprising 45.25 hours of real and 316.74 hours of fake audio. It also proposes a new deepfake detection system using a pre-trained audio foundation model (BEATs), which outperforms existing state-of-the-art methods from speech and singing domains.

What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection

audio Published: 2025-05-23 Authors: Binh Nguyen, Shuji Shi, Ryan Ofman, Thai Le

This paper investigates the linguistic sensitivity of audio anti-spoofing detectors by introducing transcript-level adversarial attacks. The study reveals that minor linguistic perturbations can significantly reduce detection accuracy, highlighting the need for more robust systems that account for linguistic variations.

ASVspoof2019 vs. ASVspoof5: Assessment and Comparison

audio Published: 2025-05-21 Authors: Avishai Weizman, Yehuda Ben-Shimol, Itshak Lapidot

This paper compares the ASVspoof2019 and ASVspoof5 databases for automatic speaker verification spoofing detection, highlighting the increased difficulty in ASVspoof5 due to mismatched conditions in both bona fide and spoofed speech statistics, and showing that genuine speech in ASVspoof5 is statistically closer to spoofed speech than in ASVspoof2019.

Replay Attacks Against Audio Deepfake Detection

audio Published: 2025-05-20 Authors: Nicolas Müller, Piotr Kawa, Wei-Herng Choong, Adriana Stan, Aditya Tirumala Bukkapatnam, Karla Pizzi, Alexander Wagner, Philip Sperl

This paper investigates the vulnerability of audio deepfake detection systems to replay attacks, where deepfake audio is played and re-recorded, making it harder to detect. A new dataset, ReplayDF, is introduced to study this, showing significant performance degradation in six open-source detection models when subjected to replay attacks.

Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incremental Learning Method for Audio Deepfake Source Tracing

audio Published: 2025-05-20 Authors: Yang Xiao, Rohan Kumar Das

This paper proposes AnaST, an exemplar-free class incremental learning method for audio deepfake source tracing. AnaST addresses catastrophic forgetting by updating the classifier with a closed-form analytical solution in one epoch, while keeping the feature extractor fixed, enabling efficient adaptation to new attacks without storing past data.

Source Verification for Speech Deepfakes

audio Published: 2025-05-20 Authors: Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro

This paper introduces the novel task of source verification for speech deepfakes, focusing on determining if a test audio track originates from the same generative model as a set of reference tracks. The approach leverages embeddings from a classifier trained for source attribution, comparing embeddings using cosine similarity to assess source identity.

Naturalness-Aware Curriculum Learning with Dynamic Temperature for Speech Deepfake Detection

audio Published: 2025-05-20 Authors: Taewoo Kim, Guisik Kim, Choongsang Cho, Young Han Lee

This research proposes naturalness-aware curriculum learning for speech deepfake detection, a training framework that leverages speech naturalness (measured by mean opinion scores) to improve model robustness and generalization. The approach incorporates dynamic temperature scaling based on speech naturalness, resulting in a 23% relative reduction in EER on the ASVspoof 2021 DF dataset.

BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention

audio Published: 2025-05-20 Authors: Yassine El Kheir, Tim Polzehl, Sebastian Möller

BiCrossMamba-ST is a speech deepfake detection framework using a dual-branch spectro-temporal architecture with bidirectional Mamba blocks and cross-attention. It achieves significant performance improvements over state-of-the-art methods on ASVSpoof LA21 and DF21 benchmarks by effectively capturing subtle cues of synthetic speech.

Forensic deepfake audio detection using segmental speech features

audio Published: 2025-05-20 Authors: Tianle Yang, Chengzhe Sun, Siwei Lyu, Phil Rose

This research investigates the use of segmental speech features, specifically vowel formants, for deepfake audio detection. The study finds that these features, linked to human articulation, are more effective at identifying deepfakes than global features commonly used in forensic voice comparison, highlighting the need for distinct approaches in deepfake detection.

Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy

audio Published: 2025-05-19 Authors: Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

This paper introduces a novel approach for tracing the source of codec-based audio deepfakes (CodecFake) by analyzing their underlying neural audio codecs. The approach leverages a neural audio codec taxonomy to identify characteristic features of the codecs used in generating the deepfakes, enabling source tracing even for unseen CoSG systems.

ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection

audio Published: 2025-05-16 Authors: Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen

This paper proposes ALLM4ADD, a framework that leverages Audio Large Language Models (ALLMs) for audio deepfake detection by reformulating the task as an audio question answering problem. Supervised fine-tuning enhances the ALLM's ability to classify audio as real or fake, achieving superior performance, especially in data-scarce scenarios.

BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset

audio Published: 2025-05-16 Authors: Istiaq Ahmed Fahad, Kamruzzaman Asif, Sifat Sikder

This paper introduces BanglaFake, a new Bengali deepfake audio dataset containing 12,260 real and 13,260 deepfake utterances generated using a state-of-the-art TTS model. The dataset's quality is evaluated through qualitative and quantitative analyses, showing high naturalness and intelligibility of the deepfakes, making it a valuable resource for deepfake detection research in low-resource languages.

Beyond Identity: A Generalizable Approach for Deepfake Audio Detection

audio Published: 2025-05-10 Authors: Yasaman Ahmadiadli, Xiao-Ping Zhang, Naimul Khan

This research introduces an identity-independent audio deepfake detection framework that mitigates identity leakage by focusing on forgery-specific artifacts. The approach uses Artifact Detection Modules (ADMs) and novel dynamic artifact generation techniques to improve cross-dataset generalization.

Detecting Musical Deepfakes

audio Published: 2025-05-03 Authors: Nick Sunday

This research explores the detection of AI-generated music (deepfakes) using the FakeMusicCaps dataset. A convolutional neural network (ResNet18) is trained on Mel spectrograms of audio clips, subjected to tempo and pitch shifting to simulate adversarial conditions, to classify audio as either deepfake or human-generated.

End-to-end Audio Deepfake Detection from RAW Waveforms: a RawNet-Based Approach with Cross-Dataset Evaluation

audio Published: 2025-04-29 Authors: Andrea Di Pierno, Luca Guarnera, Dario Allegra, Sebastiano Battiato

This research introduces RawNetLite, a lightweight convolutional-recurrent neural network for audio deepfake detection operating directly on raw waveforms. It improves robustness through a training strategy combining data from multiple domains, Focal Loss, and waveform-level augmentations, achieving high accuracy on in-domain and out-of-domain datasets.

TriniMark: A Robust Generative Speech Watermarking Method for Trinity-Level Attribution

audio Published: 2025-04-29 Authors: Yue Li, Weizhi Liu, Dongdong Lin

This paper introduces TriniMark, a robust generative speech watermarking method for authenticating synthetic speech and tracing it back to the diffusion model and user. It achieves this through a two-stage process: pre-training a lightweight encoder-decoder and then fine-tuning the diffusion model using a waveform-guided strategy.

Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios

audio Published: 2025-04-16 Authors: Haohan Shi, Xiyu Shi, Safak Dogan, Saif Alzubi, Tianjin Huang, Yunxiao Zhang

This research introduces ADD-C, a new benchmark dataset for evaluating audio deepfake detection (ADD) systems' robustness under real-world communication conditions (codec compression and packet loss). A novel data augmentation strategy is proposed to improve ADD system performance on ADD-C, significantly enhancing robustness against these real-world degradations.

Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy

audio Published: 2025-04-15 Authors: Botao Zhao, Zuheng Kang, Yayun He, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

This paper proposes f-InfoED, a frame-level latent information entropy detector, for generalized audio deepfake detection. It leverages the variational information bottleneck to extract discriminative information entropy from latent representations, achieving state-of-the-art performance and remarkable generalization capabilities.

SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis

audio Published: 2025-04-14 Authors: Zhisheng Zhang, Derui Wang, Qianyi Yang, Pengyang Huang, Junhan Pu, Yuxin Cao, Kai Ye, Jie Hao, Yixian Yang

SafeSpeech is a proactive voice protection framework that embeds imperceptible perturbations into audio before uploading to prevent high-quality speech synthesis. It uses a surrogate model and a novel Speech PErturbative Concealment (SPEC) technique to generate universally applicable perturbations robust against adaptive adversaries.

Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

audio Published: 2025-04-09 Authors: Yuankun Xie, Ruibo Fu, Zhiyong Wang, Xiaopeng Wang, Songjun Cao, Long Ma, Haonan Cheng, Long Ye

This paper introduces a novel wavelet prompt tuning (WPT) method for all-type audio deepfake detection, significantly improving cross-type detection accuracy. WPT optimizes self-supervised learning (SSL) models by learning specialized prompt tokens in the frequency domain, requiring far fewer trainable parameters than fine-tuning.

Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

audio Published: 2025-04-08 Authors: Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li

This paper introduces Nes2Net, a lightweight architecture for speech anti-spoofing that directly processes high-dimensional features from speech foundation models without dimensionality reduction layers. This improves performance by 22% and reduces computational cost by 87% compared to state-of-the-art baselines.

Anomaly Detection and Localization for Speech Deepfakes via Feature Pyramid Matching

audio Published: 2025-03-23 Authors: Emma Coletta, Davide Salvi, Viola Negroni, Daniele Ugo Leonzio, Paolo Bestagini

This paper introduces an interpretable one-class detection framework for speech deepfake detection, addressing limitations of supervised learning methods. The model, trained solely on real speech, identifies synthetic audio as anomalies and generates anomaly maps highlighting anomalous regions in time and frequency domains.

Measuring the Robustness of Audio Deepfake Detectors

audio Published: 2025-03-21 Authors: Xiang Li, Pin-Yu Chen, Wenqi Wei

This research systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions (noise, modification, compression). It finds that while models are robust to noise, they are vulnerable to modifications and compression, especially neural codecs; foundation models generally outperform traditional models.

InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being

audio Published: 2025-03-18 Authors: Guang Dai, Pinhao Wang, Cheng Yao, Fangtian Ying

InnerSelf is a novel voice system that uses speech synthesis and large language models to allow users to engage in supportive dialogues with their deepfake voices, aiming to improve emotional well-being by manipulating positive self-talk.

DIN-CTS: Low-Complexity Depthwise-Inception Neural Network with Contrastive Training Strategy for Deepfake Speech Detection

audio Published: 2025-02-27 Authors: Lam Pham, Dat Tran, Phat Lam, Florian Skopik, Alexander Schindler, Silvia Poletti, David Fischinger, Martin Boyer

This paper proposes DIN-CTS, a low-complexity deepfake speech detection system using a Depthwise-Inception Network (DIN) trained with a contrastive training strategy (CTS). The system transforms audio into spectrograms, trains a DIN to extract audio embeddings for bonafide speech, and detects deepfakes by computing the distance of test utterances from this distribution.

DeePen: Penetration Testing for Audio Deepfake Detection

audio Published: 2025-02-27 Authors: Nicolas Müller, Piotr Kawa, Adriana Stan, Thien-Phuc Doan, Souhwan Jung, Wei Herng Choong, Philip Sperl, Konstantin Böttinger

This paper introduces DeePen, a penetration testing methodology for evaluating the robustness of audio deepfake detection models. DeePen uses signal processing modifications (attacks) to assess model vulnerabilities, revealing that all tested systems, both commercial and academic, are susceptible to deception by simple manipulations.

Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis

audio Published: 2025-02-20 Authors: Kevin Warren, Daniel Olszewski, Seth Layton, Kevin Butler, Carrie Gates, Patrick Traynor

This paper proposes a novel audio deepfake detection method using six classical prosodic features (pitch, jitter, shimmer, HNR). The model achieves 93% accuracy and a 24.7% EER, comparable to existing baselines, while demonstrating enhanced robustness against adversarial attacks and providing explainability through attention mechanisms.

Generalizable speech deepfake detection via meta-learned LoRA

audio Published: 2025-02-15 Authors: Janne Laakkonen, Ivan Kukanov, Ville Hautamäki

This paper proposes a novel approach for generalizable speech deepfake detection using meta-learning with Low-Rank Adaptation (LoRA) adapters. This method improves generalization by learning common structures across different deepfake attack types, reducing the need for extensive retraining when encountering new attacks.

VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect

audio Published: 2025-02-14 Authors: Qingyuan Fei, Wenjie Hou, Xuan Hai, Xin Liu

VocalCrypt is a novel active defense method against AI voice cloning that embeds imperceptible pseudo-timbre into audio, preventing voice cloning without compromising audio quality. It significantly improves robustness and real-time performance compared to existing methods.

A Preliminary Exploration with GPT-4o Voice Mode

audio Published: 2025-02-14 Authors: Yu-Xiang Lin, Chih-Kai Yang, Wei-Chih Chen, Chen-An Li, Chien-yu Huang, Xuanjun Chen, Hung-yi Lee

This paper presents a preliminary exploration of GPT-4o's audio processing capabilities, evaluating its performance across various audio, speech, and music tasks. The study reveals GPT-4o's strengths in tasks like intent classification and multilingual speech recognition but also its limitations and safety-related restrictions, notably a refusal to perform tasks like audio deepfake detection.

SyntheticPop: Attacking Speaker Verification Systems With Synthetic VoicePops

audio Published: 2025-02-13 Authors: Eshaq Jamdar, Amith Kamath Belman

This paper introduces SyntheticPop, a novel attack method targeting the VoicePop speaker verification system. SyntheticPop embeds synthetic pop noises into spoofed audio samples, significantly reducing the system's accuracy (to 14% from 69%). The attack achieves over 95% success rate with only 20% of the training data poisoned.

ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech

audio Published: 2025-02-13 Authors: Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi, Myeonghun Jeong, Ge Zhu, Yongyi Zang, You Zhang, Soumi Maiti, Florian Lux, Nicolas Müller, Wangyou Zhang, Chengzhe Sun, Shuwei Hou, Siwei Lyu, Sébastien Le Maguer, Cheng Gong, Hanjie Guo, Liping Chen, Vishwanath Singh

The ASVspoof 5 challenge introduces a new crowdsourced speech database for spoofing, deepfake, and adversarial attack detection. This database features diverse acoustic conditions, a significantly larger number of speakers (~2000 compared to ~100 in previous editions), and attacks generated using 32 different algorithms, including adversarial attacks for the first time.

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

audio Published: 2025-02-06 Authors: Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

This paper introduces XAttnMark, a novel audio watermarking method that achieves state-of-the-art performance in both watermark detection and attribution. It improves upon existing methods by leveraging partial parameter sharing between generator and detector, a cross-attention mechanism, and a psychoacoustic-aligned loss function.

Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection

audio Published: 2025-02-05 Authors: Yassine El Kheir, Youness Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller

This paper analyzes the layer-wise contributions of self-supervised learning (SSL) models for audio deepfake detection across diverse languages and scenarios. It finds that lower layers consistently provide the most discriminative features, enabling the development of computationally efficient models with comparable performance by using only a subset of these layers.

Deepfake Detection of Singing Voices With Whisper Encodings

audio Published: 2025-01-31 Authors: Falguni Sharma, Priyanka Gupta

This paper proposes a singing voice deepfake detection (SVDD) system using noise-variant encodings from OpenAI's Whisper model. The system leverages the non-speech information encoded by Whisper, even though it's a noise-robust model, to differentiate between real and fake singing voices. Performance is evaluated using Equal Error Rate (EER).

Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation

audio Published: 2025-01-24 Authors: Wen Huang, Yanmei Gu, Zhiming Wang, Huijia Zhu, Yanmin Qian

This paper proposes a novel audio deepfake detection strategy integrating Latent Space Refinement (LSR) and Latent Space Augmentation (LSA) to improve generalization. LSR uses multiple learnable prototypes for spoofed audio, while LSA augments data in the latent space, enhancing the model's ability to handle unseen attacks.

What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

audio Published: 2025-01-23 Authors: Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj

This paper proposes a relevancy-based explainable AI (XAI) method, Gradient Average Transformer Relevancy (GATR), to analyze predictions of transformer-based audio deepfake detection models. GATR outperforms existing XAI methods (Grad-CAM, SHAP) in faithfulness metrics and a partial spoof test, providing insights into the models' decision-making process on large datasets.

Transferable Adversarial Attacks on Audio Deepfake Detection

audio Published: 2025-01-21 Authors: Muhammad Umar Farooq, Awais Khan, Kutub Uddin, Khalid Mahmood Malik

This paper introduces a transferable GAN-based adversarial attack framework to evaluate the robustness of state-of-the-art audio deepfake detection (ADD) systems. The framework generates high-quality adversarial attacks that preserve transcription and perceptual integrity, revealing significant vulnerabilities in existing ADD systems.

CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset

audio Published: 2025-01-14 Authors: Xuanjun Chen, Jiawei Du, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee

This paper introduces CodecFake+, a large-scale dataset for detecting deepfake speech generated by codec-based speech generation (CoSG) systems. It also proposes a taxonomy for categorizing neural audio codecs, enabling detailed analysis of factors influencing CodecFake detection performance.

Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition

audio Published: 2025-01-11 Authors: Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Songjun Cao, Long Ma, Chenxing Li, Haonnan Cheng, Long Ye

This paper introduces the Neural Codec Source Tracing (NCST) task for open-set audio deepfake detection, encompassing both neural codec classification and ALM detection. A new dataset, ST-Codecfake, is created to benchmark NCST models under open-set conditions, revealing limitations in classifying unseen real audio despite strong performance on in-distribution and out-of-distribution tasks.

Unmasking Deepfakes: Leveraging Augmentations and Features Variability for Deepfake Speech Detection

audio Published: 2025-01-09 Authors: Inbal Rimon, Oren Gal, Haim Permuter

This paper presents a hybrid deepfake speech detection model combining a self-supervised feature extractor (Wav2Vec 2.0) with a ResNet34 classifier. The model incorporates novel audio and feature-level augmentations, achieving state-of-the-art results on the ASVSpoof5 challenge.

Vision Graph Non-Contrastive Learning for Audio Deepfake Detection with Limited Labels

audio Published: 2025-01-09 Authors: Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia

This paper introduces SIGNL, a novel framework for audio deepfake detection that uses spatio-temporal vision graph non-contrastive learning to achieve high performance with limited labeled data. SIGNL constructs graphs from audio spectrograms, pre-trains encoders using label-free learning, and fine-tunes them for deepfake detection, significantly outperforming state-of-the-art methods.

Explaining Speaker and Spoof Embeddings via Probing

audio Published: 2024-12-24 Authors: Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen

This research investigates the explainability of embeddings used in audio spoofing detection systems. By training classifiers to predict speaker attributes from these embeddings, the study reveals which traits are preserved and how this impacts spoofing detection robustness.

Are audio DeepFake detection models polyglots?

audio Published: 2024-12-23 Authors: Bartłomiej Marek, Piotr Kawa, Piotr Syga

This research benchmarks multilingual audio deepfake detection by evaluating various adaptation strategies. Experiments analyzing models trained on English datasets, along with intra- and cross-linguistic adaptations, reveal significant variations in detection efficacy, highlighting the importance of target-language data.

Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution

audio Published: 2024-12-23 Authors: Orchid Chetia Phukan, Drishti Singh, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma

This research investigates the use of speech pre-trained models (PTMs) for audio deepfake source attribution (ADSD). It finds that the x-vector model, a speaker recognition PTM, achieves the best performance, and proposes FINDER, a novel fusion framework, to further improve ADSD accuracy by combining PTM representations.

Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes

audio Published: 2024-12-17 Authors: Kuiyuan Zhang, Zhongyun Hua, Rushi Lan, Yushu Zhang, Yifang Guo

This paper proposes a novel speech deepfake detection method that leverages inconsistencies in phoneme-level speech features. It introduces adaptive phoneme pooling to extract these features and a graph attention network to model their temporal dependencies, achieving superior performance over state-of-the-art methods on multiple datasets.

Region-Based Optimization in Continual Learning for Audio Deepfake Detection

audio Published: 2024-12-16 Authors: Yujie Chen, Jiangyan Yi, Cunhang Fan, Jianhua Tao, Yong Ren, Siding Zeng, Chu Yuan Zhang, Xinrui Yan, Hao Gu, Jun Xue, Chenglong Wang, Zhao Lv, Xiaohui Zhang

This paper introduces RegO, a continual learning method for audio deepfake detection that utilizes the Fisher information matrix to partition the neural network into four regions for region-adaptive gradient optimization. This approach, combined with an Ebbinghaus forgetting mechanism, improves the model's ability to adapt to new deepfake audio while retaining performance on previously learned data.

Audios Don't Lie: Multi-Frequency Channel Attention Mechanism for Audio Deepfake Detection

audio Published: 2024-12-12 Authors: Yangguang Feng

This research proposes an audio deepfake detection method using a multi-frequency channel attention mechanism (MFCA) and 2D discrete cosine transform (DCT). The method leverages MobileNet V2 for feature extraction and MFCA to weight different frequency channels, improving the detection of fine-grained frequency features in audio signals.

Reject Threshold Adaptation for Open-Set Model Attribution of Deepfake Audio

audio Published: 2024-12-02 Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Yujie Chen, Hao Gu, Guanjun Li, Junzuo Zhou, Yong Ren, Tao Xu

This paper proposes ReTA, a novel framework for open-set model attribution of deepfake audio, addressing the limitations of manually setting rejection thresholds in previous methods. ReTA adapts rejection thresholds for each class by learning reconstruction error distributions and employing Gaussian probability estimation, improving accuracy and data adaptability.

From Audio Deepfake Detection to AI-Generated Music Detection -- A Pathway and Overview

audio Published: 2024-11-30 Authors: Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller

This paper provides the first comprehensive review of AI-generated music (AIGM) detection methods. It proposes a pathway for leveraging foundation models from audio deepfake detection to improve AIGM detection by focusing on intrinsic musical features rather than superficial techniques.

Parallel Stacked Aggregated Network for Voice Authentication in IoT-Enabled Smart Devices

audio Published: 2024-11-29 Authors: Awais Khan, Ijaz Ul Haq, Khalid Mahmood Malik

This paper introduces PSA-Net, a lightweight framework for voice anti-spoofing in IoT devices. PSA-Net directly processes raw audio, eliminating the need for computationally expensive pre-processing, and uses a split-transform-aggregate approach to achieve consistent performance across various spoofing attacks.

Comparative Analysis of ASR Methods for Speech Deepfake Detection

audio Published: 2024-11-26 Authors: Davide Salvi, Amit Kumar Singh Yadav, Kratika Bhagtani, Viola Negroni, Paolo Bestagini, Edward J. Delp

This paper investigates the relationship between Automatic Speech Recognition (ASR) performance and speech deepfake detection accuracy. By adapting pre-trained ASR models (Whisper and Wav2Vec 2.0) for deepfake detection, the authors analyze if improvements in ASR correlate with improved deepfake detection capabilities.

Listening for Expert Identified Linguistic Features: Assessment of Audio Deepfake Discernment among Undergraduate Students

audio Published: 2024-11-21 Authors: Noshaba N. Bhalli, Nehal Naqvi, Chloe Evered, Christine Mallinson, Vandana P. Janeja

This study investigates whether training undergraduate students to identify expert-defined linguistic features in audio improves their ability to discern audio deepfakes. The researchers found that training significantly reduced students' uncertainty in evaluating audio clips and improved their ability to correctly identify clips they were initially unsure about.

XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection

audio Published: 2024-11-15 Authors: Yang Xiao, Rohan Kumar Das

This paper proposes XLSR-Mamba, a dual-column bidirectional state space model for spoofing attack detection. It combines a pre-trained wav2vec 2.0 model with a novel Mamba architecture to achieve competitive results and faster inference compared to transformer-based models.

Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation

audio Published: 2024-11-14 Authors: Kuiyuan Zhang, Zhongyun Hua, Yushu Zhang, Yifang Guo, Tao Xiang

This paper presents a robust deepfake speech detection method using dual-stream feature decomposition learning to separate synthesizer-independent content features from synthesizer-specific features. A synthesizer feature augmentation strategy further enhances robustness by blending and shuffling features, improving performance across various synthesizers and datasets.

Toward Transdisciplinary Approaches to Audio Deepfake Discernment

audio Published: 2024-11-08 Authors: Vandana P. Janeja, Christine Mallinson

This paper advocates for a transdisciplinary approach to audio deepfake detection, integrating linguistic knowledge with AI methods to overcome limitations of current expert-agnostic AI models. It highlights the need to move beyond a solely AI-based approach and incorporate human expertise in language to improve the robustness and comprehensiveness of deepfake detection.

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

audio Published: 2024-10-31 Authors: Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao

This paper introduces DeepFakeVox-HQ, the largest public deepfake audio dataset, and proposes F-SAT, a Frequency-Selective Adversarial Training method to improve deepfake audio detection robustness. F-SAT focuses on high-frequency components, which are easily manipulated by attackers, improving accuracy on both clean and corrupted/attacked samples.

Mitigating Unauthorized Speech Synthesis for Voice Protection

audio Published: 2024-10-28 Authors: Zhisheng Zhang, Qianyi Yang, Derui Wang, Pengyang Huang, Yuxin Cao, Kai Ye, Jie Hao

This paper proposes Pivotal Objective Perturbation (POP), a proactive audio protection technology that adds imperceptible noise to speech samples to prevent high-quality deepfake audio generation. POP's effectiveness and transferability across various state-of-the-art text-to-speech (TTS) models are demonstrated through extensive experiments, significantly increasing speech unclarity scores.

Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

audio Published: 2024-10-27 Authors: Ivan Kukanov, Janne Laakkonen, Tomi Kinnunen, Ville Hautamäki

This paper tackles the challenge of generalizing speech deepfake detection to unseen attacks using meta-learning. By learning attack-invariant features, the approach adapts to new attacks with minimal samples, improving Equal Error Rate (EER) from 21.67% to 10.42% on the InTheWild dataset using only 96 unseen samples.

ALDAS: Audio-Linguistic Data Augmentation for Spoofed Audio Detection

audio Published: 2024-10-21 Authors: Zahra Khanjani, Christine Mallinson, James Foulds, Vandana P Janeja

The paper introduces ALDAS, an AI framework for automatically labeling linguistic features in audio to improve spoofed audio detection. ALDAS leverages a CNN trained on expert-labeled data, enhancing existing detection models without the limitations of manual annotation.

Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target Dataset

audio Published: 2024-10-13 Authors: Hideyuki Oiso, Yuto Matsunaga, Kazuya Kakizaki, Taiki Miyagawa

This paper proposes a computationally efficient method for audio deepfake detection using prompt tuning, addressing challenges of source-target domain gaps, limited target datasets, and high computational costs associated with large pre-trained models. Prompt tuning acts as a plug-in, seamlessly integrating with state-of-the-art transformer models to improve performance on target data with minimal additional parameters and computational overhead.

Quantum-Trained Convolutional Neural Network for Deepfake Audio Detection

audio Published: 2024-10-11 Authors: Chu-Hsuan Abraham Lin, Chen-Yu Liu, Samuel Yen-Chi Chen, Kuan-Cheng Chen

This paper proposes a Quantum-Trained Convolutional Neural Network (QT-CNN) for deepfake audio detection. The QT-CNN uses a hybrid quantum-classical approach, reducing the number of trainable parameters by up to 70% compared to classical CNNs without sacrificing accuracy.

Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap

audio Published: 2024-10-09 Authors: Georgia Channing, Juil Sock, Ronald Clark, Philip Torr, Christian Schroeder de Witt

This paper introduces novel explainability methods for transformer-based audio deepfake detectors and open-sources a new benchmark for real-world generalizability. The improved explainability builds trust and addresses the scalability challenge in audio deepfake detection.

Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

audio Published: 2024-10-09 Authors: Yi Zhu, Chirag Goel, Surya Koppisetti, Trang Tran, Ankur Kumar, Gaurav Bharaj

This paper presents Reality Defender's submission to the ASVspoof5 challenge, focusing on a novel pretraining strategy called SLIM. SLIM leverages self-supervised contrastive learning to learn style-linguistics dependency embeddings from bonafide speech, improving generalizability and maintaining low computational cost.

Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

audio Published: 2024-10-09 Authors: Anton Firc, Kamil Malinka, Petr Hanáček

This paper introduces a new dataset of diffusion-generated deepfake speech, comparing its quality and detectability to non-diffusion deepfakes. The findings show that detection performance is comparable across both types, with variability depending on the detector architecture, suggesting diffusion models don't pose a significantly greater threat to current detection systems.

Can DeepFake Speech be Reliably Detected?

audio Published: 2024-10-09 Authors: Hongbin Liu, Youzheng Chen, Arun Narayanan, Athula Balachandran, Pedro J. Moreno, Lun Wang

This research systematically studies malicious attacks against state-of-the-art open-source synthetic speech detectors (SSDs). It evaluates white-box, black-box, and agnostic attacks, measuring effectiveness and stealthiness using metrics and human ratings, revealing significant vulnerabilities in current SSDs.

Where are we in audio deepfake detection? A systematic analysis over generative and detection models

audio Published: 2024-10-06 Authors: Xiang Li, Pin-Yu Chen, Wenqi Wei

This paper introduces SONAR, a framework for benchmarking AI-synthesized audio detection models. SONAR uses a novel dataset from 9 diverse audio synthesis platforms and evaluates both traditional and foundation model-based detection systems, revealing that foundation models show stronger generalization capabilities.

Augmentation through Laundering Attacks for Audio Spoof Detection

audio Published: 2024-10-01 Authors: Hashim Ali, Surya Subramani, Hafiz Malik

This paper investigates the performance of an audio spoof detection system (AASIST) trained using data augmentation through laundering attacks on the ASVspoof 5 database. The results show that the system performs worst on specific spoofing attacks and codec conditions, highlighting challenges in real-world audio deepfake detection.

XWSB: A Blend System Utilizing XLS-R and WavLM with SLS Classifier detection system for SVDD 2024 Challenge

audio Published: 2024-09-27 Authors: Qishan Zhang, Shuangbing Wen, Fangke Yan, Tao Hu, Jun Li

This paper presents XWSB, a state-of-the-art system for singing voice deepfake detection, achieving an EER of 2.32% in the CtrSVDD 2024 challenge. XWSB blends XLS-R and WavLM models, each coupled with an SLS classifier, using a max voting strategy for final decision.

Freeze and Learn: Continual Learning with Selective Freezing for Speech Deepfake Detection

audio Published: 2024-09-26 Authors: Davide Salvi, Viola Negroni, Luca Bondi, Paolo Bestagini, Stefano Tubaro

This paper investigates the optimal application of continual learning for speech deepfake detection. It compares retraining an entire model versus selectively updating only initial layers (responsible for feature processing) while freezing others. Results show that selectively updating initial layers is the most effective strategy for maintaining model performance while adapting to new data.

Leveraging Mixture of Experts for Improved Speech Deepfake Detection

audio Published: 2024-09-24 Authors: Viola Negroni, Davide Salvi, Alessandro Ilic Mezza, Paolo Bestagini, Stefano Tubaro

This paper proposes a novel speech deepfake detection method using a Mixture of Experts (MoE) architecture. The MoE framework enhances generalization and adaptability to unseen data by specializing experts on different datasets, outperforming traditional single models and ensemble methods. An efficient gating mechanism dynamically assigns expert weights, optimizing detection performance.

Representation Loss Minimization with Randomized Selection Strategy for Efficient Environmental Fake Audio Detection

audio Published: 2024-09-24 Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Nitin Choudhury, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

This paper proposes a novel approach for efficient environmental audio deepfake detection by randomly selecting a subset (40-50%) of representation vectors from foundation models. This method outperforms state-of-the-art dimensionality reduction techniques while significantly reducing model parameters and inference time.

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

audio Published: 2024-09-23 Authors: Lam Pham, Phat Lam, Dat Tran, Hieu Tang, Tin Nguyen, Alexander Schindler, Florian Skopik, Alexander Polonsky, Canh Vu

This paper conducts a comprehensive survey of deepfake speech detection, analyzing existing challenges, datasets, and deep learning techniques. It proposes hypotheses on improving detection effectiveness, validates them through experiments, and presents a highly competitive deepfake speech detection model.

Room Impulse Responses help attackers to evade Deep Fake Detection

audio Published: 2024-09-23 Authors: Hieu-Thi Luong, Duc-Tuan Truong, Kong Aik Lee, Eng Siong Chng

This paper investigates the vulnerability of deepfake audio detection systems to attacks using room impulse responses (RIRs) to add reverberation to fake speech. The authors demonstrate that this simple attack significantly increases the error rate of state-of-the-art systems, and propose a defense mechanism using large-scale synthetic RIR data augmentation during training, substantially improving detection accuracy.

Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition

audio Published: 2024-09-21 Authors: Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Sishir Kalita, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

This research explores multimodal foundation models (MFMs) for non-verbal emotion recognition (NVER), hypothesizing that their joint pre-training improves accuracy. A novel fusion framework, MATA, is proposed to combine MFM representations, achieving state-of-the-art results on benchmark datasets.

Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models

audio Published: 2024-09-21 Authors: Orchid Chetia Phukan, Sarthak Jain, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

This research compares music foundation models (MFMs) and speech foundation models (SFMs) for singing voice deepfake detection (SVDD). It finds that speaker recognition SFMs perform best, and proposes a novel fusion framework, FIONA, which combines SFMs and MFMs to achieve state-of-the-art results with a 13.74% equal error rate (EER).

Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis

audio Published: 2024-09-20 Authors: Lauri Juvela, Xin Wang

This paper enhances collaborative watermarking for speech synthesis detection by incorporating audio codec augmentation. It demonstrates that using a waveform-domain straight-through estimator for gradient approximation enables robust watermarking even after processing through traditional and neural audio codecs.

SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

audio Published: 2024-09-18 Authors: Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe

The paper introduces SpoofCeleb, a new dataset for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV). It uses real-world, noisy speech from VoxCeleb1 to train 23 TTS systems, generating a large and diverse dataset of both bona fide and spoofed speech, addressing limitations of existing datasets.

SafeEar: Content Privacy-Preserving Audio Deepfake Detection

audio Published: 2024-09-14 Authors: Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu

SafeEar is a novel framework for audio deepfake detection that preserves content privacy by using only acoustic information (prosody and timbre) for detection, decoupled from semantic content using a neural audio codec. This approach achieves a low equal error rate (EER) while preventing content recovery by both machine and human analysis.

DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset

audio Published: 2024-09-13 Authors: Jiawei Du, I-Ming Lin, I-Hsiang Chiu, Xuanjun Chen, Haibin Wu, Wenze Ren, Yu Tsao, Hung-yi Lee, Jyh-Shing Roger Jang

The paper introduces DFADD, a new dataset of audio deepfakes generated using advanced diffusion and flow-matching TTS models. DFADD addresses the lack of robust anti-spoofing models against these high-quality synthetic audios and serves as a valuable resource for developing more resilient detection methods.

LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking

audio Published: 2024-09-12 Authors: Mayank Kumar Singh, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji

This paper proposes LOCKEY, a novel method for deepfake deterrence and user tracking in generative models. It integrates key-based authentication with watermarking, ensuring only users with valid keys can generate high-quality outputs, while embedding the user's key as a watermark for tracking.

D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack

audio Published: 2024-09-11 Authors: Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac

This paper investigates the resilience of the D-CAPTCHA system against transferable imperceptible adversarial attacks. It exposes vulnerabilities in D-CAPTCHA and proposes D-CAPTCHA++, a more robust version that uses adversarial training to mitigate these vulnerabilities, significantly improving the system's resistance to attacks.

VoiceWukong: Benchmarking Deepfake Voice Detection

audio Published: 2024-09-10 Authors: Ziwei Yan, Yanjie Zhao, Haoyu Wang

VoiceWukong is a new benchmark dataset for deepfake voice detection, addressing limitations in existing datasets by including diverse languages (English and Chinese) and various manipulations. Evaluation of 12 state-of-the-art detectors revealed significant challenges in real-world application, with most exceeding a 20% equal error rate.

Investigating Causal Cues: Strengthening Spoofed Audio Detection with Human-Discernible Linguistic Features

audio Published: 2024-09-09 Authors: Zahra Khanjani, Tolulope Ale, Jianwu Wang, Lavon Davis, Christine Mallinson, Vandana P. Janeja

This paper investigates causal relationships between human-discernible linguistic features (EDLFs) and spoofed audio detection. Using a hybrid dataset of spoofed audio augmented with sociolinguistic annotations and causal discovery models, the authors analyze the impact of EDLFs on audio authenticity.

Continuous Learning of Transformer-based Audio Deepfake Detection

audio Published: 2024-09-09 Authors: Tuan Duy Nguyen Le, Kah Kuan Teh, Huy Dat Tran

This paper presents a framework for audio deepfake detection that achieves high accuracy on existing data and adapts effectively to new fake data via continuous learning. It uses an Audio Spectrogram Transformer (AST) model, enhanced with data augmentation and a continuous learning plugin module that outperforms conventional fine-tuning.

Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection

audio Published: 2024-09-08 Authors: Theophile Stourbe, Victor Miara, Theo Lepage, Reda Dehak

This paper presents a speech deepfake detection system that leverages a pre-trained WavLM model as a front-end and explores different back-end techniques for aggregating its representations. The system achieves state-of-the-art results on the ASVspoof 2024 challenge, demonstrating the effectiveness of this approach.

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

audio Published: 2024-09-03 Authors: Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong Wang

This research proposes ensemble methods using speech foundation models for singing voice deepfake detection, achieving a leading 1.79% pooled equal error rate (EER) on the CtrSVDD evaluation set. A novel Squeeze-and-Excitation Aggregation (SEA) method is introduced to effectively integrate features from these models, improving performance beyond individual systems.

USTC-KXDIGIT System Description for ASVspoof5 Challenge

audio Published: 2024-09-03 Authors: Yihao Chen, Haochen Wu, Nan Jiang, Xiang Xia, Qing Gu, Yunqi Hao, Pengfei Cai, Yu Guan, Jialong Wang, Weilin Xie, Lei Fang, Sian Fang, Yan Song, Wu Guo, Lin Liu, Minqiang Xu

This paper describes the USTC-KXDIGIT system for the ASVspoof5 Challenge, focusing on speech deepfake detection. The approach uses a cascade of feature extractors (handcrafted and self-supervised) and classifiers, employing extensive embedding engineering, data augmentation (including synthesized fake audio), and score fusion for improved robustness and generalization.

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

audio Published: 2024-08-30 Authors: Kirill Borodin, Vasiliy Kudryavtsev, Dmitrii Korzh, Alexey Efimenko, Grach Mkrtchian, Mikhail Gorodnichev, Oleg Y. Rogov

The paper introduces AASIST3, a novel architecture for speech deepfake detection that enhances the AASIST framework with Kolmogorov-Arnold networks and additional layers. This results in a more than twofold performance improvement, achieving minDCF scores of 0.5357 (closed condition) and 0.1414 (open condition).

SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge

audio Published: 2024-08-28 Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan

The inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices. The challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD), with the top team in the CtrSVDD track achieving a 1.65% equal error rate.

Easy, Interpretable, Effective: openSMILE for voice deepfake detection

audio Published: 2024-08-28 Authors: Octavian Pascu, Dan Oneata, Horia Cucu, Nicolas M. Müller

This paper demonstrates highly accurate voice deepfake detection on the ASVspoof5 dataset using a small subset of simple, interpretable features extracted from the openSMILE library. These features, such as mean unvoiced segment length, achieve surprisingly low equal error rates (EERs), with an overall EER of 15.7 ± 6.0%.

Is Audio Spoof Detection Robust to Laundering Attacks?

audio Published: 2024-08-27 Authors: Hashim Ali, Surya Subramani, Shefali Sudhir, Raksha Varahamurthy, Hafiz Malik

This paper introduces a new laundering attack database, ASVSpoof Laundering Database, created by applying various real-world audio distortions to the ASVSpoof 2019 database. Seven state-of-the-art audio spoof detection approaches are evaluated on this new database, revealing their vulnerability to these attacks.

SONICS: Synthetic Or Not -- Identifying Counterfeit Songs

audio Published: 2024-08-26 Authors: Md Awsafur Rahman, Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Bishmoy Paul, Shaikh Anowarul Fattah

This paper introduces SONICS, a large-scale dataset for end-to-end synthetic song detection, addressing limitations of existing datasets. It also proposes SpecTTTra, a novel architecture that efficiently models long-range temporal dependencies in songs, outperforming existing methods in terms of F1 score, speed, and memory usage.

Analyzing the Impact of Splicing Artifacts in Partially Fake Speech Signals

audio Published: 2024-08-25 Authors: Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro

This paper analyzes splicing artifacts in partially fake speech signals, revealing that these artifacts can be exploited for detection without training a dedicated model. The authors achieve a low Equal Error Rate (EER) on two datasets by analyzing the dynamic range of specific frequency bands, highlighting the challenges of generating high-quality spliced audio.

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

audio Published: 2024-08-23 Authors: Zhenyu Wang, John H. L. Hansen

This paper proposes a robust synthetic audio spoofing detection system using a RawNet2-based encoder enhanced with a simple attention module, a weighted additive angular margin loss to address data imbalance, and a meta-learning framework for generalization to unseen attacks. The system also incorporates adversarial examples with an auxiliary batch normalization for disentangled training, achieving a pooled EER of 0.87% and a min t-DCF of 0.0277 on the ASVspoof 2019 LA corpus.

BUT Systems and Analyses for the ASVspoof 5 Challenge

audio Published: 2024-08-20 Authors: Johan Rohdin, Lin Zhang, Oldřich Plchot, Vojtěch Staněk, David Mihola, Junyi Peng, Themos Stafylakis, Dmitriy Beveraki, Anna Silnova, Jan Brukner, Lukáš Burget

This paper presents the Brno University of Technology's (BUT) systems for the ASVspoof 2025 challenge, focusing on deepfake detection and spoofing-robust automatic speaker verification (SASV). The main contributions include analyzing different label schemes for deepfake detection and proposing a logistic regression approach for jointly optimizing affine transformations of countermeasure and speaker verification scores in SASV.

Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?

audio Published: 2024-08-20 Authors: Yuankun Xie, Chenxu Xiong, Xiaopeng Wang, Zhiyong Wang, Yi Lu, Xin Qi, Ruibo Fu, Yukun Liu, Zhengqi Wen, Jianhua Tao, Guanjun Li, Long Ye

This paper investigates the effectiveness of current deepfake audio detection models against audio generated by Audio Language Models (ALMs). The study evaluates state-of-the-art countermeasures on 12 types of ALM-generated audio, finding that codec-trained countermeasures achieve surprisingly high detection accuracy, exceeding expectations.

A Noval Feature via Color Quantisation for Fake Audio Detection

audio Published: 2024-08-20 Authors: Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Yukun Liu, Guanjun Li, Xin Qi, Yi Lu, Xuefei Liu, Yongwei Li

This paper introduces a novel fake audio detection method using color quantization to extract features from spectrograms. By constraining the reconstruction to a limited color palette, the method enhances the distinguishability between real and fake audio, improving classification performance compared to using original spectral inputs.

ASASVIcomtech: The Vicomtech-UGR Speech Deepfake Detection and SASV Systems for the ASVspoof5 Challenge

audio Published: 2024-08-19 Authors: Juan M. Martín-Doñas, Eros Roselló, Angel M. Gomez, Aitor Álvarez, Iván López-Espejo, Antonio M. Peinado

This paper details the ASASVIcomtech team's participation in the ASVspoof5 Challenge, focusing on speech deepfake detection (Track 1) and spoofing-aware speaker verification (Track 2). While a closed-condition system using a DCCRN yielded unsatisfactory results, an open-condition ensemble system leveraging self-supervised models and augmented data achieved highly competitive results.

SZU-AFS Antispoofing System for the ASVspoof 5 Challenge

audio Published: 2024-08-19 Authors: Yuxiong Xu, Jiafeng Zhong, Sengui Zheng, Zefeng Liu, Bin Li

This paper introduces SZU-AFS, an anti-spoofing system for the ASVspoof 5 Challenge, focusing on standalone speech deepfake detection. The system leverages a four-stage approach: baseline model selection, data augmentation exploration, a co-enhancement strategy using gradient norm aware minimization (GAM), and logit score fusion, achieving a minDCF of 0.115 and an EER of 4.04% on the evaluation set.

Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model

audio Published: 2024-08-17 Authors: Massimiliano Todisco, Michele Panariello, Xin Wang, Héctor Delgado, Kong Aik Lee, Nicholas Evans

Malacopula, a neural-based generalized Hammerstein model, generates adversarial perturbations for spoofed speech to deceive automatic speaker verification (ASV) systems. It enhances spoofing attacks by using non-linear processes to modify speech, minimizing the cosine distance between spoofed and bona fide speaker embeddings. Experiments show substantial vulnerability increases, though speech quality degrades and attacks are detectable under controlled conditions.

ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

audio Published: 2024-08-16 Authors: Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi

The ASVspoof 2024 challenge focuses on evaluating speech spoofing and deepfake detection systems. It uses a significantly larger, crowdsourced dataset with diverse acoustic conditions and incorporates adversarial attacks for the first time, pushing the limits of current detection technologies.

WavLM model ensemble for audio deepfake detection

audio Published: 2024-08-14 Authors: David Combei, Adriana Stan, Dan Oneata, Horia Cucu

This paper presents a method for audio deepfake detection using an ensemble of WavLM models. The approach benchmarks various pretrained representations, finding WavLM to be superior, and then finetunes WavLM models with data augmentation, achieving a low equal error rate (EER) in the ASVspoof5 challenge.

Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge

audio Published: 2024-08-13 Authors: Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Haonan Cheng, Long Ye

This paper tackles open-domain audio deepfake detection in the ASVspoof5 challenge. The authors introduce a novel data augmentation method, Frequency Mask, to address high-frequency gaps in the dataset and combine multiple self-supervised learning features with varied temporal information for improved robustness. Their approach achieves a minDCF of 0.0158 and an EER of 0.55% on the ASVspoof5 evaluation progress set.

ADD 2023: Towards Audio Deepfake Detection and Analysis in the Wild

audio Published: 2024-08-09 Authors: Jiangyan Yi, Chu Yuan Zhang, Jianhua Tao, Chenglong Wang, Xinrui Yan, Yong Ren, Hao Gu, Junzuo Zhou

The ADD 2023 challenge focuses on advancing audio deepfake detection beyond binary classification by tackling tasks like identifying manipulated audio intervals and determining the source algorithm. This paper details the datasets used in the challenge and analyzes the methodologies of top-performing participants, highlighting both successes and limitations.

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

audio Published: 2024-07-26 Authors: Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj

This paper introduces SLIM, a novel audio deepfake detection model that leverages the style-linguistics mismatch in fake speech. SLIM uses self-supervised pretraining on real speech to learn style-linguistics dependencies, then uses these features with standard acoustic features to classify real and fake audio, outperforming benchmarks on out-of-domain datasets.

GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis

audio Published: 2024-07-15 Authors: Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li

This paper introduces GROOT, a generative robust audio watermarking method that embeds watermarks directly into audio during synthesis using diffusion models. This proactive approach surpasses existing state-of-the-art methods in robustness against various attacks, maintaining high watermark extraction accuracy.

Advancing Continual Learning for Robust Deepfake Audio Classification

audio Published: 2024-07-14 Authors: Feiyi Dong, Qingchen Tang, Yichen Bai, Zihan Wang

This paper proposes CADE, a novel continual learning method for robust deepfake audio classification. CADE uses a fixed memory size to store past data, incorporates two distillation losses to retain old knowledge, and employs a novel embedding similarity loss for better positive sample alignment, outperforming baseline methods on the ASVspoof2019 dataset.

From Real to Cloned Singer Identification

audio Published: 2024-07-11 Authors: Dorian Desblancs, Gabriel Meseguer-Brocal, Romain Hennequin, Manuel Moussallam

This paper investigates the use of singer identification methods for detecting cloned voices in music. Three embedding models trained with a singer-level contrastive learning scheme are evaluated on real and cloned voices, revealing a significant performance drop when classifying cloned voices, particularly for models using mixtures as input.

Source Tracing of Audio Deepfake Systems

audio Published: 2024-07-10 Authors: Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury

This research introduces a system for classifying audio deepfake generation attributes (input type, acoustic model, vocoder) rather than simply detecting deepfakes. The system leverages existing spoofing countermeasure architectures and is evaluated on ASVspoof 2019 and MLAAD datasets, demonstrating robustness in identifying deepfake generation techniques.

Targeted Augmented Data for Audio Deepfake Detection

audio Published: 2024-07-10 Authors: Marcella Astrid, Enjie Ghorbel, Djamila Aouada

This paper proposes a novel data augmentation method for improving the generalization capabilities of audio deepfake detectors. By perturbing real audio data to create pseudo-fakes near the model's decision boundary, the method enhances the diversity of training data and mitigates overfitting to specific manipulation techniques.

Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection

audio Published: 2024-07-08 Authors: Zhenchun Lei, Hui Yan, Changhong Liu, Minglei Ma, Yingen Yang

This paper proposes two-path GMM-ResNet and GMM-SENet models for audio spoofing detection. These models leverage Gaussian probability features from two GMMs (one for genuine and one for spoofed speech) and utilize ResNet and SENet architectures to capture both score distribution on GMM components and inter-frame relationships, achieving significant performance improvements over the baseline GMM.

Towards Attention-based Contrastive Learning for Audio Spoof Detection

audio Published: 2024-07-03 Authors: Chirag Goel, Surya Koppisetti, Ben Colman, Ali Shahriyari, Gaurav Bharaj

This paper introduces an attention-based contrastive learning framework (SSAST-CL) for audio spoof detection using Vision Transformers (ViTs). SSAST-CL improves upon a baseline ViT model by incorporating cross-attention to enhance representation learning, achieving competitive performance on the ASVSpoof 2021 challenge.

GMM-ResNet2: Ensemble of Group ResNet Networks for Synthetic Speech Detection

audio Published: 2024-07-02 Authors: Zhenchun Lei, Hui Yan, Changhong Liu, Yong Zhou, Minglei Ma

This paper proposes GMM-ResNet2, an improved model for synthetic speech detection. It enhances a previous GMM-ResNet model by using multi-scale Log Gaussian Probability features, a grouping technique for ensemble learning, an improved residual block, and an ensemble-aware loss function, resulting in significant performance gains on ASVspoof 2019 and 2021 datasets.

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

audio Published: 2024-07-01 Authors: Lam Pham, Phat Lam, Truong Nguyen, Huyen Nguyen, Alexander Schindler

This paper presents a deep learning system for deepfake audio detection using an ensemble of models. The system leverages multiple spectrograms with different transformations and auditory filters, and combines three deep learning approaches: end-to-end training, transfer learning, and audio embedding extraction from pre-trained models. The ensemble achieves a highly competitive Equal Error Rate (EER) of 0.03 on the ASVspoof 2019 dataset.

SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures

audio Published: 2024-07-01 Authors: Oguzhan Baser, Kaan Kale, Sandeep P. Chinchali

SecureSpectra embeds irreversible signatures in audio to defend against deepfake threats, leveraging the inability of deepfake models to replicate high-frequency content. Differential privacy protects signatures from reverse engineering, achieving high detection accuracy with minimal performance compromise.

Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models

audio Published: 2024-06-27 Authors: Borodin Kirill Nikolayevich, Kudryavtsev Vasiliy Dmitrievich, Mkrtchian Grach Maratovich, Gorodnichev Mikhail Genadievich, Korzh Dmitrii Sergeevich

This paper presents an automatic speaker verification (ASV) system that extracts embeddings from audio to capture voice characteristics like pitch and phoneme duration. This system was used in the SSTC challenge to verify voice-converted audio, achieving an equal error rate (EER) of 20.669%.

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

audio Published: 2024-06-25 Authors: Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

This paper proposes a Temporal-Channel Modeling (TCM) module to improve synthetic speech detection by enhancing the multi-head self-attention mechanism in Transformer models. The TCM module effectively captures temporal-channel dependencies in the input speech representation, leading to significant performance gains.

One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

audio Published: 2024-06-24 Authors: Hyun Myung Kim, Kangwook Jang, Hoirin Kim

This paper proposes a novel adaptive centroid shift (ACS) method for audio deepfake detection using one-class learning. ACS updates the centroid representation using only bonafide samples, creating a robust model against unseen spoofing attacks. The method achieves a state-of-the-art equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset.

Deepfake tweets automatic detection

audio Published: 2024-06-24 Authors: Adam Frej, Adrian Kaminski, Piotr Marciniak, Szymon Szmajdzinski, Soveatin Kuntur, Anna Wroblewska

This research focuses on detecting deepfake tweets using natural language processing (NLP) techniques. It evaluates various machine learning models on the TweepFake dataset to identify effective strategies for recognizing AI-generated text and improving the reliability of online information.

Frequency-mix Knowledge Distillation for Fake Speech Detection

audio Published: 2024-06-14 Authors: Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv

This paper proposes Frequency-mix knowledge distillation (FKD) for fake speech detection, addressing information loss in existing data augmentation methods. FKD uses a teacher model trained on frequency-mixed data and a student model trained on time-domain augmented data, with multi-level feature distillation to improve information extraction and generalization.

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

audio Published: 2024-06-12 Authors: Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

This paper introduces Codecfake, a new dataset for detecting LLM-based deepfake audio generated using neural codecs. Codecfake shows that ADD models trained on this dataset significantly outperform those trained on vocoder-based datasets, achieving a 41.406% reduction in average equal error rate.

FakeSound: Deepfake General Audio Detection

audio Published: 2024-06-12 Authors: Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

This paper introduces the task of deepfake general audio detection, proposing the FakeSound dataset generated via an automated manipulation pipeline. A benchmark deepfake detection model, surpassing human performance and state-of-the-art speech deepfake detection, is presented.

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection

audio Published: 2024-06-12 Authors: Zihan Pan, Tianchi Liu, Hardik B. Sailor, Qiongqiong Wang

This paper investigates the use of the WavLM model for anti-spoofing detection, proposing an attentive merging method to combine hierarchical hidden embeddings from multiple transformer layers. The approach achieves state-of-the-art equal error rates (EERs) on ASVspoof datasets, demonstrating the effectiveness of this method and the importance of early transformer layers.

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

audio Published: 2024-06-11 Authors: Haibin Wu, Yuan Tseng, Hung-yi Lee

This paper introduces CodecFake, the first dataset of deepfake audios generated using state-of-the-art codec-based speech synthesis systems. It demonstrates that existing anti-spoofing models fail to detect these deepfakes and shows that training on CodecFake significantly improves detection accuracy.

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

audio Published: 2024-06-10 Authors: Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan

This paper introduces RawBMamba, a bidirectional end-to-end state space model for audio deepfake detection. It combines short-range features extracted using sinc layers and convolutional layers with long-range features captured by a bidirectional Mamba model, improving upon the unidirectional limitations of previous Mamba models. The resulting model significantly outperforms existing methods on several datasets.

Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy

audio Published: 2024-06-05 Authors: Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, Jianhua Tao

This paper introduces the Real Emphasis and Fake Dispersion (REFD) strategy for audio deepfake algorithm recognition, focusing on both in-distribution (ID) and out-of-distribution (OOD) detection. REFD uses a two-stage approach, emphasizing real audio detection in the first stage and focusing on fake audio classification and OOD detection in the second, achieving a state-of-the-art 86.83% F1-score on Audio Deepfake Detection Challenge 2023 Track 3.

Generalized Fake Audio Detection via Deep Stable Learning

audio Published: 2024-06-05 Authors: Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, Shuchen Shi

This paper proposes a Sample Weight Learning (SWL) module for generalized fake audio detection. SWL addresses distribution shift by decorrelating features via learned sample weights, improving generalization across datasets without needing extra training data or complex training processes.

Harder or Different? Understanding Generalization of Audio Deepfake Detection

audio Published: 2024-06-05 Authors: Nicolas M. Müller, Nicholas Evans, Hemlata Tak, Philip Sperl, Konstantin Böttinger

This research investigates the generalization problem in audio deepfake detection, determining whether poor performance on unseen deepfakes is due to increased difficulty ('hardness') or fundamental differences ('difference') between deepfake generation methods. The study finds that performance gaps are primarily attributed to 'difference', implying that simply increasing model capacity is insufficient for robust generalization.

Singing Voice Graph Modeling for SingFake Detection

audio Published: 2024-06-05 Authors: Xuanjun Chen, Haibin Wu, Jyh-Shing Roger Jang, Hung-yi Lee

This paper introduces SingGraph, a novel model for singing voice deepfake (SingFake) detection. SingGraph combines MERT and wav2vec2.0 models for pitch/rhythm and lyric analysis, respectively, and uses RawBoost and beat matching for data augmentation, achieving state-of-the-art results on the SingFake dataset.

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

audio Published: 2024-06-04 Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

The paper introduces CtrSVDD, a large-scale dataset for singing voice deepfake detection, addressing limitations in existing datasets through enhanced controllability, diversity, and openness. It also presents a baseline system for evaluating different audio features in detecting deepfakes.

Towards Out-of-Distribution Detection in Vocoder Recognition via Latent Feature Reconstruction

audio Published: 2024-06-04 Authors: Renmingyue Du, Jixun Yao, Qiuqiang Kong, Yin Cao

This paper proposes a reconstruction-based approach for out-of-distribution (OOD) detection in vocoder recognition using an autoencoder with multiple decoders, one for each vocoder class. If none of the decoders can reconstruct an input feature satisfactorily, it's classified as OOD. Contrastive learning and an auxiliary classifier enhance the approach's performance.

Towards Robust Audio Deepfake Detection: A Evolving Benchmark for Continual Learning

audio Published: 2024-05-14 Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao

This paper introduces EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA addresses the challenge of traditional methods struggling to adapt to evolving synthetic speech by incorporating classic and newly generated deepfake audio datasets and supporting various continual learning techniques.

SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

audio Published: 2024-05-08 Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

This paper introduces the SVDD Challenge 2024, the first research challenge focused on singing voice deepfake detection. The challenge features two tracks, one with controlled, isolated vocals and another with in-the-wild recordings containing background music, to advance research in this specialized area.

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

audio Published: 2024-05-08 Authors: Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

This paper introduces Codecfake, a large-scale dataset of over 1 million audio samples for detecting ALM-based deepfake audio generated using neural codecs. To improve generalization, they propose CSAM, a co-training sharpness aware minimization strategy that addresses domain ascent bias, achieving a low average equal error rate (EER) of 0.616%.

Detecting music deepfakes is easy but actually hard

audio Published: 2024-05-07 Authors: Darius Afchar, Gabriel Meseguer-Brocal, Romain Hennequin

This paper introduces the first music deepfake detector, achieving surprisingly high accuracy (99.8%) using convolutional neural networks trained on real and auto-encoded audio. However, it emphasizes the limitations of this approach, highlighting the need for further research into robustness, generalization, calibration, and interpretability.

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

audio Published: 2024-05-03 Authors: Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva

This paper proposes a training-free approach for audio deepfake detection leveraging large-scale pre-trained models. The method reformulates the problem as speaker verification, identifying fake audios through mismatch with a reference set of the claimed speaker's voice. This eliminates the need for training on fake data, ensuring generalization to unseen synthesis methods.

Device Feature based on Graph Fourier Transformation with Logarithmic Processing For Detection of Replay Speech Attacks

audio Published: 2024-04-26 Authors: Mingrui He, Longting Xu, Han Wang, Mingjun Zhang, Rohan Kumar Das

This paper proposes novel audio features for replay speech attack detection in automatic speaker verification. These features, GFLC, GFDCC, and GFLDC, are derived using graph Fourier transform, logarithmic processing, and a device-related linear transformation, improving upon previous methods that ignored device and environmental noise effects. The proposed features outperform existing front-ends on multiple datasets.

CLAD: Robust Audio Deepfake Detection Against Manipulation Attacks with Contrastive Learning

audio Published: 2024-04-24 Authors: Haolin Wu, Jing Chen, Ruiying Du, Cong Wu, Kun He, Xingcan Shang, Hao Ren, Guowen Xu

This paper presents CLAD, a contrastive learning-based audio deepfake detector robust to manipulation attacks. CLAD incorporates contrastive learning to minimize variations caused by manipulations and a length loss to improve clustering of real audios, significantly enhancing detection robustness against various attacks.

Every Breath You Don't Take: Deepfake Speech Detection Using Breath

audio Published: 2024-04-23 Authors: Seth Layton, Thiago De Andrade, Daniel Olszewski, Kevin Warren, Kevin Butler, Patrick Traynor

This paper proposes a novel deepfake speech detection method using breath as a discriminator. A breath detector is trained and used to extract breath features from audio samples, which are then used to classify real and deepfake speech with high accuracy, outperforming a state-of-the-art model.

A Survey on Speech Deepfake Detection

audio Published: 2024-04-22 Authors: Menglu Li, Yasaman Ahmadiadli, Xiao-Ping Zhang

This survey paper analyzes over 200 research papers on speech deepfake detection published up to March 2024. It provides a comprehensive review of the detection pipeline, including model architectures, optimization techniques, datasets, and evaluation metrics, identifying current state-of-the-art and suggesting future research directions.

Retrieval-Augmented Audio Deepfake Detection

audio Published: 2024-04-22 Authors: Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

This paper proposes a Retrieval-Augmented Detection (RAD) framework for audio deepfake detection, which augments test samples with similar retrieved samples to improve detection accuracy. The RAD framework, extended with a multi-fusion attentive classifier, achieves state-of-the-art results on ASVspoof 2021 DF and competitive results on 2019 and 2021 LA datasets.

Enhancing Generalization in Audio Deepfake Detection: A Neural Collapse based Sampling and Training Approach

audio Published: 2024-04-19 Authors: Mohammed Yousif, Jonat John Mathew, Huzaifa Pallan, Agamjeet Singh Padda, Syed Daniyal Shah, Sara Adamski, Madhu Reddiboina, Arjun Pankajakshan

This paper proposes a neural collapse-based sampling approach for enhancing generalization in audio deepfake detection. By sampling confidently classified data points from pre-trained models on diverse datasets, it creates a smaller, more efficient training database that improves generalization on unseen data without the computational cost of training on massive datasets.

Cross-Domain Audio Deepfake Detection: Dataset and Analysis

audio Published: 2024-04-07 Authors: Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang

This paper introduces a new cross-domain audio deepfake detection (CD-ADD) dataset with over 300 hours of speech data generated by five advanced zero-shot TTS models, addressing the limitations of existing datasets. Experiments using Wav2Vec2 and Whisper models demonstrate high accuracy and few-shot learning capabilities, highlighting the challenges posed by neural codec compression.

Heterogeneity over Homogeneity: Investigating Multilingual Speech Pre-Trained Models for Detecting Audio Deepfake

audio Published: 2024-03-31 Authors: Orchid Chetia Phukan, Gautam Siddharth Kashyap, Arun Balaji Buduru, Rajesh Sharma

This research investigates the effectiveness of multilingual speech pre-trained models (PTMs) for audio deepfake detection. The study finds that multilingual PTMs outperform monolingual PTMs and propose a novel fusion framework, MiO, achieving state-of-the-art performance on two datasets and comparable performance on a third.

Detection of Deepfake Environmental Audio

audio Published: 2024-03-26 Authors: Hafsa Ouajdi, Oussama Hadder, Modan Tailleur, Mathieu Lagrange, Laurie M. Heller

This paper proposes a deepfake environmental audio detection pipeline using CLAP audio embeddings. Evaluated on the 2023 DCASE challenge dataset, the method achieves 98% accuracy in detecting fake sounds generated by 44 state-of-the-art synthesizers, showing a 10% improvement over using VGGish embeddings.

Exploring Green AI for Audio Deepfake Detection

audio Published: 2024-03-21 Authors: Subhajit Saha, Md Sahidullah, Swagatam Das

This research proposes a green AI framework for audio deepfake detection using pre-trained self-supervised learning (SSL) models and classical machine learning algorithms. Instead of fine-tuning large deep neural networks, it leverages embeddings from these pre-trained models with simpler classifiers, achieving competitive results with significantly reduced computational cost.

Towards the Development of a Real-Time Deepfake Audio Detection System in Communication Platforms

audio Published: 2024-03-18 Authors: Jonat John Mathew, Rakin Ahsan, Sae Furukawa, Jagdish Gautham Krishna Kumar, Huzaifa Pallan, Agamjeet Singh Padda, Sara Adamski, Madhu Reddiboina, Arjun Pankajakshan

This research explores the feasibility of using static deepfake audio detection models in real-time communication platforms. Two models (ResNet and LCNN) were implemented and tested on the ASVspoof 2019 dataset and a new dataset from Microsoft Teams meetings, demonstrating the challenges of adapting static models to dynamic real-time scenarios.

A robust audio deepfake detection system via multi-view feature

audio Published: 2024-03-04 Authors: Yujie Yang, Haochen Qin, Hang Zhou, Chengcheng Wang, Tianyu Guo, Kai Han, Yunhe Wang

This paper improves audio deepfake detection by exploring various audio features (handcrafted and learning-based) and proposes multi-view feature incorporation methods (feature selection and fusion). The model, trained on ASV2019 data, achieves a 24.27% equal error rate on the In-the-Wild dataset, demonstrating improved generalization.

Advanced Signal Analysis in Detecting Replay Attacks for Automatic Speaker Verification Systems

audio Published: 2024-03-02 Authors: Lee Shih Kuang

This research introduces novel signal analysis methods (arbitrary analysis, mel scale analysis, and constant Q analysis) inspired by the Fourier inversion formula for replay speech detection in automatic speaker verification. These methods improve efficiency and effectiveness in analyzing speech signals, particularly when integrated with temporal autocorrelation of speech features.

PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

audio Published: 2024-02-28 Authors: Govind Mittal, Arthur Jakobsson, Kelly O. Marshall, Chinmay Hegde, Nasir Memon

PITCH, a challenge-response system, enhances deepfake audio detection by incorporating audio challenges designed to exploit weaknesses in voice cloning technology. This human-AI collaborative system achieves 84.5% accuracy, significantly improving upon human-only performance (72.6%) by leveraging machine precision while maintaining human decision authority.

Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0

audio Published: 2024-02-27 Authors: Taein Kang, Soyul Han, Sunmook Choi, Jaejin Seo, Sanghyeok Chung, Seungeun Lee, Seungsang Oh, Il-Youp Kwak

This research investigates using wav2vec 2.0 as an audio feature extractor for voice spoofing detection. By selectively choosing and fine-tuning Transformer layers within wav2vec 2.0, the authors achieve state-of-the-art performance on the ASVspoof 2019 LA dataset with various spoofing detection back-end models.

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

audio Published: 2024-01-20 Authors: Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

This paper proposes a generalized standalone ASV (G-SASV) system for speaker verification that is robust to spoofing attacks. It achieves this by enhancing a simple deep neural network backend using limited spoofing data during training, without requiring a separate spoofing countermeasure module during testing. The approach improves the performance of statistical ASV backends significantly.

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

audio Published: 2024-01-17 Authors: Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger

The paper introduces MLAAD v7, a multi-language audio anti-spoofing dataset containing 485.3 hours of synthetic speech in 40 languages generated using 101 TTS models. Experiments show that models trained on MLAAD achieve superior performance compared to models trained on other datasets, demonstrating its value as a training resource for robust audio deepfake detection.

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

audio Published: 2024-01-11 Authors: Lian Huang, Chi-Man Pun

This paper proposes a novel framework for replay and deepfake audio detection using hybrid features and a self-attention mechanism. The approach combines deep learning features and Mel-spectrogram features, leveraging self-attention to focus on essential elements for improved discrimination. This results in significantly lower Equal Error Rates (EERs) compared to baseline systems on the ASVspoof 2021 dataset.

AntiDeepFake: AI for Deep Fake Speech Recognition

audio Published: 2024-01-04 Authors: Enkhtogtokh Togootogtokh, Christian Klasen

This research presents AntiDeepFake, an AI system for deepfake speech recognition. The system uses a pipeline encompassing data collection, feature extraction, feature engineering, AI modeling (with CatBoost, XGBoost, and TabNet), and evaluation, achieving high accuracy in differentiating real and synthetic speech.

What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection

audio Published: 2023-12-15 Authors: Xiaohui Zhang, Jiangyan Yi, Chenglong Wang, Chuyuan Zhang, Siding Zeng, Jianhua Tao

This paper proposes Radian Weight Modification (RWM), a continual learning approach for audio deepfake detection. RWM categorizes audio classes based on feature distribution compactness to adapt gradient modification directions, improving knowledge acquisition and mitigating forgetting when encountering new deepfake types.

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

audio Published: 2023-12-13 Authors: Yinlin Guo, Haofan Huang, Xi Chen, He Zhao, Yuehai Wang

This paper proposes a novel audio deepfake detection method combining the self-supervised WavLM model for feature extraction and a Multi-Fusion Attentive (MFA) classifier for improved spoofing detection. The MFA classifier leverages complementary information from audio features at both time and layer levels, achieving state-of-the-art results on the ASVspoof 2021 DF set.

MFAAN: Unveiling Audio Deepfakes with a Multi-Feature Authenticity Network

audio Published: 2023-11-06 Authors: Karthik Sivarama Krishnan, Koushik Sivarama Krishnan

The paper introduces MFAAN, a multi-feature audio authenticity network for detecting audio deepfakes. MFAAN uses multiple parallel paths processing MFCC, LFCC, and Chroma-STFT features, achieving high accuracy on benchmark datasets.

Audio compression-assisted feature extraction for voice replay attack detection

audio Published: 2023-10-09 Authors: Xiangyu Shi, Yuhao Luo, Li Wang, Haorui He, Hao Li, Lei Wang, Zhizheng Wu

This research proposes a novel audio deepfake detection approach using audio compression. By comparing the original audio with a compressed and decompressed version, the method extracts features reflecting channel noise introduced during replay attacks. The approach achieves a state-of-the-art equal error rate (EER) of 22.71% on the ASVspoof 2021 PA evaluation set.

Thech. Report: Genuinization of Speech waveform PMF for speaker detection spoofing and countermeasures

audio Published: 2023-10-09 Authors: Itshak Lapidot, Jean-Francois Bonastre

This tech report investigates the impact of waveform probability mass function (PMF) on speaker spoofing detection. It proposes a 'genuinization' algorithm to reduce the PMF distribution gap between genuine and spoofed speech, improving spoofing detection performance.

Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection

audio Published: 2023-10-05 Authors: Awais Khan, Khalid Mahmood Malik

This paper introduces Quick-SpoofNet, a one-shot learning and metric learning approach for audio deepfake detection. It uses a robust spectral feature set and a Siamese LSTM network to generate temporal embeddings, effectively classifying bona fide and spoofed speech even for unseen attacks.

Collaborative Watermarking for Adversarial Speech Synthesis

audio Published: 2023-09-26 Authors: Lauri Juvela, Xin Wang

This paper proposes a collaborative training scheme for synthetic speech watermarking, where a HiFi-GAN vocoder is trained alongside ASVspoof 2021 baseline countermeasure models to embed a watermark aiding detection while maintaining audio quality. The approach improves detection performance compared to conventional training methods and shows robustness against noise and time-stretching.

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

audio Published: 2023-09-22 Authors: Alexandre R. Ferreira, Cláudio E. C. Campelo

This paper proposes a framework using deepfake audio for data augmentation in training automatic speech-to-text transcription models, addressing the scarcity of diverse labeled datasets for less popular languages. Experiments were conducted using a voice cloner and an Indian English dataset to evaluate the framework's impact on transcription accuracy.

Bridging the Spoof Gap: A Unified Parallel Aggregation Network for Voice Presentation Attacks

audio Published: 2023-09-19 Authors: Awais Khan, Khalid Mahmood Malik

This paper proposes a Parallel Stacked Aggregation Network (PSA) for unified voice spoofing detection, addressing the gap in existing research that tackles logical and physical attacks separately. The PSA network processes raw audio using a split-transform-aggregation technique to identify both logical and physical attacks, outperforming state-of-the-art solutions with reduced Equal Error Rate (EER) disparities.

Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection

audio Published: 2023-09-18 Authors: Awais Khan, Khalid Mahmood Malik, Shah Nawaz

This paper proposes a unified voice spoofing detection method using a spectra-temporal fusion approach. It combines frame-level spectral deviation coefficients (SDC) with utterance-level sequential temporal coefficients (STC) via an autoencoder to generate robust spectra-temporal deviated coefficients (STDC), effectively detecting various spoofing attacks.

Spoofing attack augmentation: can differently-trained attack models improve generalisation?

audio Published: 2023-09-18 Authors: Wanying Ge, Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Nicholas Evans

This paper investigates the variability of deepfake detection performance due to differences in spoofing attack model training. It demonstrates that even subtle changes in the training of spoofing attacks can significantly impact detection accuracy, and proposes spoofing attack augmentation as a complementary technique to improve generalization.

One-Class Knowledge Distillation for Spoofing Speech Detection

audio Published: 2023-09-15 Authors: Jingze Lu, Yuxiang Zhang, Wenchao Wang, Zengqiang Shang, Pengyuan Zhang

This paper proposes a one-class knowledge distillation (OCKD) method for spoofing speech detection that addresses the generalization limitations of traditional binary classification approaches. OCKD uses a teacher-student framework, where a teacher model trained on both bonafide and spoofed speech guides a student model trained only on bonafide speech, resulting in improved performance on unseen spoofing attacks.

HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods

audio Published: 2023-09-15 Authors: Hyun-seo Shin, Jungwoo Heo, Ju-ho Kim, Chan-yeong Lim, Wonbin Kim, Ha-Jin Yu

This paper proposes HM-Conformer, an audio deepfake detection system that improves upon the Conformer architecture by incorporating hierarchical pooling to reduce redundant information and multi-level classification token aggregation to leverage information from different encoder blocks. This results in improved performance on the ASVspoof 2021 Deepfake dataset.

Characterizing the temporal dynamics of universal speech representations for generalizable deepfake detection

audio Published: 2023-09-15 Authors: Yi Zhu, Saurabh Powar, Tiago H. Falk

This paper addresses the lack of generalizability in deepfake speech detection systems by focusing on the long-term temporal dynamics of universal speech representations. The authors propose a method to characterize these dynamics, showing that different generative models produce similar dynamic patterns, leading to improved deepfake detection performance on unseen attacks.

SingFake: Singing Voice Deepfake Detection

audio Published: 2023-09-14 Authors: Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan

This paper introduces the task of singing voice deepfake detection and presents SingFake, the first in-the-wild dataset for this task, comprising 28.93 hours of real and 29.40 hours of deepfake singing voice clips. The authors evaluate existing speech deepfake detection systems on this dataset, demonstrating their limitations and highlighting the need for specialized methods for singing voice deepfake detection.

Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?

audio Published: 2023-09-12 Authors: Xin Wang, Junichi Yamagishi

This research explores using large-scale vocoded speech data to improve speech spoofing countermeasures (CMs). By continually training a self-supervised learning (SSL) model on over 9,000 hours of vocoded data, the authors demonstrate significant improvements in CM performance on various unseen test sets, surpassing previous state-of-the-art results.

Towards generalisable and calibrated synthetic speech detection with self-supervised representations

audio Published: 2023-09-11 Authors: Octavian Pascu, Adriana Stan, Dan Oneata, Elisabeta Oneata, Horia Cucu

This paper proposes using pretrained self-supervised representations (specifically, wav2vec 2.0 variants) with a simple logistic regression classifier for audio deepfake detection. This approach significantly improves generalization capabilities and calibration compared to existing methods, reducing the equal error rate from 30.9% to 8.8% on a benchmark of eight deepfake datasets.

An Efficient Temporary Deepfake Location Approach Based Embeddings for Partially Spoofed Audio Detection

audio Published: 2023-09-06 Authors: Yuankun Xie, Haonan Cheng, Yutian Wang, Long Ye

This paper proposes Temporal Deepfake Location (TDL), a fine-grained partially spoofed audio detection method. TDL uses an embedding similarity module to separate real and fake audio frames in an embedding space and a temporal convolution operation to focus on positional information, improving detection accuracy.

FSD: An Initial Chinese Dataset for Fake Song Detection

audio Published: 2023-09-05 Authors: Yuankun Xie, Jingjing Zhou, Xiaolin Lu, Zhenghao Jiang, Yuxin Yang, Haonan Cheng, Long Ye

This paper introduces the FSD dataset, a novel Chinese Fake Song Detection dataset created using five state-of-the-art singing voice synthesis and conversion methods. Experiments show that models trained on FSD significantly outperform speech-trained models in detecting deepfake songs, achieving a 38.58% reduction in average equal error rate.

Audio Deepfake Detection: A Survey

audio Published: 2023-08-29 Authors: Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, Yan Zhao

This survey paper provides a comprehensive overview of audio deepfake detection, analyzing state-of-the-art approaches, datasets, features, and classifiers. It also performs a unified comparison of these methods on various datasets and highlights challenges for future research, such as the need for larger, more diverse datasets.

Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

audio Published: 2023-08-24 Authors: Jordan J. Bird, Ahmad Lotfi

This paper introduces the DEEP-VOICE dataset for AI-generated speech detection and demonstrates that an Extreme Gradient Boosting model achieves 99.3% accuracy in real-time classification of real versus AI-generated speech (using Retrieval-based Voice Conversion), with an inference time of around 0.004 milliseconds per second of audio.

Complex-valued neural networks for voice anti-spoofing

audio Published: 2023-08-22 Authors: Nicolas M. Müller, Philip Sperl, Konstantin Böttinger

This paper proposes using complex-valued neural networks to process complex-valued constant-Q transforms (CQT) of audio for voice anti-spoofing. This approach retains phase information, improving detection accuracy and enabling explainable AI methods. The results show superior performance compared to existing methods on the In-the-Wild dataset.

The DKU-DUKEECE System for the Manipulation Region Location Task of ADD 2023

audio Published: 2023-08-20 Authors: Zexin Cai, Weiqing Wang, Yikang Wang, Ming Li

This paper presents a system for the Audio Deepfake Detection Challenge (ADD 2023) Track 2, focusing on locating manipulated regions in audio. The system integrates three models: a boundary detection model, an anti-spoofing detection model, and a VAE model, achieving first place with a final ADD score of 0.6713.

Spatial Reconstructed Local Attention Res2Net with F0 Subband for Fake Speech Detection

audio Published: 2023-08-19 Authors: Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chenglong Wang, Chengshi Zheng, Zhao Lv

This paper proposes a novel fake speech detection method using an F0 subband and a spatial reconstructed local attention Res2Net (SR-LA Res2Net) architecture. The method leverages the discriminative information in the fundamental frequency (F0) subband, effectively modeled by SR-LA Res2Net to achieve state-of-the-art performance on the ASVspoof 2019 LA dataset.

Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms

audio Published: 2023-08-18 Authors: Penghui Wen, Kun Hu, Wenxi Yue, Sen Zhang, Wanlei Zhou, Zhiyong Wang

This paper proposes S2pecNet, a deep learning method for robust audio anti-spoofing that leverages multi-order spectral patterns (raw and power spectrograms). It uses a fusion-reconstruction strategy for effective feature representation, achieving state-of-the-art performance on the ASVspoof2019 LA Challenge.

All-for-One and One-For-All: Deep learning-based feature fusion for Synthetic Speech Detection

audio Published: 2023-07-28 Authors: Daniele Mari, Davide Salvi, Paolo Bestagini, Simone Milani

This paper proposes a deep learning-based synthetic speech detection model that fuses three different feature sets (FD, STLT, and bicoherence features) for improved performance. The fused model outperforms state-of-the-art solutions and demonstrates robustness to anti-forensic attacks.

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

audio Published: 2023-07-03 Authors: Sheng Zhao, Qilong Yuan, Yibo Duan, Zhuoyue Chen

This paper presents an end-to-end multi-module audio deepfake generation system consisting of a speaker encoder, a Tacotron2-based synthesizer, and a WaveRNN-based vocoder. This system achieved first place in the ADD 2023 challenge Track 1.1, demonstrating high-quality synthetic speech generation.

Multi-perspective Information Fusion Res2Net with RandomSpecmix for Fake Speech Detection

audio Published: 2023-06-27 Authors: Shunbo Dong, Jun Xue, Cunhang Fan, Kang Zhu, Yujie Chen, Zhao Lv

This paper proposes MPIF-Res2Net with random Specmix for fake speech detection, aiming to improve the model's ability to learn precise forgery information in low-quality scenarios. The approach uses multi-perspective information fusion to reduce redundant information and random Specmix for data augmentation to enhance generalization and improve discriminative information location.

TranssionADD: A multi-frame reinforcement based sequence tagging model for audio deepfake detection

audio Published: 2023-06-27 Authors: Jie Liu, Zhiba Su, Hui Huang, Caiyan Wan, Quanxiu Wang, Jiangli Hong, Benlai Tang, Fengjie Zhu

This paper proposes TranssionADD, a multi-frame reinforcement-based sequence tagging model for audio deepfake detection. It improves model robustness and handles outliers by using a multi-frame detection module and an isolated-frame penalty loss, achieving 2nd place in Track 2 of the ADD 2023 challenge.

Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems

audio Published: 2023-06-13 Authors: Michele Panariello, Wanying Ge, Hemlata Tak, Massimiliano Todisco, Nicholas Evans

The paper introduces Malafide, a universal adversarial attack against automatic speaker verification (ASV) spoofing countermeasures (CMs). Malafide uses optimized linear time-invariant filters to introduce convolutional noise, degrading CM performance significantly while preserving speech quality, even in black-box settings.

Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

audio Published: 2023-06-09 Authors: Chenglong Wang, Jiangyan Yi, Xiaohui Zhang, Jianhua Tao, Le Xu, Ruibo Fu

This paper introduces a low-rank adaptation (LoRA) method for efficient fine-tuning of the wav2vec2 model for fake audio detection. By freezing pre-trained weights and adding trainable low-rank matrices, it significantly reduces the number of trainable parameters while maintaining comparable performance to full fine-tuning.

Improved DeepFake Detection Using Whisper Features

audio Published: 2023-06-02 Authors: Piotr Kawa, Marcin Plata, Michał Czuba, Piotr Szymański, Piotr Syga

This paper investigates using the Whisper automatic speech recognition model as a front-end for audio deepfake detection. By incorporating Whisper features with existing front-ends and training three detection models, the authors demonstrate improved detection accuracy, reducing the Equal Error Rate by 21% on the DeepFakes In-The-Wild dataset.

Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

audio Published: 2023-05-30 Authors: Qing Wang, Jixun Yao, Ziqian Wang, Pengcheng Guo, Lei Xie

This paper proposes a timbre-reserved adversarial attack for speaker identification (SID) that generates fake audio while preserving the target speaker's timbre, even in black-box settings. This is achieved using a pseudo-Siamese network to learn from a black-box SID model, constraining both intrinsic and structural similarity, and incorporating adversarial constraints during voice conversion model training.

Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion

audio Published: 2023-05-25 Authors: Rui Liu, Jinhua Zhang, Guanglai Gao, Haizhou Li

This paper introduces M2S-ADD, a novel audio deepfake detection model that leverages stereo audio information. It converts mono audio to stereo using a pre-trained model and then employs a dual-branch neural network to analyze the left and right channels, improving detection accuracy.

ADD 2023: the Second Audio Deepfake Detection Challenge

audio Published: 2023-05-23 Authors: Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li

The ADD 2023 challenge focuses on advancing audio deepfake detection beyond binary classification. It introduces three sub-challenges: audio fake game, manipulation region localization, and deepfake algorithm recognition, pushing research toward more realistic and nuanced detection.

TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

audio Published: 2023-05-23 Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Ruibo Fu, Xun Chen

This paper proposes TO-RawNet, a novel deep neural network architecture for fake audio detection. It improves upon RawNet by incorporating orthogonal convolution to reduce filter correlation and temporal convolutional networks (TCNs) to capture long-term dependencies in speech signals, resulting in a significant reduction in Equal Error Rate (EER).

Towards generalizing deep-audio fake detection networks

audio Published: 2023-05-22 Authors: Konstantin Gasenzer, Moritz Wolter

This paper addresses the limited generalization ability of deep audio fake detectors to unseen generators by identifying stable frequency domain fingerprints of various audio generators. Using these fingerprints, the authors train lightweight, generalizing detectors that achieve improved results on the WaveFake dataset and its extended version.

Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms

audio Published: 2023-05-18 Authors: Chang Zeng, Xin Wang, Xiaoxiao Miao, Erica Cooper, Junichi Yamagishi

This paper addresses the generalization problem in audio deepfake detection when encountering unseen audio genres. It proposes a multi-task learning method that combines meta-optimization and genre alignment regularization to improve the generalization ability of countermeasure models. Experimental results demonstrate significant performance improvement compared to baseline systems in cross-genre scenarios.

Using Deepfake Technologies for Word Emphasis Detection

audio Published: 2023-05-12 Authors: Eran Kaufman, Lee-Ad Gottlieb

This paper proposes a novel approach for automated emphasis detection in spoken language using deepfake technology. By generating an 'emphasis-devoid' version of a spoken sentence using a speaker's voice sample and comparing it to the original, the system identifies emphasized words.

VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation

audio Published: 2023-05-09 Authors: Yuanda Wang, Hanqing Guo, Guangjing Wang, Bocheng Chen, Qiben Yan

VSMask is a real-time voice protection mechanism against voice synthesis attacks. Unlike existing offline methods, it uses a predictive neural network to forecast perturbations for upcoming speech, minimizing latency and enabling protection of live audio streams.

AI-Synthesized Voice Detection Using Neural Vocoder Artifacts

audio Published: 2023-04-25 Authors: Chengzhe Sun, Shan Jia, Shuwei Hou, Siwei Lyu

This research proposes a novel approach to detect AI-synthesized voices by identifying neural vocoder artifacts in audio signals. A multi-task learning framework, using a RawNet2 model with a vocoder identification module, is introduced to improve detection accuracy.

Learning From Yourself: A Self-Distillation Method for Fake Speech Detection

audio Published: 2023-03-02 Authors: Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang, Zhengqi Wen, Dan Zhang, Zhao Lv

This paper introduces a novel self-distillation method for fake speech detection that enhances the performance of shallow networks without increasing model complexity. It achieves this by using a deep network as a teacher model to guide shallow networks, reducing feature differences and improving accuracy.

Speaker-Aware Anti-Spoofing

audio Published: 2023-03-02 Authors: Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

This paper introduces speaker-aware anti-spoofing, a voice spoofing countermeasure that uses prior knowledge of the target speaker. It extends the AASIST model by integrating target speaker information at the frame and utterance levels, achieving significant improvements in EER and t-DCF over a speaker-independent baseline.

Hello Me, Meet the Real Me: Audio Deepfake Attacks on Voice Assistants

audio Published: 2023-02-20 Authors: Domna Bilika, Nikoletta Michopoulou, Efthimios Alepis, Constantinos Patsakis

This research investigates the vulnerability of voice assistants (VAs) to audio deepfake attacks. The authors demonstrate that synthesized voice commands, created using readily available tools, successfully tricked VAs into performing unauthorized actions in over 30% of their experiments, highlighting a significant security risk.

Exposing AI-Synthesized Human Voices Using Neural Vocoder Artifacts

audio Published: 2023-02-18 Authors: Chengzhe Sun, Shan Jia, Shuwei Hou, Ehab AlBadawy, Siwei Lyu

This paper proposes a novel approach for detecting AI-synthesized human voices by identifying artifacts left by neural vocoders. A multi-task learning framework using a RawNet2 model is introduced, incorporating vocoder identification as a pretext task to improve the detection of synthetic voices.

Warning: Humans Cannot Reliably Detect Speech Deepfakes

audio Published: 2023-01-19 Authors: Kimberly T. Mai, Sergi D. Bray, Toby Davies, Lewis D. Griffin

This study investigates human capabilities in detecting speech deepfakes through an online experiment involving 529 participants listening to English and Mandarin audio clips. The results show that human detection accuracy is unreliable, reaching only 73% accuracy, with no significant difference between languages; brief familiarization with deepfake examples only marginally improves performance.

Deepfake CAPTCHA: A Method for Preventing Fake Calls

audio Published: 2023-01-08 Authors: Lior Yasur, Guy Frankovits, Fred M. Grabovski, Yisroel Mirsky

This paper proposes D-CAPTCHA, an active defense against real-time deepfakes, which challenges the deepfake model to generate content exceeding its capabilities, thereby making passive detection easier. The system outperforms state-of-the-art audio deepfake detectors, achieving 91-100% accuracy depending on the challenge.

Defense Against Adversarial Attacks on Audio DeepFake Detection

audio Published: 2022-12-30 Authors: Piotr Kawa, Marcin Plata, Piotr Syga

This research evaluates the robustness of three audio deepfake detection architectures against adversarial attacks. The authors introduce a novel adaptive adversarial training method to enhance the robustness of these detectors, notably adapting RawNet3 for deepfake detection for the first time.

Source Tracing: Detecting Voice Spoofing

audio Published: 2022-12-16 Authors: Tinglong Zhu, Xingming Wang, Xiaoyi Qin, Ming Li

This paper proposes a system for classifying different spoofing attributes in audio deepfakes, focusing on identifying the generation methods rather than just detecting the presence of a fake. This approach, using multi-task learning, improves robustness against unseen spoofing methods and achieves a 20% relative improvement over conventional binary spoof detection methods on the ASVspoof 2019 LA dataset.

SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection

audio Published: 2022-11-11 Authors: Jiangyan Yi, Chenglong Wang, Jianhua Tao, Chu Yuan Zhang, Cunhang Fan, Zhengkun Tian, Haoxin Ma, Ruibo Fu

This paper introduces SceneFake, a novel dataset for scene fake audio detection, addressing a gap in existing datasets by focusing on manipulations of the acoustic scene in audio recordings. Benchmark results demonstrate that existing models trained on other datasets perform poorly on SceneFake, highlighting the challenge of detecting this specific type of audio manipulation.

EmoFake: An Initial Dataset for Emotion Fake Audio Detection

audio Published: 2022-11-10 Authors: Yan Zhao, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xiaohui Zhang, Yongfeng Dong

This paper introduces EmoFake, a new dataset for emotion fake audio detection, focusing on audio where the emotion has been altered while other aspects remain the same. A new detection method, Graph Attention networks using Deep Emotion embedding (GADE), is proposed and evaluated on this dataset, showing promising results.

Waveform Boundary Detection for Partially Spoofed Audio

audio Published: 2022-11-01 Authors: Zexin Cai, Weiqing Wang, Ming Li

This paper proposes a deep learning-based system for detecting partially spoofed audio by identifying waveform boundaries between genuine and manipulated segments. The system achieves state-of-the-art performance on the ADD2022 challenge, outperforming other methods in locating manipulated audio clips.

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

audio Published: 2022-10-31 Authors: Luigi Attorresi, Davide Salvi, Clara Borrelli, Paolo Bestagini, Stefano Tubaro

This paper proposes a novel synthetic speech detection approach combining speaker verification and prosody analysis. It uses speaker embeddings from an automatic speaker verification network and prosody embeddings from a specialized encoder, concatenating them and feeding them into a binary classifier to detect deepfake speech generated by Text-to-Speech and Voice Conversion techniques.

Adaptive re-calibration of channel-wise features for Adversarial Audio Classification

audio Published: 2022-10-21 Authors: Vardhan Dongre, Abhinav Thimma Reddy, Nikhitha Reddeddy

This paper proposes an adaptive channel-wise recalibration of audio features using attentional feature fusion for synthetic speech detection. The approach improves upon existing methods by achieving higher accuracy and better generalization across various synthetic speech generation models, particularly using a ResNet architecture with squeeze-excitation blocks and a combination of LFCC and MFCC features.

Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders

audio Published: 2022-10-19 Authors: Xin Wang, Junichi Yamagishi

This paper proposes a method for efficiently creating spoofed training data for speech spoofing countermeasures using neural vocoders, instead of relying on computationally expensive TTS and VC systems. A contrastive feature loss is introduced to improve the training process by leveraging the relationship between bona fide and spoofed data pairs.

Deepfake Detection System for the ADD Challenge Track 3.2 Based on Score Fusion

audio Published: 2022-10-13 Authors: Yuxiang Zhang, Jingze Lu, Xingming Wang, Zhuo Li, Runqiu Xiao, Wenchao Wang, Ming Li, Pengyuan Zhang

This paper presents a deepfake audio detection system for the ADD Challenge Track 3.2, using score-level fusion of multiple light convolutional neural networks (LCNNs). The system incorporates various front-ends and online data augmentation, achieving a weighted equal error rate (WEER) of 11.04%, a top result in the challenge.

SpecRNet: Towards Faster and More Accessible Audio DeepFake Detection

audio Published: 2022-10-12 Authors: Piotr Kawa, Marcin Plata, Piotr Syga

The paper introduces SpecRNet, a novel neural network architecture for audio deepfake detection designed for faster inference and low computational requirements. Benchmarks show SpecRNet achieves performance comparable to state-of-the-art models while requiring up to 40% less processing time.

Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

audio Published: 2022-10-11 Authors: Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li, Nan Li, Kong Aik Lee, Longbiao Wang, Jianwu Dang

This paper presents a system for detecting synthesized speech in two tracks of the Audio Deep Synthesis Detection (ADD) Challenge: Low-quality Fake Audio Detection and Partially Fake Audio Detection. The approach leverages spectro-temporal artifacts using raw waveform, handcrafted features, and deep embeddings, incorporating techniques like data augmentation, domain adaptation, and a greedy fusion strategy.

Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture

audio Published: 2022-10-07 Authors: Lei Wang, Benedict Yeoh, Jun Wah Ng

This paper proposes a new SE-Res2Net-Conformer architecture for improved synthetic voice detection and re-formulates audio splicing detection to focus on boundary identification. The proposed architecture combines the strengths of Res2Net, Conformer blocks, and a deep learning approach to achieve better performance on both tasks.

The Sound of Silence: Efficiency of First Digit Features in Synthetic Audio Detection

audio Published: 2022-10-06 Authors: Daniele Mari, Federica Latora, Simone Milani

This research investigates the effectiveness of first digit statistics extracted from MFCC coefficients of silenced speech segments for synthetic audio detection. The proposed method is computationally lightweight and achieves over 90% accuracy on the ASVSpoof dataset, outperforming some state-of-the-art approaches.

ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild

audio Published: 2022-10-05 Authors: Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, Kong Aik Lee

The ASVspoof 2021 challenge benchmarked speech spoofing and deepfake detection systems under more realistic conditions, including encoding, transmission effects, and real-world acoustic environments. Results revealed varying levels of robustness across tasks, highlighting challenges in generalization to unseen data and conditions.

Deepfake audio detection by speaker verification

audio Published: 2022-09-28 Authors: Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva

This paper proposes a novel deepfake audio detection approach that leverages speaker verification techniques, trained only on real audio data, to achieve high generalization ability and robustness to audio impairments. The method avoids training on fake audios, thus overcoming the limitation of existing methods that struggle with unseen synthetic audio generation tools.

Synthetic Voice Spoofing Detection Based On Online Hard Example Mining

audio Published: 2022-09-23 Authors: Chenlei Hu, Ruohua Zhou

This paper proposes an Online Hard Example Mining (OHEM) algorithm to improve the detection of unknown voice spoofing attacks in automatic speaker verification. By focusing on hard-to-classify samples, OHEM addresses class imbalance and achieves a low equal error rate (EER) of 0.77% on the ASVspoof 2019 Challenge.

ConvNeXt Based Neural Network for Audio Anti-Spoofing

audio Published: 2022-09-14 Authors: Qiaowei Ma, Jinghui Zhong, Yitao Yang, Weiheng Liu, Ying Gao, Wing W. Y. Ng

This paper proposes a lightweight end-to-end audio anti-spoofing model based on a revised ConvNeXt architecture. By incorporating a channel attention block and focal loss, the model effectively focuses on informative speech sub-bands and difficult-to-classify samples, achieving state-of-the-art performance on the ASVSpoof 2019 LA dataset.

Audio Deepfake Attribution: An Initial Dataset and Investigation

audio Published: 2022-08-21 Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Jie Chen

This paper introduces the first audio deepfake attribution dataset (ADA) for identifying the source of deepfake audio. To address the challenge of attributing audio from unknown sources, a novel open-set audio deepfake attribution (OSADA) method called Class-Representation Multi-Center Learning (CRML) is proposed.

An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio

audio Published: 2022-08-20 Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Haoxin Ma, Tao Wang, Shiming Wang, Ruibo Fu

This paper introduces a novel problem of detecting vocoder fingerprints in fake audio, aiming to identify the specific vocoder used to generate the fake audio. Experiments using eight state-of-the-art vocoders show that distinct vocoder fingerprints exist and are detectable.

Fully Automated End-to-End Fake Audio Detection

audio Published: 2022-08-20 Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Haiyang Sun, Xun Chen, Zhengkun Tian, Haoxin Ma, Cunhang Fan, Ruibo Fu

This paper proposes a fully automated end-to-end fake audio detection method using a wav2vec pre-trained model for feature extraction and a modified DARTS (light-DARTS) for architecture search and optimization. The method achieves a state-of-the-art equal error rate (EER) of 1.08% on the ASVspoof 2019 LA dataset, outperforming existing single systems.

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

audio Published: 2022-08-02 Authors: Jun Xue, Cunhang Fan, Zhao Lv, Jianhua Tao, Jiangyan Yi, Chengshi Zheng, Zhengqi Wen, Minmin Yuan, Shegang Shao

This paper proposes a novel audio deepfake detection system that combines fundamental frequency (F0) information and real plus imaginary spectrogram features. By utilizing the differences in F0 distribution between real and fake speech and modeling disjoint subbands separately, the system achieves a significantly lower equivalent error rate (EER) than existing systems.

Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection

audio Published: 2022-06-27 Authors: Piotr Kawa, Marcin Plata, Piotr Syga

This paper introduces the Attack Agnostic Dataset, combining audio deepfake and anti-spoofing datasets to improve the generalization and stability of audio deepfake detection methods. A LCNN model with LFCC and mel-spectrogram front-ends is proposed, showing improved generalization, stability, and performance compared to existing methods.

Detection of Doctored Speech: Towards an End-to-End Parametric Learn-able Filter Approach

audio Published: 2022-06-27 Authors: Rohit Arora

This research proposes end-to-end deep learning models (WSTnet and CWTnet) for detecting doctored speech, using Wavelet Scattering and Continuous Wavelet Transforms, respectively, instead of the SincNet baseline's sinc layer. A further improved model, WDnet, replaces the CWT layer with a Wavelet Deconvolution layer to optimize scale parameters, yielding substantial performance improvements over both the baseline and traditional methods on the ASVspoof 2019 dataset.

On-Device Voice Authentication with Paralinguistic Privacy

audio Published: 2022-05-27 Authors: Ranya Aloufi, Hamed Haddadi, David Boyle

This research paper presents a novel on-device voice authentication system that prioritizes user privacy while maintaining high accuracy. The system locally derives token-based credentials from voice data, allowing selective filtering of sensitive information before transmission to service providers, thereby mitigating privacy risks associated with cloud-based voice authentication.

Baselines and Protocols for Household Speaker Recognition

audio Published: 2022-04-30 Authors: Alexey Sholokhov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen

This research paper introduces an evaluation benchmark and open-source baselines for household speaker recognition, addressing challenges like domain robustness, short utterances, and passive enrollment. It provides several algorithms for both active and passive enrollment scenarios.

The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance

audio Published: 2022-04-11 Authors: Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi

This paper introduces a new spoofing scenario, Partial Spoof (PS), where synthesized speech segments are embedded within bona fide utterances. It proposes improved countermeasures (CMs) using self-supervised pre-trained models for feature extraction and a new CM architecture that leverages segment-level labels at multiple temporal resolutions for both utterance and segment-level detection, achieving low error rates on the PartialSpoof and ASVspoof 2019 LA databases.

A Study of Using Cepstrogram for Countermeasure Against Replay Attacks

audio Published: 2022-04-09 Authors: Shih-Kuang Lee, Yu Tsao, Hsin-Min Wang

This research demonstrates the effectiveness of cepstrograms as a countermeasure against replay attacks in automatic speaker verification. Experiments on the ASVspoof 2019 physical access database show that cepstrogram-based systems outperform other state-of-the-art methods, achieving the best results in both single and fusion systems.

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

audio Published: 2022-04-06 Authors: Jin Woo Lee, Eungbeom Kim, Junghyun Koo, Kyogu Lee

This paper investigates effective feature spaces for spoof detection using wav2vec 2.0, finding that the 5th layer's features are optimal. A simple attentive statistics pooling (ASP) layer as the backend achieves a 0.31% EER on ASVspoof 2019 LA, and a proposed spoof-aware speaker verification (SASV) method achieves 1.08% EER on the SASV Challenge 2022 database.

Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

audio Published: 2022-04-04 Authors: Youngsik Eom, Yeonghyeon Lee, Ji Sub Um, Hoirin Kim

This paper proposes a transfer learning scheme for speech anti-spoofing using a pre-trained wav2vec 2.0 model and a variational information bottleneck (VIB). The approach improves the performance of distinguishing unseen spoofed and genuine speech, surpassing state-of-the-art systems, particularly in low-resource and cross-dataset settings.

Adversarial Speaker Distillation for Countermeasure Model on Automatic Speaker Verification

audio Published: 2022-03-31 Authors: Yen-Lun Liao, Xuanjun Chen, Chung-Che Wang, Jyh-Shing Roger Jang

This paper proposes an adversarial speaker distillation method for creating lightweight countermeasure (CM) models for automatic speaker verification (ASV) systems. The method combines generalized end-to-end (GE2E) pre-training, adversarial fine-tuning, and knowledge distillation to achieve a smaller model size while maintaining high performance in detecting spoofed audio.

A Comparative Study of Fusion Methods for SASV Challenge 2022

audio Published: 2022-03-31 Authors: Petr Grinberg, Vladislav Shikhov

This paper investigates various fusion methods for combining embeddings from Automatic Speaker Verification (ASV) and countermeasure (CM) systems in the Spooﬁng Aware Speaker Veriﬁcation (SASV) Challenge 2022. The authors explore different fusion techniques, including boosting over embeddings (CatBoost), which outperforms existing methods.

Does Audio Deepfake Detection Generalize?

audio Published: 2022-03-30 Authors: Nicolas M. Müller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, Konstantin Böttinger

This paper systematically re-implements and evaluates twelve audio deepfake detection architectures from prior work, identifying key factors like feature extraction (cqtspec or logspec outperform melspec) for improved performance. It also introduces a new 'in-the-wild' dataset to assess generalization, revealing significantly degraded performance on real-world data, highlighting limitations in current approaches.

SASV 2022: The First Spoofing-Aware Speaker Verification Challenge

audio Published: 2022-03-28 Authors: Jee-weon Jung, Hemlata Tak, Hye-jin Shim, Hee-Soo Heo, Bong-Jin Lee, Soo-Whan Chung, Ha-Jin Yu, Nicholas Evans, Tomi Kinnunen

The paper presents the first Spoofing-Aware Speaker Verification (SASV) challenge, aiming to integrate speaker verification and anti-spoofing research. The challenge focuses on jointly optimized solutions, contrasting with previous challenges that treated these as separate tasks. Results show that the top-performing system significantly reduces the equal error rate compared to a conventional system.

Attacker Attribution of Audio Deepfakes

audio Published: 2022-03-28 Authors: Nicolas M. Müller, Franziska Dieckmann, Jennifer Williams

This paper tackles the problem of audio deepfake attacker attribution, aiming to identify the creator of a fake audio recording. It proposes using recurrent neural network embeddings as attacker signatures, demonstrating superior performance compared to low-level acoustic features for distinguishing between deepfakes from different sources.

Spoofing-Aware Speaker Verification with Unsupervised Domain Adaptation

audio Published: 2022-03-21 Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

This paper proposes a method for improving the spoofing robustness of automatic speaker verification (ASV) systems without using a separate countermeasure module. It achieves this by employing unsupervised domain adaptation techniques to optimize the back-end probabilistic linear discriminant analysis (PLDA) classifier using the ASVspoof 2019 dataset, resulting in significant performance improvements.

SA-SASV: An End-to-End Spoof-Aggregated Spoofing-Aware Speaker Verification System

audio Published: 2022-03-12 Authors: Zhongwei Teng, Quchen Fu, Jules White, Maria E. Powell, Douglas C. Schmidt

This paper presents SA-SASV, an end-to-end spoofing-aware speaker verification system that uses multi-task classifiers optimized by multiple losses. Unlike previous approaches, SA-SASV avoids ensemble methods and offers more flexible training set requirements. It achieves improved performance on the ASVSpoof 2019 LA dataset.

The Vicomtech Audio Deepfake Detection System based on Wav2Vec2 for the 2022 ADD Challenge

audio Published: 2022-03-03 Authors: Juan M. Martín-Doñas, Aitor Álvarez

This paper presents an audio deepfake detection system for the 2022 ADD challenge, combining a pre-trained wav2vec2 feature extractor with a downstream classifier. The system leverages contextualized speech representations from different transformer layers and data augmentation techniques to improve robustness and performance in various challenging audio conditions.

Explainable deepfake and spoofing detection: an attack analysis using SHapley Additive exPlanations

audio Published: 2022-02-28 Authors: Wanying Ge, Massimiliano Todisco, Nicholas Evans

This paper extends previous work on explainable deepfake and spoofing detection by applying SHapley Additive exPlanations (SHAP) to analyze different attack algorithms. Using classifiers operating on raw waveforms and magnitude spectrograms, it identifies attack-specific artifacts and reveals differences and consistencies between synthetic speech and converted voice spoofing attacks.

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

audio Published: 2022-02-24 Authors: Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, Nicholas Evans

This paper investigates using a wav2vec 2.0 front-end with fine-tuning for speaker verification spoofing and deepfake detection. Despite pre-training only on bona fide data, the approach achieves the lowest equal error rates reported in the literature for ASVspoof 2021 Logical Access and Deepfake databases, further improved with data augmentation.

ADD 2022: the First Audio Deep Synthesis Detection Challenge

audio Published: 2022-02-17 Authors: Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Xiaohui Zhang, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu

The ADD 2022 challenge focuses on audio deepfake detection, addressing real-world scenarios not covered by previous tasks. It includes three tracks: low-quality fake audio detection, partially fake audio detection, and an audio fake game, providing diverse and challenging datasets for evaluating detection methods.

Partially Fake Audio Detection by Self-attention-based Fake Span Discovery

audio Published: 2022-02-14 Authors: Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao, Hsin-Min Wang, Helen Meng

This paper proposes a novel framework for partially fake audio detection using a question-answering strategy with a self-attention mechanism. The model identifies the start and end points of fake audio clips within an utterance, improving its ability to discriminate between real and partially fake audio. This approach achieved second place in the ADD 2022 partially fake audio detection track.

Synthetic speech detection using meta-learning with prototypical loss

audio Published: 2022-01-24 Authors: Monisankha Pal, Aditya Raikar, Ashish Panda, Sunil Kumar Kopparapu

This research addresses the generalization problem in synthetic speech detection by employing prototypical loss under a meta-learning paradigm. This approach learns an embedding space that effectively distinguishes between genuine and synthetic speech, improving performance on unseen spoofing attacks.

Adversarial Transformation of Spoofing Attacks for Voice Biometrics

audio Published: 2022-01-04 Authors: Alejandro Gomez-Alanis, Jose A. Gonzalez-Lopez, Antonio M. Peinado

This paper introduces a novel Adversarial Biometrics Transformation Network (ABTN) to generate adversarial spoofing attacks against voice biometric systems. The ABTN jointly optimizes the loss functions of both the Presentation Attack Detection (PAD) and Automatic Speaker Verification (ASV) systems to create attacks that fool the PAD while remaining undetected by the ASV.

Audio Deepfake Perceptions in College Going Populations

audio Published: 2021-12-06 Authors: Gabrielle Watson, Zahra Khanjani, Vandana P. Janeja

This research investigates the perception of audio deepfakes among college students. Using MelGAN to generate audio deepfakes, the study analyzes how factors like grammar complexity, audio length, and political context influence detection accuracy, also exploring differences in perception across majors.

How Deep Are the Fakes? Focusing on Audio Deepfake: A Survey

audio Published: 2021-11-28 Authors: Zahra Khanjani, Gabrielle Watson, Vandana P. Janeja

This survey paper focuses on audio deepfakes, a topic often overlooked in existing surveys. It critically analyzes audio deepfake generation and detection methods from 2016 to 2020, providing a unique resource for researchers in this field.

Investigating self-supervised front ends for speech spoofing countermeasures

audio Published: 2021-11-15 Authors: Xin Wang, Junichi Yamagishi

This paper investigates using pre-trained self-supervised speech models as front-ends for speech spoofing countermeasures (CMs). The authors find that fine-tuning a well-chosen pre-trained front-end with a shallow or deep neural network back-end significantly improves performance on multiple datasets compared to a baseline system.

RawBoost: A Raw Data Boosting and Augmentation Method applied to Automatic Speaker Verification Anti-Spoofing

audio Published: 2021-11-08 Authors: Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, Nicholas Evans

RawBoost is a data augmentation method for improving spoofing detection in automatic speaker verification, operating directly on raw waveforms without requiring additional data sources. It enhances a state-of-the-art system by 27% relative performance on the ASVspoof 2021 logical access database, outperformed only by methods using external data or model-level interventions.

WaveFake: A Data Set to Facilitate Audio Deepfake Detection

audio Published: 2021-11-04 Authors: Joel Frank, Lea Schönherr

This paper introduces WaveFake, a novel dataset for audio deepfake detection, comprising samples from six state-of-the-art text-to-speech (TTS) architectures across two languages. It also provides two baseline models (GMM and RawNet2) for future research in this area.

A Study On Data Augmentation In Voice Anti-Spoofing

audio Published: 2021-10-20 Authors: Ariel Cohen, Inbal Rimon, Eran Aflalo, Haim Permuter

This paper investigates data augmentation techniques for improving synthetic audio detection in voice anti-spoofing. The authors propose novel data augmentation methods to address channel variability and unseen spoofing attacks, achieving state-of-the-art performance on the ASVspoof 2021 challenge.

A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing

audio Published: 2021-10-11 Authors: Wei Liu, Meng Sun, Xiongwei Zhang, Hugo Van hamme, Thomas Fang Zheng

This paper proposes a multi-resolution front-end for speech anti-spoofing that learns optimal weighted combinations of time-frequency resolutions. Features from different resolutions are weighted and concatenated, with weights predicted by a learnable neural network. The approach also refines these combinations by pruning less important resolutions.

Explaining deep learning models for spoofing and deepfake detection with SHapley Additive exPlanations

audio Published: 2021-10-07 Authors: Wanying Ge, Jose Patino, Massimiliano Todisco, Nicholas Evans

This paper uses SHapley Additive exPlanations (SHAP) to analyze deep learning models for spoofing and deepfake detection. It reveals unexpected classifier behavior, identifies key contributing artifacts, and highlights differences between competing models, promoting explainable AI in this field.

Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model

audio Published: 2021-09-06 Authors: Zhongwei Teng, Quchen Fu, Jules White, Maria Powell, Douglas C. Schmidt

This paper proposes an Auxiliary RawNet (ARNet) model to improve audio spoof detection accuracy by combining handcrafted features with features learned from raw waveforms. ARNet uses a lightweight auxiliary encoder to process raw waveforms, supplementing information in handcrafted features at a low computational cost.

FastAudio: A Learnable Audio Front-End for Spoof Speech Detection

audio Published: 2021-09-06 Authors: Quchen Fu, Zhongwei Teng, Jules White, Maria Powell, Douglas C. Schmidt

This paper proposes FastAudio, a learnable audio front-end for spoof speech detection. By replacing fixed filterbanks with a learnable layer, FastAudio achieves a 27% relative improvement in minimum tandem detection cost function (min t-DCF) compared to fixed front-ends on the ASVspoof 2019 dataset, outperforming other learnable front-ends.

Efficient Attention Branch Network with Combined Loss Function for Automatic Speaker Verification Spoof Detection

audio Published: 2021-09-05 Authors: Amir Mohammad Rostami, Mohammad Mehdi Homayounpour, Ahmad Nickabadi

This paper proposes the Efficient Attention Branch Network (EABN) for automatic speaker verification spoof detection, addressing the generalization problem of existing models. EABN uses an attention branch to generate interpretable attention masks that improve classification performance in a perception branch, employing the efficient EfficientNet-A0 architecture.

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

audio Published: 2021-09-01 Authors: Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado

The ASVspoof 2021 challenge focused on advancing spoofed and deepfake speech detection. It introduced a new deepfake speech detection task alongside logical and physical access tasks, evaluating progress without providing matched training data, reflecting real-world scenarios.

ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan

audio Published: 2021-09-01 Authors: Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, Junichi Yamagishi

The ASVspoof 2021 challenge focuses on developing spoofing countermeasures for speech data, encompassing logical access (LA), physical access (PA), and speech deepfake (DF) tasks. The paper details the challenge's evaluation plan, including datasets, metrics (t-DCF and EER), and baseline systems.

Benchmarking and challenges in security and privacy for voice biometrics

audio Published: 2021-09-01 Authors: Jean-Francois Bonastre, Hector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Paul-Gauthier Noe, Jose Patino, Md Sahidullah, Brij Mohan Lal Srivastava, Massimiliano Todisco, Natalia Tomashenko, Emmanuel Vincent, Xin Wang, Junichi Yamagishi

This paper provides a high-level overview of benchmarking methodologies used in voice biometrics security and privacy research. It describes the ASVspoof challenge for spoofing countermeasures and the VoicePrivacy initiative for privacy preservation through anonymization.

Creation and Detection of German Voice Deepfakes

audio Published: 2021-08-02 Authors: Vanessa Barnekow, Dominik Binder, Niclas Kromrey, Pascal Munaretto, Andreas Schaad, Felix Schmieder

This paper investigates the feasibility of creating and detecting German voice deepfakes using readily available tools and datasets. The authors demonstrate that convincing deepfakes can be generated with relatively little effort, and that human detection rates are low (37%), while a bispectral analysis-based approach achieves higher detection accuracy.

Multi-Task Learning in Utterance-Level and Segmental-Level Spoof Detection

audio Published: 2021-07-29 Authors: Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi

This paper proposes SELCNN, a light convolutional neural network with squeeze-and-excitation blocks for enhanced feature selection, and applies it within multi-task learning (MTL) frameworks for simultaneous utterance-level and segmental-level spoof detection in the PartialSpoof database. Experiments demonstrate that the multi-task binary-branch architecture, particularly when fine-tuned from a segmental warm-up model, outperforms single-task models.

End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection

audio Published: 2021-07-27 Authors: Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, Nicholas Evans

This paper proposes RawGAT-ST, a spectro-temporal graph attention network for speech deepfake detection. It achieves this by learning relationships between spectral and temporal cues directly from raw waveforms, using a novel graph fusion and pooling strategy. The model achieves a state-of-the-art equal error rate of 1.06% on the ASVspoof 2019 logical access database.

Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection

audio Published: 2021-07-26 Authors: Wanying Ge, Jose Patino, Massimiliano Todisco, Nicholas Evans

This paper introduces Raw PC-DARTS, an end-to-end speech deepfake and spoofing detection system that automatically learns its network architecture from raw audio waveforms. The system achieves a state-of-the-art tandem detection cost function score of 0.0517 on the ASVspoof 2019 logical access database.

UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021

audio Published: 2021-07-26 Authors: Xinhui Chen, You Zhang, Ge Zhu, Zhiyao Duan

This paper presents a channel-robust synthetic speech detection system for the ASVspoof 2021 challenge. It uses an acoustic simulator to augment datasets with various codec and channel effects, and employs an ECAPA-TDNN model with one-class learning and channel-robust training strategies.

Human Perception of Audio Deepfakes

audio Published: 2021-07-20 Authors: Nicolas M. Müller, Karla Pizzi, Jennifer Williams

This paper compares human and machine capabilities in detecting audio deepfakes through a gamified online experiment. Humans and a state-of-the-art AI algorithm showed similar strengths and weaknesses, struggling with certain types of attacks, contrary to AI's superhuman performance in other areas. The study analyzes human success factors, such as native language and age.

Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks

audio Published: 2021-07-19 Authors: Xu Li, Xixin Wu, Hui Lu, Xunying Liu, Helen Meng

This paper proposes Channel-wise Gated Res2Net (CG-Res2Net), a novel architecture that improves the generalizability of synthetic speech detection systems to unseen attacks. It achieves this by incorporating a channel-wise gating mechanism into the Res2Net block, dynamically selecting relevant channels and suppressing less relevant ones.

Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn?

audio Published: 2021-06-23 Authors: Nicolas M. Müller, Franziska Dieckmann, Pavel Czempin, Roman Canals, Konstantin Böttinger, Jennifer Williams

This paper analyzes the ASVspoof 2019 dataset, revealing an uneven distribution of silence duration correlated with the spoof/bonafide label. Models trained solely on silence duration achieve surprisingly high accuracy (up to 85%), indicating a potential bias in previous research that inadvertently relied on this artifact.

Generalized Spoofing Detection Inspired from Audio Generation Artifacts

audio Published: 2021-04-08 Authors: Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh

This paper proposes using 2D Discrete Cosine Transform (DCT) on log-Mel spectrograms as a novel long-range spectro-temporal feature for audio deepfake detection. This feature effectively captures artifacts in generated audio, outperforming existing features like log-Mel spectrograms, CQCC, and MFCC, and leading to state-of-the-art performance on the ASVspoof 2019 challenge.

Graph Attention Networks for Anti-Spoofing

audio Published: 2021-04-08 Authors: Hemlata Tak, Jee-weon Jung, Jose Patino, Massimiliano Todisco, Nicholas Evans

This paper proposes using Graph Attention Networks (GATs) to improve spoofing detection in automatic speaker verification by modeling the relationships between spectral sub-bands or temporal segments. Experiments on the ASVspoof 2019 database show that the GAT-based model with temporal attention outperforms baseline systems, and fusion with other systems provides significant performance improvements.

Half-Truth: A Partially Fake Audio Detection Dataset

audio Published: 2021-04-08 Authors: Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu

This paper introduces the Half-Truth Audio Detection (HAD) dataset, focusing on partially fake audio where only a few words in an utterance are synthetically generated. This addresses a significant gap in existing datasets and provides a more realistic scenario for fake audio detection, enabling both fake utterance detection and localization of manipulated regions.

Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection

audio Published: 2021-04-07 Authors: Wanying Ge, Michele Panariello, Jose Patino, Massimiliano Todisco, Nicholas Evans

This paper presents the first successful application of Partially-Connected Differentiable Architecture Search (PC-DARTS) to deepfake and spoofing detection. PC-DARTS efficiently learns complex neural architectures composed of convolutional operations and residual blocks, resulting in competitive performance with less computational complexity than existing state-of-the-art methods.

A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection

audio Published: 2021-03-21 Authors: Xin Wang, Junich Yamagishi

This paper presents a comparative study of neural network-based speech spoofing countermeasures, focusing on varied-length input handling and loss functions. The authors found that average pooling for varied-length inputs and a new hyper-parameter-free loss function yielded a best-performing single model with an equal error rate (EER) of 1.92% on the ASVspoof 2019 logical access task.

Data Augmentation with Signal Companding for Detection of Logical Access Attacks

audio Published: 2021-02-12 Authors: Rohan Kumar Das, Jichen Yang, Haizhou Li

This paper introduces a novel data augmentation technique using a-law and mu-law signal companding to improve the detection of logical access attacks in automatic speaker verification (ASV). Experiments on the ASVspoof 2019 logical access corpus show that this method outperforms state-of-the-art spoofing countermeasures, particularly in handling unknown attacks.

ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech

audio Published: 2021-02-11 Authors: Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, Kong Aik Lee

This paper analyzes the results of the ASVspoof 2019 challenge, focusing on the top-performing systems for detecting synthesized, converted, and replayed speech. The findings highlight the effectiveness of fusion techniques for logical access scenarios and the significant gap between simulated and real replay data performance.

Automatic Speech Verification Spoofing Detection

audio Published: 2020-12-15 Authors: Shentong Mo, Haofan Wang, Pinxu Ren, Ta-Chung Chi

This research paper investigates automatic speech verification spoofing detection using traditional machine learning models. The authors explore different audio features (MFCC and CQCC) and classifiers (SVM and GMM) to identify spoofed speech, evaluating performance using EER and t-DCF.

Multi-task Learning Based Spoofing-Robust Automatic Speaker Verification System

audio Published: 2020-12-06 Authors: Yuanjun Zhao, Roberto Togneri, Victor Sreeram

This paper proposes a spoofing-robust automatic speaker verification (SR-ASV) system using a multi-task learning architecture. This deep learning model jointly trains speaker verification and spoofing detection, achieving substantial performance improvements over state-of-the-art systems on the ASVspoof 2017 and 2019 corpora.

Detection and Evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems

audio Published: 2020-11-07 Authors: Yang Gao, Jiachen Lian, Bhiksha Raj, Rita Singh

This paper investigates the effectiveness of human and machine-generated speech in spoofing automatic speaker verification (ASV) systems. It proposes using features capturing the fine-grained inconsistencies of human speech production to detect deepfakes, demonstrating that fundamental frequency sequence-related entropy, spectral envelope, and aperiodic parameters are promising for robust deepfake audio detection.

End-to-end anti-spoofing with RawNet2

audio Published: 2020-11-02 Authors: Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, Anthony Larcher

This paper presents the first application of RawNet2, a raw audio-based deep neural network, to anti-spoofing in automatic speaker verification. Modifications were made to the original RawNet2 architecture to improve its performance in detecting spoofed speech, particularly the challenging A17 attack. The results show that while overall performance is not superior to a baseline, the system achieves state-of-the-art results on the A17 attack and improves when fused with the baseline.

Replay and Synthetic Speech Detection with Res2net Architecture

audio Published: 2020-10-28 Authors: Xu Li, Na Li, Chao Weng, Xunying Liu, Dan Su, Dong Yu, Helen Meng

This paper proposes using the Res2Net architecture for replay and synthetic speech detection to improve generalizability to unseen spoofing attacks. Res2Net modifies the ResNet block to enable multiple feature scales, enhancing performance and reducing model size. Experiments on the ASVspoof 2019 corpus show that Res2Net significantly outperforms ResNet34 and ResNet50.

One-class Learning Towards Synthetic Voice Spoofing Detection

audio Published: 2020-10-27 Authors: You Zhang, Fei Jiang, Zhiyao Duan

This paper proposes a one-class learning approach for synthetic voice spoofing detection, focusing on unknown attacks. The method compacts bona fide speech representation and injects an angular margin to separate spoofing attacks in the embedding space, achieving a 2.19% equal error rate (EER) on the ASVspoof 2019 dataset, surpassing all existing single systems.

Learnable Spectro-temporal Receptive Fields for Robust Voice Type Discrimination

audio Published: 2020-10-19 Authors: Tyler Vuong, Yangyang Xia, Richard Stern

This paper proposes a deep-learning system for Voice Type Discrimination (VTD), which distinguishes live speech from playback audio. The system uses a learnable spectro-temporal receptive field (STRF) layer for robust feature extraction, showing strong performance on VTD and ASVspoof 2019 spoofing detection tasks.

Dataset artefacts in anti-spoofing systems: a case study on the ASVspoof 2017 benchmark

audio Published: 2020-10-15 Authors: Bhusan Chettri, Emmanouil Benetos, Bob L. T. Sturm

This research paper investigates how artifacts in the ASVspoof 2017 dataset contribute to the apparent success of published spoofing detection systems. The authors demonstrate how these artifacts can be exploited to manipulate model decisions and propose a framework incorporating speech endpoint detection to improve model robustness and trustworthiness.

Texture-based Presentation Attack Detection for Automatic Speaker Verification

audio Published: 2020-10-08 Authors: Lazaro J. Gonzalez-Soler, Jose Patino, Marta Gomez-Barrero, Massimiliano Todisco, Christoph Busch, Nicholas Evans

This paper proposes a presentation attack detection (PAD) method for automatic speaker verification using texture descriptors applied to speech spectrogram images. A common Fisher vector feature space, based on a generative model, is used to improve the generalizability of PAD solutions, achieving low error rates for both known and unknown attacks.

Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks

audio Published: 2020-09-21 Authors: Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li

This paper proposes a novel feature genuinization method for synthetic speech detection. It uses a CNN-based transformer trained on genuine speech to enhance the difference between genuine and synthetic speech features before classification with a light CNN. This approach outperforms state-of-the-art methods on the ASVspoof 2019 dataset.

Using Multi-Resolution Feature Maps with Convolutional Neural Networks for Anti-Spoofing in ASV

audio Published: 2020-08-20 Authors: Qiongqiong Wang, Kong Aik Lee, Takafumi Koshinaka

This paper proposes a method for anti-spoofing in automatic speaker verification (ASV) that uses multi-resolution feature maps with convolutional neural networks (CNNs). By stacking spectrograms extracted with different window lengths, the method improves both time and frequency resolution, leading to more discriminative representations of audio segments.

Audio Spoofing Verification using Deep Convolutional Neural Networks by Transfer Learning

audio Published: 2020-08-08 Authors: Rahul T P, P R Aravind, Ranjith C, Usamath Nechiyil, Nandakumar Paramparambath

This paper proposes a deep convolutional neural network (DCNN) based speech classifier for detecting spoofing attacks in speaker verification systems. Using a ResNet-34 architecture and Mel-spectrograms, the model achieves low equal error rates (EER) on the ASVspoof 2019 dataset for both logical and physical access scenarios.

Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals

audio Published: 2020-07-12 Authors: Tomi Kinnunen, Héctor Delgado, Nicholas Evans, Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang, Md Sahidullah, Junichi Yamagishi, Douglas A. Reynolds

This paper presents extensions to the tandem detection cost function (t-DCF), a risk-based approach for assessing spoofing countermeasures (CMs) used with automatic speaker verification (ASV). These extensions include a simplified t-DCF, analysis of a fixed ASV system case, simulations for interpretation, and new analyses using the ASVspoof 2019 database.

Integrated Replay Spoofing-aware Text-independent Speaker Verification

audio Published: 2020-06-10 Authors: Hye-jin Shim, Jee-weon Jung, Ju-ho Kim, Seung-bin Kim, Ha-Jin Yu

This paper proposes two approaches for integrated speaker verification and presentation attack detection: a monolithic end-to-end approach and a modular back-end approach. Experiments show that the modular approach, using separate DNNs for speaker verification and presentation attack detection, yields a 21.77% relative improvement in equal error rate compared to a conventional system.

Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning

audio Published: 2020-06-05 Authors: Haibin Wu, Andy T. Liu, Hung-yi Lee

This paper proposes using Mockingjay, a self-supervised learning model, to defend against black-box adversarial attacks on anti-spoofing models for automatic speaker verification. High-level representations extracted by Mockingjay prevent the transferability of adversarial examples and successfully counter these attacks.

DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices

audio Published: 2020-05-28 Authors: Run Wang, Felix Juefei-Xu, Yihao Huang, Qing Guo, Xiaofei Xie, Lei Ma, Yang Liu

DeepSonar detects AI-synthesized fake voices by analyzing layer-wise neuron activation patterns of a speaker recognition system. This approach achieves high detection rates (98.1% accuracy) and low false alarm rates (around 2%) while demonstrating robustness against various manipulation attacks.

Spoofing Attack Detection using the Non-linear Fusion of Sub-band Classifiers

audio Published: 2020-05-20 Authors: Hemlata Tak, Jose Patino, Andreas Nautsch, Nicholas Evans, Massimiliano Todisco

This paper proposes a simple yet effective approach for spoofing attack detection in automatic speaker verification. It uses an ensemble of simple classifiers, each tuned to different sub-bands of the audio spectrum, and combines their scores using non-linear fusion, outperforming most systems in the ASVspoof 2019 challenge.

An explainability study of the constant Q cepstral coefficient spoofing countermeasure for automatic speaker verification

audio Published: 2020-04-14 Authors: Hemlata Tak, Jose Patino, Andreas Nautsch, Nicholas Evans, Massimiliano Todisco

This paper investigates why constant Q cepstral coefficients (CQCCs) are effective in detecting some spoofing attacks but not others in automatic speaker verification. The study reveals that the effectiveness depends on the frequency location of spoofing artefacts and how different front-ends emphasize information at various frequencies.

Subband modeling for spoofing detection in automatic speaker verification

audio Published: 2020-04-04 Authors: Bhusan Chettri, Tomi Kinnunen, Emmanouil Benetos

This paper investigates the impact of different frequency subbands on replay spoofing detection in automatic speaker verification. A joint subband modeling framework using multiple CNNs, each trained on a different subband, is proposed, showing improved performance over full-band models on the ASVspoof 2017 dataset. However, this improvement didn't generalize to the ASVspoof 2019 PA dataset.

Deep Generative Variational Autoencoding for Replay Spoof Detection in Automatic Speaker Verification

audio Published: 2020-03-21 Authors: Bhusan Chettri, Tomi Kinnunen, Emmanouil Benetos

This paper proposes using variational autoencoders (VAEs) as a backend for replay attack detection in automatic speaker verification. Three VAE models are explored, with the conditional VAE (C-VAE) showing significant improvements over separate VAEs and a Gaussian mixture model (GMM) baseline, achieving a 9-10% absolute improvement in EER and t-DCF on the ASVspoof 2019 dataset.

Defense against adversarial attacks on spoofing countermeasures of ASV

audio Published: 2020-03-06 Authors: Haibin Wu, Songxiang Liu, Helen Meng, Hung-yi Lee

This paper proposes spatial smoothing (passive) and adversarial training (proactive) as defense methods to enhance the robustness of Automatic Speaker Verification (ASV) spoofing countermeasure models against adversarial attacks. Experimental results demonstrate that both methods effectively improve the models' resilience to adversarial examples.

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

audio Published: 2020-02-28 Authors: Jennifer Williams, Joanna Rownicka, Pilar Oplustil, Simon King

This paper investigates automatic quality estimation of multi-speaker Text-to-Speech (TTS) synthesis by comparing different speech representations for predicting human mean opinion scores (MOS). A neural network is trained and evaluated on various TTS and voice conversion systems, achieving high correlation with human judgments, and revealing consistent quality patterns across different systems for specific speakers.

Multi-Task Siamese Neural Network for Improving Replay Attack Detection

audio Published: 2020-02-16 Authors: Patrick von Platen, Fei Tao, Gokhan Tur

This paper proposes using a multi-task Siamese Neural Network (SNN) for improved replay attack detection in speaker verification systems. The SNN significantly outperforms a ResNet baseline by reducing the Equal Error Rate (EER) by 26.8%, and further improvements are achieved with the addition of reconstruction loss.

A study on the role of subsidiary information in replay attack spoofing detection

audio Published: 2020-01-31 Authors: Jee-weon Jung, Hye-jin Shim, Hee-Soo Heo, Ha-Jin Yu

This study investigates the impact of subsidiary information (room size, reverberation, etc.) on replay attack detection in audio. Using adversarial and multi-task learning frameworks, the researchers analyze whether this information is implicitly present in deep neural network embeddings and if explicitly incorporating it improves detection accuracy.

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

audio Published: 2019-11-05 Authors: Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, Zhen-Hua Ling

This paper introduces ASVspoof 2019, a large-scale public database for synthesized, converted, and replayed speech aimed at advancing research in automatic speaker verification (ASV) spoofing countermeasures. The database includes diverse spoofing attacks generated using state-of-the-art techniques and is designed to reflect logical and physical access scenarios.

Replay Spoofing Countermeasure Using Autoencoder and Siamese Network on ASVspoof 2019 Challenge

audio Published: 2019-10-29 Authors: Mohammad Adiban, Hossein Sameti, Saeedreza Shehnepoor

This paper presents a novel replay spoofing countermeasure for Automatic Speaker Verification (ASV) systems. It uses Constant Q Cepstral Coefficients (CQCC) features processed by an autoencoder to enhance information and incorporate noise information, followed by a Siamese network for classification, achieving significant improvements over the baseline system.

Self-supervised pre-training with acoustic configurations for replay spoofing detection

audio Published: 2019-10-22 Authors: Hye-jin Shim, Hee-Soo Heo, Jee-weon Jung, Ha-Jin Yu

This paper proposes a self-supervised pre-training framework for acoustic configurations to improve replay spoofing detection. It leverages datasets from other tasks (like speaker verification) to train a deep neural network to identify whether audio segments share identical acoustic configurations, improving generalization to unseen conditions.

Adversarial Attacks on Spoofing Countermeasures of automatic speaker verification

audio Published: 2019-10-19 Authors: Songxiang Liu, Haibin Wu, Hung-yi Lee, Helen Meng

This paper investigates the vulnerability of automatic speaker verification (ASV) spoofing countermeasures to adversarial attacks. The authors implement high-performing countermeasure models and test their robustness against Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks in both white-box and black-box scenarios.

Speech Replay Detection with x-Vector Attack Embeddings and Spectral Features

audio Published: 2019-09-23 Authors: Jennifer Williams, Joanna Rownicka

This paper presents a system for speech replay detection submitted to the ASVspoof 2019 challenge. The system combines x-vector attack embeddings, jointly modeling environment and attack types, with sub-band spectral centroid magnitude coefficients (SCMCs) as input to a convolutional neural network (CNN). The approach outperforms challenge baselines using tandem detection cost function (tDCF) and equal error rate (EER) metrics.

Black-box Attacks on Automatic Speaker Verification using Feedback-controlled Voice Conversion

audio Published: 2019-09-17 Authors: Xiaohai Tian, Rohan Kumar Das, Haizhou Li

This paper proposes a feedback-controlled voice conversion (VC) framework for black-box attacks on automatic speaker verification (ASV) systems. The framework uses ASV system output scores as feedback to optimize the VC system, generating adversarial samples more deceptive than standard VC methods while maintaining good perceptual quality.

Voice Spoofing Detection Corpus for Single and Multi-order Audio Replays

audio Published: 2019-09-03 Authors: Roland Baumann, Khalid Mahmood Malik, Ali Javed, Andersen Ball, Brandon Kujawa, Hafiz Malik

This paper introduces a novel voice spoofing detection corpus (VSDC) containing bona fide, first-order, and second-order replay audio samples, addressing the limitations of existing datasets that lack multi-order replay data and diverse recording conditions. VSDC is designed to evaluate anti-spoofing algorithms in multi-hop scenarios and includes audio from fifteen speakers recorded using various microphones and environments.

Detecting Spoofing Attacks Using VGG and SincNet: BUT-Omilia Submission to ASVspoof 2019 Challenge

audio Published: 2019-07-13 Authors: Hossein Zeinali, Themos Stafylakis, Georgia Athanasopoulou, Johan Rohdin, Ioannis Gkinis, Lukáš Burget, Jan "Honza Černocký

This paper describes the BUT-Omilia system for the ASVspoof 2019 challenge, focusing on detecting spoofing attacks in speaker verification. For physical access (PA), a fusion of two VGG networks is used, while for logical access (LA), a fusion of VGG and SincNet is employed. The PA system showed significant improvement over the baseline, while the LA system struggled to generalize to unseen attacks.

The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

audio Published: 2019-07-05 Authors: Weicheng Cai, Haiwei Wu, Danwei Cai, Ming Li

This paper presents a deep learning-based system for replay attack detection in the ASVspoof 2019 challenge. The system leverages data augmentation (speed perturbation), explores various feature representations (including group delay gram), and employs a residual neural network for classification, achieving a low equal error rate (EER) of 0.66% on the evaluation set through system fusion.

Towards robust audio spoofing detection: a detailed comparison of traditional and learned features

audio Published: 2019-05-28 Authors: Balamurali BT, Kin Wah Edward Lin, Simon Lui, Jer-Ming Chen, Dorien Herremans

This research introduces a robust audio spoofing detection system that generalizes across various replay spoofing techniques, unlike most existing systems. It achieves this by comparing traditional audio features with features learned via an autoencoder, ultimately demonstrating the importance of combining both for optimal performance.

Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge

audio Published: 2019-04-23 Authors: Jee-weon Jung, Hye-jin Shim, Hee-Soo Heo, Ha-Jin Yu

This paper proposes an end-to-end deep neural network (DNN) for replay attack detection in speaker verification, using high-resolution spectrograms with complementary information (magnitude, phase, and power spectral density). The approach avoids handcrafted features, focusing instead on directly modeling raw audio information for improved robustness against advanced spoofing techniques.

STC Antispoofing Systems for the ASVspoof2019 Challenge

audio Published: 2019-04-11 Authors: Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, Alexandr Kozlov

This paper presents the Speech Technology Center's (STC) anti-spoofing systems for the ASVspoof 2019 challenge. The systems, based on an enhanced Light CNN architecture with angular margin-based softmax activation, achieved low equal error rates (EERs) of 1.86% and 0.54% in logical and physical access scenarios, respectively.

ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection

audio Published: 2019-04-09 Authors: Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Hector Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee

The ASVspoof 2019 challenge focused on advancing countermeasures against spoofing attacks in automatic speaker verification (ASV). The challenge incorporated logical and physical access scenarios, various spoofing attack types, and a new tandem detection cost function (t-DCF) metric to assess system performance holistically.

ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks

audio Published: 2019-04-01 Authors: Cheng-I Lai, Nanxin Chen, Jesús Villalba, Najim Dehak

The paper introduces ASSERT, a system for audio spoofing detection submitted to the ASVspoof 2019 Challenge. It uses variants of squeeze-excitation and residual networks, achieving significant performance improvements over baseline systems in both Physical Access and Logical Access sub-challenges.

Generalization of Spoofing Countermeasures: a Case Study with ASVspoof 2015 and BTAS 2016 Corpora

audio Published: 2019-01-23 Authors: Dipjyoti Paul, Md Sahidullah, Goutam Saha

This paper investigates the generalization capability of spoofing countermeasures in voice-based biometric systems. It analyzes the performance of different spoofing types using MFCCs and CQCCs features with a GMM-ML classifier on ASVspoof 2015 and BTAS 2016 corpora, showing varying generalization capabilities across spoofing types.

Introduction to Voice Presentation Attack Detection and Recent Advances

audio Published: 2019-01-04 Authors: Md Sahidullah, Hector Delgado, Massimiliano Todisco, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Kong-Aik Lee

This research paper reviews recent advancements in voice presentation attack detection (PAD) for automatic speaker verification (ASV), focusing on studies from the last three years. It summarizes findings and lessons learned from two ASVspoof challenges, highlighting the continued need for generalized PAD solutions capable of detecting diverse spoofing attacks.

Attentive Filtering Networks for Audio Replay Attack Detection

audio Published: 2018-10-31 Authors: Cheng-I Lai, Alberto Abad, Korin Richmond, Junichi Yamagishi, Najim Dehak, Simon King

This paper proposes an Attentive Filtering Network (AFN) for audio replay attack detection. AFN uses an attention-based filtering mechanism to enhance feature representations in the time and frequency domains before classification with a ResNet. The system achieves a competitive equal error rate (EER) on the ASVspoof 2017 dataset.

Transforming acoustic characteristics to deceive playback spoofing countermeasures of speaker verification systems

audio Published: 2018-09-12 Authors: Fuming Fang, Junichi Yamagishi, Isao Echizen, Md Sahidullah, Tomi Kinnunen

This paper investigates a novel playback spoofing attack against speaker verification systems by enhancing stolen speech using a speech enhancement generative adversarial network (SEGAN). The attack significantly increases equal error rates for existing countermeasures, demonstrating a vulnerability in current playback detection methods.

A Study On Convolutional Neural Network Based End-To-End Replay Anti-Spoofing

audio Published: 2018-05-22 Authors: Bhusan Chettri, Saumitra Mishra, Bob L. Sturm, Emmanouil Benetos

This paper investigates the performance of Convolutional Neural Networks (CNNs) for end-to-end replay attack detection in the ASVspoof 2017 challenge. The authors find that while CNNs generalize well on the development dataset, they struggle to generalize to the evaluation dataset, highlighting challenges in achieving consistent performance across different data distributions.

t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification

audio Published: 2018-04-25 Authors: Tomi Kinnunen, Kong Aik Lee, Hector Delgado, Nicholas Evans, Massimiliano Todisco, Md Sahidullah, Junichi Yamagishi, Douglas A. Reynolds

This paper introduces a new tandem detection cost function (t-DCF) metric for evaluating anti-spoofing countermeasures in automatic speaker verification (ASV). The t-DCF improves upon the equal error rate (EER) by considering the costs of different errors and prior probabilities of target and spoof trials, leading to more realistic and application-specific performance assessment.

Anti-spoofing Methods for Automatic SpeakerVerification System

audio Published: 2017-05-24 Authors: Galina Lavrentyeva, Sergey Novoselov, Konstantin Simonchik

This research paper analyzes various acoustic feature spaces and classifiers for robust spoofing detection in automatic speaker verification systems. It compares different spoofing detection systems on the ASVspoof 2015 challenge datasets, finding that combining magnitude and phase information, along with wavelet-based features, yields improved performance.

DNN Filter Bank Cepstral Coefficients for Spoofing Detection

audio Published: 2017-02-13 Authors: Hong Yu, Zheng-Hua Tan, Zhanyu Ma, Jun Guo

This paper proposes DNN-FBCC, a new filter bank based cepstral feature for spoofing detection in speaker verification systems. A filter bank neural network (FBNN) automatically learns filter banks from natural and synthetic speech, outperforming manually designed filter banks and improving detection of unknown attacks.

Novel Speech Features for Improved Detection of Spoofing Attacks

audio Published: 2016-03-14 Authors: Dipjyoti Paul, Monisankha Pal, Goutam Saha

This paper proposes novel speech features for improved detection of spoofing attacks in automatic speaker verification systems. These features leverage alternative frequency warping and formant-specific block transformation of filter bank log energies, significantly outperforming existing methods.

Spoofing Detection Goes Noisy: An Analysis of Synthetic Speech Detection in the Presence of Additive Noise

audio Published: 2016-03-12 Authors: Cemal Hanilci, Tomi Kinnunen, Md Sahidullah, Aleksandr Sizov

This research analyzes the robustness of state-of-the-art synthetic speech detectors under additive noise. It compares various acoustic feature sets and back-end models (GMM and i-vector) to determine their performance in noisy conditions, revealing significant performance degradation even at high signal-to-noise ratios.

Spoofing detection under noisy conditions: a preliminary investigation and an initial database

audio Published: 2016-02-09 Authors: Xiaohai Tian, Zhizheng Wu, Xiong Xiao, Eng Siong Chng, Haizhou Li

This paper investigates spoofing detection for automatic speaker verification (ASV) under noisy conditions. A new database is created by adding various noises to the ASVspoof 2015 database at different signal-to-noise ratios (SNRs), and experiments show that system performance degrades significantly under noisy conditions, with phase-based features proving more robust than magnitude-based features.

STC Anti-spoofing Systems for the ASVspoof 2015 Challenge

audio Published: 2015-07-29 Authors: Sergey Novoselov, Alexandr Kozlov, Galina Lavrentyeva, Konstantin Simonchik, Vadim Shchemelinin

This paper details the Speech Technology Center's (STC) submissions to the ASVspoof 2015 challenge, focusing on exploring various acoustic feature spaces (MFCC, phase spectrum, wavelet transform) for robust spoofing detection. They employed TV-JFA for probability modeling and compared SVM and DBN classifiers.