Context and Transcripts Improve Detection of Deepfake Audios of Public Figures

Authors: Chongyang Gao, Marco Postiglione, Julian Baldwin, Natalia Denisenko, Isabel Gortner, Luke Fosdick, Chiara Pulice, Sarit Kraus, V. S. Subrahmanian

Published: 2026-01-19 23:40:05+00:00

AI Summary

This paper introduces the Context-based Audio Deepfake Detector (CADD) which leverages contextual information and transcripts to significantly improve the detection of deepfake audios of public figures. It also presents two new datasets, JDD and SYN, composed of real-world and synthetically generated deepfakes respectively. The research demonstrates CADD's enhanced performance and robustness against various adversarial manipulations compared to existing baseline detectors.

Abstract

Humans use context to assess the veracity of information. However, current audio deepfake detectors only analyze the audio file without considering either context or transcripts. We create and analyze a Journalist-provided Deepfake Dataset (JDD) of 255 public deepfakes which were primarily contributed by over 70 journalists since early 2024. We also generate a synthetic audio dataset (SYN) of dead public figures and propose a novel Context-based Audio Deepfake Detector (CADD) architecture. In addition, we evaluate performance on two large-scale datasets: ITW and P$^2$V. We show that sufficient context and/or the transcript can significantly improve the efficacy of audio deepfake detectors. Performance (measured via F1 score, AUC, and EER) of multiple baseline audio deepfake detectors and traditional classifiers can be improved by 5%-37.58% in F1-score, 3.77%-42.79% in AUC, and 6.17%-47.83% in EER. We additionally show that CADD, via its use of context and/or transcripts, is more robust to 5 adversarial evasion strategies, limiting performance degradation to an average of just -0.71% across all experiments. Code, models, and datasets are available at our project page: https://sites.northwestern.edu/nsail/cadd-context-based-audio-deepfake-detection (access restricted during review).


Key findings
The CADD framework consistently improved audio deepfake detection performance across 71 state-of-the-art and traditional machine learning baselines, showing F1-score improvements of 5%-37.58% and AUC improvements of 3.77%-42.79%. CADD also demonstrated superior robustness against five adversarial audio manipulation strategies, limiting performance degradation to an average of -0.71% across experiments. These improvements were most substantial on the challenging real-world JDD dataset.
Approach
The CADD architecture integrates features from the raw audio clip (e.g., LFCC, MFCC, Whisper) with contextual information and transcripts. Textual context (recent news, social media posts, Wikidata information about the public figure) and transcripts are embedded using ALBERT and Whisper respectively, reduced via PCA, and then fused with audio features. These combined features are processed through a neural fusion module and classification head for final deepfake detection.
Datasets
Journalist-provided Deepfake Dataset (JDD), Synthetic Audio Dataset (SYN), In-The-Wild (ITW), Perturbed Public Voices (P2V)
Model(s)
RawNet3, LCNN, MesoNet, SpecRNet (as deepfake detector backbones); LFCC, MFCC, Whisper (audio feature extractors); ALBERT, Whisper (text embedding models); PCA (dimensionality reduction); Logistic Regression, Random Forest, SVM, AdaBoost, XGBoost, Gaussian Naive Bayes, K-Nearest Neighbors (traditional machine learning classifiers).
Author countries
USA, Israel