TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models

Authors: Awais Khan, Muhammad Umar Farooq, Kutub Uddin, Khalid Malik

Published: 2026-04-01 16:12:31+00:00

AI Summary

The paper introduces TRACE, a training-free framework for detecting partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations. It hypothesizes that genuine speech forms smooth embedding trajectories, while splice boundaries introduce abrupt disruptions. TRACE achieves competitive performance on standard benchmarks and surpasses supervised baselines on challenging LLM-driven deepfakes without any training or labeled data.

Abstract

Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.


Key findings
TRACE achieved an 8.08% EER on the PartialSpoof dataset, demonstrating competitive performance with fine-tuned supervised baselines without any training. Crucially, on the challenging LlamaPartialSpoof benchmark featuring LLM-driven commercial synthesis, TRACE surpassed a supervised baseline (24.12% vs. 24.49% EER) without any target-domain data, highlighting its strong generalization across unseen synthesis methods and languages. The study also found that intermediate transformer layers of foundation models and first-order dynamics are more informative for detection than final layers or second-order dynamics.
Approach
TRACE detects partial audio deepfakes by extracting L2-normalized frame embeddings from a frozen speech foundation model. It then calculates the chord distance between consecutive unit-sphere projections (first-order dynamics) to identify abrupt disruptions indicative of splice boundaries. These frame-level dynamics are aggregated into a scalar utterance-level detection score using various closed-form statistics, which are linearly fused and calibrated without any model training.
Datasets
PartialSpoof, HalfTruth Audio Deepfake (HAD), ADD 2023 Track 2, LlamaPartialSpoof
Model(s)
WavLM-Large, WavLM-Base, HuBERT-Large, Wav2Vec 2.0-Base, Wav2Vec 2.0-XLSR, Whisper-Base
Author countries
USA