Forensic Similarity for Speech Deepfakes

Authors: Viola Negroni, Davide Salvi, Daniele Ugo Leonzio, Paolo Bestagini, Stefano Tubaro

Published: 2025-10-03 10:02:34+00:00

Comment: Submitted @ IEEE OJSP

AI Summary

This paper introduces 'Forensic Similarity for Speech Deepfakes', a digital audio forensics approach that determines if two audio segments share the same forensic traces. The system utilizes a two-part deep-learning architecture comprising a feature extractor based on a speech deepfake detector backbone and a shallow similarity network. The method demonstrates strong generalization to previously unseen generative models for source verification and shows applicability to splicing detection.

Abstract

In this paper, we introduce a digital audio forensics approach called Forensic Similarity for Speech Deepfakes, which determines whether two audio segments contain the same forensic traces or not. Our work is inspired by prior work in the image domain on forensic similarity, which proved strong generalization capabilities against unknown forensic traces, without requiring prior knowledge of them at training time. To achieve this in the audio setting, we propose a two-part deep-learning system composed of a feature extractor based on a speech deepfake detector backbone and a shallow neural network, referred to as the similarity network. This system maps pairs of audio segments to a score indicating whether they contain the same or different forensic traces. We evaluate the system on the emerging task of source verification, highlighting its ability to identify whether two samples originate from the same generative model. Additionally, we assess its applicability to splicing detection as a complementary use case. Experiments show that the method generalizes to a wide range of forensic traces, including previously unseen ones, illustrating its flexibility and practical value in digital audio forensics.


Key findings
The forensic similarity framework exhibits strong generalization capabilities to previously unseen generative models, proving effective in source verification and showing applicability to splicing detection. The proposed similarity model consistently outperforms standard similarity scoring methods across various datasets. The framework achieves near-perfect accuracy for most generator pairs, successfully identifying common generative sources even for methods not seen during training.
Approach
The system employs a two-part deep-learning framework. First, a feature extractor, built upon a speech deepfake detector backbone, processes input audio segments into forensic-meaningful embeddings. Second, a shallow neural network, termed the similarity model, takes pairs of these embeddings to produce a score indicating whether they contain the same or different forensic traces.
Datasets
MLAAD, ASVspoof 2019, TIMIT-TTS, PartialSpoof
Model(s)
LCNN, ResNet18, RawNet2, AASIST (as feature extractor backbones), a shallow neural network (as the similarity model). LCNN with an unfrozen training strategy was selected as the optimal feature extractor.
Author countries
Italy