Forensic Similarity for Speech Deepfakes

Authors: Viola Negroni, Davide Salvi, Daniele Ugo Leonzio, Paolo Bestagini, Stefano Tubaro

Published: 2025-10-03 10:02:34+00:00

AI Summary

This paper introduces Forensic Similarity for Speech Deepfakes, a digital audio forensics approach designed to determine whether two audio segments contain the same generative forensic traces. The proposed system is a Siamese deep-learning framework combining a deepfake detector backbone as a feature extractor with a shallow neural network similarity model. The method demonstrates strong generalization capabilities for source verification across unseen generative models and shows utility in audio splicing detection.

Abstract

In this paper, we introduce a digital audio forensics approach called Forensic Similarity for Speech Deepfakes, which determines whether two audio segments contain the same forensic traces or not. Our work is inspired by prior work in the image domain on forensic similarity, which proved strong generalization capabilities against unknown forensic traces, without requiring prior knowledge of them at training time. To achieve this in the audio setting, we propose a two-part deep-learning system composed of a feature extractor based on a speech deepfake detector backbone and a shallow neural network, referred to as the similarity network. This system maps pairs of audio segments to a score indicating whether they contain the same or different forensic traces. We evaluate the system on the emerging task of source verification, highlighting its ability to identify whether two samples originate from the same generative model. Additionally, we assess its applicability to splicing detection as a complementary use case. Experiments show that the method generalizes to a wide range of forensic traces, including previously unseen ones, illustrating its flexibility and practical value in digital audio forensics.


Key findings
The proposed Similarity Model consistently outperformed standard similarity scoring methods (Cosine, Euclidean) for source verification across all test sets. The optimized configuration, using LCNN as the unfrozen feature extractor, achieved an EER of 10.5% and an AUC of 95.7% on unseen in-domain generators (MLAAD open-set). The method successfully generalized to the complementary task of splicing detection, achieving an AUC of 80% on the PartialSpoof development set, demonstrating robust generalization to unknown traces.
Approach
The system utilizes a two-part Siamese deep-learning architecture where two input audio segments are processed by a shared feature extractor (a repurposed deepfake detector backbone) to obtain forensic embeddings. A shallow similarity network then maps the pair of embeddings to a score indicating the likelihood of them sharing the same generative source. The approach is trained sequentially, first optimizing the feature extractor for closed-set source tracing, then training the similarity model in the Siamese setup.
Datasets
MLAAD, ASVspoof 2019, TIMIT-TTS, PartialSpoof
Model(s)
LCNN, ResNet18, RawNet2, AASIST (used as feature extractor backbones), custom shallow Similarity Model (Siamese architecture)
Author countries
Italy