Multi-Tast Transformer for Explainable Speech Deepfake Detection via Formant Modeling

Authors: Viola Negroni, Luca Cuccovillo, Paolo Bestagini, Patrick Aichroth, Stefano Tubaro

Published: 2026-01-21 10:34:12+00:00

Comment: Accepted @ IEEE ICASSP 2026

AI Summary

This paper introduces SFATNet-4, a lightweight multi-task transformer for explainable speech deepfake detection. The model simultaneously predicts formant trajectories and voicing patterns while classifying speech as real or fake, providing insights into whether its decisions rely more on voiced or unvoiced regions. It improves upon its predecessor by requiring fewer parameters, training faster, and offering better interpretability without sacrificing prediction performance.

Abstract

In this work, we introduce a multi-task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.


Key findings
SFATNet-4 consistently outperforms its predecessor, SFATNet-3, across both in-domain and out-of-domain datasets, achieving lower Equal Error Rate (EER) and higher Area Under the Curve (AUC). The model also demonstrates significant efficiency gains, training four times faster with 35% fewer parameters, while offering built-in explainability. Explainability analysis reveals that the model heavily relies on unvoiced speech regions for detecting synthetic speech, suggesting these areas harbor salient synthesis artifacts.
Approach
The approach utilizes SFATNet-4, a multi-task transformer architecture comprising magnitude and phase encoders and three decoding modules. These decoders are responsible for predicting fundamental frequency and formant trajectories, distinguishing voiced/unvoiced segments, and performing deepfake detection with frame-level explainability via a multi-head pooling mechanism. An improved time-only input segmentation strategy enhances efficiency and enables frame-level interpretation.
Datasets
ASVspoof 5, In-the-Wild, FakeOrReal, TIMIT-TTS, VidTIMIT
Model(s)
SFATNet-4 (Multi-Task Transformer), SFATNet-3 (as baseline/predecessor)
Author countries
Italy, Germany