Cyclostationarity Analysis as a Complement to Self-Supervised Representations for Speech Deepfake Detection

Authors: Cemal Hanilçi, Md Sahidullah, Tomi Kinnunen

Published: 2026-03-04 10:28:28+00:00

Comment: submitted to IEEE Transactions on Audio, Speech and Language Processing

AI Summary

This paper introduces a cyclostationarity-inspired acoustic feature extraction framework for speech deepfake detection based on spectral correlation density (SCD). The proposed features model periodic statistical structures in speech and demonstrate strong complementarity with self-supervised learning (SSL) embeddings and conventional acoustic representations. Fusion of SCD and SSL embeddings significantly reduces the equal error rate (EER) on ASVspoof 2019 LA from 8.28% to 0.98% and yields consistent improvements on the challenging ASVspoof 5 dataset.

Abstract

Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven acoustic features remain important for modeling fine-grained structural properties of speech. Most existing acoustic front ends are based on time-frequency representations, which do not fully exploit higher-order spectral dependencies inherent in speech signals. We introduce a cyclostationarity-inspired acoustic feature extraction framework for SDD based on spectral correlation density (SCD). The proposed features model periodic statistical structures in speech by capturing spectral correlations between frequency components. In particular, we propose temporally structured SCD features that characterize the evolution of spectral and cyclic-frequency components over time. The effectiveness and complementarity of the proposed features are evaluated using multiple countermeasure architectures, including convolutional neural networks, SSL-based embedding systems, and hybrid fusion models. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5 demonstrate that SCD-based features provide complementary discriminative information to SSL embeddings and conventional acoustic representations. In particular, fusion of SSL and SCD embeddings reduces the equal error rate on ASVspoof 2019 LA from $8.28\\%$ to $0.98\\%$, and yields consistent improvements on the challenging ASVspoof 5 dataset. The results highlight cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.


Key findings
SCD-based features provide complementary discriminative information to both SSL embeddings and conventional acoustic representations, leading to substantial performance gains. The fusion of SSL and SCD embeddings reduced the EER on ASVspoof 2019 LA from 8.28% (SSL-only) to 0.98% and demonstrated improved robustness to unseen and diverse spoofing attacks on ASVspoof 5, particularly against adversarial attacks. SCD features consistently showed smaller performance degradation from development to evaluation sets compared to other acoustic features.
Approach
The authors propose a cyclostationarity-inspired acoustic feature extraction framework based on Spectral Correlation Density (SCD), including temporally structured SCD features (SCDa and SCDb). These features capture periodic statistical structures by modeling spectral correlations. Their effectiveness and complementarity are evaluated using various countermeasure architectures, including convolutional neural networks (SE-Res2Net50), self-supervised learning (SSL) embedding systems (Wav2Vec 2.0), and hybrid fusion models.
Datasets
ASVspoof 2019 LA, ASVspoof 2021 DF, ASVspoof 5
Model(s)
Spectral Correlation Density (SCD), Temporally Structured SCD features (SCDa, SCDb), LFCCs, Mel-spectrograms, CQT spectrograms, STFT spectrograms (as front-ends); SE-Res2Net50, Wav2Vec 2.0 (frozen SSL backbone), SSL-only CM (Wav2Vec 2.0 embeddings with lightweight projection layers), Embedding Fusion CM (combining SE-Res2Net50 and SSL embeddings).
Author countries
Turkey, India, Finland