SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

View on arXiv ← Back to list

Authors: Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv

Published: 2025-11-26 12:16:38+00:00

AI Summary

SONAR addresses the lack of generalization in deepfake audio detection, which stems from spectral bias causing models to overlook subtle high-frequency artifacts left by deepfake generators. The framework explicitly disentangles the audio signal into low-frequency content and high-frequency residuals via a dual-path architecture and utilizes a frequency-aware Jensen-Shannon contrastive loss. This approach sharpens decision boundaries by enforcing alignment for genuine content-noise pairs while maximizing the separation of fake embeddings.

Abstract

Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.

Key findings

SONAR achieves single-run state-of-the-art EERs across multiple benchmarks, including 1.45% on ASVspoof 2021 DF and 5.43% on the challenging In-the-Wild corpus. The framework is highly efficient, converging up to eight times faster than strong baselines, demonstrating that the frequency-contrastive alignment accelerates the separation of real and fake latent manifolds. Ablation studies confirmed that both the learnable SRM filters and the JS alignment loss are essential for the observed performance gains.

Approach

SONAR uses a dual-path system consisting of a Content Feature Extractor (CFE) using an XLSR encoder and a Noise Feature Extractor (NFE) which preprocesses audio using constrained, learnable SRM high-pass filters before feeding it into a cloned XLSR encoder. The resulting content and noise embeddings are fused via cross-attention and optimized using a combined weighted cross-entropy and a Jensen-Shannon (JS) alignment loss to modulate the dependency between low- and high-frequency features.

Datasets

ASVspoof 2019 (LA), ASVspoof 2021 (LA, DF), In The Wild (ITW) benchmark.

Model(s)

XLSR (Wav2Vec 2.0 XLSR) Encoder, AASIST (Audio Anti-spoofing Using Integrated Spectro-temporal Graph Attention Networks) classifier, XLSR-Mamba, Rich Feature Extractor (RFE) using learnable SRM filters.

Author countries

Israel