Audio Deepfake Detection at the First Greeting: Hi!

Authors: Haohan Shi, Xiyu Shi, Safak Dogan, Tianjin Huang, Yunxiao Zhang

Published: 2026-01-27 13:02:39+00:00

AI Summary

This paper introduces Short-MGAA (S-MGAA), a lightweight deepfake detection framework designed for ultra-short audio inputs (0.5-2.0s) under real-world communication degradations. S-MGAA enhances discriminative representation learning through a Pixel-Channel Enhanced Module (PCEM) for fine-grained saliency and a Frequency Compensation Enhanced Module (FCEM) for multi-scale frequency modeling. The proposed method consistently outperforms state-of-the-art baselines, demonstrating strong robustness and efficiency for real-time deployment.

Abstract

This paper focuses on audio deepfake detection under real-world communication degradations, with an emphasis on ultra-short inputs (0.5-2.0s), targeting the capability to detect synthetic speech at a conversation opening, e.g., when a scammer says Hi. We propose Short-MGAA (S-MGAA), a novel lightweight extension of Multi-Granularity Adaptive Time-Frequency Attention, designed to enhance discriminative representation learning for short, degraded inputs subjected to communication processing and perturbations. The S-MGAA integrates two tailored modules: a Pixel-Channel Enhanced Module (PCEM) that amplifies fine-grained time-frequency saliency, and a Frequency Compensation Enhanced Module (FCEM) to supplement limited temporal evidence via multi-scale frequency modeling and adaptive frequency-temporal interaction. Extensive experiments demonstrate that S-MGAA consistently surpasses nine state-of-the-art baselines while achieving strong robustness to degradations and favorable efficiency-accuracy trade-offs, including low RTF, competitive GFLOPs, compact parameters, and reduced training cost, highlighting its strong potential for real-time deployment in communication systems and edge devices.

Key findings

S-MGAA consistently surpassed nine state-of-the-art baselines across various ultra-short durations (0.5-2.0s) and degradation conditions, with notable EER reductions (e.g., 23.89% average EER reduction at 0.5s with MFCCs). The framework also demonstrated favorable efficiency-accuracy trade-offs, maintaining low computational cost (GFLOPs, parameters, training time) and stable latency, highlighting its potential for real-time, resource-constrained deployment.

Approach

The authors propose Short-MGAA (S-MGAA), an extension of Multi-Granularity Adaptive Time-Frequency Attention, tailored for ultra-short and degraded audio. It incorporates a Pixel-Channel Enhanced Module (PCEM) to amplify fine-grained time-frequency saliency and a Frequency Compensation Enhanced Module (FCEM) to supplement limited temporal information via multi-scale frequency modeling and adaptive frequency-temporal interaction.

Datasets

Fake-or-Real, Wavefake, LJSpeech, MLAAD-EN, M-AILABS, ASVspoof2021 Logical Access (combined into Dcom for training); ADD-C test dataset for evaluation.

Model(s)

Short-MGAA (S-MGAA), an extension of Multi-Granularity Adaptive Time-Frequency Attention (MGAA), utilizing Linear-Frequency Cepstral Coefficients (LFCC), Constant-Q Cepstral Coefficients (CQCC), and Mel-frequency Cepstral Coefficients (MFCC) as input features.

Author countries

← Previous