Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features
Authors: Lisan Al Amin, Rakib Hossain, Mahbubul Islam, Faisal Quader, Thanh Thi Nguyen
Published: 2026-05-07 11:26:01+00:00
AI Summary
This paper proposes Q-Patch, a quantum feature map tailored for audio deepfake detection that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits. It leverages a compact four-dimensional acoustic descriptor per patch and maps it to a four-qubit circuit with adjacency-aware entanglement, enabling practical quantum kernel construction. Q-Patch aims to exploit the time-frequency structure of audio for improved discrimination between bona fide and spoofed samples in low-resource settings.
Abstract
Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimensional acoustic descriptor and mapped to a four-qubit circuit with depth at most three, enabling practical quantum kernel construction under near-term constraints. We evaluate Q-Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size-matched classical baselines. Q-Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF-SVM) trained on the same patch-level features. Kernel-space analysis further reveals a clear class structure, with cross-class similarity around 0.615 and within-class self-similarity of 1.00. Overall, Q-Patch provides a practical framework for incorporating time-frequency-aware representations into quantum kernel learning for audio authenticity assessment in low-resource settings.