Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features

Authors: Lisan Al Amin, Rakib Hossain, Mahbubul Islam, Faisal Quader, Thanh Thi Nguyen

Published: 2026-05-07 11:26:01+00:00

AI Summary

This paper proposes Q-Patch, a quantum feature map tailored for audio deepfake detection that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits. It leverages a compact four-dimensional acoustic descriptor per patch and maps it to a four-qubit circuit with adjacency-aware entanglement, enabling practical quantum kernel construction. Q-Patch aims to exploit the time-frequency structure of audio for improved discrimination between bona fide and spoofed samples in low-resource settings.

Abstract

Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimensional acoustic descriptor and mapped to a four-qubit circuit with depth at most three, enabling practical quantum kernel construction under near-term constraints. We evaluate Q-Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size-matched classical baselines. Q-Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF-SVM) trained on the same patch-level features. Kernel-space analysis further reveals a clear class structure, with cross-class similarity around 0.615 and within-class self-similarity of 1.00. Overall, Q-Patch provides a practical framework for incorporating time-frequency-aware representations into quantum kernel learning for audio authenticity assessment in low-resource settings.


Key findings
Q-Patch achieved an AUROC of 0.87 and an EER of 14.8% on a controlled validation dataset, outperforming a classical RBF-SVM (AUROC 0.82, EER 18.2%) trained on the same features and a compact Tiny CNN (AUROC 0.85, EER 16.3%). Kernel-space analysis revealed a clear class structure, with within-class similarities higher than cross-class similarities, demonstrating that the quantum feature map induces discriminative patterns between bona fide and spoofed audio samples.
Approach
The Q-Patch framework involves converting audio into log-mel spectrograms, partitioning them into non-overlapping 4x4 time-frequency patches, and summarizing each patch with a 4-dimensional acoustic descriptor (mean activation, spectral centroid, spectral bandwidth, inter-frame coherence). The top-k (k=2) most salient patches are selected, and their summaries are embedded into 8-qubit quantum states using a shallow feature map with local and inter-patch entanglement. A fidelity-based quantum kernel is then constructed for classification with a Quantum Support Vector Machine (QSVM).
Datasets
LJ Speech subset
Model(s)
Q-Patch (quantum feature map), Quantum Support Vector Machine (QSVM), RBF-SVM (baseline), Tiny CNN (baseline)
Author countries
USA, Bangladesh, Australia