Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection

Authors: Khalid Zaman, Melike Sah, Anuwat Chaiwongyenc, Cem Direkoglu

Published: 2026-04-09 11:22:40+00:00

AI Summary

This paper introduces Quantum Vision (QV) theory for deep learning-based audio classification, specifically applied to deepfake speech detection. Inspired by particle-wave duality, QV theory transforms speech spectrograms into 'information waves' using a QV block before feeding them into deep learning models. Experiments on the ASVspoof dataset demonstrate that QV-based Convolutional Neural Networks (QV-CNN) and Vision Transformers (QV-ViT) consistently outperform their standard counterparts.

Abstract

We propose Quantum Vision (QV) theory as a new perspective for deep learning-based audio classification, applied to deepfake speech detection. Inspired by particle-wave duality in quantum physics, QV theory is based on the idea that data can be represented not only in its observable, collapsed form, but also as information waves. In conventional deep learning, models are trained directly on these collapsed representations, such as images. In QV theory, inputs are first transformed into information waves using a QV block, and then fed into deep learning models for classification. QV-based models improve performance in image classification compared to their non-QV counterparts. What if QV theory is applied speech spectrograms for audio classification tasks? This is the motivation and novelty of the proposed approach. In this work, Short-Time Fourier Transform (STFT), Mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCC) of speech signals are converted into information waves using the proposed QV block and used to train QV-based Convolutional Neural Networks (QV-CNN) and QV-based Vision Transformers (QV-ViT). Extensive experiments are conducted on the ASVSpoof dataset for deepfake speech classification. The results show that QV-CNN and QV-ViT consistently outperform standard CNN and ViT models, achieving higher classification accuracy and improved robustness in distinguishing genuine and spoofed speech. Moreover, the QV-CNN model using MFCC features achieves the best overall performance on the ASVspoof dataset, with an accuracy of 94.20% and an EER of 9.04%, while the QV-CNN with Mel-spectrograms attains the highest accuracy of 94.57%. These findings demonstrate that QV theory is an effective and promising approach for audio deepfake detection and opens new directions for quantum-inspired learning in audio perception tasks.


Key findings
QV-based models consistently outperformed standard CNN and ViT models across STFT, Mel-spectrogram, and MFCC features, showing improved classification accuracy and robustness. The QV-CNN model with MFCC features achieved the best overall performance with 94.20% accuracy and an Equal Error Rate (EER) of 9.04%, while QV-CNN with Mel-spectrograms reached the highest accuracy of 94.57%.
Approach
The proposed approach involves a Quantum Vision (QV) block that converts traditional speech spectrograms (STFT, Mel-spectrograms, MFCC) into quantum-inspired 'information waves'. These wave representations, which are designed to capture richer data characteristics, are then used as input to modified deep learning architectures: QV-CNN and QV-ViT for deepfake speech classification.
Datasets
ASVspoof 2019 dataset
Model(s)
QV-based Convolutional Neural Networks (QV-CNN), QV-based Vision Transformers (QV-ViT)
Author countries
Cyprus, Japan, Thailand, Turkey