Authors: Mahsa Salehi, Kalin Stefanov, Ehsan Shareghi
Published: 2024-02-22 21:44:58+00:00
Comment: 9 pages, 4 figures, 3 tables
AI Summary
This paper investigates human brain activity, as measured by EEG, when individuals listen to real versus deepfake audio. It contrasts these human responses with the representations learned by a state-of-the-art deepfake audio detection algorithm. Preliminary results indicate that while machine learning representations do not clearly distinguish fake from real audio, human EEG patterns display distinct differences, suggesting a promising avenue for future deepfake detection research.
Abstract
In this paper we study the variations in human brain activity when listening to real and fake audio. Our preliminary results suggest that the representations learned by a state-of-the-art deepfake audio detection algorithm, do not exhibit clear distinct patterns between real and fake audio. In contrast, human brain activity, as measured by EEG, displays distinct patterns when individuals are exposed to fake versus real audio. This preliminary evidence enables future research directions in areas such as deepfake audio detection.
Key findings
The study found that representations from a state-of-the-art audio deepfake detection algorithm did not clearly differentiate between real and fake audio. In contrast, human brain activity, as captured by EEG, exhibited distinct and discriminative patterns when exposed to fake versus real audio. A ConvTran classifier successfully leveraged these EEG patterns to classify deepfake audio with high precision and recall, especially in random train/test splits.
Approach
The researchers collected EEG data from human participants listening to a custom-created dataset of real and deepfake audio. The EEG data was preprocessed to remove noise and artifacts, then segmented into short time series windows. A time series classification model was trained on this EEG data to classify segments as either real or fake audio.
Datasets
Custom-collected real audio data from 20 native English-speaking actors. Deepfake audio generated using VITS and YourTTS. Custom-collected EEG data from 2 English-speaking participants using a 64-channel Easycap EEG system.
Model(s)
ConvTran (a time series classification method combining CNN and Transformer architectures) was used to classify EEG data. For comparison of machine learning representations, a state-of-the-art audio deepfake detection system (referenced as wav2vec2-base-960h-finetuned-deepfake) was mentioned.
Author countries
Australia