Defense Against Synthetic Speech: Real-Time Detection of RVC Voice Conversion Attacks

Authors: Prajwal Chinchmalatpure, Suyash Chinchmalatpure, Siddharth Chavan

Published: 2025-12-31 02:06:42+00:00

AI Summary

This study addresses the real-time detection of RVC (Retrieval-based Voice Conversion) attacks, framing the task as a streaming classification problem on one-second audio segments. The method extracts time-frequency and cepstral features, using supervised learning models to classify speech as authentic or voice-converted. The evaluation emphasizes realistic conditions where background ambience is reintroduced to suppress trivial detection artifacts.

Abstract

Generative audio technologies now enable highly realistic voice cloning and real-time voice conversion, increasing the risk of impersonation, fraud, and misinformation in communication channels such as phone and video calls. This study investigates real-time detection of AI-generated speech produced using Retrieval-based Voice Conversion (RVC), evaluated on the DEEP-VOICE dataset, which includes authentic and voice-converted speech samples from multiple well-known speakers. To simulate realistic conditions, deepfake generation is applied to isolated vocal components, followed by the reintroduction of background ambiance to suppress trivial artifacts and emphasize conversion-specific cues. We frame detection as a streaming classification task by dividing audio into one-second segments, extracting time-frequency and cepstral features, and training supervised machine learning models to classify each segment as real or voice-converted. The proposed system enables low-latency inference, supporting both segment-level decisions and call-level aggregation. Experimental results show that short-window acoustic features can reliably capture discriminative patterns associated with RVC speech, even in noisy backgrounds. These findings demonstrate the feasibility of practical, real-time deepfake speech detection and underscore the importance of evaluating under realistic audio mixing conditions for robust deployment.


Key findings
Short-window acoustic features reliably capture discriminative patterns associated with RVC speech, enabling credible discrimination between genuine and converted speech. The experimental behavior confirms the feasibility of building practical, low-latency, real-time detection systems that are robust against realistic audio mixing conditions (reintroduced background ambience).
Approach
The approach segments streaming audio into one-second windows to support low-latency inference. Acoustic features, including time-frequency and cepstral representations (e.g., MFCCs), are extracted from these windows. Supervised machine learning models, specifically a Feed-forward Neural Network, are trained on these features for binary classification.
Datasets
DEEP-VOICE dataset
Model(s)
Feed-forward Neural Network (FNN)
Author countries
United States, Canada, India