Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors

Authors: Vojtěch Staněk, Martin Perešíni, Lukáš Sekanina, Anton Firc, Kamil Malinka

Published: 2026-04-01 19:17:59+00:00

Comment: Accepted to WCCI CEC 2026

AI Summary

This paper proposes an evolutionary multi-objective score fusion framework for deepfake speech detection that jointly minimizes detection error and system complexity. Utilizing NSGA-II with both binary-coded detector selection and a real-valued weighting scheme, the method effectively balances performance and resource efficiency. Experiments on the ASVspoof 5 dataset demonstrate that the obtained Pareto fronts surpass traditional fusion baselines and achieve state-of-the-art detection accuracy with reduced system complexity.

Abstract

While deepfake speech detectors built on large self-supervised learning (SSL) models achieve high accuracy, employing standard ensemble fusion to further enhance robustness often results in oversized systems with diminishing returns. To address this, we propose an evolutionary multi-objective score fusion framework that jointly minimizes detection error and system complexity. We explore two encodings optimized by NSGA-II: binary-coded detector selection for score averaging and a real-valued scheme that optimizes detector weights for a weighted sum. Experiments on the ASVspoof 5 dataset with 36 SSL-based detectors show that the obtained Pareto fronts outperform simple averaging and logistic regression baselines. The real-valued variant achieves 2.37% EER (0.0684 minDCF) and identifies configurations that match state-of-the-art performance while significantly reducing system complexity, requiring only half the parameters. Our method also provides a diverse set of trade-off solutions, enabling deployment choices that balance accuracy and computational cost.


Key findings
The evolutionary multi-objective fusion (specifically the real-valued variant) achieved a lowest EER of 2.37% (0.0684 minDCF), outperforming simple averaging and logistic regression baselines. It also identified configurations that matched state-of-the-art performance (2.59% EER) with significantly reduced system complexity, requiring only half the parameters (2.49B vs 5B).
Approach
They propose an evolutionary multi-objective score fusion framework using the NSGA-II algorithm to optimize two conflicting objectives: minimizing detection error (EER) and minimizing system complexity (number of parameters). Two encoding schemes are explored: binary-coded selection for simple score averaging and a real-valued scheme for optimizing detector weights in a weighted sum, incorporating a cut-off threshold for pruning low-contributing detectors.
Datasets
ASVspoof 5 dataset
Model(s)
Self-Supervised Learning (SSL) models (HuBERT, Wav2Vec2, XLS-R, WavLM) combined with pooling and classifier architectures (AASIST, Multi-Head Factorized Attention (MHFA), Sensitive Layer Selection (SLS)). The fusion is optimized using the NSGA-II evolutionary algorithm.
Author countries
Czech Republic