ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection

Authors: Benjamin Chou, Yi Zhu, Surya Koppisetti

Published: 2026-04-17 23:44:33+00:00

Comment: To appear at ACL Findings 2026

AI Summary

ICLAD introduces a novel In-Context Learning paradigm with comparison-guidance for audio deepfake detection, addressing the generalization gap of existing systems on in-the-wild deepfakes. It leverages Audio Language Models (ALMs) for training-free detection, providing textual rationales by employing a pairwise comparative reasoning strategy to filter irrelevant acoustic attributes. ICLAD, augmented by a specialized deepfake detector, demonstrates significant macro F1 improvements on in-the-wild datasets.

Abstract

Audio deepfakes pose a significant security threat, yet current state-of-the-art (SOTA) detection systems do not generalize well to realistic in-the-wild deepfakes. We introduce a novel \\textbf{I}n-\\textbf{C}ontext \\textbf{L}earning paradigm with comparison-guidance for \\textbf{A}udio \\textbf{D}eepfake detection (\\textbf{ICLAD}). The framework enables the use of audio language models (ALMs) for training-free generalization to unseen deepfakes and provides textual rationales on the detection outcome. At the core of ICLAD is a pairwise comparative reasoning strategy that guides the ALM to discover and filter hallucinations and deepfake-irrelevant acoustic attributes. The ALM works alongside a specialized deepfake detector, whereby a routing mechanism feeds out-of-distribution samples to the ALM. On in-the-wild datasets, ICLAD improves macro F1 over the specialized detector, with up to $2\\times$ relative improvement. Further analysis demonstrates the flexibility of ICLAD and its potential for deployment on recent open-source ALMs.


Key findings
ICLAD significantly improves generalization to unseen in-the-wild deepfakes, achieving up to a 2x relative improvement in macro F1 score over specialized detectors on datasets like SpoofCeleb. The Pairwise Comparative Reasoning (PCR) strategy effectively reduces hallucination rates in the generated textual rationales compared to simple prompting. Dynamic routing is crucial for leveraging the complementary strengths of ALMs on out-of-distribution samples and specialized detectors on in-distribution data.
Approach
ICLAD uses a two-phase framework. In the offline phase, a Pairwise Comparative Reasoning (PCR) strategy guides an Audio Language Model (ALM) to generate and reconcile real and fake evidence for audio samples, storing them in a RAG database. During online inference, a dynamic routing mechanism directs in-distribution samples to a specialized deepfake detector, while out-of-distribution samples are processed by the ALM, which uses in-context learning with retrieved examples to make robust predictions and provide textual rationales.
Datasets
ASVspoof 2021 (21DF), MLAAD-v3, ITW, SpoofCeleb, DFEval 2024, ASVspoof 2019 (19DF)
Model(s)
Gemini-2.5 Flash, Wav2Vec2-AASIST, Audio Flamingo 3 (AF3), Wav2Vec2-XLSR, Qwen3-0.5B text embeddings
Author countries
USA