SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs

View on arXiv ← Back to list

Authors: Shail Desai, Aditya Pawar, Li Lin, Xin Wang, Shu Hu

Published: 2025-11-16 00:50:24+00:00

AI Summary

SynthGuard is introduced as an open, user-friendly platform designed for detecting and analyzing AI-generated multimedia, encompassing both images and audio. The platform leverages a modularized backend of traditional deepfake detectors and Multimodal Large Language Models (MLLMs) to provide transparent, explainable forensic analysis. This system aims to address the limitations of closed-source and modality-limited existing tools by offering accessible forensic research capabilities to the public.

Abstract

Artificial Intelligence (AI) has made it possible for anyone to create images, audio, and video with unprecedented ease, enriching education, communication, and creative expression. At the same time, the rapid rise of AI-generated media has introduced serious risks, including misinformation, identity misuse, and the erosion of public trust as synthetic content becomes increasingly indistinguishable from real media. Although deepfake detection has advanced, many existing tools remain closed-source, limited in modality, or lacking transparency and educational value, making it difficult for users to understand how detection decisions are made. To address these gaps, we introduce SynthGuard, an open, user-friendly platform for detecting and analyzing AI-generated multimedia using both traditional detectors and multimodal large language models (MLLMs). SynthGuard provides explainable inference, unified image and audio support, and an interactive interface designed to make forensic analysis accessible to researchers, educators, and the public. The SynthGuard platform is available at: https://in-engr-nova.it.purdue.edu/

Key findings

The key finding is the successful development and deployment of SynthGuard, which serves as the first open, MLLM-based explainable platform for AI-generated multimedia detection, unifying image and audio analysis. The platform provides explainable inference by leveraging MLLMs to generate natural language reasoning for detection results, a feature absent in surveyed closed-source platforms. It also integrates a diverse suite of fairness-enhanced and frequency-based traditional detectors for robust cross-domain performance.

Approach

The platform utilizes a modular Python FastAPI backend integrating two types of detectors: MLLM-Agnostic models (traditional CNNs/Transformers like Xception and F3Net) for classification, and MLLM-Aware models (like Qwen-VL-Chat and LLaVA-NeXT) for contextual analysis and explainable reasoning. Audio deepfakes are handled by a CNN-based architecture, while MLLMs use hybrid transcription (Whisper) and language analysis (Qwen2-VL-2B) for audio interpretation. The system is deployed via a React frontend and NGINX reverse proxy.

Datasets

AI-Face benchmark

Model(s)

Xception, EfficientNet-B4, ViT-B/16, F3Net, Qwen-VL-Chat, LLaVA-NEXT-13B, InternVL-Chat-V1.5, Whisper, Qwen2-VL-2B

Author countries

USA

← Previous