A Preliminary Exploration with GPT-4o Voice Mode

Authors: Yu-Xiang Lin, Chih-Kai Yang, Wei-Chih Chen, Chen-An Li, Chien-yu Huang, Xuanjun Chen, Hung-yi Lee

Published: 2025-02-14 06:34:08+00:00

Comment: Work in progress

AI Summary

This report provides a preliminary evaluation of GPT-4o's audio processing and reasoning capabilities across diverse tasks in audio, speech, and music understanding. It highlights GPT-4o's strengths in areas like multilingual speech recognition and robustness against hallucinations, but also identifies weaknesses in tasks such as audio duration prediction and its tendency to refuse certain safety-sensitive tasks.

Abstract

With the rise of multimodal large language models, GPT-4o stands out as a pioneering model, driving us to evaluate its capabilities. This report assesses GPT-4o across various tasks to analyze its audio processing and reasoning abilities. We find that GPT-4o exhibits strong knowledge in audio, speech, and music understanding, performing well in tasks like intent classification, spoken command classification, semantic and grammatical reasoning., multilingual speech recognition, and singing analysis. It also shows greater robustness against hallucinations than other large audio-language models (LALMs). However, it struggles with tasks such as audio duration prediction and instrument classification. Additionally, GPT-4o's safety mechanisms cause it to decline tasks like speaker identification, age classification, MOS prediction, and audio deepfake detection. Notably, the model exhibits a significantly different refusal rate when responding to speaker verification tasks on different datasets. This is likely due to variations in the accompanying instructions or the quality of the input audio, suggesting the sensitivity of its built-in safeguards. Finally, we acknowledge that model performance varies with evaluation protocols. This report only serves as a preliminary exploration of the current state of LALMs.


Key findings
GPT-4o exhibits strong capabilities in audio, speech, and music understanding, performing well in tasks like multilingual speech recognition and demonstrating greater robustness against hallucinations. However, it struggles with tasks such as audio duration prediction and instrument classification. Additionally, its safety mechanisms frequently cause it to decline sensitive tasks like speaker identification, age classification, MOS prediction, and audio deepfake detection, with refusal rates varying significantly based on instructions or input quality.
Approach
The authors assess GPT-4o's audio understanding and reasoning capabilities by evaluating it against a wide range of tasks from large-scale benchmarks, including Dynamic-SUPERB Phase2, MMAU, and CMM. They analyze its performance across various criteria and also investigate its refusal rates due to built-in safety mechanisms.
Datasets
Dynamic-SUPERB Phase2, MMAU, CMM
Model(s)
GPT-4o (which mostly refused these tasks), along with baseline models including WavLLM, LTU-AS, GAMA-IT, MU-LLaMA, SALMONN-7B, SALMONN-13B, Qwen-Audio-Chat, Qwen2-Audio-7B-Instruct, and Whisper-LLaMA, were evaluated for audio deepfake detection tasks.
Author countries
Taiwan