PhD Public Defense
The Generative Perspective for Diverse Audio Applications
Ge Zhu
Supervised by Zhiyao Duan
Wednesday, June 25, 2025
10 a.m.11 a.m.
Zoom: https://rochester.zoom.us/j/94019823254
Traditional audio processing has relied on domain-specific techniques tailored to individual tasks—speech synthesis, music generation, and environmental sound processing—each requiring specialized algorithms and optimization criteria. This fragmented approach leads to redundant development efforts and limits knowledge transfer across audio domains. This dissertation presents a generative perspective that leverages deep generative models to address diverse audio applications through consistent architectural backbones and shared training paradigms. The research addresses two fundamental challenges: (1) developing generative paradigms that can handle diverse audio processing tasks while maintaining architectural consistency, and (2) improving audio-text alignment to enable semantic control over generative audio models through natural language interfaces.
For the first challenge, this work demonstrates how three major generative frame-works can serve as consistent foundations across audio domains: Generative adversarial networks are shown to provide consistent solutions across multiple domains through MusicHiFi, a unified GAN architecture that performs vocoding, bandwidth extension, and mono-to-stereo conversion within a single paradigm, achieving superior quality and efficiency compared to task-specific approaches. Score-based generative models are explored through a comprehensive framework with modular conditioning mechanisms, enabling applications from text-to-speech synthesis to audio enhancement using consistent underlying architectures. Normalizing flows are leveraged as deep statistical priors for inference-time optimization, demonstrating competitive performance in music source separation without requiring task-specific supervised training.
For the second challenge, this dissertation introduces Cacophony, a large-scale text- audio contrastive model trained on 13,000 hours of curated audio-text data. The training methodology combines masked autoencoder pre-training on unlabeled audio with contrastive learning and auxiliary captioning objectives. This achieves state-of-the-art performance on audio-text retrieval tasks while enabling precise natural language control over generative audio models.