Abstract:
Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations.
While natural speech data is limited for VC, our study focuses on filling in this gap. We introduce a novel data-sourcing pipeline that makes the
release of a natural speech dataset for VC, named NaturalVoices. The pipeline extracts rich information in speech such as emotion and signal-to-noise
ratio (SNR) from raw podcast data, utilizing recent deep learning methods and providing flexibility and ease of use. NaturalVoices marks a large-scale,
spontaneous, expressive, and emotional speech dataset, comprising over 4,000 hours speech sourced from the original podcasts in the MSP-Podcast dataset.
Objective and subjective evaluations demonstrate the effectiveness of using our pipeline for providing natural and expressive data for VC, suggesting the
potential of NaturalVoices for broader speech generation tasks.
The samples are from training both the model and vocoder on our NaturalVoices dataset using 80-100dB condition and two conversion scenarios
(the conversion between seen speakers, the conversion between unseen speakers).
We provide the utterances from source speakers, denoted as Source,
the utterances from target speakers, denoted as Target,
and the converted utterances, denoted as Converted.
Seen to seen speakers
Type
Source
Target
Converted
Female to Female
Female to Male
Male to Female
Male to Male
Unseen to unseen speakers
Type
Source
Target
Converted
Female to Female
Female to Male
Male to Female
Male to Male
[1] Park, Hyun Joon, et al. "TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[2] Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.