Speak Like a Dog: Voice Conversion from Humans to Non-Humans
Voice changer (VC) converts the source speaker’s speech waveform into a speech waveform with the characteristics of the target speaker while preserving linguistic information.
A recent article on arXiv.org studies the transformation of an organism’s voice from human to non-human. It converts the human voice into a non-human creature-like voice while preserving linguistic information. This task could be applied in movie theater production or playing video games.
The researchers proposed the task of “talking like a dog” as an example of such tasks and built a dataset and evaluation criteria. An experiment was performed to compare existing representative non-parallel VC methods in terms of acoustic features, network architecture and training criteria. Standard VC methods can convert human voices to dog-like voices discretely, but preserving linguistic information is a challenge.
This paper proposes a new voice conversion (VC) task from human voice to dog voice while preserving linguistic information as an example of human speech conversion task. human to non-human (H2NH-VC). Although most research on VC involves human-to-human VC, the H2NH-VC aims to convert human voices into non-human creature-like voices. The non-parallel VC allowed us to develop the H2NH-VC, because we couldn’t collect a parallel dataset where non-human organisms speak human languages. In this study, we propose to use dogs as an example of a non-human organism target domain and define the task of “talking like a dog”. To clarify the possibilities and characteristics of the “talk like a dog” task, we conducted a comparative test using the existing representative non-parallel VC methods in the acoustic features. learning (Mel-cepstral and Mel-spectral coefficients), network architecture (five different kernels- size settings) and training criteria (variable autoencoder (VAE) – based on the competitive and based on adversarial networks). Finally, the converted voices were evaluated using average opinion scores: dog breed, sound quality and clarity, as well as character error rate (CER). Testing showed that using Mel spectroscopy improved the dog-likeness of the converted voice, while preserving linguistic information was a challenge. The challenges and limitations of current VC methods for H2NH-VC are highlighted.
Research articles: Suzuki, K., Sakamoto, S., Taniguchi, T., and Kameoka, H., “Speaking Like a Dog: Converting Voices from Humans to Non-Humans”, 2022. Links: https://arxiv.org/abs/2206.04780