Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
https://arxiv.org/abs/1712.05884 [arxiv.org]
2017-12-22 04:21
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.
Blog: https://research.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html
source: green