HOME OVERALL PERFORMANCE

Speak&Sing

Unified Speech and Singing Voice Synthesizer Controlling Timbre, Style, and Emotion

We propose a novel unified voice synthesizer, Speak&Sing, that synthesizes speech and singing voices, reflecting the timbre, style, and emotion of the speaker or singer. Speak&Sing is trainable with separate speech and singing voice datasets, transferring the timbre of the speech dataset to singing voice synthesis, and vice versa. It resolves the discrepancy between TTS and SVS in their inputs by estimating text pitch and text duration from the text transcript, using them as proxies for note pitch and note duration. Speak&Sing accurately represents the pronunciation of both speech and singing voices through phoneme-pitch-duration (PPD) embeddings. Speak&Sing expresses timbre by adding a timbre embedding to the PPD embeddings, and emotion and style by adding variances to pitch, energy, and duration. To address the difference in the distribution of acoustic features and in the way style and emotion are expressed, it first estimates the main trends of pitch and duration from the note pitch/duration for SVS and from the text pitch/duration for TTS, and then expresses style and emotion by adding variances in pitch and duration. In experiments, Speak&Sing seamlessly synthesized speech and singing voice, successfully transferring timbre across domain. It also exhibited improved audio quality and expressiveness compared to the baseline models. Up to our knowledge, Speak&Sing is the first expressive unified speech and singing voice synthesizer for non-tonal languages that does not rely on a large language model.

Model Overall Structure





Audio Demo


  • Demonstaration of overall performance

    Please listen to the samples, focusing on expressiveness and fidelity.

    Samples of speech

    G.T. PITS Speak&Sing


    Samples of singing voice

    G.T. VISinger2 Speak&Sing