Speak&Sing

Unified Speech and Singing Voice Synthesizer Controlling Timbre, Style, and Emotion

We propose a novel unified voice synthesizer, Speak&Sing, that synthesizes speech and singing voices, reflecting the timbre, style, and emotion of the speaker or singer. Speak&Sing is trainable with separate speech and singing voice datasets, transferring the timbre of the speech dataset to singing voice synthesis, and vice versa. It resolves the discrepancy between TTS and SVS in their inputs by estimating text pitch and text duration from the text transcript, using them as proxies for note pitch and note duration. Speak&Sing accurately represents the pronunciation of both speech and singing voices through phoneme-pitch-duration (PPD) embeddings. Speak&Sing expresses timbre by adding a timbre embedding to the PPD embeddings, and emotion and style by adding variances to pitch, energy, and duration. To address the difference in the distribution of acoustic features and in the way style and emotion are expressed, it first estimates the main trends of pitch and duration from the note pitch/duration for SVS and from the text pitch/duration for TTS, and then expresses style and emotion by adding variances in pitch and duration. In experiments, Speak&Sing seamlessly synthesized speech and singing voice, successfully transferring timbre across domain. It also exhibited improved audio quality and expressiveness compared to the baseline models. Up to our knowledge, Speak&Sing is the first expressive unified speech and singing voice synthesizer for non-tonal languages that does not rely on a large language model.

Model Overall Structure

Audio Demo

Demonstaration of overall performance

Please listen to the samples, focusing on expressiveness and fidelity.

Samples of speech

G.T.	PITS	Speak&Sing

Samples of singing voice

G.T.	VISinger2	Speak&Sing

Speech Style Transfer

Style-based attribute adaptor controls the speech style and timbre separately.
Please listen to the samples focusing on speech style (e.g. intonation, speech speed) and timbre
which have high similarity with reference samples

The audio samples are corresponding to figure 5 in the paper. Please check more details of the figure in the paper.

Reference sample for timbre	Reference sample for style	Synthesized by proposed model

PPD Embedding

Please listen to the sample, focusing on pitch, rhythm, and pronunciation.

We applied PPD embeddings, commonly used in SVS, to both SVS and TTS.
This not only enhances the accurate reproduction of note pitch and note duration
but also improves speech pronunciation, For further details, please refer to Section III.C.

	G.T.	Speak&Sing (without PPD Embedding)	Speak&Sing (with PPD Embedding)	Script
Speech				Geunyeoegeseo yeojalosseo ui areumdaumeul jeolbaneuna bbaeatatda.
Singing				Han eopsi dwieseo gidarimyeo.

G.T sample of speaker	Synthesized voice with same timbre of G.T.	Reference sample for timbre	Synthesized voice with timbre from different domain