Natural language guidance of high-fidelity text-to-speech with synthetic annotations

Natural language guidance of high-ﬁdelity text-to-speech with synthetic

annotations

Dan Lyth

, Simon King

Stability AI

University of Edinburgh, UK

[email protected], [email protected]

Abstract

Text-to-speech models trained on large-scale datasets have

demonstrated impressive in-context learning capabilities and

naturalness. However, control of speaker identity and style

in these models typically requires conditioning on reference

speech recordings, limiting creative applications. Alterna-

tively, natural language prompting of speaker identity and style

has demonstrated promising results and provides an intuitive

method of control. However, reliance on human-labeled de-

scriptions prevents scaling to large datasets.

Our work bridges the gap between these two approaches.

We propose a scalable method for labeling various aspects of

speaker identity, style, and recording conditions. We then ap-

ply this method to a 45k hour dataset, which we use to train a

speech language model. Furthermore, we propose simple meth-

ods for increasing audio ﬁdelity, signiﬁcantly outperforming re-

cent work despite relying entirely on found data.

Our results demonstrate high-ﬁdelity speech generation in

a diverse range of accents, prosodic styles, channel conditions,

and acoustic conditions, all accomplished with a single model

and intuitive natural language conditioning. Audio samples can

be heard at https:// text-description-to-speech.com/.

1. Introduction

Scaling both model and training data size has driven rapid

progress in generative modeling, especially for text and image

synthesis [1, 2, 3, 4]. Natural language conditioning provides an

intuitive method for control and creativity in these modalities,

enabled by web-scale human-authored text and image annota-

tions [5, 6]. However, only recently has speech synthesis started

to exploit scale and natural language conditioning.

The initial results from large-scale text-to-speech (TTS)

models have demonstrated impressive in-context learning capa-

bilities, such as zero-shot speaker and style adaptation, cross-

lingual synthesis, and content editing [7, 8, 9]. However, a

reliance on reference speech limits their practical application.

It also forces the user to reproduce the likeness of an existing

speaker, which is beneﬁcial in some use cases but has the po-

tential for harm, especially when so little enrollment data is re-

quired.

To alleviate these shortcomings, the use of natural language

to describe speaker and style is starting to be explored (the most

recent example is concurrent work to ours, Audiobox. [10]).

Unlike the image modality, no large dataset containing natural

language descriptions of speech exists, so this metadata must

be created from scratch. To date, this has been achieved using

a combination of human annotations and statistical measures,

with the results often passed through a large language model to

mimic the variability that might be expected in genuine human

annotations [10, 11, 12, 13, 14]. However, any approach requir-

ing human annotations is challenging to scale to large datasets.

For example, Multilingual LibriSpeech [15] contains over 10

million utterances across 50k hours of audio, equivalent to over

ﬁve years. Because of the human annotation bottleneck, TTS

models using natural language descriptions have been of limited

scale and, therefore, unable to demonstrate some of the broad

range of capabilities associated with larger models.

In this work, we rely entirely on automatic labeling, en-

abling us to scale to large data for the ﬁrst time (along with

concurrent work [10]). Coupling this with large-scale speech

language models allows us to synthesize speech in a wide range

of speaking styles and recording conditions using intuitive nat-

ural language control.

Speciﬁcally, we:

1. Propose a method for efﬁciently labeling a 45k hour dataset

with multiple attributes, including gender, accent, speaking

rate, pitch, and recording conditions.

2. Train a speech language model on this dataset and demon-

strate the ability to control these attributes independently, cre-

ating speaker identities and style combinations unseen in the

training data.

3. Demonstrate that with as little as 1% high-ﬁdelity audio in

the training data and the use of the latest state-of-the-art au-

dio codec models, it is possible to generate extremely high-

ﬁdelity audio.

2. Related Work

2.1. Control of speaker identity and style

Controlling non-lexical speech information, such as speaking

style and speaker identity, has been explored through various

approaches. With neural models, the ﬁrst attempts at this re-

lied on reference embeddings or “global style tokens” derived

from exemplar recordings [16, 17]. This approach is effective

but constrains users to existing recordings, signiﬁcantly limit-

ing versatility and scalability. To alleviate this, more ﬂexible

approaches sample from the continuous latent spaces of Gaus-

sian mixture models and variational autoencoders [18]. How-

ever, this approach requires careful training to ensure that the

latent variables are disentangled, as well as complex analysis to

identify the relationship between these variables and attributes

of speech.

In an attempt to bypass the brittleness of reference embed-

dings and the complexity of disentangled latent variable ap-

proaches, recent work has attempted to use natural language

descriptions to guide non-lexical speech variation directly. This

line of work has no doubt been inspired by the success in other

modalities (particularly text-to-image models), but a key chal-

arXiv:2402.01912v1 [cs.SD] 2 Feb 2024

Transcript text

“the woods where Timothy wandered alone

were wild and lonely and in them were

fierce bob cats who snarled and fought at

night”

…

n-1

n-2

…

n-3

n-4

…

n-1

n-2

n-3

(pre-trained and

ﬁxed)

RVQ

Decoder

Pre-pend

text tokens

Cross-attention

Residual vector quantized tokens

Decoder-only Transformer

(language model)

Sequence steps

Codebooks

935

…

Description text

“A woman with a Pakistani accent enchants

the listener with her book reading. The

speaker’s voice is very close-sounding, and

the recording is excellent….”

Figure 1: Overview of the model architecture

lenge for speech is the lack of natural language descriptive

metadata.

An initial attempt at circumventing this issue was proposed

in [11]. In this work, the authors use statistically derived met-

rics (such as speaking rate and pitch) from a dataset of real

speech, combined with emotion labels from a dataset of syn-

thetic speech provided by a commercial TTS model. Together,

this dataset offers ﬁve axes of non-lexical variation, which are

turned into keywords, each with three levels of granularity

(high, medium, and low).

The authors of [12] move away from computational label-

ing methods and explore human annotation. They label 44

hours of data with natural language sentences describing style

and emotion but rely on a ﬁxed set of speaker IDs to control

speaker likeness. Human annotation is used again in [13] (with

an even smaller 6-hour dataset) in service of style transfer, and

again, speaker likeness is controlled by speaker IDs. However,

in [14], the authors do tackle the labeling and generation of dif-

ferent speaker identities. Human annotations are combined with

computational statistics similar to [11] and are then fed into a

language model to create natural language variations. Unlike

our work, the authors make no attempt to model accent or chan-

nel conditions and only label 16% of the speakers in a dataset

already two orders of magnitude smaller than the one we use.

This difﬁculty in scaling human annotations provides some of

the motivation for this work.

While we were running the evaluations for this work, the

authors of [10] released Audiobox. This concurrent work uses

a similar approach to ours to label a large dataset. However, we

propose simple methods to signiﬁcantly outperform this work

in terms of the overall naturalness and audio ﬁdelity of the gen-

erated speech. We also outperform this work in how closely

our model matches the text description (for those attributes of

speech shared across both lines of work). PromptTTS2 [19]

is also concurrent work that scales to a large dataset, but their

approach only attempts to control four variables, signiﬁcantly

limiting the range of capabilities.

3. Method

3.1. Metadata collection

The data for this study comprises two English speech corpora

derived from the LibriVox audiobook project

- the English

portion of Multilingual LibriSpeech (MLS) [15] (45k hours)

and LibriTTS-R [20] (585 hours). While LibriTTS-R is sig-

niﬁcantly smaller in scale, we include it given the higher au-

dio quality resulting from enhancement via the Miipher system

[21]. Both datasets provide transcriptions and a label for gender

generated using a predictive model.

3.1.1. Accent

Speaker accent is an aspect of speech that natural language

prompting in TTS has so far overlooked, despite the wide range

of accents found in the datasets typically used.

We appreciate that labeling accents with discrete labels is an

ill-formed task considering the discrete-continuous nature of ac-

cents. However, the alternative, i.e., ignoring accent altogether,

is unacceptable. To this end, we train an accent classiﬁer and

use it to label our datasets.

We use EdAcc [22], VCTK [23], and the English-accented

subset of VoxPopuli [24] as the training data for our accent clas-

siﬁer. In total, these datasets cover 53 accents. We extract em-

beddings using the language ID model from [25] and train a

simple linear classiﬁer using these embeddings, achieving an

accuracy of 86% on a held-out test set. We then run this model

on our datasets and spot-check the results.

3.1.2. Recording quality

Large-scale publicly available speech datasets are typically de-

rived from crowd-sourced projects such as LibriVox. This leads

to a fundamental limitation of these datasets - the audio record-

ing quality is often suboptimal compared to professional record-

ings. For example, many utterances have low signal-to-noise,

narrow bandwidth, codec compression artifacts, and excessive

reverberation.

To circumvent this limitation, we include LibriTTS-R, a

dataset derived from LibriVox but which has, as mentioned,

signiﬁcantly improved audio ﬁdelity. By including this high-

ﬁdelity dataset and labeling features related to recording quality

librivox.org

across both datasets, we hypothesize that the model will learn

a latent representation of audio ﬁdelity. Crucially, this should

allow the generation of clean, professional-sounding recordings

for accents and styles that only have low-ﬁdelity utterances in

the training data.

The two proxies we use for labeling recording quality are

the estimated signal-to-noise ratio (SNR) and estimated C50.

C50 is the ratio of early reﬂections to late reﬂections and indi-

cates how reverberant a recording is. For both of these features,

we use the Brouhaha library

introduced in [26].

3.1.3. Pitch and speaking rate

We compute pitch contours for all utterances using the PENN

library

proposed in [27] and then calculate the speaker-level

mean and utterance-level standard deviation. The speaker-level

mean is used to generate a label for speaker pitch relative to

gender, and the standard deviation is used as a proxy for how

monotone or animated an individual utterance is.

The speaking rate is simply calculated by dividing the num-

ber of phonemes in the transcript by the utterance length (si-

lences at the beginning and end of the audio ﬁles have already

been removed). We use the library g2p

for the grapheme-to-

phoneme conversion.

3.2. Metadata preparation

The next stage is to take all the variables described above and

convert them into natural language sentences. To do this, we

ﬁrst create keywords for each variable.

The discrete labels such as gender (provided by the dataset

creators) and accent require no further processing and can be

directly used as keywords. However, the pitch, speaking rate,

and estimated SNR and C50 are all continuous variables that

must ﬁrst be mapped to discrete categories. We do this by an-

alyzing the variables across the full dataset and then applying

appropriate binning. A visual example of this is shown in Fig-

ure 2, where the estimated SNR across all utterances can be

seen along with the bin boundaries. For each variable, we apply

seven bins and then use appropriate short phrases to describe

each bin. For example, in the case of speaking rate, we use

terms such as “very fast”, “quite fast”, “fairly slowly” etc.

Once this binning is complete for all continuous variables,

we have keywords for gender, accent, pitch relative to speaker,

pitch standard deviation, speaking rate, estimated SNR, and es-

timated C50. We also create a new category when the SNR and

C50 are both in their highest or lowest bin and label these “very

good recording” and “very bad recording”, respectively.

To improve generalization and allow the user to input de-

scriptive phrases using their own terminology, we feed these

sets of keywords into a language model (Stable Beluga 2.5

)

with appropriate prompts to create full sentences. For exam-

ple, “female”, “Hungarian”, “slightly roomy sounding”, “fairly

noisy”, “quite monotone”, “fairly low pitch”, “very slowly”

could be converted into “a woman with a deep voice speaking

slowly and somewhat monotonously with a Hungarian accent in

an echoey room with background noise”.

github.com/marianne-m/brouhaha-vad

github.com/interactiveaudiolab/penn

github.com/roedoejet/g2p

huggingface.co/stabilityai/StableBeluga2

Figure 2: Estimated SNR across the MLS dataset and the dis-

crete bin boundaries used to create keywords.

3.3. Model

We adapt the general-purpose audio generation library Au-

dioCraft

and make it suitable for TTS. This library sup-

ports multiple forms of conditioning (text, embeddings, wave

ﬁles) that can be applied in various ways (pre-pending, cross-

attention, summing, interpolation). To date, it has only been

used for audio and music generation [28, 29]. In order to make

it suitable for TTS, we remove word drop-out from the tran-

script conditioning and, after some initial experiments, settle

on pre-pending the transcript and using cross-attention for the

description (see Figure 1). Unlike previous work, we do not

provide any form of audio or speaker embedding conditioning

from similar utterances - the model must rely entirely on the text

description for gender, style, accent, and channel conditions.

We use the Descript Audio Codec

(DAC) (44.1kHz ver-

sion) introduced in [30] to provide our discrete feature repre-

sentations. This residual vector quantized model produces to-

kens at a frame rate of 86Hz and has nine codebooks. We use

the delay pattern introduced in [29] to deal with these nine di-

mensions in the context of a language model architecture. We

choose this codec rather than the popularly used Encodec [31]

because the authors of DAC demonstrate a subjective and ob-

jective improvement in audio ﬁdelity over Encodec.

4. Experiments

4.1. Objective evaluation

To evaluate whether our model is capable of generating speech

that matches a provided description, we ﬁrst carry out the fol-

lowing objective evaluations. To test the targeted control of

speciﬁc attributes, we use the test set sentences from MLS and

LibriTTS-R but manually write the description to test the vari-

able of interest. The only exception is the accent test set, where

we combine sentences from the Rainbow Passage [32], Comma

Gets a Cure [33], and Please Call Stella. In all cases, we ensure

that the descriptions are balanced across the test set.

To test control of gender, we use a pre-trained gender clas-

siﬁer

that achieves an F1 score of 0.99 on LibriSpeech Clean

100. Using this classiﬁer, our generated test set scores an accu-

github.com/facebookresearch/audiocraft

github.com/descriptinc/descript-audio-codec

huggingface.co/aleﬁury/wav2vec2-large-xlsr-53-gender-

recognition-librispeech

(a) Mean pitch (b) Pitch standard deviation

(e) Estimated SNR

Figure 3: Correlation between description labels and synthe-

sized labels (with 95% conﬁdence intervals).

racy of 94%. In a similar fashion, we re-use our accent classiﬁer

(see Section 3.1.1) and classify the accents of our generated ac-

cent test set. Here, we see a somewhat poorer accuracy of 68%.

We hypothesize that this is likely to be due to noisy labeling and

a very imbalanced distribution of accents in the training set.

The remaining attributes that we labeled are continuous

variables. For these variables, we run our generated test sets

through the same models that were used to label the training set

(see Section 3.1). The results can be seen in Figure 3. We see

that for every attribute other than C50, the model performs fairly

well at generating speech that matches the provided description.

We are unsure as to why the model performs poorly at generat-

ing audio with the appropriate C50, and further investigation is

required.

Our ﬁnal objective evaluation aims to quantify the audio

ﬁdelity of our model when asked to produce audio with “excel-

lent recording quality” or similar terms. Here, we use the re-

cently proposed Torchaudio Speech Quality and Intelligibility

Measures [34]. This model provides a reference-less estimate

of Wideband Perceptual Estimation of Speech Quality (PESQ),

Short-Time Objective Intelligibility (STOI), and Scale-Invariant

Signal-to-Distortion Ratio (SD-SRD). Using 20 test sentences

and descriptions from LibriTTS-R, we run these metrics on out-

puts from our model, Audiobox (using the public website inter-

face

), and the ground truth audio. As can be seen in Table

1, our model produces speech with values that are signiﬁcantly

higher than Audiobox and often very close to the ground truth.

As mentioned in Section 3.1.2, this is achieved despite training

on only ∼500 hours of high-ﬁdelity speech in the context of the

full 45k hour dataset.

audiobox.metademolab.com/capabilities/tts description condition

Table 1: Speech Quality and Intelligibility Measures (SQUIM)

(with 95% conﬁdence intervals)

Model PESQ STOI SI-SDR

Ground truth 4.15 ±0.04 0.997 ±0.001 27.45 ±1.09

Ours 3.84 ±0.10 0.996 ±0.001 26.53 ±1.16

Audiobox 3.46 ±0.16 0.988 ±0.004 21.84 ±1.37

Table 2: Naturalness and relevance results (with 95% conﬁ-

dence intervals)

Model MOS REL

Ground truth 3.67 ±0.09 3.62 ±0.06

Ours 3.92 ±0.07 3.88 ±0.06

Audiobox 2.79 ±0.09 3.19 ±0.06

4.2. Subjective evaluation

To complement our objective evaluations, we run two subjec-

tive listening tests. The ﬁrst of these aims to quantify how well

our model follows a natural language description (”relevance”

or “REL”). An example of such a description taken from the lis-

tening test set is: “A female voice with an Italian accent reads

from a book. The recording is very noisy. The speaker reads

fairly quickly with a slightly high-pitched and monotone voice.”

We generate 40 sets of samples (20 from MLS and 20 from

LibriTTS-R) using descriptions created using the method de-

scribed in Section 3. Again, we employ 30 listeners (native En-

glish speakers) and ask them to evaluate how closely the speech

matches the description using a 5-point scale. For each sentence

and description, we present the listeners with outputs from our

model, Audiobox, and the ground truth audio. The only pro-

cessing we apply is loudness normalization to -18 LUFS and

the removal of silence before and after speech.

As can be seen in Table 2, our model outperforms Audiobox

in this evaluation. Somewhat counterintuitively, it also outper-

forms the ground truth. However, there appear to be two clear

reasons for this. Firstly, the test set descriptions contain la-

bel noise. For example, if a ground truth utterance is labeled

with the incorrect accent, the generated speech is likely to be

more faithful to the description than the ground truth. Simi-

larly, we also see very occasional instances where samples from

LibriTTS-R contain mild audio artifacts (we removed samples

containing signiﬁcant artifacts). In this case, the audio ﬁdelity

from our model is likely to be higher than the ground truth and,

therefore, closer to the description. One ﬁnal note on this eval-

uation is that we are aware that there are aspects of speech that

Audiobox is capable of controlling that our model is not (for

example, age).

Our second listening test aims to evaluate the overall nat-

uralness and audio ﬁdelity of our model. In this case, we only

use samples from LibriTTS-R. We ask 30 listeners to rate the

speech on a scale of 1-5 (we’ll refer to this simply as mean

opinion score, or MOS). We appreciate that this evaluation con-

ﬂates the naturalness of speech patterns and audio ﬁdelity, but

we chose this test design to match that used by the authors of

Audiobox.

In Table 2, we see that our model signiﬁcantly outperforms

Audiobox. We hypothesize that there are two reasons for this.

Firstly, we believe that our use of the DAC codec rather than

Encodec has a signiﬁcant impact (see Section 3.3 for further de-

tails). Secondly, our use (and labeling) of a small high-ﬁdelity

dataset (LibriTTS-R) may well have had a signiﬁcant impact.

However, this hypothesis is difﬁcult to validate without full

knowledge of the data used to train Audiobox.

Again, we see our model outperforming the ground truth.

As discussed previously, it would appear that this is due to

minor audio artifacts in LibriTTS-R caused by the speech-

enhancement model. See our demo website

to hear examples.

5. Conclusion

In this work, we propose a simple but highly effective method

for generating high-ﬁdelity text-to-speech that can be intuitively

guided by natural language descriptions. To the best of our

knowledge, this is the ﬁrst time that such a method is capable

of controlling such a wide range of speech and channel condi-

tion attributes in conjunction with such high audio ﬁdelity and

overall naturalness. In particular, we note that this was possible

with an amateurly recorded dataset coupled with a compara-

tively tiny amount of clean data.

However, this work only demonstrates efﬁcacy in a rela-

tively narrow domain (audiobook reading in English). In the

future, we plan to extend to a wider range of languages, speak-

ing styles, vocal effort, and channel conditions.

6. References

[1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,

P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,

S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child,

A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,

E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner,

S. McCandlish, A. Radford, I. Sutskever, and D. Amodei,

“Language Models are Few-Shot Learners.” [Online]. Available:

http://arxiv.org/abs/2005.14165

[2] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra,

A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann,

P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao,

P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du,

B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard,

G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev,

H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus,

D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov,

R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai,

T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child,

O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz,

O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean,

S. Petrov, and N. Fiedel, “PaLM: Scaling language modeling with

pathways.” [Online]. Available: http://arxiv.org/abs/2204.02311

[3] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen,

“Hierarchical text-conditional image generation with CLIP

latents,” version: 1. [Online]. Available: http://arxiv.org/abs/

2204.06125

[4] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer.

High-Resolution Image Synthesis with Latent Diffusion Models.

[Online]. Available: http://arxiv.org/abs/2112.10752

[5] [Online]. Available: https://commoncrawl.org/

[6] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight-

man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,

P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt,

R. Kaczmarczyk, and J. Jitsev, “LAION-5b: An open large-scale

dataset for training next generation image-text models.” [Online].

Available: http://arxiv.org/abs/2210.08402

[7] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen,

Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei. Neural Codec

text-description-to-speech.com

Language Models are Zero-Shot Text to Speech Synthesizers.

[Online]. Available: http://arxiv.org/abs/2301.02111

[8] Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu,

Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and

F. Wei. Speak Foreign Languages with Your Own Voice: Cross-

Lingual Neural Codec Language Modeling. [Online]. Available:

http://arxiv.org/abs/2303.03926

[9] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz,

M. Williamson, V. M. Y. Adi, J. Mahadeokar, and W.-N. Hsu,

“Voicebox: Text-Guided Multilingual Universal Speech Genera-

tion at Scale.”

[10] A. Team, A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo,

J. Zhang, X. Zhang, R. Adkins, W. Ngan, J. Wang, I. Cruz,

B. Akula, A. Akinyemi, B. Ellis, R. Moritz, Y. Yungster, A. Rako-

toarison, L. Tan, C. Summers, C. Wood, J. Lane, M. Williamson,

and W.-N. Hsu, “Audiobox: Uniﬁed Audio Generation with Nat-

ural Language Prompts.”

[11] Z. Guo, Y. Leng, Y. Wu, S. Zhao, and X. Tan. PromptTTS:

Controllable Text-to-Speech with Text Descriptions. [Online].

Available: http://arxiv.org/abs/2211.12171

[12] D. Yang, S. Liu, R. Huang, G. Lei, C. Weng, H. Meng, and

D. Yu. InstructTTS: Modelling Expressive TTS in Discrete Latent

Space with Natural Language Style Prompt. [Online]. Available:

http://arxiv.org/abs/2301.13662

[13] G. Liu, Y. Zhang, Y. Lei, Y. Chen, R. Wang, Z. Li, and

L. Xie. PromptStyle: Controllable Style Transfer for Text-to-

Speech with Natural Language Descriptions. [Online]. Available:

http://arxiv.org/abs/2305.19522

[14] R. Shimizu, R. Yamamoto, M. Kawamura, Y. Shirahata,

H. Doi, T. Komatsu, and K. Tachibana. PromptTTS++:

Controlling Speaker Identity in Prompt-Based Text-to-Speech

Using Natural Language Descriptions. [Online]. Available:

http://arxiv.org/abs/2309.08140

[15] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert,

“MLS: A Large-Scale Multilingual Dataset for Speech Research,”

in Interspeech 2020, pp. 2757–2761. [Online]. Available:

http://arxiv.org/abs/2012.03411

[16] R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton,

J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards End-

to-End Prosody Transfer for Expressive Speech Synthesis with

Tacotron.” [Online]. Available: http://arxiv.org/abs/1803.09047

[17] Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-Ryan, E. Battenberg,

J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous,

“Style Tokens: Unsupervised Style Modeling, Control and

Transfer in End-to-End Speech Synthesis.” [Online]. Available:

http://arxiv.org/abs/1803.09017

[18] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang,

Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang,

“Hierarchical Generative Modeling for Controllable Speech

Synthesis.” [Online]. Available: http://arxiv.org/abs/1810.07217

[19] Y. Leng, Z. Guo, K. Shen, X. Tan, Z. Ju, Y. Liu, Y. Liu, D. Yang,

L. Zhang, K. Song, L. He, X.-Y. Li, S. Zhao, T. Qin, and J. Bian.

PromptTTS 2: Describing and Generating Voices with Text

Prompt. [Online]. Available: http://arxiv.org/abs/2309.02285

[20] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka,

M. Bacchiani, Y. Zhang, W. Han, and A. Bapna, “LibriTTS-

r: A restored multi-speaker text-to-speech corpus.” [Online].

Available: http://arxiv.org/abs/2305.18802

[21] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe,

N. Morioka, Y. Zhang, W. Han, A. Bapna, and M. Bacchiani,

“Miipher: A robust speech restoration model integrating self-

supervised speech and text representations.” [Online]. Available:

http://arxiv.org/abs/2303.01664

[22] R. Sanabria, N. Bogoychev, N. Markl, A. Carmantini, O. Klejch,

and P. Bell. The Edinburgh International Accents of English

Corpus: Towards the Democratization of English ASR. [Online].

Available: http://arxiv.org/abs/2303.18110

[23] “CSTR VCTK Corpus: English multi-speaker corpus for

cstr voice cloning toolkit (version 0.92).” [Online]. Available:

https://datashare.ed.ac.uk/handle/10283/3443

[24] C. Wang, M. Rivi

ere, A. Lee, A. Wu, C. Talnikar, D. Haz-

iza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli:

A Large-Scale Multilingual Speech Corpus for Representa-

tion Learning, Semi-Supervised Learning and Interpretation.”

[Online]. Available: http://arxiv.org/abs/2101.00390

[25] V. Pratap, A. Tjandra, B. Shi, P. T. A. Babu, S. Kundu, A. Elkahky,

Z. Ni, A. Vyas, M. F.-Z. A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu,

A. Conneau, and M. Auli, “Scaling Speech Technology to 1,000+

Languages.”

[26] M. Lavechin, M. M

etais, H. Titeux, A. Boissonnet, J. Copet,

M. Rivi

ere, E. Bergelson, A. Cristia, E. Dupoux, and H. Bredin.

Brouhaha: Multi-task training for voice activity detection,

speech-to-noise ratio, and C50 room acoustics estimation.

[Online]. Available: http://arxiv.org/abs/2210.13248

[27] M. Morrison, C. Hsieh, N. Pruyne, and B. Pardo. Cross-domain

Neural Pitch and Periodicity Estimation. [Online]. Available:

http://arxiv.org/abs/2301.12258

[28] F. Kreuk, “AudioGen: Textually Guided Audio Genera-

tion.” [Online]. Available: https://openreview.net/forum?id=

CYK7RfcOzQ4

[29] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve,

Y. Adi, and A. D

efossez. Simple and Controllable Music

Generation. [Online]. Available: http://arxiv.org/abs/2306.05284

[30] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar.

High-Fidelity Audio Compression with Improved RVQGAN.

[Online]. Available: http://arxiv.org/abs/2306.06546

[31] A. D

efossez, J. Copet, G. Synnaeve, and Y. Adi. High

Fidelity Neural Audio Compression. [Online]. Available: http:

//arxiv.org/abs/2210.13438

[32] G. Fairbanks, Voice and Articulation Drillbook, 1960.

[33] D. Honorof, J. McCullough, and B. Somerville, “Comma gets a

cure: A diagnostic passage for accent study,” Retrieved February,

vol. 20, p. 2007, 2000.

[34] A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson,

and B. Xu. TorchAudio-Squim: Reference-less Speech Quality

and Intelligibility measures in TorchAudio. [Online]. Available:

http://arxiv.org/abs/2304.01448