Natural language guidance of high-fidelity text-to-speech with synthetic
annotations
Dan Lyth
1
, Simon King
2
1
Stability AI
2
University of Edinburgh, UK
Abstract
Text-to-speech models trained on large-scale datasets have
demonstrated impressive in-context learning capabilities and
naturalness. However, control of speaker identity and style
in these models typically requires conditioning on reference
speech recordings, limiting creative applications. Alterna-
tively, natural language prompting of speaker identity and style
has demonstrated promising results and provides an intuitive
method of control. However, reliance on human-labeled de-
scriptions prevents scaling to large datasets.
Our work bridges the gap between these two approaches.
We propose a scalable method for labeling various aspects of
speaker identity, style, and recording conditions. We then ap-
ply this method to a 45k hour dataset, which we use to train a
speech language model. Furthermore, we propose simple meth-
ods for increasing audio fidelity, significantly outperforming re-
cent work despite relying entirely on found data.
Our results demonstrate high-fidelity speech generation in
a diverse range of accents, prosodic styles, channel conditions,
and acoustic conditions, all accomplished with a single model
and intuitive natural language conditioning. Audio samples can
be heard at https:// text-description-to-speech.com/.
1. Introduction
Scaling both model and training data size has driven rapid
progress in generative modeling, especially for text and image
synthesis [1, 2, 3, 4]. Natural language conditioning provides an
intuitive method for control and creativity in these modalities,
enabled by web-scale human-authored text and image annota-
tions [5, 6]. However, only recently has speech synthesis started
to exploit scale and natural language conditioning.
The initial results from large-scale text-to-speech (TTS)
models have demonstrated impressive in-context learning capa-
bilities, such as zero-shot speaker and style adaptation, cross-
lingual synthesis, and content editing [7, 8, 9]. However, a
reliance on reference speech limits their practical application.
It also forces the user to reproduce the likeness of an existing
speaker, which is beneficial in some use cases but has the po-
tential for harm, especially when so little enrollment data is re-
quired.
To alleviate these shortcomings, the use of natural language
to describe speaker and style is starting to be explored (the most
recent example is concurrent work to ours, Audiobox. [10]).
Unlike the image modality, no large dataset containing natural
language descriptions of speech exists, so this metadata must
be created from scratch. To date, this has been achieved using
a combination of human annotations and statistical measures,
with the results often passed through a large language model to
mimic the variability that might be expected in genuine human
annotations [10, 11, 12, 13, 14]. However, any approach requir-
ing human annotations is challenging to scale to large datasets.
For example, Multilingual LibriSpeech [15] contains over 10
million utterances across 50k hours of audio, equivalent to over
five years. Because of the human annotation bottleneck, TTS
models using natural language descriptions have been of limited
scale and, therefore, unable to demonstrate some of the broad
range of capabilities associated with larger models.
In this work, we rely entirely on automatic labeling, en-
abling us to scale to large data for the first time (along with
concurrent work [10]). Coupling this with large-scale speech
language models allows us to synthesize speech in a wide range
of speaking styles and recording conditions using intuitive nat-
ural language control.
Specifically, we:
1. Propose a method for efficiently labeling a 45k hour dataset
with multiple attributes, including gender, accent, speaking
rate, pitch, and recording conditions.
2. Train a speech language model on this dataset and demon-
strate the ability to control these attributes independently, cre-
ating speaker identities and style combinations unseen in the
training data.
3. Demonstrate that with as little as 1% high-fidelity audio in
the training data and the use of the latest state-of-the-art au-
dio codec models, it is possible to generate extremely high-
fidelity audio.
2. Related Work
2.1. Control of speaker identity and style
Controlling non-lexical speech information, such as speaking
style and speaker identity, has been explored through various
approaches. With neural models, the first attempts at this re-
lied on reference embeddings or “global style tokens” derived
from exemplar recordings [16, 17]. This approach is effective
but constrains users to existing recordings, significantly limit-
ing versatility and scalability. To alleviate this, more flexible
approaches sample from the continuous latent spaces of Gaus-
sian mixture models and variational autoencoders [18]. How-
ever, this approach requires careful training to ensure that the
latent variables are disentangled, as well as complex analysis to
identify the relationship between these variables and attributes
of speech.
In an attempt to bypass the brittleness of reference embed-
dings and the complexity of disentangled latent variable ap-
proaches, recent work has attempted to use natural language
descriptions to guide non-lexical speech variation directly. This
line of work has no doubt been inspired by the success in other
modalities (particularly text-to-image models), but a key chal-
arXiv:2402.01912v1 [cs.SD] 2 Feb 2024
Transcript text
“the woods where Timothy wandered alone
were wild and lonely and in them were
fierce bob cats who snarled and fought at
night”
t
1
t
2
t
1
0
0
0
0
0
t
3
t
4
t
3
t
2
t
1
t
2
t
1
0
t
5
t
6
t
5
t
4
t
3
t
4
t
3
t
2
t
n-1
t
n-2
t
n-3
t
n-4
t
n
t
n-1
t
n-2
t
n-3
T5
(pre-trained and
fixed)
RVQ
Decoder
Pre-pend
text tokens
Cross-attention
Residual vector quantized tokens
Decoder-only Transformer
(language model)
Sequence steps
Codebooks
935
5
64
Description text
“A woman with a Pakistani accent enchants
the listener with her book reading. The
speaker’s voice is very close-sounding, and
the recording is excellent….”
Figure 1: Overview of the model architecture
lenge for speech is the lack of natural language descriptive
metadata.
An initial attempt at circumventing this issue was proposed
in [11]. In this work, the authors use statistically derived met-
rics (such as speaking rate and pitch) from a dataset of real
speech, combined with emotion labels from a dataset of syn-
thetic speech provided by a commercial TTS model. Together,
this dataset offers five axes of non-lexical variation, which are
turned into keywords, each with three levels of granularity
(high, medium, and low).
The authors of [12] move away from computational label-
ing methods and explore human annotation. They label 44
hours of data with natural language sentences describing style
and emotion but rely on a fixed set of speaker IDs to control
speaker likeness. Human annotation is used again in [13] (with
an even smaller 6-hour dataset) in service of style transfer, and
again, speaker likeness is controlled by speaker IDs. However,
in [14], the authors do tackle the labeling and generation of dif-
ferent speaker identities. Human annotations are combined with
computational statistics similar to [11] and are then fed into a
language model to create natural language variations. Unlike
our work, the authors make no attempt to model accent or chan-
nel conditions and only label 16% of the speakers in a dataset
already two orders of magnitude smaller than the one we use.
This difficulty in scaling human annotations provides some of
the motivation for this work.
While we were running the evaluations for this work, the
authors of [10] released Audiobox. This concurrent work uses
a similar approach to ours to label a large dataset. However, we
propose simple methods to significantly outperform this work
in terms of the overall naturalness and audio fidelity of the gen-
erated speech. We also outperform this work in how closely
our model matches the text description (for those attributes of
speech shared across both lines of work). PromptTTS2 [19]
is also concurrent work that scales to a large dataset, but their
approach only attempts to control four variables, significantly
limiting the range of capabilities.
3. Method
3.1. Metadata collection
The data for this study comprises two English speech corpora
derived from the LibriVox audiobook project
1
- the English
portion of Multilingual LibriSpeech (MLS) [15] (45k hours)
and LibriTTS-R [20] (585 hours). While LibriTTS-R is sig-
nificantly smaller in scale, we include it given the higher au-
dio quality resulting from enhancement via the Miipher system
[21]. Both datasets provide transcriptions and a label for gender
generated using a predictive model.
3.1.1. Accent
Speaker accent is an aspect of speech that natural language
prompting in TTS has so far overlooked, despite the wide range
of accents found in the datasets typically used.
We appreciate that labeling accents with discrete labels is an
ill-formed task considering the discrete-continuous nature of ac-
cents. However, the alternative, i.e., ignoring accent altogether,
is unacceptable. To this end, we train an accent classifier and
use it to label our datasets.
We use EdAcc [22], VCTK [23], and the English-accented
subset of VoxPopuli [24] as the training data for our accent clas-
sifier. In total, these datasets cover 53 accents. We extract em-
beddings using the language ID model from [25] and train a
simple linear classifier using these embeddings, achieving an
accuracy of 86% on a held-out test set. We then run this model
on our datasets and spot-check the results.
3.1.2. Recording quality
Large-scale publicly available speech datasets are typically de-
rived from crowd-sourced projects such as LibriVox. This leads
to a fundamental limitation of these datasets - the audio record-
ing quality is often suboptimal compared to professional record-
ings. For example, many utterances have low signal-to-noise,
narrow bandwidth, codec compression artifacts, and excessive
reverberation.
To circumvent this limitation, we include LibriTTS-R, a
dataset derived from LibriVox but which has, as mentioned,
significantly improved audio fidelity. By including this high-
fidelity dataset and labeling features related to recording quality
1
librivox.org
across both datasets, we hypothesize that the model will learn
a latent representation of audio fidelity. Crucially, this should
allow the generation of clean, professional-sounding recordings
for accents and styles that only have low-fidelity utterances in
the training data.
The two proxies we use for labeling recording quality are
the estimated signal-to-noise ratio (SNR) and estimated C50.
C50 is the ratio of early reflections to late reflections and indi-
cates how reverberant a recording is. For both of these features,
we use the Brouhaha library
2
introduced in [26].
3.1.3. Pitch and speaking rate
We compute pitch contours for all utterances using the PENN
library
3
proposed in [27] and then calculate the speaker-level
mean and utterance-level standard deviation. The speaker-level
mean is used to generate a label for speaker pitch relative to
gender, and the standard deviation is used as a proxy for how
monotone or animated an individual utterance is.
The speaking rate is simply calculated by dividing the num-
ber of phonemes in the transcript by the utterance length (si-
lences at the beginning and end of the audio files have already
been removed). We use the library g2p
4
for the grapheme-to-
phoneme conversion.
3.2. Metadata preparation
The next stage is to take all the variables described above and
convert them into natural language sentences. To do this, we
first create keywords for each variable.
The discrete labels such as gender (provided by the dataset
creators) and accent require no further processing and can be
directly used as keywords. However, the pitch, speaking rate,
and estimated SNR and C50 are all continuous variables that
must first be mapped to discrete categories. We do this by an-
alyzing the variables across the full dataset and then applying
appropriate binning. A visual example of this is shown in Fig-
ure 2, where the estimated SNR across all utterances can be
seen along with the bin boundaries. For each variable, we apply
seven bins and then use appropriate short phrases to describe
each bin. For example, in the case of speaking rate, we use
terms such as “very fast”, “quite fast”, “fairly slowly” etc.
Once this binning is complete for all continuous variables,
we have keywords for gender, accent, pitch relative to speaker,
pitch standard deviation, speaking rate, estimated SNR, and es-
timated C50. We also create a new category when the SNR and
C50 are both in their highest or lowest bin and label these “very
good recording” and “very bad recording”, respectively.
To improve generalization and allow the user to input de-
scriptive phrases using their own terminology, we feed these
sets of keywords into a language model (Stable Beluga 2.5
5
)
with appropriate prompts to create full sentences. For exam-
ple, “female”, “Hungarian”, “slightly roomy sounding”, “fairly
noisy”, “quite monotone”, “fairly low pitch”, “very slowly”
could be converted into “a woman with a deep voice speaking
slowly and somewhat monotonously with a Hungarian accent in
an echoey room with background noise”.
2
github.com/marianne-m/brouhaha-vad
3
github.com/interactiveaudiolab/penn
4
github.com/roedoejet/g2p
5
huggingface.co/stabilityai/StableBeluga2
Figure 2: Estimated SNR across the MLS dataset and the dis-
crete bin boundaries used to create keywords.
3.3. Model
We adapt the general-purpose audio generation library Au-
dioCraft
6
and make it suitable for TTS. This library sup-
ports multiple forms of conditioning (text, embeddings, wave
files) that can be applied in various ways (pre-pending, cross-
attention, summing, interpolation). To date, it has only been
used for audio and music generation [28, 29]. In order to make
it suitable for TTS, we remove word drop-out from the tran-
script conditioning and, after some initial experiments, settle
on pre-pending the transcript and using cross-attention for the
description (see Figure 1). Unlike previous work, we do not
provide any form of audio or speaker embedding conditioning
from similar utterances - the model must rely entirely on the text
description for gender, style, accent, and channel conditions.
We use the Descript Audio Codec
7
(DAC) (44.1kHz ver-
sion) introduced in [30] to provide our discrete feature repre-
sentations. This residual vector quantized model produces to-
kens at a frame rate of 86Hz and has nine codebooks. We use
the delay pattern introduced in [29] to deal with these nine di-
mensions in the context of a language model architecture. We
choose this codec rather than the popularly used Encodec [31]
because the authors of DAC demonstrate a subjective and ob-
jective improvement in audio fidelity over Encodec.
4. Experiments
4.1. Objective evaluation
To evaluate whether our model is capable of generating speech
that matches a provided description, we first carry out the fol-
lowing objective evaluations. To test the targeted control of
specific attributes, we use the test set sentences from MLS and
LibriTTS-R but manually write the description to test the vari-
able of interest. The only exception is the accent test set, where
we combine sentences from the Rainbow Passage [32], Comma
Gets a Cure [33], and Please Call Stella. In all cases, we ensure
that the descriptions are balanced across the test set.
To test control of gender, we use a pre-trained gender clas-
sifier
8
that achieves an F1 score of 0.99 on LibriSpeech Clean
100. Using this classifier, our generated test set scores an accu-
6
github.com/facebookresearch/audiocraft
7
github.com/descriptinc/descript-audio-codec
8
huggingface.co/alefiury/wav2vec2-large-xlsr-53-gender-
recognition-librispeech
(a) Mean pitch (b) Pitch standard deviation
(c) Speaking rate (d) Estimated C50
(e) Estimated SNR
Figure 3: Correlation between description labels and synthe-
sized labels (with 95% confidence intervals).
racy of 94%. In a similar fashion, we re-use our accent classifier
(see Section 3.1.1) and classify the accents of our generated ac-
cent test set. Here, we see a somewhat poorer accuracy of 68%.
We hypothesize that this is likely to be due to noisy labeling and
a very imbalanced distribution of accents in the training set.
The remaining attributes that we labeled are continuous
variables. For these variables, we run our generated test sets
through the same models that were used to label the training set
(see Section 3.1). The results can be seen in Figure 3. We see
that for every attribute other than C50, the model performs fairly
well at generating speech that matches the provided description.
We are unsure as to why the model performs poorly at generat-
ing audio with the appropriate C50, and further investigation is
required.
Our final objective evaluation aims to quantify the audio
fidelity of our model when asked to produce audio with “excel-
lent recording quality” or similar terms. Here, we use the re-
cently proposed Torchaudio Speech Quality and Intelligibility
Measures [34]. This model provides a reference-less estimate
of Wideband Perceptual Estimation of Speech Quality (PESQ),
Short-Time Objective Intelligibility (STOI), and Scale-Invariant
Signal-to-Distortion Ratio (SD-SRD). Using 20 test sentences
and descriptions from LibriTTS-R, we run these metrics on out-
puts from our model, Audiobox (using the public website inter-
face
9
), and the ground truth audio. As can be seen in Table
1, our model produces speech with values that are significantly
higher than Audiobox and often very close to the ground truth.
As mentioned in Section 3.1.2, this is achieved despite training
on only 500 hours of high-fidelity speech in the context of the
full 45k hour dataset.
9
audiobox.metademolab.com/capabilities/tts description condition
Table 1: Speech Quality and Intelligibility Measures (SQUIM)
(with 95% confidence intervals)
Model PESQ STOI SI-SDR
Ground truth 4.15 ±0.04 0.997 ±0.001 27.45 ±1.09
Ours 3.84 ±0.10 0.996 ±0.001 26.53 ±1.16
Audiobox 3.46 ±0.16 0.988 ±0.004 21.84 ±1.37
Table 2: Naturalness and relevance results (with 95% confi-
dence intervals)
Model MOS REL
Ground truth 3.67 ±0.09 3.62 ±0.06
Ours 3.92 ±0.07 3.88 ±0.06
Audiobox 2.79 ±0.09 3.19 ±0.06
4.2. Subjective evaluation
To complement our objective evaluations, we run two subjec-
tive listening tests. The first of these aims to quantify how well
our model follows a natural language description (”relevance”
or “REL”). An example of such a description taken from the lis-
tening test set is: “A female voice with an Italian accent reads
from a book. The recording is very noisy. The speaker reads
fairly quickly with a slightly high-pitched and monotone voice.
We generate 40 sets of samples (20 from MLS and 20 from
LibriTTS-R) using descriptions created using the method de-
scribed in Section 3. Again, we employ 30 listeners (native En-
glish speakers) and ask them to evaluate how closely the speech
matches the description using a 5-point scale. For each sentence
and description, we present the listeners with outputs from our
model, Audiobox, and the ground truth audio. The only pro-
cessing we apply is loudness normalization to -18 LUFS and
the removal of silence before and after speech.
As can be seen in Table 2, our model outperforms Audiobox
in this evaluation. Somewhat counterintuitively, it also outper-
forms the ground truth. However, there appear to be two clear
reasons for this. Firstly, the test set descriptions contain la-
bel noise. For example, if a ground truth utterance is labeled
with the incorrect accent, the generated speech is likely to be
more faithful to the description than the ground truth. Simi-
larly, we also see very occasional instances where samples from
LibriTTS-R contain mild audio artifacts (we removed samples
containing significant artifacts). In this case, the audio fidelity
from our model is likely to be higher than the ground truth and,
therefore, closer to the description. One final note on this eval-
uation is that we are aware that there are aspects of speech that
Audiobox is capable of controlling that our model is not (for
example, age).
Our second listening test aims to evaluate the overall nat-
uralness and audio fidelity of our model. In this case, we only
use samples from LibriTTS-R. We ask 30 listeners to rate the
speech on a scale of 1-5 (we’ll refer to this simply as mean
opinion score, or MOS). We appreciate that this evaluation con-
flates the naturalness of speech patterns and audio fidelity, but
we chose this test design to match that used by the authors of
Audiobox.
In Table 2, we see that our model significantly outperforms
Audiobox. We hypothesize that there are two reasons for this.
Firstly, we believe that our use of the DAC codec rather than
Encodec has a significant impact (see Section 3.3 for further de-
tails). Secondly, our use (and labeling) of a small high-fidelity
dataset (LibriTTS-R) may well have had a significant impact.
However, this hypothesis is difficult to validate without full
knowledge of the data used to train Audiobox.
Again, we see our model outperforming the ground truth.
As discussed previously, it would appear that this is due to
minor audio artifacts in LibriTTS-R caused by the speech-
enhancement model. See our demo website
10
to hear examples.
5. Conclusion
In this work, we propose a simple but highly effective method
for generating high-fidelity text-to-speech that can be intuitively
guided by natural language descriptions. To the best of our
knowledge, this is the first time that such a method is capable
of controlling such a wide range of speech and channel condi-
tion attributes in conjunction with such high audio fidelity and
overall naturalness. In particular, we note that this was possible
with an amateurly recorded dataset coupled with a compara-
tively tiny amount of clean data.
However, this work only demonstrates efficacy in a rela-
tively narrow domain (audiobook reading in English). In the
future, we plan to extend to a wider range of languages, speak-
ing styles, vocal effort, and channel conditions.
6. References
[1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,
S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child,
A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner,
S. McCandlish, A. Radford, I. Sutskever, and D. Amodei,
“Language Models are Few-Shot Learners. [Online]. Available:
http://arxiv.org/abs/2005.14165
[2] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra,
A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann,
P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao,
P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du,
B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard,
G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev,
H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus,
D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov,
R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai,
T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child,
O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz,
O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean,
S. Petrov, and N. Fiedel, “PaLM: Scaling language modeling with
pathways.” [Online]. Available: http://arxiv.org/abs/2204.02311
[3] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen,
“Hierarchical text-conditional image generation with CLIP
latents, version: 1. [Online]. Available: http://arxiv.org/abs/
2204.06125
[4] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer.
High-Resolution Image Synthesis with Latent Diffusion Models.
[Online]. Available: http://arxiv.org/abs/2112.10752
[5] [Online]. Available: https://commoncrawl.org/
[6] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight-
man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,
P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt,
R. Kaczmarczyk, and J. Jitsev, “LAION-5b: An open large-scale
dataset for training next generation image-text models.” [Online].
Available: http://arxiv.org/abs/2210.08402
[7] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen,
Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei. Neural Codec
10
text-description-to-speech.com
Language Models are Zero-Shot Text to Speech Synthesizers.
[Online]. Available: http://arxiv.org/abs/2301.02111
[8] Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu,
Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and
F. Wei. Speak Foreign Languages with Your Own Voice: Cross-
Lingual Neural Codec Language Modeling. [Online]. Available:
http://arxiv.org/abs/2303.03926
[9] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz,
M. Williamson, V. M. Y. Adi, J. Mahadeokar, and W.-N. Hsu,
“Voicebox: Text-Guided Multilingual Universal Speech Genera-
tion at Scale.
[10] A. Team, A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo,
J. Zhang, X. Zhang, R. Adkins, W. Ngan, J. Wang, I. Cruz,
B. Akula, A. Akinyemi, B. Ellis, R. Moritz, Y. Yungster, A. Rako-
toarison, L. Tan, C. Summers, C. Wood, J. Lane, M. Williamson,
and W.-N. Hsu, Audiobox: Unified Audio Generation with Nat-
ural Language Prompts.
[11] Z. Guo, Y. Leng, Y. Wu, S. Zhao, and X. Tan. PromptTTS:
Controllable Text-to-Speech with Text Descriptions. [Online].
Available: http://arxiv.org/abs/2211.12171
[12] D. Yang, S. Liu, R. Huang, G. Lei, C. Weng, H. Meng, and
D. Yu. InstructTTS: Modelling Expressive TTS in Discrete Latent
Space with Natural Language Style Prompt. [Online]. Available:
http://arxiv.org/abs/2301.13662
[13] G. Liu, Y. Zhang, Y. Lei, Y. Chen, R. Wang, Z. Li, and
L. Xie. PromptStyle: Controllable Style Transfer for Text-to-
Speech with Natural Language Descriptions. [Online]. Available:
http://arxiv.org/abs/2305.19522
[14] R. Shimizu, R. Yamamoto, M. Kawamura, Y. Shirahata,
H. Doi, T. Komatsu, and K. Tachibana. PromptTTS++:
Controlling Speaker Identity in Prompt-Based Text-to-Speech
Using Natural Language Descriptions. [Online]. Available:
http://arxiv.org/abs/2309.08140
[15] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert,
“MLS: A Large-Scale Multilingual Dataset for Speech Research,
in Interspeech 2020, pp. 2757–2761. [Online]. Available:
http://arxiv.org/abs/2012.03411
[16] R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton,
J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards End-
to-End Prosody Transfer for Expressive Speech Synthesis with
Tacotron.” [Online]. Available: http://arxiv.org/abs/1803.09047
[17] Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-Ryan, E. Battenberg,
J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous,
“Style Tokens: Unsupervised Style Modeling, Control and
Transfer in End-to-End Speech Synthesis. [Online]. Available:
http://arxiv.org/abs/1803.09017
[18] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang,
Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang,
“Hierarchical Generative Modeling for Controllable Speech
Synthesis.” [Online]. Available: http://arxiv.org/abs/1810.07217
[19] Y. Leng, Z. Guo, K. Shen, X. Tan, Z. Ju, Y. Liu, Y. Liu, D. Yang,
L. Zhang, K. Song, L. He, X.-Y. Li, S. Zhao, T. Qin, and J. Bian.
PromptTTS 2: Describing and Generating Voices with Text
Prompt. [Online]. Available: http://arxiv.org/abs/2309.02285
[20] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka,
M. Bacchiani, Y. Zhang, W. Han, and A. Bapna, “LibriTTS-
r: A restored multi-speaker text-to-speech corpus. [Online].
Available: http://arxiv.org/abs/2305.18802
[21] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe,
N. Morioka, Y. Zhang, W. Han, A. Bapna, and M. Bacchiani,
“Miipher: A robust speech restoration model integrating self-
supervised speech and text representations. [Online]. Available:
http://arxiv.org/abs/2303.01664
[22] R. Sanabria, N. Bogoychev, N. Markl, A. Carmantini, O. Klejch,
and P. Bell. The Edinburgh International Accents of English
Corpus: Towards the Democratization of English ASR. [Online].
Available: http://arxiv.org/abs/2303.18110
[23] “CSTR VCTK Corpus: English multi-speaker corpus for
cstr voice cloning toolkit (version 0.92). [Online]. Available:
https://datashare.ed.ac.uk/handle/10283/3443
[24] C. Wang, M. Rivi
`
ere, A. Lee, A. Wu, C. Talnikar, D. Haz-
iza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli:
A Large-Scale Multilingual Speech Corpus for Representa-
tion Learning, Semi-Supervised Learning and Interpretation.
[Online]. Available: http://arxiv.org/abs/2101.00390
[25] V. Pratap, A. Tjandra, B. Shi, P. T. A. Babu, S. Kundu, A. Elkahky,
Z. Ni, A. Vyas, M. F.-Z. A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu,
A. Conneau, and M. Auli, “Scaling Speech Technology to 1,000+
Languages.
[26] M. Lavechin, M. M
´
etais, H. Titeux, A. Boissonnet, J. Copet,
M. Rivi
`
ere, E. Bergelson, A. Cristia, E. Dupoux, and H. Bredin.
Brouhaha: Multi-task training for voice activity detection,
speech-to-noise ratio, and C50 room acoustics estimation.
[Online]. Available: http://arxiv.org/abs/2210.13248
[27] M. Morrison, C. Hsieh, N. Pruyne, and B. Pardo. Cross-domain
Neural Pitch and Periodicity Estimation. [Online]. Available:
http://arxiv.org/abs/2301.12258
[28] F. Kreuk, AudioGen: Textually Guided Audio Genera-
tion. [Online]. Available: https://openreview.net/forum?id=
CYK7RfcOzQ4
[29] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve,
Y. Adi, and A. D
´
efossez. Simple and Controllable Music
Generation. [Online]. Available: http://arxiv.org/abs/2306.05284
[30] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar.
High-Fidelity Audio Compression with Improved RVQGAN.
[Online]. Available: http://arxiv.org/abs/2306.06546
[31] A. D
´
efossez, J. Copet, G. Synnaeve, and Y. Adi. High
Fidelity Neural Audio Compression. [Online]. Available: http:
//arxiv.org/abs/2210.13438
[32] G. Fairbanks, Voice and Articulation Drillbook, 1960.
[33] D. Honorof, J. McCullough, and B. Somerville, “Comma gets a
cure: A diagnostic passage for accent study,Retrieved February,
vol. 20, p. 2007, 2000.
[34] A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson,
and B. Xu. TorchAudio-Squim: Reference-less Speech Quality
and Intelligibility measures in TorchAudio. [Online]. Available:
http://arxiv.org/abs/2304.01448