tails). Secondly, our use (and labeling) of a small high-fidelity
dataset (LibriTTS-R) may well have had a significant impact.
However, this hypothesis is difficult to validate without full
knowledge of the data used to train Audiobox.
Again, we see our model outperforming the ground truth.
As discussed previously, it would appear that this is due to
minor audio artifacts in LibriTTS-R caused by the speech-
enhancement model. See our demo website
10
to hear examples.
5. Conclusion
In this work, we propose a simple but highly effective method
for generating high-fidelity text-to-speech that can be intuitively
guided by natural language descriptions. To the best of our
knowledge, this is the first time that such a method is capable
of controlling such a wide range of speech and channel condi-
tion attributes in conjunction with such high audio fidelity and
overall naturalness. In particular, we note that this was possible
with an amateurly recorded dataset coupled with a compara-
tively tiny amount of clean data.
However, this work only demonstrates efficacy in a rela-
tively narrow domain (audiobook reading in English). In the
future, we plan to extend to a wider range of languages, speak-
ing styles, vocal effort, and channel conditions.
6. References
[1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,
S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child,
A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner,
S. McCandlish, A. Radford, I. Sutskever, and D. Amodei,
“Language Models are Few-Shot Learners.” [Online]. Available:
http://arxiv.org/abs/2005.14165
[2] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra,
A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann,
P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao,
P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du,
B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard,
G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev,
H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus,
D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov,
R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai,
T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child,
O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz,
O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean,
S. Petrov, and N. Fiedel, “PaLM: Scaling language modeling with
pathways.” [Online]. Available: http://arxiv.org/abs/2204.02311
[3] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen,
“Hierarchical text-conditional image generation with CLIP
latents,” version: 1. [Online]. Available: http://arxiv.org/abs/
2204.06125
[4] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer.
High-Resolution Image Synthesis with Latent Diffusion Models.
[Online]. Available: http://arxiv.org/abs/2112.10752
[5] [Online]. Available: https://commoncrawl.org/
[6] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight-
man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,
P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt,
R. Kaczmarczyk, and J. Jitsev, “LAION-5b: An open large-scale
dataset for training next generation image-text models.” [Online].
Available: http://arxiv.org/abs/2210.08402
[7] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen,
Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei. Neural Codec
10
text-description-to-speech.com
Language Models are Zero-Shot Text to Speech Synthesizers.
[Online]. Available: http://arxiv.org/abs/2301.02111
[8] Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu,
Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and
F. Wei. Speak Foreign Languages with Your Own Voice: Cross-
Lingual Neural Codec Language Modeling. [Online]. Available:
http://arxiv.org/abs/2303.03926
[9] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz,
M. Williamson, V. M. Y. Adi, J. Mahadeokar, and W.-N. Hsu,
“Voicebox: Text-Guided Multilingual Universal Speech Genera-
tion at Scale.”
[10] A. Team, A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo,
J. Zhang, X. Zhang, R. Adkins, W. Ngan, J. Wang, I. Cruz,
B. Akula, A. Akinyemi, B. Ellis, R. Moritz, Y. Yungster, A. Rako-
toarison, L. Tan, C. Summers, C. Wood, J. Lane, M. Williamson,
and W.-N. Hsu, “Audiobox: Unified Audio Generation with Nat-
ural Language Prompts.”
[11] Z. Guo, Y. Leng, Y. Wu, S. Zhao, and X. Tan. PromptTTS:
Controllable Text-to-Speech with Text Descriptions. [Online].
Available: http://arxiv.org/abs/2211.12171
[12] D. Yang, S. Liu, R. Huang, G. Lei, C. Weng, H. Meng, and
D. Yu. InstructTTS: Modelling Expressive TTS in Discrete Latent
Space with Natural Language Style Prompt. [Online]. Available:
http://arxiv.org/abs/2301.13662
[13] G. Liu, Y. Zhang, Y. Lei, Y. Chen, R. Wang, Z. Li, and
L. Xie. PromptStyle: Controllable Style Transfer for Text-to-
Speech with Natural Language Descriptions. [Online]. Available:
http://arxiv.org/abs/2305.19522
[14] R. Shimizu, R. Yamamoto, M. Kawamura, Y. Shirahata,
H. Doi, T. Komatsu, and K. Tachibana. PromptTTS++:
Controlling Speaker Identity in Prompt-Based Text-to-Speech
Using Natural Language Descriptions. [Online]. Available:
http://arxiv.org/abs/2309.08140
[15] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert,
“MLS: A Large-Scale Multilingual Dataset for Speech Research,”
in Interspeech 2020, pp. 2757–2761. [Online]. Available:
http://arxiv.org/abs/2012.03411
[16] R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton,
J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards End-
to-End Prosody Transfer for Expressive Speech Synthesis with
Tacotron.” [Online]. Available: http://arxiv.org/abs/1803.09047
[17] Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-Ryan, E. Battenberg,
J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous,
“Style Tokens: Unsupervised Style Modeling, Control and
Transfer in End-to-End Speech Synthesis.” [Online]. Available:
http://arxiv.org/abs/1803.09017
[18] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang,
Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang,
“Hierarchical Generative Modeling for Controllable Speech
Synthesis.” [Online]. Available: http://arxiv.org/abs/1810.07217
[19] Y. Leng, Z. Guo, K. Shen, X. Tan, Z. Ju, Y. Liu, Y. Liu, D. Yang,
L. Zhang, K. Song, L. He, X.-Y. Li, S. Zhao, T. Qin, and J. Bian.
PromptTTS 2: Describing and Generating Voices with Text
Prompt. [Online]. Available: http://arxiv.org/abs/2309.02285
[20] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka,
M. Bacchiani, Y. Zhang, W. Han, and A. Bapna, “LibriTTS-
r: A restored multi-speaker text-to-speech corpus.” [Online].
Available: http://arxiv.org/abs/2305.18802
[21] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe,
N. Morioka, Y. Zhang, W. Han, A. Bapna, and M. Bacchiani,
“Miipher: A robust speech restoration model integrating self-
supervised speech and text representations.” [Online]. Available:
http://arxiv.org/abs/2303.01664
[22] R. Sanabria, N. Bogoychev, N. Markl, A. Carmantini, O. Klejch,
and P. Bell. The Edinburgh International Accents of English
Corpus: Towards the Democratization of English ASR. [Online].
Available: http://arxiv.org/abs/2303.18110