I'm a PhD student (finishing April 2025) working on speech synthesis and recognition, and I signed up here since a lot of the data used in my field comes from LibriVox audiobooks.
I also enjoy audiobooks and will maybe try my hand at recording one at some point (although my German accent might be a bit much for some listeners).
However, I'm posting here since I think that the research community should engage more with the people whose voices are being used in their research, and I'm very curious what your thoughts are on this research.
I have three objectives with this post:
1) Describe how your data has been used (and which voices are included)
2) Ask what your opinions are on your voices being replicated by AI - Obviously for a lot of people on here who recorded audiobooks long before speech synthesis was anywhere near what it is today, it might come as a surprise how well your voices have been replicated by now. On the other hand, it could be positive to know your work has contributed to the advancement of research - I don’t want to bias anyone either way.
3) Answer any questions on speech research and how LibriVox data is used to the best of my abilities.
Just to be clear, I’m not using any of this for my PhD or thesis, this is purely out of interest and wanting to share what I know about the field with the people who have been a huge resource to it (without maybe fully knowing).
Overview of Speech Technology Research
First off all, the AI-generated voices everyone has come in contact with by now do not represent all that is “speech technology research”. Broadly, we can divide speech technology research (although there is more outside these categories) into automatic speech recognition (ASR) and text-to-speech (TTS) - for simplicity I will refer to them with their abbreviations. This research can be used for screen readers, replacing someone's lost voice, subtitling, but also scam calls, surveillance and impersonation. I’d like to say that good uses have outweighed the bad, but I suppose I’m quite biased.
1) How Your Data Has Been Used
To check in which corpora your recordings have been used, first find your LibriVox ID (the number following https://librivox.org/author/) and then download the datasets and check if your ID is one of the folders. For the smaller datasets that’s feasible, but for the larger ones I would like to compile a searchable file with all the IDs, and I’ll try to find the time to do so over the next couple days. The main problem is that often the authors do not provide the list of speaker IDs, and some of the datasets are several terabytes in size.
Frequently Used Datasets
- LibriSpeech (2015) - 6500+ citations - https://www.danielpovey.com/files/2015_icassp_librispeech.pdf
This 1000 hour corpus was mostly used for ASR training. Audio was segmented into chunks less than one minute (usually only a couple seconds) and was not split according to sentence boundaries (can start and end abruptly within sentences). - LJSpeech (2017) - 1000s of citations (not tracked on google scholar) - https://keithito.com/LJ-Speech-Dataset/
The arguably most used dataset for TTS. Since TTS models are hard to train for multiple speakers, it only features 24 hours of one speaker, Linda Johnson. The original author included her as a co-author in the citation, but sadly most papers citing the paper do not include her. Since a big part of TTS research are the synthetic audio samples and how convincing they are, I think it’s safe to say that Linda Johnson is the person with the single most synthetic versions of her voice.
Here is an example of a recent paper synthesising her voice: https://styletts2.github.io/ - LibriTTS (2019) - 800+ citations - https://arxiv.org/abs/1904.02882
This paper introduced using multiple speakers for TTS. Here, the samples are actually split on a sentence level, and include punctuation. The dataset contains 585 hours of speech with ~2,500 LibriVox speakers.
Here is an example of a recent paper synthesising LibriVox voices: https://speechresearch.github.io/naturalspeech3/ - LibriLight (2019) - 600+ citations - https://arxiv.org/pdf/1912.07875
Despite the name, this dataset actually contains more data, with over 50k hours of over 7000 speakers. The name is due to only a small set containing the corresponding text, and it’s meant to be used to train ASR models on the unlabeled (no text) data to recognize the data with text. - MLS (Multi-lingual LibriSpeech) (2020) - 400+ citations - https://arxiv.org/abs/2012.03411
Around 50k hours (with 45k in English) of speech in different languages. The languages are English, German, Dutch, French, Spanish, Italian, Portuguese and Polish. - LibriTTS-R (2023) - 20+ citations - https://arxiv.org/abs/2305.18802
LibriTTS but with noise removed using a speech enhancement model. - LibriHeavy (2024) - 10+ citations - https://arxiv.org/abs/2309.08105
LibriLight but with transcripts for all 50,000 hours of speech.
What does this mean in practical terms? If you have recorded an English audiobook and shared it on LibriVox pre-2015, chances are good your voice has been used to make things like Alexa, Google Home and Siri happen, and of course much more. Since LibriSpeech is so widely used it is even possible that your voice is more easily recognized than others. For TTS, your voice has very likely contributed to most modern TTS systems, like the one by OpenAI or the AI voice used in TikTok. There are also many openly released TTS systems these days. Most aim to do “voice cloning” - that means providing a sample of a few seconds to replicate someone's voice. This will work much better when the given voice was in the training data, which means if you are a speaker whose voice has been used in training, providing a sample of your voice to the model will produce a more realistic outcome. Here are some models you can try right now if you are interested:
- https://huggingface.co/spaces/coqui/xtts
- https://huggingface.co/spaces/styletts2/styletts2
- https://huggingface.co/spaces/parler-tts/parler_tts (this one is a bit different, here you use text to describe what the speaker should sound like)
3) Are there any questions that I can answer as a speech technology researcher?
Edit: I just found this post from 2011 discussing the implications of using LibriVox recordings for TTS. Well worth a read!