LibriVox and Speech Technology Research

cdminix · Post by **cdminix** » September 2nd, 2024, 5:56 am

Hello everyone!

I'm a PhD student (finishing April 2025) working on speech synthesis and recognition, and I signed up here since a lot of the data used in my field comes from LibriVox audiobooks.
I also enjoy audiobooks and will maybe try my hand at recording one at some point (although my German accent might be a bit much for some listeners).
However, I'm posting here since I think that the research community should engage more with the people whose voices are being used in their research, and I'm very curious what your thoughts are on this research.

I have three objectives with this post:
1) Describe how your data has been used (and which voices are included)
2) Ask what your opinions are on your voices being replicated by AI - Obviously for a lot of people on here who recorded audiobooks long before speech synthesis was anywhere near what it is today, it might come as a surprise how well your voices have been replicated by now. On the other hand, it could be positive to know your work has contributed to the advancement of research - I don’t want to bias anyone either way.
3) Answer any questions on speech research and how LibriVox data is used to the best of my abilities.

Just to be clear, I’m not using any of this for my PhD or thesis, this is purely out of interest and wanting to share what I know about the field with the people who have been a huge resource to it (without maybe fully knowing).

Overview of Speech Technology Research
First off all, the AI-generated voices everyone has come in contact with by now do not represent all that is “speech technology research”. Broadly, we can divide speech technology research (although there is more outside these categories) into automatic speech recognition (ASR) and text-to-speech (TTS) - for simplicity I will refer to them with their abbreviations. This research can be used for screen readers, replacing someone's lost voice, subtitling, but also scam calls, surveillance and impersonation. I’d like to say that good uses have outweighed the bad, but I suppose I’m quite biased.

1) How Your Data Has Been Used
To check in which corpora your recordings have been used, first find your LibriVox ID (the number following https://librivox.org/author/) and then download the datasets and check if your ID is one of the folders. For the smaller datasets that’s feasible, but for the larger ones I would like to compile a searchable file with all the IDs, and I’ll try to find the time to do so over the next couple days. The main problem is that often the authors do not provide the list of speaker IDs, and some of the datasets are several terabytes in size.

Frequently Used Datasets

LibriSpeech (2015) - 6500+ citations - https://www.danielpovey.com/files/2015_icassp_librispeech.pdf
This 1000 hour corpus was mostly used for ASR training. Audio was segmented into chunks less than one minute (usually only a couple seconds) and was not split according to sentence boundaries (can start and end abruptly within sentences).
LJSpeech (2017) - 1000s of citations (not tracked on google scholar) - https://keithito.com/LJ-Speech-Dataset/
The arguably most used dataset for TTS. Since TTS models are hard to train for multiple speakers, it only features 24 hours of one speaker, Linda Johnson. The original author included her as a co-author in the citation, but sadly most papers citing the paper do not include her. Since a big part of TTS research are the synthetic audio samples and how convincing they are, I think it’s safe to say that Linda Johnson is the person with the single most synthetic versions of her voice.
Here is an example of a recent paper synthesising her voice: https://styletts2.github.io/
LibriTTS (2019) - 800+ citations - https://arxiv.org/abs/1904.02882
This paper introduced using multiple speakers for TTS. Here, the samples are actually split on a sentence level, and include punctuation. The dataset contains 585 hours of speech with ~2,500 LibriVox speakers.
Here is an example of a recent paper synthesising LibriVox voices: https://speechresearch.github.io/naturalspeech3/
LibriLight (2019) - 600+ citations - https://arxiv.org/pdf/1912.07875
Despite the name, this dataset actually contains more data, with over 50k hours of over 7000 speakers. The name is due to only a small set containing the corresponding text, and it’s meant to be used to train ASR models on the unlabeled (no text) data to recognize the data with text.
MLS (Multi-lingual LibriSpeech) (2020) - 400+ citations - https://arxiv.org/abs/2012.03411
Around 50k hours (with 45k in English) of speech in different languages. The languages are English, German, Dutch, French, Spanish, Italian, Portuguese and Polish.
LibriTTS-R (2023) - 20+ citations - https://arxiv.org/abs/2305.18802
LibriTTS but with noise removed using a speech enhancement model.
LibriHeavy (2024) - 10+ citations - https://arxiv.org/abs/2309.08105
LibriLight but with transcripts for all 50,000 hours of speech.

There are certainly more LibriVox derived datasets, but these are the most frequently used ones I can think of.
What does this mean in practical terms? If you have recorded an English audiobook and shared it on LibriVox pre-2015, chances are good your voice has been used to make things like Alexa, Google Home and Siri happen, and of course much more. Since LibriSpeech is so widely used it is even possible that your voice is more easily recognized than others. For TTS, your voice has very likely contributed to most modern TTS systems, like the one by OpenAI or the AI voice used in TikTok. There are also many openly released TTS systems these days. Most aim to do “voice cloning” - that means providing a sample of a few seconds to replicate someone's voice. This will work much better when the given voice was in the training data, which means if you are a speaker whose voice has been used in training, providing a sample of your voice to the model will produce a more realistic outcome. Here are some models you can try right now if you are interested:

https://huggingface.co/spaces/coqui/xtts
https://huggingface.co/spaces/styletts2/styletts2
https://huggingface.co/spaces/parler-tts/parler_tts (this one is a bit different, here you use text to describe what the speaker should sound like)

2) How do you feel about your voice being used in academia, for TTS and ASR research, and in general?

3) Are there any questions that I can answer as a speech technology researcher?

Edit: I just found this post from 2011 discussing the implications of using LibriVox recordings for TTS. Well worth a read!

Post by **Availle** » September 2nd, 2024, 6:20 am

HI there and welcome.
Good to hear from "the other side of the trench" so to speak.

As for 1), well that's interesting - thanks for sharing!
As for 2) I am actually against having my voice used to train any kind of AI, whether this be speech recognition or to create an artificial voice.

While my recordings are definitely in the public domain, my voice is not, and I am surprised that people assumed otherwise (given that we have to consent to everything left and right and then some everywhere and all the time.) And yes, I've been a researcher too, so I know that works (or should work in theory.)

I wish somebody would have reached out to ask my consent to be included in any of those datasets.
(To be fair, one (!) research student from Germany did contact me a while back "hey, just wanted to let you know that I'm using your files.... " etc. When I tried to get back to that student, the email address entered into my contact form was wrong. Pure coincidence, I'm sure.)

As for 3), I may get back to you on this one.

Sorry for sounding negative here, AI is definitely an interesting field. I wish AI would take care of the drudgery of cleaning and doing the dishes... instead of the things I actually find fun to do like recording audio books.

cdminix · Post by **cdminix** » September 2nd, 2024, 6:31 am

Thanks for your honest reply. While I personally do not create TTS models at the moment, I have noted your LibriVox ID and will exclude it from future research if I ever do release a TTS system.
Maybe it would be worthwhile to create a list of LibriVox users that would prefer their voice not be used for TTS, now that it is so easily available.

I think the ethical implications are interesting.
Should everyone who released their audiobook in the public domain have anticipated technology (often years later) could very closely replicate their voice?

I'd also love it if AI could do the dishes, it's so sad that that's a more difficult research problem with fewer people working on it than just cloning someone's voice!

ShrimpPhish · Post by **ShrimpPhish** » September 2nd, 2024, 7:04 am

My dad says that the difference between AI voices and real voices is that AI voices don't have life in them.

I would prefer not to be part of this.

Thank you!

cdminix · Post by **cdminix** » September 2nd, 2024, 7:56 am

Since AI voices are getting so realistic we might have to fall back on a definition like that!
Your earliest recording seems to be from this year, I think your voice wouldn't be included in any datasets that are out at the moment, if that's any consolation.

Peter Why · Post by **Peter Why** » September 2nd, 2024, 10:32 am

For me, I tend to feel that by putting my recordings into the public domain, I have released them to any sort of use or abuse that anyone might conceive. I think that, given the choice at the beginning, I would have been slightly more comfortable if Librivox recordings were released with a licence more restrictive than public domain. And that's without having any conception of the use that can now be made of them.

It's too easy for our voices now to be used in ways that I would consider an abuse or even illegal. Who is responsible if my voice is used to make blackmailing phone calls?

Peter

Beeswaxcandle · Post by **Beeswaxcandle** » September 2nd, 2024, 11:26 am

Applications like voice banking (replacing a lost voice) are worthwhile and of value. Others like the recent case where an AI used Stephen Fry's voice to deliver a speech when he was somewhere else in the world—much of the content of which he didn't agree with—stray into the criminal. I see everything between as grey at present and slippery slope arguments abound. The potential to steal and use my voice to access data that is supposedly protected via a voiceprint is concerning.

While my recordings here are later than the dataset being used at present, I too would not be happy to have them used for purposes outside of the context they were made. That's aside from the fact that in most recording sessions I use multiple accents and voice timbres, which would make using my voice for training an AI challenging.

TheBanjo · Post by **TheBanjo** » September 2nd, 2024, 5:16 pm

I started recording for Librivox only three years or so ago, so from what you've said, my voice is unlikely to be included in the datasets you have mentioned. Nonetheless, I can see it may well be included in future datasets.

When I signed up for Librivox, I understood that I would have no "rights" at all over the readings that I was helping to make available to the public. I feel that the way Librivox recordings have been used to train AI is consistent with the information I received when I agreed my recordings would be in the public domain.

I can certainly see that some uses of AI voice technology could be malicious, but I personally am prepared to accept the good/evil tradeoff here. I don't think there's any way of putting this genie back in the bottle, in any case — and it's certainly a genie with pretty amazing powers. On a separate forum, I've just been exchanging notes with a blind person for whom the latest developments in this area promise to be a godsend.

What I haven't seen discussed above are questions about the commercialisation of this technology. I personally feel much more comfortable about my voice being used to development AI models that will be available on an open source basis, so that at least some of the products using these models can be made available for free (just as, in fact, Librivox audiobooks are). You haven't raised this as an issue you want to know about, but it's an important side of this for me.

I'm not saying this is ever going happen (it probably isn't), but if I was ever asked if I was happy for recordings of my voice to be used for the development of AI speech models that were going to be released for free public use, I would say "yes".

Post by **Kazbek** » September 2nd, 2024, 6:02 pm

Thanks for sharing! This is interesting to know. I'm curious whether the "voices" in TTS systems these days are each fine-tuned to mimic a single human voice, or if we're hearing more mixtures of human voices, parametrized control of different voice attributes or configuration by prompt engineering.

As a LV hobbyist, I would be glad if my LV recordings are used to build TTS systems that generate free audiobooks from texts that would not otherwise be recorded. I'm not enthusiastic about commercial uses of my contributions, but I take it in stride. It would be more sad if commercial TTS systems make it more difficult for professional voice artists to make a living.

Michael

cdminix · Post by **cdminix** » September 4th, 2024, 6:21 am

> What I haven't seen discussed above are questions about the commercialisation of this technology.

Certainly an important topic as well! When it comes to producing voices that could be used commercially, I know of at least one AI company that is working on a revenue-sharing program with voice actors, but sadly I think this will not be the norm, just as it is with image and text generation at the moment. Regulation will need to catch up...

> I'm curious whether the "voices" in TTS systems these days are each fine-tuned to mimic a single human voice, or if we're hearing more mixtures of human voices, parametrized control of different voice attributes or configuration by prompt engineering.

The most common models published use voice cloning, that means providing a 10-30s snippet and the model replicating the voice in that snippet as closely as possible. Then there are models that are limited to specific voices. There is only one open-source model I'm aware of that exclusively uses prompts, but it's definitely an exciting direction. Also this approach does not (easily) allow cloning someone's voice directly, which is nice.

TheBanjo · Post by **TheBanjo** » September 4th, 2024, 4:29 pm

cdminix wrote: ↑September 4th, 2024, 6:21 am Then there are models that are limited to specific voices. There is only one open-source model I'm aware of that exclusively uses prompts, but it's definitely an exciting direction. Also this approach does not (easily) allow cloning someone's voice directly, which is nice.

I was able to navigate to the site of the open source model you linked to on my iPhone, but on my computer two different browsers blocked the site as being the source of potential malware. The results I heard on my iPhone at this site were certainly impressive.

This raises a further question in my mind. I imagine it's only a matter of time before some idiot does try to pass off one of these voices as a 'real' voice on Librivox, if only as a prank. Are you aware of any free/open source software that is able to make any kind of semi-rational estimate of the probability that a given recorded "voice" has been created by AI?

Edit: I can now see there is an unambiguously "clean" source of info on (open source) parler-tts, including installation instructions, here: https://github.com/huggingface/parler-tts

DrSpoke · Post by **DrSpoke** » September 9th, 2024, 11:56 pm

Hello researcher with a German accent,

I wish Librivox recordings were more protected from use outside the original purpose and with a licence more restrictive - at least the wording at the beginning of each file could be updated, taking into account current knowledge and relevant technology.
Anyway, I do not agree to my voice (and recordings for that matter) being used for AI or anything outside the scope of Librivox to make books in the public domain available in a "narrated by real people" version - real people who are the real owners of those voices and voiced words..

Thank you so much for your post and concern.

swmom17 · Post by **swmom17** » October 11th, 2024, 6:50 pm

I have not been doing recording for Librivox for very long, but I would like to opt out of any
future AI datasets.
That said, I would like to see LV have a slightly more restrictive license to protect the voiceprint
of its readers. Thanks.

BrianFullen · Post by **BrianFullen** » October 18th, 2024, 1:51 pm

cdminix wrote: ↑September 2nd, 2024, 6:31 am Thanks for your honest reply. While I personally do not create TTS models at the moment, I have noted your LibriVox ID and will exclude it from future research if I ever do release a TTS system.
Maybe it would be worthwhile to create a list of LibriVox users that would prefer their voice not be used for TTS, now that it is so easily available.

I think the ethical implications are interesting.
Should everyone who released their audiobook in the public domain have anticipated technology (often years later) could very closely replicate their voice?

I'd also love it if AI could do the dishes, it's so sad that that's a more difficult research problem with fewer people working on it than just cloning someone's voice!

It should be opt in not opt out. But while I'm at it, leave me out. Thanks.

msfry · Post by **msfry** » November 2nd, 2024, 10:27 am

BrianFullen wrote: ↑October 18th, 2024, 1:51 pm It should be opt in not opt out.

I absolutely agree. This policy, I read somewhere, was based on research by employers, who discovered that their employees tended not to sign up for IRA's and Savings Match Programs, until the automatic Opt In feature was put in place. This boosted participation by 80%. Still, it reeks of manipulation and trickery, especially when they put the Opt Out feature in the fine print. It reminds me of the oft seen lately choice: "check this box if you don't want your donation to recur monthly." It's effect upon me is I don't donate at all.