How Often are Librivox Books Listened To?

Comments about LibriVox? Suggestions to improve things? News?
TedL
Posts: 584
Joined: October 24th, 2022, 3:06 am
Location: Wisconsin
Contact:

Post by TedL »

knotyouraveragejo wrote: February 20th, 2024, 12:54 pm We record old books. They are not everyone's cup of tea. Especially the nonfiction, much of which is only of historical interest. Keep in mind, also, that relying solely on IA data for downloads leaves out many other places/ways our recordings are available - phone apps, You Tube sites and various other online services that use our recordings, to name a few. While some of these use the IA links for all their downloads, not all of them do.
I addressed these issues in my responses this morning to other comments, so please also see them. Most of the other sites that use Librivox audiobooks seem to offer only a small number of them, and the number of their books that are actually heard is even smaller. We have enormous numbers of people arriving at our two sites (Librivox.org and the IA Librivox collection) looking for free public domain audiobooks. Its a tremendous opportunity to meet that demand, and we should not waste it.

As to nonfiction: I only read and record nonfiction, and I think a lot of the nonfiction books in the collection are unique and valuable and should be heard. But the public is unfamiliar with the names of most authors and titles, so they need help in finding books on their subject.
TriciaG
LibriVox Admin Team
Posts: 60937
Joined: June 15th, 2008, 10:30 pm
Location: Toronto, ON (but Minnesotan to age 32)

Post by TriciaG »

TedL wrote: February 21st, 2024, 4:00 am
TriciaG wrote: February 20th, 2024, 12:24 pm
#2 will indeed be fairly labor intensive, but maybe there are potential volunteers who don't wish to be readers or listeners. However, I think we would only need only about two people to actually open records to edit the subjects. I'd love to have at least one experienced librarian involved.
Only LV admins (and maybe some people at Archive itself) have access to edit the data at Archive, so this would fall entirely on the admins. Not to be a downer, but I highly doubt that's going to happen. Our primary objective is making audiobooks, not making them easier to find. :hmm:
What's an Admin? Is that Management? It would be helpful if you directed the attention of all of Management to this thread, as they should have a role in this decision.

On Librivox it has apparently always been left to book coordinators to fill in the topics. This method clearly has not worked to make useful subject searches on Librivox. Standardized terms must be used on every book. We should follow the Internet Archive's lead and use Library of Congress subject headings.
Admins are generally the MCs - those whose names are purple in the forum. No one else can edit records, so even if we wanted to go this route, the task of retroactively changing keywords would fall entirely on them, on top of all the other admin/MC tasks.

And as I said before, implementing this somehow to create a LOC look-up on the New Project Template Generator for future projects (with my limited programming knowledge, this appears the best way to implement something like this, rather than a HUGE list similar to our genre one) would be a large programming challenge, and entirely dependent on whoever is capable of and volunteers to do the programming.

Then there would be getting the BCs/soloists to actually use the tool effectively. You can lead a horse to water...

What you're asking for is a MAJOR change, that would require major time and effort on the development/programming side, a herculean effort to retrofit, some measure of policing of new projects to be sure it's done correctly . . . and a debatable amount of benefit relative to all that effort.
School fiction: David Blaize
America Exploration: The First Four Voyages of Amerigo Vespucci
Serial novel: The Wandering Jew
Medieval England meets Civil War Americans: Centuries Apart
TedL
Posts: 584
Joined: October 24th, 2022, 3:06 am
Location: Wisconsin
Contact:

Post by TedL »

TriciaG wrote: February 21st, 2024, 7:58 am
TedL wrote: February 21st, 2024, 4:00 am
TriciaG wrote: February 20th, 2024, 12:24 pm

Only LV admins (and maybe some people at Archive itself) have access to edit the data at Archive, so this would fall entirely on the admins. Not to be a downer, but I highly doubt that's going to happen. Our primary objective is making audiobooks, not making them easier to find. :hmm:
What's an Admin? Is that Management? It would be helpful if you directed the attention of all of Management to this thread, as they should have a role in this decision.

On Librivox it has apparently always been left to book coordinators to fill in the topics. This method clearly has not worked to make useful subject searches on Librivox. Standardized terms must be used on every book. We should follow the Internet Archive's lead and use Library of Congress subject headings.
Admins are generally the MCs - those whose names are purple in the forum. No one else can edit records, so even if we wanted to go this route, the task of retroactively changing keywords would fall entirely on them, on top of all the other admin/MC tasks.

And as I said before, implementing this somehow to create a LOC look-up on the New Project Template Generator for future projects (with my limited programming knowledge, this appears the best way to implement something like this, rather than a HUGE list similar to our genre one) would be a large programming challenge, and entirely dependent on whoever is capable of and volunteers to do the programming.

Then there would be getting the BCs/soloists to actually use the tool effectively. You can lead a horse to water...

What you're asking for is a MAJOR change, that would require major time and effort on the development/programming side, a herculean effort to retrofit, some measure of policing of new projects to be sure it's done correctly . . . and a debatable amount of benefit relative to all that effort.
I envision this as being easier and simpler than you see it. As I write up my proposed course of action, I'll keep your objections in mind and try to address each as specifically as I can. I hope to release that paper tomorrow.

Regards,
TedL
Posts: 584
Joined: October 24th, 2022, 3:06 am
Location: Wisconsin
Contact:

Post by TedL »

How to Make Subject Searches Work

Go to the Librivox Free Audiobook Collection in Internet Archive, - https://archive.org/details/librivoxaudio - and hover your mouse over a book cover. You can see the line called "Topics". The proposal for changes described in this paper is about updating the words that Librivox inserts in the Topics line.


What's wrong with the existing subject search system?

You can already carry out a subject search at this site and at Librivox.org. Putting a search term into the search line causes it to look through all the text in a record, including the book summary. There are no standardized search terms. If you want to search for all books in the collection about automobiles, for example, you would have to do searches for auto, autos, car, cars, automobile, and automobiles, plus any other terms used at the beginning of the 20th century for autos.

Your search results will include all books where the search term was mentioned in the summary text, so search results include many books where your subject is really marginal to the book. A lot of related books are omitted from search results because the subject term didn't appear in the book summary.


What terms are in "Topics" now, and how did they get there?

The terms "audiobooks" and "librivox" seem to be in Topics in every book entry. If we confirm those have a necessary function, we should continue to add them to each book.

Otherwise, the terms that were added to topics are usually one-word terms added by the Book Coordinators. They used words they believed described the major themes of the book. The number of topics entered varies a lot from book to book, with many books having no topics other than 'audiobooks' and 'librivox'. Internet Archive (IA), which governs the format of these records, says we should use a maximum of 10 topics in a book.


Why are you suggesting changes?

The Internet Archive has built in the capability to do searches from terms into the Topics field. To see how books use this capability, look at these two examples:

https://archive.org/details/101essaysthatwil0000wies

https://archive.org/details/sonofneptune0000rick_f7t7

As you can see, there are multi-word 'topics' in each entry. Click any topic, and Internet Archive does a search of all the books on site and presents you with the results. You then filter the results by selecting Librivox in the Collection box in the left column. To be included in those search results, a book must have the same identical term in its Topics field.

Just hover your mouse over the book covers in the results, and you'll see they have that topic. You can click on any topic while hovering (to do another search), without even opening a book page.

The search results in this example sometimes have hundreds of books, because it is searching within a collection of 3.9 million books (the collection is "Texts to Borrow"). Our collection of Librivox audiobooks presently has 19,000 books, or one-half of one percent that size. So we can limit our subject terms to broader topics than IA does, and therefore use far fewer subject terms.


How will Users know about these new standardized 'subject headings'?

1st, they can find a similar book through browsing or a book title search, then click on the topic that fits their need.

2nd. We should provide a complete list of our subject headings. The front page of the list would have basic subject categories. Clicking on a basic subject would lead to all the subject headings in that category that we use.


How will we implement putting topics like this into all of our 19,000 books?

To do an effective search from a topic in a book, or for a book to appear in search results, all books on a particular subject need to use the same identical term. There are three systems (that I'm aware of) for standardized 'subject headings' used for books. The Library of Congress and Dewey Decimal System are used by libraries, and BISAC headings (Book Industry Study Group) are used by booksellers.

I suggest that we should select the Library of Congress (LOC) system, for two reasons.

First, it is the main system used in other collections at IA, so it will be easy for users to switch back and forth from Librivox to other IA collections.

Second, the system I'm proposing for 'classifying' our books (assigning subject headings) involves looking up book titles and using their existing LOC subject headings, rather than using the LOC reference tools to figure them out. This will be a big time-saver, and will make it possible for non-librarian volunteers to carry out this task. Volunteers would find subject headings by doing book title searches on Internet Archive, Worldcat.org, or the LOC Online Catalog at catalog.loc.gov. Occasionally they will refer to the Library of Congress Subject Headings manual, available online at IA.

We should have a 'team leader' for the volunteers to oversee the assignment of subject headings.

Volunteers will forward the book title and new subject headings to an 'Admin' person at Librivox with authority to edit book records. The Admin person will open an existing book record, copy and paste the new subject headings in the Topics field, and close the book record.

For new books, a volunteer will find and add the subject headings while the book is still being recorded. When the book is finished the Admin will do their usual routine to add the book to the catalog, with no additional steps.


How much work will it be?

My guess is that volunteers will each be able to do 5 to 10 books per hour. I hope that an Admin person can tell me how long their task would take. I can't predict how long it will take to finish the job until we know how many volunteers we have.


Where will we get volunteers?
Over the life of Librivox, I understand that we have had 13,000 volunteer readers. But many drop off because they find the process of recording difficult. I suspect there are people who would be willing to help Librivox in this other way instead of recording. Librivox would recruit volunteers as usual on the website and YouTube channel.


You said we would change searches at IA, so that instead of searching the full book catalog, our searches would take place within the Librivox audiobook collection. Status?

IA indicates that the searcher should use the "Collection" box in the left margin to limit search results to Librivox audiobooks.


What about the Librivox.org site?

This needs more study, and probably wouldn't be addressed immediately. Change here would require modifications to the Librivox.org website. There is much less traffic on this site than on the IA site, although at 1.7 million views per month, traffic is still outstanding.

I suggest that instead of using the Genre/Subject list, we would direct users to our list of subject headings, mentioned above. Users could put the subject heading into the Librivox.org search field, or simply access our audiobooks through the IA site.

Currently the Librivox.org individual book records contain a line for 'Genre(s)', not for topics. Is it feasible and desirable for Librivox.org to have the same book records as IA?


Do you think this is worth the effort?

Definitely yes! The two Librivox sites together have over 20 million views per month; a truly phenomenal number. ahrefs says the 100th largest U.S. website, FedEx, has the same volume of traffic. But despite all those visitors, more than 90% of our books get less than 1 view per month, and are heard even less than that. This proposed subject search system, already in use for most Internet Archive books, will allow this huge crowd of visitors searching for audiobooks to easily find everything we offer on their favorite subjects.

Many Librivox volunteers believe that private websites offering Librivox books ensure that our books are heard. I think not. I found that a handful of Librivox books are heavily used on some of the biggest audiobook sites. But very few of our books even appear in their catalogs. Traffic on most of the audiobook sites that offer our books is insignificant.

The most important thing we can do for Librivox right now is to make our books more accessible to users with this upgrade.
Rapunzelina
LibriVox Admin Team
Posts: 17930
Joined: November 15th, 2011, 3:47 am

Post by Rapunzelina »

I have no idea how statistics work, so this may be a dumb question, but could the views come from the bots that are lurking in the forums and everywhere, and/or requests from different APIs? not necessarily human visitors. ?
redrun
LibriVox Admin Team
Posts: 3076
Joined: August 11th, 2022, 8:32 pm
Contact:

Post by redrun »

Short answer is: those bots definitely wouldn't show up to skew the estimates. But the estimates are very broad, and are designed for comparative analysis.

These estimates are based solely on where we rank in the results for popular searches. People browsing LibriVox.org directly, or getting here from other sites, aren't included in the limited data. (That would be why Ted included statistics for sites that link to us, in another post.)
There are other bots that might try to inflate the number of searches for certain terms, but I can't speculate on how common that is, who might engage in it, or whether it would work in our "favor" in terms of these stats.
These are very much ball-park figures. The numbers are big, we can roughly compare them, and that's all they're for.

(Yes, this is based on what 'ahrefs' says about its own data. They're less direct about it than that first site quoted, but it seems to be about the same data and estimation methodology.)
TriciaG
LibriVox Admin Team
Posts: 60937
Joined: June 15th, 2008, 10:30 pm
Location: Toronto, ON (but Minnesotan to age 32)

Post by TriciaG »

Second, the system I'm proposing for 'classifying' our books (assigning subject headings) involves looking up book titles and using their existing LOC subject headings, rather than using the LOC reference tools to figure them out. This will be a big time-saver, and will make it possible for non-librarian volunteers to carry out this task. Volunteers would find subject headings by doing book title searches on Internet Archive, Worldcat.org, or the LOC Online Catalog at catalog.loc.gov. Occasionally they will refer to the Library of Congress Subject Headings manual, available online at IA.
Out of curiosity, I tried to find LOC subject headings for my next project, "From Baca to Beulah". Here are the LOC and Worldcat entries:
https://catalog.loc.gov/vwebv/search?searchArg=from+baca+to+beulah&searchCode=GKEY%5E*&searchType=0&recCount=25&sk=en_US
https://search.worldcat.org/title/433945245

I'm looking there and at Archive, and I don't see LOC subjects listed. Am I missing something, or do not all books have LOC subjects?
School fiction: David Blaize
America Exploration: The First Four Voyages of Amerigo Vespucci
Serial novel: The Wandering Jew
Medieval England meets Civil War Americans: Centuries Apart
redrun
LibriVox Admin Team
Posts: 3076
Joined: August 11th, 2022, 8:32 pm
Contact:

Post by redrun »

Our Archive books don't have LOC tags, but they could if our BCs entered them (and/or we went back and added them to past books). The examples linked earlier are Archive books that do have the "proper" tags, for comparison:
https://archive.org/details/101essaysthatwil0000wies
TriciaG
LibriVox Admin Team
Posts: 60937
Joined: June 15th, 2008, 10:30 pm
Location: Toronto, ON (but Minnesotan to age 32)

Post by TriciaG »

Yes, I understand that. The two instances of From Baca to Beulah texts on IA don't have any subject tags at all, and as you say, rely on the uploading entity to add them. So in general, that's not the greatest source to find them.

I'm wondering what they would be for this book, but I don't find them anywhere. I guess this is one of the cases where one would have to go through the index and assign them oneself. I wonder what percentage of our projects would require that.
School fiction: David Blaize
America Exploration: The First Four Voyages of Amerigo Vespucci
Serial novel: The Wandering Jew
Medieval England meets Civil War Americans: Centuries Apart
redrun
LibriVox Admin Team
Posts: 3076
Joined: August 11th, 2022, 8:32 pm
Contact:

Post by redrun »

Ah, I see what you mean. Not all books seem to have tags at LOC, so also not at Archive. :hmm:
TedL
Posts: 584
Joined: October 24th, 2022, 3:06 am
Location: Wisconsin
Contact:

Post by TedL »

TriciaG wrote: February 23rd, 2024, 3:06 pm
Second, the system I'm proposing for 'classifying' our books (assigning subject headings) involves looking up book titles and using their existing LOC subject headings, rather than using the LOC reference tools to figure them out. This will be a big time-saver, and will make it possible for non-librarian volunteers to carry out this task. Volunteers would find subject headings by doing book title searches on Internet Archive, Worldcat.org, or the LOC Online Catalog at catalog.loc.gov. Occasionally they will refer to the Library of Congress Subject Headings manual, available online at IA.
Out of curiosity, I tried to find LOC subject headings for my next project, "From Baca to Beulah". Here are the LOC and Worldcat entries:
https://catalog.loc.gov/vwebv/search?searchArg=from+baca+to+beulah&searchCode=GKEY%5E*&searchType=0&recCount=25&sk=en_US
https://search.worldcat.org/title/433945245

I'm looking there and at Archive, and I don't see LOC subjects listed. Am I missing something, or do not all books have LOC subjects?
Not all book entries show LOC subject headings, even in the LOC catalog or in WorldCat. It is necessary to check both. In the case of "From Baca to Beulah", no subject headings are listed at LOC.

At WorldCat, the search turned up 7 copies. See "View all Formats & Editions" in the bottom right corner of the box, and click on "Formats and Editions".

Some of the 7 copies contain subject headings when you click "Show more information". I found:

Autobiographies
Biographies
Evangelists
Evangelists United States Biography
Smith, Jennie 1842-1924
autobiographies (literary works)


The original search in WorldCat also yielded two sequels to "From Baca to Beulah", so I opened those to see if I missed any useful subject headings. In addition to the above, I found:

Evangelists United States Personal narratives
Personal narratives


Which of these would be best to use? I'm not a librarian, but would note that the purpose of putting these subject headings in the "Topics" line is almost entirely to enable searches. In the Internet Archive, if you search with broad terms like "Biographies" you can get search results in the thousands; far too many books to look through. The Librivox collection is much smaller, so broad search terms are generally workable. In this case we can use up to 10 search terms, according to Internet Archive, so we could use all of these if we wished, and users could search either broad terms or narrow ones, like "Evangelists United States Personal Narratives".

I suggest, off the top of my head, that we try to get one to three broad terms, and at least one narrower term, for each book.

I'm sure we will find some books that don't have subject headings at either WorldCat or LOC. Internet Archive might have them. If not, we would probably have to classify them ourselves.

Feb 25 Update

I have found since writing the above that there is a faster way to find subject headings. Project Gutenberg (gutenberg.org) a longstanding collection of public domain books, where most of the books on Librivox originated, now has a set of subject headings on most of its books. A simple book title search will reveal the subject headings and one or two "LOC Classes" (Library of Congress broad categories) for most of its books. I will try searching a random group as a sample to see what percentage of entries on Gutenberg already contain the subject headings we need.

I also found a second site that's very helpful. A book title search on Open Library (openlibrary.org), which is part of the Internet Archive, provides links to the main locations of the book all in one place. Open the first title that is available in the library (many entries are not available), and scroll down to "ID Numbers" inside the box of "Book Details". Internet Archive, LCCN (Library of Congress) and OCLC Worldcat links for that book title are there.

Ted Lienhart
Last edited by TedL on February 25th, 2024, 1:32 pm, edited 1 time in total.
TedL
Posts: 584
Joined: October 24th, 2022, 3:06 am
Location: Wisconsin
Contact:

Post by TedL »

redrun wrote: February 23rd, 2024, 4:18 pm Ah, I see what you mean. Not all books seem to have tags at LOC, so also not at Archive. :hmm:
I haven't done much research at Internet Archive, but I suspect most of their books were classified by librarians at whichever library contributed them. I believe this is what happened at WorldCat too. I don't think the Library of Congress classifies all their books. That's my guess - I wish a librarian would step into this discussion and clear this up.
TheBanjo
Posts: 1327
Joined: January 23rd, 2021, 8:19 pm
Location: Melbourne, Australia
Contact:

Post by TheBanjo »

I suspect AI is about to render the kinds of approaches to 'discoverability' being discussed here completely irrelevant.

The new Gemini 1.5, released only a few days ago by Deep Mind, when fed a one-frame per second version of a 44 minute Buster Keaton movie, is able to perform a task like "Find the moment when a piece of paper is removed from the person's pocket, and tell me some key information on it, with the timecode" (see https://www.youtube.com/watch?v=Cs6pe8o7XY8 at around the 10:00 minute mark).

You don't have to be Einstein to see that it's quite likely that in five years' time the public will be able to discover which of 20,000 or 30,00 or 40,000 audiobooks might be a good match for freeform search criteria using what, by that time, will be relatively old hat AI.

Things are moving extremely fast in this space — it's almost impossible to overstate how fast — and I, for one, would be extremely reluctant to invest in of my time in any of the kinds of approaches that have been discussed above. They have all, in their day, represented "the state of the art" in terms of information access technology, but then the Stone Age once represented "the state of the art" when it came to making knives and axes.
TedL
Posts: 584
Joined: October 24th, 2022, 3:06 am
Location: Wisconsin
Contact:

Post by TedL »

TheBanjo wrote: February 24th, 2024, 4:00 pm I suspect AI is about to render the kinds of approaches to 'discoverability' being discussed here completely irrelevant.

The new Gemini 1.5, released only a few days ago by Deep Mind, when fed a one-frame per second version of a 44 minute Buster Keaton movie, is able to perform a task like "Find the moment when a piece of paper is removed from the person's pocket, and tell me some key information on it, with the timecode" (see https://www.youtube.com/watch?v=Cs6pe8o7XY8 at around the 10:00 minute mark).

You don't have to be Einstein to see that it's quite likely that in five years' time the public will be able to discover which of 20,000 or 30,00 or 40,000 audiobooks might be a good match for freeform search criteria using what, by that time, will be relatively old hat AI.

Things are moving extremely fast in this space — it's almost impossible to overstate how fast — and I, for one, would be extremely reluctant to invest in of my time in any of the kinds of approaches that have been discussed above. They have all, in their day, represented "the state of the art" in terms of information access technology, but then the Stone Age once represented "the state of the art" when it came to making knives and axes.
I don't know enough about AI to debate you on that subject. But let me make a couple of points:

1. A database can only carry out a search and analysis of the data it can see. Our audiobook database only has a little metadata about the book, not the book itself. In most cases, even a much smarter computer app would not be able to figure out the subjects of our books because they can't see the data they would need to analyze.

2. 99% of the people working on Librivox will have nothing to do with this subject heading project. The people working on it would be volunteers who wish to do so. The only exception would be a few "admins" assigned to enter the new subject headings into old book records. Book coordinators won't be involved at all. The subject headings would be entered into "books in progress" by the subject-heading team, so when the last recordings are finished, its ready to go to the "admin". Or at least, that's the way I imagine the process.
TheBanjo
Posts: 1327
Joined: January 23rd, 2021, 8:19 pm
Location: Melbourne, Australia
Contact:

Post by TheBanjo »

TedL wrote: February 25th, 2024, 1:42 pm
I don't know enough about AI to debate you on that subject. But let me make a couple of points:

1. A database can only carry out a search and analysis of the data it can see. Our audiobook database only has a little metadata about the book, not the book itself. In most cases, even a much smarter computer app would not be able to figure out the subjects of our books because they can't see the data they would need to analyze.

2. 99% of the people working on Librivox will have nothing to do with this subject heading project. The people working on it would be volunteers who wish to do so. The only exception would be a few "admins" assigned to enter the new subject headings into old book records. Book coordinators won't be involved at all. The subject headings would be entered into "books in progress" by the subject-heading team, so when the last recordings are finished, its ready to go to the "admin". Or at least, that's the way I imagine the process.
Thanks Ted.

I guess my point here is that all of discussing this topic now DO need to get up to speed on the rapidly emerging capabilities of AI. Have you viewed the clip I referenced?

It seems entirely plausible to me that within a few years it will be not only technologically feasible (we've probably reached that point already, with Gemini 1.5) but also economically feasible (ie, via access to what by then will hopefully be as much a free service as is Google search today) to present an AI with all the audiobooks in our collection, a set of guidelines on how to catalogue a book (eg, Library of Congress cataloguing principles) and then to tell the AI to spit out Library of Congress classifications for each audiobook, based just on the audio recording with no access to the underlying text. We could then feed that AI generated classification data into our audiobooks database. In effect, the AI would be doing what you're asking human volunteers to do — only probably much more reliably.

If that is already, or will soon be, possible, there is no point at all in going down the path you a proposing (with whose goals I am most sympathetic).

I can appreciate that this suggestion comes out of left field as it were, but I am certainly making it in all seriousness. I have a son who is an academic who works in AI who sends me clips like the one I referenced in my previous post, and this is the only reason I have any familiarity with new developments in this field.

I really don't see how it's possible to reach a rational decision on how, or whether, to proceed with your proposal as it stands without given some consideration to the possibility that it may soon be rendered completely moot by AI.

I could, of course, be wrong. I'm not here as an advocate of AI per se. More, I'm an advocate of considered and thoughtful decision-making, and I raise this possibility in that spirit.
Post Reply