I need help getting the whole catalog for torrenting

Non-reading activities need your help too!
daxm
Posts: 15
Joined: July 5th, 2007, 10:23 pm

Post by daxm »

Hello all. I've recently been working on getting the whole catalog "cataloged" so that I can figure out if I have the whole catalog downloaded for torrenting. (Yes, I've seen some posts about torrenting before but it seems nothing has stuck yet. Maybe this time around it will! :D )

My intention is to offer torrents for each of the completed works in their "whole book" zip files. So far I have 23GB downloaded. I used an automated approach to get what I have now but due to the mis-match naming convention of the zip files to their respective book titles I'm having difficulty figuring out if I have it all.

Anyone have any suggestions or hints on how I can figure this out?

NOTE: My automated approach started by me basically crawling the libriox.org website for any files ending in "64kb_mp3.zip" as it appears all the "whole book" zip files end that way. I've noticed that it appears I have gaps though and am doubtful that I actually have it all.

What would be nice is a plain list of all the links to all the "whole book" zip files of all the completed works.

My true hope and goal in this project is to offload some of the bandwidth from librivox by setting up a torrent "ring" (yes, I'll be back asking people to seed off of me). I'm sure there are several people in the world who would like to listen to these books but don't have an Internet connection stable enough to be able download 250MB files in one sitting. With torrenting they will be able to get small chunks and, over time, they will have the whole thing!

Any and all help is appreciated.
jimmowatt
Posts: 1532
Joined: January 13th, 2006, 8:44 am
Location: Cambridge UK
Contact:

Post by jimmowatt »

I would have thought you'd start by going to catalog page

http://librivox.org/newcatalog/visitor_advanced.php

and selecting option, browse entire catalog.
That lists everything in catalog.
[url=http://librivox.org/newcatalog/people_public.php?peopleid=75]Jim Mowatt[/url] - [url=http://historyzine.com]Historyzine - The History Podcast[/url]
Cori
Posts: 12124
Joined: November 22nd, 2005, 10:22 am
Location: Britain
Contact:

Post by Cori »

LibriVox files are stored at archive.org, so there's not really a bandwidth concern as such. However, the more ways to get files to people the better, for sure -- I've downloaded a few LV books this way, and it is easy.

I don't know if those torrents are still alive, but when I last looked (many weeks ago) there were a fair number still. Can you support those as well as starting torrents for new books..?
Last edited by Cori on July 19th, 2007, 12:21 am, edited 1 time in total.
There's honestly no such thing as a stupid question -- but I'm afraid I can't rule out giving a stupid answer : : To Posterity and Beyond!
Cori
Posts: 12124
Joined: November 22nd, 2005, 10:22 am
Location: Britain
Contact:

Post by Cori »

And, thinking about it, the zip files are auto-generated (and thus auto-named) by archive.org ... we upload the individual 128Kb-rate files to them, and all the other formats spring from those.

So, could you let us know where there are problem filenames, so we can look into it..?
There's honestly no such thing as a stupid question -- but I'm afraid I can't rule out giving a stupid answer : : To Posterity and Beyond!
daxm
Posts: 15
Joined: July 5th, 2007, 10:23 pm

Post by daxm »

jimmowatt wrote:I would have thought you'd start by going to catalog page

http://librivox.org/newcatalog/visitor_advanced.php

and selecting option, browse entire catalog.
That lists everything in catalog.
This is how I started getting the files in the first place. There are currently 749 entries in the catalog yet I only have 300 .zip files. The reason I'm trying to automate them is so that I can "browse the catalog" and easily find the new entries. Maybe not all catalog entries have a .zip file of the whole book on their page?
daxm
Posts: 15
Joined: July 5th, 2007, 10:23 pm

Post by daxm »

Cori wrote:And, thinking about it, the zip files are auto-generated (and thus auto-named) by archive.org ... we upload the individual 128Kb-rate files to them, and all the other formats spring from those.

So, could you let us know where there are problem filenames, so we can look into it..?
The filenames issue stems from simple things such as the directory name of the completed work doesn't align up with the work's name, nor with the archive.org name. Some files start with "the" or "a" and yet the directory (or other related files) don't. Some use underscores in their name, others have dashes, yet more others just run the name all together. (A good example of that is: treasure-island-by-robert-louis-stevenson.xml treasureisland_librivox_64kb_mp3.zip. Here you see examples of all three.)

I'm not complaining (or at least not trying to complain). It just makes it hard for me to find and align the correct information when the there seems to be no standard for naming.

Again, the files could be random strings of alphanumerics for all I care IF I could come up with a systematic and automated way of checking for new entries and then downloading the correct files. I'm going to be spending plenty of time setting up and maintaining the torrent server so I need all the help I can get on this end. :)
Cori
Posts: 12124
Joined: November 22nd, 2005, 10:22 am
Location: Britain
Contact:

Post by Cori »

Ah, I think that's okay (for us, anyway, though annoying for you.)

The underscore version, which includes the authors name, is our catalogue page -- nice for those to be as descriptive as possible ... the archive.org version tends to be shorter to keep it simple. Usually they're very similar, but not necessarily identical.

I don't know if there's any way to find what you want from the archive.org catalogue, but I think that might be a better way of doing it than handling the mismatch between the naming systems.

http://www.archive.org/details/librivoxaudio

All our catalogue entries should have a zip file on their page ... it's one of the final checks that's done.
There's honestly no such thing as a stupid question -- but I'm afraid I can't rule out giving a stupid answer : : To Posterity and Beyond!
knotyouraveragejo
LibriVox Admin Team
Posts: 22127
Joined: November 18th, 2006, 4:37 pm

Post by knotyouraveragejo »

daxm wrote:[
This is how I started getting the files in the first place. There are currently 749 entries in the catalog yet I only have 300 .zip files. The reason I'm trying to automate them is so that I can "browse the catalog" and easily find the new entries. Maybe not all catalog entries have a .zip file of the whole book on their page?
daxm,

If you search archive.org for "librivox zip" you will get 789 hits which presumably will include all the cataloged audiobooks (plus a few extraneous hits for CD covers, podcasts, etc.) There is a link to the zip file for each book on its archive.org detail page. Hopefully this will help you find what you are missing.
jimmowatt
Posts: 1532
Joined: January 13th, 2006, 8:44 am
Location: Cambridge UK
Contact:

Post by jimmowatt »

One way of checking for new entries would be to check the new releases feed:

http://librivox.org/newcatalog/NewReleases.xml

Also, if your script searches the catalog source page for text it could look for http://www.archive.org/download/ and then the next thing that comes up is the name of the zip file.
[url=http://librivox.org/newcatalog/people_public.php?peopleid=75]Jim Mowatt[/url] - [url=http://historyzine.com]Historyzine - The History Podcast[/url]
hugh
LibriVox Admin Team
Posts: 7972
Joined: September 26th, 2005, 4:14 am
Location: Montreal, QC
Contact:

Post by hugh »

the other thing to do would be crawl our site for all our catalog pages (all nicely named) and then keep crawling to either:
a) the archive.org zip file or
b) the work's RSS page, which points to all the files to download.

that way you should get good meta data with each book, no?
daxm
Posts: 15
Joined: July 5th, 2007, 10:23 pm

Post by daxm »

Normally I'd reply individually to you all but I've never had such a response! Never have I been apart of an online community that is so helpful and friendly. Seriously!

If you don't believe me, look at some other forums on the net. Most are full of people telling other people to "read the documentation" or that they are stupid.

All of your responses have helped me look into other ways to do this. Now lets see if I can do it! ;-)
kayray
Posts: 11828
Joined: September 26th, 2005, 9:10 am
Location: Union City, California
Contact:

Post by kayray »

daxm wrote:Normally I'd reply individually to you all but I've never had such a response! Never have I been apart of an online community that is so helpful and friendly. Seriously!

If you don't believe me, look at some other forums on the net. Most are full of people telling other people to "read the documentation" or that they are stupid.
We pride ourselves on being the friendlist, non-flamiest forum in the interweb. :) Good luck with the torrenting. We've had a few people try to get torrenting started but it hasn't really "stuck" yet. I hope you can get it going for real!
Kara
http://kayray.org/
--------
"Mary wished to say something very sensible into her Zoom H2 Handy Recorder, but knew not how." -- Jane Austen (& Kara)
Starlite
Posts: 16548
Joined: April 30th, 2006, 2:17 pm
Location: Thunder Bay Ontario, Canada

Post by Starlite »

I just did a torrent search on www.isohunt.com and came up with only 35 hits for "librivox". Are you going to tag each torrent with "librivox"? I know if I do a search for "audiobooks" I will get much more but was wondering how you were getting along. :D

Esther :D
Last edited by Starlite on July 19th, 2007, 3:25 pm, edited 1 time in total.
"Reasonable people adapt themselves to the world. Unreasonable
people attempt to adapt the world to themselves. All progress,
therefore, depends on unreasonable people." George Bernard Shaw
daxm
Posts: 15
Joined: July 5th, 2007, 10:23 pm

Post by daxm »

Starlite wrote:I just did a torrent search on www.isohunt.com and came up with only 35 hits for "librivox". Are you going to tag each torrent with "librivox"? I know if I do a search for "audiobooks" I will get much more but was wondering how you were getting along. :D

Esther :D
Yes the idea is to "specialize" in Librivox audiobooks. Too many people use torrent for illegal activities and I want to show that it can be used for good things too.
ab2525
Posts: 628
Joined: June 20th, 2006, 8:55 pm
Location: Woodbridge, Virginia
Contact:

Post by ab2525 »

Hey daxm, what script do you use to crawl?
What's this little box thingy for? Oh! [color=red]C[/color][color=orange]O[/color][color=yellow]L[/color][color=blue]O[/color][color=indigo]R[/color]
Post Reply