I need help getting the whole catalog for torrenting
Hello all. I've recently been working on getting the whole catalog "cataloged" so that I can figure out if I have the whole catalog downloaded for torrenting. (Yes, I've seen some posts about torrenting before but it seems nothing has stuck yet. Maybe this time around it will! )
My intention is to offer torrents for each of the completed works in their "whole book" zip files. So far I have 23GB downloaded. I used an automated approach to get what I have now but due to the mis-match naming convention of the zip files to their respective book titles I'm having difficulty figuring out if I have it all.
Anyone have any suggestions or hints on how I can figure this out?
NOTE: My automated approach started by me basically crawling the libriox.org website for any files ending in "64kb_mp3.zip" as it appears all the "whole book" zip files end that way. I've noticed that it appears I have gaps though and am doubtful that I actually have it all.
What would be nice is a plain list of all the links to all the "whole book" zip files of all the completed works.
My true hope and goal in this project is to offload some of the bandwidth from librivox by setting up a torrent "ring" (yes, I'll be back asking people to seed off of me). I'm sure there are several people in the world who would like to listen to these books but don't have an Internet connection stable enough to be able download 250MB files in one sitting. With torrenting they will be able to get small chunks and, over time, they will have the whole thing!
Any and all help is appreciated.
My intention is to offer torrents for each of the completed works in their "whole book" zip files. So far I have 23GB downloaded. I used an automated approach to get what I have now but due to the mis-match naming convention of the zip files to their respective book titles I'm having difficulty figuring out if I have it all.
Anyone have any suggestions or hints on how I can figure this out?
NOTE: My automated approach started by me basically crawling the libriox.org website for any files ending in "64kb_mp3.zip" as it appears all the "whole book" zip files end that way. I've noticed that it appears I have gaps though and am doubtful that I actually have it all.
What would be nice is a plain list of all the links to all the "whole book" zip files of all the completed works.
My true hope and goal in this project is to offload some of the bandwidth from librivox by setting up a torrent "ring" (yes, I'll be back asking people to seed off of me). I'm sure there are several people in the world who would like to listen to these books but don't have an Internet connection stable enough to be able download 250MB files in one sitting. With torrenting they will be able to get small chunks and, over time, they will have the whole thing!
Any and all help is appreciated.
I would have thought you'd start by going to catalog page
http://librivox.org/newcatalog/visitor_advanced.php
and selecting option, browse entire catalog.
That lists everything in catalog.
http://librivox.org/newcatalog/visitor_advanced.php
and selecting option, browse entire catalog.
That lists everything in catalog.
[url=http://librivox.org/newcatalog/people_public.php?peopleid=75]Jim Mowatt[/url] - [url=http://historyzine.com]Historyzine - The History Podcast[/url]
LibriVox files are stored at archive.org, so there's not really a bandwidth concern as such. However, the more ways to get files to people the better, for sure -- I've downloaded a few LV books this way, and it is easy.
I don't know if those torrents are still alive, but when I last looked (many weeks ago) there were a fair number still. Can you support those as well as starting torrents for new books..?
I don't know if those torrents are still alive, but when I last looked (many weeks ago) there were a fair number still. Can you support those as well as starting torrents for new books..?
Last edited by Cori on July 19th, 2007, 12:21 am, edited 1 time in total.
There's honestly no such thing as a stupid question -- but I'm afraid I can't rule out giving a stupid answer : : To Posterity and Beyond!
And, thinking about it, the zip files are auto-generated (and thus auto-named) by archive.org ... we upload the individual 128Kb-rate files to them, and all the other formats spring from those.
So, could you let us know where there are problem filenames, so we can look into it..?
So, could you let us know where there are problem filenames, so we can look into it..?
There's honestly no such thing as a stupid question -- but I'm afraid I can't rule out giving a stupid answer : : To Posterity and Beyond!
This is how I started getting the files in the first place. There are currently 749 entries in the catalog yet I only have 300 .zip files. The reason I'm trying to automate them is so that I can "browse the catalog" and easily find the new entries. Maybe not all catalog entries have a .zip file of the whole book on their page?jimmowatt wrote:I would have thought you'd start by going to catalog page
http://librivox.org/newcatalog/visitor_advanced.php
and selecting option, browse entire catalog.
That lists everything in catalog.
The filenames issue stems from simple things such as the directory name of the completed work doesn't align up with the work's name, nor with the archive.org name. Some files start with "the" or "a" and yet the directory (or other related files) don't. Some use underscores in their name, others have dashes, yet more others just run the name all together. (A good example of that is: treasure-island-by-robert-louis-stevenson.xml treasureisland_librivox_64kb_mp3.zip. Here you see examples of all three.)Cori wrote:And, thinking about it, the zip files are auto-generated (and thus auto-named) by archive.org ... we upload the individual 128Kb-rate files to them, and all the other formats spring from those.
So, could you let us know where there are problem filenames, so we can look into it..?
I'm not complaining (or at least not trying to complain). It just makes it hard for me to find and align the correct information when the there seems to be no standard for naming.
Again, the files could be random strings of alphanumerics for all I care IF I could come up with a systematic and automated way of checking for new entries and then downloading the correct files. I'm going to be spending plenty of time setting up and maintaining the torrent server so I need all the help I can get on this end.
Ah, I think that's okay (for us, anyway, though annoying for you.)
The underscore version, which includes the authors name, is our catalogue page -- nice for those to be as descriptive as possible ... the archive.org version tends to be shorter to keep it simple. Usually they're very similar, but not necessarily identical.
I don't know if there's any way to find what you want from the archive.org catalogue, but I think that might be a better way of doing it than handling the mismatch between the naming systems.
http://www.archive.org/details/librivoxaudio
All our catalogue entries should have a zip file on their page ... it's one of the final checks that's done.
The underscore version, which includes the authors name, is our catalogue page -- nice for those to be as descriptive as possible ... the archive.org version tends to be shorter to keep it simple. Usually they're very similar, but not necessarily identical.
I don't know if there's any way to find what you want from the archive.org catalogue, but I think that might be a better way of doing it than handling the mismatch between the naming systems.
http://www.archive.org/details/librivoxaudio
All our catalogue entries should have a zip file on their page ... it's one of the final checks that's done.
There's honestly no such thing as a stupid question -- but I'm afraid I can't rule out giving a stupid answer : : To Posterity and Beyond!
-
- LibriVox Admin Team
- Posts: 22127
- Joined: November 18th, 2006, 4:37 pm
daxm,daxm wrote:[
This is how I started getting the files in the first place. There are currently 749 entries in the catalog yet I only have 300 .zip files. The reason I'm trying to automate them is so that I can "browse the catalog" and easily find the new entries. Maybe not all catalog entries have a .zip file of the whole book on their page?
If you search archive.org for "librivox zip" you will get 789 hits which presumably will include all the cataloged audiobooks (plus a few extraneous hits for CD covers, podcasts, etc.) There is a link to the zip file for each book on its archive.org detail page. Hopefully this will help you find what you are missing.
One way of checking for new entries would be to check the new releases feed:
http://librivox.org/newcatalog/NewReleases.xml
Also, if your script searches the catalog source page for text it could look for http://www.archive.org/download/ and then the next thing that comes up is the name of the zip file.
http://librivox.org/newcatalog/NewReleases.xml
Also, if your script searches the catalog source page for text it could look for http://www.archive.org/download/ and then the next thing that comes up is the name of the zip file.
[url=http://librivox.org/newcatalog/people_public.php?peopleid=75]Jim Mowatt[/url] - [url=http://historyzine.com]Historyzine - The History Podcast[/url]
-
- LibriVox Admin Team
- Posts: 7972
- Joined: September 26th, 2005, 4:14 am
- Location: Montreal, QC
- Contact:
the other thing to do would be crawl our site for all our catalog pages (all nicely named) and then keep crawling to either:
a) the archive.org zip file or
b) the work's RSS page, which points to all the files to download.
that way you should get good meta data with each book, no?
a) the archive.org zip file or
b) the work's RSS page, which points to all the files to download.
that way you should get good meta data with each book, no?
Normally I'd reply individually to you all but I've never had such a response! Never have I been apart of an online community that is so helpful and friendly. Seriously!
If you don't believe me, look at some other forums on the net. Most are full of people telling other people to "read the documentation" or that they are stupid.
All of your responses have helped me look into other ways to do this. Now lets see if I can do it!
If you don't believe me, look at some other forums on the net. Most are full of people telling other people to "read the documentation" or that they are stupid.
All of your responses have helped me look into other ways to do this. Now lets see if I can do it!
We pride ourselves on being the friendlist, non-flamiest forum in the interweb. :) Good luck with the torrenting. We've had a few people try to get torrenting started but it hasn't really "stuck" yet. I hope you can get it going for real!daxm wrote:Normally I'd reply individually to you all but I've never had such a response! Never have I been apart of an online community that is so helpful and friendly. Seriously!
If you don't believe me, look at some other forums on the net. Most are full of people telling other people to "read the documentation" or that they are stupid.
Kara
http://kayray.org/
--------
"Mary wished to say something very sensible into her Zoom H2 Handy Recorder, but knew not how." -- Jane Austen (& Kara)
http://kayray.org/
--------
"Mary wished to say something very sensible into her Zoom H2 Handy Recorder, but knew not how." -- Jane Austen (& Kara)
I just did a torrent search on www.isohunt.com and came up with only 35 hits for "librivox". Are you going to tag each torrent with "librivox"? I know if I do a search for "audiobooks" I will get much more but was wondering how you were getting along.
Esther
Esther
Last edited by Starlite on July 19th, 2007, 3:25 pm, edited 1 time in total.
"Reasonable people adapt themselves to the world. Unreasonable
people attempt to adapt the world to themselves. All progress,
therefore, depends on unreasonable people." George Bernard Shaw
people attempt to adapt the world to themselves. All progress,
therefore, depends on unreasonable people." George Bernard Shaw
Yes the idea is to "specialize" in Librivox audiobooks. Too many people use torrent for illegal activities and I want to show that it can be used for good things too.Starlite wrote:I just did a torrent search on www.isohunt.com and came up with only 35 hits for "librivox". Are you going to tag each torrent with "librivox"? I know if I do a search for "audiobooks" I will get much more but was wondering how you were getting along.
Esther
Hey daxm, what script do you use to crawl?
What's this little box thingy for? Oh! [color=red]C[/color][color=orange]O[/color][color=yellow]L[/color][color=blue]O[/color][color=indigo]R[/color]