How to make a Project Gutenberg eBook

Everything except LibriVox (yes, this is where knitting gets discussed. Now includes non-LV Volunteers Wanted projects)
enko

Post by enko » July 29th, 2012, 1:00 pm

I recently made a Project Gutenberg eBook.

You can find it here: http://www.gutenberg.org/ebooks/40365

There are 2 main ways of making a Project Gutenberg eBook. One is by joining the Distributed Proofreaders team ( http://www.pgdp.net/ ) and the other way is by doing it all by yourself. I produced the above book by myself so I will explain the procedure.

- Choose a book that is public domain in the USA. That means it must have been published before 1923. The book can be a scanned book found at Internet Archive ( http://www.archive.org/ ), Google Books ( http://books.google.com/ ), Gallica ( http://gallica.bnf.fr/ ), etc or it can be a physical book that is at home, library, etc.

- Scan the Title page and Obverse (aka Verso) page. Scan extra pages as needed if dates or other publication information appears there. These pages should include the Author, the Title, the Publication Date and the City + Country of publication. Also, any copyright statement or similar (such as "All Rights Reserved"). Send the scans to Project Gutenberg at ( http://copy.pglaf.org/ )so that they can confirm it is really in the public domain.

- Wait for the confirmation before starting any work. If the book is cleared you will get an email with a Clearance OK key.

- After you get the confirmation that the book is in the public domain, read the Volunteers' FAQ ( http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ ) concentrating particularly on the section 7 of the Volunteers' FAQ.

- Now you have to make the text itself. Project Gutenberg standard is that line lengths in a text file must be 60-70 chars/line.

There are 3 ways you can use to make the text:

1st way: Often websites like Internet Archive, Google Books, Gallica, etc produce along with their scanned books an OCR (optical character recognition) text. You can then use that OCR text and correct it.

2nd way: Scan the book by yourself, produce an OCR text and correct the OCR text.

or the 3rd way: Type it.

- Spellcheck the file with software like Microsoft Word, etc.

- Check the files for common scanning/spelling errors with Gutcheck: http://upload.pglaf.org/gutcheck.php

- After all the checking and necessary corrections send the file to http://upload.pglaf.org/

- The person handling your submission ( called a whitewasher ) will email you about any changes that must be made.

When the file is good, the whitewasher will post your ebook on Project Gutenberg where everyone will be able to access it.

That's about it.
Last edited by enko on July 31st, 2012, 10:50 pm, edited 2 times in total.

Availle
LibriVox Admin Team
Posts: 15559
Joined: August 1st, 2009, 11:30 pm
Contact:

Post by Availle » July 29th, 2012, 1:18 pm

Thanks for posting this enko!

I have read a book for librivox, which is not to be found online (yet), so I was wondering if I should produce an ebook of the text myself. However, I was a bit worried about the time it would take.

Could you please indicate how long it took you to produce your book? And how many pages did it have?
Cheers,
Ava.

--
AvailleAudio.com

ppcunningham
Posts: 1148
Joined: July 21st, 2009, 3:26 pm
Location: Northern California, USA
Contact:

Post by ppcunningham » July 29th, 2012, 2:55 pm

I've done a few books for Gutenberg.
The amount of time it takes for each book varies a lot. Figure a month or so, depending on how much time you have to devote to the project. One book i did, which was printed in the late 1700's, used the long-S which looks like an F, so OCR would have been useless, and I had to type it. It was 350 pages, and took about two months. I just started working on another one that is 365 pages, and it took about two hours or so per 7-page chapter to correct the OCR scan, so I think it would be faster to re-type it, too.

That's the easy part - the hard part is getting it to pass Gutcheck, which is Gutenberg's formatting checker - especially if your book has formatting oddities, like old-style punctuation and spelling. This can go quick, or can take days.

You have already recorded yours, but for anyone else considering doing this - I would recommend getting PD approved through Gutenberg, and then doing the e-book and recording as you go. I wish I had! I am still finding errors I missed - last night I discovered I had written 'house' instead of 'horse'. I corrected it in the recording, but now I have to notify Gutenberg to fix the book. Somehow, when you are recording, the errors pop out better.

The worst part for me, though, was that by the time I got it posted to Gutenberg, I had been over it so many times, I was sick of it. It has been two years, and I'm still not finished recording it.

Patti
The trouble with life isn't that there is no answer, it's that there are so many answers. Ruth Benedict

Non curo. Si metrum non habet, non est poema

My Page

lubee930
Posts: 4304
Joined: March 4th, 2012, 1:06 pm
Location: Denver, CO, USA

Post by lubee930 » July 29th, 2012, 5:54 pm

Hi enko and Patti--

A very interesting topic here--thank you for the information.

I'm curious: Do you have a feel if very many of the books are manually typed? I would guess that the majority scanned? Does Gutenberg prefer or encourage one method over the other? It sounds as if there is still quite a bit of manual effort involved even if the book is scanned, so I'm not so sure that just typing it would be such a bad alternative (depending, of course, upon the book and the person). :)
Best,
Lucretia

ppcunningham
Posts: 1148
Joined: July 21st, 2009, 3:26 pm
Location: Northern California, USA
Contact:

Post by ppcunningham » July 29th, 2012, 9:36 pm

If you have a nice clear, clean copy of a book, then scanning might work, but I think it takes longer to correct scanning errors than it does to type it - depending on your typing skills. The problem comes in when you don't have a clean, clear copy. One of the books I did was made from a photocopy of a micro-fiche, and had words missing. Another one had pages missing, that i had to get from a different source. The one I'm working on now, has been underlined, has writing in the margins, and the first and last lines on the page are blurry - which causes OCR to do all sorts of interesting things. And the Googlebooks text scans are just horrible - Pick one at random, and you'll see what I mean. No one checks them after they're scanned, and they are mostly unreadable.

To type, or not to type - Gutenberg doesn't care one way or the other. They are only interested in getting a text copy in the prescribed format. And yes, there is a lot of Manual effort involved, but a lot of self-satisfaction, too, when it is completed, and you finally get the e-mail that says it's been posted!

One of my books wasn't on Archive yet, either, so I uploaded both the pdf's of the book, and the text file - so now it's available there as well, and at last look, had been downloaded 141 times. Well worth the effort, I think.

Patti
The trouble with life isn't that there is no answer, it's that there are so many answers. Ruth Benedict

Non curo. Si metrum non habet, non est poema

My Page

Cori
LibriVox Admin Team
Posts: 12042
Joined: November 22nd, 2005, 10:22 am
Location: Great Britain
Contact:

Post by Cori » July 30th, 2012, 3:27 am

lubee930 wrote:Do you have a feel if very many of the books are manually typed?
I would hazard a very wild guess that less than 3% of PG books were manually typed. More in the beginning, when OCR just wasn't that great, and some still come through that way now. But OCR's so much better, and it's so much easier to share the work out, a la Distributed Proofreaders, that although typing is definitely not frowned on, it's just not so common.

Source: 3 years as a Distributed Proofreader with fingers in various pies.


Notes: typing is *AWESOME* practice for being a touch-typist. I touch-typed the whole of The Hobbit, just for the experience of it. It really firmed up the muscle-memory of my fingers. It also firmed my dislike of Tolkien's poetry and Dwarvish drinking songs, but that's another matter.
There's honestly no such thing as a stupid question -- but I'm afraid I can't rule out giving a stupid answer : : To Posterity and Beyond!

enko

Post by enko » July 30th, 2012, 3:33 am

Availle wrote:I have read a book for librivox, which is not to be found online (yet), so I was wondering if I should produce an ebook of the text myself. However, I was a bit worried about the time it would take.

Could you please indicate how long it took you to produce your book? And how many pages did it have?
Yes, as you know the book quite well you could produce an ebook of the text. The number of words is more important than the number of pages. My book has about 16,000 words. Correcting the OCR errors took 2 hours and formatting took 30 minutes, so a total time of 2.5 hours.
lubee930 wrote: I'm curious: Do you have a feel if very many of the books are manually typed? I would guess that the majority scanned? Does Gutenberg prefer or encourage one method over the other? It sounds as if there is still quite a bit of manual effort involved even if the book is scanned, so I'm not so sure that just typing it would be such a bad alternative (depending, of course, upon the book and the person). :)
The majority of the books are scanned, corrected and produced by Distributed Proofreaders. Project Gutenberg encourages people who would like to make ebooks to register on Distributed Proofreaders. The people there are very knowledgeable and as helpful as the LibriVox members. But if for any reasons you would like to make an ebook by (yourself, with a friend, in a group), you can do it too.
If you decide to type a book, you have to consider its advantages and disadvantages over scanning and correcting the OCR of the same book.
ppcunningham wrote: And the Googlebooks text scans are just horrible - Pick one at random, and you'll see what I mean. No one checks them after they're scanned, and they are mostly unreadable.
It depends. My ebook came from a Googlebooks text OCR and it was good. Usually most of the OCR errors are on the cover (which doesn't have much text anyway). The main pages have few errors.

lubee930
Posts: 4304
Joined: March 4th, 2012, 1:06 pm
Location: Denver, CO, USA

Post by lubee930 » July 30th, 2012, 3:49 am

Thank you all for taking the time to reply! :)
Best,
Lucretia

Piotrek81
Posts: 3177
Joined: November 3rd, 2011, 2:02 pm
Location: Poznań, Poland

Post by Piotrek81 » July 31st, 2012, 5:23 am

That's an awesome topic :clap: I've been thinking what to do to increase the number of Polish PD e-texts. You people have answered most of the questions I thought of asking :D The only problem I still don't know how to solve is step 1 of the whole operation, namely laying my hands on a PD paper text for a long enough period to execute the whole operation :mrgreen:

I also thought about joining Distributed Proofreaders, hoping that in this way I'll be able to further the cause of spreading Polish e-texts, but I know very little of this community. Are they even having any Polish text in the works now?
Come help us record The Deluge THE DELUGE IS BACK!
Want to hear some PREPARATION TIPS before you press "record"? Listen to THIS and THIS

Cori
LibriVox Admin Team
Posts: 12042
Joined: November 22nd, 2005, 10:22 am
Location: Great Britain
Contact:

Post by Cori » July 31st, 2012, 5:30 am

No, DP doesn't currently have any Polish works in progress.

If you've got a reasonable camera and good lighting, it's possible to photograph pages, rather than scanning them. That shouldn't take *too* long if you don't do Polish translations of Dickens -- you might even be able to do it in the Rare Books room of your local library, so the books never have to leave the building.

I wonder if any US universities have a Polish speciality? Many Unis have staff interested in archiving and might be willing to help with the scanning step ... just need to find one with a good selection.

Have you been through HathiTrust's offerings? Not sure how they differ from GoogleBooks or archive.org, I think there must be some overlap, but I've found interesting things there from time to time. And if you let them know in items where authors died long enough ago to make the books PD, they're good about opening up access.
There's honestly no such thing as a stupid question -- but I'm afraid I can't rule out giving a stupid answer : : To Posterity and Beyond!

Piotrek81
Posts: 3177
Joined: November 3rd, 2011, 2:02 pm
Location: Poznań, Poland

Post by Piotrek81 » July 31st, 2012, 10:31 am

Thanks, Cori. The bit about photographing didn't cross my mind :shock:

Yes, I visit hahtitrust website from time to time. They do have some interesting PD books in Polish. Also, I've already noticed that they are very open and respond quickly :)
Come help us record The Deluge THE DELUGE IS BACK!
Want to hear some PREPARATION TIPS before you press "record"? Listen to THIS and THIS

BellonaTimes
Posts: 3677
Joined: February 15th, 2009, 6:25 pm
Location: Florida
Contact:

Post by BellonaTimes » July 31st, 2012, 8:30 pm

I'd like to do one of the book I'm currently soloing on but it has a lot of photographs and in-line drawings, so I imagine the HTML version would take months. I read from a MS WordPad version of the OCR text, which is pretty accurate for a scanned book; that would be easy to turn into just a txt version.
They call me Threadkiller.
My Catalog Page

ppcunningham
Posts: 1148
Joined: July 21st, 2009, 3:26 pm
Location: Northern California, USA
Contact:

Post by ppcunningham » July 31st, 2012, 10:41 pm

What you can do in that case, is put all the drawings in a zip file and upload it along with the finished text file. Then the text will be posted, at least, and at some point you or someone else can go back and do the HTML version.

I did that with one of my books that had a lot of illustrations.
The trouble with life isn't that there is no answer, it's that there are so many answers. Ruth Benedict

Non curo. Si metrum non habet, non est poema

My Page

enko

Post by enko » August 14th, 2012, 9:06 am

2 other websites that help in making Project Gutenberg ebooks:

free literature: http://www.freeliterature.org/

and

Girlebooks (they make ebooks by female writers): http://girlebooks.com/

Both of these websites mention librivox quite often. Maybe some of their people are members here.

NinaBrown
Posts: 433
Joined: December 22nd, 2011, 6:17 pm
Location: Rockville IN

Post by NinaBrown » August 26th, 2012, 4:24 pm

Piotrek81 wrote:The only problem I still don't know how to solve is step 1 of the whole operation, namely laying my hands on a PD paper text for a long enough period to execute the whole operation
Oh, but you have it easy! You live in Poland! Last time I checked Poland was full of Polish books :mrgreen: - and there is University in Poznan, isnt't there? - they must have library... with a catalogue on line... find a book you want to do then go talk to the librarian. They are going to love you!!! When I was studying here in Australia I could borrow books from my uni library for 3 months then renew if it wasn't on request list. If you get the library interested, maybe they can support you and offer free photocopying? Then you will be working from a photocopy. I think the hardest bit is to choose the book :-)
This is all very exciting, I wish I had more time :cry:
-nina-

Post Reply