De re metallica

Suggest and discuss books to read (all languages welcome!)
annise
LibriVox Admin Team
Posts: 38675
Joined: April 3rd, 2008, 3:55 am
Location: Melbourne,Australia

Post by annise »

Not sure what you mean by extract ? I have always dodged uisng pdf projects - the one's on Archive are pdf's of images - I didn't know you could extract them as text ? And their OCR version is totally unedited and pretty awful

Anne
RuthieG
Posts: 21957
Joined: April 17th, 2008, 8:41 am
Location: Kent, England
Contact:

Post by RuthieG »

Technically possible, and works fine for PDFs that have been prepared from computer documents, but this PDF is a PDF of the book scan. I just tried it with another smaller archive book scan PDF and it extracts the text, certainly, but it doesn't actually put any spaces between the words. :shock:

And the OCR text is a non-starter.

Ruth
My LV catalogue page | RuthieG's CataBlog of recordings | Tweet: @RuthGolding
annise
LibriVox Admin Team
Posts: 38675
Joined: April 3rd, 2008, 3:55 am
Location: Melbourne,Australia

Post by annise »

RuthieG wrote:Technically possible, and works fine for PDFs that have been prepared from computer documents, but this PDF is a PDF of the book scan. I just tried it with another smaller archive book scan PDF and it extracts the text, certainly, but it doesn't actually put any spaces between the words. :shock:

And the OCR text is a non-starter.

Ruth
Oh well - a one word project shouldn't take long - just need someone who could pronounce it :help:

Anne
janbbeck
Posts: 271
Joined: September 16th, 2011, 9:45 pm

Post by janbbeck »

Can you give me a little more info of what needs to be done with the pdf to make it fit for consumption,please?
ppcunningham
Posts: 1142
Joined: July 21st, 2009, 3:26 pm
Location: Northern California, USA
Contact:

Post by ppcunningham »

It needs to be converted into text, so that a word count program can count the number of words in a chapter, and so that if a chapter is too long it can be divided into smaller sections. An OCR scanning program can do this, but most of them are not very good. Go to Archive.org and compare the "Full Text" version to the PDF pages, and you will see what I mean. Archive uses one of the best OCR programs, and it is still mostly garbage.

If someone were really dedicated, they could re-type it, and submit it to Gutenberg and/or upload it to Archive. I've done that a couple of times.

Patti
The trouble with life isn't that there is no answer, it's that there are so many answers. Ruth Benedict

Non curo. Si metrum non habet, non est poema

My Page
janbbeck
Posts: 271
Joined: September 16th, 2011, 9:45 pm

Post by janbbeck »

so simply from the pdf and cutting that up is not good enough? Sure seems very restrictive to only be able to do ASCII-text-available books...
RuthieG
Posts: 21957
Joined: April 17th, 2008, 8:41 am
Location: Kent, England
Contact:

Post by RuthieG »

We often use scanned books of PDFs from the Internet Archive, Hathi Trust, Google books etc. They are fine for simple books with clearly defined chapters, or for soloists who can use PDF or an online reader of a scanned book comfortably.

The problem with this one arises with the fact that it will be a group read of a complex and large book with enormous numbers of footnotes. If we need to split chapters (which I suspect we will) we need to be able to count the words and split the chapters in a meaningful way. That is time-consuming, and time is something that some of us here don't have too much of at the moment, for various reasons.

Of course, someone with a whole lot of time and a burning desire to see this recorded may well come along, but not necessarily right now. Many books are suggested here, and many are taken up and recorded, but sometimes it takes a while.

Ruth
My LV catalogue page | RuthieG's CataBlog of recordings | Tweet: @RuthGolding
janbbeck
Posts: 271
Joined: September 16th, 2011, 9:45 pm

Post by janbbeck »

Well, I tried to do a decent OCR job with FineReader, but its recognition accuracy is very bad, and it does not learn as I correct errors.

I will try with Omnipage next to see if that works a bit better.
janbbeck
Posts: 271
Joined: September 16th, 2011, 9:45 pm

Post by janbbeck »

Ok, so omnipage does a far superior job on this. I would like to see if I can prepare this into good text to prepare for splitting up and such. I can do, say, the first chapter and then have you all look at it and give me some direction.

Anyone willing to help?
Lucy_k_p
Posts: 2901
Joined: February 16th, 2009, 7:19 am
Location: Bath, UK
Contact:

Post by Lucy_k_p »

I just spotted something in the new releases feed on Gutenberg:
http://www.gutenberg.org/ebooks/38015

(I only remember the title because I kept wondering why someone was talking about rock bands in the book suggestions thread.)
So little space, so much to say.
annise
LibriVox Admin Team
Posts: 38675
Joined: April 3rd, 2008, 3:55 am
Location: Melbourne,Australia

Post by annise »

Thanks Lucy - that looks much more doable - rah for PG :D

Anne

And I thought it was a rock band too :oops:

And I keep reading the title as "Re - De Metallica"
janbbeck
Posts: 271
Joined: September 16th, 2011, 9:45 pm

Post by janbbeck »

Thanks a lot! That saves me a lot of time. I was working my way through this on omnipage.....
Post Reply