Not sure what you mean by extract ? I have always dodged uisng pdf projects - the one's on Archive are pdf's of images - I didn't know you could extract them as text ? And their OCR version is totally unedited and pretty awful
Anne
De re metallica
Technically possible, and works fine for PDFs that have been prepared from computer documents, but this PDF is a PDF of the book scan. I just tried it with another smaller archive book scan PDF and it extracts the text, certainly, but it doesn't actually put any spaces between the words.
And the OCR text is a non-starter.
Ruth
And the OCR text is a non-starter.
Ruth
My LV catalogue page | RuthieG's CataBlog of recordings | Tweet: @RuthGolding
-
- LibriVox Admin Team
- Posts: 38675
- Joined: April 3rd, 2008, 3:55 am
- Location: Melbourne,Australia
Oh well - a one word project shouldn't take long - just need someone who could pronounce itRuthieG wrote:Technically possible, and works fine for PDFs that have been prepared from computer documents, but this PDF is a PDF of the book scan. I just tried it with another smaller archive book scan PDF and it extracts the text, certainly, but it doesn't actually put any spaces between the words.
And the OCR text is a non-starter.
Ruth
Anne
-
- Posts: 1142
- Joined: July 21st, 2009, 3:26 pm
- Location: Northern California, USA
- Contact:
It needs to be converted into text, so that a word count program can count the number of words in a chapter, and so that if a chapter is too long it can be divided into smaller sections. An OCR scanning program can do this, but most of them are not very good. Go to Archive.org and compare the "Full Text" version to the PDF pages, and you will see what I mean. Archive uses one of the best OCR programs, and it is still mostly garbage.
If someone were really dedicated, they could re-type it, and submit it to Gutenberg and/or upload it to Archive. I've done that a couple of times.
Patti
If someone were really dedicated, they could re-type it, and submit it to Gutenberg and/or upload it to Archive. I've done that a couple of times.
Patti
The trouble with life isn't that there is no answer, it's that there are so many answers. Ruth Benedict
Non curo. Si metrum non habet, non est poema
My Page
Non curo. Si metrum non habet, non est poema
My Page
We often use scanned books of PDFs from the Internet Archive, Hathi Trust, Google books etc. They are fine for simple books with clearly defined chapters, or for soloists who can use PDF or an online reader of a scanned book comfortably.
The problem with this one arises with the fact that it will be a group read of a complex and large book with enormous numbers of footnotes. If we need to split chapters (which I suspect we will) we need to be able to count the words and split the chapters in a meaningful way. That is time-consuming, and time is something that some of us here don't have too much of at the moment, for various reasons.
Of course, someone with a whole lot of time and a burning desire to see this recorded may well come along, but not necessarily right now. Many books are suggested here, and many are taken up and recorded, but sometimes it takes a while.
Ruth
The problem with this one arises with the fact that it will be a group read of a complex and large book with enormous numbers of footnotes. If we need to split chapters (which I suspect we will) we need to be able to count the words and split the chapters in a meaningful way. That is time-consuming, and time is something that some of us here don't have too much of at the moment, for various reasons.
Of course, someone with a whole lot of time and a burning desire to see this recorded may well come along, but not necessarily right now. Many books are suggested here, and many are taken up and recorded, but sometimes it takes a while.
Ruth
My LV catalogue page | RuthieG's CataBlog of recordings | Tweet: @RuthGolding
Ok, so omnipage does a far superior job on this. I would like to see if I can prepare this into good text to prepare for splitting up and such. I can do, say, the first chapter and then have you all look at it and give me some direction.
Anyone willing to help?
Anyone willing to help?
I just spotted something in the new releases feed on Gutenberg:
http://www.gutenberg.org/ebooks/38015
(I only remember the title because I kept wondering why someone was talking about rock bands in the book suggestions thread.)
http://www.gutenberg.org/ebooks/38015
(I only remember the title because I kept wondering why someone was talking about rock bands in the book suggestions thread.)
So little space, so much to say.