Changes in reader_faq and adding scanning_faq

bookshelf
Gutenberg back end service account 2019-11-13 10:20:01 -05:00
parent 6f6eb1c278
commit fc3707305d
2 changed files with 1100 additions and 50 deletions

View File

@ -10,70 +10,70 @@ Most of this page is no longer actively maintained. Some content may be inaccura
<div class="contents">
<ol>
<li><a href="#">About Finding eBooks</a>
<li><a href="#about-finding-ebooks">About Finding eBooks</a>
<ol class="inner_1">
<li><a href="#">How can I find an eBook I'm looking for?</a></li>
<li><a href="#">Can I get a complete list of Project Gutenberg eBooks?</a></li>
<li><a href="#">How can I download a PG text without using the web catalog?</a>
<li><a href="#how-can-i-find-an-ebook-im-looking-for">How can I find an eBook I'm looking for?</a></li>
<li><a href="#can-i-get-a-complete-list-of-project-gutenberg-ebooks">Can I get a complete list of Project Gutenberg eBooks?</a></li>
<li><a href="#how-can-i-download-a-pg-text-without-using-the-web-catalog">How can I download a PG text without using the web catalog?</a>
<ol class="inner_2">
<li><a href="#"> Books after 10,000 &#8212; the new naming scheme</a></li>
<li><a href="#">Books before 10,000 &#8212; the old naming scheme</a></li>
<li><a href="#books-after-10000--the-new-naming-scheme"> Books after 10,000 &#8212; the new naming scheme</a></li>
<li><a href="#books-before-10000--the-old-naming-scheme">Books before 10,000 &#8212; the old naming scheme</a></li>
</ol>
</li>
<li><a href="#">You don't have the eBook I'm looking for. Can you help me find it?</a></li>
<li><a href="#">Where else can I go to get eBooks?</a></li>
<li><a href="#">I see some eBooks in several places on the Net. Do different people really re-create the same eBooks?</a></li>
<li><a href="#you-dont-have-the-ebook-im-looking-for-can-you-help-me-find-it">You don't have the eBook I'm looking for. Can you help me find it?</a></li>
<li><a href="#where-else-can-i-go-to-get-ebooks">Where else can I go to get eBooks?</a></li>
<li><a href="#i-see-some-ebooks-in-several-places-on-the-net-do-different-people-really-re-create-the-same-ebooks">I see some eBooks in several places on the Net. Do different people really re-create the same eBooks?</a></li>
</ol>
</li>
<li><a href="#">About Using the Web Site</a>
<li><a href="#about-using-the-web-site">About Using the Web Site</a>
<ol class="inner_1">
<li><a href="#"> Why couldn't I reach your site? (or: Why is your site slow?)</a></li>
<li><a href="#">I get an error when I try to download a book.</a></li>
<li><a href="#">I searched for a book I know is in Project Gutenberg, but got no results.</a></li>
<li><a href="#">Can I copy your website, or your website materials?</a></li>
<li><a href="#">Your site doesn't look right in my browser. I clicked on a button, and nothing happened.</a></li>
<li><a href="#">What does that thing about "mirror sites" mean?</a></li>
<li><a href="#">What exactly is an FTP site anyway?</a></li>
<li><a href="#">Can I become an FTP mirror?</a></li>
<li><a href="#">Can I make a private FTP mirror for my school, library or organization?</a></li>
<li><a href="#">When I clicked on the file I want, nothing happened.</a></li>
<li><a href="#">How many texts are downloaded through the web site?</a></li>
<li><a href="#">What are the most popular books?</a></li>
<li><a href="#why-couldnt-i-reach-your-site-or-why-is-your-site-slow"> Why couldn't I reach your site? (or: Why is your site slow?)</a></li>
<li><a href="#i-get-an-error-when-i-try-to-download-a-book">I get an error when I try to download a book.</a></li>
<li><a href="#i-searched-for-a-book-i-know-is-in-project-gutenberg-but-got-no-results">I searched for a book I know is in Project Gutenberg, but got no results.</a></li>
<li><a href="#can-i-copy-your-website-or-your-website-materials">Can I copy your website, or your website materials?</a></li>
<li><a href="#your-site-doesnt-look-right-in-my-browser-i-clicked-on-a-button-and-nothing-happened">Your site doesn't look right in my browser. I clicked on a button, and nothing happened.</a></li>
<li><a href="#what-does-that-thing-about-mirror-sites-mean">What does that thing about "mirror sites" mean?</a></li>
<li><a href="#what-exactly-is-an-ftp-site-anyway">What exactly is an FTP site anyway?</a></li>
<li><a href="#can-i-become-an-ftp-mirror">Can I become an FTP mirror?</a></li>
<li><a href="#can-i-make-a-private-ftp-mirror-for-my-school-library-or-organization">Can I make a private FTP mirror for my school, library or organization?</a></li>
<li><a href="#when-i-clicked-on-the-file-i-want-nothing-happened">When I clicked on the file I want, nothing happened.</a></li>
<li><a href="#how-many-texts-are-downloaded-through-the-web-site">How many texts are downloaded through the web site?</a></li>
<li><a href="#what-are-the-most-popular-books">What are the most popular books?</a></li>
</ol>
</li>
<li><a href="#">About Downloading and Using Project Gutenberg eBooks</a>
<li><a href="#about-downloading-and-using-project-gutenberg-ebooks">About Downloading and Using Project Gutenberg eBooks</a>
<ol class="inner_1">
<li><a href="#">Should I download a ZIP or a TXT file?</a></li>
<li><a href="#">I've got a ZIP file. What do I do with it?</a></li>
<li><a href="#">I tried to unzip my file, but it said the file was corrupt, or damaged.</a></li>
<li><a href="#">I see gibberish onscreen when I click on a book.</a></li>
<li><a href="#">Can I download and read your books?</a></li>
<li><a href="#">What am I allowed to do with the books I download?</a></li>
<li><a href="#">Does Project Gutenberg know who downloads their books?</a></li>
<li><a href="#">I've found some obvious typos in a Project Gutenberg text. How should I report them?</a></li>
<li><a href="#">I've found some obvious typos in a Project Gutenberg text. Who should I report them to?</a></li>
<li><a href="#">I've reported some typos. What will happen next?</a></li>
<li><a href="#">I've got the text file, and I can read it, but it seems to be double-spaced or it has control characters like ^J or ^M at the end of every line.</a></li>
<li><a href="#">When I print out the text file, each line runs over the edge of the page and looks bad.</a></li>
<li><a href="#">I can read the text file, but a few characters appear as black squares, or gibberish.</a></li>
<li><a href="#">Can I get a handheld device for reading PG texts? Which device should I get?</a></li>
<li><a href="#">How can I read a PG eBook on my Palm?</a></li>
<li><a href="#">How can I read a PG eBook on my PDA (not Palm)?</a></li>
<li><a href="#should-i-download-a-zip-or-a-txt-file">Should I download a ZIP or a TXT file?</a></li>
<li><a href="#ive-got-a-zip-file-what-do-i-do-with-it">I've got a ZIP file. What do I do with it?</a></li>
<li><a href="#i-tried-to-unzip-my-file-but-it-said-the-file-was-corrupt-or-damaged">I tried to unzip my file, but it said the file was corrupt, or damaged.</a></li>
<li><a href="#i-see-gibberish-onscreen-when-i-click-on-a-book">I see gibberish onscreen when I click on a book.</a></li>
<li><a href="#can-i-download-and-read-your-books">Can I download and read your books?</a></li>
<li><a href="#what-am-i-allowed-to-do-with-the-books-i-download">What am I allowed to do with the books I download?</a></li>
<li><a href="#does-project-gutenberg-know-who-downloads-their-books">Does Project Gutenberg know who downloads their books?</a></li>
<li><a href="#ive-found-some-obvious-typos-in-a-project-gutenberg-text-how-should-i-report-them">I've found some obvious typos in a Project Gutenberg text. How should I report them?</a></li>
<li><a href="#ive-found-some-obvious-typos-in-a-project-gutenberg-text-who-should-i-report-them-to">I've found some obvious typos in a Project Gutenberg text. Who should I report them to?</a></li>
<li><a href="#ive-reported-some-typos-what-will-happen-next">I've reported some typos. What will happen next?</a></li>
<li><a href="#ive-got-the-text-file-and-i-can-read-it-but-it-seems-to-be-double-spaced-or-it-has-control-characters-like-j-or-m-at-the-end-of-every-line">I've got the text file, and I can read it, but it seems to be double-spaced or it has control characters like ^J or ^M at the end of every line.</a></li>
<li><a href="#when-i-print-out-the-text-file-each-line-runs-over-the-edge-of-the-page-and-looks-bad">When I print out the text file, each line runs over the edge of the page and looks bad.</a></li>
<li><a href="#i-can-read-the-text-file-but-a-few-characters-appear-as-black-squares-or-gibberish">I can read the text file, but a few characters appear as black squares, or gibberish.</a></li>
<li><a href="#can-i-get-a-handheld-device-for-reading-pg-texts-which-device-should-i-get">Can I get a handheld device for reading PG texts? Which device should I get?</a></li>
<li><a href="#how-can-i-read-a-pg-ebook-on-my-palm">How can I read a PG eBook on my Palm?</a></li>
<li><a href="#how-can-i-read-a-pg-ebook-on-my-pda-not-palm">How can I read a PG eBook on my PDA (not Palm)?</a></li>
</ol>
</li>
<li><a href="#">About the Files</a>
<li><a href="#about-the-files">About the Files</a>
<ol class="inner_1">
<li><a href="#">What types of files are there, and how do I read them?</a></li>
<li><a href="#">What do the filenames of the texts mean?</a>
<li><a href="#what-types-of-files-are-there-and-how-do-i-read-them">What types of files are there, and how do I read them?</a></li>
<li><a href="#what-do-the-filenames-of-the-texts-mean">What do the filenames of the texts mean?</a>
<ol class="inner_2">
<li><a href="#">Books after 10,000 &#8212; the new naming scheme</a></li>
<li><a href="#">Books up to 10,000 &#8212; the old naming scheme</a></li>
<li><a href="#books-after-10000--the-new-naming-scheme-1">Books after 10,000 &#8212; the new naming scheme</a></li>
<li><a href="#books-up-to-10000--the-old-naming-scheme">Books up to 10,000 &#8212; the old naming scheme</a></li>
</ol>
</li>
<li><a href="#">What is the difference within PG between an "edition" and a "version"?</a></li>
<li><a href="#">What is the difference between an "etext" and an "eBook"?</a></li>
<li><a href="#">What are the "Etext/Ebook numbers" on the texts?</a></li>
<li><a href="#">What do the month and year on the text mean?</a></li>
<li><a href="#what-is-the-difference-within-pg-between-an-edition-and-a-version">What is the difference within PG between an "edition" and a "version"?</a></li>
<li><a href="#what-is-the-difference-between-an-etext-and-an-ebook">What is the difference between an "etext" and an "eBook"?</a></li>
<li><a href="#what-are-the-etextebook-numbers-on-the-texts">What are the "Etext/Ebook numbers" on the texts?</a></li>
<li><a href="#what-do-the-month-and-year-on-the-text-mean">What do the month and year on the text mean?</a></li>
</ol>
</li>
</ol>
@ -212,7 +212,7 @@ If you choose one of the mirrors, you are then brought to a new page, asking you
Select a site, and the file will be downloaded, or offered for download, depending on which format you selected and which browser you use.
If you can't find your text either way, the book has not been cataloged. If you know that the book has been posted recently, and maybe hasn't made it into the catalog yet, read: [ How can I download a PG text without using the web catalog?]( How can I download a PG text without using the web catalog?)
If you can't find your text either way, the book has not been cataloged. If you know that the book has been posted recently, and maybe hasn't made it into the catalog yet, read: [ How can I download a PG text without using the web catalog?](how-can-i-download-a-pg-text-without-using-the-web-catalog)
If even this doesn't help, don't despair! We don't have it, but it may be elsewhere on the Web. Go to the major search engines and try there. You can also try looking in the Book Search section of [The Online Books Page](https://onlinebooks.library.upenn.edu), and if you have no luck with that, you might be able to find it listed as being In Progress somewhere on their [Books In Progress and Requested](https://onlinebooks.library.upenn.edu/in-progress.html) page.
@ -414,4 +414,351 @@ The Posting Team, who post the books, also make the corrections, and ultimately,
Many producers put their e-mail addresses in their texts, specifically so that readers can contact them when errors are found. If you see that in your text, you should try to contact the producer first. This is especially true if the corrections aren't obvious, as in the case of missing words. The producer is likely to have the original book, and will probably be able to confirm your corrections without visiting a library. If the book needs the corrections, the producer can then notify the Posting Team.
If you get no response from the producer, or if there is no e-mail address listed, or if the corrections are small and obvious, you should send them to the email address for reporting errors listed on the Contacts Page where members of the posting team will deal with them.
If you get no response from the producer, or if there is no e-mail address listed, or if the corrections are small and obvious, you should send them to the email address for reporting errors listed on the Contacts Page where members of the posting team will deal with them.
### I've reported some typos. What will happen next?
This varies wildly. Sometimes, you may just get a response e-mail in a day or three saying thanks, and that we've fixed the typo. This is normal when you've just reported one or a few obvious typos.
Where there is some text missing, or the changes you suggest are otherwise not obvious, we may have to find someone with an eligible copy of the book to confirm the changes, and that might take time. Normally, you will get an e-mail explaining that within a week.
Sometimes, even though you've noticed only one or two small typos, one of the Posting Team who was looking at it may find many more, and decide that the whole text needs to be re-proofed. This may also take time.
If the text needs a lot of changes, we may post a new EDITION [R.35] of it, with a new filename: e.g. abcde10.txt may become abcde11.txt. In this case, you will receive a copy of the e-mail sent to the posted list announcing the new file. Our current rule of thumb is that we create a new edition when we make twelve significant changes, but we judge each on a case-by-case basis, and especially will usually not make a new edition if the original was posted recently.
### I've got the text file, and I can read it, but it seems to be double-spaced or it has control characters like ^J or ^M at the end of every line.
This is most often seen on Mac or Linux. If you want to dig into why this effect happens, see the FAQ "Why use a CR/LF at end of line?" [V.85].
Perhaps viewing it in a different editor or viewer will help, but it's usually easiest just to globally replace all of the control characters (if you see them) with nothing, or to replace all double line-ends with single line-ends.
### When I print out the text file, each line runs over the edge of the page and looks bad.
If you have a file ending in .txt from Project Gutenberg, it is usually formatted with about 70 characters per line, and with a Carriage Return/Line Feed pair (also known as a "Hard Return" or a "Paragraph Mark") at the end of every line.
This is the most widely accepted format for text files, but it's not ideal on all computers and all programs. 70 characters per line means that if you are using an unusually large or small font to print it, lines may wrap around or not reach across the page. The hard return means that on some systems, the lines may appear double-spaced.
Unfortunately, we can't advise you how best to format texts on all systems, mostly because we don't know every system! Here are a couple of tips you might try:
If your font is too big or too small, try setting the font to Courier size 10 or Times size 12. It may not be ideal, but it mostly works.
In a word processor, you may be able to remove the Hard Returns, but beware! if you remove too many, the whole text will become one paragraph. One common formula for removing the HRs goes like this:
First, all paragraphs and separate lines should be separated by two HRs, so that you can see one blank line between them. Where they aren't, as in the case of a table of contents or lines of verse, add the extra HRs to make them so.
Replace All occurrences of two HRs with some nonsense character or string that doesn't exist in the text, like ~$~.
Replace All remaining HRs with a space.
Replace your inserted string ~$~ with one HR.
### I can read the text file, but a few characters appear as black squares, or gibberish.
The text is using some character set that your editor or viewer isn't. For example, the text is using ISO-8859-1, and your viewer is using Codepage 850 — or vice versa. You can see the plain ASCII characters, but non-ASCII characters like accented letters display as nonsense.
Look at the top of the file for a clue to the character set encoding: if it's there, it may help you to find which editor, or font, or viewer you should be using.
### Can I get a handheld device for reading PG texts? Which device should I get?
To read eBooks on a handheld, you need three things: the eBook content itself (which you can get from PG and other sites), a device (which I will sometimes call a PDA, even though technically, the RocketBook isn't a PDA) and the reader software that runs on the PDA.
In mid-2002, there are three main families of handheld devices people use for reading eBooks: Palms, Pocket PCs and RocketBooks (or their successor, REB1100s). In general, it is possible to use any of these in combination with any common type of personal computer.
Palms are very common, especially when you count not just the Palm [http://www.palmone.com/us/](http://www.palmone.com/us/) itself, but PalmOS-based devices from other manufacturers, like:
the Franklin eBookMan [http://www.franklin.com/ebookman/](http://www.franklin.com/ebookman/)
the Handspring Visor [http://www.handspring.com](http://www.handspring.com)
the Sony Clié [http://www.sony.com](http://www.sony.com)
Because of the number of makers of PalmOS-based devices, you can buy them with lots of combinations of features — color screen, audio, different memory sizes. Of course, Palms have other applications besides eBook reading. Palms are the smallest and most portable of the three classes, and tend to have the best battery life for travelling, but they also have the smallest screen. Just about all reader software will run on Palms, except the Microsoft Reader, which runs only on Pocket PCs, but you don't need the Microsoft Reader for Project Gutenberg eBooks.
In Pocket PCs, the Compaq iPaq [http://www.hp.com](http://www.hp.com) and the Dell Axim [http://www.dell.com](http://www.dell.com) are by far the most common at the end of 2003. More expensive and bulkier than a Palm, they have a bigger screen. Like the Palms, they can perform many functions besides reading eBooks. Only Pocket PCs can support the Microsoft Reader, but this is not necessary for reading Project Gutenberg eBooks.
The RocketBook, and its successor the Gemstar REB1100, are quite different from the others. These were built specifically for reading eBooks, and do not have additional functions. They are not, technically, PDAs. Their screens are bigger, and excellent for reading, but do not offer color. They also don't offer a choice of readers — the dedicated reader is built-in to the device. Both of them require the eBooks you load to be formatted for their reader, and files made for them usually have the extension .rb for RocketBook. The REB1100 did not come with the RocketLibrarian, which is the program you run on your PC to turn an etext into a RocketBook file, but people are still making .rb files, and the RocketLibrarian is still available and popular among an enthusiastic group of Rocket users. (The REB1200 is entirely different from the REB1100, and, as far as we know, PG etexts cannot easily be transferred to it.)
In late 2003, Gemstar discontinued their eBook reader range, but there are many still around.
In summary, the Rocket/REB1100 is a dedicated reader, with a good screen, but limited to what it does.
Palms are relatively cheap and common, with a wide range of options, and the capacity to function as PDAs as well. They can run all common readers except the Microsoft one. .
The iPaq [http://www.hp.com](http://www.hp.com) has a good color screen, but is bulkier than a Palm, and can run lots of readers, including the Microsoft one, but not all Palm readers are available for Pocket PC. Like Palms, the iPaq can do other jobs besides displaying eBooks.
Different people make different choices among these for reading their eBooks, and they all work well; it's a matter of personal taste.
### How can I read a PG eBook on my Palm?
These steps work for all devices running the Palm OS.
1. Install the free [Plucker Viewer](http://www.plkr.org/dl)
2. Download the eBook in the "plucker" format to your desktop
3. Sync the plucker file to the Palm using your favorite desktop application
### How can I read a PG eBook on my PDA (not Palm)?
To read a book on your PDA, you need to get the file into a format that your reader software understands. Each PDA reader program will work only with a specific format of file. Some will read several formats, but, in general, it's a jungle of competing options.
Unless you use a Rocket or REB1100, you will need to install at least one reader program, and many veteran readers install two or three to deal with different formats. There are many of them available. One of the most used is the [Mobipocket Reader](http://www.mobipocket.com).
Further, the process may be different depending on which reader software you're using. Each format that a reader understands has one or more converter programs that run on your PC, and turn the plain text file into that format. So in general, you have to:
1. Download the PG text
2. Edit the text for the layout the converter wants (often HTML).
3. Use the converter to create a file of the format the reader wants.
4. Transfer the converted file to your PDA.
If all this sounds too complicated, remember that many people take and convert PG texts into many formats, and offer them for download from their sites. Of course, there is no guarantee that someone will have converted the particular eBook you want, but there are lots of options. Try [Blackmask](http://www.blackmask.com), which lists thousands of texts already converted for Mobipocket, iSilo, RocketBook and the Microsoft Reader.
There are many other sites that serve pre-converted PG texts.
[MemoWare](http://www.memoware.com) is also a useful resource for converted eBooks, and has lots of information, including an excellent [map of the readers and formats jungle](http://www.memoware.com/mw.cgi/?screen=help_format)
Steve Sakoman's site at [http://www.sakoman.net/](http://www.sakoman.net/) takes plain texts from PG and produced automated conversions to HTML and PalmDOC PDB.
If you're "rolling your own", you'll probably need to convert our plain texts to HTML at some point, because a lot of converters require HTML as input, and this is a common theme in readers' explanations of how they get texts onto their PDAs. Don't panic! You don't have to be a HTML wizard to do this — in fact, you don't need to know anything about HTML at all! Usually, it's just a matter of removing some line ends and Saving As HTML. You won't get a lot of fancy markup, or images out of thin air, but you will get the book.
One of the main things you usually have to do in making HTML is unwrap the lines. If you're making your HTML manually, this is usually done by replacing two paragraph marks with some nonsense marker like @@Z@@, replacing all single paragraph marks with a space, and replacing the nonsense marker with a paragraph mark. After unwrapping, the text can just be Saved As HTML.
This has the drawback that lines that shouldn't be wrapped — like poetry, tables or letter headings, will be wrapped. You may have to go through the text and add extra line breaks for these.
There are some applications that specifically assist with auto-converting text into HTML:
- GutenMark [http://www.sandroid.org/GutenMark](http://www.sandroid.org/GutenMark) was specifically written for the purpose, and knows enough about PG conventions to do a very good job.
- InterParse [http://www.interparse.com](http://www.interparse.com) is a Windows-based generic text parser that is very easy and intuitive to use.
- The World Wide Web Consortium lists some other options at [http://www.w3.org/Tools/Misc_filters.html](http://www.w3.org/Tools/Misc_filters.html)
If you're using a RocketBook or REB1100, you don't have either the choices or the confusion to deal with. One of our volunteers who uses a RocketBook offered this recipe for getting a PG text onto a RocketBook:
On converting to Rocket:
1. Download text file.
2. Using your utility for showing formatting, enter your word processing program's edit mode.
3. Replace all double paragraph marks with some nonsense sequence that can't possibly actually be there, such as @@Z@@.
4. Replace all single paragraph marks with one single space (enter).
5. Replace your nonsense sequence with one paragraph mark.
6. Convert all your double spaces to single spaces. Repeat this until you get "0" for how many replacements were made.
7. Save in HTML.
8. Go into your Rocket Librarian. Use "import file using Rocket Librarian." Go and pick up the file, which will be automatically converted to .rb in this process.
This sounds long, but it usually takes me under three minutes except for a very long text. I've never taken longer than five minutes. You can just go in and pick up the text file with Rocket Librarian, but what you get onscreen doing this looks very odd. Steps 2-7 are not essential, and if I'm in a hurry to read something once I might skip them, but if it's something I know I want to keep I use them.
This formula is not ideal for poetry or blank verse — if you want to keep the lines unwrapped, you should avoid removing the paragraph marks.
Another volunteer, who reads on Mobipocket [http://www.mobipocket.com](http://www.mobipocket.com) offered this suggestion:
I use the MobiPocket Publisher, available free from [http://www.mobipocket.com](http://www.mobipocket.com). It wants to take a HTML file as input, so the first thing I have to do is convert my PG text to HTML.
I usually do this by running GutenMark, available at [http://www.sandroid.org/GutenMark](http://www.sandroid.org/GutenMark). I can also do it in Microsoft Word using the following sequence:
- Edit / Replace / Special and choose Paragraph Mark twice (or, from replace, you can type in ^p^p to get two Paragraph Marks) and replace with @@@@. Replace All. This saves off real paragraph ends by marking them with a nonsense sequence.
- Now Replace one Paragraph Mark (^p) with a space. Replace All. This removes the line-ends.
- Finally, replace @@@@ with one Paragraph Mark. Replace All. This brings back the Paragraph Ends.
- Now I can Save As HTML.
GutenMark does a better job of converting to HTML than my simple Word formula, since it recognizes standard PG features, and sometimes Mobipocket doesn't like the HTML produced from Word — it complains of a missing file, or doesn't recognize quotation marks.
Having got my HTML file, I open Mobipocket Publisher, choose "Project Gutenberg", Add the File I created, and just Publish it to MobiPocket .PRC format. Then I pick it up on my iPaq the next time I sync. The whole process takes two or three minutes, and the results, since I discovered GutenMark, are good.
I recently came across InterParse 4 at [http://www.interparse.com](http://www.interparse.com). It doesn't have the built-in knowledge of GutenMark, so the results aren't as good, but it's really easy to use, and you can see the effect of your changes onscreen as you do it. For most PG books, all you have to do is just Open the text file and choose Options / Remove all CRLFs (Except at Paragraph End), then Convert / Text to HTML and Save As the HTML filename you want. Quick and painless.
## About the Files
### What types of files are there, and how do I read them?
The vast majority of our files are plain text. You can read these with any editor or text viewer or browser. Some are HTML. You can read these with any browser.
For a fuller listing of other file types, and how to read them, please see the Formats FAQ [F.2].
### What do the filenames of the texts mean?
We have to divide this question into two answers, for books up to 10,000, and books after 10,000 (or older books reposted after we hit 10,000).
#### Books after 10,000 — the new naming scheme
Since eBook number 10,000, we name our files based on the PG etext number; thus, the base of the name simply reflects the order in which the book was posted. 12345.txt is just the 12,345th book posted.
Also, when we correct an older book, we may repost it into the new naming scheme rather than just replacing it in the old scheme. When we do this, its naming conventions are the same as if it had been numbered after 10,000, and, additionally, we add a subdirectory "old/", into which we put all of the older files, so that they are preserved for anyone who wants to examine them. In this way, we will eventually move all e-books to the new naming scheme.
Formats or character sets other than plain ASCII then get extensions added to indicate the type of file. Character sets get digits; formats get letters. The most common of these are:
- -0 for Unicode
- -8 for 8-bit plain text
- -h for HTML
- -m for MP3
- -r for RTF
- -t for TeX
- -x for XML
Thus, eBook number 12345 may — fairly typically — have the files 12345.txt, 12345.zip, 12345-8.txt, 12345-8.zip, 12345-h.htm and 12345-h.zip, as well as other possible character sets or formats.
Other formats get appropriate three-letter extensions, like -pdf.
The complete set of naming rules for post-10K eBooks is:
1. Directory structure: the directory for the eBook shall be contained in a hierarchy of directories, each one a single digit, being all the digits of the etext number except the last, in order. The name of the directory for the eBook itself shall be the number of the eBook. Thus, eBook #12345 will be contained in:
<pre>
/1/2/3/4/12345/
</pre>
and 123456 in
<pre>
/1/2/3/4/5/123456/
</pre>
Where an e-book is a reposting of a pre-10,000 text, we will create an old/ subdirectory, containing all of the old files associated with that text. For example, consider:
Mike, by P. G. Wodehouse 7423
The corrected, reposted files will be found in:
/7/4/2/7423/
and the older, pre-10K files will all be held in:
/7/4/2/7423/old/
2. Filenames within the eBook's directory shall be the eBook's number, with extensions preceded by a minus sign, indicating character set or format.
a) A file without a character set or format indicator is plain 7-bit ASCII. [In practice, we might allow a few 8-bit characters — up to a dozen or two — and still call it ASCII]
Example: 12345.txt [7-bit plain vanilla ASCII]
b) Character sets, for text files, get digits:
- -0 Unicode (including UTF-7, UTF-8, UCS-4, etc.)
- -5 Big-5
- -8 8-bit (including ISO 8859, Codepages, etc.)
- Example: 12345-8.txt [Text in some 8-bit encoding]
c) File types get letters. Ideally, one-letter formats should be standards-based and editable. For now, the following is the list of single-letter formats.
- -h HTML
- -x XML
- -r RTF
- -t TEX
- -m MP3
Other formats get preferably three (more if necessary) letters.
- -lit LIT
- -pdb PDB
- -doc Word DOC
- -mpg MPEG
- Example: 12345-x.xml [XML]
- Example: 12345-pdf.pdf [PDF]
When more than one variant of a format is posted, the poster will add additional letters as appropriate.
- Example: If a HTML of 12345 has been posted as 12345-h, and we are posting a new HTML if the same eBook broken into pages, it might be posted as 12345-hp.
3. Under the eBook's directory are all files for that eBook. The .txt files will be in the eBook's main directory, as well as other formats that require only one file (PDF, RTF, …). Formats that are likely to require ancillary files get a subdirectory named for file type, with the file within. This is to make it predictable to find the formats, and to allow for any ancillary files to be stored in the subdirectory.
Formats that get a subdirectory include: HTML, TeX and XML. Formats that do not get a subdirectory include: PDF, RTF, LIT, PDB.
The subdir name for each shall be the name of the primary file that lives there.
- Example: The file 12345-h.htm will be at /12345/12345-h/12345-h.htm , and any ancillary files (such as JPEG or CSS) will be in (or below) the same subdirectory.
4. A .zip for each format will be in the main eBook directory. The .zip will unzip to a subdirectory if it's a multi-file format from #3 above, otherwise it will simply unzip a file. In the case of some pre-compressed formats, such as MP3, a .zip may not make sense, in which case it may be omitted.
- Example: 12345-h.zip will be at 12345/ , and when unzipped will create a subdirectory 12345-h/ with 12345-h.htm and any ancillary files.
- Example: 12345-pdf.zip will be at 12345/, and when unzipped will create 12345-pdf.pdf in the current directory.
5. Versions and editions: in the case of a new EDITION, a corrected file, the original file is renamed with an extension of its own posted date .yyyymmdd, and then replaced by the corrected file. So 12345.txt, when replaced, becomes 12345.txt.20030101 and the new, corrected file becomes 12345.txt.
New EDITIONS will get a "Most recently updated: " line added to their standard metadata.
The Release Date in the standard header will be the month and year of the actual first posting of that eBook.
6. Each file (e.g., 12345-h.htm) should have a Project Gutenberg header, metadata and footer. In cases where the file is not editable (such as PDF), or where adding a header isn't realistic (such as MP3), the header, metadata and footer can go in a "readme" file named for the file, with "-readme" added before the extension. The "readme" file shall be in the same directory as the file to which it refers, and shall be included in the ZIP file for that format. Where the format is multifile, there should be only one "readme" for all files.
- Example: "12345-pdf-readme.txt" for the file 12345-pdf.pdf Note: If we were able to add the standard header prior to creating the PDF file, it could be distributed as any other editable format without a readme.
- Example: "12345-m-readme.txt" for the files 12345-m-001.mp3, 12345-m-002.mp3, etc.
7. The GUTINDEX file(s) will have entries of the form:
Title, by Author eBook#
eBook # will be in 5 digits, followed by a "C" if copyrighted and "*" if reserved. "by " will be omitted if there is not enough space. Any additional data, such as a translator or subtitle, will be on a following line or lines surrounded by square brackets [] and indented by two spaces.
GUTINDEX will have approximate date indicators such as:
MARCH 2004: 822 eBooks
The following is an example of etext# 12345, assuming it has ASCII, 8-bit and Unicode text files, a HTML and a HTML broken into pages, an XML, PDF, TeX, and LIT formats, and MP3. Assume that we couldn't edit the LIT, and so had to add a "readme" for that containing the header as in point 6 above.
The directory 12345 for the eBook will be at
1/2/3/4/12345/
and it will contain the files
1/2/3/4/12345/12345.txt
1/2/3/4/12345/12345.zip
1/2/3/4/12345/12345-0.txt
1/2/3/4/12345/12345-0.zip
1/2/3/4/12345/12345-8.txt
1/2/3/4/12345/12345-8.zip
1/2/3/4/12345/12345-h.zip
1/2/3/4/12345/12345-hp.zip
1/2/3/4/12345/12345-t.zip
1/2/3/4/12345/12345-x.zip
1/2/3/4/12345/12345-pdf.pdf
1/2/3/4/12345/12345-pdf.zip
1/2/3/4/12345/12345-lit.lit
1/2/3/4/12345/12345-lit-readme.lit
1/2/3/4/12345/12345-lit.zip
and in its subdirectories the further files
1/2/3/4/12345/12345-h/12345-h.htm
1/2/3/4/12345/12345-h/image1.png
1/2/3/4/12345/12345-hp/12345-hp.htm
1/2/3/4/12345/12345-hp/page2.htm
1/2/3/4/12345/12345-hp/image1.png
1/2/3/4/12345/12345-t/12345-t.tex
1/2/3/4/12345/12345-x/12345-x.xml
1/2/3/4/12345/12345-x/12345-x.xsl
1/2/3/4/12345/12345-x/image1.png
1/2/3/4/12345/12345-m/12345-m-readme.txt
1/2/3/4/12345/12345-m/12345-m-001.mp3
1/2/3/4/12345/12345-m/12345-m-002.mp3
#### Books up to 10,000 — the old naming scheme
Older PG files are named for the text, the edition, and the format type.
Nearly all of these PG files are named in "8.3" format — that is, up to eight characters, a dot, and three more characters. (It should have been all of them, by the rules, but we had to break a few.)
The first five characters in the filename are simply a unique name for that text, for example, "Ulysses" by Joyce begins with "ulyss".
If the text has been posted as both a 7-bit and 8-bit text, then the first character of the filename will be a 7 or an 8, to indicate that. For example, we have both 7crmp10 and 8crmp10 for Dostoevsky's Crime and Punishment.
The 6th and 7th characters of the name are the edition number — 01 through 99. We normally start at edition 10 (1.0); numbers lower than that indicate that we think the text needs some more work; numbers higher than that mean that someone has corrected the original edition 10.
The 8th character of the filename, if it exists, indicates either the version or the format of the file. When we get a different version of the text based on a different source, we give it an a, b, c, as for example if the text is from a different translation. Where we have posted a text in a different format, we also add an eighth character — "h" for HTML, "x" for XML, "r" for RTF, "t" for TeX, "u" for Unicode are established formats. There have been some experimental postings with "l" for LIT, and "p" for either PRC or PDB.
So, for example:
<pre>
7crmp10 is our first edition of Crime and Punishment in plain ASCII
8sidd10 is our first edition of Siddhartha, as an 8-bit text
dyssy10b is our first edition of our third translation of Homer's Odyssey, in plain ASCII
jsbys11 is our second edition of Jo's Boys, in plain ASCII
vbgle10h is our HTML format of our first edition of Darwin's Voyage of the Beagle
7ldv110 is our 7-bit ASCII version of the first volume of the Notebooks of Leonardo da Vinci
</pre>
To make it worse, we don't always stick to these rules, for example:
<pre>
1ddc810 is our first edition of the first book of Dante's Divina Commedia in Italian, as an 8-bit text
80day10 is our first edition of Verne's Around the World in 80 days, in plain 7-bit ASCII in English.
emma10 is our first edition of Jane Austen's "Emma" — with a 4-character basename instead of 5.
</pre>
Some series have special, non-standard names. Shakespeare is named with a digit representing the overall source (First Folio, etc), then "ws", then a series number, so for example 0ws2610, 1ws2610 and 2ws2610 are all versions of "Hamlet". The Tom Swift series is named with a two-digit prefix denoting the series number, then "tom", so for example 01tom10 is "Tom Swift and his Motor-Cycle".
And what should we do with a text from a different source that is formatted as HTML? For example, if dyssy10b is the name of the third translation, what should the HTML version be named? dyssy10bh is obvious, but it uses 9 characters.
The problem, of course, is that we are trying to fit a lot of information into an 8-character filename, and as the collection grows, and the number of formats and versions increases, we come across more pressure on filenames, so while the filename is a good guide to the contents, it's not definitive.
### What is the difference within PG between an "edition" and a "version"?
We give the name "edition" to a corrected file made from an existing PG text. For example, if someone points out some typos in our file of "War and Peace", we will fix them, and, if enough are found to warrant a "new edition", then instead of just replacing the file wrnpc10.txt, we may make a new file wrnpc11.txt, and leave the original alone. A new edition is always filed under the same year and etext number as the original — it's just an update.
We give the name "version" to a completely independent e-text made from the same original book, but a different source. For example, Homer's Odyssey was translated by many different people, but they all worked from the same book. The translations by Lang, Butler, Pope and Chapman are very different, but they all come from the same root.
Thus, these are all "versions" of Homer's Odyssey. We give them all the same basename — dyssy — and each gets a new number, but we keep the original basename, and add a letter to the filename to indicate that they are "versions" of the same original book:
<pre>
dyssy10.txt Butler's Translation
dyssy10a.txt Butcher & Lang's Translation
dyssy10b.txt Pope's Translation
</pre>
The differences don't have to be as extreme as this for us to create a new version. "Clotelle"/"Clotel", for example, was a book published multiple times in English by William Wells Brown, and each time, he changed the text. We preserve three different texts of the same book as different versions: clotl10 clotl10a and clotl10b.
### What is the difference between an "etext" and an "eBook"?
If there is any, it seems to be in the eye of the Marketing Department! Michael Hart started the whole thing, and coined the word "Etext". The term "eBook" is gaining in popularity, even for texts that are not full books, so we've started using that more now.
### What are the "Etext/Ebook numbers" on the texts?
These are simply a series of numbers. We give one to each etext as it is posted, so the earliest etexts have low numbers and later etexts have higher numbers. Etext number 1 is the Declaration of Independence, the first text that Michael Hart typed in to the mainframe that he was using in 1971.
A few numbers are reserved for books that we hope to have in the PG archive someday; for example, 1984 is reserved for Orwell's classic.
When we improve an text by making some corrections, we call it a new EDITION, and it keeps the same etext number, but when we post a different VERSION of the same text, from a different paper book — like different translations of Homer's Odyssey — each new version gets a new etext number.
### What do the month and year on the text mean?
Project Gutenberg sets a production target for itself. The idea is that we try to produce X texts in a month, and in books before #10,000, we dated the texts according to what month of our schedule they appear in. For example, if our target for September 2000 was 50 texts, and we actually produced 55, then the last five would be dated October 2000, and we'd get a head-start on the month. At the time of writing the original FAQ, in July 2002, that target was the publication of 200 books per month. However, our actual production far outpaced our targets, with the result that the "head-start" had accumulated so much that in July 2002, we were releasing books scheduled for March, 2004!
The fact that we were so far ahead of schedule makes this quite confusing for newcomers. If it bothers you, just don't think about it! But at least it's better than being behind schedule. We didn't always produce so many books. In the September 1994 newsletter, Michael Hart wrote:
As always, I am terrified of the prospect of doubling our output to 16 Etexts per month for next year, we really need your help!!!
That was when the Project's target was 8 Etexts per month. Today, our target is heading towards 12 eBooks per day!
In books after number 10,000, we abandoned the "Schedule Month, Year" idea, and the "Release Date" is the actual date on which we posted them.

703
site/how_to/scanning_faq.md Normal file
View File

@ -0,0 +1,703 @@
---
layout: default
title: PG-Scanning FAQ
permalink: /how_to/scanning_faq.html
---
# Scanning FAQ
These guidelines might not reflect all current "best practices." Please visit [Distributed Proofreaders](https://www.pgdp.net) to view forums where best practices are actively discussed and maintained.
<div class="contents">
<ol>
<li><a href="#">What is a scanner?</a></li>
<li><a href="#">What types of scanners are there?</a></li>
<li><a href="#">Which scanner should I get?</a></li>
<li><a href="#">What is ADF?</a></li>
<li><a href="#">Should I get ADF?</a></li>
<li><a href="#">What's a "TWAIN driver" and why do I need one?</a></li>
<li><a href="#">How do I scan a book?</a></li>
<li><a href="#">My book won't open flat enough for a good scan, and I don't want to cut the pages.</a></li>
<li><a href="#">How long does it take to scan a book?</a></li>
<li><a href="#">What scanner settings are best?</a></li>
<li><a href="#">Can I use a digital camera in place of a scanner?</a></li>
<li><a href="#">What is OCR?</a></li>
<li><a href="#">What differences are there between OCR packages?</a></li>
<li><a href="#">How accurate should OCR be?</a></li>
<li><a href="#">Which OCR package should I get?</a></li>
<li><a href="#">What types of mistakes do OCR packages typically make?</a></li>
<li><a href="#">Why am I getting a lot of mistakes in my OCRed text?</a>
<ol class="inner_1">
<li><a href="#"> Scan 1 &#8212; A perfect Scan</a></li>
<li><a href="#">Scan 2 &#8212; A Typical Scan</a></li>
<li><a href="#">Scan 3 &#8212; Guttering and Smaller Print</a></li>
<li><a href="#">Scan 4 &#8212; A Really Bad Case!</a></li>
<li><a href="#">Conclusion</a></li>
</ol>
</li>
<li><a href="#">I got an OCR package bundled with my scanner. Is it good enough to use?</a></li>
<li><a href="#">I want to include some images with a HTML version. How should I scan them?</a></li>
<li><a href="#">I want to include some images with a HTML version. What type of image should I use?</a></li>
<li><a href="#">Will PG store scanned page images of my book?</a></li>
</ol>
</div>
### What is a scanner?
A scanner is a machine that makes an image, a picture of the page that is fed to it, and sends that image to your computer. It only makes an image, like a camera does; it doesn't turn that image into text.
### What types of scanners are there?
The most common type of scanner, the kind you're likely to find in your local computer store, is a flatbed scanner. It has a glass bed usually a bit bigger than Letter paper size (or A4 if you live in Europe! :-) and most of the common models are optimized for typical office correspondence. One of these may cost anything from under $100 to $400, depending on its features, or you can pick them up cheaper second-hand. You use this by placing the paper or book face-down flat onto the glass, and scanning from there. This is the kind of scanner most commonly used by PG volunteers.
Some stores will call sheetfed scanners a different category. These are flatbed scanners with Automatic Document Feed (ADF), but they are fundamentally the same machine, and the ADF sheetfeeder unit may often be bought as an accessory to the flatbed scanner. Recently, a few sheetfed scanners have appeared that are very small, without a full flatbed, just a narrow strip that the paper rolls through. Avoid these for PG work; you often need to be able to scan the book flat.
Hand scanners, as their name implies, are much smaller, and typically very cheap, or even thrown in free. You use these by holding them in your hand and running them along the text like a brush. These are really not intended for PG work; you need a very steady hand movement to get them to scan a page of text into a readable image, and they shouldn't be considered as an option for a 400-page book — scanning and OCR is tough enough without that!
You can think of production scanners as industrial-strength flatbed scanners. The basic mechanisms are the same, but a production scanner will certainly have ADF (sheetfeeder), more features and speed, and be rated for very high volume scanning. Production scanners are used by publishers, businesses with high-volume paper processing needs, and print shops. This last is useful, because you may be able to get some scanning done by a print shop. It can't hurt to ask. If you're thinking about buying one of these babies (and who among us hasn't? :-), be sure you have $2000 or more to spend.
Drum scanners are mostly used by publishers for professional, high-quality artwork. The paper is placed on the surface of a drum that rotates past a fixed scanning head. The drum can be very large. Because the sensors don't have to move, the electronics and optics can be of higher quality, and produce very accurate, high-definition images. They are exactly what you would want for making professional quality scans of old movie posters, but they're expensive, and not very useful for scanning War and Peace to OCR.
Planetary scanners are a different breed to all the others. They are really not scanners at all, but a very high-end digital camera on a stand. You place the book face-up with the pages open, with the camera looking straight down on it. It takes a picture, and passes it on to the connected computer. Planetary scanners are ideal for old, fragile, valuable books that can't be exposed to the stress of normal scanning. They typically come supplied with specialized software, sometimes even their own dedicated computer, and they are very, very expensive — $20,000+.
### Which scanner should I get?
For most people, the answer is simple. Unless you have a lot of money and are sure you will be scanning a lot of books, you should get a normal, consumer-or-office type flatbed scanner, with or without an ADF sheetfeeder.
Having decided that, you're faced with the question of which scanner to buy. More good news! The market in scanners is very competitive, and there are many top-line vendors all watching each others' features like hawks, eager to deliver the highest-spec machine they can. There are only a couple of critical factors in this decision — most of it is about getting the best buy.
For PG work, you really need an optical resolution no less than 300 by 300 dpi (dots per inch), and 600 by 600 is very desirable. Obviously, more is better, but it would be very rare to need more than 600 dpi for PG work. Pay no attention to the "interpolated" or "enhanced" resolution, where the software "guesses" what dots should fill in the gaps — you're only interested in the optical resolution. The good news is that it's very difficult to find modern scanners with a maximum optical resolution of less than 600 dpi, but if you're buying second-hand, you should check this out first.
You will also need a scanning surface on the glass big enough to place your book with two facing pages flat. Again, the good news is that it's very hard to find a flatbed whose scanning surface is too small for PG work, since these scanners tend to be designed to handle office paper, which is about the right size. Most flatbed scanners have scanning surfaces of about 8.5" by 11.5", and this is standard for PG work. If you're working on books with very large pages, you may need to resign yourself to scanning one page at a time, but buying a scanner with a big flatbed for these rare occasions will be much more expensive.
You must make sure that you get a scanner that will connect correctly to your computer. There are four main types of connections commonly available: SCSI, USB, FireWire (IEEE 1394) and parallel.
SCSI (Small Computer Systems Interface) is the highest-quality option, but it means that you need a SCSI card in your computer, and be willing to figure out how to install it. If you're already a SCSI enthusiast, you don't need to read further; if you're not, I suggest you avoid it unless you enjoy tinkering. Production scanners mostly require SCSI.
Parallel-port connections used to be common, as a cheaper, easier alternative to SCSI. Since the introduction of USB they have become rarer, but you will still see them for sale second-hand. These plug into your printer port, and don't require any further engineering skills.
Most new scanners hook up using a USB (Universal Serial Bus) interface, which is a no-muss, no-fuss "plug-in and go" option, but be sure, if you have an old PC, that it actually has a USB port and that your operating system supports it; some older Windows PCs and Macs may not. If your PC doesn't support USB, you should probably look at Parallel-port scanners.
If you're buying second-hand — and used scanners can be very cheap — make absolutely sure that you're getting the original software that came with the scanner, and that that software will work with your current operating system on your PC.
Having ensured that your choice of scanners passes these tests, you're now free to indulge your tastes for any extras you like. Color is nice, but rarely used, since we mostly transcribe older books that have no color printing. Higher resolutions are comforting to have, both since you may occasionally find them useful and because it shows that the optics are of higher quality than you actually need for your PG scans.
If you are nervous about your choice of scanner, or how easy it is to get one working, feel free to contact other PG volunteers for their opinions, as described in the FAQ "How do PG volunteers communicate?" [V.12].
### What is ADF?
ADF stands for Automatic Document Feed, and it's just a jargon term for a sheetfeeder, where you put in a stack of pages to be scanned and go away while that's happening instead of putting in each page manually.
### Should I get ADF?
That depends. Yes, ADF is a great idea, and can be a huge work-saver, and if you have the cash to spend, it may well be worth it. But ADF has a dirty little secret: like any other gizmo with moving parts, it occasionally jams. The sheetfeeders built into these low-cost machines are aimed at handling typical office paper straight from the laser printer — large, smooth, good quality, with perfectly-cut, perfectly-aligned edges. In your PG work, you will be dealing with hundred-year-old pages of various thicknesses and textures, usually much smaller than the sheetfeeder was designed to work with. And you will have to have cut the pages, and may leave ragged edges in doing so.
Under these conditions, you may find that paper often jams in your sheetfeeder, and it defeats the purpose if you have to stand over the scanner while it works, or if you end up having to lift the cover and use your scanner as an ordinary flatbed, or, worse, if your paper gets scrunched up as if a dog had been playing with it.
And of course, in order to feed the pages through, you will have to cut them out of the book, destroying it. (It may be possible, with the help of a bookbinder, to have the pages professionally cut, and later re-bound.)
With ADF, you probably won't actually scan much faster than scanning flat, but you won't have to keep turning over the pages during that time.
So when you're making that choice, think carefully. If money isn't a problem, or you do expect to be working with cut sheets, then go ahead and get a sheetfeeder — it's great when it works! But don't be disappointed when it doesn't work all the time.
### What's a "TWAIN driver" and why do I need one?
A TWAIN driver (see twain.org) is a piece of software that installs onto your Windows PC or Mac and controls your scanner from there. With any modern scanner, there will be a TWAIN driver included in its software package. Once installed, you shouldn't have to think about it again, or even know it's there.
A modern OCR package will usually find your TWAIN driver and use it to control the scanner. This is very handy. There may also be a small scanning package with your TWAIN driver, which will provide a screen where you can make fine adjustments to scanner settings, and start scans. You probably won't need this, since your OCR package will probably do it for you, but it may be useful for semi-manual control of the scanner.
Unix-based systems like Linux use SANE (http://www.sane-project.org/)[http://www.sane-project.org/] rather than TWAIN drivers.
### How do I scan a book?
This depends on whether you have cut the pages out, or whether you are working with an intact book.
If you have cut the pages out, and you have an ADF, then you will obviously feed them through that.
If you don't have an ADF, there usually isn't much point in cutting the pages. Most modern OCR will recognize a "dual-page" or "two-up" scan, and, if yours does, then that's normally the best option. Scanning the uncut book, open and flat, is the most common scanning method used in PG.
Take the book and place it open, flat on the scanner glass. To fit both pages on the glass, you may need to position it lengthways, at 90 degrees to its natural angle. Most OCR software will recognize that the image has been rotated through a right-angle, and will correct it when it reads the text.
A common problem with scanning an opened book is "guttering", which happens when the spine of the book is not pressed flat enough, and the inside of each page, where it meets the spine, is curved against the glass. There's more about this, and an example, scan3, in the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?". To avoid guttering, make sure that the spine is held down throughout the scan. (Some people put a weight on the spine to hold the spine down on each scan; others just press their hand against it.)
Another common problem is light scattering, when too much light gets into the scanner. The scanner head detects light, and you want the only internal light source to be from the scanner itself, not ambient room light or sunlight. Scanners have covers, that are intended to be closed while scanning, for a controlled light level, but when you're scanning a book held open and flat, you can't close the cover fully. In a bad case, this can lead to a condition of the scan like overexposure of film and you can see an example in scan4 of the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?". If this happens, just make sure that your room is dim while you scan — don't have a ray of bright sunlight bouncing around the inside of the scanner!
Occasionally, when scanning cut pages with very thin paper, you may get a shadow of the text on the other side showing through. If this happens, you can try covering the inside of the scanner lid, which is normally white, with a piece of black paper.
Many modern OCR packages will control the scanner automatically, and you may be able to set your OCR so that it does an automatic timed scan every, say, 30 seconds. This is a great timesaver, since you don't have to go back and forth between the scanner and the screen. Just set your timer, hold down the book for the scan, take the book up, turn the page, put it down again, and wait for the next scan to start. Set the timer for whatever interval you are comfortable with. Highly recommended, if your OCR or scanning package can do it.
By default, most scanners will always scan the entire area of the flatbed, but usually, your book will occupy only about half of it. Look for a setting on your OCR or scanning package which allows you to reduce the area that the head scans. Just scan enough to get the image of your pages. This makes the time for each scan and subsequent OCR recognition shorter, and in a really good case can cut your total scanning and OCR time in half.
Scanning all pages together is usually fastest, but you may prefer to scan each double-page, then correct it in your OCR package's editor, then scan the next. This is a more leisurely approach favored by some volunteers.
### My book won't open flat enough for a good scan, and I don't want to cut the pages.
Well, then, you have a difficult choice to make, but you do still have several options:
You can accept a poor-quality scan, and spend a lot of time fixing up the guttering on the margins.
You can bite the bullet, and cut the pages.
You can type the book, or find a typist who will work on it for you.
You can find a print shop or bookbinder who will cut the pages professionally, and re-bind the book when you're done. You may even replace it with a fresh new binding that will give the book a new lease of life.
Take your choice.
Most books will open flat enough for an adequate scan, though you may have to put stress on the spine to do it.
If you have a really precious book, and you can't find a typist, you might consider the options of a digital camera [S.11] or finding someone with a planetary scanner [S.2] to scan it for you.
Michael Hart said: "I would give up every book I own, including my first edition of the OED, my Civil War edition of the Merriam Webster's Unabridged, etc., etc., etc., so everyone could use it any time they wanted rather than that only I or my friends could use it . . . and obviously I could use it too."
Fortunately, it rarely comes to that.
### How long does it take to scan a book?
Putting the book flat on the glass means that you scan two pages at a time. A reasonable modern scanner will scan the area of two typical pages at 400dpi in anywhere from 20 to 40 seconds — let's call it 30 seconds for two pages. That's four pages a minute, or 240 pages an hour. You could reasonably get through a 400 page book in two hours, even allowing for an occasional break or glitch.
Of course, you should also allow time for scanning a few trial pages with different settings before you start, to decide which settings to use. Ten minutes spent here can save you hours of proofreading time.
There are two big tips that can save you a lot of scanning time:
If your OCR or scanner control package has a timer setting, that automatically keeps scanning without user intervention, you can forget about the screen and just keep turning the pages as needed.
You should set your scanner just to scan the area the book covers on the glass. By default, your software will probably scan the full area of the glass, and usually, your book won't need that. By scanning only what you need, you may typically save anything from 20% to 70% of the time taken to scan the full area. If your book is small enough to open flat across the scanner instead of "down" the side, 400 pages an hour is not out of the question with this trick.
### What scanner settings are best?
For a given book, scanner, PC and OCR software, there must be some "ideal" scanner settings, but if you change any of these components, the ideal scanner settings will change with them. Some OCR packages recognize greyscale better than black and white; some don't like greyscale at all. Some books have small print needing higher resolution; some are speckled so that higher resolution leads to more errors.
Obviously, the best settings also depend on the individual book, and some books will require you to get downright creative with the settings, but most PG books are scanned in Black and White or greyscale, somewhere between 300dpi and 600dpi.
This decision is a trade-off between speed and accuracy, and an illustration of the difference between principle and practice. In principle, a true-color, 9600dpi scan is a much better rendering of the page than a B&W 400dpi scan. In practice, all that extra information doesn't usually help the OCR make better distinctions between letters, and the larger and more detailed the scan, the longer it takes to make the scan, the more disk space the image file takes, and the more processing time and memory the OCR package needs to recognize it.
A further paradox emerges when considering higher vs. lower resolutions: depending on the paper and ink quality, you may see more errors start to appear on very high resolution scans. These are caused by small imperfections in the paper or ink spots that show up on the high-res scan, and that the OCR tries to interpret as letters or punctuation.
So, in summary, bigger is better, but only up to a point.
Brightness is a setting often neglected, that can make quite a big difference to your results. Look at the scanned image: if you see lots of dark patches, make your scan lighter; if your letters appear thin and faded, make your scan darker.
See the FAQ ["Why am I getting a lot of mistakes in my OCRed text?"]() for some typical scans and results.
### Can I use a digital camera in place of a scanner?
Digital cameras are getting better resolution all the time, and some volunteers have experimented with making a kind of home-made planetary scanner from a digital camera and a stand. So far, the results don't quite match a dedicated scanner, but as digital cameras improve, this may become a common option. One problem, which planetary scanners use specialized software to correct, is that the natural curve of the pages near the middle of the book tends to give a foreshortened aspect to the letters there, which can cause problems for OCR software, like guttering.
Whatever the current problems, the prospect of using digital cameras is exciting, because it will mean that non-typists will be able to produce old books borrowed from libraries without worrying about scan quality vs. damage to the spine.
### What is OCR?
OCR stands for Optical Character Recognition. This is very important software that looks at the picture of the page that your scanner has supplied, and turns it into text.
When the scanner delivers the image of the page, that image is only a picture. You can't, for example, search for text in it, or edit the text to add a blank line. Your editor or word processor can't work with it. The OCR program does the job of "reading" and "typing" the image for you. OCR packages call this "reading" or "recognizing".
### What differences are there between OCR packages?
One word: huge. All OCR packages do the same job, but they do it in different ways, with different features, and with different levels of accuracy. OCR can save you a lot of time, or cost you a lot of time. It's really worth putting some effort into making sure you get the right OCR package, and, once you have it, into understanding how to use it. It'll save you time in the long run.
### How accurate should OCR be?
OCR packages commonly say that they are "99%+" accurate, or something like that. Let's analyze what that actually means: say there are 1,000 characters (letters) on each page, then with 99.9% accuracy, you would expect to have to make 1 correction per page. With 99% accuracy, that would be up to 10 corrections per page. And in a 400-page book, this all adds up.
But there's a "Your Mileage May Vary" clause built into that. Typically, the manufacturers test their OCR on fresh, laser-printed or press-printed copy with perfect scans, and this is fair, since they are aiming their products primarily at businesses that process these kinds of materials. You are not dealing with fresh print; you're dealing with old books, yellowed, spotted, marked, imperfectly printed in the first place, and possibly using unfamiliar fonts. And it's unlikely that you will have the patience to get a perfect scan on every page. The result is that the accuracy of OCR for typical PG work doesn't match the accuracy on images of perfect, fresh paper.
Apart from the scan quality, OCR also has to contend with different fonts and sizes for the letters.
However, if you're getting more than 10 errors per page, you should look at some examples of OCR in the FAQ [ "Why am I getting a lot of mistakes in my OCRed text?"]().
### Which OCR package should I get?
The accuracy of OCR software has improved enormously in the last few years, and OCR technology looks likely to keep improving even faster than software in general. Further, there is competition in this area, and products leapfrog each other with new versions regularly. The brands most commonly mentioned by PG volunteers (mid-2002) are Abbyy, OmniPage and TextBridge [P.1], and trial versions of all three have been available for download over the Web, and may still be when you read this. [Warning: these are big downloads — 40MB or more.]
Most common OCR packages will offer two main working options: to scan a page and view/edit the resulting text on the spot before saving, and to scan a whole batch of pages together and view/edit them all later. Some people like to fix up one page at a time; others prefer to get all of the OCR work done at once, then get the whole text into their editor. Most OCR software will cater for both, and if this is important to you, you should check that the OCR you're buying supports the way you want to work.
If you intend to work in a language other than English, make sure that the OCR you buy supports the characters in your language.
Some OCR software has a "training" or "learning" mode. Using this mode, it scans and "reads" or "recognizes" a page, then you correct that page, and the OCR "learns" from its mistakes and tries to do better on the letters it misread when it recognizes the next page. If you're dealing with a very rare font, this can make a difference to your OCR quality, but modern OCR packages come with enough inbuilt font knowledge for most languages, and you probably won't need this.
If possible, try a couple of OCR packages before you decide. If you want opinions on specific versions, contact other PG volunteers and ask for their opinions, as described in the FAQ ["How do PG volunteers communicate?"]()
### What types of mistakes do OCR packages typically make?
Each text has its own peculiarities, but there are a number of well-known scanning errors you will be dealing with all the time.
Punctuation is always a problem. Periods, commas and semi-colons are often confused, as are colons and semi-colons. There are also usually a number of extra or missing spaces in the e-text.
The problem of quotes can assume nightmarish proportions in a text which contains a lot of dialog, particularly when single and double quotes are nested.
The numeral 1, the lower-case letter l, the exclamation mark ! and the capital I are routinely confused, and often, single or double quotes may be mistaken for one of these.
Lower-case m is often mistaken for rn or ni.
The letters h and b and e and c are commonly mis-read, and these are probably the hardest of all to catch, since ear/car, eat/cat, he/be, hear/bear, heard/beard are all common words which no spell-checker will flag as problems.
For example:
<pre>
" Hello1' caIled jirnmy breczily. 11Anyone home ? "
There seemed to he no-oneabout. Only tbe eat beard him."
</pre>
should read:
<pre>
"Hello!" called Jimmy breezily, "Anyone home?"
There seemed to be no-one about. Only the cat heard him.
</pre>
### Why am I getting a lot of mistakes in my OCRed text?
If you're new to OCR, you may have come with the idea that OCR is almost perfect, and just makes a few mistakes now and then. No. It's slightly amazing that OCR works at all, and when it does, it isn't perfect.
You might reasonably expect to average anything up to 10 errors per page for typical PG work; if you're seeing more, then there is a problem with
- your printed book
- your scan, or
- your OCR package
Problems with the printed book fall into three categories: bad printing, age, and unusual fonts. Bad printing consists of problems like too much or too little ink on the press at the time the book was printed, and irregularities in the print where the metal type was damaged. Age causes yellowing — even browning — of the paper, and faded print. Unusual fonts may be hard for OCR to recognize, and very tightly-spaced print may make adjacent letters seem to touch, which confuses OCR software.
There are many ways for you to have problems with your scan. Obviously, if your scanner is defective or the glass is dirty, you will notice it immediately, but there are many mistakes you can make that will result in a poor-quality image, and cause later problems for your OCR.
You may not be able to control the quality of the paper you have to work with, but there is a lot you can do about the quality of your scan.
The two mistakes that people inexperienced with scanners most commonly make are not holding the spine down firmly enough to get a flat image of the paper, and not setting the brightness correctly, or letting too much light get in. In your early scans, watch out for these problems.
First, if you haven't already, read the FAQ ["How do I scan a book?"]() and check that you're following the basic recommendations there.
Now let's look at some samples, and see the kinds of problems you might encounter.
A disclaimer about these samples: specific OCR packages are named, but you should not take these as a fair and comprehensive comparative review of the software. The object of this exercise is to show typical scanning conditions and problems, and the resulting OCR output. OCR packages have quite a range of variance within themselves, may work better on some texts than others, may improve with "training" or different settings, and I have even seen the same OCR package produce different text from the same image with the same settings! Further, since OCR quality is improving rapidly, and packages leapfrog each other in quality, the next version of a particular brand may be vastly better than any of the software mentioned here. Of particular interest in this context is the leap in quality between OmniPage 10 and OmniPage 11.
#### Scan 1 — A perfect Scan
Scan 1 is as near to a perfect scan as you can expect in PG work. It comes from The Founder of New France by Charles W. Colby. It is only a 300 dpi image, but given the quality of the print and of the scan, 300 dpi is all we need. Ironically, it comes from Gardner Buchanan, who complains about the age and infirmity of his scanner in his description of how he produces a text. The moral is that you don't have to have the latest equipment to get good results!
It doesn't really need any comment, and all of the packages except gocr rendered it perfectly. Note the fake "space" before the semicolon — if you look closely at the image, you will see why the OCR packages mistook it for a full space, as discussed in the FAQ [V.104] "My book leaves a space before punctuation like semicolons, question marks, exclamation marks and quotes. Should I do the same?"
<pre>
Champlain was now definitely committed to
the task of gaining for France a foothold in
North America. This was to be his steady
purpose, whether fortune frowned or smiled.
At times circumstances seemed favourable ;
at other times they were most disheartening.
Hence, if we are to understand his life and
character, we must consider, however briefly,
the conditions under which he worked.
</pre>
gocr 0.3.6 converted this as:
<pre>
Champtain was now definitely committed to
the task of gaining for France a foothotd in
_orth America. This was to be his steady
purpose, whether fortune frowned or smiled.
At times circumstances seemed favourable .,
at other times they were most disheartening.
_ence, if we are to understand his life and
character, we must consider, however brieRy,
the conditions under which he worked.
</pre>
#### Scan 2 — A Typical Scan
Scan 2 is a paragraph from Baroness Orczy's **Castles in the Air**. Notice the ink-splotch above the capital "I" in the first line, which will give our OCR some problems. The page is also unevenly inked elsewhere, and I have scanned it with the brightness level a bit too high.
I have made two separate scans, one at 300 dpi and one at 400 dpi, both black and white. The page was cleanly cut, and carefully placed straight onto the scanner glass with the cover down. The original print is somewhere between the size of Times New Roman 10 and 11, with capital letters about 2.2 millimeters high, but better and more clearly spaced. These scans are fairly typical for PG work. Because of the relatively large letters, and the reasonable scan, there isn't much difference between the text produced from the 300 dpi scan and the 400 dpi scan.
I actually cut this book to get the pages out so that I could feed it through my ADF, but the paper is so thick and textured that it sticks together, and jams when feeding through. The thick, absorbent paper, combined with the uneven inking, means that, no matter how good the scan, any OCR has to contend with the irregular edges of letters, which are clearly visible even at 300 dpi.
Here is the output for these scans from some OCR software packages. I changed just one thing: Abbyy recognized the em-dashes as such, and output them as a special character in Codepage 1252 for em-dashes, which isn't available in ASCII, so I converted that to the PG standard 2 dashes.
Abbyy FineReader 6:
<pre>
Yes, indeed, I was on the track of M. Aristide Fournier,
and of one of the most important hauls of enemy goods
which had ever been made in France. Not only that. I
had also before me one of the most brutish criminals it
had ever been my misfortune to come across. A bully, a
fiend of cruelty. In very truth my fertile brain %vas
seething with plans for eventually laying that abominable
ruffian by the heels: hanging would be a merciful pun-
ishment for such a miscreant. Yes, indeed, five thousand
francs — a goodly sum in those days, Sir — was practically
assured me. But over and above mere lucre there was
the certainty that in a few days' time I should see the
light of gratitude shining out of a pair of lustrous blue
eyes, and a winning smile chasing away the look of
fear and of sorrow from the sweetest face I had seen for
many a day.
</pre>
<pre>
Yes, indeed, Twas on the track of M. Aristide Fournier,
and of one of the most important hauls of enemy goods
which had ever been made in France. Not only that. I
had also before me one of the most brutish criminals it
had ever been my misfortune to come across. A bully, a
fiend of cruelty. In very truth my fertile brain was
seething with plans for eventually laying that abominable
ruffian by the heels: hanging would be a merciful pun-
ishment for such a miscreant. Yes, indeed, five thousand
francs — a goodly sum in those days, Sir — was practically
assured me. But over and above mere lucre there was
the certainty that in a few days' time I should see the
light of gratitude shining out of a pair of lustrous blue
eyes, and a winning smile chasing away the look of
fear and of sorrow from the sweetest face I had seen for
many a day.
</pre>
gocr 0.3.6:
<pre>
__e_, indeed, f___as on_the track of h_. hristide Fournier,
3nd of one of the most im__ant hau1s of enem)_ goods
___hich had e__er been made in France. h?ot onl3_ that. I
had a1so before me one of the most brUtish crimînat_s it
h__4 e___er been m31 misfortune to co_me acro__3. A bu113_, a
tiend oí cruelt__. In very truth m3_ fertiIe brain ___as
s_e_1_::_g __-ith planS for e__entua113_ _ay:ng that abominab1e
ru_iin b.__ t1_e hee1s . hanginig __ou1d be a n_erciful pun-
i;__,i__gnt íor such a miscreanf. yes, in_i__ee3, fj_1e thou3and
francî-a b_ood13_ sum in those days, _ir-_vas practica1l3_
a3_ured me. _ut o___er and above n_ere lucre there was
the certaint_v that in a few_ da3_s' ti_e I shou1d see the
lib_ht of gratitude shininb_ out of a pair _f _usLtrous btue
e3_e3_, and a ___inning smi1e chasing a__ay the Ioo_ of
_ear and of sorrow from the s__eetest iace T had Seen fof
man)_ a day.
</pre>
<pre>
Yes, indeed, f___as on the track of h__. Ariseide Fournier,
and of one of the most important hau1s _f enemy goods
___hich had ever been made in France. NoEUR on1y that. I
had also before me one of the most brutish crimina1s it
h_ad ever been my misfo__tune to come acros__. A bu11y, a
fiend of crue1ty. _n very truth my fertib brain _vas
seeî3_:i_g __ith plans for e__entua11p 1aying _at abom_in_ ab1e
ru_an by the heels. hanging _____ou1d _ a merciful pun-
iï_h_ment for such a miscreant. Yes, indeed, five thou__and
f_ancs-a b_ood1y sum in those days, _ir-_vas practica1ly
a3îured me. But over and above mere _ucre th.ere was
th_e certainty that in a few days' ti_e _ shou1d see the
1i__t of gratjtude shining out of a pair o_, _userous b1ue
b .
e__es, and a __inning smi1e chasing away the l_k of
_,ear and of sorrow from the s___,eetest face _ _ad _.een _o_
many a day. . .
</pre>
Recognita Standard 3.2.7AK:
<pre>
~'es, indeed, ~w-as on the track of ltT. Aristide Fournier,
and of one of the most important hauls of enemy goods
"=hich had ever been made in France. ~Tot only that. I
ha~i also before me one of the most brutish criminals it
had ever been my misfortune to come across. A bully-, a
fiend of cruelty. In very truth my fertiIe brain was
s; ething w-ith plans for eventually iaying that abominable
ruffian by the heels : hanging ~-ould be a merciful pun-
ishment for such a miscreant. ires, indeed, five thousand
franes-a goodly sum in those days, Sir-was practically
as~ured me. But over and above mere lucre there was
thP certainty that in a few days' time I should see the
light of gratitude shining out of a pair of lustrous btue
ey·es, and a winning smile chasing away the hk of
fear and of sorrow from the sweetest face I had seen for
many a day.
</pre>
<pre>
Yes, indeed, l~was on the track of h~i. Aristide Fournier,
and of one of the most important hauls of enemy goods
w~hich had ever been made in France. lVot only that. I
had also before mP one of the most brutish criminals it
had ever been my misfortune to come acrass. A bully, a
fiend of cruelty. In very truth my fertile brain was
seething with plans for ez~entually laying that abomin_ able
ruffian by the heels : hanging ~~.-ould be a merciful pun-
ishment for such a miscreant. Yes, indeed, five thousand
f:ancs-a goodly sum in those days, Sir-was practically
assured me. But over and above mere lucre there was
the certainty that in a few days' time I should~ see the
Iight of gratitude shining out of a pair of iEustrous blue
eyes, and a w inning smile chasing away the Iook of
fear and of sorrow from the s"-eetest face ~ had seen ~'or
rr~any a day.
</pre>
OmniPage Pro 10:
<pre>
Yes, indeed, twas on the track of 11T. Aristide Fournier,
and of one of the most important hauls of enemy goods
which had ever been made in France. Not only that. I
ha(i also before me one of the most brutish criminals it
had ever been my misfortune to come across. A bully, a
fiend of cruelty. In very truth my fertile brain was
seething with plans for eventually laying that abominable
ruffian by the heels: hanging would be a merciful pun-
ishment for such a miscreant. Yes, indeed, five thousand
francs-a goodly sum in those days, Sir-was practically
assured me. But over and above mere lucre there was
the certainty that in a few days' time I should see the
light of gratitude shining out of a pair of lustrous blue
eyes, and a winning smile chasing away the look of
fear and of sorrow from the sweetest face I had seen for
many a day.
</pre>
<pre>
Yes, indeed, fwas on the track of h-I. Aristide Fournier,
and of one of the most important hauls of enemy goods
which had ever been made in France. Not only that. I
had also before me one of the most brutish criminals it
had ever been my misfortune to come across. A bully, a
fiend of cruelty. In very truth my fertile brain was
seething with plans for eventually laying that abominable
ruffian by the heels: hanging would be a merciful pun-
ishment for such a miscreant. Yes, indeed, five thousand
francs-a goodly sum in those days, Sir-was practically
assured me. But over and above mere lucre there was
the certainty that in a few days' time I should see the
light of gratitude shining out of a pair of lustrous blue
eyes, and a winning smile chasing away the look of
fear and of sorrow from the sweetest face I had seen for
many a day.
</pre>
OmniPage Pro 11:
<pre>
Yes, indeed, twas on the track of AT. Aristide Fournier,
and of one of the most important hauls of enemy goods
which had ever been made in France. Not only that. I
had also before me one of the most brutish criminals it
had ever been my misfortune to come across. A bully, a
fiend of cruelty. In very truth my fertile brain was
seething with plans for eventually laying that abominable
ruffian by the heels: hanging would be a merciful pun-
ishment for such a miscreant. Yes, indeed, five thousand
francs-a goodly sum in those days, Sir-was practically
assured me. But over and above mere lucre there was
the certainty that in a few days' time I should see the
light of gratitude shining out of a pair of lustrous blue
eyes, and a winning smile chasing away the look of
fear and of sorrow from the sweetest face I had seen for
many a day.
</pre>
<pre>
Yes, indeed, fwas on the track of h-I. Aristide Fournier,
and of one of the most important hauls of enemy goods
which had ever been made in France. Not only that. I
had also before me one of the most brutish criminals it
had ever been my misfortune to come across. A bully, a
fiend of cruelty. In very truth my fertile brain was
seething with plans for eventually laying that abominable
ruffian by the heels: hanging would be a merciful pun-
ishment for such a miscreant. Yes, indeed, five thousand
francs-a goodly sum in those days, Sir-was practically
assured me. But over and above mere lucre there was
the certainty that in a few days' time I should see the
light of gratitude shining out of a pair of lustrous blue
eyes, and a winning smile chasing away the look of
fear and of sorrow from the sweetest face I had seen for
many a day.
</pre>
Textbridge Millennium Pro:
<pre>
Yes, indeed, rwas on the track of M. Aristide Fournier,
and of one of the most important hauls of enemy goods
which had ever been made in France. Not only that. I
hail also before me one of the most brutish criminals it
had ever been my misfortune to come across. A bully, a
fiend of cruelty. In very truth my fertile brain was
seething with plans for eventually laying that abominable
ruffian by the heels: hanging would be a merciful pun-
ishment for such a miscreant. Yes, indeed, five thousand
francs-a goodly sum in those days, Sir-was practically
assured me. But over and above mere lucre there was
the certainty that in a few days' time I should see the
light of gratitude shining out of a pair of lustrous blue
eyes, and a winning smile chasing away the look of
fear and of sorrow from the sweetest face I had seen for
many a day.
</pre>
<pre>
Yes, indeed, f was on the track of M. Aristide Fournier,
and of one of the most important hauls of enemy goods
which had ever been made in France. Not only that. I
had also before me one of the most brutish criminals it
had ever been my misfortune to come across. A bully, a
fiend of cruelty. In very truth my fertile brain was
seething with plans for eventually laying that abominable
ruffian by the heels: hanging would be a merciful pun-
ishment for such a miscreant. Yes, indeed, five thousand
francs-a goodly sum in those days, Sir-was practically
assured me. But over and above mere lucre there was
the certainty that in a few days' time I should see the
light of gratitude shining out of a pair of lustrous blue
eyes, and a winning smile chasing away the look of
fear and of sorrow from the sweetest face I had seen for
manyaday.
</pre>
#### Scan 3 — Guttering and Smaller Print
Scan 3 is a paragraph from The Egoist by George Meredith. It was scanned in a dim room, with the scanner cover open and the book held open, flat against the scanner glass. However, the spine was not pressed firmly enough against the glass, and as a result you can see that the words on the left-hand edge (which were near the spine) appear to be slanted, a bit distorted, and not well lit. This problem is familiar to people who scan for PG — everybody gets distracted sometimes, and fails to keep enough pressure on the spine. As you see from the results below, it caused problems for all of the OCR packages on the words affected. If you find this kind of "guttering" regularly in your own scans, where the characters near the spine are not being recognized correctly by your OCR, you need to make sure that your book is down as flat as possible before making a scan.
I have made two separate scans, one at [300 dpi]() and one at [400 dpi](), both black and white. Because of the smaller size and the guttering problem, the 400 dpi scan made for better quality text in this case.
Here's the output from the sample OCR:
Abbyy FineReader 6:
<pre>
NEITHER Clara nor Vernon appeared at the mid-day table,
n Middleton talked with Miss Dale on classical matters,
like a good-natured giant giving a child the jump from
stone to stone across a brawling mountain ford, so that an
uncdified audience might really suppose, upon seeing her
over the difficulty, she had done something for herself. Sir
\Villoughby was proud of her, and therefore anxious to
soltlo her business while he was in the humour to lose her.
He hoped to finish it by shooting a word or two at Vernon
before dinner. Clara's petition to be set free, released from
him, had vaguely frightened even more than it offended hia
nrido.
</pre>
<pre>
NEITHER Clara nor Vernon appeared at the mid-day table.
Dr. Middleton talked with Miss Bale on classical matters,
like a good-natured giant giving a child the jump from
stone to stone across a brawling mountain ford, so that an
unedified audience might really suppose, upon seeing her
over the difficulty, she had done something for herself. Sir
"VVilloughby was proud of her, and therefore anxious to
settle her business while he was in the humour to lose her.
He hoped to finish it by shooting a word or two at Vernon
before dinner. Clara's petition to be set free, released from
him, had vaguely frightened even more than it offended his
pride.
</pre>
gocr 0.3.6:
<pre>
__,,,____,_ Cl,_I._c nor Vernon a__e_Ped _t tl_le _id_da_ tab1e_
_, _ii_(__etoiI f,,_lk(;cl with _MiSs _ale _U_1d_ abS8iG_l I_i_t_t_l.__
i,_i,;,_ .,, _(_u_-i,L_t_ii.e(l 6iiLIblt 6'7_V. ill_ _ C 'll . tf e__Ul__b rU_l
gt(),ii_, tu _fj(),I(, ,_uruSS.,__ T__ Illl_ g UlOUUt_lU o_ _ 8O .t _' t_ail
u,,_,_ifj(;il ;,_i((ic,IGG l_i_' lt re_ y 8UE)_OB_'_ U_Oll 8eelll6 lttr
_,__i. t_ic (li__icu1ty, SIIe t1_d iluI_e 8ol_eth_ng_ fo_ be_.Self. _i__
_ji___()_i___lIl)y w,,s prui_il of heT_ and k__eTefope an_iouS to
_(_(.__u l___i. i)i__, ii,ess wIlile he Wa8 in the hU_ouT to luse Iier_
j__ l_()_)(_(l t() tiiIish it b_ ShOOtiltg a WOTd o__ t_O &t Verno_
_o__(),__ (li,_iIci._ Cl__T_'S _eti_tio_ tO be Set fTee_.Te1ea8ecl fro_
)ii))),, lIL_Ll v_b__uely f_.ighteUe eVen _OTe kba_ lt OfEe_ded hi_
pi_i..(l_u- . _ , , — .___ _ _,- - -__-
</pre>
<pre>
________ Cl__i.a nop Vernon appeared &t t'h_e _id_day t__le_
D_. _id(lle_oi_ t_lked with Miss _ale ,on _ _Ssi__l __i tt_r_'_
iij_e _ 6ood-n___tLi_.ed 6iai_t 6_i_ing & Ghild the ___np _'_.on_
_tune to _tone aGro_S a braWlin( __ inOU__taiß _foPd_ So t2_at a__
u__p,(_ified ___idiei_Ge _ni62it real y 8uppO.8e_ upon _seeii_6 l_e_
o______ the difhculty_ she had done _o_neth_n6 fop ber_elf_ _i_
_viljoli____k)y w__s proud of heT, and the_efo_e an_iouS to
___.tle li__i. i)u__inesS Whike he W_S î_ the hum'ou_ to_ lose her_
__e l_op(_d to finish it by 8hooting a wopd o_ tWo ak Verno__ _
_eforR_ _(in_icr_ Clara's petition to _ Set _free, releaSed fro_
)ii__, h_d va6uely frigbte_ed eve_ _ore tban it o_e_ded hiD
pi.icle. -. - - - - - '
</pre>
Recognita Standard 3.2.7AK:
<pre>
~rFr~rrmx Clara nor Vernon apneared at the mid-da~'table.
Dr. bLidrlleton talkc;d wi.th Miss Dale vn elassieal matters,
like a ~n~a-mZtured giant gi.ving a child th© jucnp frvm
stonc to stone across a brawling mounta,in ford, so that au
uiicilificd .ruciicucc mil;·ht really suppasc, upon seeixig hor
·n~er thc ciillicul.ty, she had clouo something for herself. Sir
~Villcm;;lrlry wvs proua of her, and therefors angiaus to
sct.tla lrur tn~sincss while he was in the humoar to lose her.
lle lu,hcot to iinish it by shooting a word ar two at Vernon
bol'ore ~linncr. Clara's petition to bo set froe, released £rom
JGGnt., hvd vagucly frighteued even more than it offended hia
ri~le.
p
</pre>
<pre>
NEITfi~R Clara nor Vernon appeareci at the xnid-day table.
Dr. Middleton talked with Miss Dalo on classics,l rnatters',
like a good-natured giant giving a child the jtimp from
stone to stone across a brawling mountain ford, so that an
unedified audience might really suppose, upon ~ seeing her
over the difficulty, she had done something for herself. Sir
yillon ;hby was proud of her, and therefore anxiotis to
scttle luer business while he w~as in the hurxiour to lose her:
He hoped to finish it by shooting a word or two at Vernon
before dinner. Clara's petition to be set free, released from
jcLm, had vaguely frighteued even more than it offended his
pride.
</pre>
OmniPage Pro 10:
<pre>
NF r~rn,Px Clara nor Vernon appeared at the mid-dap table.
Dr. Middleton talked with Miss Dale on classical matter,
like .t good-natured giant giving a child the jump from
stone to stone across a brawling mountain ford, so that an
uneVified audience might really suppose, upon seeing her
over the difficulty, she had done something for herself. Sir
jV;llo,r;;lrl>y was proud of her, and therefore anxious to
set.tlo lror Uusiness while he was in the humour to lose her.
Ile. lropcol to finish it by shooting a word or two at Vernon
bol'ore dinner. Clara's petition to beset free, released from
)zinc, had vaguely frightened even more than it offended his
pride.
</pre>
<pre>
NEITHER Clara nor Vernon appeared at the mid-day table.
Dr. Middleton talked with Miss Bale on classical matters',
like a good-natured giant giving a child the jump from
stone to stone across a brawling mountain ford, so that an
unedified audience might really suppose, upon ~ seeing her
over the difficulty, she had done something for herself. Sir
yillou ;hby was proud of her, and therefore anxious to
settle her business while he was in the humour to lose her.
He hoped to finish it by shooting a word or two at Vernon
before dinner. Clam's petition to be set free, released from
him, had vaguely frightened even more than it offended his
pride.
</pre>
OmniPage Pro 11:
<pre>
NF f,rnMR Clara nor Vernon appeared at the mid-day table.
Dr. Middleton talked with Miss Dale on classical matters,
like .t good-natared giant giving a child the jump from
stone to stone across a brawling mountain ford, so that an
une(lifie(l audience might really suppose, upon seeing her
over the difficulty, she had done something for herself. Sir
jVillon;hl)y was proud of her, and therefore anxious to
setale leer business while he was in the humour to lose her.
lle hoped to finish it by shooting a word or two at Vernon
bofore dinner. Clara's petition to beset free, released from
)lint, had vaguely frightened even more than it offended his
pride.
-.2 ..1_ - ____
</pre>
<pre>
NEITHER Clara nor Vernon appeared at the mid-day table.
Dr. Middleton talked with Miss Dale on classical matters',
like a good-natured giant giving a child the jump from
stone to stone across a brawling mountain ford, so that an
unedified audience might really suppose, upon,seeing her
over the difficulty, she had done something for herself. Sir
Willoughby was proud of her, and therefore anxious to
settle her business while he was in the huniour to lose her.
Il"e hoped to finish it by shooting a word or two at Vernon
before dinner. Clara's petition to be set free, released from
hint, had vaguely frightened even more than it offended his
pride. - -
</pre>
TextBridge Millennium Pro:
<pre>
NErr'!'~~ Clara nor Vernon appeared at the mid.day table.
pr. ~1id(lIeto11 talked with Miss Dale on classical matters,
like a good-natured giant giving a child the jump from
stone to stone across a brawling mountain ford, so that au
~1edifi~ tLU(llCIlCC might really suppose, upon seeing h er
over the (hjiheulty, she had done something for herself. Sir
wiflouighby was proud of her, and therefore anxious to
settle her business while he was in the humour to lose her.
lie ho1)ed to finish it by shooting a word or two at Vernon
before dinner. Clara's petition to be set free, released from
him, had vaguely frightened even more than it offended his
prú~t~.
</pre>
<pre>