Tuesday, January 03, 2012

A quick test of free-ocr.com

A short while ago, I happened upon free-ocr.com, a site that does optical character recognition on an image file that you upload (PDF, JPG, GIF, TIFF or BMP), using as its engine the open source package Tesseract. It's quite a handy site.

This post is mostly a note to self, and a way to report some results to T. Reinhardt, the site's proprietor. But I thought some of you might also be interested to see some examples of just how hard a problem this is, and where the current state of the art lies.

(1, 2) Free OCR performed just about flawlessly on two screen shots of my favorite author bio of all time. The images were taken of a search result on Google Books and cropped down to remove all the stuff outside of the images of the book page. I found this result especially impressive in light of some artifacts I left in on purpose in the second image, along with this, from Free OCR's FAQs:

Can I OCR screen shots?

Our OCR engine requires images at a resolution of 150 dpi or greater and as such it is not optimal suited for reading PC screen shots which are only 72dpi. An Additional problem is that the fonts used for web design are usually sans serif fonts, which are more difficult to recognize than serif fonts.
But as we process your uploaded image for better accuracy it might be possible to OCR even screen shots. just give it a try. In the most cases the result is faster than typing all by hand.

(3) A much harder test -- the sign featured in yesterday's healthcare post -- produced results that were as error-filled as I expected.

(4) A quick crop of (3), so that just the sign's text and one extraneous bit of the image (the tip of Wile E.'s tail) remained. Knowing as little as I do about the field of OCR, I was a bit surprised at this one. I expected some errors, but not quite as many. In retrospect, I'd guess that part of the problem is that the photo of the sign was taken off-axis. I don't know if this is the skew mentioned in the FAQs, or if skew refers to rotation around the z-axis.

Following are thumbnails of the images uploaded to Free OCR and the resulting text. Just to be fussy, the text results have red pilcrows () that I added by hand. Those correspond to where Free OCR detected a line break.

That a browser recognizes the HTML
code ¶ and not the word pilcrow amuses me.

Click the images to see them at full size.

Thanks for your site, T. Hope this report is of some use to you.

~ ml %
,» ‘
V Q me you can‘? afford <1 iraw,
\ to an airport - you’ q
‘ Y°"'“ 9" a free colom$¢%da'
(4) If you can‘? afford a doctor.
o to an airport - you'll gt? G
it-ee >_<-my and q breasf exam,
Icmdinuf you menhon Al Qaeda,
you get u free co\0nos¢¢py_
Isaac Asimov was born in the Soviet Union to his great
surprise. He moved quickly to correct the situation.
When his parents emigrated to the United States, Isaac
(three years old at the time) stowed away in their
baggage. He has been an American citizen since the age
of eight.
Brought up in Brooklyn, and educated in its public
schools, he eventually found his way to Columbia
University and, over the protests of the school
administration, managed to annex a series of degrees in
chemistry, up to and including a Ph.D. He then infiltrated
Boston University and climbed the academic ladder,
ignoring all cries of outrage, until he found himself
Professor of Biochemistry.
Meanwhile, at the age of nine, he found the love of his
life (in the inanimate sense) when he discovered his first
science-fiction magazine. By the time he was eleven, he
began to write stories, and at eighteen, he actually
worked up the nerve to submit one. It was rejected. After
four long months of tribulation and suffering, he sold his
first story and, thereafter, he never looked back.
In 1941, when he was twenty-one years old, he wrote
the classic short story “Nightfall” and his future was
assured. Shortly before that he had begun writing his
(2) assured. Shortly before that he had begun writing his
robot stories, and shortly after that he had begun his
Foundation series.
What was left except quantity? At the present time, he
has published over 260 books, distributed through every
major division of the Dewey system of library
classification, and shows no signs of slowing up. He
remains as youthful, as lively, and as lovable as ever, and
grows more handsome with each year. You can be sure
that this is so since he has written this little essay himself
and his devotion to absolute objectivity is notorious.
He is married to Janet Jeppson, psychiatrist and writer,
has two children by a previous marriage, and lives in
New York City.
Isaac Asimov
November, 1982

No comments: