bjkeefe: A quick test of free-ocr.com

Tuesday, January 03, 2012

A quick test of free-ocr.com

A short while ago, I happened upon free-ocr.com, a site that does optical character recognition on an image file that you upload (PDF, JPG, GIF, TIFF or BMP), using as its engine the open source package Tesseract. It's quite a handy site.

This post is mostly a note to self, and a way to report some results to T. Reinhardt, the site's proprietor. But I thought some of you might also be interested to see some examples of just how hard a problem this is, and where the current state of the art lies.

(1, 2) Free OCR performed just about flawlessly on two screen shots of my favorite author bio of all time. The images were taken of a search result on Google Books and cropped down to remove all the stuff outside of the images of the book page. I found this result especially impressive in light of some artifacts I left in on purpose in the second image, along with this, from Free OCR's FAQs:

Can I OCR screen shots?

Our OCR engine requires images at a resolution of 150 dpi or greater and as such it is not optimal suited for reading PC screen shots which are only 72dpi. An Additional problem is that the fonts used for web design are usually sans serif fonts, which are more difficult to recognize than serif fonts.
But as we process your uploaded image for better accuracy it might be possible to OCR even screen shots. just give it a try. In the most cases the result is faster than typing all by hand.

(3) A much harder test -- the sign featured in yesterday's healthcare post -- produced results that were as error-filled as I expected.

(4) A quick crop of (3), so that just the sign's text and one extraneous bit of the image (the tip of Wile E.'s tail) remained. Knowing as little as I do about the field of OCR, I was a bit surprised at this one. I expected some errors, but not quite as many. In retrospect, I'd guess that part of the problem is that the photo of the sign was taken off-axis. I don't know if this is the skew mentioned in the FAQs, or if skew refers to rotation around the z-axis.

Following are thumbnails of the images uploaded to Free OCR and the resulting text. Just to be fussy, the text results have red pilcrows (¶) that I added by hand. Those correspond to where Free OCR detected a line break.

That a browser recognizes the HTML
code ¶ and not the word pilcrow amuses me.

Click the images to see them at full size.

Thanks for your site, T. Hope this report is of some use to you.

(3)		~ ml %¶ ,» ‘¶ V Q me you can‘? afford <1 iraw,¶ \ to an airport - you’ q¶ ~%mmW%~¶ ‘ Y°"'“ 9" a free colom$¢%da'¶
(4)		If you can‘? afford a doctor.¶ o to an airport - you'll gt? G¶ it-ee >_<-my and q breasf exam,¶ Icmdinuf you menhon Al Qaeda,¶ you get u free co\0nos¢¢py_¶
(1)		ABOUT THE AUTHOR¶ Isaac Asimov was born in the Soviet Union to his great¶ surprise. He moved quickly to correct the situation.¶ When his parents emigrated to the United States, Isaac¶ (three years old at the time) stowed away in their¶ baggage. He has been an American citizen since the age¶ of eight.¶ Brought up in Brooklyn, and educated in its public¶ schools, he eventually found his way to Columbia¶ University and, over the protests of the school¶ administration, managed to annex a series of degrees in¶ chemistry, up to and including a Ph.D. He then inﬁltrated¶ Boston University and climbed the academic ladder,¶ ignoring all cries of outrage, until he found himself¶ Professor of Biochemistry.¶ Meanwhile, at the age of nine, he found the love of his¶ life (in the inanimate sense) when he discovered his ﬁrst¶ science-ﬁction magazine. By the time he was eleven, he¶ began to write stories, and at eighteen, he actually¶ worked up the nerve to submit one. It was rejected. After¶ four long months of tribulation and suffering, he sold his¶ ﬁrst story and, thereafter, he never looked back.¶ In 1941, when he was twenty-one years old, he wrote¶ the classic short story “Nightfall” and his future was¶ assured. Shortly before that he had begun writing his¶
(2)		assured. Shortly before that he had begun writing his¶ robot stories, and shortly after that he had begun his¶ Foundation series.¶ What was left except quantity? At the present time, he¶ has published over 260 books, distributed through every¶ major division of the Dewey system of library¶ classiﬁcation, and shows no signs of slowing up. He¶ remains as youthful, as lively, and as lovable as ever, and¶ grows more handsome with each year. You can be sure¶ that this is so since he has written this little essay himself¶ and his devotion to absolute objectivity is notorious.¶ He is married to Janet Jeppson, psychiatrist and writer,¶ has two children by a previous marriage, and lives in¶ New York City.¶ Isaac Asimov¶ November, 1982¶

bjkeefe

Tuesday, January 03, 2012

A quick test of free-ocr.com

No comments:

ShareThis

About Me

Followers

Some Friends

Favorites Among the Bigger Fish

Blog Archive

Labels