A short while ago, I happened upon free-ocr.com, a site that does optical character recognition on an image file that you upload (PDF, JPG, GIF, TIFF or BMP), using as its engine the open source package Tesseract. It's quite a handy site.
This post is mostly a note to self, and a way to report some results to T. Reinhardt, the site's proprietor. But I thought some of you might also be interested to see some examples of just how hard a problem this is, and where the current state of the art lies.
(1, 2) Free OCR performed just about flawlessly on two screen shots of my favorite author bio of all time. The images were taken of a search result on Google Books and cropped down to remove all the stuff outside of the images of the book page. I found this result especially impressive in light of some artifacts I left in on purpose in the second image, along with this, from Free OCR's FAQs:
Can I OCR screen shots?
Our OCR engine requires images at a resolution of 150 dpi or greater and as such it is not optimal suited for reading PC screen shots which are only 72dpi. An Additional problem is that the fonts used for web design are usually sans serif fonts, which are more difficult to recognize than serif fonts.
But as we process your uploaded image for better accuracy it might be possible to OCR even screen shots. just give it a try. In the most cases the result is faster than typing all by hand.
(3) A much harder test -- the sign featured in yesterday's healthcare post -- produced results that were as error-filled as I expected.
(4) A quick crop of (3), so that just the sign's text and one extraneous bit of the image (the tip of Wile E.'s tail) remained. Knowing as little as I do about the field of OCR, I was a bit surprised at this one. I expected some errors, but not quite as many. In retrospect, I'd guess that part of the problem is that the photo of the sign was taken off-axis. I don't know if this is the skew mentioned in the FAQs, or if skew refers to rotation around the z-axis.
Following are thumbnails of the images uploaded to Free OCR and the resulting text. Just to be fussy, the text results have red pilcrows (¶) that I added by hand. Those correspond to where Free OCR detected a line break.
That a browser recognizes the HTML
&para; and not the word pilcrow amuses me.
Click the images to see them at full size.
Thanks for your site, T. Hope this report is of some use to you.