Note
These webpages are currently rough notes only, meant to serve as very basic introductory material. Please follow up by email if you are interested in the project.Introduction
These pages are here to introduce and discuss a possible approach to Optical Character Recognition, inspired by a need to handle complex scripts like Bengali (but applicable to other scripts as well). The feature that separates it from more common implementations is that the classifier needs to be trained before it can be used, and to be useful, probably needs a minimal (under the best of circumstances) degree of interaction even afterwards.
We mostly wish to talk about the techniques of how to achieve our goal, but discussion and experimentation is largely pointless without a reference implementation. We have a preliminary reference implementation based on a Free Software environment called R (often referred to as the Statistician's Matlab, to give a vague idea of what it is like). To use and improve our implementation, you would have to install R. (Warning: the R source tarball is around 9MB.) At some future date, if things look promising, we may consider implementing a standalone application.
Our reference implementation is available as an add-on R source package (less than 200KB), which needs to be compiled. If you are working on Windows and you are not an R user already, it's likely that you will find it difficult to get this to work, since an additional mingw toolchain is required. It is possible to distribute pre-compiled binary versions of packages, but I don't have enough experience with Windows to know (or care) how to do that. If you are really desperate, bug me via email and I'll see what I can do.
Goals and Scope
Before anyone gets too excited, we should point out certain things about what we can expect from this approach. First of all, this still needs a lot of work, and there's no guarantee that the end-product will be actually usable. Even if everything turns out to work as perfectly as we hope, the very nature of the approach means that:
- it will probably work only with very high quality print and high quality scans. It may be useful if you want to OCR a reasonably large book from a good publishing press. But for casual one-page scans, it would be practically useless.
- it's mostly interactive (and hence demanding in terms of user time spent), and you probably won't be able to just give some image file names and get some output.
- It needs manual training initially for every new book / font -- this has the advantage that almost any language will work - English, Bengali, Hindi (not together, though).
To learn more, start reading the other links (preferably in sequence)