A Sample Session

To start classification you only need to know about 3 functions. The following is a description of a sample first session with the OCR software that illustrates their usage. All details are skipped for now. The user is encouraged to play around bit with R to get a feel for it.

It is assumed that to work with one or more images printed using a certain font, you will use a separate directory. The OCR package will create certain files and subdirectories there in order to make data persistent across sessions.

With this in mind, create a new directory and move into it. For starters, bring one single image file into the directory. If you don't have any scanned images, you can use this one (300 dpi, 14 lines of text, 1728x996 pixels, 1.7 MB). R will load the full image into memory, so you may have problems with big images, especially if your computer is low in memory (in that case break up your images). You will need to make sure your image is a PGM file. So your preprocessing steps may go something like this:

$ mkdir ocr-exp
$ cd ocr-exp/
$ cp /path/to/foo.png .
$ convert foo.png foo.pgm # if you have ImageMagick installed
$ pngtopnm foo.png | ppmtopgm > foo.pgm # another alternative

Breaking up image into lines of text

The first step is to split up this image into single lines of text. The function segmentIntoLines() will try to extract these lines as double matrices, and save them in a new directory, specific to the input image. This could go like (note that lines starting with > and + are what the user needs to enter, everything else is output):

$ R

[R startup messages skipped]

> library(bocra)
Loading required package: pixmap
> segmentIntoLines("foo")

        Read image matrix of size 13768824

> list.files(path = "foo-lines")
 [1] "image_00001.rda" "image_00002.rda" "image_00003.rda" "image_00004.rda"
 [5] "image_00005.rda" "image_00006.rda" "image_00007.rda" "image_00008.rda"
 [9] "image_00009.rda" "image_00010.rda" "image_00011.rda" "image_00012.rda"
[13] "image_00013.rda" "image_00014.rda"
>

Note that the file extension .pgm is implicit (this is for ease of coding, and can be easily extended). Since the input file is foo.pgm, The lines are supposed to be saved in the directory called foo-lines/. We confirm that this directory has indeed been created and some files written in it by listing the files using the list.files() function.

In fact, since we would eventually use these file names for training our classifier, let us store these names in a variable, and this time get the full names (including directory name). To make sure the lines have been read in correctly, let us also plot the images (you would need to make the plot window wider and shorter for a proper aspect ratio):

> ## a hash (#) is the R comment character
> foolines = list.files(path = "foo-lines", full.names = TRUE)
> ## assigning something usually suppresses printing
> foolines # print the contents of the variable foolines
 [1] "foo-lines/image_00001.rda" "foo-lines/image_00002.rda"
 [3] "foo-lines/image_00003.rda" "foo-lines/image_00004.rda"
 [5] "foo-lines/image_00005.rda" "foo-lines/image_00006.rda"
 [7] "foo-lines/image_00007.rda" "foo-lines/image_00008.rda"
 [9] "foo-lines/image_00009.rda" "foo-lines/image_00010.rda"
[11] "foo-lines/image_00011.rda" "foo-lines/image_00012.rda"
[13] "foo-lines/image_00013.rda" "foo-lines/image_00014.rda"
>
> par(ask = TRUE) # to prompt before displaying each new image
>
> # loop through files, R indexing starts from 1
> for (i in 1:length(foolines)) { 
+   load(foolines[i])  # loads data from ith file
+   plotpix(linepix)   # plots corresponding image
+ }
> 
> # the + prompt indicates that the last line was
> # syntactically incomplete and needs to be continued
> 
> par(ask = FALSE) # restore normal behaviour

Now that it seems lines have been segmented correctly, we can move on to the next step, which is to initiate and train a classifier. To create a new instance of a classifier that's supposed to handle Indic scripts like Bengali and Hindi, one needs to call the function indicClassifier() and store the result in a variable. This variable can then be trained with one file at a time.

> bar = indicClassifier()
> updateClassifier(bar, foolines[1])

This will start the training by displaying the full line, displaying the identified words sequentially in a larger scale, and go through what the software identifies as glyphs, asking the user to enter the class. Each glyph will be displayed in grey, and the class entered should be for that part only. For example, 'ra' would be split into two parts, 'ba' and the dot below it. Here's a screenshot.

Currently training is the only mode. Play around and let me know if you have any questions.

As RMS would say, happy hacking!


If you have any questions or comments, contact me at deepayan at stat.wisc.edu