I noticed that when I photocopy and email documents, the resulting attachment has relatively low resolution, and the digits get melded to one another. I decided to try to build a classifier to begin to sort this out. To this end, I needed to build a data set. First, I used svgfig to produce SVG sans-serif digit pairs, with kerning adjusted at four intervals. Then I used inkscape to create PNG images from the SVG files. Finally, I read the PNG images and wrote them to a NumPy array. I created a set of clean images, and images polluted with Gaussian noise, with a mean of zero, and a variance of 0.1. (The pixels were then rescaled back to the range of 0 to 1.) I also shifted each pair in eight directions. This produced a data set with 7200, 16×16 pixel images, half of which were noisy. I used a random forests classifier from sklearn, and performed 10-fold cross validation.

State of the Art

I began this experiment because I optical character recognition (OCR) is an awfully difficult thing to get right when there is noisy data. I have used gocr in a workflow with poppler tools like pdftoppm, but I have only had limited success. Other tools, like abbyyocr work amazingly well on clean data, but somewhat less so on noisy data. I found a blog with an interesting comparison, here.

In case you’re interested, this is what I’ve figured out regarding pdftoppm and gocr. If your data was converted to a PDF from Word or something, then you can run,

$ pdftotext -layout file.pdf > out.txt

That will try to preserve the layout as much as possible. For an example, see this blog.

If your data is instead a PDF of images, then you have to bite the proverbial bullet and use OCR. In that case, use pdftoppm to convert the PDF to a set of images,

$ pdftoppm -r 200 file.pdf img

This will create images with 200 dpi resolution. The images will be named img-001.ppm etc. Sometimes you can improve the OCR by fiddling with the resolution, but it’s tedious. Anyway, then you apply gocr with all of its optons,

$ gocr -a 50 -C '--.0-9' img-001.ppm | grep -c '_'

The -C option specifies that gocr will only look for digits, decimals, and minus signs, and the -a option specifies that it will make a decision with only 50% confidence, just in case you feel lucky. The grep part will count the number of lines with underscores, which are used to describe underscores or unknown characters. I use this as a rough way to evaluate performance. If you want the output, then you redirect the output as so,

$ gocr img-001.ppm > out.txt

That’s my current survey of free OCR tools and techniques. The gocr utility has some exciting features that I have not experimented with yet, but that’s for a future post.

Creating a Data Set

First, we import an obnoxious number of modules.

import os, sys, svgfig, numpy as np
from IPython.display import SVG, Image
import matplotlib.image as mpimg
import collections
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

Create some digit pairs with four levels of kerning and write them to SVG files.

path = 'C:/Users/...'

# SVG window size
m = 1.0
w = svgfig.window(0,m,0,m)

# kerning distances
offsets = [ 0.014, 0.012, 0.010, 0.008 ]

# labels for kerned images
labels = [ 'xl', 'lg', 'md', 'sm' ]
offset_lbl = dict( zip( offsets, labels ) )

for offset in offsets:
    for i in range( 10 ):
        for j in range( 10 ):

            # list to hold the svgfig objects
            fig = list()
            
            # the two digits 
            fig.append( svgfig.Text( m/2-offset, m/2, str(i) ).SVG( w ) )
            fig.append( svgfig.Text( m/2+offset, m/2, str(j) ).SVG( w ) )
            
            # write and save the SVG
            fn = path+offset_lbl[offset]+str(i)+str(j)+'.svg'
            g = svgfig.SVG("g",*fig).save(fn)

If you’re in IPython, you can view an SVG using,

SVG( filename=path+'xl23.svg' )

Moving on, we’ll convert our SVG images to pixelated PNG images using system calls to inkscape, which you will hve to download and put in your system path, if you’re on Windows. I set the output PNGs to be 14×14 pixels, so that I could float them around in a 16×16 pixel window.

for i in range( 10 ):
    for j in range( 10 ):
        for sz in labels:
            inp = sz+str(i)+str(j)+'.svg'
            oup = sz+str(i)+str(j)+'.png'
            os.system('inkscape '+inp+' -e '+oup+' -D -w 14 -h 14')

At this point, if you’re using IPython and you’d like to view a PNG, then you can use,

Image( filename=path+'sm23.png' )

Before we go any further, we’ll need to define a function for rescaling an array to a new maximum and minimum. We’ll use this to scale our noisy images back down to the range 0 to 1.

def rescale( data , newmin , newmax ):
    oldmin , oldmax = np.min( data ), np.max( data )
    newscale = float( newmax - newmin )
    oldscale = float( oldmax - oldmin )
    if oldscale == 0: return None
    ratio = float( newscale / oldscale )
    scaled = data.copy()
    scaled -= oldmin
    scaled *= ratio
    scaled += newmin
    return scaled

Now, we put everything together to form an array of shifted and possibly noisy images.

X = np.zeros((400*9*2,16*16+1))
k = 0
for i in range( 10 ):
    for j in range( 10 ):
        for sz in labels:
            for ii in range(3):
                for jj in range(3):
                    z = np.zeros((16,16))
                    fn = sz+str(i)+str(j)+'.png'
                    img = mpimg.imread( fn )
                    z[ii:ii+14,jj:jj+14] = img[:,:,3]
                    X[k,:16*16] = z.ravel()
                    X[k,-1] = i*10+j
                    k += 1
            for ii in range(3):
                for jj in range(3):
                    z = np.random.normal(0,0.1,(16,16))
                    fn = sz+str(i)+str(j)+'.png'
                    img = mpimg.imread( fn )
                    z[ii:ii+14,jj:jj+14] += img[:,:,3]
                    z = rescale( z, 0, 1 )
                    X[k,:16*16] = z.ravel()
                    X[k,-1] = i*10+j
                    k += 1

This is what a noisy image would look like before it was flattened out and put into the data set.

Classification and Cross-Validation

Finally, we perform the classification using random forests.

k = list() ; v = 10
for i in range(2,10):
    clf = RandomForestClassifier( n_estimators=2**i, max_depth=None, min_samples_split=1, random_state=0 )
    k.append( cross_val_score( clf, X[:,:-1], X[:,-1], cv=v ) )
    print 2**i,np.mean( k[-1] )
    
boxplot( *k ) ;
ylim( 0.5, 1.0 ) ; grid() ;
title('Random Forest\n10-fold Cross Validation');
xticks(range(1,8+1),[ 2**i for i in range(2,10) ] ) ;
ylabel('Mean Score') ; xlabel('Number of Trees') ;
savefig('rforest10.png',fmt='png',dpi=100) ;

Future Work

I’d like to expand this data set with serif digits, and incorporate digits with decimals, commas, and dashes. On the upstream side, I’d like to work more with image segmentation to focus on areas of interest containing digits. I think it would also be helpful to use more sporadic noise, like the speckles found in photocopies, rather than Gaussian noise, but that’s for a later iteration.

Connor Johnson

Compound Digit Recognition with Random Forests

State of the Art

Creating a Data Set

Classification and Cross-Validation

Future Work

2 thoughts on “Compound Digit Recognition with Random Forests”

Blog about math, programming, and data.