I noticed that when I photocopy and email documents, the resulting attachment has relatively low resolution, and the digits get melded to one another. I decided to try to build a classifier to begin to sort this out. To this end, I needed to build a data set. First, I used svgfig to produce SVG sans-serif digit pairs, with kerning adjusted at four intervals. Then I used inkscape to create PNG images from the SVG files. Finally, I read the PNG images and wrote them to a NumPy array. I created a set of clean images, and images polluted with Gaussian noise, with a mean of zero, and a variance of 0.1. (The pixels were then rescaled back to the range of 0 to 1.) I also shifted each pair in eight directions. This produced a data set with 7200, 16×16 pixel images, half of which were noisy. I used a random forests classifier from sklearn, and performed 10-fold cross validation.
State of the Art
I began this experiment because I optical character recognition (OCR) is an awfully difficult thing to get right when there is noisy data. I have used gocr in a workflow with poppler tools like
pdftoppm, but I have only had limited success. Other tools, like abbyyocr work amazingly well on clean data, but somewhat less so on noisy data. I found a blog with an interesting comparison, here.
In case you’re interested, this is what I’ve figured out regarding
gocr. If your data was converted to a PDF from Word or something, then you can run,
$ pdftotext -layout file.pdf > out.txt
That will try to preserve the layout as much as possible. For an example, see this blog.
If your data is instead a PDF of images, then you have to bite the proverbial bullet and use OCR. In that case, use
pdftoppm to convert the PDF to a set of images,
$ pdftoppm -r 200 file.pdf img
This will create images with 200 dpi resolution. The images will be named
img-001.ppm etc. Sometimes you can improve the OCR by fiddling with the resolution, but it’s tedious. Anyway, then you apply
gocr with all of its optons,
$ gocr -a 50 -C '--.0-9' img-001.ppm | grep -c '_'
-C option specifies that
gocr will only look for digits, decimals, and minus signs, and the
-a option specifies that it will make a decision with only 50% confidence, just in case you feel lucky. The
grep part will count the number of lines with underscores, which are used to describe underscores or unknown characters. I use this as a rough way to evaluate performance. If you want the output, then you redirect the output as so,
$ gocr img-001.ppm > out.txt
That’s my current survey of free OCR tools and techniques. The
gocr utility has some exciting features that I have not experimented with yet, but that’s for a future post.
Creating a Data Set
First, we import an obnoxious number of modules.
import os, sys, svgfig, numpy as np from IPython.display import SVG, Image import matplotlib.image as mpimg import collections import sklearn from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import cross_val_score
Create some digit pairs with four levels of kerning and write them to SVG files.
path = 'C:/Users/...' # SVG window size m = 1.0 w = svgfig.window(0,m,0,m) # kerning distances offsets = [ 0.014, 0.012, 0.010, 0.008 ] # labels for kerned images labels = [ 'xl', 'lg', 'md', 'sm' ] offset_lbl = dict( zip( offsets, labels ) ) for offset in offsets: for i in range( 10 ): for j in range( 10 ): # list to hold the svgfig objects fig = list() # the two digits fig.append( svgfig.Text( m/2-offset, m/2, str(i) ).SVG( w ) ) fig.append( svgfig.Text( m/2+offset, m/2, str(j) ).SVG( w ) ) # write and save the SVG fn = path+offset_lbl[offset]+str(i)+str(j)+'.svg' g = svgfig.SVG("g",*fig).save(fn)
If you’re in IPython, you can view an SVG using,
SVG( filename=path+'xl23.svg' )
Moving on, we’ll convert our SVG images to pixelated PNG images using system calls to inkscape, which you will hve to download and put in your system path, if you’re on Windows. I set the output PNGs to be 14×14 pixels, so that I could float them around in a 16×16 pixel window.
for i in range( 10 ): for j in range( 10 ): for sz in labels: inp = sz+str(i)+str(j)+'.svg' oup = sz+str(i)+str(j)+'.png' os.system('inkscape '+inp+' -e '+oup+' -D -w 14 -h 14')
At this point, if you’re using IPython and you’d like to view a PNG, then you can use,
Image( filename=path+'sm23.png' )
Before we go any further, we’ll need to define a function for rescaling an array to a new maximum and minimum. We’ll use this to scale our noisy images back down to the range 0 to 1.
def rescale( data , newmin , newmax ): oldmin , oldmax = np.min( data ), np.max( data ) newscale = float( newmax - newmin ) oldscale = float( oldmax - oldmin ) if oldscale == 0: return None ratio = float( newscale / oldscale ) scaled = data.copy() scaled -= oldmin scaled *= ratio scaled += newmin return scaled
Now, we put everything together to form an array of shifted and possibly noisy images.
X = np.zeros((400*9*2,16*16+1)) k = 0 for i in range( 10 ): for j in range( 10 ): for sz in labels: for ii in range(3): for jj in range(3): z = np.zeros((16,16)) fn = sz+str(i)+str(j)+'.png' img = mpimg.imread( fn ) z[ii:ii+14,jj:jj+14] = img[:,:,3] X[k,:16*16] = z.ravel() X[k,-1] = i*10+j k += 1 for ii in range(3): for jj in range(3): z = np.random.normal(0,0.1,(16,16)) fn = sz+str(i)+str(j)+'.png' img = mpimg.imread( fn ) z[ii:ii+14,jj:jj+14] += img[:,:,3] z = rescale( z, 0, 1 ) X[k,:16*16] = z.ravel() X[k,-1] = i*10+j k += 1
This is what a noisy image would look like before it was flattened out and put into the data set.
Classification and Cross-Validation
Finally, we perform the classification using random forests.
k = list() ; v = 10 for i in range(2,10): clf = RandomForestClassifier( n_estimators=2**i, max_depth=None, min_samples_split=1, random_state=0 ) k.append( cross_val_score( clf, X[:,:-1], X[:,-1], cv=v ) ) print 2**i,np.mean( k[-1] ) boxplot( *k ) ; ylim( 0.5, 1.0 ) ; grid() ; title('Random Forest\n10-fold Cross Validation'); xticks(range(1,8+1),[ 2**i for i in range(2,10) ] ) ; ylabel('Mean Score') ; xlabel('Number of Trees') ; savefig('rforest10.png',fmt='png',dpi=100) ;
I’d like to expand this data set with serif digits, and incorporate digits with decimals, commas, and dashes. On the upstream side, I’d like to work more with image segmentation to focus on areas of interest containing digits. I think it would also be helpful to use more sporadic noise, like the speckles found in photocopies, rather than Gaussian noise, but that’s for a later iteration.