Tag Archives: OCR

Compound Digit Recognition with Random Forests

I noticed that when I photocopy and email documents, the resulting attachment has relatively low resolution, and the digits get melded to one another. I decided to try to build a classifier to begin to sort this out. To this end, I needed to build a data set. First, I used svgfig to produce SVG sans-serif digit pairs, with kerning adjusted at four intervals. Then I used inkscape to create PNG images from the SVG files. Finally, I read the PNG images and wrote them to a NumPy array. I created a set of clean images, and images polluted with Gaussian noise, with a mean of zero, and a variance of 0.1. (The pixels were then rescaled back to the range of 0 to 1.) I also shifted each pair in eight directions. This produced a data set with 7200, 16×16 pixel images, half of which were noisy. I used a random forests classifier from sklearn, and performed 10-fold cross validation.

Continue reading