Compound Digit Recognition with Random Forests

I noticed that when I photocopy and email documents, the resulting attachment has relatively low resolution, and the digits get melded to one another. I decided to try to build a classifier to begin to sort this out. To this end, I needed to build a data set. First, I used svgfig to produce SVG sans-serif digit pairs, with kerning adjusted at four intervals. Then I used inkscape to create PNG images from the SVG files. Finally, I read the PNG images and wrote them to a NumPy array. I created a set of clean images, and images polluted with Gaussian noise, with a mean of zero, and a variance of 0.1. (The pixels were then rescaled back to the range of 0 to 1.) I also shifted each pair in eight directions. This produced a data set with 7200, 16×16 pixel images, half of which were noisy. I used a random forests classifier from sklearn, and performed 10-fold cross validation.

Continue reading Compound Digit Recognition with Random Forests

Integrals Over Arbitrary Triangular Regions for FEM

In this post I’ll present a recipe for taking an integral over an arbitrary triangular region using the SciPy integrate.dblquad() function. This is an important operation for implementing the Finite Elements method for solving partial differential equations. < !-more-->In school we are taught to perform a change of variables which involves splitting the triangle into two regions and performing the double integration on the simpler sub-domains after carefully calculating new limits of integration. This recipe maps the triangle to the unit square, and then calculates the double integral on the domain [0,1] \times [0,1]. I pieced this together after looking at this discussion on the MATLAB Central message board regarding the transformation of the triangle to the unit square, and this post on Paul’s Online Notes that touched on the calculation of the Jacobian, and this post by John D. Cook about choosing the correct error limits for quadrature integration.

Continue reading Integrals Over Arbitrary Triangular Regions for FEM

Fractal Dimension and Box Counting

In this post I will present a technique for generating a one dimensional (quasi) fractal data set using a modified Matérn point process, perform a simple box-couting procedure, and then calculate the lacunarity and fractal dimension using linear regression. Lacunarity is actually a pretty large topic, and we will only cover one accepted interpretation here. This material was motivated by an interesting paper on the fractal modelling of fractures in tight gas reservoirs. Tight gas reservoirs refer to reservoirs with very low permeability. To provide a sense of perspective, oil reservoirs typically have a permebility of ten to a hundred millidarcies, whereas shale gas reservoirs are usually less than 0.1 microdarcies, which is about the same permeability as a granite countertop.

Continue reading Fractal Dimension and Box Counting

Spatial Point Processes

Here, I’ll introduce some ideas regarding spatial point processes using Python. First I’ll present the Poisson point process, and then I’ll cover two other processes: the Thomas point process and the Matérn point process. I’ll use these tools in two future posts regarding measuring fractal dimension, and kriging. An excellent resource for spatial statistics is the R package spstat. The manual is a really great read. The spstat package implements the Thomas and Matérn point processes as rThomas and rMatern.

Continue reading Spatial Point Processes

Numerical Solutions to ODEs

In this post I’ll present some theory and Python code for solving ordinary differential equations numerically. I’ll discuss Euler’s Method first, because it is the most intuitive, and then I’ll present Taylor’s Method, and several Runge-Kutta Methods. Obviously, there is top notch software out there that does this stuff in its sleep, but it’s fun to do math and write programs. This material is adapted from the excellent textbook by Burden and Faires, Numerical Analysis 8th Ed., which is easily worth whatever they’re asking for it these days.

Continue reading Numerical Solutions to ODEs

Linear Regression with Python

In this post I will use Python to explore more measures of fit for linear regression. I will consider the coefficient of determination (R2), hypothesis tests (F, t, Omnibus), AIC, BIC, and other measures. This will be an expansion of a previous post where I discussed how to assess linear models in R, via the IPython notebook, by looking at the residual, and several measures involving the leverage.

Continue reading Linear Regression with Python

Small Data: Germinating Seeds

This is the first in a series of posts using the small data sets from The Handbook of Small Data Sets to illustrate introductory techniques in text processing, plotting, statistics, etc. The data sets are collected in a ZIP file at publisher’s website in the link above. Someone decided to format the data files to resemble the published format to the greatest degree possible, which makes parsing the files interesting. First, we will import our modules,

Continue reading Small Data: Germinating Seeds

Smoothing with Exponentially Weighted Moving Averages

A moving average takes a noisy time series and replaces each value with the average value of a neighborhood about the given value. This neighborhood may consist of purely historical data, or it may be centered about the given value. Furthermore, the values in the neighborhood may be weighted using different sets of weights. Here is an example of an equally weighted three point moving average, using historical data,

Continue reading Smoothing with Exponentially Weighted Moving Averages