In this post I’ll demonstrate how to build a object oriented Tkinter GUI application for associating labels to filenames in order to quickly and easily build a set of training data. The Submit button will associate the label with the file, and the Save and Quit button will dump the file and its associated label into a Python dict, and then a cPickle file for later use. This is still a little rough around the edges; it assumes that you’re looking for PNG data in the current directory, and the output overwrites previous output, but it’s a start.

In this post I’ll demonstrate how to open an Excel file in Python using Pandas, a (the) module for data manipulation. I love using Pandas, and I cannot recommend it enough.

In this post I’ll discuss creating and altering shapefiles, and converting point sets from one coordinate reference system to another. I’ll also touch on scripting these tasks for large data sets. I’ll begin with the installation of Quantum GIS and Python for manipulating geographical data. I mainly use QGIS for visualizing and building shapefiles, and I use OSGeo4W from the command line for adding/converting shapefile projections, and converting point sets from one CRS to another.

In this post I will walk through the computation of principal components from a data set using Python. A number of languages and modules implement principal components analysis (PCA) but some implementations can vary slightly which may lead to confusion if you are trying to follow someone else’s code, or you are using multiple languages. Perhaps more importantly, as a data analyst you should at all costs avoid using a tool if you do not understand how it works. I will use data from The Handbook of Small Data Sets to illustrate this example. The data sets will be found in a zipped directory on site linked above.

In this post I will work through an example of Simple Kriging. Kriging is a set of techniques for interpolation. It differs from other interpolation techniques in that it sacrifices smoothness for the integrity of sampled points. Most interpolation techniques will over or undershoot the value of the function at sampled locations, but kriging honors those measurements and keeps them fixed. In future posts I would like to cover other types of kriging, other semivariaogram models, and colocated co-kriging. Until then, I’m keeping relatively up to date code at my GitHub project, geostatsmodels.

In this post I will present a Python implementation of a new technique for fractal interpolation derived from a paper by Manousopoulos, Drakopoulos, and Theoharis. You may find my code on here on GitHub. Fractal interpolation is useful for data sets that exhibit self similarity at multiple scales, which are difficult to interpolate with polynomials.

I noticed that when I photocopy and email documents, the resulting attachment has relatively low resolution, and the digits get melded to one another. I decided to try to build a classifier to begin to sort this out. To this end, I needed to build a data set. First, I used svgfig to produce SVG sans-serif digit pairs, with kerning adjusted at four intervals. Then I used inkscape to create PNG images from the SVG files. Finally, I read the PNG images and wrote them to a NumPy array. I created a set of clean images, and images polluted with Gaussian noise, with a mean of zero, and a variance of 0.1. (The pixels were then rescaled back to the range of 0 to 1.) I also shifted each pair in eight directions. This produced a data set with 7200, 16×16 pixel images, half of which were noisy. I used a random forests classifier from sklearn, and performed 10-fold cross validation.

In this post I’ll present a recipe for taking an integral over an arbitrary triangular region using the SciPy integrate.dblquad() function. This is an important operation for implementing the Finite Elements method for solving partial differential equations. In school we are taught to perform a change of variables which involves splitting the triangle into two regions and performing the double integration on the simpler sub-domains after carefully calculating new limits of integration. This recipe maps the triangle to the unit square, and then calculates the double integral on the domain . I pieced this together after looking at this discussion on the MATLAB Central message board regarding the transformation of the triangle to the unit square, and this post on Paul’s Online Notes that touched on the calculation of the Jacobian, and this post by John D. Cook about choosing the correct error limits for quadrature integration.

In this post I will present a technique for generating a one dimensional (quasi) fractal data set using a modified Matérn point process, perform a simple box-couting procedure, and then calculate the lacunarity and fractal dimension using linear regression. Lacunarity is actually a pretty large topic, and we will only cover one accepted interpretation here. This material was motivated by an interesting paper on the fractal modelling of fractures in tight gas reservoirs. Tight gas reservoirs refer to reservoirs with very low permeability. To provide a sense of perspective, oil reservoirs typically have a permebility of ten to a hundred millidarcies, whereas shale gas reservoirs are usually less than 0.1 microdarcies, which is about the same permeability as a granite countertop.

Here, I’ll introduce some ideas regarding spatial point processes using Python. First I’ll present the Poisson point process, and then I’ll cover two other processes: the Thomas point process and the Matérn point process. I’ll use these tools in two future posts regarding measuring fractal dimension, and kriging. An excellent resource for spatial statistics is the R package spstat. The manual is a really great read. The spstat package implements the Thomas and Matérn point processes as rThomas and rMatern.