In this post I’ll demonstrate how to build a object oriented Tkinter GUI application for associating labels to filenames in order to quickly and easily build a set of training data. The Submit button will associate the label with the file, and the Save and Quit button will dump the file and its associated label into a Python dict, and then a cPickle file for later use. This is still a little rough around the edges; it assumes that you’re looking for PNG data in the current directory, and the output overwrites previous output, but it’s a start.
I bought one of the Arduino Sidekick component kits from RadioShack this weekend and I’d like to build a few circuits with those parts over the next few posts. I’ll be using Mike Margolis’ Arduino Cookbook which is the best text on tinkering with Arduinos that I have found, and I highly recommend it.
In this post I’ll discuss creating and altering shapefiles, and converting point sets from one coordinate reference system to another. I’ll also touch on scripting these tasks for large data sets. I’ll begin with the installation of Quantum GIS and Python for manipulating geographical data. I mainly use QGIS for visualizing and building shapefiles, and I use OSGeo4W from the command line for adding/converting shapefile projections, and converting point sets from one CRS to another.
Google Maps API lets you make query information elevation data using WGS84 coordinates. All you have to do is construct a URL with the coordinates, and then Google will return a JSON file. A JSON file is basically a text file, with some extra structure, in the form of some keywords, brackets, braces, and semi-colons.
In this post I will walk through the computation of principal components from a data set using Python. A number of languages and modules implement principal components analysis (PCA) but some implementations can vary slightly which may lead to confusion if you are trying to follow someone else’s code, or you are using multiple languages. Perhaps more importantly, as a data analyst you should at all costs avoid using a tool if you do not understand how it works. I will use data from The Handbook of Small Data Sets to illustrate this example. The data sets will be found in a zipped directory on site linked above.
In this post I’ll provide an example of exploratory factor analysis in R. We will use the
psych package by William Revelle. Factor analysis seeks to find latent variables, or factors, by looking at the correlation matrix of the observed variables. This technique can be used for dimensionality reduction, or for better insight into the data. As with any technique, this will not work in all scenarios. Firstly, latent variables are not always present, and secondly, it will miss existing latent variables if they are not apparent from the observed data. More information can be found in the manual distributed with the package.
In this post I will work through an example of Simple Kriging. Kriging is a set of techniques for interpolation. It differs from other interpolation techniques in that it sacrifices smoothness for the integrity of sampled points. Most interpolation techniques will over or undershoot the value of the function at sampled locations, but kriging honors those measurements and keeps them fixed. In future posts I would like to cover other types of kriging, other semivariaogram models, and colocated co-kriging. Until then, I’m keeping relatively up to date code at my GitHub project, geostatsmodels.
In this post I will present a Python implementation of a new technique for fractal interpolation derived from a paper by Manousopoulos, Drakopoulos, and Theoharis. You may find my code on here on GitHub. Fractal interpolation is useful for data sets that exhibit self similarity at multiple scales, which are difficult to interpolate with polynomials.
I noticed that when I photocopy and email documents, the resulting attachment has relatively low resolution, and the digits get melded to one another. I decided to try to build a classifier to begin to sort this out. To this end, I needed to build a data set. First, I used svgfig to produce SVG sans-serif digit pairs, with kerning adjusted at four intervals. Then I used inkscape to create PNG images from the SVG files. Finally, I read the PNG images and wrote them to a NumPy array. I created a set of clean images, and images polluted with Gaussian noise, with a mean of zero, and a variance of 0.1. (The pixels were then rescaled back to the range of 0 to 1.) I also shifted each pair in eight directions. This produced a data set with 7200, 16×16 pixel images, half of which were noisy. I used a random forests classifier from sklearn, and performed 10-fold cross validation.