This is one of the fundamental tasks in science. You do a study, and then you have to determine if there is a statistically meaningful difference between the test and control data. It is important to be able to understand the hypothesis testing, because a lot of interesting functions in R are hypothesis tests. I’ll consider the simple z-test for testing whether the mean of the simple is the same as the hypothesized mean of the population. We’ll see how statistical power, which is the probability of detecting a difference in means, changes with sample size and effect size, which is the size of the difference between the observed sample mean, and the hypothesized population mean. We’ll also see that the significance level is comparable to the Type-II (false negative) error rate.
Continue reading Classical Hypothesis Testing, Statistical Power, and Type-II Errors →
In this post I’ll discuss building a Windows executable from a Python script for 32-bit and 64-bit Windows. Producing a 64-bit executable on a 64-bit machine in Windows is easy using PyInstaller, but producing a 32-bit executable on a 64-bit machine takes some tinkering. I ended up setting up a chroot environment on Ubuntu for this task.
Continue reading Develop Windows Executables from Python Scripts for 32-bit and 64-bit Architecures →
This is my first whack at using PyBrain for optical character recognition. I am limiting myself to numerical data, since that’s what I have laying around needing to be optically recognized the most. I’m also focusing on extra small, and heavily corrupted data.
Continue reading Using PyBrain for Optical Character Recognition (First Whack) →
In this post I’ll demonstrate how to build a object oriented Tkinter GUI application for associating labels to filenames in order to quickly and easily build a set of training data. The Submit button will associate the label with the file, and the Save and Quit button will dump the file and its associated label into a Python dict, and then a cPickle file for later use. This is still a little rough around the edges; it assumes that you’re looking for PNG data in the current directory, and the output overwrites previous output, but it’s a start.
Continue reading Tkinter Optical Character Recognition Training Data Labeler →
In this post I’ll demonstrate how to open an Excel file in Python using Pandas, a (the) module for data manipulation. I love using Pandas, and I cannot recommend it enough.
Continue reading Open an Excel File in Pandas →
I bought one of the Arduino Sidekick component kits from RadioShack this weekend and I’d like to build a few circuits with those parts over the next few posts. I’ll be using Mike Margolis’ Arduino Cookbook which is the best text on tinkering with Arduinos that I have found, and I highly recommend it.
Continue reading Wiring a Tilt Switch →
In this post I’ll discuss creating and altering shapefiles, and converting point sets from one coordinate reference system to another. I’ll also touch on scripting these tasks for large data sets. I’ll begin with the installation of Quantum GIS and Python for manipulating geographical data. I mainly use QGIS for visualizing and building shapefiles, and I use OSGeo4W from the command line for adding/converting shapefile projections, and converting point sets from one CRS to another.
Continue reading Using QGIS and OSGeo4W for Geo-Data Tasks →
Google Maps API lets you make query information elevation data using WGS84 coordinates. All you have to do is construct a URL with the coordinates, and then Google will return a JSON file. A JSON file is basically a text file, with some extra structure, in the form of some keywords, brackets, braces, and semi-colons.
Continue reading Query Google Maps API Using Windows PowerShell →
In this post I will walk through the computation of principal components from a data set using Python. A number of languages and modules implement principal components analysis (PCA) but some implementations can vary slightly which may lead to confusion if you are trying to follow someone else’s code, or you are using multiple languages. Perhaps more importantly, as a data analyst you should at all costs avoid using a tool if you do not understand how it works. I will use data from The Handbook of Small Data Sets to illustrate this example. The data sets will be found in a zipped directory on site linked above.
Continue reading Computing Principal Components in Python →
Blog about math, programming, and data.