This is the first in a series of posts using the small data sets from The Handbook of Small Data Sets to illustrate introductory techniques in text processing, plotting, statistics, etc. The data sets are collected in a ZIP file at publisher’s website in the link above. Someone decided to format the data files to resemble the published format to the greatest degree possible, which makes parsing the files interesting. First, we will import our modules,
from pandas import DataFrame from pylab import * import numpy as np
Data
The first data set records the effect of two variables on the germination of seeds, in terms of the number of seedlings counted after a reasonable period of time. The first nominal variable is the level of watering, which encompasses six levels, from Not A Lot, to Rather Much. The second nominal variable pertains to whether or not the soil boxes were covered. This was included in the experiment in order measure the effect of evaporation. If a box was covered, then the water evaporated more slowly.
Once the ZIP file has been downloaded and unzipped (using 7-zip), you will notice that numbers are separated by tabs, which to computers, look like \t
. We’ll begin the text processing as follows,
germdata = open( 'handetal/GERMIN.DAT', 'r' ).readlines() germdata = [ i.strip().split('\t') for i in germdata if i[0] not in ['#','\n'] ]
The first line of code opens the file, and returns a list containing a string representation of each line of the file. The second line is a list comprehension, which is a faster more compact form of a for-loop. For each string in the germdata
list, the list comprehension applies two important string methods, strip()
and split()
. First, it strips off any newline trailing characters (\n
). Second, it splits that string by tab characters (\t
), returning a list of individual strings. The last part of the list comprehension is a conditional statement that skips blank lines and comments. Basically it says that if the line starts with a pound sign (#
), or is empty (\n
), then we should ignore it and carry on. (One might even be inclined to remain calm.) At this point, germdata
should look something like this,
[['22', '41', '66', '82', '79', '0'], ['25', '46', '72', '73', '68', '0'], ['27', '59', '51', '73', '74', '0'], ['23', '38', '78', '84', '70', '0'], ['45', '65', '81', '55', '31', '0'], ['41', '80', '73', '51', '36', '0'], ['42', '79', '74', '40', '45', '0'], ['43', '77', '76', '62', '*', '0']]
Here, we have a list of lists of strings that represent numbers. The first four lists refer to the covered boxes, and the next four lists refer to the uncovered boxes. The columns, from left to right, represent the levels of watering ranging from Not A Lot, to Rather Much. It turns out that if you water plants Rather Much, then they cease to grow altogether, evaporation or no.
Finally, note the asterisk (asterix) in the last line of the data. It is not uncommon for data to have missing values, or totally unpredictable null values, such as this asterisk. When faced with null or missing values, you can either make do with fewer observations, or insert a “reasonable guess” (air quotes emphasized) as to what that datum might have been, had it had been.
I will cast the unknown datum as None
, Python’s special variable for non-entities. In Ruby nil
is used, and in Perl it is null
, etc. In the same stroke, I’ll convert the strings to numbers so that we can do math with them. I’ll accomplish these tasks with a function named intify()
to turn integer strings into integers, and anything else into None
.
def intify( x ): if x.isdigit(): return int( x ) else: return None
Now that we have intify()
we can turn the strings in germdata
into integers using a double list comprehension.
germdata = [ [ intify( j ) for j in i ] for i in germdata ]
Now germdata
should look like this,
[[22, 41, 66, 82, 79, 0], [25, 46, 72, 73, 68, 0], [27, 59, 51, 73, 74, 0], [23, 38, 78, 84, 70, 0], [45, 65, 81, 55, 31, 0], [41, 80, 73, 51, 36, 0], [42, 79, 74, 40, 45, 0], [43, 77, 76, 62, None, 0]]
Data Management with Pandas
Now that our data is nice and clean, we can put it into a pandas DataFrame, which is a convenient container for handling data sets. The advantage of using the pandas module, rather than soldiering on with lists of lists (of lists) is that pandas code is more readable, and it scales well in terms of size and complexity. Another reason to use pandas DataFrames is because it handles unknown or missing data well.
The term DataFrame is reminiscent of the R data frame, which it is modeled after. The DataFrame allows you to define the column names of your matrix so that you can call individual variables by name, using either dot or bracket notation, e.g., if A
is a column in the DataFrame df
, then we can access it by saying either df.A
or df['A']
. This make your code more readable and easier to debug/reuse/expand later.
In this case, we’ll name our columns 1 through 6, for the six levels of watering,
from pandas import DataFrame germdata = DataFrame( germdata, columns=['1','2','3','4','5','6'] )
We’ll add another column to the DataFrame that describes whether the soil box was covered or not. We’ll call this column 'coverage'
. In doing so, we’ll take advantage of Python’s operator overloading for enumerable items. For example, [0]*3
will produce [0,0,0]
, and [0]+[0,0]
will produce [0,0,0]
. We can use these operations to build lists quickly and easily.
coverage = ['covered']*4+['uncovered']*4 germdata['coverage'] = coverage
Now that we ave our data in a DataFrame, we can look at the medians of the different watering levels, organized by the labels in the 'coverage'
column. Why do we use the median and not the mean? Well, the mean is easily affected by outliers, and when we don’t have many data points, one oddball can throw everything off. The median on the other hand is more robust, meaning that it is less affected by outliers. Here is the code and output for evaluating the median across the six hydration levels, and two coverage levels.
germdata.groupby('coverage').median()
This produces the following table,
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
coverage | ||||||
covered | 24.0 | 43.5 | 69 | 77.5 | 72 | 0 |
uncovered | 42.5 | 78.0 | 75 | 53.0 | 36 | 0 |
BUT THAT’S NOT ALL, we can pass multiple functions to the .groupby()
function using the .agg()
function. First we’ll define a function I found here for calculating the percentile of a dataset. (What’s a percentile?) The median is the middle value in the data set after the data set has been sorted; it’s also the 50th percentile. The 25th percentile is the datum 25% into the data set after it has been sorted, and the 64th percentile is the datum 64% into the sorted data set etc. We sometimes use the interquartile range (IQR) to describe the spread or variation in a set, rather than the variance or standard deviation, for the same reason we sometimes use the median rather than the mean–it’s a more robust measure. Without further ado, this is that function I found on the internet,
def percentile(n): def percentile_(x): return np.percentile(x, n) percentile_.__name__ = '{}%'.format(n) return percentile_
And here’s the code and output for describing the interquartile range for our data, across the different levels of treatment,
germdata.groupby('coverage').agg([ percentile(25), percentile(75) ])
1 | 2 | 3 | 4 | 5 | 6 | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
25% | 75% | 25% | 75% | 25% | 75% | 25% | 75% | 25% | 75% | 25% | 75% | |
coverage | ||||||||||||
covered | 22.75 | 25.5 | 40.25 | 49.25 | 62.25 | 73.50 | 73.00 | 82.50 | 69.50 | 75.25 | 0 | 0 |
uncovered | 41.75 | 43.5 | 74.00 | 79.25 | 73.75 | 77.25 | 48.25 | 56.75 | 34.75 | NaN | 0 | 0 |
Here, we observe an interesting thing. While pandas gladly handles missing or unknown values, because that’s the nature of data science, NumPy, which is a module for dealing with n-dimensional arrays and matrix operations for scientific and engineering applications, is most definitely not happy about unknown values. Suffice to say, it does not compute, so we get a NaN
value, which stands for Not a Number.
Plotting with Matplotlib
So, having our interquartile numbers in a table is nice, but if you’re showing this to your boss, he probably wants a picture. Pandas and matplotlib work well together and make creating plots very simple. Below we’ll plot the number of seedlings observed for each of the watering levels across the two levels of coverage. First, we’ll plot a simple boxplot as,
germdata[ germdata['coverage']=='covered' ].boxplot()
Here, we are using the boxplot()
method of germdata
. This is nice, but we can do a little bit better using the pylab boxplot function to produce two plots in the same figure. Notice in lines 5 and 10 we force the y-axes to have the same range of values. This is important for comparing the two plots. If we want to compare apples to apples, then we need the y-axes to have the same range.
fig = figure() ax = fig.add_subplot(211) ax.boxplot( germdata[ germdata['coverage']=='covered' ].values[:,:-2] ) ; ylim(0,100) ; ax.set_ylabel('Covered') ; ax.set_aspect(0.025) ; grid() title('Germination') ax = fig.add_subplot(212) ax.boxplot( germdata[ germdata['coverage']=='uncovered' ].values[:,:-2] ) ; ylim(0,100) ; ax.set_ylabel('Unovered') ; ax.set_aspect(0.025) ; grid() xlabel('Hydration Level') # TOTALLY OPTIONAL PART sup = fig.add_axes( [0., 0., 1, 1] ) sup.set_axis_off() sup.set_xlim(0, 1) sup.set_ylim(0, 1) sup.text( 0.16, 0.5, 'Seedling Count', rotation='vertical', horizontalalignment='center', verticalalignment='center', size=12 ) savefig( 'hand_data_germination.png', fmt='png', dpi=200 )
The totally optional part involved adding an extra y-axis label to span the two plots. I found that code here. This optional part adds another axis to the figure, and then adds some text. At any rate, we can see from the plot that covering the boxes has a noticeable effect of the germination of the seeds.