In this post, I’ll describe a technique for determining whether the mean of two sets are significantly different. In a previous post I demonstrated how to perform the standard statistical tests using R. Randomization tests are convenient when you can’t say anything about the normality or homoscedasticity (constant variance) of the population, and/or you don’t have access to a truly random sample.
Theory
Given two sets of data, and , we would like to determine if their means are significantly different. To that end, we will first compute the difference of their means and call it . We will compare to a distribution of mean differences generated by the data. Suppose and each have elements, that is, . If is small, we can compute every possible pair of -element sets, and , from the data in and , and compute the difference in their means. These differences should be collected in a variable . We can then compare the original difference in means with the distribution of the differences of means of the randomized sets and in order to determine if is an outlier or not. Let be the set of mean differences in greater than , or . If , the ratio of the sizes of and , is very small, then is probably an outlier, and the difference in the means of and is probably statistically significant.
Implementaion
First we’ll import some modules.
import numpy as np import scipy.stats import random import seaborn
Then we’ll generate two data sets, x
and y
. The data set z
will be a mixture of x
and y
.
N = 10 x = scipy.stats.beta(3,1).rvs(N) y = scipy.stats.beta(4,6).rvs(N) z = np.hstack((x,y))
We can take the mean of the difference of x
and y
to obtain d0
.
d0 = np.mean( x - y ) ; d0
0.26227545315547807
Then, we can create a for loop that will randomly sample the combined data set z
to produce two random sets px
and py
. We’ll inspect the distribution of the mean differences between px
and py
to determine if d0
is an outlier, or if it is representative of the distribution of random mean differences. If it is an outlier, then we can say that there is a good chance that the sets x
and y
are significantly different.
M = 4999 d = np.zeros((M,)) for k in range( M ): pxi = random.sample( np.arange(20), 10 ) pyi = [ i for i in np.arange(20) if i not in pxi ] px = np.mean( [ z[i] for i in pxi ] ) py = np.mean( [ z[i] for i in pyi ] ) d[k] = px - py
We can inspect the distribution visually using the hist()
function from the seaborn
package that was imported above. This package provides plots that are more reminiscent of ggplot
plots.
hist( d, bins=12 ) ; title("Distribution of Group Differences") ; xlabel("Mean Difference") ; ylabel("Frequency") ;
The next two lines count how many randomized mean differences are greater than the observed mean difference, d0
, and then calculate the proportion of greater differences to the whole, N
. We can interpret this as a probability of observing a greater mean difference.
p = len( np.where( d >= d0 )[0] ) print p / float( M )
0.0014002800560112022