Randomization Test for a Difference in Mean

In this post, I’ll describe a technique for determining whether the mean of two sets are significantly different. In a previous post I demonstrated how to perform the standard statistical tests using R. Randomization tests are convenient when you can’t say anything about the normality or homoscedasticity (constant variance) of the population, and/or you don’t have access to a truly random sample.

Theory

Given two sets of data, X and Y, we would like to determine if their means are significantly different. To that end, we will first compute the difference of their means and call it d_{0}. We will compare d_{0} to a distribution of mean differences generated by the data. Suppose X and Y each have N elements, that is, \vert X \vert = N = \vert Y \vert. If N is small, we can compute every possible pair of N-element sets, X^{\prime} and Y^{\prime}, from the data in X and Y, and compute the difference in their means. These differences should be collected in a variable d. We can then compare the original difference in means d_{0} with the distribution of the differences of means of the randomized sets X^{\prime} and Y^{\prime} in order to determine if d_{0} is an outlier or not. Let p be the set of mean differences in d greater than d_{0}, or p = \vert { x > d_{0} : x \in d } \vert. If \frac{\vert p \vert}{\vert d \vert}, the ratio of the sizes of p and d, is very small, then d_{0} is probably an outlier, and the difference in the means of X and Y is probably statistically significant.

Implementaion

First we’ll import some modules.

import numpy as np
import scipy.stats
import random
import seaborn

Then we’ll generate two data sets, x and y. The data set z will be a mixture of x and y.

N = 10
x = scipy.stats.beta(3,1).rvs(N)
y = scipy.stats.beta(4,6).rvs(N)
z = np.hstack((x,y))

We can take the mean of the difference of x and y to obtain d0.

d0 = np.mean( x - y ) ; d0
0.26227545315547807

Then, we can create a for loop that will randomly sample the combined data set z to produce two random sets px and py. We’ll inspect the distribution of the mean differences between px and py to determine if d0 is an outlier, or if it is representative of the distribution of random mean differences. If it is an outlier, then we can say that there is a good chance that the sets x and y are significantly different.

M = 4999
d = np.zeros((M,))

for k in range( M ):
    
    pxi = random.sample( np.arange(20), 10 )
    pyi = [ i for i in np.arange(20) if i not in pxi ]
    
    px = np.mean( [ z[i] for i in pxi ] )
    py = np.mean( [ z[i] for i in pyi ] )
    
    d[k] = px - py

We can inspect the distribution visually using the hist() function from the seaborn package that was imported above. This package provides plots that are more reminiscent of ggplot plots.

hist( d, bins=12 ) ;
title("Distribution of Group Differences") ;
xlabel("Mean Difference") ;
ylabel("Frequency") ;

The next two lines count how many randomized mean differences are greater than the observed mean difference, d0, and then calculate the proportion of greater differences to the whole, N. We can interpret this as a probability of observing a greater mean difference.

p = len( np.where( d >= d0 )[0] )
print p / float( M )
0.0014002800560112022