This is one of the fundamental tasks in science. You do a study, and then you have to determine if there is a statistically meaningful difference between the test and control data. It is important to be able to understand the hypothesis testing, because a lot of interesting functions in R are hypothesis tests. I’ll consider the simple z-test for testing whether the mean of the simple is the same as the hypothesized mean of the population. We’ll see how statistical power, which is the probability of detecting a difference in means, changes with sample size and effect size, which is the size of the difference between the observed sample mean, and the hypothesized population mean. We’ll also see that the significance level is comparable to the Type-II (false negative) error rate.
The takeaway is that if you want to detect a small change, then you had better have a very (painfully) large sample. Likewise, with a small sample size, all you can hope to do is detect (painfully) obvious differences in means. At the end of the day, you not only need to present a -value, but also an estimate of the power of the test to detect the effect. A small -value means nothing when the power of the test is similarly small.
Classical Hypothesis Test for a Difference of Means
If we assume that our two data sets are approximately normal, and that we have ~30 data points per set, then we can perform classical statistical hypothesis testing. This involves conjuring up a null hypothesis of no difference between the data sets, and an alternate hypothesis that there is a significant difference between the sets.
We also need to state a significance level, , which is our threshold for rejecting or failing to reject the null hypothesis. The smaller our significance level , the more sure we are that the data observed in our sample represents the population from which the sample was taken. Alternatively, we can think of the significance level as an acceptable level for our Type-II, or false negative, error rate.
Once we perform the test we obtain a test statistic which is then converted to a -value, the probability associated with that test statistic. The -value represents the probability that the null hypothesis is true, but that there exist unobserved data that are more extreme than the observed data. If the is less than , then we reject the null hypothesis. If the -value is greater than , then we fail to reject the null hypothesis.
Here is a short function that will perform the z-test
def ztest( dp, mu=0.0, sigma=None, n=None, alpha=0.05, twotail=True ): ''' Input : (dp) iterable or float, sample data points, or estimated mean (mu) float, hypothesized population mean (sigma) float, standard deviation of sample (n) integer, number of data points (alpha) float, confidence level (twotail) boolean, twotailed test Output: (pv) probability of the hypothesis being true, but there exist data points that are at least as extreme as the observed values ''' # in case the data points are a NumPy array if( type( dp )==np.ndarray ): n = dp.size x = np.mean( dp ) # in case the data points are a list elif( type( dp )==list ): n = len( dp ) x = np.mean( dp ) # in case we provide an estimated mean value else: x = dp # two-tail correction if twotail: alpha /= 2.0 # sample standard deviation if sigma == None: sigma = np.std( dp ) # this prevents division by zero if sigma == 0.0: sigma = 0.01 # z-score z = ( x - mu ) / float( sigma ) # check n if n == None: print 'Need a sample size, n.' return None # test statistic ts = z * np.sqrt( n ) # calculate the p-value if ts < 0: pv = scipy.stats.norm( 0, 1 ).cdf( ts ) elif ts > 0: pv = scipy.stats.norm( 0, 1 ).cdf( -ts ) return pv
We can evaluate the power of this test for a fixed standard deviation of 1.0, and different sample sizes and effect sizes. We will see that as we increase the sample size or the effect size, we increase the power of the test, the probability that it will detect an effect of a given size. We would also see an increase in power if we could reduce the standard deviation, but this is almost always out of the investigator’s control, so we will consider it fixed. We will look at effects of size 0.1 to 0.6, in increments of 0.1, and we will consider sample sizes from 5 to 100, in increments of 5. We will perform 500 trials per effect size and sample size, and calculate the empirical statistical power from that.
# dictionary to hold the results power = dict() # number of trials ntrials = 500 # effect sizes: [ 0.1, 0.2, ..., 0.6 ] effects = np.linspace( 0.1, 0.6, 6, endpoint=True ) for effect in effects: # temporary list tmp = list() # normally distributed sample with mu=effect, and std=1 dp = scipy.stats.norm(effect,1) # number of points in a sample for npoints in range( 5, 105, 5 ): # list of p-values outcome = np.array( [ ztest( dp.rvs( npoints ), mu = 0.0 ) for i in range( ntrials ) ] ) # number of p-values that are less than the significance level alpha reject_null = np.where( outcome <= 0.025 ).size # sample size, and probability of rejecting the null hypothesis that the means are equal tmp.append( [ npoints, reject_null / float( ntrials ) ] ) # add these results to a dictionary, indexed by effect size power[(effect,1)] = np.array( tmp )
For completeness, this is the code I used to generate the plots. The only interesting thing here is the very last line, which prevents the
savefig() function from cutting off the legend that I put on the side of the graph.
# organize the tests by effect size keys = power.keys() keys.sort() # perform the plotting for k in keys: plot( power[k][:,0], power[k][:,1], marker='.', label=str(k) ) # limit the x axis from 5 to 100 xlim(5,100) # limit the y axis to [0,1] ylim( 0, 1 ) # plot the legend lgd = legend( loc=(1.05,0.2), title="Effect Size" ) # put a grid on it! grid() # title and x and y labels title("Empirical Statistical Power for\nDifferent Effect Sizes, with ") xlabel("Sample Size") ; ylabel("Empirical Statistical Power") # save the figure savefig( 'statistical_power.png', fmt='png', dpi=200,\ bbox_extra_artists=(lgd,), bbox_inches='tight' )
Type-II Error Rate and the Significance Level
Traditionally, practitioners of hypothesis tests have been very worried about Type-II errors, or false negatives. The idea is that it is more important to minimize false negatives than false positives, when a positive test constitutes a clear and present danger. In other words, a placing more emphasis on minimizing the false negative error rate errs on the side of caution. This makes sense in testing for cancer; a false positive is startling, but a false negative is prematurely fatal. In other situations, it is also important to minimize the false positive error rate; for instance, if it is costly or time consuming to post-process (potentially false) positives. The next bit of code will demonstrate empirically that the significance level approximates the Type-II false positive error rate.
# dictionary to hold the results typeII = dict() # number of trials ntrials = 1000 # significance levels: [ 0.025, 0.05, 0.01 ] alphas = [ 0.025, 0.05, 0.1 ] for a in alphas: # temporary list tmp = list() # normally distributed sample with mu=effect, and std=1 dp = scipy.stats.norm(0,1) # number of points in a sample for npoints in range( 5, 105, 5 ): # list of p-values outcome = np.array( [ ztest( dp.rvs( npoints ), mu = 0.0, alpha = a ) for i in range( ntrials ) ] ) # number of p-values that are less than the significance level alpha reject_null = np.where( outcome <= a/2.0 ).size # sample size, and probability of rejecting the null hypothesis that the means are equal tmp.append( [ npoints, reject_null / float( ntrials ) ] ) # add these results to a dictionary, indexed by effect size typeII[(a,1)] = np.array( tmp )