Classical Hypothesis Testing, Statistical Power, and Type-II Errors

This is one of the fundamental tasks in science. You do a study, and then you have to determine if there is a statistically meaningful difference between the test and control data. It is important to be able to understand the hypothesis testing, because a lot of interesting functions in R are hypothesis tests. I’ll consider the simple z-test for testing whether the mean of the simple is the same as the hypothesized mean of the population. We’ll see how statistical power, which is the probability of detecting a difference in means, changes with sample size and effect size, which is the size of the difference between the observed sample mean, and the hypothesized population mean. We’ll also see that the significance level \alpha is comparable to the Type-II (false negative) error rate.

The takeaway is that if you want to detect a small change, then you had better have a very (painfully) large sample. Likewise, with a small sample size, all you can hope to do is detect (painfully) obvious differences in means. At the end of the day, you not only need to present a p-value, but also an estimate of the power of the test to detect the effect. A small p-value means nothing when the power of the test is similarly small.

Classical Hypothesis Test for a Difference of Means

If we assume that our two data sets are approximately normal, and that we have ~30 data points per set, then we can perform classical statistical hypothesis testing. This involves conjuring up a null hypothesis of no difference between the data sets, and an alternate hypothesis that there is a significant difference between the sets.

We also need to state a significance level, \alpha, which is our threshold for rejecting or failing to reject the null hypothesis. The smaller our significance level \alpha, the more sure we are that the data observed in our sample represents the population from which the sample was taken. Alternatively, we can think of the significance level as an acceptable level for our Type-II, or false negative, error rate.

Once we perform the test we obtain a test statistic which is then converted to a p-value, the probability associated with that test statistic. The p-value represents the probability that the null hypothesis is true, but that there exist unobserved data that are more extreme than the observed data. If the p is less than \alpha, then we reject the null hypothesis. If the p-value is greater than \alpha, then we fail to reject the null hypothesis.

Here is a short function that will perform the z-test

def ztest( dp, mu=0.0, sigma=None, n=None, alpha=0.05, twotail=True ):
    '''
    Input : (dp)      iterable or float, sample data points, or estimated mean 
            (mu)      float, hypothesized population mean 
            (sigma)   float, standard deviation of sample
            (n)       integer, number of data points
            (alpha)   float, confidence level
            (twotail) boolean, twotailed test
    Output: (pv)      probability of the hypothesis being true,
                      but there exist data points that are at 
                      least as extreme as the observed values
    '''
    # in case the data points are a NumPy array
    if( type( dp )==np.ndarray ):
        n = dp.size
        x = np.mean( dp )
    # in case the data points are a list
    elif( type( dp )==list ):
        n = len( dp )
        x = np.mean( dp )
    # in case we provide an estimated mean value
    else:
        x = dp
    # two-tail correction
    if twotail:
        alpha /= 2.0
    # sample standard deviation
    if sigma == None:
        sigma = np.std( dp )
    # this prevents division by zero
    if sigma == 0.0:
        sigma = 0.01
    # z-score
    z = ( x - mu ) / float( sigma )
    # check n
    if n == None:
        print 'Need a sample size, n.'
        return None
    # test statistic
    ts = z * np.sqrt( n )
    # calculate the p-value
    if ts < 0:
        pv = scipy.stats.norm( 0, 1 ).cdf( ts )
    elif ts > 0:
        pv = scipy.stats.norm( 0, 1 ).cdf( -ts )
    return pv

Statistical Power

We can evaluate the power of this test for a fixed standard deviation of 1.0, and different sample sizes and effect sizes. We will see that as we increase the sample size or the effect size, we increase the power of the test, the probability that it will detect an effect of a given size. We would also see an increase in power if we could reduce the standard deviation, but this is almost always out of the investigator’s control, so we will consider it fixed. We will look at effects of size 0.1 to 0.6, in increments of 0.1, and we will consider sample sizes from 5 to 100, in increments of 5. We will perform 500 trials per effect size and sample size, and calculate the empirical statistical power from that.

# dictionary to hold the results
power = dict()
# number of trials
ntrials = 500
# effect sizes: [ 0.1, 0.2, ..., 0.6 ]
effects = np.linspace( 0.1, 0.6, 6, endpoint=True )
for effect in effects:
    # temporary list
    tmp = list()
    # normally distributed sample with mu=effect, and std=1
    dp = scipy.stats.norm(effect,1)
    # number of points in a sample
    for npoints in range( 5, 105, 5 ):
        # list of p-values
        outcome = np.array( [ ztest( dp.rvs( npoints ), mu = 0.0 ) for i in range( ntrials ) ] )
        # number of p-values that are less than the significance level alpha
        reject_null = np.where( outcome <= 0.025 )[0].size
        # sample size, and probability of rejecting the null hypothesis that the means are equal
        tmp.append( [ npoints, reject_null / float( ntrials ) ] )
    # add these results to a dictionary, indexed by effect size
    power[(effect,1)] = np.array( tmp )

For completeness, this is the code I used to generate the plots. The only interesting thing here is the very last line, which prevents the savefig() function from cutting off the legend that I put on the side of the graph.

# organize the tests by effect size
keys = power.keys()
keys.sort()
# perform the plotting
for k in keys:
    plot( power[k][:,0], power[k][:,1], marker='.', label=str(k[0]) )
# limit the x axis from 5 to 100
xlim(5,100)
# limit the y axis to [0,1]
ylim( 0, 1 )
# plot the legend
lgd = legend( loc=(1.05,0.2), title="Effect Size" )
# put a grid on it!
grid()
# title and x and y labels
title("Empirical Statistical Power for\nDifferent Effect Sizes, with \sigma=1")
xlabel("Sample Size") ; ylabel("Empirical Statistical Power")
# save the figure
savefig( 'statistical_power.png', fmt='png', dpi=200,\
         bbox_extra_artists=(lgd,), bbox_inches='tight' )

Type-II Error Rate and the Significance Level

Traditionally, practitioners of hypothesis tests have been very worried about Type-II errors, or false negatives. The idea is that it is more important to minimize false negatives than false positives, when a positive test constitutes a clear and present danger. In other words, a placing more emphasis on minimizing the false negative error rate errs on the side of caution. This makes sense in testing for cancer; a false positive is startling, but a false negative is prematurely fatal. In other situations, it is also important to minimize the false positive error rate; for instance, if it is costly or time consuming to post-process (potentially false) positives. The next bit of code will demonstrate empirically that the significance level \alpha approximates the Type-II false positive error rate.

# dictionary to hold the results
typeII = dict()
# number of trials
ntrials = 1000
# significance levels: [ 0.025, 0.05, 0.01 ]
alphas = [ 0.025, 0.05, 0.1 ]
for a in alphas:
    # temporary list
    tmp = list()
    # normally distributed sample with mu=effect, and std=1
    dp = scipy.stats.norm(0,1)
    # number of points in a sample
    for npoints in range( 5, 105, 5 ):
        # list of p-values
        outcome = np.array( [ ztest( dp.rvs( npoints ), mu = 0.0, alpha = a ) for i in range( ntrials ) ] )
        # number of p-values that are less than the significance level alpha
        reject_null = np.where( outcome <= a/2.0 )[0].size
        # sample size, and probability of rejecting the null hypothesis that the means are equal
        tmp.append( [ npoints, reject_null / float( ntrials ) ] )
    # add these results to a dictionary, indexed by effect size
    typeII[(a,1)] = np.array( tmp )

One thought on “Classical Hypothesis Testing, Statistical Power, and Type-II Errors”

Comments are closed.