Computing Principal Components in Python

In this post I will walk through the computation of principal components from a data set using Python. A number of languages and modules implement principal components analysis (PCA) but some implementations can vary slightly which may lead to confusion if you are trying to follow someone else’s code, or you are using multiple languages. Perhaps more importantly, as a data analyst you should at all costs avoid using a tool if you do not understand how it works. I will use data from The Handbook of Small Data Sets to illustrate this example. The data sets will be found in a zipped directory on site linked above.

Data

The data was collected from 41 major US cities and published in Biometry 2nd Ed. by Sokal and Rohlf in 1981.

Variable Meaning
Y Sulphur dioxide content of air in micrograms per cubic meter
X1 Average annual temperature in degrees Fahrenheit
X2 Number of manufacturers employing 20 or more workers
X3 Population in 1970 census, in thousands
X4 Average annual wind speed in miles per hour
X5 Average annual percipitation in inches
X6 Average number of days with percipitation per year

Once we have downloaded the data from The Handbook of Small Data Sets, we can extract it and parse it as follows,

import numpy as np
p = open( 'handetal/USAIR.DAT', 'r' ).readlines()
p = [ i.strip().split('\t') for i in p ]
X = np.array( [ [ float(j) for j in i ] for i in p ] )

Since we will be working with matrices, try keep track of the dimensions of each matrix. This helps when you are debugging. The initial data has 41 observations over 7 variables. We’ll keep track of these dimensions and use them later by storing the number of observations, M, and the number of variables, N.

M, N = X.shape

Computation

The first step in PCA is to mean-center the data. This means taking each column and subtracting the mean of that column. This centers the data from each column about zero, and puts the data points on somewhat of a more equal footing. We’ll perform this operation using the nice idiom below.

# subtract the mean from each variable (column) in the data
mX = X - np.mean( X, axis=0 )

Next, we’ll calculate and inspect the covariance matrix.

cvr = np.cov( mX.T )
s = '{:10.2f} '*N
for i in range( N ):
    print s.format( *cvr[i] )
550.95 -73.56 8527.72 6711.99 3.18 15.00 229.93
-73.56 52.24 -773.97 -262.35 -3.61 32.86 -82.43
8527.72 -773.97 317502.89 311718.81 191.55 -215.02 1968.96
6711.99 -262.35 311718.81 335371.89 175.93 -178.05 645.99
3.18 -3.61 191.55 175.93 2.04 -0.22 6.21
15.00 32.86 -215.02 -178.05 -0.22 138.57 154.79
229.93 -82.43 1968.96 645.99 6.21 154.79 702.59

Now that we have a covariance matrix, we can calculate the eigenvalues and eigenvectors. Next, we’ll sort the eigenvectors according to the magnitude of the eigenvalues. We do this because the eigenvector corresponding to the largest eigenvalue also corresponds to the first principal component, and this relationship holds for the rest of the principal components.

# calculate the eigenvalues and eigenvectors
eigenvalue, eigenvector = np.linalg.eig( cvr )

# sort the eigenvectors according to the eigenvalues
srt = np.argsort( eigenvalue )[::-1]
eigenvector = np.matrix( eigenvector[:,srt] )
eigenvalue = eigenvalue[srt]

We can then inspect the eigenvalues numerically,

print s.format( *eigenvalue )
638471.96 14812.04 701.96 205.00 116.70 12.06 1.45

Or we can inspect them graphically,

semilogy( eigenvalue.real, '-o' )
title("Log-plot of Eigenvalues")
xlabel( "Eigenvalue Index" )
ylabel( "Eigenvalue Magnitude" )
grid() ; savefig( "eigenvalue_logplot.png", fmt="png", dpi=200 )

Next, we pick the number of components we want to use, which will be represented here by the variable ncomp=3. Then we take the first ncomp eigenvectors as our feature vector.

# number of principal components
ncomp = 3

# build a feature vector 
# from the first "ncomp" eigenvectors
fv = eigenvector[:,:ncomp]

Multiplying the feature vector, fv, by the mean-centered data, mX, we obtain the transformed data td, which is short for, “ta-daa!”. The transformed data will have the original number of M=41 observations, but it will only have the ncomp=3 variables, which will be fewer than the N=7 variables that we started with.

# transform the data using the feature vector
td = fv.T * mX.T

We may wish to stop here and used the reduced data set td, or we may reproject the reduced dimensional data onto the original N=7 dimensions. After we do that, we can reincorporate the mean that we subtracted from the original data at the outset.

# reproject the transformed data onto the original number of dimensions
rd = fv * td

# reincorporate the mean that was subtracted at the outset
rd = np.array( rd.T + np.mean( X, axis=0 ) )

If we had set ncomp=7 then we would get back exactly the data that we had started with. When we use feature vectors composed of fewer eigenvectors we reduce the variance observed in the data, but we also lose potentially valuable information.

2 thoughts on “Computing Principal Components in Python”

  1. If I am using standardized data, specifically, Z-scores, instead of simply centered data, how would I reincorporate the mean and the standard deviation divisor at the last step here?

    1. If your data is already standardized, then the mean should be zero or darn close, and adding it to the result of the PCA should not matter.

Leave a Reply

Your email address will not be published. Required fields are marked *