In this post I will walk through the computation of principal components from a data set using Python. A number of languages and modules implement principal components analysis (PCA) but some implementations can vary slightly which may lead to confusion if you are trying to follow someone else’s code, or you are using multiple languages. Perhaps more importantly, as a data analyst you should at all costs avoid using a tool if you do not understand how it works. I will use data from The Handbook of Small Data Sets to illustrate this example. The data sets will be found in a zipped directory on site linked above.
Data
The data was collected from 41 major US cities and published in Biometry 2nd Ed. by Sokal and Rohlf in 1981.
Variable | Meaning |
---|---|
Y | Sulphur dioxide content of air in micrograms per cubic meter |
X1 | Average annual temperature in degrees Fahrenheit |
X2 | Number of manufacturers employing 20 or more workers |
X3 | Population in 1970 census, in thousands |
X4 | Average annual wind speed in miles per hour |
X5 | Average annual percipitation in inches |
X6 | Average number of days with percipitation per year |
Once we have downloaded the data from The Handbook of Small Data Sets, we can extract it and parse it as follows,
import numpy as np p = open( 'handetal/USAIR.DAT', 'r' ).readlines() p = [ i.strip().split('\t') for i in p ] X = np.array( [ [ float(j) for j in i ] for i in p ] )
Since we will be working with matrices, try keep track of the dimensions of each matrix. This helps when you are debugging. The initial data has 41 observations over 7 variables. We’ll keep track of these dimensions and use them later by storing the number of observations, M
, and the number of variables, N
.
M, N = X.shape
Computation
The first step in PCA is to mean-center the data. This means taking each column and subtracting the mean of that column. This centers the data from each column about zero, and puts the data points on somewhat of a more equal footing. We’ll perform this operation using the nice idiom below.
# subtract the mean from each variable (column) in the data mX = X - np.mean( X, axis=0 )
Next, we’ll calculate and inspect the covariance matrix.
cvr = np.cov( mX.T ) s = '{:10.2f} '*N for i in range( N ): print s.format( *cvr[i] )
550.95 | -73.56 | 8527.72 | 6711.99 | 3.18 | 15.00 | 229.93 |
-73.56 | 52.24 | -773.97 | -262.35 | -3.61 | 32.86 | -82.43 |
8527.72 | -773.97 | 317502.89 | 311718.81 | 191.55 | -215.02 | 1968.96 |
6711.99 | -262.35 | 311718.81 | 335371.89 | 175.93 | -178.05 | 645.99 |
3.18 | -3.61 | 191.55 | 175.93 | 2.04 | -0.22 | 6.21 |
15.00 | 32.86 | -215.02 | -178.05 | -0.22 | 138.57 | 154.79 |
229.93 | -82.43 | 1968.96 | 645.99 | 6.21 | 154.79 | 702.59 |
Now that we have a covariance matrix, we can calculate the eigenvalues and eigenvectors. Next, we’ll sort the eigenvectors according to the magnitude of the eigenvalues. We do this because the eigenvector corresponding to the largest eigenvalue also corresponds to the first principal component, and this relationship holds for the rest of the principal components.
# calculate the eigenvalues and eigenvectors eigenvalue, eigenvector = np.linalg.eig( cvr ) # sort the eigenvectors according to the eigenvalues srt = np.argsort( eigenvalue )[::-1] eigenvector = np.matrix( eigenvector[:,srt] ) eigenvalue = eigenvalue[srt]
We can then inspect the eigenvalues numerically,
print s.format( *eigenvalue )
638471.96 | 14812.04 | 701.96 | 205.00 | 116.70 | 12.06 | 1.45 |
Or we can inspect them graphically,
semilogy( eigenvalue.real, '-o' ) title("Log-plot of Eigenvalues") xlabel( "Eigenvalue Index" ) ylabel( "Eigenvalue Magnitude" ) grid() ; savefig( "eigenvalue_logplot.png", fmt="png", dpi=200 )
Next, we pick the number of components we want to use, which will be represented here by the variable ncomp=3
. Then we take the first ncomp
eigenvectors as our feature vector.
# number of principal components ncomp = 3 # build a feature vector # from the first "ncomp" eigenvectors fv = eigenvector[:,:ncomp]
Multiplying the feature vector, fv
, by the mean-centered data, mX
, we obtain the transformed data td
, which is short for, “ta-daa!”. The transformed data will have the original number of M=41
observations, but it will only have the ncomp=3
variables, which will be fewer than the N=7
variables that we started with.
# transform the data using the feature vector td = fv.T * mX.T
We may wish to stop here and used the reduced data set td
, or we may reproject the reduced dimensional data onto the original N=7
dimensions. After we do that, we can reincorporate the mean that we subtracted from the original data at the outset.
# reproject the transformed data onto the original number of dimensions rd = fv * td # reincorporate the mean that was subtracted at the outset rd = np.array( rd.T + np.mean( X, axis=0 ) )
If we had set ncomp=7
then we would get back exactly the data that we had started with. When we use feature vectors composed of fewer eigenvectors we reduce the variance observed in the data, but we also lose potentially valuable information.
If I am using standardized data, specifically, Z-scores, instead of simply centered data, how would I reincorporate the mean and the standard deviation divisor at the last step here?
If your data is already standardized, then the mean should be zero or darn close, and adding it to the result of the PCA should not matter.