In this post I’ll provide an example of exploratory factor analysis in R. We will use the
psych package by William Revelle. Factor analysis seeks to find latent variables, or factors, by looking at the correlation matrix of the observed variables. This technique can be used for dimensionality reduction, or for better insight into the data. As with any technique, this will not work in all scenarios. Firstly, latent variables are not always present, and secondly, it will miss existing latent variables if they are not apparent from the observed data. More information can be found in the manual distributed with the package.
Installing R Packages
You can usually install R packages in two steps by calling
install.packages() and then importing that package to the current namespace by calling
library() as follows,
This will bring up a dialog were you can pick the nearest server to install the package from, and you’ll be on your way. If you are behind a firewall that does not allow such shenanigans, then you can Google the package, download it, and then install the binary manually. First, Google something like R package psych which should give you this link. Then, under Downloads, pick the appropriate download. Finally in the R terminal you call the same functions, but you use the path to the new package, and some additional arguments. For instance, on Windows you’d probably do something like this,
We will be using the
bfi data, which has 28 variables, and 1000 observations. The data are the results of a study of 1000 individuals who were asked to rate themselves on a Lickert scale of 1 to 6. In the analysis of the data we should be aware of two issues: Lickert scales are known to produce skewed results because of testing fatigue and general boredom, and people are not always very accurate at self-reporting. Below is a listing of the questions in the survey. The last three variables in the data frame report age, education and gender.
|A1||Indifferent to others|
|A2||Inquire about others’ well-being|
|A3||Know how to comfort others|
|A5||Make people feel at ease|
|C1||Exacting in my work|
|C2||Continue until everything is perfect|
|C3||Do things according to a plan|
|C4||Do things in a half-way manner|
|C5||Waste my time|
|E1||Don’t talk a lot|
|E2||Find it difficult to approach others|
|E3||Know how to captivate people|
|E4||Make friends easily|
|N1||Get angry easily|
|N2||Get irritated easily|
|N3||Have frequent mood swings|
|N4||Often feel blue|
|O1||Am full of ideas|
|O2||Avoid imposing my will on others|
|O3||Carry the conversation to a higher level|
|O4||Spend time reflecting on things|
|O5||Will not probe deeply into a subject|
We can load the data and inspect the first few rows using the following commands. The function
head(), showing the first few rows, has a complement
tail() which shows the last few rows.
data( bfi ) head( bfi )
If we wanted to see the first three rows and first four columns we’d call
To get an idea of the distribution of the answers for a given question or variable we can use the
hist( bfi[,'A3'], breaks=c(0.5:1:6.5) ) hist( bfi$A3, breaks=c(0.5:1:6.5) )
Note that in R using the dollar sign to access a variable in a data frame is syntax sugar, and the best practice (as far as I know) is to use the square brackets.
We can also view the summary statistics using the
summary() function, or the
describe() function from the
summary( bfi$age )
Min. 1st Qu. Median Mean 3rd Qu. Max. 3.00 20.00 26.00 28.78 35.00 86.00
Note that the minimum age is 3, which implies either that a three year old was administered a psychological survey, or that there was an error in the data collection.
R has a built in function for factor analysis called
out <- factanal( covmat=cor( bfi, use="complete.obs" ), factors=5, rotation="varimax" )
fa() function from the
psych package provides more information. This will require the
library( GPArotation ) corMatrix <- cor( bfi, use="complete.obs" ) out <- fa( r=corMatrix, factors=5 )
To view all of the statistics type
out. Alternatively, we can take out the middling factor weights by using a threshold. Here, we print the loadings that have an absolute value greater than 0.3.
print( out$loadings, cutoff=0.3 )
Loadings: MR2 MR3 MR5 MR1 MR4 A1 -0.420 A2 0.620 A3 0.649 A4 0.460 A5 0.546 C1 0.533 C2 0.623 C3 0.567 C4 -0.670 C5 -0.582 E1 0.544 E2 0.650 E3 -0.384 0.333 E4 0.339 -0.554 E5 -0.398 N1 0.855 N2 0.818 N3 0.669 N4 0.410 0.437 N5 0.444 O1 0.544 O2 -0.457 O3 0.641 O4 0.347 0.365 O5 -0.519 gender education age MR2 MR3 MR5 MR1 MR4 SS loadings 2.457 1.980 1.977 1.915 1.683 Proportion Var 0.088 0.071 0.071 0.068 0.060 Cumulative Var 0.088 0.158 0.229 0.297 0.358
From this output, we could say that the
MR2 factor corresponds to grumpiness, the
MR3 factor corresponds to diligence, the
MR5 factor corresponds to compassion or empathy, the
MR1 factor corresponds to introversion, and the
MR4 factor corresponds to creativity or charisma. The difficult part of factor analysis is interpreting the factors. Ideally we would use such an analysis to design another experiment that would try to observe the latent variables to determine whether or not they exist. However, this may not be possible, or feasible.
We can visualize the factors by calling the function
fa.diagram( out ). The square boxes are the observed variables, and the ovals are the unobserved factors. The straight arrows are the loadings, the correlation between the factor and the observed variable(s). The curved arrows are the correlations between the factors. If no curved arrow is present, then the correlation between the factors is not great.