In this post I’ll provide an example of exploratory factor analysis in R. We will use the psych
package by William Revelle. Factor analysis seeks to find latent variables, or factors, by looking at the correlation matrix of the observed variables. This technique can be used for dimensionality reduction, or for better insight into the data. As with any technique, this will not work in all scenarios. Firstly, latent variables are not always present, and secondly, it will miss existing latent variables if they are not apparent from the observed data. More information can be found in the manual distributed with the package.
Installing R Packages
You can usually install R packages in two steps by calling install.packages()
and then importing that package to the current namespace by calling library()
as follows,
install.packages("psych") library(psych)
This will bring up a dialog were you can pick the nearest server to install the package from, and you’ll be on your way. If you are behind a firewall that does not allow such shenanigans, then you can Google the package, download it, and then install the binary manually. First, Google something like R package psych which should give you this link. Then, under Downloads, pick the appropriate download. Finally in the R terminal you call the same functions, but you use the path to the new package, and some additional arguments. For instance, on Windows you’d probably do something like this,
install.packages("C:\\Users\\condor\\Downloads\\psych_1.4.3.zip",repos=NULL,type="source") library(psych)
Data
We will be using the bfi
data, which has 28 variables, and 1000 observations. The data are the results of a study of 1000 individuals who were asked to rate themselves on a Lickert scale of 1 to 6. In the analysis of the data we should be aware of two issues: Lickert scales are known to produce skewed results because of testing fatigue and general boredom, and people are not always very accurate at self-reporting. Below is a listing of the questions in the survey. The last three variables in the data frame report age, education and gender.
Variable | Interpretation |
---|---|
A1 | Indifferent to others |
A2 | Inquire about others’ well-being |
A3 | Know how to comfort others |
A4 | Love children |
A5 | Make people feel at ease |
C1 | Exacting in my work |
C2 | Continue until everything is perfect |
C3 | Do things according to a plan |
C4 | Do things in a half-way manner |
C5 | Waste my time |
E1 | Don’t talk a lot |
E2 | Find it difficult to approach others |
E3 | Know how to captivate people |
E4 | Make friends easily |
E5 | Take charge |
N1 | Get angry easily |
N2 | Get irritated easily |
N3 | Have frequent mood swings |
N4 | Often feel blue |
N5 | Panic easily |
O1 | Am full of ideas |
O2 | Avoid imposing my will on others |
O3 | Carry the conversation to a higher level |
O4 | Spend time reflecting on things |
O5 | Will not probe deeply into a subject |
We can load the data and inspect the first few rows using the following commands. The function head()
, showing the first few rows, has a complement tail()
which shows the last few rows.
data( bfi ) head( bfi )
If we wanted to see the first three rows and first four columns we’d call
bfi[1:3,1:4]
To get an idea of the distribution of the answers for a given question or variable we can use the hist()
function.
hist( bfi[,'A3'], breaks=c(0.5:1:6.5) ) hist( bfi$A3, breaks=c(0.5:1:6.5) )
Note that in R using the dollar sign to access a variable in a data frame is syntax sugar, and the best practice (as far as I know) is to use the square brackets.
We can also view the summary statistics using the summary()
function, or the describe()
function from the psych
package.
summary( bfi$age )
Min. 1st Qu. Median Mean 3rd Qu. Max. 3.00 20.00 26.00 28.78 35.00 86.00
Note that the minimum age is 3, which implies either that a three year old was administered a psychological survey, or that there was an error in the data collection.
Factor Analysis
R has a built in function for factor analysis called factanal()
.
out <- factanal( covmat=cor( bfi, use="complete.obs" ), factors=5, rotation="varimax" )
The fa()
function from the psych
package provides more information. This will require the GPArotation
package.
library( GPArotation ) corMatrix <- cor( bfi, use="complete.obs" ) out <- fa( r=corMatrix, factors=5 )
To view all of the statistics type out
. Alternatively, we can take out the middling factor weights by using a threshold. Here, we print the loadings that have an absolute value greater than 0.3.
print( out$loadings, cutoff=0.3 )
Loadings: MR2 MR3 MR5 MR1 MR4 A1 -0.420 A2 0.620 A3 0.649 A4 0.460 A5 0.546 C1 0.533 C2 0.623 C3 0.567 C4 -0.670 C5 -0.582 E1 0.544 E2 0.650 E3 -0.384 0.333 E4 0.339 -0.554 E5 -0.398 N1 0.855 N2 0.818 N3 0.669 N4 0.410 0.437 N5 0.444 O1 0.544 O2 -0.457 O3 0.641 O4 0.347 0.365 O5 -0.519 gender education age MR2 MR3 MR5 MR1 MR4 SS loadings 2.457 1.980 1.977 1.915 1.683 Proportion Var 0.088 0.071 0.071 0.068 0.060 Cumulative Var 0.088 0.158 0.229 0.297 0.358
From this output, we could say that the MR2
factor corresponds to grumpiness, the MR3
factor corresponds to diligence, the MR5
factor corresponds to compassion or empathy, the MR1
factor corresponds to introversion, and the MR4
factor corresponds to creativity or charisma. The difficult part of factor analysis is interpreting the factors. Ideally we would use such an analysis to design another experiment that would try to observe the latent variables to determine whether or not they exist. However, this may not be possible, or feasible.
Visualization
We can visualize the factors by calling the function fa.diagram( out )
. The square boxes are the observed variables, and the ovals are the unobserved factors. The straight arrows are the loadings, the correlation between the factor and the observed variable(s). The curved arrows are the correlations between the factors. If no curved arrow is present, then the correlation between the factors is not great.
Dear Connor,
Thank you for the very useful and quick guidelines on factor analysis in R. I have one theoretical question. If we use questionnaires in our study to measure e.g user satisfaction and for certain questions the loadings are small (typical threshold is 0.5), is it better to exclude these questions when calculating the overall user satisfaction score?
.. Also in the above example, would it be better to exclude E3 and E5 when calculating Extroversion for this specific sample of individuals (even though big5 is well-established questionnaire) ?
sorry if it is a stupid question, I am not a good statistician 🙂
Thank you in advance!