Exploratory Factor Analysis in R

In this post I’ll provide an example of exploratory factor analysis in R. We will use the psych package by William Revelle. Factor analysis seeks to find latent variables, or factors, by looking at the correlation matrix of the observed variables. This technique can be used for dimensionality reduction, or for better insight into the data. As with any technique, this will not work in all scenarios. Firstly, latent variables are not always present, and secondly, it will miss existing latent variables if they are not apparent from the observed data. More information can be found in the manual distributed with the package.

Installing R Packages

You can usually install R packages in two steps by calling install.packages() and then importing that package to the current namespace by calling library() as follows,

install.packages("psych")
library(psych)

This will bring up a dialog were you can pick the nearest server to install the package from, and you’ll be on your way. If you are behind a firewall that does not allow such shenanigans, then you can Google the package, download it, and then install the binary manually. First, Google something like R package psych which should give you this link. Then, under Downloads, pick the appropriate download. Finally in the R terminal you call the same functions, but you use the path to the new package, and some additional arguments. For instance, on Windows you’d probably do something like this,

install.packages("C:\\Users\\condor\\Downloads\\psych_1.4.3.zip",repos=NULL,type="source")
library(psych)

Data

We will be using the bfi data, which has 28 variables, and 1000 observations. The data are the results of a study of 1000 individuals who were asked to rate themselves on a Lickert scale of 1 to 6. In the analysis of the data we should be aware of two issues: Lickert scales are known to produce skewed results because of testing fatigue and general boredom, and people are not always very accurate at self-reporting. Below is a listing of the questions in the survey. The last three variables in the data frame report age, education and gender.

Variable Interpretation
A1 Indifferent to others
A2 Inquire about others’ well-being
A3 Know how to comfort others
A4 Love children
A5 Make people feel at ease
C1 Exacting in my work
C2 Continue until everything is perfect
C3 Do things according to a plan
C4 Do things in a half-way manner
C5 Waste my time
E1 Don’t talk a lot
E2 Find it difficult to approach others
E3 Know how to captivate people
E4 Make friends easily
E5 Take charge
N1 Get angry easily
N2 Get irritated easily
N3 Have frequent mood swings
N4 Often feel blue
N5 Panic easily
O1 Am full of ideas
O2 Avoid imposing my will on others
O3 Carry the conversation to a higher level
O4 Spend time reflecting on things
O5 Will not probe deeply into a subject

We can load the data and inspect the first few rows using the following commands. The function head(), showing the first few rows, has a complement tail() which shows the last few rows.

data( bfi )
head( bfi )

If we wanted to see the first three rows and first four columns we’d call

bfi[1:3,1:4]

To get an idea of the distribution of the answers for a given question or variable we can use the hist() function.

hist( bfi[,'A3'], breaks=c(0.5:1:6.5) )
hist( bfi$A3, breaks=c(0.5:1:6.5) )

Note that in R using the dollar sign to access a variable in a data frame is syntax sugar, and the best practice (as far as I know) is to use the square brackets.

We can also view the summary statistics using the summary() function, or the describe() function from the psych package.

summary( bfi$age )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00   20.00   26.00   28.78   35.00   86.00

Note that the minimum age is 3, which implies either that a three year old was administered a psychological survey, or that there was an error in the data collection.

Factor Analysis

R has a built in function for factor analysis called factanal().

out <- factanal( covmat=cor( bfi, use="complete.obs" ), factors=5, rotation="varimax" )

The fa() function from the psych package provides more information. This will require the GPArotation package.

library( GPArotation )
corMatrix <- cor( bfi, use="complete.obs" )
out <- fa( r=corMatrix, factors=5 )

To view all of the statistics type out. Alternatively, we can take out the middling factor weights by using a threshold. Here, we print the loadings that have an absolute value greater than 0.3.

print( out$loadings, cutoff=0.3 )
Loadings:
          MR2    MR3    MR5    MR1    MR4   
A1                      -0.420              
A2                       0.620              
A3                       0.649              
A4                       0.460              
A5                       0.546              
C1                0.533                     
C2                0.623                     
C3                0.567                     
C4               -0.670                     
C5               -0.582                     
E1                              0.544       
E2                              0.650       
E3                             -0.384  0.333
E4                       0.339 -0.554       
E5                             -0.398       
N1         0.855                            
N2         0.818                            
N3         0.669                            
N4         0.410                0.437       
N5         0.444                            
O1                                     0.544
O2                                    -0.457
O3                                     0.641
O4                              0.347  0.365
O5                                    -0.519
gender                                      
education                                   
age                                         

                 MR2   MR3   MR5   MR1   MR4
SS loadings    2.457 1.980 1.977 1.915 1.683
Proportion Var 0.088 0.071 0.071 0.068 0.060
Cumulative Var 0.088 0.158 0.229 0.297 0.358

From this output, we could say that the MR2 factor corresponds to grumpiness, the MR3 factor corresponds to diligence, the MR5 factor corresponds to compassion or empathy, the MR1 factor corresponds to introversion, and the MR4 factor corresponds to creativity or charisma. The difficult part of factor analysis is interpreting the factors. Ideally we would use such an analysis to design another experiment that would try to observe the latent variables to determine whether or not they exist. However, this may not be possible, or feasible.

Visualization

We can visualize the factors by calling the function fa.diagram( out ). The square boxes are the observed variables, and the ovals are the unobserved factors. The straight arrows are the loadings, the correlation between the factor and the observed variable(s). The curved arrows are the correlations between the factors. If no curved arrow is present, then the correlation between the factors is not great.

One thought on “Exploratory Factor Analysis in R”

  1. Dear Connor,

    Thank you for the very useful and quick guidelines on factor analysis in R. I have one theoretical question. If we use questionnaires in our study to measure e.g user satisfaction and for certain questions the loadings are small (typical threshold is 0.5), is it better to exclude these questions when calculating the overall user satisfaction score?
    .. Also in the above example, would it be better to exclude E3 and E5 when calculating Extroversion for this specific sample of individuals (even though big5 is well-established questionnaire) ?

    sorry if it is a stupid question, I am not a good statistician 🙂

    Thank you in advance!

Leave a Reply

Your email address will not be published. Required fields are marked *