In this post I’ll provide an example of exploratory factor analysis in R. We will use the `psych`

package by William Revelle. Factor analysis seeks to find latent variables, or factors, by looking at the correlation matrix of the observed variables. This technique can be used for dimensionality reduction, or for better insight into the data. As with any technique, this will not work in all scenarios. Firstly, latent variables are not always present, and secondly, it will miss existing latent variables if they are not apparent from the observed data. More information can be found in the manual distributed with the package.

# Installing R Packages

You can usually install R packages in two steps by calling `install.packages()`

and then importing that package to the current namespace by calling `library()`

as follows,

install.packages("psych") library(psych)

This will bring up a dialog were you can pick the nearest server to install the package from, and you’ll be on your way. If you are behind a firewall that does not allow such shenanigans, then you can Google the package, download it, and then install the binary manually. First, Google something like *R package psych* which should give you this link. Then, under *Downloads*, pick the appropriate download. Finally in the R terminal you call the same functions, but you use the path to the new package, and some additional arguments. For instance, on Windows you’d probably do something like this,

install.packages("C:\\Users\\condor\\Downloads\\psych_1.4.3.zip",repos=NULL,type="source") library(psych)

# Data

We will be using the `bfi`

data, which has 28 variables, and 1000 observations. The data are the results of a study of 1000 individuals who were asked to rate themselves on a Lickert scale of 1 to 6. In the analysis of the data we should be aware of two issues: Lickert scales are known to produce skewed results because of testing fatigue and general boredom, and people are not always very accurate at self-reporting. Below is a listing of the questions in the survey. The last three variables in the data frame report age, education and gender.

Variable | Interpretation |
---|---|

A1 | Indifferent to others |

A2 | Inquire about others’ well-being |

A3 | Know how to comfort others |

A4 | Love children |

A5 | Make people feel at ease |

C1 | Exacting in my work |

C2 | Continue until everything is perfect |

C3 | Do things according to a plan |

C4 | Do things in a half-way manner |

C5 | Waste my time |

E1 | Don’t talk a lot |

E2 | Find it difficult to approach others |

E3 | Know how to captivate people |

E4 | Make friends easily |

E5 | Take charge |

N1 | Get angry easily |

N2 | Get irritated easily |

N3 | Have frequent mood swings |

N4 | Often feel blue |

N5 | Panic easily |

O1 | Am full of ideas |

O2 | Avoid imposing my will on others |

O3 | Carry the conversation to a higher level |

O4 | Spend time reflecting on things |

O5 | Will not probe deeply into a subject |

We can load the data and inspect the first few rows using the following commands. The function `head()`

, showing the first few rows, has a complement `tail()`

which shows the last few rows.

data( bfi ) head( bfi )

If we wanted to see the first three rows and first four columns we’d call

bfi[1:3,1:4]

To get an idea of the distribution of the answers for a given question or variable we can use the `hist()`

function.

hist( bfi[,'A3'], breaks=c(0.5:1:6.5) ) hist( bfi$A3, breaks=c(0.5:1:6.5) )

Note that in R using the dollar sign to access a variable in a data frame is syntax sugar, and the best practice (as far as I know) is to use the square brackets.

We can also view the summary statistics using the `summary()`

function, or the `describe()`

function from the `psych`

package.

summary( bfi$age )

Min. 1st Qu. Median Mean 3rd Qu. Max. 3.00 20.00 26.00 28.78 35.00 86.00

Note that the minimum age is 3, which implies either that a three year old was administered a psychological survey, or that there was an error in the data collection.

# Factor Analysis

R has a built in function for factor analysis called `factanal()`

.

out <- factanal( covmat=cor( bfi, use="complete.obs" ), factors=5, rotation="varimax" )

The `fa()`

function from the `psych`

package provides more information. This will require the `GPArotation`

package.

library( GPArotation ) corMatrix <- cor( bfi, use="complete.obs" ) out <- fa( r=corMatrix, factors=5 )

To view all of the statistics type `out`

. Alternatively, we can take out the middling factor weights by using a threshold. Here, we print the loadings that have an absolute value greater than 0.3.

print( out$loadings, cutoff=0.3 )

Loadings: MR2 MR3 MR5 MR1 MR4 A1 -0.420 A2 0.620 A3 0.649 A4 0.460 A5 0.546 C1 0.533 C2 0.623 C3 0.567 C4 -0.670 C5 -0.582 E1 0.544 E2 0.650 E3 -0.384 0.333 E4 0.339 -0.554 E5 -0.398 N1 0.855 N2 0.818 N3 0.669 N4 0.410 0.437 N5 0.444 O1 0.544 O2 -0.457 O3 0.641 O4 0.347 0.365 O5 -0.519 gender education age MR2 MR3 MR5 MR1 MR4 SS loadings 2.457 1.980 1.977 1.915 1.683 Proportion Var 0.088 0.071 0.071 0.068 0.060 Cumulative Var 0.088 0.158 0.229 0.297 0.358

From this output, we could say that the `MR2`

factor corresponds to grumpiness, the `MR3`

factor corresponds to diligence, the `MR5`

factor corresponds to compassion or empathy, the `MR1`

factor corresponds to introversion, and the `MR4`

factor corresponds to creativity or charisma. The difficult part of factor analysis is interpreting the factors. Ideally we would use such an analysis to design another experiment that would try to observe the latent variables to determine whether or not they exist. However, this may not be possible, or feasible.

# Visualization

We can visualize the factors by calling the function `fa.diagram( out )`

. The square boxes are the observed variables, and the ovals are the unobserved factors. The straight arrows are the loadings, the correlation between the factor and the observed variable(s). The curved arrows are the correlations between the factors. If no curved arrow is present, then the correlation between the factors is not great.

Dear Connor,

Thank you for the very useful and quick guidelines on factor analysis in R. I have one theoretical question. If we use questionnaires in our study to measure e.g user satisfaction and for certain questions the loadings are small (typical threshold is 0.5), is it better to exclude these questions when calculating the overall user satisfaction score?

.. Also in the above example, would it be better to exclude E3 and E5 when calculating Extroversion for this specific sample of individuals (even though big5 is well-established questionnaire) ?

sorry if it is a stupid question, I am not a good statistician ðŸ™‚

Thank you in advance!