Modeling in R with the caret Package

In this post I’ll look at using the caret package in R for determining the optimal parameters for a given model. The caret package was developed by Max Kuhn, who also developed the C50 package for decision trees which I talked about in a previous post.

Parameter Estimation

We can use caret::train() to determine the optimal parameters for a model. This function will perform repeatedly resample the data set in order to estimate the effect of different parameters. When it is done, it will report the optimal parameters, an estimated accuracy, and an estimated standard deviation for the accuracy.

We may input the feature matrix X, and a vector of class labels y, or we may pass an R formula using the variable names. Then, we specify the data set, and finally the machine learning algorithm. A complete list of the algorithms supported by the caret::train() function may be found here.

Here is an example of using the caret::train() function on Edgar Anderson’s iris data set using the Random Forests algorithm.

library( caret )
library( randomForest )
data( iris )
set.seed(318)

m <- caret::train( Species ~ ., data=iris, method="rf" )
m

The cryptic Species ~ . tells R that we want to model the Species variable of the data set, using all of the other variables. (The tilde means using, and the dot matches all of the other variables.) The next argument tells the function to use the iris data set. The third argument specifies the randomForest algorithm.

This produces the following output,

Random Forest 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (25 reps) 

Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
  2     0.938     0.907  0.0348       0.0523  
  3     0.941     0.910  0.0358       0.0537  
  4     0.935     0.902  0.0370       0.0554  

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 3. 

Caveat

If you get this error when using caret,

Loading required namespace: e1071
Error in loadNamespace(name) : there is no package called ‘e1071’

Then try installing e1071. I’m not sure what e1071 is, but this fixed the problem for me.

install.packages("e1071")

2 thoughts on “Modeling in R with the caret Package”

Leave a Reply

Your email address will not be published. Required fields are marked *