In this post I’ll walk through an example of using the C50 package for decision trees in R. This is an extension of the C4.5 algorithm. We’ll use some totally unhelpful credit data from the UCI Machine Learning Repository that has been sanitized and anonymified beyond all recognition.
Data
The data set is interesting because it has a mixture of numeric values and strings. There’s 690 records, so it’s not going to break your computer working with it. We’ll grab the data directly from the site using the URL in the code listing below.
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data" crx <- read.table( file=url, header=FALSE, sep="," )
As an aside, we can write the data out to a file using,
write.table( crx, "crx.dat", quote=FALSE, sep="," )
We can use the head()
function to see the first few lines of data.
head( crx, 6 )
This should produce the following output:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 1 b 30.83 0.000 u g w v 1.25 t t 1 f g 00202 0 + 2 a 58.67 4.460 u g q h 3.04 t t 6 f g 00043 560 + 3 a 24.50 0.500 u g q h 1.50 t f 0 f g 00280 824 + 4 b 27.83 1.540 u g w v 3.75 t t 5 t g 00100 3 + 5 b 20.17 5.625 u g w v 1.71 t f 0 f s 00120 0 + 6 b 32.08 4.000 u g m v 2.50 t f 0 t g 00360 0 +
Next we’ll want to randomize the rows. So far, the rows are ordered by the last column, V16
, so all the pluses are at the top and all the minuses are at the bottom. (Verify this with tail(crx)
.) We can shuffle the rows with the next line, and then split the data into a feature matrix, X
, and a target vector, y
.
crx <- crx[ sample( nrow( crx ) ), ] X <- crx[,1:15] y <- crx[,16]
Then we can create a training set and a test set for X
and y
.
trainX <- X[1:600,] trainy <- y[1:600] testX <- X[601:690,] testy <- y[601:690]
Using the C50 Package
Next, we can install the C50 package, and access it through the library()
function.
install.packages("C50") library(C50)
Next, we can build a model, and then check out a summary of its output. I like to reference what package I am using with the “::
” notation. It’s usually not necessary, but it helps me remember what functions come from what packages.
model <- C50::C5.0( trainX, trainy ) summary( model )
So, this basically says that the tree split the data at one spot, whether or not the V9
variable had the value t
, or f
. The line V9 = f: - (295/19)
means that V9 = f
classified V16 = -
correctly 295 times, and incorrectly 19 times. The line V9 = t: + (305/66)
means that V9 = t
classified V16 = +
correctly 305 times, and incorrectly 66 times. This analysis is repeated further down in a tabular format. It also states that there was 14.2% error rate, which accounts for 85 out of the 600 records used for training.
Call: C5.0.default(x = trainX, y = trainy) C5.0 [Release 2.07 GPL Edition] Fri Aug 29 11:17:06 2014 ------------------------------- Class specified by attribute `outcome' Read 600 cases (16 attributes) from undefined.data Decision tree: V9 = f: - (295/19) V9 = t: + (305/66) Evaluation on training data (600 cases): Decision Tree ---------------- Size Errors 2 85(14.2%) << (a) (b) <-classified as ---- ---- 276 66 (a): class - 19 239 (b): class + Attribute usage: 100.00% V9 Time: 0.0 secs
Boosting
Boosting is the process of adding weak learners in such a way that newer learners pick up the slack of older learners. In this way we can (hopefully) incrementally increase the accuracy of the model. Using the C5.0()
function, we can increase the number of boosting iterations by changing the trials
parameter.
model <- C50::C5.0( trainX, trainy, trials=10 )
The summary()
function applied to the model
object will then print out the tree structure for each learner.
Prediction
We can make predictions by inputting the model
object into the predict()
function, along with the data that we wanted to make predictions from. The type="class"
argument specifies that we want the actual class labels as output, rather than the probability that the class label was one label or another. Therefore, the factor vector p
is vector of pluses and minuses, like y
, trainy
, and testy
.
p <- predict( model, testX, type="class" )
If we didn't trust our model summary, we could evaluate the accuracy of the model using the test data we set aside.
sum( p == testy ) / length( p )
This gives us the following accuracy estimate,
[1] 0.8333333
Out of curiosity, if we had called the predict()
function with the type="prob"
argument, we would get the following output for p
.
> p - + 195 0.2175490 0.78245098 40 0.2175490 0.78245098 175 0.2175490 0.78245098 14 0.9343581 0.06564189 86 0.2175490 0.78245098 160 0.2175490 0.78245098 89 0.2175490 0.78245098 298 0.9343581 0.06564189 35 0.2175490 0.78245098 663 0.9343581 0.06564189 665 0.9343581 0.06564189 578 0.2175490 0.78245098 138 0.2175490 0.78245098 438 0.9343581 0.06564189 363 0.9343581 0.06564189 440 0.9343581 0.06564189 148 0.2175490 0.78245098 266 0.9343581 0.06564189 317 0.9343581 0.06564189 541 0.2175490 0.78245098 60 0.2175490 0.78245098 322 0.9343581 0.06564189 319 0.9343581 0.06564189 454 0.9343581 0.06564189 166 0.2175490 0.78245098 265 0.9343581 0.06564189 621 0.9343581 0.06564189 567 0.2175490 0.78245098 22 0.2175490 0.78245098 193 0.2175490 0.78245098 37 0.2175490 0.78245098 87 0.2175490 0.78245098 24 0.2175490 0.78245098 85 0.2175490 0.78245098 540 0.2175490 0.78245098 165 0.2175490 0.78245098 289 0.9343581 0.06564189 678 0.9343581 0.06564189 308 0.9343581 0.06564189 32 0.2175490 0.78245098 213 0.2175490 0.78245098 476 0.9343581 0.06564189 137 0.2175490 0.78245098 73 0.2175490 0.78245098 587 0.2175490 0.78245098 426 0.9343581 0.06564189 45 0.2175490 0.78245098 641 0.9343581 0.06564189 533 0.2175490 0.78245098 513 0.2175490 0.78245098 489 0.9343581 0.06564189 350 0.9343581 0.06564189 42 0.2175490 0.78245098 400 0.9343581 0.06564189 531 0.2175490 0.78245098 222 0.2175490 0.78245098 528 0.2175490 0.78245098 557 0.2175490 0.78245098 321 0.9343581 0.06564189 177 0.2175490 0.78245098 452 0.9343581 0.06564189 199 0.2175490 0.78245098 201 0.2175490 0.78245098 506 0.2175490 0.78245098 192 0.2175490 0.78245098 647 0.9343581 0.06564189 502 0.2175490 0.78245098 340 0.9343581 0.06564189 168 0.2175490 0.78245098 361 0.9343581 0.06564189 214 0.2175490 0.78245098 625 0.9343581 0.06564189 202 0.2175490 0.78245098 252 0.2175490 0.78245098 292 0.9343581 0.06564189 588 0.2175490 0.78245098 107 0.2175490 0.78245098 434 0.9343581 0.06564189 247 0.2175490 0.78245098 34 0.2175490 0.78245098 474 0.9343581 0.06564189 589 0.2175490 0.78245098 184 0.2175490 0.78245098 686 0.9343581 0.06564189 238 0.2175490 0.78245098 7 0.2175490 0.78245098 682 0.9343581 0.06564189 17 0.2175490 0.78245098 3 0.2175490 0.78245098 190 0.2175490 0.78245098
Great post Conner. I’m wondering if you have a command to manually create a confusion matrix from the predicted result? Sort of, in addition to knowing the accuracy from sum( p == testy ) / length( p )?
It was really helpful. Thanks
Is it right that I get a different result every time I copy your code?
In the fourth listing of the “Data” section, there is a line where I randomize the rows of the data frame. If you leave that part out, then you should get the same outputs back, unless there is a randomization operation in the C50 package. Hope this helps!