In this post I’ll walk through an example of using the C50 package for decision trees in R. This is an extension of the C4.5 algorithm. We’ll use some totally unhelpful credit data from the UCI Machine Learning Repository that has been sanitized and anonymified beyond all recognition.

Data

The data set is interesting because it has a mixture of numeric values and strings. There’s 690 records, so it’s not going to break your computer working with it. We’ll grab the data directly from the site using the URL in the code listing below.

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
crx <- read.table( file=url, header=FALSE, sep="," )

As an aside, we can write the data out to a file using,

write.table( crx, "crx.dat", quote=FALSE, sep="," )

We can use the head() function to see the first few lines of data.

head( crx, 6 )

This should produce the following output:

  V1    V2    V3 V4 V5 V6 V7   V8 V9 V10 V11 V12 V13   V14 V15 V16
1  b 30.83 0.000  u  g  w  v 1.25  t   t   1   f   g 00202   0   +
2  a 58.67 4.460  u  g  q  h 3.04  t   t   6   f   g 00043 560   +
3  a 24.50 0.500  u  g  q  h 1.50  t   f   0   f   g 00280 824   +
4  b 27.83 1.540  u  g  w  v 3.75  t   t   5   t   g 00100   3   +
5  b 20.17 5.625  u  g  w  v 1.71  t   f   0   f   s 00120   0   +
6  b 32.08 4.000  u  g  m  v 2.50  t   f   0   t   g 00360   0   +

Next we’ll want to randomize the rows. So far, the rows are ordered by the last column, V16, so all the pluses are at the top and all the minuses are at the bottom. (Verify this with tail(crx).) We can shuffle the rows with the next line, and then split the data into a feature matrix, X, and a target vector, y.

crx <- crx[ sample( nrow( crx ) ), ]
X <- crx[,1:15]
y <- crx[,16]

Then we can create a training set and a test set for X and y.

trainX <- X[1:600,]
trainy <- y[1:600]
testX <- X[601:690,]
testy <- y[601:690]

Using the C50 Package

Next, we can install the C50 package, and access it through the library() function.

install.packages("C50")
library(C50)

Next, we can build a model, and then check out a summary of its output. I like to reference what package I am using with the “::” notation. It’s usually not necessary, but it helps me remember what functions come from what packages.

model <- C50::C5.0( trainX, trainy )
summary( model )

So, this basically says that the tree split the data at one spot, whether or not the V9 variable had the value t, or f. The line V9 = f: - (295/19) means that V9 = f classified V16 = - correctly 295 times, and incorrectly 19 times. The line V9 = t: + (305/66) means that V9 = t classified V16 = + correctly 305 times, and incorrectly 66 times. This analysis is repeated further down in a tabular format. It also states that there was 14.2% error rate, which accounts for 85 out of the 600 records used for training.

Call:
C5.0.default(x = trainX, y = trainy)


C5.0 [Release 2.07 GPL Edition]     Fri Aug 29 11:17:06 2014
-------------------------------

Class specified by attribute `outcome'

Read 600 cases (16 attributes) from undefined.data

Decision tree:

V9 = f: - (295/19)
V9 = t: + (305/66)


Evaluation on training data (600 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         2   85(14.2%)   <<


       (a)   (b)    <-classified as
      ----  ----
       276    66    (a): class -
        19   239    (b): class +


    Attribute usage:

    100.00% V9


Time: 0.0 secs

Boosting

Boosting is the process of adding weak learners in such a way that newer learners pick up the slack of older learners. In this way we can (hopefully) incrementally increase the accuracy of the model. Using the C5.0() function, we can increase the number of boosting iterations by changing the trials parameter.

model <-  C50::C5.0( trainX, trainy, trials=10 )

The summary() function applied to the model object will then print out the tree structure for each learner.

Prediction

We can make predictions by inputting the model object into the predict() function, along with the data that we wanted to make predictions from. The type="class" argument specifies that we want the actual class labels as output, rather than the probability that the class label was one label or another. Therefore, the factor vector p is vector of pluses and minuses, like y, trainy, and testy.

p <- predict( model, testX, type="class" )

If we didn't trust our model summary, we could evaluate the accuracy of the model using the test data we set aside.

sum( p == testy ) / length( p )

This gives us the following accuracy estimate,

[1] 0.8333333

Out of curiosity, if we had called the predict() function with the type="prob" argument, we would get the following output for p.

> p
            -          +
195 0.2175490 0.78245098
40  0.2175490 0.78245098
175 0.2175490 0.78245098
14  0.9343581 0.06564189
86  0.2175490 0.78245098
160 0.2175490 0.78245098
89  0.2175490 0.78245098
298 0.9343581 0.06564189
35  0.2175490 0.78245098
663 0.9343581 0.06564189
665 0.9343581 0.06564189
578 0.2175490 0.78245098
138 0.2175490 0.78245098
438 0.9343581 0.06564189
363 0.9343581 0.06564189
440 0.9343581 0.06564189
148 0.2175490 0.78245098
266 0.9343581 0.06564189
317 0.9343581 0.06564189
541 0.2175490 0.78245098
60  0.2175490 0.78245098
322 0.9343581 0.06564189
319 0.9343581 0.06564189
454 0.9343581 0.06564189
166 0.2175490 0.78245098
265 0.9343581 0.06564189
621 0.9343581 0.06564189
567 0.2175490 0.78245098
22  0.2175490 0.78245098
193 0.2175490 0.78245098
37  0.2175490 0.78245098
87  0.2175490 0.78245098
24  0.2175490 0.78245098
85  0.2175490 0.78245098
540 0.2175490 0.78245098
165 0.2175490 0.78245098
289 0.9343581 0.06564189
678 0.9343581 0.06564189
308 0.9343581 0.06564189
32  0.2175490 0.78245098
213 0.2175490 0.78245098
476 0.9343581 0.06564189
137 0.2175490 0.78245098
73  0.2175490 0.78245098
587 0.2175490 0.78245098
426 0.9343581 0.06564189
45  0.2175490 0.78245098
641 0.9343581 0.06564189
533 0.2175490 0.78245098
513 0.2175490 0.78245098
489 0.9343581 0.06564189
350 0.9343581 0.06564189
42  0.2175490 0.78245098
400 0.9343581 0.06564189
531 0.2175490 0.78245098
222 0.2175490 0.78245098
528 0.2175490 0.78245098
557 0.2175490 0.78245098
321 0.9343581 0.06564189
177 0.2175490 0.78245098
452 0.9343581 0.06564189
199 0.2175490 0.78245098
201 0.2175490 0.78245098
506 0.2175490 0.78245098
192 0.2175490 0.78245098
647 0.9343581 0.06564189
502 0.2175490 0.78245098
340 0.9343581 0.06564189
168 0.2175490 0.78245098
361 0.9343581 0.06564189
214 0.2175490 0.78245098
625 0.9343581 0.06564189
202 0.2175490 0.78245098
252 0.2175490 0.78245098
292 0.9343581 0.06564189
588 0.2175490 0.78245098
107 0.2175490 0.78245098
434 0.9343581 0.06564189
247 0.2175490 0.78245098
34  0.2175490 0.78245098
474 0.9343581 0.06564189
589 0.2175490 0.78245098
184 0.2175490 0.78245098
686 0.9343581 0.06564189
238 0.2175490 0.78245098
7   0.2175490 0.78245098
682 0.9343581 0.06564189
17  0.2175490 0.78245098
3   0.2175490 0.78245098
190 0.2175490 0.78245098

5 thoughts on “Decision Trees in R using the C50 Package”

Pingback: Modeling in R with the caret Package | Connor Johnson
Jay says:

February 25, 2015 at 8:34 AM

Great post Conner. I’m wondering if you have a command to manually create a confusion matrix from the predicted result? Sort of, in addition to knowing the accuracy from sum( p == testy ) / length( p )?
faraz says:

April 30, 2015 at 2:36 AM

It was really helpful. Thanks
Joe says:

June 2, 2015 at 5:10 AM

Is it right that I get a different result every time I copy your code?
1. cjohnson318 says:
  
  June 4, 2015 at 10:07 PM
  
  In the fourth listing of the “Data” section, there is a line where I randomize the rows of the data frame. If you leave that part out, then you should get the same outputs back, unless there is a randomization operation in the C50 package. Hope this helps!

Comments are closed.

Data

Using the C50 Package

Boosting

Prediction

5 thoughts on “Decision Trees in R using the C50 Package”

Blog about math, programming, and data.