Understanding data set
We use the inbuilt data set iris.
It has 150 observations and 5 variables. We need to build a model to predict the categorical target variable 'Species' using the numerical feature variables Petal.Length, Petal.Width, Sepal.Length, Sepal.Width
# check dimensions of data set
dim(iris)
# [1] 150 5
names(iris)
#[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
str(iris)
#'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Load installed R packages
libs = c("randomForest")
lapply(libs, require, character.only=TRUE)
Create train and test dataset
Split the data set in to two equal parts for training and testing.
Use sample function to get random numbers indicating indexes for the training and testing datasets.
# random sample for training and testing data
index = sample(nrow(iris), nrow(iris)/2)
index
# [1] 18 93 91 92 126 149 2 34 95 73 98 76 40 127 138 114
# [17] 39 36 25 31 42 136 21 6 28 102 66 113 125 137 55 32
# [33] 133 60 22 88 23 30 112 90 61 71 143 67 35 53 109 50
# [49] 132 78 8 131 104 49 15 48 47 70 17 101 148 4 146 144
# [65] 128 110 26 43 5 46 10 140 87 85 7
training_data = iris[index,]
testing_data = iris[-index,]
Understand the random forest model
Default number of trees used in random forest is 500
# print the random forest model
print(rfm)
# Call:
# randomForest(formula = Species ~ ., data = training_data)
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
# OOB estimate of error rate: 4%
# Confusion matrix:
# setosa versicolor virginica class.error
# setosa 31 0 0 0.00000000
# versicolor 0 18 2 0.10000000
# virginica 0 1 23 0.04166667
Measure the accuracy of the model
# accuracy of model
accuracy = mean(testPred == testing_data$Species)
accuracy
#[1] 0.9333333
Understand the trees used in random forest model
Default number of trees in random forest model is 500
getTree(rfm, 500, labelVar=TRUE)
# left daughter right daughter split var split point status
# 1 2 3 Petal.Width 0.70 1
# 2 0 0 <NA> 0.00 -1
# 3 4 5 Petal.Width 1.65 1
# 4 0 0 <NA> 0.00 -1
# 5 0 0 <NA> 0.00 -1
# prediction
# 1 <NA>
# 2 setosa
# 3 <NA>
# 4 versicolor
# 5 virginica