Random Forest – NoSimpler

Random Forest Prediction Model

It is an ensemble learning model, that is, it combines weaker classification and regression models to build a superior model for prediction. It overcomes the over fitting problem by using the bagging technique to select features; and uses multiple random samples for training set resulting in multiple classification trees. Voting technique is used to select the model with the lowest error rate.

random-forest-1

random-forest-2

random-forest-3

Programming Logic

Steps to fit random forest model to predict a categorical target variable based on numerical feature variables

Pre-requisite:

Understand the dataset for any pre-processing that may be required to complete the ML task.

Step 1:
Install the required R packages and load them

Step 2:
Set up the environment options, if any
Set seed

Step 3:
Create train and test data from dataset

Step 4:
Use Random Forest algorithm predictor and use all the features for predicting the target

Step 5:
Print and plot random forest tree model

Step 6:
Now use the random forest model to predict the target for test data and check the accuracy

Step 7:
Using the random forest model find the importance of the features used in predictor

Step 8:
See the trees used in random forest model as a data frame

Understanding data set

We use the inbuilt data set iris.

It has 150 observations and 5 variables. We need to build a model to predict the categorical target variable 'Species' using the numerical feature variables Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

# check dimensions of data set

dim(iris)
# [1] 150 5

names(iris)
#[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

str(iris)
#'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Install required R packages

install.packages("randomForest")

Load installed R packages

libs = c("randomForest")
lapply(libs, require, character.only=TRUE)

Set seed

To make result reproducible

set.seed(1234)

Create train and test dataset

Split the data set in to two equal parts for training and testing.

Use sample function to get random numbers indicating indexes for the training and testing datasets.

# random sample for training and testing data

index = sample(nrow(iris), nrow(iris)/2)
index

# [1] 18 93 91 92 126 149 2 34 95 73 98 76 40 127 138 114
# [17] 39 36 25 31 42 136 21 6 28 102 66 113 125 137 55 32
# [33] 133 60 22 88 23 30 112 90 61 71 143 67 35 53 109 50
# [49] 132 78 8 131 104 49 15 48 47 70 17 101 148 4 146 144
# [65] 128 110 26 43 5 46 10 140 87 85 7

training_data = iris[index,]
testing_data = iris[-index,]

Fit the random forest model

The predictor uses all features:

Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

# fit the rainForest model

rfm = randomForest(Species ~., training_data)

Understand the random forest model

Default number of trees used in random forest is 500

# print the random forest model

print(rfm)

# Call:
# randomForest(formula = Species ~ ., data = training_data)
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2

# OOB estimate of error rate: 4%
# Confusion matrix:
# setosa versicolor virginica class.error
# setosa 31 0 0 0.00000000
# versicolor 0 18 2 0.10000000
# virginica 0 1 23 0.04166667

Predict using random forest model

Use the model to predict testing data and compare with actual values using confusion matrix

# predict for test data using rfm

testPred = predict(rfm, testing_data)

# confusion matrix for predicted and actual values

table(testPred,testing_data$Species)

# testPred setosa versicolor virginica
# setosa 19 0 0
# versicolor 0 30 5
# virginica 0 0 21

Measure the accuracy of the model

# accuracy of model
accuracy = mean(testPred == testing_data$Species)
accuracy
#[1] 0.9333333

Importance of features in prediction

The random forest model can be used to find the importance of the features in the prediction

# see the most important features in predictor

importance(rfm)

# MeanDecreaseGini
# Sepal.Length 3.616852
# Sepal.Width 1.776566
# Petal.Length 18.809183
# Petal.Width 24.937505

# the most import feature is Petal width

Understand the trees used in random forest model

Default number of trees in random forest model is 500

getTree(rfm, 500, labelVar=TRUE)

# left daughter right daughter split var split point status
# 1 2 3 Petal.Width 0.70 1
# 2 0 0 <NA> 0.00 -1
# 3 4 5 Petal.Width 1.65 1
# 4 0 0 <NA> 0.00 -1
# 5 0 0 <NA> 0.00 -1
# prediction
# 1 <NA>
# 2 setosa
# 3 <NA>
# 4 versicolor
# 5 virginica