Written by Abhinav T K
Random forest is one of the most popular and powerful supervised machine learning techniques. It can perform both classification and regression tasks with good accuracy. As the name suggests it is a collection of decision trees. So we should know how a decision tree works to know the working of a random forest model.
Why should we use a random forest model if we can use a decision tree?
The reason is the inaccuracy of decision trees as far as predictive learning is concerned. They tend to overfit the training data and are not flexible when it comes to classifying new samples. Random forest solves this problem by becoming more flexible because of its ensemble learning approach and results in good improvement of accuracy.
Advantages of Random Forest:
Random forest can be used for both classification and regression problems.
No issue of overfitting.
High accuracy for classification.
Power to handle large datasets.
Disadvantages of Random Forest:
It doesn’t perform well for regression as compared to classification. It does not predict values beyond the range in the training set.
Large number of decision trees can slow down the model.
How does Random Forest work?
Random forest models are built using an approach called bagging which is an ensemble learning technique. Ensemble learning is the process by which multiple models are combined and built to solve problems.
Bootstrapping the data and using the aggregate to make a decision is called bagging. Bagging helps to decrease the variance in the prediction by creating multiple datasets by combinations with repetitions of the original dataset.
Let’s see how a random forest is built.
Step 1- Creating a bootstrapped training set:
The bootstrapped training set is a resampling of the original training set with the samples being randomly selected from the original training set. The same sample can be picked more than once while creating a bootstrapped dataset.
Step 2 - Building a decision tree using the bootstrapped training set:
While building the decision tree we use a random subset of features at each step(node). Out of those features select the one that best separates the samples. Then we move to the next node and do the same process. We build the decision tree this way, by considering only a random subset of columns at each step(node).
Step 3 - Build multiple decision trees:
Build multiple decision trees by following the above two steps- that is, create a bootstrapped training set and build a tree using a subset of features at each node. By following these steps we can create a variety of trees. This variety makes the random forest more effective.
The number of trees is a hyperparameter. In general, with more trees in the forest, the more accurate the model is.
Using a Random Forest:
When we get a new sample it is run through all the decision trees of the random forest and the output is noted.
If it is a classification problem the class that occurred most number of times as the output of the trees (majority votes of the trees) will be the output of the random forest.
If it is a regression problem the average of the output of all the decision trees is taken as the output of the random forest.
Evaluating the accuracy of a Random Forest:
While creating a bootstrapped dataset, some samples are left out. These samples are called out-of-bag samples. Typically about one-third samples are left out. The out-of-bag samples are run through all the trees that were built without using it. And we find the output of the out-of-bag sample by majority votes. We repeat the same for other out-of-bag samples. The accuracy is measured as the proportion of out-of-bag samples that were correctly classified by the random forest.
About the person behind the keyboard: Abhinav is pursuing B.tech from IIT Hyderabad and is a passionate engineer. If you guys want to contact him, just click on his name.