In this article, I will give you a brief overview of Random Forest and its features. Before going into details, I assume that you know about the concept of decision tree.
Random forest is a supervised classification machine learning algorithm which uses ensemble method. Simply put, a random forest is made up of numerous decision trees and helps to tackle the problem of overfitting in decision trees. These decision trees are randomly constructed by selecting random features from the given dataset.
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. One big advantage of random forest is that it can be used for both classification and regression problems.
Random forest arrives at a decision or prediction based on the maximum number of votes received from the decision trees. The outcome which is arrived at, for a maximum number of times through the numerous decision trees is considered as the final outcome by the random forest.
Pseudocode for Random Forest:
Step 1: Randomly select “k” features from total “m” features
where k << m.
Step 2: Among the “k” features, calculate the node “d” using the best split point.
Step 3: Split the node into daughter nodes using the best split.
Step 4: Repeat 1 to 3 steps until “l” number of nodes has been reached.
Step 5: Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees.
Pseudocode for Random forest prediction:
Step 1: Takes the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome (target).
Step 2: Calculate the votes for each predicted target.
Step 3: Consider the high voted predicted target as the final prediction from the random forest algorithm.
Pros and Cons of Random Forest
The following are the advantages of Random Forest algorithm −
- It overcomes the problem of overfitting by averaging or combining the results of different decision trees.
- Random forests work well for a large range of data items than a single decision tree does.
- Random forest has less variance then single decision tree.
- Random forests are very flexible and possess very high accuracy.
- Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.
- Random Forest algorithms maintains good accuracy even a large proportion of the data is missing.
The following are the disadvantages of Random Forest algorithm −
- Complexity is the main disadvantage of Random forest algorithms.
- Construction of Random forests are much harder and time-consuming than decision trees.
- More computational resources are required to implement Random Forest algorithm.
- It is less intuitive in case when we have a large collection of decision trees.
- The prediction process using random forests is very time-consuming in comparison with other algorithms.
Random forest gives much more accurate predictions when compared to simple regression models in many scenarios. Random forests are also very hard to beat performance wise. Of course, you can probably always find a model that can perform better, but these usually take more time to develop, though they can handle a lot of different feature types, like binary, categorical and numerical. Overall, random forest is a mostly fast, simple and flexible tool.