Random variation happens in two ways
- the data used to build each tree is randomly selected (bootstrap sample)
- bootstrap sample: bootstrap sample has N rows just like the original training set but with possibly some rows from the original dataset missing and others occurring multiple times just due to the nature of the random selection with replacement.
- the features chosen in each split tests are also randomly selected
- instead of finding the best split across all possible features, a random subset of features is chosen and the best split is found within that smaller subset of features.
Prediction
- each tree makes a prediction
- combine individual predictions
- regression: mean of individual tree predictions
- classification
- each tree gives a probability for each class
- probabilities averaged across trees
- predict the class with the highest probability
Model Complexity
- n_estimated: no. of trees
- max_features: no. of features in the subset that are randomly considered at each split
- Learning is sensitive to max_features
- max_features = 1 → trees will be very different and possibly with many levels (can not pick the most informative feature)
- max_features ≈ no. of features → similar trees with fewer levels (because can use the most informative feature)