- Random forests are an example of the ensemble idea applied to decision trees.
- one decision tree tend to be prone to overfitting
- many decision trees more stable, better generalization
- introduce random variation into tree-building to guarantee that all of the decision trees in the random forest are different.
- captures both the global and local patterns in the training data compared to the single decision tree model
Random variation happens in two ways
- the data used to build each tree is randomly selected (bootstrap sample)
- bootstrap sample: bootstrap sample has N rows just like the original training set but with possibly some rows from the original dataset missing and others occurring multiple times just due to the nature of the random selection with replacement.
- the features chosen in each split tests are also randomly selected
- instead of finding the best split across all possible features, a random subset of features is chosen and the best split is found within that smaller subset of features.
Prediction
- each tree makes a prediction
- combine individual predictions
- regression: mean of individual tree predictions
- classification
- each tree gives a probability for each class
- probabilities averaged across trees
- predict the class with the highest probability
Pros
- widely used, They give excellent prediction performance on a wide variety of problems
- Doesn't require careful normalization of features or extensive parameter tuning
- handles mixtures of feature types
- easily parallelized across multiple CPUs.
Cons
- the random forest models are often very difficult for people to interpret (one big difference from decision trees), to know why a particular decision is made
- not a good choice for tasks that have very high dimensional sparse features like text classification, where linear models can provide efficient training and fast accurate prediction
Model Complexity
- no. of trees (default is 10)should be larger for large datasets
- since ensembles that can average over more trees will reduce overfitting but also increase computational cost)
- no. of features in the subset that are randomly considered at each split.
- Learning is senstive to max_features, has a strong effect on performance
- max_features = 1 →trees will be very different with many levels as cannot pick the most informative feature
- max_features ≈ no. of features → similar trees with fewer levels
- controls the depth of each tree (default: None. Split until all leaves contain the same class or all leaves have fewer samples than two by default or )
- n_jobs: How many cores to use in parallel during training
- if you have four cores, the training will be four times as fast as if you just used one.
- -1: it will use all the cores on your system