Random Forest Optimization & Parameters
Random Forest Optimization Parameters Explained
- n_estimators
- max_depth
- criterion
- min_samples_split
- max_features
- random_state
- Here are some of the most significant optimization parameters you can adjust and play with when you’re working with Random Forests.
- Considering the similarities, it’s no surprise that some of the parameters are identical or very similar to decision trees.
- However, Random Forests also have their own unique parameters that can be important since forest is bigger and more complex than the trees.
- Random Forests can also be computationally costly. I will share some exclusive tips that you can use to make your random forest more efficient and lightweight!
n_estimators: (default 100), this parameter signifies the amount of trees in the forest. This is probably the most characteristic optimization parameter of a random forest algorithm.
max_depth: (default None) Another important parameter, max_depth signifies allowed depth of individual decision trees.
It can take an integer value. It can also take None, in which case nodes will continue to be expanded until all leaves are pure or contain less samples than min_samples_split.
min_samples_split: (default 2) This is the minimum number of samples required for a node split.
It can take an integer or float value, integer being the more straightforward approach.
criterion: (default gini) Criterion is the same as in decision tree algorithm.
It allows choosing between two values: gini or entropy and it’s gini by default. This parameter applies to both regression and classification decision trees.
While these two values usually yield very similar results there is a performance difference. Gini is usually the faster route since entropy uses a logarithmic algorithm.
Without diving into mathematics, gini en entropy can be explained as two similar formulas about inheriting information. Inheritance here refers to passing information from parent tree to children trees (nodes) at every splitting step. Which brings us to the 2nd important parameter: splitter.
While entropy might the upper hand in exploratory analysis, Gini can be advantageous for reduced false-classification.
This parameter is the same as in Decision Trees.
max_features: (default auto) This parameter concerns the best split scenario. max_features defines how many features should be considered when looking for the best split. It can take these values: None, “auto”, “sqrt”, “log2”, int or float.
Best split will be considered with max features of:
- Total number of features, if None
- Certain number of features, if int
- Square root of total features, if sqrt or auto (Yes, they’re the same)
- A fraction of features, if float
- log2 of features, if log2
random_state: (default: None) This decision tree parameter defines the seed options for randomness used to shuffle input data.
none: seed will be RandomState instance of numpy’s random module: numpy.random
int: seed will be random_state used by random number generator
RandomState instance: random_state will be the random number generator (seed)
Examples:
GC = RFC(n_estimators=200)
GC = RFC(max_depth=5)
GC = RFC(min_samples_split=3)
GC = RFC(n_jobs=-1)
GC = RFC(warm_start=True)
GC = RFC(verbose=2)
More parameters
Some more Random Forest Optimization Parameters for fine tuning
Further on, these parameters can be used for further optimization, to avoid inefficiency and make adjustments based on how data is handled or forest is constructed:
bootstrap
(default: True)
True: Trees are built with bootstrap samples.
False: Trees are built with the whole dataset.
verbose
(default: 0)
Signifies information printed while building trees. (Verbosity)
0: Least information
1: More information
2: Even more information
oob_score
(default: False)
Stands for out-of-bag score.
True: generalization accuracy estimation will be done with out-of-bag samples
warm_start
(default: False)
False: Fits a new forest
True: Adds estimators and fits by using the solution of previous call
n_jobs
(default: None)
Signifies number or jobs to be run in parallel.
None: 1 job will be run in parallel
int: jobs run in parallel will be the integer provided.
-1: All processors will be used for the task.
class_weight
(default: None)
Assigns class weights
None: All classes have the same weight of 1.
“balanced”: Class weights will be automatically balanced based on y values.
dict or list of dicts: Assigns custom class weight values based on sequence provided (list of dicts for multi-output problems).
I hope you found this Decision Tree Optimization Tutorial useful. Check out more useful resources about Random Forests we have, or take a look at our special Machine Learning guide: all the different Machine Learning Algorithm Tutorials with Python.
Official Scikit Learn Documentation: sklearn.ensemble.RandomForestClassifier