n_estimators: (default 100), this parameter signifies the amount of trees in the forest. This is probably the most characteristic optimization parameter of a random forest algorithm.
max_depth: (default None) Another important parameter, max_depth signifies allowed depth of individual decision trees.
It can take an integer value. It can also take None, in which case nodes will continue to be expanded until all leaves are pure or contain less samples than min_samples_split.
min_samples_split: (default 2) This is the minimum number of samples required for a node split.
It can take an integer or float value, integer being the more straightforward approach.
criterion: (default gini) Criterion is the same as in decision tree algorithm.
It allows choosing between two values: gini or entropy and it’s gini by default. This parameter applies to both regression and classification decision trees.
While these two values usually yield very similar results there is a performance difference. Gini is usually the faster route since entropy uses a logarithmic algorithm.
Without diving into mathematics, gini en entropy can be explained as two similar formulas about inheriting information. Inheritance here refers to passing information from parent tree to children trees (nodes) at every splitting step. Which brings us to the 2nd important parameter: splitter.
While entropy might the upper hand in exploratory analysis, Gini can be advantageous for reduced false-classification.
This parameter is the same as in Decision Trees.
max_features: (default auto) This parameter concerns the best split scenario. max_features defines how many features should be considered when looking for the best split. It can take these values: None, “auto”, “sqrt”, “log2”, int or float.
Best split will be considered with max features of:
random_state: (default: None) This decision tree parameter defines the seed options for randomness used to shuffle input data.
none: seed will be RandomState instance of numpy’s random module: numpy.random
int: seed will be random_state used by random number generator
RandomState instance: random_state will be the random number generator (seed)
GC = RFC(n_estimators=200)
GC = RFC(max_depth=5)
GC = RFC(min_samples_split=3)
GC = RFC(n_jobs=-1)
GC = RFC(warm_start=True)
GC = RFC(verbose=2)
Further on, these parameters can be used for further optimization, to avoid inefficiency and make adjustments based on how data is handled or forest is constructed:
(default: True)
True: Trees are built with bootstrap samples.
False: Trees are built with the whole dataset.
(default: 0)
Signifies information printed while building trees. (Verbosity)
0: Least information
1: More information
2: Even more information
(default: False)
Stands for out-of-bag score.
True: generalization accuracy estimation will be done with out-of-bag samples
(default: False)
False: Fits a new forest
True: Adds estimators and fits by using the solution of previous call
(default: None)
Signifies number or jobs to be run in parallel.
None: 1 job will be run in parallel
int: jobs run in parallel will be the integer provided.
-1: All processors will be used for the task.
(default: None)
Assigns class weights
None: All classes have the same weight of 1.
“balanced”: Class weights will be automatically balanced based on y values.
dict or list of dicts: Assigns custom class weight values based on sequence provided (list of dicts for multi-output problems).
I hope you found this Decision Tree Optimization Tutorial useful. Check out more useful resources about Random Forests we have, or take a look at our special Machine Learning guide: all the different Machine Learning Algorithm Tutorials with Python.
Official Scikit Learn Documentation: sklearn.ensemble.RandomForestClassifier