Decision Tree Optimization

Decision Tree Optimization Parameters Explained

Here are some of the most commonly adjusted parameters with Decision Trees. Let’s take a deeper look at what they are used for and how to change their values:

criterion: (default: gini) This parameter allows choosing between two values: gini or entropy. This parameter applies to both regression and classification decision trees.

While these two values usually yield very similar results there is a performance difference. Gini is usually the faster route since entropy uses a logarithmic algorithm.

Without diving into mathematics, gini en entropy can be explained as two similar formulas about inheriting information. Inheritance here refers to passing information from parent tree to children trees (nodes) at every splitting step. Which brings us to the 2nd important parameter: splitter. 

While entropy might have the upper hand in exploratory analysis, Gini can be advantageous for reduced false-classifications.

splitter: (default: best) Splitter parameter can be used to define the split strategy and takes two values: best or random.

According to scikitlearn documentation best will choose the best split and random will choose the best random split. But what does this mean?

Best will initiate splits on each the best feature that makes the provides the most information after the split while random will base the splitting strategy on random features.

This comes with different consequences as best will provide reduced computation needs hence more efficiency. Random might be useful when the analyst is confident that all features provide equal or similar importance to the classification.

max_depth: (default: None) This parameter signifies the maximum depth of the decision tree. 

When left at default (None), nodes will be expanded until all leaves are pure or they contain samples less than the amount of min_samples_split.

This parameter can also take an integer value.

Examples:

DT = DTC(criterion= "entropy")
DT = DTC(max_features="sqrt")
DT = DTC(splitter="random")
DT = DTC(criterion= "gini", max_depth= 5)

More parameters

More Decision Tree Optimization Parameters for fine tuning

Further on, these parameters can be used for further optimization, to avoid overfitting and make adjustments based on impurity:

  • min_samples_split,
  • min_samples_leaf
  • random_state,
  • max_leaf_nodes
  • max_features

min_samples_split

(default: 2)

- Concerning internal node splits, this parameter signifies the minimum sample number required for the split.

- It can take an integer value or a float value for fractions.

max_leaf_nodes

(default: None)

None: number of leaf nodes will be unlimited.

int: Tree will grow in best-first fashion.

min_samples_leaf

(default: 1)

Concerning leaf nodes, this is the minimum sample number required at leaf nodes.

int: minimum number of min_samples_leaf.

float: minimum number of min_samples_leaf as fraction.

max_features

default=None

Concerning splits, signifies amount of features to be considered for best splits.

None: max_features will be n_features
int or float: max_features will be an integer or fraction of (n_features).

“auto” and "sqrt": max_features will be square root of (n_features) - Yes, both are the same thing.

“log2”: max_features will be log2 of (n_features)

random_state

(default: None)

This decision tree parameter defines the seed options for randomness used to shuffle input data:

None: seed will be numpy.random

int: seed will be created by random number generator

RandomState instance: random_state will be the random number generator (seed)

I hope you found this Decision Tree Optimization Tutorial useful. Check out more useful resources about Decision Trees we have, or take a look at our special Machine Learning guide: all the different Machine Learning Algorithm Tutorials with Python.

Official Scikit Learn Documentation: sklearn.tree.DecisionTreeClassifier