Decision Tree (Step by Step)

I’ve created these step-by-step machine learning algorith implementations in Python for everyone who is new to the field and might be confused with the different steps.

It really helps understanding what’s happening during a machine learning implementation.

In this particular tutorial I will break down different steps in a decision tree algorithm in scikit learn with Python. It’s very similar to most other machine learning algorithm implementations in Python and in this case it is very very similar to random forests.

(You can check out Random Forest algorithm here and learn a lot about its history, see different examples, visualization, code samples etc. Random Forests are an advanced implementation of Decision Trees and they are very commonly utilized in professional life for solving real world problems.)

Decision Trees are still good to know and understand especially since they are the base of random forests and they have been around for many many years.

Check out this page to learn about curious history of Decision Trees.

Estimated Time

10 mins

Skill Level

Advanced

Content Sections

Steps Summary

Steps Explained w/ code

Course Provider

Provided by HolyPython.com

I’ve split up Decision Tree implementation to 2 different categories here:

(Red for the actual machine learning work and black font signifies preparation phase)

Down the page I’ve also color coded the steps in a different way to group similar steps with each other.

Import the relevant Python libraries
Import the data
Read / clean / adjust the data (if needed)
Create a train / test split
Create the Decision Tree model object
Fit the model
Predict
Evaluate the accuracy

Let’s read more about each individual step and what’s achieved with each of them:

1 Import Libraries

pandas can be useful for constructing dataframes and scikit learn is the ultimate library for simple machine learning operations, learning and practicing machine learning.

3 Read the Data

Reading data is simple but there can be important points such as: dealing with columns, headers, titles, constructing data frames etc.

5 Create the Model

Machine Learning models can be created with a very simple and straight-forward process using scikitlearn. In this case we will create a Decision Tree Classifier object from the DecisionTreeClassifier module of scikitlearn.tree library.

7 Predict

Once the model is ready, predictions can be done on the test part of the data. Furthermore, I enjoy predicting foreign values that are not in the initial dataset just to observe the outcomes the model creates. .predict method is used for predictions.

2 Import the Data

We need a nice dataset that’s sensible to analyze with machine learning techniques, particularly decision tree in this case. Scikitlearn has some cool sample data as well.

4 Split the Data

Even splitting data is made easy with Scikit-learn, for this operation we will use train_test_module from scikitlearn library.

6 Fit the Model

Machine Learning models are generally fit by training data. This is the part where training of the model takes place and we will do the same for our decision tree model.

8 Evaluation

Finally, scikitlearn library’s metrics module is very useful to test the accuracy of the model’s predictions. This part could be done manually as well but metrics module brings lots of functionality and simplicity to the table.

1- Importing the libraries (pandas and sklearn libraries)

First the import part for libraries:

pandas is imported for data frames
train_test_split from sklearn.model_selection makes splitting data for train and test purposes very easy and proper
sklearn.tree provides the actual model for Decision Tree Classifier
datasets module of sklearn has great datasets making it easy to experiment with AI & Machine Learning
metrics is great for evaluating the results we’ll get from decision trees

###Importing Libraries
import pandas as pd
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.model_selection import train_test_split as tts
from sklearn import metrics

2- Importing the data (iris dataset)

It’s time to find some data to work with. For the simplicity I will suggest using pre-included datasets library in scikitlearn. They are great for practice and everything is already taken care. So, there won’t be a complication such as missing values or invalid characters etc. while you’re learning.

One thing I’ve been learning is to keep it simple while I’m learning in fields outside my expertise and then step up gradually to avoid burn-out.

Let’s import the iris dataset:

###Importing Dataset
iris = datasets.load_iris()

3- Reading the data (scikitlearn datasets and pandas dataframe)

Now we can get the data ready:

Pandas DataFrame class is used to construct a data frame. Data frames are very useful when working with large datasets with different titles.

In case of machine learning algorithms: you usually have feature(s) and an outcome or multiple outcomes to work with, this mean different titles and sometimes different types of data. That’s why DataFrame becomes the perfect structure to work with.

###Constructing Data Frame
data = pd.DataFrame({"sl":iris.data[:,0], "sw":iris.data[:,1], "pl":iris.data[:,2], "pw":iris.data[:,3], 'species': iris.target})
# print(data["species"])

You should also print some or all of your data to have a better understanding of what’s going on. I usually check out each feature and outcome briefly and try to establish a logic in my mind regarding the dataset I’m working with.

4- Splitting the data (train_test_split module)

This is another standard Machine Learning step.

We need to split data so that there are:

training feature(s) and outcome(s)
test feature(s) and test outcome(s)

The logic is to train the decision tree machine learning model with the train split and then test the trained model with the test split.

It’s a rather simple process (step) thanks to Scikit learn’s train_test_split module.

I named the variables X_tr, y_tr for training and X_ts, y_ts for test input. This is up to your taste or your circumstances.
X_tr, X_ts will be assigned to a part of the features
y_tr, y_ts will be assigned to a part of outcomes
Split ratio can be assigned using test_size parameter. This is an important parameter and something you should experiment with to get a better understanding. 1/3rd or 30% usually are reasonable ratios.
Then model works on X_tr and y_tr for training.
Then we will test it on X_ts and y_ts to see how successful the model is.

###Splitting train/test data
x=data[['sl','sw','pl','pw']]
y=data["species"]
X_tr, X_ts, y_tr, y_ts = tts(X,y, test_size=30/100, random_state=None)

You don’t want to overtrain the machine learning model by assigning a split ratio such as 50/50. This will train the model in a way that it’s very very good in predicting the test part of the same dataset but it might struggle adapting to predicting data from outside the current dataset in future because it has adapted way too much to the data at hand (overfitting)

Don’t worry too much about overfitting in the beginning but it’s an important phenomenon and something you should gain experience about as you gain more skills and experience.

5- Creating the model (tree.DecisionTreeClasifier)

Now we can create a Decision Tree Classifier object and put machine learning to work using the training data:

###Creating Decision Tree Classifier Model
DT = DTC()

6- Fitting the model (Training with features(X) and outcomes (y))

###Training the Model
DT.fit(X_tr,y_tr)

7- Making predictions (.predict method)

###Making Predictions
y_pr=DT.predict(X_ts)
print(y_pr)

8- Evaluating results (scikitlearn metrics module)

###Evaluating Prediction Accuracy
print("Acc %:",metrics.accuracy_score(y_ts, y_pr)*100)

Bonus: Predicting foreign data

###Making Prediction with Foreign Data
print(DT.predict([[1,1,0.5,6]]))

You can see the full one piece code in this page: Decision Tree Simple Implementation