Random Forest (Step by Step)
I tackle projects by splitting them up. It’s much easier to manage and I usually avoid overwhelming myself this way. Machine Learning is no different.
Random forest steps generally can be categorized under 8 main tasks: 3 indirect/support tasks and 5 tasks where you really deal with the machine learning model directly. Now of course everything is related but this is how I conceptualize a random forest machine learning project in my head:
- Import the relevant Python libraries
- Import the data
- Read / clean / adjust the data (if needed)
- Create a train / test split
- Create the Random Forest model object
- Fit the model
- Predict
- Evaluate the accuracy
(I’ve also color coded the steps based on their grouping of process similarities so you can visualize the steps more clearly.)
Estimated Time
10 mins
Skill Level
Advanced
Content Sections
Course Provider
Provided by HolyPython.com
1 Import Libraries
pandas can be useful for constructing dataframes and scikit learn is the ultimate library for simple machine learning operations, learning and practicing machine learning.
3 Read the Data
Reading data is simple but there can be important points such as: dealing with columns, headers, titles, constructing data frames etc.
5 Create the Model
Machine Learning models can be created with a very simple and straight-forward process using scikitlearn. In this case we will create a Random Forest Classifier object from the RandomForestClassifier module of scikitlearn.ensemble library.
7 Predict
Once the model is ready, predictions can be done on the test part of the data. Furthermore, I enjoy predicting foreign values that are not in the initial dataset just to observe the outcomes the model creates. .predict method is used for predictions.
2 Import the Data
We need a nice dataset that’s sensible to analyze with machine learning techniques, particularly random forests in this case. Scikitlearn has some cool sample data as well.
4 Split the Data
Even splitting data is made easy with Scikit-learn, for this operation we will use train_test_module from scikitlearn library.
6 Fit the Model
Machine Learning models are generally fit by training data. This is the part where training of the model takes place and we will do the same for our random forest model.
8 Evaluation
Finally, scikitlearn library’s metrics module is very useful to test the accuracy of the model’s predictions. This part could be done manually as well but metrics module brings lots of functionality and simplicity to the table.
1- Importing the libraries (pandas and sklearn libraries)
First the import part for libraries:
- pandas is imported for data frames
- train_test_split from sklearn.model_selection makes splitting data for train and test purposes very easy and proper
- sklearn.ensemble provides the actual model for Random Forest Classifier
- datasets module of sklearn has great datasets making it easy to experiment with AI & Machine Learning
- metrics is great for evaluating the results we’ll get from the random forest
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn import datasets
from sklearn import metrics
Now we can get the data ready:
I also like to print often to check out if everything is on track or to explore what’s inside my data.
Basically, here pandas DataFrame object is used to create a data frame where each feature of the iris is assigned to a feature name.
2- Importing the data (iris dataset)
iris = datasets.load_iris()
#print(iris.data[:5])
3- Reading the data (scikitlearn datasets and pandas dataframe)
data = pd.DataFrame({"sl":iris.data[:,0], "sw":iris.data[:,1], "pl":iris.data[:,2], "pw":iris.data[:,3], 'species': iris.target})
#print(data["species"])
4- Splitting the data (train_test_split module)
We got the libraries, we got the data it’s machine learning time!
Creating a split for training and test variables will provide us with the input we will feed the Random Forest Classifier Model.
- I named the variables X_tr, y_tr for training and X_ts, y_ts for test input. This is up to your taste or circumstances
- X_tr, X_ts will be assigned to a part of the features
- y_tr, y_ts will be assigned to a part of outcomes
- Then model works on X_tr and y_tr and tries to find meaning in terms of how they relate to each other.
- Then we will test it on X_ts and y_ts to see how successful the model is.
- At this point you have a trained model. You can use it to estimate other data than test data we reserved here and if it’s accurate it can be very useful for whatever prediction you’re trying to achieve.
X=data[['sl','sw','pl','pw']]
y=data["species"]
X_tr, X_ts, y_tr, y_ts = tts(X,y, test_size=30/100)
test_size parameter above is actually very important. Especially your data size is limited. Ideally you want enough data to help machine learning model learn and then you may want even more data to test the model on.
1/3 or 1/4 usually is a good split ration (here we’re using 30/100). If you go overboard with this ratio say 1/2 or 70/100 train to test split this will bring some important risks.
Overtraining data can make it more successful in predicting the test part of the same data
But, when predictions are made on foreign data this might cause the success rate to fall because the model is overfit or over-adjusted to the data at hand.
It’s a fine tuning skill that comes with experience and practice so my suggestion is to get your hands dirty and try experimenting yourself to gain the actual skills for profound machine learning adjustments.
5- Creating the model (ensemble.RandomForestClassifier)
Now we can create a Random Forest object and put machine learning to work using the training data:
RF = RFC(n_estimators=100)
This part is actually pretty important because most of the optimization takes place here. While creating the Random Forest Classifier Model, we can pass many hyper parameters that can have big impacts on the Machine Learning implementation.
Check out this page to see a tutorial about Random Forest Optimization Parameters.
6- Fitting the model (Training with features(X) and outcomes (y))
This is where training takes place (fitting). Training part of the data is introduced to the model we created in the previous step.
If the model and data are suitable for each other, and if the correct optimization requirements are satisfied (if necessary) model will have a successful training which is crucial for the prediction part.
RF.fit(X_tr,y_tr)
7- Making predictions (.predict method)
This is where the moment of truth happens. Trained model is being tested on the test part of the data. If results are successful it might mean that the model is ready for the wild. (Meaning making predictions on foreign data outside of both training and test data)
y_pr=RF.predict(X_ts)
print(y_pr)
8- Evaluating results (scikitlearn metrics module)
Scikitlearn’s metrics module is worth exploration indeed.
With this tool you can find out how the predictions match up with the actual values. Ultimate step to discover about the accuracy of the Random Forest Classifier.
print("Accuracy %:",metrics.accuracy_score(y_ts, y_pr)*100)
Bonus: Predicting foreign data
print(RF.predict([[4,5,6,6]]))
You can see the full one piece code in this page: Random Forest Simple Implementation
Master other Machine Learning Algorithms
We have created a full Machine Learning Tutorial consisted of different machine learning algorithms, their invention history, examples, visualization, code samples, parameters and optimization tips. The aim is to make Machine Learning as understandable, efficient and simple as possible for masses.
Also if you are excited about Machine Learning and Artificial Intelligence but find the programming part challenging it’s not that complicated at all and you should definitely check out our basic Python lessons, Python exercises, Python tutorials and Python tips.