I tackle projects by splitting them up. It’s much easier to manage and I usually avoid overwhelming myself this way. Machine Learning is no different.
Random forest steps generally can be categorized under 8 main tasks: 3 indirect/support tasks and 5 tasks where you really deal with the machine learning model directly. Now of course everything is related but this is how I conceptualize a random forest machine learning project in my head:
10 mins
Advanced
Provided by HolyPython.com
pandas can be useful for constructing dataframes and scikit learn is the ultimate library for simple machine learning operations, learning and practicing machine learning.
Reading data is simple but there can be important points such as: dealing with columns, headers, titles, constructing data frames etc.
Machine Learning models can be created with a very simple and straight-forward process using scikitlearn. In this case we will create a Random Forest Classifier object from the RandomForestClassifier module of scikitlearn.ensemble library.
Once the model is ready, predictions can be done on the test part of the data. Furthermore, I enjoy predicting foreign values that are not in the initial dataset just to observe the outcomes the model creates. .predict method is used for predictions.
We need a nice dataset that’s sensible to analyze with machine learning techniques, particularly random forests in this case. Scikitlearn has some cool sample data as well.
Even splitting data is made easy with Scikit-learn, for this operation we will use train_test_module from scikitlearn library.
Machine Learning models are generally fit by training data. This is the part where training of the model takes place and we will do the same for our random forest model.
Finally, scikitlearn library’s metrics module is very useful to test the accuracy of the model’s predictions. This part could be done manually as well but metrics module brings lots of functionality and simplicity to the table.
First the import part for libraries:
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn import datasets
from sklearn import metrics
Now we can get the data ready:
I also like to print often to check out if everything is on track or to explore what’s inside my data.
Basically, here pandas DataFrame object is used to create a data frame where each feature of the iris is assigned to a feature name.
iris = datasets.load_iris()
#print(iris.data[:5])
data = pd.DataFrame({"sl":iris.data[:,0], "sw":iris.data[:,1], "pl":iris.data[:,2], "pw":iris.data[:,3], 'species': iris.target})
#print(data["species"])
We got the libraries, we got the data it’s machine learning time!
Creating a split for training and test variables will provide us with the input we will feed the Random Forest Classifier Model.
X=data[['sl','sw','pl','pw']]
y=data["species"]
X_tr, X_ts, y_tr, y_ts = tts(X,y, test_size=30/100)
test_size parameter above is actually very important. Especially your data size is limited. Ideally you want enough data to help machine learning model learn and then you may want even more data to test the model on.
1/3 or 1/4 usually is a good split ration (here we’re using 30/100). If you go overboard with this ratio say 1/2 or 70/100 train to test split this will bring some important risks.
Overtraining data can make it more successful in predicting the test part of the same data
But, when predictions are made on foreign data this might cause the success rate to fall because the model is overfit or over-adjusted to the data at hand.
It’s a fine tuning skill that comes with experience and practice so my suggestion is to get your hands dirty and try experimenting yourself to gain the actual skills for profound machine learning adjustments.
Now we can create a Random Forest object and put machine learning to work using the training data:
RF = RFC(n_estimators=100)
This part is actually pretty important because most of the optimization takes place here. While creating the Random Forest Classifier Model, we can pass many hyper parameters that can have big impacts on the Machine Learning implementation.
Check out this page to see a tutorial about Random Forest Optimization Parameters.
This is where training takes place (fitting). Training part of the data is introduced to the model we created in the previous step.
If the model and data are suitable for each other, and if the correct optimization requirements are satisfied (if necessary) model will have a successful training which is crucial for the prediction part.
RF.fit(X_tr,y_tr)
This is where the moment of truth happens. Trained model is being tested on the test part of the data. If results are successful it might mean that the model is ready for the wild. (Meaning making predictions on foreign data outside of both training and test data)
y_pr=RF.predict(X_ts)
print(y_pr)
Scikitlearn’s metrics module is worth exploration indeed.
With this tool you can find out how the predictions match up with the actual values. Ultimate step to discover about the accuracy of the Random Forest Classifier.
print("Accuracy %:",metrics.accuracy_score(y_ts, y_pr)*100)
print(RF.predict([[4,5,6,6]]))
You can see the full one piece code in this page: Random Forest Simple Implementation
We have created a full Machine Learning Tutorial consisted of different machine learning algorithms, their invention history, examples, visualization, code samples, parameters and optimization tips. The aim is to make Machine Learning as understandable, efficient and simple as possible for masses.
Also if you are excited about Machine Learning and Artificial Intelligence but find the programming part challenging it’s not that complicated at all and you should definitely check out our basic Python lessons, Python exercises, Python tutorials and Python tips.