I’ve created these step-by-step machine learning algorith implementations in Python for everyone who is new to the field and might be confused with the different steps.
Naive Bayes is a very old statistical model with mathematical foundations.
It was found by a church minister who was intrigued about god, probability and chance’s effects in life.
You can read about the curious history of Thomas Bayes’ Bayes Theorem discovery which gave way to probabilistic statistics.
Despite it’s limitations and simplicity Naive Bayes has its advantages. It can predict data incredibly fast and accurately while producing statistical outputs that are sometimes necessary.
In this tutorial you can find out a little more about the major steps of the simplest and most straightforward Naive Bayes implementation which can be your stepping stone for more sophisticated machine learning applications.
10 mins
Advanced
Provided by HolyPython.com
I’ve split up Naive Bayes Classifier implementation to 2 different categories here:
(Red for the actual machine learning work and black font signifies preparation phase)
Down the page I’ve also color coded the steps in a different way to group similar steps with each other.
pandas can be useful for constructing dataframes and scikit learn is the ultimate library for simple machine learning operations, learning and practicing machine learning.
Reading data is simple but there can be important points such as: dealing with columns, headers, titles, constructing data frames etc.
Machine Learning models can be created with a very simple and straight-forward process using scikitlearn. In this case we will create an object from the Gaussian Naive Bayes Classifier module of scikitlearn.naive_bayes library.
Once the model is ready, predictions can be done on the test part of the data. Furthermore, I enjoy predicting foreign values that are not in the initial dataset just to observe the outcomes the model creates. .predict method is used for predictions.
We need a nice dataset that’s sensible to analyze with machine learning techniques, particularly Gaussian Naive Bayes Classifier in this case. Scikitlearn has some cool sample data as well.
Even splitting data is made easy with Scikit-learn, for this operation we will use train_test_module from scikitlearn library.
Machine Learning models are generally fit by training data. This is the part where training of the model takes place and we will do the same for our Naive Bayes model.
Finally, scikitlearn library’s metrics module is very useful to test the accuracy of the model’s predictions. This part could be done manually as well but metrics module brings lots of functionality and simplicity to the table.
First the import part for libraries:
###Importing Libraries
from sklearn import datasets
from sklearn import metrics
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split as tts
It’s time to find some data to work with. For the simplicity I will suggest using pre-included datasets library in scikitlearn. They are great for practice and everything is already taken care. So, there won’t be a complication such as missing values or invalid characters etc. while you’re learning.
Let’s import the iris dataset, it’s simple and readily available:
###Importing Dataset
iris = datasets.load_iris()
Now we can get the data ready:
Pandas DataFrame class is used to construct a data frame. Data frames are very useful when working with large datasets with different titles.
In case of machine learning algorithms: you usually have feature(s) and an outcome or multiple outcomes to work with, this mean different titles and sometimes different types of data. That’s why DataFrame becomes the perfect structure to work with.
###Constructing Data Frame
data = pd.DataFrame({"sl":iris.data[:,0], "sw":iris.data[:,1], "pl":iris.data[:,2], "pw":iris.data[:,3], 'species': iris.target})
# print(data["species"])
Here is another standard Machine Learning step.
We need to split data so that there are:
It’s a rather simple process (step) thanks to Scikit learn’s train_test_split module.
###Splitting train/test data
x=data[['sl','sw','pl','pw']]
y=data["species"]
X_tr, X_ts, y_tr, y_ts = tts(X,y, test_size=30/100, random_state=None)
An advantage of Naive Bayes is that one doesn’t have to worry about overfitting.
Now, we can create a Naive Bayes Classifier object and put machine learning to work using the training data:
There are also a number of other Naive Bayes models (i.e.: MultinomialNB) that can be valuable in different situations.
Also if there is going to be any optimizations this is the right moment to pass hyper parameters to the model during initialization.
###Creating Naive Bayes Classifier Model
GNB = GaussianNB(var_smoothing=2e-9)
Model can be trained using the .fit method on the model we’ve just created.
###Training the Model
GNB.fit(X_tr, y_tr)
Now it’s time to make predictions with the trained model.
###Making Predictions
y_pr = GNB.predict(X_ts)
print(y_pr)
Evaluating the model is quite important for validation. If the results aren’t promising you might need to tweak the settings or find a model that’s more suitable. metrics module of sklearn library is very useful in this sense.
###Evaluating Prediction Accuracy
print("Acc %:",metrics.accuracy_score(y_ts, y_pr)*100)
###Making Prediction with Foreign Data
print(GNB.predict([[1,1,0.5,6]]))
You can see the full one piece code in this page: Naive Bayes Simple Implementation