Naive Bayes Classifier from scratch on Kaggle dataset

Vinay Anant
Dec 1, 2021
2 min read

Overview

We are using a text dataset about the movie review from Kaggle. Our goal is to forecast the sentiment.

Naive Bayes Classifier

Nave Bayes Classifier is a simple classifier that aids in the development of machine learning models capable of making forecasts.

P(A|B) = P(B|A) P(A)

P(B)

Import dataset

Here, I have downloaded the dataset and uploaded it to google drive. After this, I am importing into the jupyter notebook by providing the url of the file.

Import the dataset, shuffle it, and give the columns names like 'Sentences' and 'Score'.

Split dataset

Split dataset as train, development and test using the pandas data frame.

Dataset is divided on the basis of requirement and here we are dividing it as 80% for training, 10% development and 10% as testing. These values can be updated in the parameters (train_size = 0.8, development_size = 0.1, test_size = 0.1).

Build vocabulary

Build a vocabulary as a list using a dictionary and numerous for and if else loops, which will produce multiple dictionaries with the key and values in the form of words and their occurrence in the list dividing positive and negative words.

Calculate probability

Now, calculate probability of the occurrence and conditional probability based on the sentiment whether it is positive or negative.

Example:

P[“home”] = Number of documents containing ‘home’ / Number of all documents

P[“home” | Positive] = Number of positive documents containing “home” / Number of all positive review documents

P[“home” | Negative] = Number of negative documents containing “home” / Number of all negative review documents

Calculate accuracy using development dataset

Accuracy using development dataset and perform k-fold cross validation.

Use the development dataset divided above to do 5 fold cross validation by dividing the dataset into 5 equal parts, then using 4 parts for training and 1 part for testing. This must be done in 5 rounds to ensure that every part is covered, and the accuracy must be calculated after each cycle.

Smoothing

Smoothing is an approach that deals with the issue of zero probability. We have used the below formula:Source

Here the alpha value has been chosen as 1 for the first evaluation and 100 for the second evaluation.

Predict top 10 positive and negative class

Here we are checking for the top 10 positive and negative words.

Final accuracy

Finally the accuracy has been tested on the test dataset.

This has been received as 66 %.

My contribution

Implemented and evaluated smoothing with multiple smoothing parameter(alpha) values
Performed k-fold cross validation properly on the development dataset where the training data and test data was updated with every fold
Plotted different graphs to display various accuracies

This notebook can be found at the link

References

Please note that the code used in the notebook has been understood/referred/used from the below sources and modified as per the need.

https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set

https://www.kaggle.com/neeleshshashidhar/naive-bayes/notebook

https://www.delftstack.com/howto/python/python-bold-text/

https://www.geeksforgeeks.org/pandas-how-to-shuffle-a-dataframe-rows/

https://www.geeksforgeeks.org/python-pandas-split-strings-into-two-list-columns-using-str-split/

https://www.geeksforgeeks.org/different-ways-to-iterate-over-rows-in-pandas-dataframe/

https://www.pythonpool.com/remove-punctuation-python/

https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/

https://en.wikipedia.org/wiki/Additive_smoothing