Overview
We are using a text dataset about the movie review from Kaggle. Our goal is to forecast the sentiment.
Naive Bayes Classifier
Nave Bayes Classifier is a simple classifier that aids in the development of machine learning models capable of making forecasts.
P(A|B) = P(B|A) P(A)
P(B)
Import dataset
Here, I have downloaded the dataset and uploaded it to google drive. After this, I am importing into the jupyter notebook by providing the url of the file.
Import the dataset, shuffle it, and give the columns names like 'Sentences' and 'Score'.
Split dataset
Split dataset as train, development and test using the pandas data frame.
Dataset is divided on the basis of requirement and here we are dividing it as 80% for training, 10% development and 10% as testing. These values can be updated in the parameters (train_size = 0.8, development_size = 0.1, test_size = 0.1).
Build vocabulary
Build a vocabulary as a list using a dictionary and numerous for and if else loops, which will produce multiple dictionaries with the key and values in the form of words and their occurrence in the list dividing positive and negative words.
Calculate probability
Now, calculate probability of the occurrence and conditional probability based on the sentiment whether it is positive or negative.
Example:
P[“home”] = Number of documents containing ‘home’ / Number of all documents
P[“home” | Positive] = Number of positive documents containing “home” / Number of all positive review documents
P[“home” | Negative] = Number of negative documents containing “home” / Number of all negative review documents
Calculate accuracy using development dataset
Accuracy using development dataset and perform k-fold cross validation.
Use the development dataset divided above to do 5 fold cross validation by dividing the dataset into 5 equal parts, then using 4 parts for training and 1 part for testing. This must be done in 5 rounds to ensure that every part is covered, and the accuracy must be calculated after each cycle.
Smoothing
Smoothing is an approach that deals with the issue of zero probability. We have used the below formula:Source
Here the alpha value has been chosen as 1 for the first evaluation and 100 for the second evaluation.
Predict top 10 positive and negative class
Here we are checking for the top 10 positive and negative words.
Final accuracy
Finally the accuracy has been tested on the test dataset.
This has been received as 66 %.
My contribution
Implemented and evaluated smoothing with multiple smoothing parameter(alpha) values
Performed k-fold cross validation properly on the development dataset where the training data and test data was updated with every fold
Plotted different graphs to display various accuracies
This notebook can be found at the link
References
Please note that the code used in the notebook has been understood/referred/used from the below sources and modified as per the need.
コメント