BLOG | VinayAnant

Building a fake news classifier

Vinay Anant
Dec 7, 2021
3 min read

Fake news is information that is not true, i.e., it is inaccurate information that can be used to spread misleading information, perhaps leading to unforeseeable outcomes.

Here we are building a classifier that can determine whether news is real or fake.

The dataset that has been used for this classifier is available here.

Import libraries

First we are importing necessary libraries which includes numpy pandas, matplotlib, scikit learn and many more.

Import dataset

Here we are importing the dataset which can be imported using pandas. Also drop any na values from the dataset to make it cleaner.

Splitting the dataset

Now we are splitting the dataset into training and test. Here it can be any value which gives us an accurate result i.e. 80:20 split i.e. 80 for training and 20 for testing.

Train and test using various models

Let's make a function that consists of training the model so that it can offer us with the necessary accuracy on the test set.

Naive Bayes Classifier

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors θy=(θy1,…,θyn) for each class y, where n is the number of features (in text classification, the size of the vocabulary) and θyi is the probability P(xi∣y) of feature appearing in a sample belonging to class y.

Random Forest Classifier

Random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

Logistic Regression

Logistic Regression implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers. Note that regularization is applied by default. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted.

Decision Tree

Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

K-Nearest Neighbor

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

AdaBoostClassifier

AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

Support Vector Machine

Support Vector Machine finds a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

Comparison and visualization of different models accuracies

Training time comparison and visualization of different models

Conclusion

As we can see that the maximum accuracy has been achieved by logistic regression which is 93.9854191980559%.

Experiments and findings

Experimented with over 6 different models, including naive bayes, random forest, logistic regression, decision tree, KNN, adaboost and SVM, as well as hyper parameter tuning. The discovery is that utilizing logistic regression with less training time, the best accuracy above 93 percent has been achieved.

Overfitting has not been observed in the above models as we have supplied an ample amount of data.

Challenges

The challenge was to enhance accuracy, which was accomplished by conducting numerous tests in test-train split, implementing multiple models, and experimenting with their various hyperparameters.

Contribution

Understood the code from this link and extended it for 7 different models including naive bayes, random forest, logistic regression, decision tree, KNN, adaboost , SVM and performed various hyperparameter tuning, yielding an accuracy of more than 93 percent with logistic regression
Visualization and comparison of accuracy and training time for above models using various types of plots

This notebook can be found here

The demo of this project can be found here

Please note that the algorithm explanation has been referred from the links which is embedded on the first word of every explanation.

References:

https://www.kaggle.com/c/fake-news/data?select=train.csv

https://www.kaggle.com/rishabh0502/fake-news-classifier-using-random-forest-93/notebook

https://scikit-learn.org/stable/modules/naive_bayes.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://scikit-learn.org/stable/modules/tree.html

https://scikit-learn.org/stable/modules/neighbors.html#classification

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

Naive Bayes Classifier from scratch on Kaggle dataset

Vinay Anant
Dec 1, 2021
2 min read

Overview

We are using a text dataset about the movie review from Kaggle. Our goal is to forecast the sentiment.

Naive Bayes Classifier

Nave Bayes Classifier is a simple classifier that aids in the development of machine learning models capable of making forecasts.

P(A|B) = P(B|A) P(A)

P(B)

Import dataset

Here, I have downloaded the dataset and uploaded it to google drive. After this, I am importing into the jupyter notebook by providing the url of the file.

Import the dataset, shuffle it, and give the columns names like 'Sentences' and 'Score'.

Split dataset

Split dataset as train, development and test using the pandas data frame.

Dataset is divided on the basis of requirement and here we are dividing it as 80% for training, 10% development and 10% as testing. These values can be updated in the parameters (train_size = 0.8, development_size = 0.1, test_size = 0.1).

Build vocabulary

Build a vocabulary as a list using a dictionary and numerous for and if else loops, which will produce multiple dictionaries with the key and values in the form of words and their occurrence in the list dividing positive and negative words.

Calculate probability

Now, calculate probability of the occurrence and conditional probability based on the sentiment whether it is positive or negative.

Example:

P[“home”] = Number of documents containing ‘home’ / Number of all documents

P[“home” | Positive] = Number of positive documents containing “home” / Number of all positive review documents

P[“home” | Negative] = Number of negative documents containing “home” / Number of all negative review documents

Calculate accuracy using development dataset

Accuracy using development dataset and perform k-fold cross validation.

Use the development dataset divided above to do 5 fold cross validation by dividing the dataset into 5 equal parts, then using 4 parts for training and 1 part for testing. This must be done in 5 rounds to ensure that every part is covered, and the accuracy must be calculated after each cycle.

Smoothing

Smoothing is an approach that deals with the issue of zero probability. We have used the below formula:Source

Here the alpha value has been chosen as 1 for the first evaluation and 100 for the second evaluation.

Predict top 10 positive and negative class

Here we are checking for the top 10 positive and negative words.

Final accuracy

Finally the accuracy has been tested on the test dataset.

This has been received as 66 %.

My contribution

Implemented and evaluated smoothing with multiple smoothing parameter(alpha) values
Performed k-fold cross validation properly on the development dataset where the training data and test data was updated with every fold
Plotted different graphs to display various accuracies

This notebook can be found at the link

References

Please note that the code used in the notebook has been understood/referred/used from the below sources and modified as per the need.

https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set

https://www.kaggle.com/neeleshshashidhar/naive-bayes/notebook

https://www.delftstack.com/howto/python/python-bold-text/

https://www.geeksforgeeks.org/pandas-how-to-shuffle-a-dataframe-rows/

https://www.geeksforgeeks.org/python-pandas-split-strings-into-two-list-columns-using-str-split/

https://www.geeksforgeeks.org/different-ways-to-iterate-over-rows-in-pandas-dataframe/

https://www.pythonpool.com/remove-punctuation-python/

https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/

https://en.wikipedia.org/wiki/Additive_smoothing

Concept of overfitting using the Higher order linear regression

Vinay Anant
Nov 12, 2021
4 min read

Overview

Introduction to the concept of overfitting through the use of higher order linear regression

Overfitting

When a model learns the information and noise in the training data to the point where it degrades the model's performance on fresh data, this is known as overfitting. This means that the model picks up on noise or random fluctuations in the training data and learns them as ideas. The issue is that these notions do not apply to fresh data, limiting the model's ability to generalize. ⓘ

Underfitting

Underfitting is defined as a model that cannot both model and generalize to new data. A machine learning model that is underfit is unsuitable, as evidenced by its poor performance on the training data. Underfitting is rarely considered since, given a decent performance metric, it is simple to discover. The solution is to move on and experiment with different machine learning techniques. Nonetheless, it serves as a good counterpoint to the issue of overfitting. ⓘ

Generate data pairs

Let us now proceed to generate the 20 data pairs (X,Y) using y = sin(2*pi*X) + 0.1 * N.

We can use uniform distribution between 0 and 1 for X. This can be done easily using Numpy as np.random.uniform.

After this we can generate Sample N from the normal gaussian distribution. This can also be easily done with Numpy as N = np.random.normal.

Now can be computed using y = sin(2*pi*X) + 0.1 * N.

Split dataset

Split dataset in the form of 10 for train and 10 for test

Root Mean Square Error

The Root Mean Square Error (RMSE) is a standard method of calculating a model's error in predicting quantitative data ⓘ

RMSE is a good estimator for the standard deviation σ of the distribution of our errors!

Source

Gradient Descent

Gradient descent is an optimization approach for determining the values of a function's parameters that minimizes a cost function.

When the parameters cannot be determined analytically (e.g., using linear algebra) and must be found using an optimization algorithm, gradient descent is the best method to utilize.

The procedure begins with initial values for the function's coefficient or coefficients. These could be 0. Source: Lecture 03_ Gradient Descent slide

By plugging the coefficients into the function

and

calculating the cost, the cost of the coefficients is determined.

Then the derviative is computed for the cost.

Now we have the derivative which can be used

to update the values of coefficients. After this, Source: Lecture 03_ Gradient Descent slide

learning rate parameter, that controls how

much the coefficients can change on each

update must be specified.ⓘ

Source: Lecture 03_ Gradient Descent slide

Order (0, 1, 3, 9)

We can find weights for of polynomial regression with for the order of 0, 1, 3, 9.

Pandas dataframe to display

Displaying weights along with different order using Pandas which consists of providing us with data frames.

Plot generation for fit data of various orders using Matplotlib

M=0

M=1

M=3

M=9

Train error vs Test error

Plotting the graph can easily help us in identifying the train vs test error after the execution

This graph is clearly showing the comparison of various train and test errors during execution of various orders starting from 0 to 9 i.e. 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 using RMSE.

Generating 100 more data pairs

Let's generate 100 more data pairs to see the results and fitting of a 9th order model.

On the left side we can see 100 data pairs, on the right we can see the fit.

We can avoid this problem of overfitting using regularization.

Regularization

We can regularize using the sum of weights.

Regularization reduces the variance of a model without changing in its bias which helps in avoiding overfitting.

L1 and L2

Regularization consists of two techniques i.e. L1 and L2. In L2 the cost function is modified by adding a term to it as penalty. This is also called as Ridge Regression.

L1 or Lasso regression is another regularization technique for reducing model complexity. It is an abbreviation for Least Absolute and Selection Operator. ⓘ

We can perform it for various lambda values as 1, 1/10, 1/100, 1/1000, 1/10000, 1/100000.

Using L2 would really improve the comparison between the test and train error and ultimately reduce or avoid overfitting.

So regularization really helps here.

Various experimented Lambda values

Conclusion

After performing various experiments it has been noticed that with ninth degree the model performed great with the training data but it overfitted.

Also, It seems difficult to understand which model performed best with the given lambda values however still with various deviations it appears that the lambda that is closer to 0.1 performed better than the others.

Contribution

Performed experiments for various orders and plotted different graphs
Researched information for overfitting and its possible solution
Implemented L2 Regularization to overcome overfitting

Challenges

Implementation of this problem was new for me and references helped me a lot to gain understanding and eventually solve the same
Displaying weights in a table was a challenge and I solved it using data frame by pandas after multiple unsuccessful attempts
Implementation of model was challenging due to different dimensions and ordering. Reshaping and sorting using zip helped to resolved this

The notebook can be found here

References:

https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e

https://machinelearningmastery.com/gradient-descent-for-machine-learning/

https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html

https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html

https://numpy.org/doc/stable/reference/generated/numpy.sin.html

https://towardsdatascience.com/how-to-split-a-dataset-into-training-and-testing-sets-b146b1649830

https://towardsdatascience.com/polynomial-regression-bbe8b9d97491

https://www.javatpoint.com/regularization-in-machine-learning

https://www.analyticsvidhya.com/blog/2021/07/all-you-need-to-know-about-polynomial-regression/

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html