Overview
Introduction to the concept of overfitting through the use of higher order linear regression
Overfitting
When a model learns the information and noise in the training data to the point where it degrades the model's performance on fresh data, this is known as overfitting. This means that the model picks up on noise or random fluctuations in the training data and learns them as ideas. The issue is that these notions do not apply to fresh data, limiting the model's ability to generalize. ⓘ
Underfitting
Underfitting is defined as a model that cannot both model and generalize to new data. A machine learning model that is underfit is unsuitable, as evidenced by its poor performance on the training data. Underfitting is rarely considered since, given a decent performance metric, it is simple to discover. The solution is to move on and experiment with different machine learning techniques. Nonetheless, it serves as a good counterpoint to the issue of overfitting. ⓘ
Generate data pairs
Let us now proceed to generate the 20 data pairs (X,Y) using y = sin(2*pi*X) + 0.1 * N.
We can use uniform distribution between 0 and 1 for X. This can be done easily using Numpy as np.random.uniform.
After this we can generate Sample N from the normal gaussian distribution. This can also be easily done with Numpy as N = np.random.normal.
Now can be computed using y = sin(2*pi*X) + 0.1 * N.
Split dataset
Split dataset in the form of 10 for train and 10 for test
Root Mean Square Error
The Root Mean Square Error (RMSE) is a standard method of calculating a model's error in predicting quantitative data ⓘ
RMSE is a good estimator for the standard deviation σ of the distribution of our errors!
Gradient Descent
Gradient descent is an optimization approach for determining the values of a function's parameters that minimizes a cost function.
When the parameters cannot be determined analytically (e.g., using linear algebra) and must be found using an optimization algorithm, gradient descent is the best method to utilize.
The procedure begins with initial values for the function's coefficient or coefficients. These could be 0. Source: Lecture 03_ Gradient Descent slide
By plugging the coefficients into the function
and
calculating the cost, the cost of the coefficients is determined.
Then the derviative is computed for the cost.
Now we have the derivative which can be used
to update the values of coefficients. After this, Source: Lecture 03_ Gradient Descent slide
learning rate parameter, that controls how
much the coefficients can change on each
update must be specified.ⓘ
Source: Lecture 03_ Gradient Descent slide
Order (0, 1, 3, 9)
We can find weights for of polynomial regression with for the order of 0, 1, 3, 9.
Pandas dataframe to display
Displaying weights along with different order using Pandas which consists of providing us with data frames.
Plot generation for fit data of various orders using Matplotlib
M=0
M=3
M=9
Train error vs Test error
Plotting the graph can easily help us in identifying the train vs test error after the execution
This graph is clearly showing the comparison of various train and test errors during execution of various orders starting from 0 to 9 i.e. 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 using RMSE.
Generating 100 more data pairs
Let's generate 100 more data pairs to see the results and fitting of a 9th order model.
On the left side we can see 100 data pairs, on the right we can see the fit.
We can avoid this problem of overfitting using regularization.
Regularization
We can regularize using the sum of weights.
Regularization reduces the variance of a model without changing in its bias which helps in avoiding overfitting.
L1 and L2
Regularization consists of two techniques i.e. L1 and L2. In L2 the cost function is modified by adding a term to it as penalty. This is also called as Ridge Regression.
L1 or Lasso regression is another regularization technique for reducing model complexity. It is an abbreviation for Least Absolute and Selection Operator. ⓘ
We can perform it for various lambda values as 1, 1/10, 1/100, 1/1000, 1/10000, 1/100000.
Using L2 would really improve the comparison between the test and train error and ultimately reduce or avoid overfitting.
So regularization really helps here.
Various experimented Lambda values
Conclusion
After performing various experiments it has been noticed that with ninth degree the model performed great with the training data but it overfitted.
Also, It seems difficult to understand which model performed best with the given lambda values however still with various deviations it appears that the lambda that is closer to 0.1 performed better than the others.
Contribution
Performed experiments for various orders and plotted different graphs
Researched information for overfitting and its possible solution
Implemented L2 Regularization to overcome overfitting
Challenges
Implementation of this problem was new for me and references helped me a lot to gain understanding and eventually solve the same
Displaying weights in a table was a challenge and I solved it using data frame by pandas after multiple unsuccessful attempts
Implementation of model was challenging due to different dimensions and ordering. Reshaping and sorting using zip helped to resolved this
The notebook can be found here
References:
Opmerkingen