top of page

Titanic tutorial from Kaggle

  • Writer: Vinay Anant
    Vinay Anant
  • Sep 24, 2021
  • 3 min read

Updated: Sep 27, 2021


Overview

This Titanic tutorial attempts to estimate who will survive and who will die based on the data provided.


Dataset

There are three csv files provided: train.csv, test.csv and gender_submission.csv.


Train.csv: This file consists of onboarded passengers information


Test.csv: This file contains information from which we have to predict whether or not the passenger will survive


Gender_submission.csv: This file serves as an example of how a submission should appear


Files location

This code displays the location of csv files on Kaggle.



Data reading

Now that we know where the data is stored, we can begin reading the files.

This code reads "train.csv" with pandas, and the head() function returns the first five rows of "train.csv."


Similarly, this code reads "test.csv" using pandas, and the head() function provides the first five rows of "test.csv."


Analyse pattern

We have "gender submission.csv" here, which assumes that only female passengers survived and that male passengers did not.



This code displays the percentage of women who survived the disaster as well as the percentage of men who survived the disaster.

According to the data, 74.2 percent of women and 18.9 percent of men survived.


Random forest model

The random forest model is a machine learning model that is useful for classification and regression tasks. This Titanic problem is a classifier task because we are attempting to determine whether or not the passenger survived, so we can apply this model here.


The random forest model is made up of several trees that examine each passenger's data and vote on whether or not the passenger can survive. The model then makes a democratic decision, in which the outcome with the most votes wins.



We import Random Forest Classifier and search for patterns in the four columns "Pclass", "Sex", "SibSp", and "Parch". It builds trees in the random forest model based on "train.csv" patterns and then generates predictions in "test.csv."

Then it saves all of the predictions in "submission.csv" and prints the message "Your submission was successfully saved!"


Submission

This is the final step in the process, in which we submit the code to the competition.



This demonstrates an accuracy of 0.77511, or 77 percent accurate results.


Contribution

Now that we have an accuracy of 0.77511, we can try to improve it.


To improve accuracy, we can try a different model. We can experiment with SVC, or Support Vector Classification, which is a class of SVM (Support Vector Machine) algorithm to see if it can improve accuracy, as this can handle classifier tasks. The implementation of Support Vector Classification is based on libsvm. It can have a variety of parameters such as C, kernel, degree, gamma, probability, and many more. However, I have already checked with multiple parameters and the results did not reflect the changes in accuracy which we will get, so in this case we will proceed without any parameters.


We follow the same steps and code as described above in the Titanic tutorial, the only updates we will do are in the last cell which has been shown in the screenshot below, where we use SVM model instead of Random Forest Classifier.


First, we use "from sklearn.svm import SVC" to import SVC from Scikit Learn's SVM library (Please note that this line is updated from previous line we had in Titanic tutorial for Random Forest i.e. "from sklearn.ensemble import RandomForestClassifier")

The model is then updated with "model = SVC()" (Please note that this line is updated from the previous line we had in Titanic tutorial for Random Forest i.e. "model =RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)")


The remaining steps and lines of code will remain the same and no other changes are needed.


Now we run it to see if there are any errors.



Once successful execution is done we can proceed with submission to the competition.



Voila! The accuracy has been increased to 0.77751 from 0.77511, a 0.0024 increase.



This Notebook can be found here:




Note - For this exercise below tools and libraries are used:

Jupyter Notebook (on Kaggle), Pandas, Numpy, Scikit learn


References



 
 
 

Comentarios


bottom of page