Titanic tutorial from Kaggle
- Vinay Anant
- Sep 24, 2021
- 3 min read
Updated: Sep 27, 2021
Overview
This Titanic tutorial attempts to estimate who will survive and who will die based on the data provided.
Dataset
There are three csv files provided: train.csv, test.csv and gender_submission.csv.
Train.csv: This file consists of onboarded passengers information
Test.csv: This file contains information from which we have to predict whether or not the passenger will survive
Gender_submission.csv: This file serves as an example of how a submission should appear
Files location
This code displays the location of csv files on Kaggle.

Data reading
Now that we know where the data is stored, we can begin reading the files.

This code reads "train.csv" with pandas, and the head() function returns the first five rows of "train.csv."

Similarly, this code reads "test.csv" using pandas, and the head() function provides the first five rows of "test.csv."
Analyse pattern
We have "gender submission.csv" here, which assumes that only female passengers survived and that male passengers did not.

This code displays the percentage of women who survived the disaster as well as the percentage of men who survived the disaster.
According to the data, 74.2 percent of women and 18.9 percent of men survived.
Random forest model
The random forest model is a machine learning model that is useful for classification and regression tasks. This Titanic problem is a classifier task because we are attempting to determine whether or not the passenger survived, so we can apply this model here.
The random forest model is made up of several trees that examine each passenger's data and vote on whether or not the passenger can survive. The model then makes a democratic decision, in which the outcome with the most votes wins.

We import Random Forest Classifier and search for patterns in the four columns "Pclass", "Sex", "SibSp", and "Parch". It builds trees in the random forest model based on "train.csv" patterns and then generates predictions in "test.csv."
Then it saves all of the predictions in "submission.csv" and prints the message "Your submission was successfully saved!"
Submission
This is the final step in the process, in which we submit the code to the competition.

This demonstrates an accuracy of 0.77511, or 77 percent accurate results.
Contribution
Now that we have an accuracy of 0.77511, we can try to improve it.
To improve accuracy, we can try a different model. We can experiment with SVC, or Support Vector Classification, which is a class of SVM (Support Vector Machine) algorithm to see if it can improve accuracy, as this can handle classifier tasks. The implementation of Support Vector Classification is based on libsvm. It can have a variety of parameters such as C, kernel, degree, gamma, probability, and many more. However, I have already checked with multiple parameters and the results did not reflect the changes in accuracy which we will get, so in this case we will proceed without any parameters.
We follow the same steps and code as described above in the Titanic tutorial, the only updates we will do are in the last cell which has been shown in the screenshot below, where we use SVM model instead of Random Forest Classifier.
First, we use "from sklearn.svm import SVC" to import SVC from Scikit Learn's SVM library (Please note that this line is updated from previous line we had in Titanic tutorial for Random Forest i.e. "from sklearn.ensemble import RandomForestClassifier")
The model is then updated with "model = SVC()" (Please note that this line is updated from the previous line we had in Titanic tutorial for Random Forest i.e. "model =RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)")
The remaining steps and lines of code will remain the same and no other changes are needed.
Now we run it to see if there are any errors.

Once successful execution is done we can proceed with submission to the competition.

Voila! The accuracy has been increased to 0.77751 from 0.77511, a 0.0024 increase.
This Notebook can be found here:
Note - For this exercise below tools and libraries are used:
Jupyter Notebook (on Kaggle), Pandas, Numpy, Scikit learn
References
Comentarios