Titanic Survivorship

Objective: To review and analyze data from the Titanic and model survivorship based on passanger details.

Taking a look at the data by age alone, we can see that the age skews younger with a median age of 28. If we add class and survivorship, it becomes very obvious that the largest number of deaths came from third class males. We also start to see that women stood a better chance of survival especially in the higher classes.

functions functions

When we put the same information in terms of percentage of survivors, we get a good picture of who survived and the likelihood of survival in each class. Half of all women in third class survived, and more than 90% of women in second and first classes survived. Men, on the other hand, have survival rates in first, second, and third classes of 37%, 16%, and 14%, respectively.

functions functions

Before we run a logistic regression, we’ll need to choose the features. I broke out class by gender so that our features are child, first class male, first class female, second class male, etc.

The coefficients of the regression are listed below, and you can see the likelihood of survival increases with class and decreases for males over females.

functions functions

The cross val score for this model is 0.7896, and the confusion matrix is below:

functions

The confusion matrix tells us that the model is very accurately predicting survivors (precision score = 88%) but is also predicting many as survived that actually were deceased (recall score = 54%). In other words, we’re predicting almost all survivors accurately at the expense of also predicting almost half of the deceased as also survivors.

The ROC curve for this model looks like this:

functions

The ROC curve is the relationship between False positives and True Positives. As the model accuracy improves by accuraretely prediciting more true positives, the false positives also increase. This ROC curve shows a very strong model.

We then run Gridsearch with logistic regression to find the best parameters, and they are as follows:

functions

We can also compare Gridsearch with kNN to see how these best parameters compare, and we see that the score is slightly higher with kNN.:

functions

We can then rerun our model with these paramaters and compare the confusion matrix and ROC curve.

functions functions

The accuracy of this model is just slightly higher with a score of 0.8000 as compared to 0.7966 with the first model.

Written on July 13, 2016