in Machine learning

Kaggle: Titanic Competition

Introduction

I’d like to share my approach to the Kaggle Titanic competition. The objective of this competition is to use machine learning to predict which passengers are likely to survive the titanic accident. The training data contains information (gender, age, class etc.) of survived and deceased passengers. We use this data to train a classifier which incorporates useful features to predict passenger survival.

My approach can be divided into two parts: Pre-processing and classification. I spent most of my time on preprocessing the data to maximize the use of available information. This focus is reflected in the sizes of the respective sections. I use a combination of R with ggplot2 and Java, using the machine learning library weka. You’ll find the code I wrote on my github page.

Pre-processing - Parsing

The first task seems trivial: Parsing. Let’s take a look at one row:

32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C

We see good and bad things. Bad things first: i) The name string looks non-trivial to parse. ii) Some of the features have missing values. The good thing is that there is a lot of information hidden within single features. Take for example the name feature. The titles can tell us about the social standing and profession of a passenger. Does a Mister have the same chance to survive as a Master or a Reverent? Surnames can indicate families and therefore grouping. Another interesting feature is the Cabin feature (B78). Is the deck (B) relevant for survival? Do passengers with more than one cabin have distinct survival chances?

I wrote a simple parser in java to i) replace missing values with the weka standard ‘?’ and ii) to extract more information from several features. Specifically, I split the original ‘Name’ feature into ‘Surname’ and ‘Title’. I split the ‘Ticket’ feature into the leading characters (‘TicketId’) and the numeric value (’TicketNr’). I further split the ‘Cabin’ feature into how many cabins the passenger booked (‘CabinCount’), on which deck the cabins are located (‘CabinDeck’) and the cabin number (‘CabinNr’). I also keep the original ‘Cabin’ as a potential grouping feature. The new data contains the following features:

PassengerId,Survived,Pclass,Surname,Title,Sex,Age,SibSp,Parch,TicketId,TicketNr,Fare,CabinCount,CabinDeck,CabinNr,Cabin,Embarked

You can download the newly formatted training and test data from my github page.

Pre-processing - Data exploration

Let’s get a feeling for the data! For this section I wrote my code in R due to its strength in visualization and statistics. Before we focus on features let’s calculate the fraction of passengers survived, . This is a bit higher than expected from a totally unbiased sample ( see wikipedia). Maybe the nice people at Kaggle gave us training data in which both classes are almost equally represented. We have two feature types, nominal and numerical. Let’s look at both types independently and try to identify those useful for our classification problem.

Feature selection - Nominal

We want to identify nominal features for which a given value provides additional information on the survival probability compared to the base rate. Take for example the feature "Sex". Is the survival probability for male and female passengers distinct? Interestingly, of all female passengers in our training data survived, compared to only of male. This looks interesting, but can we quantify how likely such a separation could have occurred by chance alone? We can model the chance of survival, neglecting any features, as a Bernoulli trial for each passenger. This simulates a coin flip with a biased coin such that the probability of success equals the population average. In the training data we observe survivors out of trials, therefore a base survival rate of . The probability to get k survivors given n trials assuming the base rate is given by the Binomial distribution .

How does the probability distribution help us to identify useful nominal features? We use this theory to generate a null hypothesis, stating that the observed number of survived passengers for a given class (e.g. female) is generated by the base survival rate. Importantly, we can use the binomial test to derive a p-value for k survivors out of n passengers given our null hypothesis. The p-value reflects the probability to observe a result at least as extreme as the one observed, given the null hypothesis. Let us come back to our example: out of female passengers survived. Figure 1 shows the probability distribution of k survivors given passengers and the base survival rate of . The vertical line indicates the observed number of survivors ().

Kaggle: Titanic Competition. Figure 1. Probability to observe k survivors given n=314 female passengers and a population survival rate of p=0.38. The vertical line indicates the number of survivors observed.

Figure 1. Probability to observe k survivors given n=314 female passengers and a population survival rate of p=0.38. Vertical line indicates observed number of survivors.

The p-value of the observed number of survivors given the null hypothesis is the sum of all probabilities for (right of the line) and is almost zero (). We can almost certainly reject the null hypothesis and therefore identify the feature "Sex" as important for classification.
Thus, a p-value in conjunction with a threshold gives us a statistical handle to select features. Importantly, we have to correct p-values to take into account that we test one hypothesis for each feature class. The multiple testing problem becomes obvious when you notice that you accept a small probability to incorrectly reject the null hypothesis for each individual test. Testing a sequence of hypothesis makes the occurrence of at least one of these false positive more likely. If you choose a p-value threshold of 0.05 and test 50 hypothesis, the probability to falsely reject the null hypothesis at least once is . There are a variety of methods to perform corrections for multiple hypothesis testing. I chose Bonferroni correction, which minimizes the number of false positive results with the cost of accepting a significant number of false negatives.

Filtering out features containing no class with a p-value < 0.01 yields five features (PClass, Title, Sex, CabinDeck and Embarked). While some of the features seem intuitive (PClass, Sex) others are more surprising (Embarked). Figure 2 shows the class-specific survival fractions.

Kaggle: Titanic Competition. Figure 2. Class-specific survival fractions for features with significant difference to the survival base rate. Horizontal lines indicate base rate survival. Width of bars scale with instances per class.

Figure 2. Class-specific survival fractions for features with significant difference to the survival base rate. Horizontal lines indicate base rate survival. Width of bars scale with instances per class.

The grey horizontal line indicates the average survival rate over the entire training population (base rate). The width of each bar scales with the number of instances for each class of a nominal feature. The distance from the average survival rate in combination with the width of the bar gives us a visual intuition of the significance of the difference.

Feature selection - Numerical

Let us now turn to the numerical features (PassengerId, Age, SibSp, Parch, TicketNr, Fare, CabinCount, CabinNr). Again, we seek to identify features whose values can separate between survived and deceased passengers. Figure 3 shows the empirical cumulative density distribution of values for each feature separated by survival.

Kaggle: Titanic Competition. Figure 3. Empirical cumulative density distribution of values for each feature, split by survived (green) and deceased (red). Note that values are normalized between 0 and 1 for all features.

Figure 3. Empirical cumulative density distribution of values for each feature for survived (green) and deceased (red) passengers.

We can immediately see that some numerical values seem indistinguishable for both classes of passengers (e.g. PassengerId) whereas others seem distinct (e.g. Fare). Can we use a statistical test to quantify the subjective difference? The Kolmogorov-Smirnov test allows us to assign a probability that two sets are drawn from the same distribution. Importantly, this test is non-parametric and therefore allows us to test without knowing the type of the underlying distribution. The Kolmogorov-Smirnov derived and Bonferroni corrected p-value identifies 4 interesting features with p-values < 0.01 (SibSp, Parch, TicketNr, Fare). Interestingly, we cannot reject the null hypothesis for the feature Age. Visual inspection indicates that age may play a predictive role for small values only. Let us include Age in our set of features.

Classification

The data exploration phase identified a subset of interesting features. This does not mean that all of them are useful (i.e. they might be redundant) or that these are the only features with useful information. It simply states that these features are characterized by significantly different values for survived and non-survived passengers.

Let’s use these features as a starting point to train our classifier. For classification we use java with extensive usage of the weka library. You can find the main class here. We first load our training and test data in the Instances class, the weka specific data representation. Before starting classification, we have to define the type (nominal or numerical) for each feature. This is necessary because the input format (csv) does not contain type information. To make our life easy we match training and test data by i) adding the missing 'Survived' feature to the test data (without assigning values) and ii) merging the training and test classes for each feature. These steps yield identical formats of both data sets and prepare the data for efficient filtering and classification using the weka framework. We then specify the classification feature (Survived) and normalize all numeric features between 0 and 1. The latter ensures that all numeric features have identical weight at the beginning of training.

For classification we use the build-in sequential optimization algorithm for support vector machines (SMO class). We can estimate the classification performance on our test data using cross-validation contained in the Evaluation class. Leave-one-out cross validation predicts a classification error of 0.178. We then train our classifier on the entire training data and classify all test instances. Submission to Kaggle yields a score of 0.780. As mentioned above, we might have missed useful features in our subset. Let’s see if we can improve that score by adding back features to the subset we defined in our data exploration phase. Let’s use a greedy forward selection of features. We want to maximize our prediction performance while keeping the number of features minimal. This approach adds the nominal features Surname and Cabin to our feature space. Cross validation of the classifier trained on the extended features space predicts an error of 0.144. Submission to Kaggle scores 0.804.

Summary

I focused my approach on increasing the amount of useful information by i) extracting information from given features and ii) identifying features which are significantly different between survived and deceased passengers. For the latter I employed p-value generating statistical tests. Feature selection was performed using a significance threshold. A friend of mine, Hugo Bowne-Anderson, is discussing an alternative approach using a correlation measure to rank features (I will link the post when it is online). I then train a support vector machine to a subset of the feature space. I correctly predict outcome for of the passengers as judged by the Kaggle score.

Write a Comment

Comment

  1. Nice writeup, very informative, and interesting to see how you considered the available variables. Thanks for taking the time to publish it. A couple of suggestions for continuing your investigation is considering the joint distribution/ correlation, so for example young+male had an improved probability of survival over old+male etc. You could also discretise the continuous variables. Just going through your wikipedia links and learning a lot along the way :)

    Spotted a small typo - deceased rather than diseased.