Kaggle: Africa Soil Property Prediction Challenge

Kaggle Africa Soil Property Prediction Challenge

The Kaggle Africa Soil Property Prediction Challenge asked for the best model to predict 5 soil parameters (Ca, P, pH, SOC and Sand) from infrared spectroscopy data and some additional features. The competition posed a series of interesting challenges, including regression of multi-parameter targets, unbalanced training and test sets, a significant overfitting problem and a non-representative set used for the preliminary derivation of the leaderboard position.

Data preprocessing

All preprocessing presented below was performed in R and my script can be downloaded here.

The training data contained ~3-fold more features than samples, immediately indicating a potential overfitting problem. My data preprocessing therefore focused on dimensionality reduction without loosing crucial information.

But before starting with features reduction, I performed a normalization which significantly improved downstream prediction performance: All spectra have a characteristic decay from low to high wavenumbers (see Figure 1A). I quantified the decay by fitting an exponential function to the data and subtracted this decay from each spectrum, resulting in more evenly distributed intensities over the wavenumber range. This normalization collapsed spectral data into more similar traces, enabling better comparison (see Figure 1B).

Kaggle: Africa Soil Property Prediction Challenge. Figure 1A: Raw spectral data. Relative intensities vs wavenumber are plotted for 10 examples. Sample IDs are given in the legend.

Figure 1A: Raw spectral data. Relative intensities vs wavenumber are plotted for 10 examples. Sample IDs are given in the legend.

Kaggle: Africa Soil Property Prediction Challenge. Figure 1B: Decay normalized spectral data. Relative intensities vs wavenumber are plotted.

Figure 1B: Decay normalized spectral data. Relative intensities vs wavenumber are plotted.

Visualization of all spectra in one heat map identifies peaks common to all spectra as well as spectra specific peaks (Figure 2B). Clustering identifies distinct groups of similar spectra. To characterize the spectral data better I added a few further features. These features include the characteristic decay described above as well as summary statistics like median, mean and standard deviation.

To reduce dimensionality, I decomposed the data into its principal components (PCA). Principal components were ordered by their contribution to the variance. The fraction of variance explained by each of the first 20 principal component is plotted in Figure 2B. The fast decay indicates that the data can be well approximated by few principal components. In other words, the information contained in > 3000 features can be compressed into a few principal components losing little information, the prerequisite for successful dimensionality reduction. The number of principal components used for prediction was determined during model selection (see below).

Kaggle: Africa Soil Property Prediction Challenge. Figure 2A: Visualization of decay normalized spectral data. Columns correspond to wavenumber and color to intensity (see legend). Each row corresponds to one spectrum. Rows are ordered by clustering.

Figure 2A: Visualization of decay normalized spectral data. Columns correspond to wavenumber and color to intensity (see legend). Each row corresponds to one spectrum. Rows are ordered by clustering.

Kaggle: Africa Soil Property Prediction Challenge. Figure 2B: Principal component analysis. The fraction of total variance of the data explained by each principal component is shown. Principal components are ordered by contribution.

Figure 2B: Principal component analysis. The fraction of total variance of the data explained by each principal component is shown. Principal components are ordered by contribution.

Balancing training and test data sets

During preprocessing and first machine learning runs I noticed that the features of test and training data follow distinct distributions. Unbalanced training and test sets pose a significant problem by i) forcing extrapolation to unobserved ranges and/or ii) giving too much weight to outlier values.

Since generating more training data is impossible, training and test sets were matched by resampling. To find the training samples best representing the test data, I performed a nearest neighbor search in the principal component space to return the n most similar training samples to each test sample. Only training samples which were returned for at least one test sample were kept. Matching was performed in python using scikit learn algorithms and can be found here.

While this downsampling technique generates a training set more similar to the test set, sample reduction itself can have critical negative affects on the prediction performance and has to be assessed carefully.

Learning: Deep neural networks

To predict the soil properties from the preprocessed data, I used a deep neural network implemented in the H2O framework. Specifically, I used an ensemble of 20 deep learning networks, each containing 100 neurons in each of the 2 hidden layers. For each target value a grid search to identify the optimal regularization parameters was performed. Model performance was quantified by 20-fold cross validation on the training data and calculating the root-mean-square error over all target values. The code to interact with the remote H2O server can be found here and is a slight modification of a script generated by Arno Candel.

Kaggle: Africa Soil Property Prediction Challenge. Figure 3: Cross-validation error (RMSE) vs number of principal components. Data is shown for unprocessed (grey) decay-normalized (orange) and decay-normalized and test-set-matched (blue) data.

Figure 3: Cross-validation error (RMSE) vs number of principal components. Data is shown for unprocessed (grey) decay-normalized (orange) and decay-normalized and test-set-matched (blue) data.

To identify the optimal number of principal components, I plotted the cross-validation error as a function of principal component count of the unprocessed (Raw), the decay-subtracted (Decay Normalized) and the decay-subtracted and test-set-matched data (Decay Normalized and Matched) (Figure 3). All three curves follow a typical pattern of decreasing error with addition of important features, followed by an increase in error, representing overfitting. The minimal cross validation error was achieved using the first 64 principal components of the decay-normalized, test-matched training data set (RMSE=0.36). The winning model was then applied for prediction on the test data.

Conclusion

Like many other competitors, I was surprised by the huge discrepancies between the cross-validation error and the error on the actual test data. This boils down to one simple diagnosis: Overfitting on the training set. While I tried to circumvent this phenomenon by dimensionality reduction, model averaging (ensemble of networks) and matching of training and test data, this was clearly not sufficient. A more efficient (and time consuming) strategy would have been to train distinct learning algorithms and merge individual predictions.

Neuroph: GraphML export

Neuroph is an awesome java neural network framework for machine learning. While I use the framework’s API, it also provides a graphical user interphase which allows visualization of neural networks. Unfortunately, the visualization fails for networks with a large number of nodes. Here I introduce a GraphML export functionality for Neuroph allowing visualization in Gephi.

In my work I use neural networks with several hundred input and hidden nodes. I would like to use interactive visualization and some graph theory algorithms to perform exploratory analysis. For such analyses I normally use Gephi and the R package iGraph, respectively. Both of these tools can import network architectures saved in the GraphML format, which uses XML syntax to describe structural properties of graphs.

I contributed a package to the Neuroph project neuroph.contrib.graphml, which allows export of a trained neural network as a GraphML file, specifying network topology as well as node-labels and learned weights. Export is performed with a few lines of code as shown in Example.java:

Visualization of the exported network allows easy identification of important network nodes. Let's define important nodes as those with edges with large absolute weight. The GraphML is imported into Gephi, the network visualized in a space-filling way and edges colored by weight:

Neuroph: GraphML export. Neural network. Edges colored according to weight.

Neural network. Nodes are represented as circles, edges as green lines connecting nodes. Edges are colored according to weight.

While this visualization is aesthetically appealing, the edges with large weight are difficult to spot due to the number of edges. This is where the interactive potential of Gephi comes into play. We can selectively show edges with large positive weight (left) or large negative weight (right):

Neuroph: GraphML export. Edges with large positive weight.

Edges with large positive weight.

Neuroph: GraphML export. Edges with large negative weight.

Edges with large negative weight.

We can thus easily identify nodes with high contributing potential. While this example just shows the start of an exploratory analysis, I hope it demonstrates how quick and interactive the GraphML export functionality for Neuroph can make it.

Kaggle: Titanic Competition

Introduction

I’d like to share my approach to the Kaggle Titanic competition. The objective of this competition is to use machine learning to predict which passengers are likely to survive the titanic accident. The training data contains information (gender, age, class etc.) of survived and deceased passengers. We use this data to train a classifier which incorporates useful features to predict passenger survival.

My approach can be divided into two parts: Pre-processing and classification. I spent most of my time on preprocessing the data to maximize the use of available information. This focus is reflected in the sizes of the respective sections. I use a combination of R with ggplot2 and Java, using the machine learning library weka. You’ll find the code I wrote on my github page.

Pre-processing - Parsing

The first task seems trivial: Parsing. Let’s take a look at one row:

32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C

We see good and bad things. Bad things first: i) The name string looks non-trivial to parse. ii) Some of the features have missing values. The good thing is that there is a lot of information hidden within single features. Take for example the name feature. The titles can tell us about the social standing and profession of a passenger. Does a Mister have the same chance to survive as a Master or a Reverent? Surnames can indicate families and therefore grouping. Another interesting feature is the Cabin feature (B78). Is the deck (B) relevant for survival? Do passengers with more than one cabin have distinct survival chances?

I wrote a simple parser in java to i) replace missing values with the weka standard ‘?’ and ii) to extract more information from several features. Specifically, I split the original ‘Name’ feature into ‘Surname’ and ‘Title’. I split the ‘Ticket’ feature into the leading characters (‘TicketId’) and the numeric value (’TicketNr’). I further split the ‘Cabin’ feature into how many cabins the passenger booked (‘CabinCount’), on which deck the cabins are located (‘CabinDeck’) and the cabin number (‘CabinNr’). I also keep the original ‘Cabin’ as a potential grouping feature. The new data contains the following features:

PassengerId,Survived,Pclass,Surname,Title,Sex,Age,SibSp,Parch,TicketId,TicketNr,Fare,CabinCount,CabinDeck,CabinNr,Cabin,Embarked

You can download the newly formatted training and test data from my github page.

Pre-processing - Data exploration

Let’s get a feeling for the data! For this section I wrote my code in R due to its strength in visualization and statistics. Before we focus on features let’s calculate the fraction of passengers survived, . This is a bit higher than expected from a totally unbiased sample ( see wikipedia). Maybe the nice people at Kaggle gave us training data in which both classes are almost equally represented. We have two feature types, nominal and numerical. Let’s look at both types independently and try to identify those useful for our classification problem.

Feature selection - Nominal

We want to identify nominal features for which a given value provides additional information on the survival probability compared to the base rate. Take for example the feature "Sex". Is the survival probability for male and female passengers distinct? Interestingly, of all female passengers in our training data survived, compared to only of male. This looks interesting, but can we quantify how likely such a separation could have occurred by chance alone? We can model the chance of survival, neglecting any features, as a Bernoulli trial for each passenger. This simulates a coin flip with a biased coin such that the probability of success equals the population average. In the training data we observe survivors out of trials, therefore a base survival rate of . The probability to get k survivors given n trials assuming the base rate is given by the Binomial distribution .

How does the probability distribution help us to identify useful nominal features? We use this theory to generate a null hypothesis, stating that the observed number of survived passengers for a given class (e.g. female) is generated by the base survival rate. Importantly, we can use the binomial test to derive a p-value for k survivors out of n passengers given our null hypothesis. The p-value reflects the probability to observe a result at least as extreme as the one observed, given the null hypothesis. Let us come back to our example: out of female passengers survived. Figure 1 shows the probability distribution of k survivors given passengers and the base survival rate of . The vertical line indicates the observed number of survivors ().

Kaggle: Titanic Competition. Figure 1. Probability to observe k survivors given n=314 female passengers and a population survival rate of p=0.38. The vertical line indicates the number of survivors observed.

Figure 1. Probability to observe k survivors given n=314 female passengers and a population survival rate of p=0.38. Vertical line indicates observed number of survivors.

The p-value of the observed number of survivors given the null hypothesis is the sum of all probabilities for (right of the line) and is almost zero (). We can almost certainly reject the null hypothesis and therefore identify the feature "Sex" as important for classification.
Thus, a p-value in conjunction with a threshold gives us a statistical handle to select features. Importantly, we have to correct p-values to take into account that we test one hypothesis for each feature class. The multiple testing problem becomes obvious when you notice that you accept a small probability to incorrectly reject the null hypothesis for each individual test. Testing a sequence of hypothesis makes the occurrence of at least one of these false positive more likely. If you choose a p-value threshold of 0.05 and test 50 hypothesis, the probability to falsely reject the null hypothesis at least once is . There are a variety of methods to perform corrections for multiple hypothesis testing. I chose Bonferroni correction, which minimizes the number of false positive results with the cost of accepting a significant number of false negatives.

Filtering out features containing no class with a p-value < 0.01 yields five features (PClass, Title, Sex, CabinDeck and Embarked). While some of the features seem intuitive (PClass, Sex) others are more surprising (Embarked). Figure 2 shows the class-specific survival fractions.

Kaggle: Titanic Competition. Figure 2. Class-specific survival fractions for features with significant difference to the survival base rate. Horizontal lines indicate base rate survival. Width of bars scale with instances per class.

Figure 2. Class-specific survival fractions for features with significant difference to the survival base rate. Horizontal lines indicate base rate survival. Width of bars scale with instances per class.

The grey horizontal line indicates the average survival rate over the entire training population (base rate). The width of each bar scales with the number of instances for each class of a nominal feature. The distance from the average survival rate in combination with the width of the bar gives us a visual intuition of the significance of the difference.

Feature selection - Numerical

Let us now turn to the numerical features (PassengerId, Age, SibSp, Parch, TicketNr, Fare, CabinCount, CabinNr). Again, we seek to identify features whose values can separate between survived and deceased passengers. Figure 3 shows the empirical cumulative density distribution of values for each feature separated by survival.

Kaggle: Titanic Competition. Figure 3. Empirical cumulative density distribution of values for each feature, split by survived (green) and deceased (red). Note that values are normalized between 0 and 1 for all features.

Figure 3. Empirical cumulative density distribution of values for each feature for survived (green) and deceased (red) passengers.

We can immediately see that some numerical values seem indistinguishable for both classes of passengers (e.g. PassengerId) whereas others seem distinct (e.g. Fare). Can we use a statistical test to quantify the subjective difference? The Kolmogorov-Smirnov test allows us to assign a probability that two sets are drawn from the same distribution. Importantly, this test is non-parametric and therefore allows us to test without knowing the type of the underlying distribution. The Kolmogorov-Smirnov derived and Bonferroni corrected p-value identifies 4 interesting features with p-values < 0.01 (SibSp, Parch, TicketNr, Fare). Interestingly, we cannot reject the null hypothesis for the feature Age. Visual inspection indicates that age may play a predictive role for small values only. Let us include Age in our set of features.

Classification

The data exploration phase identified a subset of interesting features. This does not mean that all of them are useful (i.e. they might be redundant) or that these are the only features with useful information. It simply states that these features are characterized by significantly different values for survived and non-survived passengers.

Let’s use these features as a starting point to train our classifier. For classification we use java with extensive usage of the weka library. You can find the main class here. We first load our training and test data in the Instances class, the weka specific data representation. Before starting classification, we have to define the type (nominal or numerical) for each feature. This is necessary because the input format (csv) does not contain type information. To make our life easy we match training and test data by i) adding the missing 'Survived' feature to the test data (without assigning values) and ii) merging the training and test classes for each feature. These steps yield identical formats of both data sets and prepare the data for efficient filtering and classification using the weka framework. We then specify the classification feature (Survived) and normalize all numeric features between 0 and 1. The latter ensures that all numeric features have identical weight at the beginning of training.

For classification we use the build-in sequential optimization algorithm for support vector machines (SMO class). We can estimate the classification performance on our test data using cross-validation contained in the Evaluation class. Leave-one-out cross validation predicts a classification error of 0.178. We then train our classifier on the entire training data and classify all test instances. Submission to Kaggle yields a score of 0.780. As mentioned above, we might have missed useful features in our subset. Let’s see if we can improve that score by adding back features to the subset we defined in our data exploration phase. Let’s use a greedy forward selection of features. We want to maximize our prediction performance while keeping the number of features minimal. This approach adds the nominal features Surname and Cabin to our feature space. Cross validation of the classifier trained on the extended features space predicts an error of 0.144. Submission to Kaggle scores 0.804.

Summary

I focused my approach on increasing the amount of useful information by i) extracting information from given features and ii) identifying features which are significantly different between survived and deceased passengers. For the latter I employed p-value generating statistical tests. Feature selection was performed using a significance threshold. A friend of mine, Hugo Bowne-Anderson, is discussing an alternative approach using a correlation measure to rank features (I will link the post when it is online). I then train a support vector machine to a subset of the feature space. I correctly predict outcome for of the passengers as judged by the Kaggle score.