Kaggle Africa Soil Property Prediction Challenge
The Kaggle Africa Soil Property Prediction Challenge asked for the best model to predict 5 soil parameters (Ca, P, pH, SOC and Sand) from infrared spectroscopy data and some additional features. The competition posed a series of interesting challenges, including regression of multi-parameter targets, unbalanced training and test sets, a significant overfitting problem and a non-representative set used for the preliminary derivation of the leaderboard position.
All preprocessing presented below was performed in R and my script can be downloaded here.
The training data contained ~3-fold more features than samples, immediately indicating a potential overfitting problem. My data preprocessing therefore focused on dimensionality reduction without loosing crucial information.
But before starting with features reduction, I performed a normalization which significantly improved downstream prediction performance: All spectra have a characteristic decay from low to high wavenumbers (see Figure 1A). I quantified the decay by fitting an exponential function to the data and subtracted this decay from each spectrum, resulting in more evenly distributed intensities over the wavenumber range. This normalization collapsed spectral data into more similar traces, enabling better comparison (see Figure 1B).
Visualization of all spectra in one heat map identifies peaks common to all spectra as well as spectra specific peaks (Figure 2B). Clustering identifies distinct groups of similar spectra. To characterize the spectral data better I added a few further features. These features include the characteristic decay described above as well as summary statistics like median, mean and standard deviation.
To reduce dimensionality, I decomposed the data into its principal components (PCA). Principal components were ordered by their contribution to the variance. The fraction of variance explained by each of the first 20 principal component is plotted in Figure 2B. The fast decay indicates that the data can be well approximated by few principal components. In other words, the information contained in > 3000 features can be compressed into a few principal components losing little information, the prerequisite for successful dimensionality reduction. The number of principal components used for prediction was determined during model selection (see below).
Balancing training and test data sets
During preprocessing and first machine learning runs I noticed that the features of test and training data follow distinct distributions. Unbalanced training and test sets pose a significant problem by i) forcing extrapolation to unobserved ranges and/or ii) giving too much weight to outlier values.
Since generating more training data is impossible, training and test sets were matched by resampling. To find the training samples best representing the test data, I performed a nearest neighbor search in the principal component space to return the n most similar training samples to each test sample. Only training samples which were returned for at least one test sample were kept. Matching was performed in python using scikit learn algorithms and can be found here.
While this downsampling technique generates a training set more similar to the test set, sample reduction itself can have critical negative affects on the prediction performance and has to be assessed carefully.
Learning: Deep neural networks
To predict the soil properties from the preprocessed data, I used a deep neural network implemented in the H2O framework. Specifically, I used an ensemble of 20 deep learning networks, each containing 100 neurons in each of the 2 hidden layers. For each target value a grid search to identify the optimal regularization parameters was performed. Model performance was quantified by 20-fold cross validation on the training data and calculating the root-mean-square error over all target values. The code to interact with the remote H2O server can be found here and is a slight modification of a script generated by Arno Candel.
To identify the optimal number of principal components, I plotted the cross-validation error as a function of principal component count of the unprocessed (Raw), the decay-subtracted (Decay Normalized) and the decay-subtracted and test-set-matched data (Decay Normalized and Matched) (Figure 3). All three curves follow a typical pattern of decreasing error with addition of important features, followed by an increase in error, representing overfitting. The minimal cross validation error was achieved using the first 64 principal components of the decay-normalized, test-matched training data set (RMSE=0.36). The winning model was then applied for prediction on the test data.
Like many other competitors, I was surprised by the huge discrepancies between the cross-validation error and the error on the actual test data. This boils down to one simple diagnosis: Overfitting on the training set. While I tried to circumvent this phenomenon by dimensionality reduction, model averaging (ensemble of networks) and matching of training and test data, this was clearly not sufficient. A more efficient (and time consuming) strategy would have been to train distinct learning algorithms and merge individual predictions.