Regression Modeling on a Groundwater Chemical Spill

Sometimes chemical contaminants get into groundwater aquifers. We want to get them out. In order to get them out, we have to know where they are in the aquifer - which, of course, is dozens to hundreds of feet underground and obscured from view. So we take samples. But every sample is expensive, time- and labor-intensive, and often involves lugging to remote areas machinery that looks like this:

 

CC BY-SA 3.0, https://en.wikipedia.org/w/index.php?curid=5434651

CC BY-SA 3.0, https://en.wikipedia.org/w/index.php?curid=5434651

So what if you could take a small number of samples and then tentatively predict where the plume is - and perhaps where else you had to sample to make your prediction better - purely statistically?

I was very lucky to come across a publicly available contaminant plume full dataset; these are usually proprietary. This one is at the Massachusetts Military Academy, and it's an old wastewater plume. Here is a map of the site, directly from the journal publishing the case, with the sampling locations indicated as black dots and the approximate plume boundaries in yellow:

(Savoie et al., 2012)

(Savoie et al., 2012)

The data came to me as three tables - well construction data (including coordinates and depth), organic contaminants, and inorganic contaminants, which all had to be cleaned and joined on well location. (Handy reminder: mixing decimal degrees with lat/long DMS can really mess up your maps!) According to the journal article, this plume was characterized primarily by high nitrate levels with some chloride and boron, so I picked nitrate concentrations as a first-pass target variable.  Since the goal is to predict neighboring sampling locations' nitrate concentrations, my initial features were the spatial coordinates. 

When we visualize the distribution of the target variable, we get something that looks like this:

...does that make you nervous? It makes me nervous!

...does that make you nervous? It makes me nervous!

The data ranges over three orders of magnitude, with a very heavy tail. Environmental contamination data sets are notorious for this kind of distribution and variability: from a common-sense standpoint, most samples are going to be low-contamination, but occasionally they will be very high, and it's important when they are.

 We can probably anticipate that this will cause problems in our model, which it did. More on that later.

The other major challenge in this data set is its relatively small size. Unfortunately, this is par for the course for groundwater data (see previous picture of giant drilling rig, and the underlying driver for this analysis in the first place). In fact, this data set is pretty comprehensive compared to your average groundwater data set, especially publicly-available. So "get a larger data set" isn't really an option for this kind of environmental data - any machine learning solutions for problems in this field are simply going to have to take this challenge into account. I knew overfitting was going to be a major concern for this data, and I didn't have a ton to spare for validation+holdout. I split off a 30% section to use for testing, and trained/cross-validated on the remainder.  

The first model I attempted was a Gaussian process regression (hereafter, GPR) implemented with scikit-learn's GaussianProcessRegressor. I picked this to start because of the attractive possibility of training on a comparatively sparse data set in order to return not only a model fit, but areas of highest uncertainty (which GPRs return easily): perhaps to inform future environmental sampling at the same remediation site. It's easy to conceptualize the GPR as trying to model an underlying mathematical process generating noisy data, which seemed like a good conceptual fit for our situation of a spreading plume (tidy math) being sampled in a real-world situation (messy reality). I spent a lot of time tuning this model. I tested RBF and exponential kernels head-to-head1, and tuned parameters (mostly the white noise kernel contribution; in theory, the sklearn implementation automatically optimizes GPR hyperparameters during the model runs) to maximize test set predictive accuracy. Here is a 3-dimensional (rotate- and zoom-able!) spatial visualization of the predicted plume concentrations in the best model run:

Sanity-checking this, in a relatively homogeneous medium we would expect the highest concentrations to be found 1) near the surface towards the upgradient side in the direction of travel; and 2) in the center of the plume (midpoint) perpendicular to the direction of travel, tapering off towards the edges of the plume. By rotating the visualization above, we can see this appears to be generally the case, so it's got the right idea. However, the presence of a few random low-concentration points scattered throughout, along with the overall narrow range of predicted concentrations compared to the data set, are definitely reflective of the difficulty of training on such a small, high-variability dataset. More problematically, although accuracy scores aren't the beginning and end of the story, the test set R2 accuracy score of this model was pretty bad. So I went back to feature engineering, and gave that nitrate distribution more scrutiny.

"Data that ranges over several orders of magnitude and looks skewed towards the left side of the distribution" makes me think log-normal, so I tried using log(nitrate) instead.2 That looked a lot more promising:

log_nitrate.png

We've still got the problem of a lot of observations at zero, but at least the long-tail issue has resolved. Re-running the model taking this into account increased the R2 accuracy from ~0.01 to ~0.20, a huge improvement for a tough data set.

I still wasn't too happy with the performance of the GPR implementation I was using, so I switched models entirely to a random forest regressor, just to see how much more performance could theoretically be improved. It did incredibly well: R2 = 0.5, and just in case I thought that accuracy score could be misleading, here are the visualizations of the model-predicted log(nitrate) concentrations compared to ground truth data:

I was happy enough with these results for the time being. For future work with this data set, I'd like to train a GPR iteratively, with the next sample being at the point of maximum predicted uncertainty, to determine whether this is a productive strategy for minimizing number of sampling points while maintaining reasonable model accuracy. I'd also like to see whether adding "neighboring concentrations of other contaminants" (boron, or chloride) as features is helpful for predicting target variable concentrations.3


[1] The idea here was that the exponential kernel is the theoretically correct choice for describing the covariance of a contaminant plume spreading according to the fractional advection-dispersion equation; research indicates that contaminant plumes spreading in predominantly-sand aquifers like the one underlying the Mass Military Academy exhibit super-Fickian behavior. Regardless of the potential cause, my exponential kernel did perform much better than the RBF kernel.

[2] It also turns out a lot of environmental data is log-normal, but we can derive this by examining the data, even if we didn't know that fact ahead of time. 

[3] And if someone who knows more about the GPR implementation in sklearn wants to help me make mine better, I'd be very excited! Documentation and internet-crowdsourced info for this implementation is minimal.