Creating Imaginary People with Statistics

A neat set of graphs was circulating on social media, called "The Effect of Life Events on Life Satisfaction". 

Life-events-on-happiness.png

I was curious about the methodology, so I dug into the source, and discovered that there are two open-source versions of this paper. Uniquely, the 2006 version is an earlier version, and provides a cool opportunity to look at these graphs in both raw data and post-analysis form. 

It's a great case study for analyzing confounding/lurking variables carefully, and thinking critically about the real-life interpretations of those variables. 

Here's briefly how the analysis worked. The authors used a large longitudinal dataset that included a self-reported "life satisfaction" score for participants, as well as data on various life events the participants were experiencing. They looked at life satisfaction after the event/event onset ("lags") and leading up to the event ("leads").  The 2006 paper shows graphs for the raw happiness scores of their first analysis, in which they controlled only for "fixed" effects (essentially, accounting for the fact that happy people may self-select into happier life events, and vice versa). The 2007/8 paper shows the graphs from the stand-alone graphic, from their second analysis - a multivariate regression, which aimed to isolate the effects of a wide variety of potentially confounding variables (e.g. personal health problems, age, other major life events). That unlabelled Y-axis is the coefficient on the regression variable.    

Here's where comparing these graphs tells some interesting stories about lurking variables.

If you look at the raw 2006 data, widow(er)s look happier before their spouse's death (predictably), and that happiness recovers somewhat several years out, but not completely:

widow_06.JPG

However, when they run the regression analysis and control for education, nationality, # of children, age, household income, individual health concerns, marriage state, and employment, that effect goes away, and if anything, the participants end up looking happier afterwards:

widow_08.jpg

Since all the variables were controlled for at once and we don't have access to the raw data, we can't tell exactly what happened here - but clearly, one or more of those lurking variables are responsible for lower self-reported happiness after the death of a partner than could be predicted by death of a partner alone. Some non-inclusive possibilities:

  • Children, being an additional burden to care for alone
  • Loss of a second source of income leading to cash flow problems
  • New health concerns

So this raises the question, when we remove these variables from the picture, how much does the remaining analysis reflect any situation in reality? When we ask a research question like "how much does widowhood affect happiness," if that widowhood often leads to secondary happiness-affecting issues like sudden single parenthood, loss of household income, etc., is it truly appropriate to remove those variables from the picture?

For a clearer example, let's look at layoffs:

LO_06.JPG
LO_08.JPG

These two graphs look very different, and when we look at the fine print, we discover that one of the things the second one is controlling for is income! So it's a realistic portrayal of... all those layoffs that don't result in a drop in income. (Ever had one of those? Me neither.) 

Now, to be fair, the authors did this on purpose: they were curious about the psychological effects of e.g. layoffs independent of the income effect. These graphs don't describe aggregated people in real-world situations, they describe vacuum effects. However, when these graphs escape from their original context into social media and the world at large, suddenly people are looking at them to answer a different question: "If I get laid off tomorrow, how will I feel in three years?" Now they're being interpreted as aggregated people. But they're people that don't (or rarely) exist: all those people that get laid off without losing income, or whose spouses die with no secondary effects whatsoever. It's a cautionary tale about statistics out of context.

Regression Modeling on a Groundwater Chemical Spill

Sometimes chemical contaminants get into groundwater aquifers. We want to get them out. In order to get them out, we have to know where they are in the aquifer - which, of course, is dozens to hundreds of feet underground and obscured from view. So we take samples. But every sample is expensive, time- and labor-intensive, and often involves lugging to remote areas machinery that looks like this:

 

CC BY-SA 3.0, https://en.wikipedia.org/w/index.php?curid=5434651

CC BY-SA 3.0, https://en.wikipedia.org/w/index.php?curid=5434651

So what if you could take a small number of samples and then tentatively predict where the plume is - and perhaps where else you had to sample to make your prediction better - purely statistically?

I was very lucky to come across a publicly available contaminant plume full dataset; these are usually proprietary. This one is at the Massachusetts Military Academy, and it's an old wastewater plume. Here is a map of the site, directly from the journal publishing the case, with the sampling locations indicated as black dots and the approximate plume boundaries in yellow:

(Savoie et al., 2012)

(Savoie et al., 2012)

The data came to me as three tables - well construction data (including coordinates and depth), organic contaminants, and inorganic contaminants, which all had to be cleaned and joined on well location. (Handy reminder: mixing decimal degrees with lat/long DMS can really mess up your maps!) According to the journal article, this plume was characterized primarily by high nitrate levels with some chloride and boron, so I picked nitrate concentrations as a first-pass target variable.  Since the goal is to predict neighboring sampling locations' nitrate concentrations, my initial features were the spatial coordinates. 

When we visualize the distribution of the target variable, we get something that looks like this:

...does that make you nervous? It makes me nervous!

...does that make you nervous? It makes me nervous!

The data ranges over three orders of magnitude, with a very heavy tail. Environmental contamination data sets are notorious for this kind of distribution and variability: from a common-sense standpoint, most samples are going to be low-contamination, but occasionally they will be very high, and it's important when they are.

 We can probably anticipate that this will cause problems in our model, which it did. More on that later.

The other major challenge in this data set is its relatively small size. Unfortunately, this is par for the course for groundwater data (see previous picture of giant drilling rig, and the underlying driver for this analysis in the first place). In fact, this data set is pretty comprehensive compared to your average groundwater data set, especially publicly-available. So "get a larger data set" isn't really an option for this kind of environmental data - any machine learning solutions for problems in this field are simply going to have to take this challenge into account. I knew overfitting was going to be a major concern for this data, and I didn't have a ton to spare for validation+holdout. I split off a 30% section to use for testing, and trained/cross-validated on the remainder.  

The first model I attempted was a Gaussian process regression (hereafter, GPR) implemented with scikit-learn's GaussianProcessRegressor. I picked this to start because of the attractive possibility of training on a comparatively sparse data set in order to return not only a model fit, but areas of highest uncertainty (which GPRs return easily): perhaps to inform future environmental sampling at the same remediation site. It's easy to conceptualize the GPR as trying to model an underlying mathematical process generating noisy data, which seemed like a good conceptual fit for our situation of a spreading plume (tidy math) being sampled in a real-world situation (messy reality). I spent a lot of time tuning this model. I tested RBF and exponential kernels head-to-head1, and tuned parameters (mostly the white noise kernel contribution; in theory, the sklearn implementation automatically optimizes GPR hyperparameters during the model runs) to maximize test set predictive accuracy. Here is a 3-dimensional (rotate- and zoom-able!) spatial visualization of the predicted plume concentrations in the best model run:

Sanity-checking this, in a relatively homogeneous medium we would expect the highest concentrations to be found 1) near the surface towards the upgradient side in the direction of travel; and 2) in the center of the plume (midpoint) perpendicular to the direction of travel, tapering off towards the edges of the plume. By rotating the visualization above, we can see this appears to be generally the case, so it's got the right idea. However, the presence of a few random low-concentration points scattered throughout, along with the overall narrow range of predicted concentrations compared to the data set, are definitely reflective of the difficulty of training on such a small, high-variability dataset. More problematically, although accuracy scores aren't the beginning and end of the story, the test set R2 accuracy score of this model was pretty bad. So I went back to feature engineering, and gave that nitrate distribution more scrutiny.

"Data that ranges over several orders of magnitude and looks skewed towards the left side of the distribution" makes me think log-normal, so I tried using log(nitrate) instead.2 That looked a lot more promising:

log_nitrate.png

We've still got the problem of a lot of observations at zero, but at least the long-tail issue has resolved. Re-running the model taking this into account increased the R2 accuracy from ~0.01 to ~0.20, a huge improvement for a tough data set.

I still wasn't too happy with the performance of the GPR implementation I was using, so I switched models entirely to a random forest regressor, just to see how much more performance could theoretically be improved. It did incredibly well: R2 = 0.5, and just in case I thought that accuracy score could be misleading, here are the visualizations of the model-predicted log(nitrate) concentrations compared to ground truth data:

I was happy enough with these results for the time being. For future work with this data set, I'd like to train a GPR iteratively, with the next sample being at the point of maximum predicted uncertainty, to determine whether this is a productive strategy for minimizing number of sampling points while maintaining reasonable model accuracy. I'd also like to see whether adding "neighboring concentrations of other contaminants" (boron, or chloride) as features is helpful for predicting target variable concentrations.3


[1] The idea here was that the exponential kernel is the theoretically correct choice for describing the covariance of a contaminant plume spreading according to the fractional advection-dispersion equation; research indicates that contaminant plumes spreading in predominantly-sand aquifers like the one underlying the Mass Military Academy exhibit super-Fickian behavior. Regardless of the potential cause, my exponential kernel did perform much better than the RBF kernel.

[2] It also turns out a lot of environmental data is log-normal, but we can derive this by examining the data, even if we didn't know that fact ahead of time. 

[3] And if someone who knows more about the GPR implementation in sklearn wants to help me make mine better, I'd be very excited! Documentation and internet-crowdsourced info for this implementation is minimal. 

Minipost: Considerations for "Small" Data Sets

Why might a data set be small?

We’d like to be up to our ears in all the data we want, all the time, but sometimes that just isn’t the case. This could be because you’re having trouble sourcing data, in which case you might want to work directly on that problem, but it could be for other reasons that are harder to work around. Maybe the thing you’re looking at is just naturally limited in size, like the number of different species of fish on the planet. Maybe your goals require deeper analysis of a very specific subset or slice of your data. You might be stuck with the data you have.

Common challenge areas

Overfitting

A powerful model can often quickly and easily overfit to every single point in your relatively small data set. Be on the lookout even more than usual for common signs of overfitting, be sure to properly cross-validate, and definitely consider using regularization terms.

More fundamentally, model selection can be key here- conceptually speaking, what is your hypothetical model suggesting about the underlying relationships in your data, and how well does that match your understanding of the context of the data and the processes that generated it?

Model cross-validation

You might not have enough data points for a train-test-holdout split to be a fruitful endeavor: either you’ll be lacking training data and end up with a poor model, or you’ll be lacking testing data and be unable to draw useful conclusions about model performance. K-fold cross-validation or Leave-one-out on your test data could be a better option.

Outliers

You don’t have the leeway to be careless with outlier removal. One one side, inappropriately including outliers in a small data set could skew your model in a major way. On the other, being too conservative could result in too many data points being excluded, which will have a large impact on the statistical power of your model. Thinking critically about the conceptual underpinnings of the outliers in your data set can be helpful.

Minipost: Handling Nondetects and Missing Data

In general, there are three types of ways to deal with nondetects and missing data:

  1. Drop the whole observation with the missing data

  2. Use some statistical method designed for dealing with missing data

  3. Impute the missing values, either with a single value or by using a model

In order to decide which method is best to use, it’s important to think carefully about a few things.

Most importantly, are your data systematically missing in some way? For example, are you looking at crime data for neighborhoods, and all the high-crime neighborhoods are underreported? Systematically missing data can cause major problems, so it’s important to rule out possible sources of systematically missing data as much as you can.

How much data are you missing? 0.5%? 20%? Different methods are more or less successful depending on the percentage of missing data.

What is the real-world meaning of the data set you’re studying, and what are the possible sources of missing data? Let the type of data you’re using, its purpose, and its structure inform the strategies you use.

Finally, what are the potential consequences if missing data is handled incorrectly? Are imputed values going to be problematic in some way, and how does that balance against the possibility of reduced statistical power in the analysis?

It’s easy to look up particular methods, but one thing you can’t just look up is how to think critically about the most intelligent ways to handle your missing data. Not all data sets can be treated the same.

(Once you've thought through these things, check out one of my favorite Python packages, which helps you examining missing data - missingno.)