There is usually some correlation between the covariates that we
want to include in our model, a phenomenon know as "collinearity"
This is not a problem if the correlation is small, but you need
to be careful if the absolute correlation between two covariates
is > 0.7. You can avoid problems by discarding one of a pair of
correlated covariates, but that can be a mistake if both have a
real biological impact. We'll look at a toy example where both
covariates should be included.
A toy example
Suppose we have measurement of some variable, such as tree size expressed as diameter at breast height (DBH) at 9 sites. DBH depends on temperature and rainfall, both of which we have scaled so that they range from -2 to +2.
The data are shown in the plot below, where the diameter of the circles is proportional to DBH.
The plot shows that tree size increases with rainfall (for a given temperature) and also with temperature (for a given rainfall). The (absolute) correlation between rainfall and temperature is high at 0.93, and this is also included in the plot.
Fitting a linear model with both rainfall and temperature as predictors appears to work well and we recover the coefficients we used to generate DBH with only a small effect of the random error we added:
Using just one covariate
With such a high correlation, perhaps we should choose one of the covariates for our model and ignore the other. The plots below show the effect of ignoring one covariate:
The plot for rainfall-only seems odd: it appears that high rainfall is associated with small trees; that is indeed true, but this is due to sites with high rainfall having low temperatures. This shows up when we run linear regressions:
The coefficient for temperature is underestimated (0.53 instead of 1.00) and the coefficient for rainfall is now negative (-0.42 instead of +0.50). These results are just wrong. In this example, the two covariates have separate effects on DBH and both need to be included. The correlation is due to our choice of sites: had we included sites with temperature and rainfall both high or both low, the correlation would be much lower.
In this case, if you want to disentangle the effects of temperature and rainfall, you must include both covariates in the model.
Which model should we use to make predictions?
Akaike's information criterion (AIC) attempts to select the best model for predictions:
The model with both covariates is clearly the best, but as we shall see in a moment, it's important to include the covariances when making predictions.
Why is collinearity a problem?
Richard McElreath (2016, p.142) has a toy example where the length of a person's leg is used to predict their height. This works fine for one leg. Left and right legs have slightly different lengths, and if both are included the coefficients have huge confidence intervals and neither is significant. This arises because the left leg provides no information that is not provided by the right leg, and vice versa. (If you wanted to include both legs, a good strategy would be to use the mean leg length as one covariate and the difference in leg lengths as a second covariate.)
The estimates of coefficients of correlated covariates are themselves correlated; let's look at the correlation matrix for the DBH model:
The correlation between the coefficients for temp and rain is
the same as the correlation between the covariates. As a result,
the variances are inflated, leading to larger SEs, wider
confidence intervals, and higher p values. The "variance-inflation factor" (VIF) is calculated with the
The variances are inflated by a factor of 7.3 and standard errors and confidence interval widths are larger by a factor of 2.7 as a result of collinearity. For this artificial example the variance is tiny even after inflation. (For more information on VIF, see Fox (2002 p.216ff).)
Does this apply to Bayesian estimation too?
Let's run a simple model with broad priors in JAGS and see what happens.
And here are the summaries of the marginal posterior distributions:
The marginal posteriors never tell the whole story, and in this case it's especially important to look at cross-correlations:
Again we see that the posteriors for the coefficients have the same correlation as the data for temperature and rain, and this shows up too in the plot:
One huge advantage of using MCMC chains to represent posterior distributions: it is easy to generate similar chains for derived quantities. So if we want to make predictions for the DBH of trees when (for example) temperature = 0.5 and rainfall = 1.5, we can simply do this:
How precise is our prediction compared with the point
We see that the prediction is more precise than the
individual coefficients on which it is based. This is because
the CVs of the coefficients refer to marginal distributions
which ignore cross-correlations, while the calculation of
| Updated 24 Jan 2017 by Mike Meredith|