MCMC diagnostic plots

HOME It's important to check your MCMC results: theory guarantees that the output from an MCMC run will eventually converge on the right posterior distribution, but for a less-than-infinite run we have to be careful. Checks will often show failure to converge, but will not tell us definitely that convergence has occurred. Probability of detecting a problem is less than 1! But running long chains and many chains will help.

Ok, so we "do" the diagnostics, but what exactly are we looking for, and what messages to draw for fixing problems. I want a "field guide" to diagnostic checks, especially the plots which give us most information. This page is a start, but it doesn't include all the issues that I've seen but didn't capture images of the plots.

The numbers

Most BUGS/JAGS software calculates Rhat and gives a warning if this is above a limit, usually 1.1. Rhat can only be calculated if you run more than one MCMC chain and is only informative if the chains have different starting values. Rhat is the ratio of the spread of all the values combined to the mean spread of each chain; if all the chains are sampling properly from the posterior distribution, this ratio should be 1. To know more about what's causing high values, we need to look at plots.

The n.eff figure is also usually quoted, the effective sample size. The values in an MCMC chain are not independent, and adjacent values are similar. The effective sample size is the number of independent draws from the posterior distribution which would give the same precision as the MCMC chain we have.

Trace plots and density plots

These are our main tools, and I like to see them side by side. Moreover, I want to see density plots for each chain separately. Compare the following two plots, the first from coda::plot.mcmc.list and the second from wiqid::diagPlot mcmcOutput::diagPlot (diagPlot is now part of the mcmcOutput package):

 The upper density plot suggests that the posterior has a mode around 0.98 and then tails off to 1.02. In fact, there are no values above 1, and the mode is 1.

Prior constraints

This will not be picked up by Rhat or n.eff, and you need to look at the plots, and if that raises your suspicions, look at the priors in the model code. The plots above show that the MCMC chain is constrained by the upper limit imposed by the prior. If this parameter is a probability, that constraint is ok, but not if it is, say, a coefficient in a regression model. Checking for constraints is really important if you are using uniform priors for standard deviations or parameters such as sigma in SCR models.

Posteriors drawn from the prior

This parameter is a probability with a uniform Beta(1,1) prior, and the posterior is coming solely from the prior.

What to do? (1) Look for coding errors; for example, you intended to use different probabilities for male and female, but in fact only p[1] is used for both. (2) Check the data; a colleague wanted to use different parameters for male and female and for three different parks (6 parameters in all), but no males were captured in the third park.

Insufficient burn-in

The MCMC chains need starting values. If these are not supplied, JAGS will draw random values from the prior distributions. Initial values can be a long way from the centre of the posterior distribution, and the first values in the MCMC chain can be affected by the starting values. This is the case for the red chain in the plots below.

This example is from an SCR model, where both sigma and p0 are affected, with high sigma corresponding to low p0. The initial portion of sigma is sufficiently bad to cause Rhat > 1.1, but p0 is not.

The usual reaction to high Rhat is "more burn-in needed", but that's not always the case.

What to do? Discard additional iterations at the start of the chain; use the window function if you are working with an mcmc.list or mcmcOutput object; there is no simple function for jagsUI objects (yet), but they can be easily converted to mcmcOutput format..

Here is a more severe case, again from an SCR model, where we ran 20 chains. Several chains are showing nice grassy plots around sigma = 1.5, p0 = 0.02, others have high sigma, low p0. You may notice that some chains are switching from high sigma to low.

If we look just at the switchers, it's easier to see the pattern. Five chains switched from high sigma to low, none from low to high. It seems that the high ones are affected by the starting values, and if we allow enough burn-in, they will eventually all converge.

What to do? Try to find the cause of this behaviour and fix it. The high-sigma plots above were due to a single activity centre (AC) location on the other side of a reservoir from the location of capture. When the AC succeeded in "jumping" the reservoir, sigma rapidly switched to the lower value. The solution is to provide initial values on the correct side of the reservoir.

If you can't find the reason, you could just allow a huge amount of burn-in and hope all the chains converge. Since the switches are all from high to low values, you might justifiably discard chains which did not switch.

Non-identifiability of parameters

With the plots below, there seems to be no sign of the chains moving towards a common distribution (although there appear to be only 50 iterations, this is after lots of thinning).

Updated 4 September 2020 by Mike Meredith