# Correlated priors in JAGS

HOME We routinely use independent priors for our analyses, implicitly assuming no correlations. That doesn't mean that posteriors are not correlated - correlation is never zero - but the correlation comes from the data. It might be better to include the correlation coefficient as a parameter in the model, maybe with a uniform prior, which can then be monitored and a posterior distribution reported.

In multi-species occupancy models (MSOMs), we expect some species to be both more widespread and more abundant than others. The more individuals present at the site, the higher the detection probability, so detection probability given presence, p, and probability of presence, psi, should be correlated.

## A bit of maths

For a univariate normal distribution we can write $$\eta \sim {\rm N}(\mu, \sigma^2)$$, where $$\mu$$ is the mean or centre of the distribution and $$\sigma$$ is the scale - and $$\sigma^2$$ is the variance of the distribution.

For a bivariate normal we have 2 variables, $$\eta_1$$ and  $$\eta_2$$, and 2 means, $$\mu_1$$ and  $$\mu_2$$:

$\left(\begin{matrix} \eta_1 \\ \eta_2\end{matrix}\right) \sim {\rm MVN}\left( \left(\begin{matrix} \mu_1 \\ \mu_2\end{matrix}\right) , \Sigma\right)$

Here $$\Sigma$$ is the 2 x 2 covariance matrix (aka variance-covariance matrix):

$\left(\begin{matrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{21} & \sigma_2^2\end{matrix}\right)$

where $$\sigma_i^2$$ is the variance of variable $$i$$ and $$\sigma_{12} = \sigma_{21}$$ is the covariance between the two variables. We prefer to work with the correlation coefficient, $$\rho = \frac {\sigma_{12}} { \sigma_1 \sigma_2}$$, instead of the covariance; it has values in the range (-1, 1).

## Implementing correlated normal priors

In this post we will look at 4 ways to implement correlation among normally-distributed priors. The first adds a few lines of code to connect the priors, the other 3 use a multivariate normal distribution with different ways of specifying the priors for that. We will focus on bivariate normal variables initially.

For each method, we will use JAGS to simulate 500 values from the implied prior distribution (with mean 0, to keep it simple); this will be a matrix, eta, with 500 rows and a column for each of the variables (2 columns for a bivariate normal). JAGS repeats this 1000 times to generate chains of length 1000. For each iteration, we calculate the realised SDs and correlations of the columns in eta, and compare these with the specified priors for sigma and  rho.

### Linking two univariate normal priors

Here is the JAGS model to generate draws from the implied priors (adapted from Royle & Dorazio (2008) p. 385):

model {
for(m in 1:2) {
sigma[m] ~ dunif(0,5)
tau[m] <- 1/sigma[m]^2
}
rho ~ dunif(-1,1)
var.eta <- tau/(1 - rho^2)
mu <- c(0, 0)

for(k in 1:500) {
eta[k, 1] ~ dnorm(mu, tau)
mu2[k] <- mu + rho * sigma/sigma * (eta[k, 1] - mu)
eta[k, 2] ~ dnorm(mu2[k], var.eta)
}
}

This looks a little clumsy and it seems to be asymmetric - eta[,1] is drawn from the distribution specified, then eta[,2] is a function of eta[,1] - but JAGS does not run commands in sequence. The code specifies the relationships between the nodes.

We are using JAGS as a random number generator (RNG), so don't need to supply any data and don't need to check convergence or autocorrelation. We'll use rjags functions to run the simulation and mcmcOutput to extract the parameter values.

library(rjags)
library(mcmcOutput)

wanted <- c("sigma", "rho", "eta")

out <- coda.samples(jm, wanted, 1e3)
( mc <- mcmcOutput(out) )
# Object of class 'mcmcOutput'; approx. size 8.1 MB
# MCMC values from mcmc.list object ‘out’
# The output has 1 chains each with 1000 draws.
# It has 3 parameters with 1003 nodes monitored:
#       nodes
# eta    1000
# rho       1
# sigma     2

Now we calculate the means, SDs and correlation coefficients for each realization of the bivariate variable eta:

eta <- mc$eta niter <- nrow(mc) meanx <- sdx <- matrix(NA, niter, 2) corx <- numeric(niter) for(n in 1:niter) { meanx[n, ] <- colMeans(eta[n,,]) sdx[n, ] <- apply(eta[n,,], 2, sd) corx[n] <- cor(eta[n,,]) } We check that the realised values of SD and correlation correspond to the priors for sigma and rho: plot(mc$sigma[,1], sdx[,1])
plot(mc$sigma[,2], sdx[,2]) plot(mc$rho, corx) The realised values of SD and correlation (on the y axis) are pretty close to the draws of sigma and rho from the priors. Next we check the distribution of the SDs and correlations across the 1000 steps in the MCMC chain.

hist(sdx[,1])  # uniform(0, 5)
hist(sdx[,2])  # uniform(0, 5)
hist(corx)     # uniform (-1, 1) The distributions are uniform (to within MC error), as they should be. When sigma is close to the upper limit, 5, we can get a realised SD slightly more than 5, as you can see in the scatter plots above.

### dmvnorm with a Wishart (dwish) prior

Linking 2 univariate normal distributions as we did above can't be generalised to more than two variables. A more general solution is to draw from the multivariate normal distribution, dmvnorm in JAGS. The dnorm distribution uses precision, tau, instead of SD or variance, and dmvnorm uses a precision matrix, TAU in the code below, instead of the covariance matrix.

The Wishart distribution produces a suitable matrix for use with dmvnorm. The Wishart distribution is a multivariate version of the gamma distribution, which is restricted to non-negative values of continuous variables. The dwish distribution has 2 arguments, a matrix R and a scalar k, described as "degrees of freedom". The tricky bit is deciding what values to plug in to get the priors that we want for SD and correlation.

The degrees of freedom is straightforward: setting k = M + 1, where M is the number of variables (hence k = 3 for a bivariate normal) results in a Uniform(-1, 1) prior for the correlation coefficient. Smaller values, down to the limit of k = M, give priors with more weight on -1 and 1, while larger values give a prior with more mass near 0.

The M x M matrix R is usually a diagonal matrix (all the off-diagonal values 0) which determines the distribution of the SDs. We will explore how this works below, but the JAGS User Manual (p.55) gives a hint: "R/k is a prior guess for the covariance matrix...".

The JAGS code for our RNG is below. We'll pass the R matrix and degrees of freedom as data, so we can try different  values without changing the JAGS code.

model {
TAU ~ dwish(R, k)
mu <- c(0, 0)
for(k in 1:500) {
eta[k, 1:2] ~ dmnorm(mu, TAU)
}

VCOV <- inverse(TAU)
sigma <- sqrt(VCOV[1,1])
sigma <- sqrt(VCOV[2,2])
rho <- VCOV[1,2] / (sigma * sigma)
}

The RNG itself is only 5 lines of code, then we invert TAU to get the covariance matrix and extract  the values of sigma and rho.

We'll try the values for R and k used in Kéry & Royle (2016, p.668) and Kéry & Schaub (2012, p. 206):

( R <- diag(c(5,1)) )
#      [,1] [,2]
# [1,]    5    0
# [2,]    0    1
( sigma.guess <- sqrt(diag(R)/3) )
#  1.2909944 0.5773503

jdata <- list(R = R, k = 3)

The prior guesses for sigma for the 2 variables look very small and certainly the second one would seem to be highly informative. But much will depend on the shape of the implied distribution.

The code to run the model and inspect the results is the same as for the previous example, except that we now have to supply data = jdata where previously we had data = NULL.

The scatter plots for SD vs sigma and correlation vs rho show all the points close to the diagonal, and the histogram for correlation is Uniform(-1, 1). Here are the histograms for the SDs, with the vertical red lines indicating the "prior guess" for sigma and the dashed blue lines the 90th percentiles. It seems that the prior guess actually corresponds to the mode of a skewed distribution which has reasonable probabilities for values up to 5 (sd) or 2 (sd). These might be reasonable priors but are not uninformative.

### dmvnorm with a scaled Wishart (dscaled.wishart) prior

Because the interpretation of the Wishart prior is not intuitive, the glm module in JAGS provides a couple of alternatives, including the scaled Wishart distribution, dscaled.wishart.

The second argument to dscaled.wishart is the degrees of freedom, df; here a value of df = 2 gives a Uniform(-1,1) prior for the correlation coefficient, larger values give a prior with the mass concentrated closer to 0.

The priors for sigma are half-t distributions with df degrees of freedom. Of course, if you want a uniform distribution for correlation, you're stuck with t2. But that's a reasonable choice: some people advocate a half-Cauchy distribution (which is t1), but I think that has tails which are too heavy; Gelman and the Stan group recommend t4.

The first argument to dscaled.wishart is a vector of length M with the scale of the half-t distribution. This gives you a fairly direct way of determining how informative is the prior. The plot below shows a series of half-t2 distributions with different scales and the 90th percentile for each. The only change needed to the JAGS code is the prior for TAU, which becomes

TAU ~ dscaled.wishart(s, df)

We pass our chosen values for  s and  df via the data argument and load the glm module:

jdata <- list(s = c(5,1), df=2)

jm <- jags.model("3_scaled.jags", jdata)

The code to obtain and examine the results is the same as before, and the plots of SD vs  sigma and correlation vs rho have all points close to the diagonal. Here are the histograms for the SDs with the 90th percentile as the blue dashed line: Those have long right tails, so we'll plot just the smaller values and add the appropriate half-t2 distribution. ### dmvnorm.vcov with a covariance matrix

The glm module has a version of dmvnorm which uses a covariance matrix instead of a precision matrix. This is much easier to work with, and we can chose our own priors for sigma and rho and build those into the VCOV matrix directly. This is how it works in the JAGS code:

model {
sigma ~ dunif(0, 5)
VCOV[1, 1] <- sigma^2
sigma ~ dunif(0, 5)
VCOV[2, 2] <- sigma^2
rho ~ dunif(-1, 1)
VCOV[1, 2] <- rho * sigma * sigma
VCOV[2, 1] <- VCOV[1, 2]
mu <- c(0, 0)

for(k in 1:500) {
eta[k, 1:2] ~ dmnorm.vcov(mu, VCOV)
}
}

The code to run JAGS is the same as before (with data = NULL) as is the code to extract the values. All the plots look fine; here are the histograms for the realised SDs of  eta: ## More than 2 variables

The formulae for the multivariate normal now have vectors of length M for $$\eta$$ and $$\mu$$ and an M x M matrix for $$\Sigma$$. That's mostly straightforward, but let's look at $$\Sigma$$ for M = 3:

$\left(\begin{matrix} \sigma_1^2 & \sigma_{12} & \sigma_{13}\\ \sigma_{21} & \sigma_2^2 & \sigma_{23} \\ \sigma_{31} & \sigma_{32} & \sigma_3^2\end{matrix}\right)$

Note that we now have 3 distinct values for the covariance ($$\sigma_{12} = \sigma_{21}$$, $$\sigma_{13} = \sigma_{31}$$ and $$\sigma_{23} = \sigma_{32}$$) and hence 3 values for the correlation coefficient $$\rho$$. The number of coefficients grows rapidly with M: it's $$(M^2 - M)/2$$. And there are constraints on the multiple correlations: if $$\rho_{12}$$ and $$\rho_{23}$$ are both high - say 0.8 - then $$\rho_{13}$$ cannot be negative. So we cannot just generate random values for $$\rho$$ and plug them in, some will not work and generate a run-time error in JAGS. Technically, the covariance matrix must be positive semi-definite; using inverse Wishart priors for the covariance matrix (or equivalently Wishart priors for the precision matrix) takes care of this requirement.

The methods with dmvnorm and Wishart or scaled Wishart priors work fine for more than 2 variables. With the dwish prior, the "prior guess" is smaller than the mode of the distribution.

dmvnorm.vcov works in practice, see here for an example, but I could not produce prior distributions using JAGS as a RNG. The problem is that a VCOV matrix constructed from randomly drawn values of rho and sigma will often not be positive semi-definite and causes an error.

## Recommendations

1. Don't use the classic Wishart prior (method #2 above). Or if you do, use JAGS as a RNG to check the prior distributions for SDs for the values you specify.

2. For a bivariate normal, the linking trick (method #1) works fine and allows you to specify whatever priors you choose for rho and sigma.

3. If you are happy with a half-t prior for  sigma with 2 degrees of freedom, the scaled Wishart method (#3 above) is straightforward to implement.

4. If the half-t prior is not appropriate for  sigma, or if you want to specify informative priors for rho, you can use the dmvnorm.vcov distribution (method #4) and construct your own covariance matrix.

Updated 23 December 2020 by Mike Meredith