We routinely use independent priors for our analyses, implicitly
assuming no correlations. That doesn't mean that posteriors are
not correlated - correlation is never zero - but the correlation
comes from the data. It might be better to include the
correlation coefficient as a parameter in the model, maybe with
a uniform prior, which can then be monitored and a posterior
distribution reported. In multi-species occupancy models
(MSOMs), we expect some species to be both more widespread and
more abundant than others. The more individuals present at the
site, the higher the detection probability, so detection
probability given presence, p , and probability of presence,
psi ,
should be correlated.
A bit of maths
For a univariate normal distribution we can write \(\eta \sim
{\rm N}(\mu, \sigma^2)\), where \(\mu\) is the mean or centre of
the distribution and \(\sigma\) is the scale - and \(\sigma^2\)
is the variance of the distribution.
For a bivariate normal we have 2 variables, \(\eta_1\) and
\(\eta_2\), and 2 means, \(\mu_1\) and \(\mu_2\):
\[\left(\begin{matrix} \eta_1 \\ \eta_2\end{matrix}\right) \sim
{\rm MVN}\left( \left(\begin{matrix} \mu_1 \\
\mu_2\end{matrix}\right) , \Sigma\right) \]
Here \(\Sigma\) is the 2 x 2 covariance matrix (aka
variance-covariance matrix):
\[\left(\begin{matrix} \sigma_1^2 & \sigma_{12} \\
\sigma_{21} & \sigma_2^2\end{matrix}\right)\]
where \(\sigma_i^2\) is the variance of variable \(i\) and
\(\sigma_{12} = \sigma_{21}\) is the covariance between the two
variables. We prefer to work with the correlation coefficient,
\(\rho
= \frac {\sigma_{12}} { \sigma_1 \sigma_2} \), instead of
the covariance; it has values in the range (-1, 1).
Implementing correlated normal priors
In this post we will look at 4 ways to implement correlation
among normally-distributed priors. The first adds a few lines of
code to connect the priors, the other 3 use a multivariate
normal distribution with different ways of specifying the priors
for that. We will focus on bivariate normal variables initially.
For each method, we will use JAGS to simulate 500 values from the
implied prior distribution (with mean 0, to keep it simple);
this will be a matrix, eta , with 500 rows and a
column for each of the variables (2 columns for a bivariate
normal). JAGS repeats this 1000 times to generate chains of
length 1000. For each iteration, we calculate the realised SDs
and correlations of the columns in eta , and compare
these with the specified priors for sigma and
rho .
Linking two univariate normal priors
Here is the JAGS model to generate draws from the implied
priors (adapted from Royle & Dorazio (2008) p. 385):
model {
for(m in 1:2) {
sigma[m] ~ dunif(0,5)
tau[m] <- 1/sigma[m]^2
}
rho ~ dunif(-1,1)
var.eta <- tau[2]/(1 - rho^2)
mu <- c(0, 0)
for(k in 1:500) {
eta[k, 1] ~ dnorm(mu[1], tau[1])
mu2[k] <- mu[2] + rho * sigma[2]/sigma[1] * (eta[k, 1] - mu[1])
eta[k, 2] ~ dnorm(mu2[k], var.eta)
}
}
This looks a little clumsy and it seems to be asymmetric -
eta[,1] is drawn from the distribution specified,
then eta[,2] is a function of eta[,1]
- but JAGS does not run commands in sequence. The code specifies
the relationships between the nodes.
We are using JAGS as a random number generator (RNG), so
don't need to supply any data and don't need to check
convergence or autocorrelation. We'll use rjags
functions to run the simulation and mcmcOutput to
extract the parameter values.
library(rjags)
library(mcmcOutput)
wanted <- c("sigma", "rho", "eta")
jm <- jags.model("1_link.jags", data=NULL)
out <- coda.samples(jm, wanted, 1e3)
( mc <- mcmcOutput(out) )
# Object of class 'mcmcOutput'; approx. size 8.1 MB
# MCMC values from mcmc.list object ‘out’
# The output has 1 chains each with 1000 draws.
# It has 3 parameters with 1003 nodes monitored:
# nodes
# eta 1000
# rho 1
# sigma 2
Now we calculate the means, SDs and correlation coefficients
for each realization of the bivariate variable eta :
eta <- mc$eta
niter <- nrow(mc)
meanx <- sdx <- matrix(NA, niter, 2)
corx <- numeric(niter)
for(n in 1:niter) {
meanx[n, ] <- colMeans(eta[n,,])
sdx[n, ] <- apply(eta[n,,], 2, sd)
corx[n] <- cor(eta[n,,])
}
We check that the realised values of SD and correlation
correspond to the priors for sigma and rho :
plot(mc$sigma[,1], sdx[,1])
plot(mc$sigma[,2], sdx[,2])
plot(mc$rho, corx)
The realised values of SD and correlation (on the y axis) are
pretty close to the draws of sigma and rho from the priors. Next
we check the distribution of the SDs and correlations across the
1000 steps in the MCMC chain.
hist(sdx[,1]) # uniform(0, 5)
hist(sdx[,2]) # uniform(0, 5)
hist(corx) # uniform (-1, 1)
The distributions are uniform (to within MC error), as they
should be. When sigma is close to the upper limit,
5, we can get a realised SD slightly more than 5, as you can see
in the scatter plots above.
dmvnorm with a Wishart (dwish )
prior
Linking 2 univariate normal distributions as we did above
can't be generalised to more than two variables. A more general
solution is to draw from the multivariate normal distribution,
dmvnorm in JAGS. The dnorm
distribution uses precision, tau , instead of SD or
variance, and dmvnorm uses a precision matrix,
TAU in the code below, instead of the
covariance matrix.
The Wishart distribution produces a suitable matrix for use
with dmvnorm . The Wishart distribution is a
multivariate version of the gamma distribution, which is
restricted to non-negative values of continuous variables. The
dwish distribution has 2 arguments, a matrix R and
a scalar k, described as "degrees of freedom". The tricky bit is
deciding what values to plug in to get the priors that we want
for SD and correlation.
The degrees of freedom is straightforward: setting k = M + 1,
where M is the number of variables (hence k = 3 for a bivariate normal)
results in a Uniform(-1, 1) prior for the correlation
coefficient. Smaller values, down to the limit of k = M, give
priors with more weight on -1 and 1, while larger values give a
prior with more mass near 0.
The M x M matrix R is usually a diagonal matrix (all the
off-diagonal values 0) which determines the distribution of the
SDs. We will explore how this works below, but the JAGS User
Manual (p.55) gives a hint: "R/k is a prior guess for the
covariance matrix...".
The JAGS code for our RNG is below. We'll pass the R matrix
and degrees of freedom as data, so we can try different
values without changing the JAGS code.
model {
TAU ~ dwish(R, k)
mu <- c(0, 0)
for(k in 1:500) {
eta[k, 1:2] ~ dmnorm(mu, TAU)
}
VCOV <- inverse(TAU)
sigma[1] <- sqrt(VCOV[1,1])
sigma[2] <- sqrt(VCOV[2,2])
rho <- VCOV[1,2] / (sigma[1] * sigma[2])
}
The RNG itself is only 5 lines of code, then we invert TAU to
get the covariance matrix and extract the
values of sigma and rho .
We'll try the values for R and k used in Kéry & Royle (2016,
p.668) and Kéry & Schaub (2012, p. 206):
( R <- diag(c(5,1)) )
# [,1] [,2]
# [1,] 5 0
# [2,] 0 1
( sigma.guess <- sqrt(diag(R)/3) )
# [1] 1.2909944 0.5773503
jdata <- list(R = R, k = 3)
The prior guesses for sigma for the 2 variables look very
small and certainly the second one would seem to be highly
informative. But much will depend on the shape of the implied
distribution.
The code to run the model and inspect the results is the same
as for the previous example, except that we now have to supply
data = jdata where previously we had data =
NULL .
The scatter plots for SD vs sigma and correlation vs rho show
all the points close to the diagonal, and the histogram for
correlation is Uniform(-1, 1). Here are the histograms for the
SDs, with the vertical red lines indicating the "prior guess"
for sigma and the dashed blue lines the 90th percentiles.
It seems that the prior guess actually corresponds to the mode
of a skewed distribution which has reasonable probabilities for
values up to 5 (sd[1] ) or 2 (sd[2] ). These might be reasonable
priors but are not uninformative.
dmvnorm with a scaled Wishart (dscaled.wishart )
prior
Because the interpretation of the Wishart prior is not
intuitive, the glm module in JAGS provides a couple
of alternatives, including the scaled Wishart distribution,
dscaled.wishart .
The second argument to dscaled.wishart is the
degrees of freedom, df ; here a value of df =
2 gives a Uniform(-1,1) prior for the correlation
coefficient, larger values give a prior with the mass
concentrated closer to 0.
The priors for sigma are half-t distributions with
df degrees of freedom. Of course, if you
want a uniform distribution for correlation, you're stuck with
t_{2}. But that's a
reasonable choice: some people advocate a half-Cauchy
distribution (which is t_{1}),
but I think that has tails which are too heavy;
Gelman and the Stan group recommend t_{4}.
The first argument to dscaled.wishart is a
vector of length M with the scale of the half-t
distribution. This gives you a fairly direct way of determining
how informative is the prior. The plot below shows a series of
half-t_{2}
distributions with different scales and the 90th percentile for
each.
The only change needed to the JAGS code is the prior for
TAU ,
which becomes
TAU ~ dscaled.wishart(s, df)
We pass our chosen values for
s and
df via the data argument
and load the glm module:
j data <- list(s = c(5,1), df=2)
load.module("glm")
jm <- jags.model("3_scaled.jags", jdata)
The code to obtain and examine the results is the same as
before, and the plots of SD vs
sigma and correlation vs
rho have all
points close to the diagonal. Here are the histograms for the
SDs with the 90th percentile as the blue dashed line:
Those have long right tails, so we'll plot just the smaller
values and add the appropriate half-t_{2}
distribution.
dmvnorm.vcov with a covariance
matrix
The glm module has a version of dmvnorm which uses a
covariance matrix instead of a precision matrix. This
is much easier to work with, and we can chose our own priors for
sigma and rho and build those into the
VCOV matrix directly.
This is how it works in the JAGS code:
model {
sigma[1] ~ dunif(0, 5)
VCOV[1, 1] <- sigma[1]^2
sigma[2] ~ dunif(0, 5)
VCOV[2, 2] <- sigma[2]^2
rho ~ dunif(-1, 1)
VCOV[1, 2] <- rho * sigma[1] * sigma[2]
VCOV[2, 1] <- VCOV[1, 2]
mu <- c(0, 0)
for(k in 1:500) {
eta[k, 1:2] ~ dmnorm.vcov(mu, VCOV)
}
}
The code to run JAGS is the same as before (with data =
NULL ) as is the code to extract the values. All the plots
look fine; here are the histograms for the realised SDs of
eta :
More than 2 variables
The formulae for the multivariate normal now have vectors of
length M for \(\eta\) and \(\mu\) and an M x M matrix for
\(\Sigma\). That's mostly straightforward, but let's look at
\(\Sigma\) for M = 3:
\[\left(\begin{matrix} \sigma_1^2 & \sigma_{12} &
\sigma_{13}\\ \sigma_{21} & \sigma_2^2 & \sigma_{23} \\
\sigma_{31} & \sigma_{32} & \sigma_3^2\end{matrix}\right)\]
Note that we now have 3 distinct values for the covariance
(\(\sigma_{12} = \sigma_{21}\), \(\sigma_{13} = \sigma_{31}\)
and \(\sigma_{23} = \sigma_{32}\)) and hence 3 values for the
correlation coefficient \(\rho\). The number of coefficients
grows rapidly with M: it's \((M^2 - M)/2\). And there are
constraints on the multiple correlations: if \(\rho_{12}\) and
\(\rho_{23}\) are both high - say 0.8 - then \(\rho_{13}\)
cannot be negative. So we cannot just generate random values for
\(\rho\) and plug them in, some will not work and generate a
run-time error in JAGS. Technically, the covariance matrix must
be positive semi-definite; using inverse Wishart priors for the
covariance matrix (or equivalently Wishart priors for the
precision matrix) takes care of this requirement.
The methods with dmvnorm and Wishart or scaled
Wishart priors work fine for more than 2 variables. With the
dwish prior, the "prior guess" is smaller than the
mode of the distribution.
dmvnorm.vcov works in practice, see
here for an example, but I could not produce prior
distributions using JAGS as a RNG. The problem is that a VCOV
matrix constructed from randomly drawn values of rho
and sigma will often not be positive semi-definite
and causes an error.
Recommendations
1. Don't use the classic Wishart prior (method #2 above). Or
if you do, use JAGS as a RNG to check the prior distributions
for SDs for the values you specify.
2. For a bivariate normal, the linking trick (method #1)
works fine and allows you to specify whatever priors you choose
for rho and sigma .
3. If you are happy with a half-t prior for
sigma with 2 degrees of freedom, the scaled Wishart
method (#3 above) is straightforward to implement.
4. If the half-t prior is not appropriate for
sigma , or if you want to specify informative priors for
rho , you can use the dmvnorm.vcov
distribution (method #4) and construct your own
covariance matrix. |