Goodness-of-fit : Occupancy models

HOME We talk about "fitting a model" to data, but how well does our model actually fit? Assessing model fit, or Goodness-Of-Fit (GOF) checks are important. Sometimes the fit is embarrassingly poor, often an incentive to improve the model. GOF measures rarely indicate how to improve, and we'll look at more specific checks in a later post.

GOF looks at the fit of the model to the same data set used to estimate parameters. We are not concerned here with selecting models with the best predictive ability. In general, the fit improves as we add more parameters, and GOF is usually done with the full model, the one with the maximum number of parameters.

Discrepancy measures for binomial observations

Notation: \(y\) = observed number of successes, \(k\) = number of trials, \(p\) = probability of success. Note that the discrepancy measures used here can't be used with a single trial, ie, 0/1 data, we need to aggregate trials.

For example: \(k\) = 5 visits to an occupied site when \(p\) = 0.7, expected number of times detected = \(pk\) = 3.5. The data, \(y\), are integers, the expected value is continuous, hence the difference will not be zero.

Chi-squared: \[\chi^2 = \frac{(y - pk)^2}{pk}\] If \(pk\) is small, \(\chi^2\) explodes, so one site can cause a huge discrepancy. If you were running a chi-squared test, you would manually pool results from sites so as to have reasonable \(pk\) values, but we can't do that programmatically. Also we may need to add a small amount to the denominator to avoid divide-by-zero errors: Kéry & Royle (2016, p193) add 0.0001, Kéry & Schaub (2012, p398) add 0.5; with no mathematical justification for this, it smacks of fudge.

Pearson's chi-squared: \[\chi^2 = \frac{(y - pk)^2}{pk(1-p)}\] This explodes if p = 0 or 1. Tobler et al (2015) use the square root of this, but with 0.001 added to p in the denominator to prevent divide-by-zero errors.

Freeman-Tukey: \((\sqrt{y} - \sqrt{pk})^2\) Avoids explosions and one or a few sites dominating the overall result. Eg, Kéry & Schaub (2012, p224), King et al (2010, p139). This is my favourite.

Deviance: Calculated from likelihood = \( {\rm I\!P}(y | {\rm parameters})\) A good fit results in a high likelihood. Deviance = -2*log(likelihood). Eg King et al (2010, p138, 396).

With multiple sites or animals, we calculate the discrepancy for each and add them up to get a total discrepancy.

Assessing discrepancy:

The discrepancy is not going to be zero - what values are good/bad? To assess that, we generate many simulated data sets from the fitted model and calculate the discrepancy for each of these, then see where the discrepancy for our actual data set compares. If the actual value lies near the middle of the simulated values (the solid line in the plot below), we are happy with the fit. But if the actual value is out in the tails of the simulated distribution, the fit is poor (the dashed lines).

We can calculate the p-value, ie, the probability of obtaining a discrepancy at least as large as the observed discrepancy if the model fits the data. Values near 0.5 indicate a good fit; values above 0.9 or below 0.1, a bad fit.

The histogram above was generated using the ML estimates from a frequentist fit, which ignores uncertainty in the parameter estimates. For a Bayesian fit, we incorporate uncertainty by simulating a "posterior predictive" data set for each step in the MCMC chain. We then have MCMC chains for simulated and actual discrepancies; we can compare these in a scatter plot and calculate a p-value, sometimes referred to as a "Bayesian p-value".

Implementation in JAGS

We'll use an occupancy model to illustrate this, with the data for willow tits with site covariates from the Swiss Breeding Bird Survey, the same data used by Hooten & Hobbs (2015), except that they used only the first 200 rows, removing most of the high elevation sites. We will use the site covariates and ignore the survey covariates for the moment. The following R code will download and prepare the data:

# Read in data
wt <- read.csv("", comment="#")

# Aggregate the detection data:
( y <- rowSums(wt[, 1:3], na.rm=TRUE) )
( n <- rowSums(![, 1:3])) )

# standardise site covariates and create model matrix
covs <- cbind(inter=1, scale(wt[8:7]))
covs <- cbind(covs, elev2=covs[, 'elev']^2)

Freeman-Tukey discrepancy

With the model matrix parameterization, the JAGS code with Freeman-Tukey discrepancy GOF estimates (saved in "site_covs.jags") looks like this:

  # Likelihood
  for(i in 1:nSites) {
    # Ecological model
    logit(psi[i]) <- sum(covs[i,] * beta)
    z[i] ~ dbern(psi[i])
    # Observation model
    y[i] ~ dbin(p * z[i], n[i])
    # GOF assessment
    Tobs0[i] <- (sqrt(y[i]) - sqrt(p*z[i]*n[i]))^2  # FT discrepancy for observed data
    ySim[i] ~ dbin(p * z[i], n[i])
    Tsim0[i] <- (sqrt(ySim[i]) - sqrt(p*z[i]*n[i]))^2  # ...and for simulated data

  # Priors
  p ~ dbeta(1, 1)
  for(i in 1:nCovs) {
    beta[i] ~ dunif(-5, 5)
  # GOF assessment
  Tobs <- sum(Tobs0)
  Tsim <- sum(Tsim0)

  # Derived variable
  N <- sum(z)

We prepare the necessary bits and pieces and run the null model (intercept only) and the full model with jagsUI:

# Organise the data etc
jagsData <- list(y = y, n = n, nSites = length(n), covs=covs[, 1, drop=FALSE], nCovs=1) # Null model
# jagsData <- list(y = y, n = n, nSites = length(n), covs=covs, nCovs=ncol(covs))  # Full model

inits <- function() list(z = rep(1, length(n)))

wanted <- c("p", "beta", "N", "Tobs", "Tsim")

# Run the model (6 secs)
( jagsOut <- jags(jagsData, inits, wanted, "site_covs.jags", DIC=FALSE, 
    n.chains=3, n.iter=5000, n.adapt=1000, parallel=TRUE) )
pp.check(jagsOut, "Tobs", "Tsim", main="Freeman-Tukey discrepancy")

The posterior predictive check plot1 shows that the discrepancies for the observed data are generally larger than for the simulated data, but the fit here is not bad. We would expect the model with more parameters to be a better fit to the data, but the difference here is not clear - the difference of 0.01 in the P-values could be due to MCMC error. It could be that our discrepancy measure is not picking up the difference in fit for psi.

Notice that the simulated data depend on z[i]. For sites where the species was detected and y[i] > 0, z[i] = 1, and the calculation of psi is irrelevant. So the goodness-of-fit of the psi parameter is only assessed for the sites with y[i] = 0. We can improve on this by resampling z[i], see Broms et al (2016). We do this by replacing the ySim and Tsim lines in the code above with:

zSim[i] ~ dbern(psi[i])
ySim[i] ~ dbin(p * zSim[i], n[i])
Tsim0[i] <- (sqrt(ySim[i]) - sqrt(p*zSim[i]*n[i]))^2

Resampling z makes little difference, and we aren't picking up the expected difference in fit of the two models. At least for occupancy models, I've noticed that the fit has to be spectacularly bad to get a small P-value with the Freeman-Tukey discrepancy.


For this we calculate the log-likelihood for each site with this code:

    # GOF assessment
    LLobs0[i] <- logdensity.bin(y[i], p * z[i], n[i])
    ySim[i] ~ dbin(p * z[i], n[i])
    LLsim0[i] <- logdensity.bin(ySim[i], p * z[i], n[i])

followed by

  Dobs <- -2 * sum(LLobs0)
  Dsim <- -2 * sum(LLsim0)

The deviance measure is clearly picking up a difference between the full model and the rest, though the full model has a P value which is almost too big.

Calculating GOF after the JAGS run

It's sometimes convenient to calculate GOF statistics after doing the JAGS run; this is the case when the run takes a long time and you did not include GOF in the model. Here we'll calculate the Freeman-Tukey discrepancy starting from the posterior distribution of p, z and the coefficients for psi for the full model.

We can use the same model code as above, deleting - or just ignoring - the GOF code highlighted in blue, but we need to monitor z:

wanted <- c("p", "beta", "N", "z")

After running the model, we do this:

nIter <- length(p)
Tobs <- Tsim <- numeric(nIter)
for(iter in 1:nIter) {
  psi <- plogis(covs %*% beta[iter, ])
  Tobs[iter] <- sum((sqrt(y) - sqrt(p[iter]*z[iter, ]*n))^2)
  ySim <- rbinom(237, n, p[iter]*z[iter, ])
  Tsim[iter] <- sum((sqrt(ySim) - sqrt(p[iter]*z[iter, ]*n))^2)

MASS::eqscplot(Tobs, Tsim, xlim=range(Tobs, Tsim), ylim=range(Tobs, Tsim),
  xlab="Observed data", ylab="Simulated data")
abline(0, 1, lwd=2, col='red')
mean(Tsim > Tobs) # the P value


1 I'm currently using the devel version of jagsUI from Github, which displays the correct P value. Older versions of jagsUI showed 1 - P.

Updated 18 Aug 2019 by Mike Meredith