HOME |
The issue of model selection for complex Bayesian models is
still very much open, so I was interested in a paper by
Kristen
Broms, Mevin Hooten and Ryan Fitzpatrick (2016, Ecology).
There is good stuff there about using information criteria to
compare models, but in this post I want to focus on their use of cross-validation. Most of the goodness-of-fit measures we use do
not work for individual trials, ie, 1/0 data, and trials need to
be aggregated so that the data are of the form: y
successes in n trials. The scoring methods used by Broms
et al do work with single trials. They provided the code in their supplementary
materials, but that is tailored to their study of 26 species of
fish at 113 sites in a river basin in Colorado. This is an attempt to simplify
it and make it easier for other researchers to adapt it to their
own study.
This assumes that you have read the original paper
and the supplements for details of the study and the
mathematical approaches. I will not repeat those here. The
data
The data are provided in the supplementary materials to the
paper. You can download a reformatted version with:
load(url("http://mmeredith.net/data/fishData.RData"))
ls()
README
str(Y)
str(Psi.cov)
str(X)
Here Y is a 3-dimensional array, species x sites x
surveys, with 1 where the species was detected, otherwise 0. The
number of surveys differed among sites, up to a maximum of
maxSurveys = 24. The vector nSurveys gives
the number of surveys for each site. The Y array
has NA where fewer than 24 surveys were conducted.
The number of species, nspecies = 26 , is the number
of native species known to occur in the river basin, though two
of these were not caught during the surveys.
Psi.cov is a matrix with the sites as rows and
columns for site covariates. The first column, headed int ,
is a column of ones which will correspond to the intercept in
the analysis. The remaining columns correspond to the vectors
prefixed Psi.* in the original data. X
is a 3-dimensional array, sites x surveys x variables,
with the covariates used for detection probability. X[, ,
1] contains ones and will correspond to the intercept;
the remainder are the matrices named X.* in the
original data. Again, NA s appear when less than 24
surveys were conducted.
Your data will be quite different, but
you should be able to wrangle them into similar matrices and
arrays. The models
The data file, "fishData.RData", contains lists specifying
the covariates in the five models used by Broms et al (2016), which you can display
with:
modelPsi
modelP
I used the covariate names for these model definitions,
although just using the column numbers would also work - at
least until you modify the array by adding or removing things. I
think names are less likely to lead to undetected errors than
numbers. The JAGS file
We'll write a single JAGS file to do all the analyses. The
model implemented and the fold will be determined by passing a
subset of the Y , Psi.cov and X objects to JAGS as data. The data block
To
simplify the data input, we'll use the data block in the JAGS
file to calculate the number of sites and coefficients for each
run. Code in the data block is run once only, at the start of
the model creation process. See the
JAGS User Manual for details. data {
# set up the constants used in the model.
Dpsi <- dim(Psi.cov)
nPsiCoef <- Dpsi[2]
Dp <- dim(Det.cov)
nDetCoef <- Dp[3]
DY <- dim(Y)
nPossSpp <- DY[1]
nSites <- DY[2]
muPsiCoef.tau <- 1 / 2.25 ^ 2
muDetCoef.tau <- 1 / 2.25 ^ 2
t.tau <- 1 / 2.25 ^ 2
}
The model code Here's the whole code block for the
model. This does not use data augmentation and limits the number
of species to 26. We'll add explanations afterwards.
model {
for (i in 1:nPossSpp) {
### LIKELIHOOD
# Ecological model
for (j in 1:nSites) {
logit(psi[i, j]) <- sum(psiCoef[i, ] * Psi.cov[j, ]) # same code for all models
Z[i, j] ~ dbern(psi[i, j])
# Observation model
for (k in 1:nSurveys[j]) {
logit(detect[i, j, k]) <- sum(detCoef[i, ] * Det.cov[j, k, ])
true.detect[i, j, k] <- detect[i, j, k] * Z[i, j]
Y[i, j, k] ~ dbern(true.detect[i, j, k])
}
}
### PRIORS for coefficients for each species
for (m in 1:nPsiCoef){
psiCoef[i, m] ~ dnorm(muPsiCoef[m], tauPsiCoef[m])
}
for (m in 1:nDetCoef){
detCoef[i, m] ~ dnorm(muDetCoef[m], tauDetCoef[m])
}
}
### HYPER-PRIORS for coefficients
for(m in 1:nPsiCoef){
muPsiCoef[m] ~ dnorm(0, muPsiCoef.tau)
halfsigmaPsiCoef[m] ~ dt(0, t.tau, 1) # Cauchy distribution
sigmaPsiCoef[m] <- abs(halfsigmaPsiCoef[m])# folded or half-Cauchy
tauPsiCoef[m] <- pow(sigmaPsiCoef[m], -2)
}
for(m in 1:nDetCoef){
muDetCoef[m] ~ dnorm(0, muDetCoef.tau)
halfsigmaDetCoef[m] ~ dt(0, t.tau, 1)
sigmaDetCoef[m] <- abs(halfsigmaDetCoef[m])
tauDetCoef[m] <- 1 / (sigmaDetCoef[m] * sigmaDetCoef[m])
}
}
The likelihood is standard occupancy stuff, with an
ecological model for the latent occupancy state, Z , and an
observation model which accounts for probability of
detection. We use vector multiplication with the vector of
coefficients for the species and the vector of covariate values
for the site or survey. The first element in the covariate vector is
always 1 and the first coefficient is thus the intercept. This code
will work with covariate matrices with different numbers of
columns for the different models. The priors for the
coefficients for each species are normal with mean and SD to be
estimated. Priors for the SDs (ie, the hyperpriors) use a
half-Cauchy distribution. We save the data block
and the model code to a file named "MSOM_KnownSps.jags".
Running the analysis
We can run all the models to check convergence before
starting the cross-validation. The following code will do this;
it took 3.3 hrs on my 4-core laptop. If you have more cores, you
will likely want to run the models in parallel. See the next
section for code to do this. Note that we provide starting
values for Z, as usual, and also for the hyper-parameter SDs to
keep them below 5, as random draws from a Cauchy distribution
can be very large. library(jagsUI)
wanted <- c("muPsiCoef", "sigmaPsiCoef",
"muDetCoef", "sigmaDetCoef",
"detCoef", "psiCoef")
modelOut <- list()
for(i in seq_along(modelPsi)) {
inits <- function() list(Z = matrix(1, nSpecies, nSites),
halfsigmaPsiCoef = runif(length(modelPsi[[i]]), 0, 5),
halfsigmaDetCoef = runif(length(modelP[[i]]), 0, 5))
jdata <- list(Y = Y, nSurveys = nSurveys,
Psi.cov = Psi.cov[, modelPsi[[i]]], Det.cov = X[, , modelP[[i]]])
modelOut[[i]] <- jags(jdata, inits, wanted, "MSOM_KnownSps.jags",
n.chains=3, n.adapt=1000, n.iter= 5000, parallel=TRUE)
}
for(i in seq_along(modelOut))
wiqid::diagPlot(modelOut[[i]], param=1:18, main=names(modelPsi)[i])
Running cross-validation
Dividing the date set into folds We'll do 5-fold
cross validation, dividing the 113 sites into 5 groups, called
"folds". One fold is held out and the model fitted to the
remaining folds, then the parameter estimates are used to make
predictions for the sites held out, which are compared with the
actual observations. Each site is allocated to one fold. We can
do this with the following code (which does not require us to
install extra programs):
nFolds <- 5
( foldSize <- nSites %/% nFolds + 1 )
tmp <- matrix(0, nFolds, foldSize)
set.seed(123) # Note that a new RNGkind is used for sample.int from R 3.6.0
for(i in 1:foldSize)
tmp[, i] <- sample.int(nFolds)
foldID <- as.vector(tmp)[1:nSites]
table(foldID)
The method used here is not the same as in the paper. Each
fold gets one randomly-selected site from among the first five
in the data, then one from the second five, one from the third
five, and so on. If the data are arranged in groups or in a
logical order - first to last or west to east, say - this gives
more representative folds compared with pure random allocation.
Scoring the fit
of the predictions to the observed data Broms et al used 4
methods for scoring the difference between the predictions and
the observations. See the paper for details. The 0-1 loss
compares the observations to 0-1 predictions while the others
compare with a probability. We prepare simple functions to
calculate the different scores:
# probs : an array of probabilities
# preds : an array of 1/0 predictions
# bool : a matching array of observed TRUE/FALSE or 1/0 values
# output: scalar, the sum or mean score
# 0-1 loss function
score01 <- function(preds, bool)
sum(abs(preds - bool), na.rm=TRUE)
# Area under ROC curve
auroc <- function(probs, bool) {
ok <- is.finite(bool) & is.finite(probs) # remove NAs
bool <- bool[ok]
probs <- probs[ok]
(mean(rank(probs)[bool==1]) - (sum(bool)+1)/2) / sum(bool==0)
}
# Brier score
brier <- function(probs, bool) {
sum(bool * (1 - probs)^2 + (1 - bool) * probs^2, na.rm=TRUE)
}
# Log(likelihood) score for Bernoulli trials
loglikBern <- function(probs, bool) {
sum(dbinom(bool, 1, probs, log=TRUE), na.rm=TRUE)
}
In our case, the probability we want is the unconditional
probability of detection. The usual probability of detection in
occupancy models - detect in the code above - is
conditional on the site being occupied. We assume that if the
site is not occupied the species will not be detected, ie, no
false positives. For scoring, we need the unconditional
probability of detection, ucpd , ie, the probability
that the site is occupied and the species is detected.
Hence ucpd = psi * detect . Broms et al refer to
this as integrated.probability . (What's the
probability of detecting a polar bear in downtown New York?
Answers: (1) Practically zero, polar bears don't occur in New
York. (2) Practically one, polar bears are big, white animals,
easily detected; also scary, folks will be running and
screaming. Both are right: (2) is conditional probability of
detection, ignores the (tiny) probability that the bear would be
there; (1) is unconditional probability of detection.) Setting up the parallel processing We have five
models and we need to run each five times, once for each fold.
The 25 model runs are independent, so can easily be run in
parallel.
- With just 4 cores (my laptop), run the folds in series
but run the 3 chains in JAGS in parallel.
- If you have 8 cores (my old desktop), you can run
each fold in a different core, but don't use
parallel=TRUE in
the call to JAGS.
- With 20 cores or more (my new desktop), run the
folds in parallel and the chains in parallel too.
The code below sets up parallel
processing using foreach for my 24-core desktop. Each fold runs
in a separate process, and each of these spawns 3 new processes
for the 3 chains, so we have 20 processes in all.
library(parallel)
library(foreach)
library(doParallel)
detectCores() # On my big desktop: 24
ncores <- 5 # 5 folds running in parallel assuming >5 cores.
# If >= 15 cores, run 3 parallel chains per fold, otherwise do in serial.
cl <- makeCluster(ncores)
registerDoParallel(cl)
Running the models and estimating the scores On my desktop I used an outer foreach
loop for each of the folds, and an inner for loop
for the 5 models; this took about 15 mins with n.iter=500
and 1.8 hrs with n.iter=5000 . For the laptop, I ran the
folds in series instead of parallel by changing %dopar%
to %do% ; with n.iter=500 it took 1.8 hrs. system.time(
result <- foreach(fold = 1:5, .errorhandling='pass', .packages=c("jagsUI")) %dopar% {
TEST <- foldID == fold # Just for readability
TRAIN <- foldID != fold
# matrix of results for this fold
CVscores <- matrix(fold, nrow=length(modelPsi), ncol=5)
rownames(CVscores) <- names(modelPsi)
colnames(CVscores) <- c("fold", "score01", "Log_score", "AUC", "Brier")
# Run all 5 models
for(m in seq_along(modelPsi)) {
inits <- function() list(Z = matrix(1, nSpecies, sum(TRAIN)),
halfsigmaPsiCoef = runif(length(modelPsi[[m]]), 0, 5),
halfsigmaDetCoef = runif(length(modelP[[m]]), 0, 5))
jdata <- list(Y = Y[, TRAIN, ], nSurveys = nSurveys[TRAIN],
Psi.cov = Psi.cov[TRAIN, modelPsi[[m]]], Det.cov = X[TRAIN, , modelP[[m]]])
wanted <- c("psiCoef", "detCoef") # all we need for score calculation
out <- jags(jdata, inits, wanted, "MSOM_KnownSps.jags", codaOnly=wanted,
# n.chains=3, n.adapt=1000, n.iter= 5000, n.thin=5, parallel=TRUE)
n.chains=3, n.adapt=100, n.iter= 500, parallel=TRUE)
# only use parallel=TRUE if you have enough cores
# Work through the MCMC chains to calculate psi, detect, and ucpd,
# plus predictions for Z and Y.
nIter <- length(out$sims.list$deviance)
fPsi.cov = Psi.cov[TEST, modelPsi[[m]]] # covariate matrices for this fold and model
fDet.cov = X[TEST, , modelP[[m]]]
sc01 <- llk <- AUC <- br <- numeric(nIter) # objects to hold the output
psi <- Zpred <- matrix(NA, nSpecies, sum(TEST))
detect <- true.detect <- Ypred <- ucpd <- array(NA, c(nSpecies, sum(TEST), maxSurveys))
for(iter in 1:nIter) {
psiCoef <- out$sims.list$psiCoef[iter,,]
detCoef <- out$sims.list$detCoef[iter,,]
# The code here mimics the JAGS code:
for (i in 1:nSpecies) {
# ecological model
for (j in 1:sum(TEST)) {
psi[i, j] <- plogis(sum(psiCoef[i, ] * fPsi.cov[j, ]))
Zpred[i, j] <- rbinom(1, 1, psi[i, j])
# observation model
for (k in 1:nSurveys[TEST][j]) {
detect[i, j, k] <- plogis(sum(detCoef[i, ] * fDet.cov[j, k, ]))
true.detect[i, j, k] <- detect[i, j, k] * Zpred[i, j]
Ypred[i, j, k] <- rbinom(1, 1, true.detect[i, j, k])
ucpd[i, j, k] <- detect[i, j, k] * psi[i, j]
}
}
}
sc01[iter] <- score01(Ypred, Y[, TEST, ])
llk[iter] <- loglikBern(ucpd, Y[, TEST, ])
AUC[iter] <- auroc(ucpd, Y[, TEST, ])
br[iter] <- brier(ucpd, Y[, TEST, ])
} # end of MCMC chains
CVscores[m, 'score01'] <- mean(sc01)
CVscores[m, 'Log_score'] <- -2 * mean(llk)
CVscores[m, 'AUC'] <- mean(AUC)
CVscores[m, 'Brier'] <- mean(br)
} # end "Run all 5 models"
CVscores
} ) # end of fold processing
result
tmp <- simplify2array(result)
apply(tmp, 1:2, mean) # Compare with Broms et al Table 2
The first part of the code is similar to the code in the
"Running the analysis" section above, except that we only pass
the data for the training sites to JAGS and we only harvest the
coefficients, which we need to calculate the scores. The
post-processing uses nested loops. It would be more
efficient and faster to use R's functions for handling
matrices and arrays, but they can take a lot of memory.
Working patiently through the MCMC iterations one by one
keeps memory demands within reason; it needs about 600 MB of
memory. Using the same
nested loops as the JAGS model makes it easy to see that we
are doing the same calculations. We simulate predicted
values of Y for the test sites, Ypred ,
and compare these with the actual values to obtain the 0-1
score. The unconditional probability of detection,
ucpd , is compared with the actual values to obtain
the log-likelihood, AUC and Brier scores.
fold score01 Log_score AUC Brier
limited 3 600.22 2032.4 0.889 309.80
mid1 3 560.98 1955.5 0.902 295.00
mid2 3 508.32 1913.9 0.913 272.94
full 3 504.76 1946.3 0.911 277.02
extra 3 602.21 2038.0 0.888 311.29
We don't get the same values as in Table 2 of the paper -
indeed, I don't get the same values every time that I run this
code - but they are in the same ballpark and give the same
results: 0-1 score is lowest for the FULL model and the others
are best for the MIDDLE-2 model. I hope this code
will be useful if you want to do CV with MSOM models with survey
covariates, where you are working with the 1/0 data points.
(It's different if you don't have survey-level covariates and
can aggregate across occasions to get number of detections and
number of occasions.) A big thank you to Kristen Broms, Mevin
Hooten and Ryan Fitzpatrick for the original code in the
publication. |