Simulating habitat covariates
We use simulations to evaluate our study designs, in particular
to see if we can get good estimates of the parameters of
interest with our proposed design and sample size, or to see
what sample size we need to get adequate precision for our
estimates. The effects of habitat covariates are often of
interest, and this post is about how we simulate habitat
A common method is to draw habitat covariates from
a normal distribution using the
That may not be very realistic, and we'll look at a few data sets collected in the field to see how they are distributed. If they are not approximately normal, the next question will be: Does it matter?
How are real habitat covariates distributed?
I had a quick look at data sets I've worked on myself, but
the examples here are mostly from data that y'all can access in
the CRAN packages
I used the package
If the study area has a wide range of elevations, this is an
obvious covariate to use.
With many more points, Switzerland gives the smoother histogram. Both show a mode around 500m and a plateau from about 1000 to 2500m. The plots monitored do not go beyond 2750m, excluding the highest peaks in the country. Both distributions are skewed and far from normal; the gamma and log-normal distributions are better, but still a poor fit.
Most elevation data sets have clear maxima and minima without long tails (negative elevations are rare!), and are often approximately uniform. But I did find one which was approximately normal, in the Hubbard Brook data set:
The Swiss surveys use "forest" as a covariate: the percentage of the 1x1 km plot covered by forest. Other options may be the proportion of grassland or willow thickets or potato fields in the plot or the surrounding area. Here are the plots for forest cover for the Switzerland, originally percentages which I converted to proportions:
These show big spikes at zero: 12% and 26% (respectively) of the values are zero. (I suspect many of the plots in the Switzerland data set with 0 forest are built-up areas not included in the bird monitoring.) Attempts to fit distributions to these data failed, so I removed the zeros:
There is still a preponderance of tiny values, and a beta distribution with mode 0 is a decent fit to the MHB data. I couldn't fit the beta distribution to the Switzerland data.
Zero-inflation is a feature of many "cover" covariates, but 1-inflation can occur too. The permanent grass cover covariate for sites in the Netherlands has 63% zeros (as well as a large proportion < 0.01) and 11% 1's:
The sizeof the habitat patch is often a sensible covariate to include, whether the patch in question is a water body, an island, a forest or a clearing. Again these are non-negative with a mode close to zero (how close depending on when a patch is too small to be counted) and highly skewed with a small number of extreme values. An example is the areas of 38 "wildlife openings" from the chestnut-sided warbler data set:
The largest opening is 15.72ha, the second largest just 4.70ha.
For a bigger patch-area data set, I downloaded GIS info for wetlands in the Adirondacks Park, USA. This has 19,842 wetlands ranging from 10 m2 to 13 km2 and a distribution which is highly skewed. I always advocate using log(area) for biological reasons: the difference between 1 vs 2ha is going to be bigger than 10 vs 11ha or 100 vs 101ha; to get comparable differences, you should compare 10 vs 20ha and 100 vs 200ha. I was amazed to see that the distribution of log(area) is a nice, symmetrical bell-shape! It's close to normal, but with a higher peak (greater kurtosis).
Distance from ...
A common type of covariate that I see is distance from the nearest road, river, settlement, forest edge, etc, intended as a proxy measure for the level of disturbance. These variables are skewed, with a mode at 0, and a definite upper limit. Below is the plot for a friend's data for camera traps; it doesn't have the mode at 0 expected from theory because they avoided setting camera traps too close to the road. For a point source such as a settlement, the square root provides a good model for disturbance; perhaps a road is, in practice, a series of point sources.
For a larger data set, I looked at Yellowstone National Park, USA, which has a network of sealed roads heavily used by tourists. The map below shows the roads and the distance from road. Most parts are close to roads, though in the SE of the park you can get 36 km from a road. This is reflected in the histogram of distances. The log of distance (not shown) is weird with a long left tail due to the vast number of points close to roads, but the square root has an almost normal distribution.
Distance from the forest edge or the protected area boundary is
also a sensible proxy for degree of disturbance. I tried that
for Yellowstone, but that is so close to square it may not be
typical. So I found a funky shaped patch in the
The Housing Density covariate in the MesoCarnivores data set shows a similar exponential shape. The height of the first bar is not due to zero inflation, the 1473 values include just one 0. Using the log makes sense, but the distribution of the log is still highly skewed.
The Hubbard Brook data set includes slope and aspect covariates. The slope variable looks roughly normal, though of course it doesn't have negative values.
The aspect variable is odd because it is a circular variable: 0
and 360° are both north. In this case we have modes near due
south and due north, as you would expect if the study area
centred on an east-west ridge or valley. You could simulate
something like that with the
It's rare that a real habitat covariate is close to a normal distribution or even a unform distribution, though we use these routinely to generate "covariate" values for simulations. I used simulations (or course!) to investigate the effect of this.
I used a
very simple model, generating binomial presence data with
probability determined by a logit expression, and analysed the
I tried six different versions of the covariate
All were standardized to mean zero and SD 1.
I did 100,000 iterations, retaining the estimated coefficients, and compared the overall means as a check for bias, and the Root Mean Square Errors (RMSEs) of the slope coefficient.
Here are the results for
Next the RMSEs (sorted):
An alternative way to assess the results is to decide what error is acceptable and see the probability of getting an unacceptable error. If we think an error of ±20%, ie, values between 0.8 and 1.2, are acceptable, the probability of getting an unacceptable estimate is:
That's scary! ±20% seems a pretty low bar, and we still have a 25-30% probability of getting an unacceptable estimate. Even without any complications such as non-detection errors, a sample size of 200 may not be adequate for binary data.
The differences seem to be small, but they are not negligible. To get the RMSE for the logArea distribution down below 0.18 needs a sample size of 250 (instead of 200).
Anyway, as I expected, the uniform distribution works better than the normal distribution, as the latter has values more "bunched up" around the mean. More technically, the uniform distribution has the lowest kurtosis; the values are:
So what can we conclude from all this?
First of course, do carry out simulations as part of every study design to make sure that you will have a large enough sample to get a good enough estimate of the parameters you are interested in.
For your real study, if you can chose the values of a covariate of interest, make sure they have a uniform distribution. For example, if you are interested in estimating the effect of elevation, space out your sample over the full range of elevations in the study area. If it's distance from edge and you plan to use the log of the distance, chose points to give a uniform distribution of log(distance).
For simulations, chose distributions that correspond to those
you'll use for the real study. If you will be using random
points over the study area, try to get real data for the area as
GIS layers, create random points for each simulations, then
extract the covariate values from the GIS layers. That can be
done in R with functions from the
If you really don't have any idea of the distribution of the covariate - you don't have a particular study area in mind, or you're writing a function for a package and need a default - the normal distribution may be best, as it is "pessimistic" about the efficiency of estimation with a given sample size.
|Updated 17 November 2020 by Mike Meredith|