Simulating habitat covariates

HOME We use simulations to evaluate our study designs, in particular to see if we can get good estimates of the parameters of interest with our proposed design and sample size, or to see what sample size we need to get adequate precision for our estimates. The effects of habitat covariates are often of interest, and this post is about how we simulate habitat covariates.

A common method is to draw habitat covariates from a normal distribution using the rnorm function in R. Often we set mean = 0 and sd = 1, so that the values resemble standardized covariate values.

That may not be very realistic, and we'll look at a few data sets collected in the field to see how they are distributed. If they are not approximately normal, the next question will be: Does it matter?

How are real habitat covariates distributed?

I had a quick look at data sets I've worked on myself, but the examples here are mostly from data that y'all can access in the CRAN packages AHMbook and unmarked.

I used the package fitdistrplus to investigate patterns in the data and fit possible distributions. The fitdist function with MLE gives AIC values, and the plots below show the AIC-best fit as well as the normal distribution and a empirical kernel-density fit.


If the study area has a wide range of elevations, this is an obvious covariate to use. AHMbook has several data sets from the Swiss breeding bird survey with elevation as a habitat covariate. The unmarked package has  the same covariates for all 42,275 plots in Switzerland. Here are the plots for the two data sets:

With many more points, Switzerland gives the smoother histogram. Both show a mode around 500m and a plateau from about 1000 to 2500m. The plots monitored do not go beyond 2750m, excluding the highest peaks in the country. Both distributions are skewed and far from normal; the gamma and log-normal distributions are better, but still a poor fit.

Most elevation data sets have clear maxima and minima without long tails (negative elevations are rare!), and are often approximately uniform. But I did find one which was approximately normal, in the Hubbard Brook data set:

Vegetation cover

The Swiss surveys use "forest" as a covariate: the percentage of the 1x1 km plot covered by forest. Other options may be the proportion of grassland or willow thickets or potato fields in the plot or the surrounding area. Here are the plots for forest cover for the Switzerland, originally percentages which I converted to proportions:

These show big spikes at zero: 12% and 26% (respectively) of the values are zero. (I suspect many of the plots in the Switzerland data set with 0 forest are built-up areas not included in the bird monitoring.) Attempts to fit distributions to these data failed, so I removed the zeros:

There is still a preponderance of tiny values, and a beta distribution with mode 0 is a decent fit to the MHB data. I couldn't fit the beta distribution to the Switzerland data.

Zero-inflation is a feature of many "cover" covariates, but 1-inflation can occur too. The permanent grass cover covariate for sites in the Netherlands has 63% zeros (as well as a large proportion < 0.01) and 11% 1's:

Patch size

The sizeof the habitat patch is often a sensible covariate to include, whether the patch in question is a water body, an island, a forest or a clearing. Again these are non-negative with a mode close to zero (how close depending on when a patch is too small to be counted) and highly skewed with a small number of extreme values. An example is the areas of 38 "wildlife openings" from the chestnut-sided warbler data set:

The largest opening is 15.72ha, the second largest just 4.70ha.

For a bigger patch-area data set, I downloaded GIS info for wetlands in the Adirondacks Park, USA. This has 19,842 wetlands ranging from 10 m2 to 13 km2 and a distribution which is highly skewed. I always advocate using log(area) for biological reasons: the difference between 1 vs 2ha is going to be bigger than 10 vs 11ha or 100 vs 101ha; to get comparable differences, you should compare 10 vs 20ha and 100 vs 200ha. I was amazed to see that the distribution of log(area) is a nice, symmetrical bell-shape! It's close to normal, but with a higher peak (greater kurtosis).

Distance from ...

A common type of covariate that I see is distance from the nearest road, river, settlement, forest edge, etc, intended as a proxy measure for the level of disturbance. These variables are skewed, with a mode at 0, and a definite upper limit. Below is the plot for a friend's data for camera traps; it doesn't have the mode at 0 expected from theory because they avoided setting camera traps too close to the road. For a point source such as a settlement, the square root provides a good model for disturbance; perhaps a road is, in practice, a series of point sources.

For a larger data set, I looked at Yellowstone National Park, USA, which has a network of sealed roads heavily used by tourists. The map below shows the roads and the distance from road. Most parts are close to roads, though in the SE of the park you can get 36 km from a road. This is reflected in the histogram of distances. The log of distance (not shown) is weird with a long left tail due to the vast number of points close to roads, but the square root has an almost normal distribution.


Distance from the forest edge or the protected area boundary is also a sensible proxy for degree of disturbance. I tried that for Yellowstone, but that is so close to square it may not be typical. So I found a funky shaped patch in the makeJAGSmask package. As with the roads, the high proportion of very small values makes the log of distance look odd, but the square root of distance has a plateau with a high proportion of the area being a moderate distance from the edge.

Other covariates

The Housing Density covariate in the MesoCarnivores data set shows a similar exponential shape. The height of the first bar is not due to zero inflation, the 1473 values include just one 0. Using the log makes sense, but the distribution of the log is still highly skewed.

The Hubbard Brook data set includes slope and aspect covariates. The slope variable looks roughly normal, though of course it doesn't have negative values.

The aspect variable is odd because it is a circular variable: 0 and 360 are both north. In this case we have modes near due south and due north, as you would expect if the study area centred on an east-west ridge or valley. You could simulate something like that with the circular package in R.

The British willowWarbler data set in AHMbook uses mean growing degree days (GDD: the sum of daily mean temperatures above 5.5C) as a covariate. This varies across Britain depending on latitude and elevation as shown in the map below. It's not surprising then that the distribution is nowhere near normal.

Does it matter what distribution we use in simulations?

It's rare that a real habitat covariate is close to a normal distribution or even a unform distribution, though we use these routinely to generate "covariate" values for simulations. I used simulations (or course!) to investigate the effect of this.

I used a very simple model, generating binomial presence data with probability determined by a logit expression, and analysed the data with glm.

psi <- plogis(b0 + x*b1)
z <- rbinom(nsites, 1, psi)
fit <- glm(z ~ x, family="binomial")

I tried six different versions of the covariate x:

  • random draws from rnorm(n, 0, 1)
  • random draws from runif(n, 0, 1)
  • randomly selected values from Switzerland$elevation
  • randomly selected values from Switzerland$forest
  • randomly selected values from the log of the wetland areas in Adirondack Park.
  • randomly selected values from the British mean growing degree days (GDD) data above.

All were standardized to mean zero and SD 1.

I did 100,000 iterations, retaining the estimated coefficients, and compared the overall means as a check for bias, and the Root Mean Square Errors (RMSEs) of the slope coefficient.

Here are the results for b0=0, b1=1, and nsites=200. First the means of the 100,000 estimates for the intercept and slope, where there appears to be a small positive bias in the slope estimates:

         intrcpt slope
normal   -0.0005 1.019
uniform  -0.0000 1.017
elevation 0.0040 1.018
forest    0.0427 1.019
logArea   0.0003 1.022
gdd      -0.0022 1.018

Next the RMSEs (sorted):

  uniform elevation    gdd  forest    normal   logArea 
    0.179     0.184  0.186   0.187     0.192     0.200 

An alternative way to assess the results is to decide what error is acceptable and see the probability of getting an unacceptable error. If we think an error of 20%, ie, values between 0.8 and 1.2, are acceptable, the probability of getting an unacceptable estimate is:

  uniform elevation    gdd  forest    normal   logArea 
    0.255     0.269  0.273   0.275     0.286     0.304 

That's scary!  20% seems a pretty low bar, and we still have a 25-30% probability of getting an unacceptable estimate. Even without any complications such as non-detection errors, a sample size of 200 may not be adequate for binary data.

The differences seem to be small, but they are not negligible. To get the RMSE for the logArea distribution down below 0.18 needs a sample size of 250 (instead of 200).

Anyway, as I expected, the uniform distribution works better than the normal distribution, as the latter has values more "bunched up" around the mean. More technically, the uniform distribution has the lowest kurtosis; the values are:

  • uniform: 1.80
  • elevation: 2.36
  • forest: 2.44
  • gdd: 2.46
  • normal: 3
  • logArea: 3.70

So what can we conclude from all this?


 First of course, do carry out simulations as part of  every study design to make sure that you will have a large enough sample to get a good enough estimate of the parameters you are interested in.

For your real study, if you can chose the values of a covariate of interest, make sure they have a uniform distribution. For example, if you are interested in estimating the effect of elevation, space out your sample over the full range of elevations in the study area. If it's distance from edge and you plan to use the log of the distance, chose points to give a uniform distribution of log(distance).

For simulations, chose distributions that correspond to those you'll use for the real study. If you will be using random points over the study area, try to get real data for the area as GIS layers, create random points for each simulations, then extract the covariate values from the GIS layers. That can be done in R with functions from the raster, sp or sf packages.

If you really don't have any idea of the distribution of the covariate - you don't have a particular study area in mind, or you're writing a function for a package and need a default - the normal distribution may be best, as it is "pessimistic" about the efficiency of estimation with a given sample size.

Updated 17 November 2020 by Mike Meredith