Probabilities and computer limitations

HOME There are limits to the size and precision of numbers which computers can handle, and they can cause problems when we try to calculate probabilities, and especially likelihoods. Numbers smaller than 10-308 become zero and numbers greater than 0.9999999999999999 become 1. Although not important in the real world, these rounding errors mean that there are large ranges of parameter values that have likelihoods of 0 or 1. Our algorithms to find maximum likelihoods or generate MCMC chains fail when they crash into the 1 cliff or fall into the 0 abyss.

The solution is to work with logarithms of probabilities instead of the actual values [log(p) instead of p], eg, we routinely work with log-likelihoods. Multiplying probabilities is then simply a matter of adding up the logs. But sometimes we need to add up probabilities or calculate the complement, 1 - p, and we need to do that without falling in the 0 abyss or smashing into the 1 wall.

Floating point numbers

In R, virtually all numbers are stored as floating point numbers, which use the same idea as scientific notation: a number such as 2017 will be stored as 2.017e3. The "2.017" part is the significand, the "3" is the exponent. It's a bit more complicated than that, because numbers are stored in binary form, not decimal, but the same idea is used. On most computers, 53 bits are used for the significand, which corresponds to about 16 decimal digits, and 11 for the exponent, which allows decimal values between about -308 and +308.

Because of the limits on the exponent, values outside the range of ▒1e308 are represented in R by ▒Inf, while values within the range ▒1e-308 become exactly 0. And the limits on the significand mean that numbers between 0.9999999999999999 and 1.0000000000000001 are rounded to exactly 1. Those are approximate figures; you can check the actual values for your own computer - and the help page explaining the meanings - with:


Some example probabilities

We need some probabilities to play with, and we'll generate a wide range of values using the plogis function. In practice, we often model probabilities on the logit scale and then convert to probabilities when we need to.

logit_p <- seq(-50, 50, by=10)
( p <- plogis(logit_p) )
 [1] 1.928750e-22 4.248354e-18 9.357623e-14 2.061154e-09 4.539787e-05
 [6] 5.000000e-01 9.999546e-01 1.000000e+00 1.000000e+00 1.000000e+00
[11] 1.000000e+00

The last 4 values appear to be 1, but that may just be due to rounding when printed. We can check with:

p == 1

The print-out is rounded to 1 for the 8th and 9th values, but the last 2 really are one. Values in the range 1 ▒ 1e-16 are rounded internally to 1; see .Machine$double.eps and .Machine$double.neg.eps for the exact range for your machine. Values above 36.7 on the logit scale all give probabilities equal to 1. Biologically, this is not important, 0.9999999999999999 is close enough to 1 for all practical purposes. But it causes problems when trying to estimate maximum likelihoods, as this introduces an artificial plateau in the likelihood surface, a region when the likelihood is completely flat with no gradient. Optimising functions and MCMC samplers can get stuck on these plateaux, with no indication of which way to head to get to lower likelihoods.

Using log probabilities

We will look at how we can use log-probabilities.

 [1] -5.000000e+01 -4.000000e+01 -3.000000e+01 -2.000000e+01 -1.000005e+01
 [6] -6.931472e-01 -4.539890e-05 -2.061154e-09 -9.348078e-14 0.000000e+00
[11] 0.000000e+00

Er, no, that doesn't work, the last 2 values are now exactly 0. Most R functions which use probabilities have a log.p or log argument: we need to put log.p = TRUE in the call to plogis:

( log_p <- plogis(logit_p, log.p=TRUE) )
 [1] -5.000000e+01 -4.000000e+01 -3.000000e+01 -2.000000e+01 -1.000005e+01
 [6] -6.931472e-01 -4.539890e-05 -2.061154e-09 -9.357623e-14 -4.248354e-18
[11] -1.928750e-22

Now we can work with probabilities from very near 0 to very near 1 without problems. In many applications, we need to multiply probabilities, and that just means that we add up the logs instead. But adding and subtracting probabilities still need care.

Adding probabilities

This generally becomes a problem if you have a long vector of probabilities, but we need a short vector to be able to explore the options. This one will do:

log_p <- -745:-760
exp(log_p) == 0

All except the first value in this vector turn into a zero when converted back to a probability, so we can't just use sum(exp(log_p)).

A simple solution

A simple workaround is to scale all the probabilities so that the biggest = 1, then scale back again after the addition. The scaling can easily be done by subtracting the biggest log_p value from all of them:

exp(log_p - max(log_p))
 [1] 1.000000e+00 3.678794e-01 1.353353e-01 4.978707e-02 1.831564e-02
 [6] 6.737947e-03 2.478752e-03 9.118820e-04 3.354626e-04 1.234098e-04
[11] 4.539993e-05 1.670170e-05 6.144212e-06 2.260329e-06 8.315287e-07
[16] 3.059023e-07

Now we can add up the values, convert back to a log, and add back the biggest log_p value:

sum(exp(log_p - max(log_p)))
[1] 1.581977
log(sum(exp(log_p - max(log_p))))
[1] 0.458675
log(sum(exp(log_p - max(log_p)))) + max(log_p)
[1] -744.5413
A better version

We can still get into trouble if the sum of all the smaller values add up to less than 2.2e-16 (or your value for .Machine$double.eps), as the total then will just be 1. The R function log1p is designed to give the correct value for log(1 + x) for very small values of x. The scaling is the same as before, but this time we leave out the largest value (which is 1) when we do the summation, then use log1p to get the correct log, and finally scale back again:

( p_max <- which.max(log_p) )  # find which value is the largest
[1] 1
exp(log_p[-p_max] - max(log_p))
    # use negative index to exclude the largest (which is 1)
 [1] 3.678794e-01 1.353353e-01 4.978707e-02 1.831564e-02 6.737947e-03
 [6] 2.478752e-03 9.118820e-04 3.354626e-04 1.234098e-04 4.539993e-05
[11] 1.670170e-05 6.144212e-06 2.260329e-06 8.315287e-07 3.059023e-07
sum(exp(log_p[-p_max] - max(log_p)))
[1] 0.5819765
log1p(sum(exp(log_p[-p_max] - max(log_p))))  # log1p puts back the 1
[1] 0.458675
log1p(sum(exp(log_p[-p_max] - max(log_p)))) + max(log_p)
[1] -744.5413

In this example, the small values add up to 0.58, so we don't really need the better version, but it does produce a more robust general purpose routine, which we can put into a function like this:

logSumExp <- function(log_p) {
  p_max <- which.max(log_p)
  log1p(sum(exp(log_p[-p_max] - max(log_p)))) + max(log_p)

[1] -744.5413

Hat tip: Bill Dunlap and Spencer at the R help forum.

Update: This won't work if all the probabilities being added up are zero! See new post.

Subtracting probabilities

...or to be precise, calculating 1 - p. This is  problematic when either p or 1 - p is close to 1. We'll use the original example data for this:

logit_p <- seq(-50, 50, by=10)
( log_p <- plogis(logit_p, log.p=TRUE) )
 [1] -5.000000e+01 -4.000000e+01 -3.000000e+01 -2.000000e+01 -1.000005e+01
 [6] -6.931472e-01 -4.539890e-05 -2.061154e-09 -9.357623e-14 -4.248354e-18
[11] -1.928750e-22

If we already know logit_p, it's easy to get 1 - p or log(1 - p): simply put a minus sign in front of logit_p:

( log_1mp <- plogis( -logit_p, log.p=TRUE) )
 [1] -1.928750e-22 -4.248354e-18 -9.357623e-14 -2.061154e-09 -4.539890e-05
 [6] -6.931472e-01 -1.000005e+01 -2.000000e+01 -3.000000e+01 -4.000000e+01
[11] -5.000000e+0

If we don't have logit_p, a couple of workarounds are available using standard R functions:

A: We can used the expm1 function, which computes exp(x) - 1 accurately for x close to zero. You can see how that would work: we want log(1 - exp(log_p)), and can rearrange that to log( -(exp(log_p) - 1)), which is coded as log( -expm1(log_p)). That should work well when log_p is close to zero and p is close to 1.

A <- log( -expm1(log_p))

B: Or we can use the function log1p that we saw above. This calculates log(1 + x) for x close to zero, so we just set x = -exp(log_p) and use log1p( -exp(log_p)). That should be good when exp(log_p) = p is close to zero.

B <- log1p( -exp(log_p))

Let's compare the output from A and B with the value from reversing the logit:

 cbind(logit_p, A, log_1mp, B)
      logit_p             A       log_1mp             B
 [1,]     -50  0.000000e+00 -1.928750e-22 -1.928750e-22
 [2,]     -40  0.000000e+00 -4.248354e-18 -4.248354e-18
 [3,]     -30 -9.359180e-14 -9.357623e-14 -9.357623e-14
 [4,]     -20 -2.061154e-09 -2.061154e-09 -2.061154e-09
 [5,]     -10 -4.539890e-05 -4.539890e-05 -4.539890e-05
 [6,]       0 -6.931472e-01 -6.931472e-01 -6.931472e-01
 [7,]      10 -1.000005e+01 -1.000005e+01 -1.000005e+01
 [8,]      20 -2.000000e+01 -2.000000e+01 -2.000000e+01
 [9,]      30 -3.000000e+01 -3.000000e+01 -2.999983e+01
[10,]      40 -4.000000e+01 -4.000000e+01          -Inf
[11,]      50 -5.000000e+01 -5.000000e+01          -Inf

As we expected, A works well when p is close to 1, B when p is close to zero. In the middle of the range, they are equally good, but using the wrong one near 1 or 0 is disastrous. So our general purpose function will include a check on the size of log_p:

log1mExp <- function(log_p)
  ifelse(log_p > -0.693, log(-expm1(log_p)), log1p(-exp(log_p)))

all.equal(log_1mp, log1mExp(log_p))
[1] TRUE

The value -0.693 used is actually log(0.5), but calculating logs is expensive in computer time, so it's more efficient to just insert the value here, especially as it does not need to be exact.

Hat tip: Martin Mńchler and the Rmpfr package vignette.


Updated 4 August 2017 by Mike Meredith