There are limits to the size and precision of numbers which computers can
handle, and they can cause problems when we try to calculate
probabilities, and especially likelihoods. Numbers smaller than
10^{308} become zero and
numbers greater than 0.9999999999999999 become 1. Although not
important in the real world, these rounding errors mean that
there are large ranges of parameter values that have likelihoods
of 0 or 1. Our algorithms to find maximum likelihoods or
generate MCMC chains fail when they crash into the 1 cliff or
fall into the 0 abyss. The solution is to work
with logarithms of probabilities instead of the actual values [log(p)
instead of p], eg, we routinely work with loglikelihoods.
Multiplying probabilities is then simply a matter of adding up
the logs. But sometimes we need to add up probabilities or
calculate the complement, 1  p, and we need to do that without
falling in the 0 abyss or smashing into the 1 wall.
Floating point numbers
In R, virtually all numbers are stored as floating point
numbers, which use the same idea as scientific notation: a
number such as 2017 will be stored as 2.017e3. The "2.017" part
is the significand, the "3" is the exponent. It's a bit more
complicated than that, because numbers are stored in binary
form, not decimal, but the same idea is used. On most computers,
53 bits are used for the significand, which corresponds to about
16 decimal digits, and 11 for the exponent, which allows decimal
values between about 308 and +308.
Because of the limits on the exponent, values outside the range of ±1e308 are
represented in R by ±Inf , while values within the
range ±1e308 become exactly 0. And the limits on the
significand mean that numbers between 0.9999999999999999 and
1.0000000000000001 are rounded to exactly 1. Those are
approximate figures; you can check the actual
values for your own computer  and the help page explaining the
meanings  with:
.Machine
?.Machine
Some example probabilities
We need some probabilities to play with, and we'll generate a
wide range of values using the plogis function. In practice, we
often model probabilities on the logit scale and then convert to
probabilities when we need to.
logit_p < seq(50, 50, by=10)
( p < plogis(logit_p) )
[1] 1.928750e22 4.248354e18 9.357623e14 2.061154e09 4.539787e05
[6] 5.000000e01 9.999546e01 1.000000e+00 1.000000e+00 1.000000e+00
[11] 1.000000e+00
The last 4 values appear to be 1, but that may just be due to
rounding when printed. We can check with:
p == 1
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
The printout is rounded to 1 for the 8th and 9th values, but
the last 2 really are one. Values in the range 1
± 1e16 are rounded internally to 1; see
.Machine$double.eps
and .Machine$double.neg.eps for the exact range for
your machine. Values above 36.7 on the logit scale all give
probabilities equal to 1. Biologically, this is not important,
0.9999999999999999 is close enough to 1 for all practical
purposes. But it causes problems when trying to estimate maximum
likelihoods, as this introduces an artificial plateau in the
likelihood surface, a region when the likelihood is completely
flat with no gradient. Optimising functions and MCMC samplers
can get stuck on these plateaux, with no indication of which way
to head to get to lower likelihoods.
Using log probabilities
We will look at how we can use logprobabilities.
log(p)
[1] 5.000000e+01 4.000000e+01 3.000000e+01 2.000000e+01 1.000005e+01
[6] 6.931472e01 4.539890e05 2.061154e09 9.348078e14 0.000000e+00
[11] 0.000000e+00
Er, no, that doesn't work, the last 2
values are now exactly 0. Most R functions which use
probabilities have a log.p or log
argument: we need to put log.p = TRUE in
the call to plogis :
( log_p < plogis(logit_p, log.p=TRUE) )
[1] 5.000000e+01 4.000000e+01 3.000000e+01 2.000000e+01 1.000005e+01
[6] 6.931472e01 4.539890e05 2.061154e09 9.357623e14 4.248354e18
[11] 1.928750e22
Now we can work with probabilities from very near 0 to very
near 1 without problems. In many applications, we need to
multiply probabilities, and that just means that we add up the
logs instead. But adding and subtracting probabilities still
need care.
Adding probabilities
This generally becomes a problem if you have a long vector of
probabilities, but we need a short vector to be able to explore
the options. This one will do:
log_p < 745:760
exp(log_p) == 0
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE TRUE TRUE TRUE TRUE
All except the first value in this vector turn into a zero
when converted back to a probability, so we can't just use
sum(exp(log_p)) .
A simple solution
A simple workaround is to scale all the probabilities so that
the biggest = 1, then scale back again after the addition. The
scaling can easily be done by subtracting the biggest
log_p value from all of them:
exp(log_p  max(log_p))
[1] 1.000000e+00 3.678794e01 1.353353e01 4.978707e02 1.831564e02
[6] 6.737947e03 2.478752e03 9.118820e04 3.354626e04 1.234098e04
[11] 4.539993e05 1.670170e05 6.144212e06 2.260329e06 8.315287e07
[16] 3.059023e07
Now we can add up the values, convert back to a log, and add
back the biggest log_p value:
sum(exp(log_p  max(log_p)))
[1] 1.581977
log(sum(exp(log_p  max(log_p))))
[1] 0.458675
log(sum(exp(log_p  max(log_p)))) + max(log_p)
[1] 744.5413
A better version
We can still get into trouble if the sum of all the smaller
values add up to less than 2.2e16 (or your value for .Machine$double.eps ),
as the total then will just be 1. The R function log1p is
designed to give the correct value for log(1 + x) for very small
values of x. The scaling is the same as before, but this time we
leave out the largest value (which is 1) when we do the
summation, then use log1p to get the correct log, and finally
scale back again:
( p_max < which.max(log_p) ) # find which value is the largest
[1] 1
exp(log_p[p_max]  max(log_p))
# use negative index to exclude the largest (which is 1)
[1] 3.678794e01 1.353353e01 4.978707e02 1.831564e02 6.737947e03
[6] 2.478752e03 9.118820e04 3.354626e04 1.234098e04 4.539993e05
[11] 1.670170e05 6.144212e06 2.260329e06 8.315287e07 3.059023e07
sum(exp(log_p[p_max]  max(log_p)))
[1] 0.5819765
log1p(sum(exp(log_p[p_max]  max(log_p)))) # log1p puts back the 1
[1] 0.458675
log1p(sum(exp(log_p[p_max]  max(log_p)))) + max(log_p)
[1] 744.5413
In this example, the small values add up to 0.58, so we don't
really need the better version, but it does produce a more
robust general purpose routine, which we can put into a function
like this:
logSumExp < function(log_p) {
p_max < which.max(log_p)
log1p(sum(exp(log_p[p_max]  max(log_p)))) + max(log_p)
}
logSumExp(log_p)
[1] 744.5413
Hat tip: Bill Dunlap and Spencer at
the R help forum.
Update: This won't work if all the probabilities being
added up are zero! See new post.
Subtracting probabilities
...or to be precise, calculating 1  p. This is
problematic when either p or 1  p is close to 1. We'll use the
original example data for this:
logit_p < seq(50, 50, by=10)
( log_p < plogis(logit_p, log.p=TRUE) )
[1] 5.000000e+01 4.000000e+01 3.000000e+01 2.000000e+01 1.000005e+01
[6] 6.931472e01 4.539890e05 2.061154e09 9.357623e14 4.248354e18
[11] 1.928750e22
If we already know logit_p , it's easy to get 1 
p or log(1  p): simply put a minus sign in front of
logit_p :
( log_1mp < plogis( logit_p, log.p=TRUE) )
[1] 1.928750e22 4.248354e18 9.357623e14 2.061154e09 4.539890e05
[6] 6.931472e01 1.000005e+01 2.000000e+01 3.000000e+01 4.000000e+01
[11] 5.000000e+0
If we don't have logit_p , a couple of
workarounds are available using standard R functions:
A: We can used the expm1 function, which
computes exp(x)  1 accurately for x close to zero. You can see
how that would work: we want log(1  exp(log_p)), and can
rearrange that to log( (exp(log_p)  1)), which is coded as
log( expm1(log_p)) . That should work well when
log_p is close to zero and p is close to 1.
A < log( expm1(log_p))
B: Or we can use the function log1p that we saw
above. This calculates log(1 + x) for x close to zero, so we
just set x = exp(log_p) and use log1p( exp(log_p)) .
That should be good when exp(log_p) = p is close to zero.
B < log1p( exp(log_p))
Let's compare the output from A and B with the value from
reversing the logit:
cbind(logit_p, A, log_1mp, B)
logit_p A log_1mp B
[1,] 50 0.000000e+00 1.928750e22 1.928750e22
[2,] 40 0.000000e+00 4.248354e18 4.248354e18
[3,] 30 9.359180e14 9.357623e14 9.357623e14
[4,] 20 2.061154e09 2.061154e09 2.061154e09
[5,] 10 4.539890e05 4.539890e05 4.539890e05
[6,] 0 6.931472e01 6.931472e01 6.931472e01
[7,] 10 1.000005e+01 1.000005e+01 1.000005e+01
[8,] 20 2.000000e+01 2.000000e+01 2.000000e+01
[9,] 30 3.000000e+01 3.000000e+01 2.999983e+01
[10,] 40 4.000000e+01 4.000000e+01 Inf
[11,] 50 5.000000e+01 5.000000e+01 Inf
As we expected, A works well when p is close to 1, B when p
is close to zero. In the middle of the range, they are equally
good, but using the wrong one near 1 or 0 is disastrous. So our
general purpose function will include a check on the size of
log_p:
log1mExp < function(log_p)
ifelse(log_p > 0.693, log(expm1(log_p)), log1p(exp(log_p)))
all.equal(log_1mp, log1mExp(log_p))
[1] TRUE
The value 0.693 used is actually log(0.5), but calculating
logs is expensive in computer time, so it's more efficient to
just insert the value here, especially as it does not need to be
exact.
Hat tip: Martin Mächler and the
Rmpfr package vignette.
