# Taming Floating Point Error

2020-05-05

If you've been a software engineer for long enough, it is very likely that you've seen this example of floating point perfidy:

>>> 3.0 / 10
0.3
>>> 0.1 * 3
0.30000000000000004


We understand that this is due to the fact that floating point numbers, stored with only 64 bits of precision, cannot represent the entire real number line. Moreover, when we perform operations with these floating point numbers, the errors inherent in their representation can accumulate and multiply. The moral of the story is, never use a floating point number to represent money.

At least, that is the moral for financial applications. At Square, we used a long amount_cents, and we got along with our lives. However, for most applications that have a good reason to use floating point, this can't be the end of the story. If floating point were the unpredictable, unreliable thing that I once believed it to be, we wouldn't be able to numerically solve differential equations, or linear systems, or land on the moon. Rather, there is a science of floating point error, forming part of a science of numerical errors in general, which seeks to tame and understand what happens to errors as they flow through our calculations. In general, numerical error is something that can be rather precisely quantified, as we'll see. Along the way we'll look at the Fundamental Axiom of Floating Point Arithmetic, which, at the very least, sounds way cooler than "10 things every developer should know about floating point numbers".

## Machine Epsilon

Fundamentally, error in floating point is due to the problem of "roundoff". Just as in base 10, we cannot represent the number $$1/3$$ without rounding it off somewhere:

\begin{equation*} \frac{1}{3} = 0.33333333333333 \dots \approx 0.33333333, \end{equation*}

in base 2, we cannot represent many numbers without rounding. Of course, some numbers we can represent exactly. The number 1, for example. Or any integer in the range $$(-2^{53}, 2^{53})$$. Also notably, many fractions can be exactly represented:

\begin{align*} \frac{1}{2} &= 0.1_2 \\\ \frac{3}{4} &= 0.11_2 \\\ \frac{17}{8} &= 10.001_2 \\\ &\vdots \end{align*}

However, a number like $$1/10$$, just like $$1/3$$ in base 10, must be truncated to fit in the 24 or 53 bits of the mantissa. When we enter 0.1 in a console or in source code, the value that is actually stored is very slightly different than 0.1. According to this excellent IEEE-754 Floating Point Converter, the floating point number that is actually stored (for 64-bit floating point) is

\begin{align*} &(0.0001100110011001100110011001100110011001100110011001101)_2 = \\\ &(0.1000000000000000055511151231257827021181583404541015625)_{10} \end{align*}

So the initial input to our calculation was flawed! We weren't calculating 0.1 * 3, we were actually calculating

0.1000000000000000055511151231257827021181583404541015625 * 3


How much of an error is this? We can get an idea by counting the zeros in between the significant digits, $$0.100\dots 00055$$. In this case, there are 16. So in simply entering a number which is not representable exactly in floating point, we have incurred a relative error of roughly $$10^{-16}$$.

Indeed, in all cases we can expect to incur a relative error of roughly $$10^{-16}$$. This magnitude is called machine epsilon, often written $$\epsilon_{\text{machine}}$$. It comes from the relative difference between two successive floating point numbers. For every representable floating point number $$x$$, there is a next floating point number, and it is approximately $$x + \epsilon_{\text{machine}} x$$. So for an arbitrary real number $$x_0$$, it falls between two floating point values $$x$$ and $$x + \epsilon x$$ (leaving off the subscript of $$\epsilon$$ for conciseness). When we represent $$x_0$$ in floating point, we will get one of these two values. Let's denote the floating point representation of $$x_0$$ by $$\text{fl}(x_0)$$. The absolute error incurred just by representing $$x_0$$ in floating point is

\begin{equation*} e_{\text{abs}} = |\text{fl}(x_0) - x_0| \leq \max(|x_0 - x|, |x_0 - (x + \epsilon x)|) \leq |\epsilon x|. \end{equation*}

The relative error, then, i.e. the absolute error divided by the true value, is

\begin{equation*} e_{\text{rel}} = \frac{e_{\text{abs}}}{x_0} \leq \frac{|\epsilon x|}{x_0} \approx \epsilon. \end{equation*}

Cool! So we've seen that the worst we can do, in relative terms, when representing a floating point number, is approximately $$10^{-16}$$. This is, for almost all practical purposes, very good. Because remember, we're speaking of a relative error. That means we're able to represent even very small values, very accurately. Here's the nearest floating point number to $$10^{-20}$$:

\begin{equation*} 0.000000000000000000009999999999999999451532\dots \end{equation*}

If you're curious, I invite you to count the 9's. There are 16 of them. Even when dealing with extremely small numbers, we maintain the same relative precision.

## The Fundamental Axiom of Floating Point Arithmetic

Now you might be thinking, wait! It's all good and well that we can get excellent relative accuracy when representing floating point numbers, but what about when we go to do something with them? Here we've got two floating point numbers, both of which are inexact, and we're about to multiply them! Who knows what might happen?

This concern is well-founded, because the algorithms of floating point arithmetic must be implemented with finite precision. If we are asked to multiply two large numbers with pen and paper, the algorithm that most of us will use is the one we learned in school, which involves lots of addition, carrying, sub-multiplications, and so on. If all of those intermediate steps are using some kind of binary representation, then the intermediate products may be losing precision as we go along! This seems like a recipe for disaster. Fortunately, there is a property that we can require of a floating point implementation, one which is satisfied by IEEE-754 and most other popular floating point standards, that will save us from total anarchy. This is what Trefethen and Bau, Numerical Linear Algebra, refer to as the Fundamental Axiom of Floating Point Arithmetic:

All floating point arithmetic operations are exact up to a relative error of $$\epsilon_{\text{machine}}$$.

This means that for any two floating point numbers, say $$x$$ and $$y$$, any operation involving them will give a floating point result which is within a factor of $$1 + \epsilon_{\text{machine}}$$ of the true result. This is easiest to understand if we use a special notation to represent the floating point version of, say, $$+$$. Let's write $$\oplus$$ to denote floating point addition, and $$+$$ to denote exact addition. Then the Fundamental Axiom tells us,

\begin{equation*} \frac{|(x \bigoplus y) - (x + y)|}{x + y} \leq \epsilon_{\text{machine}}. \end{equation*}

Remember, $$x$$ and $$y$$ are exactly representable as floating point numbers, but of course they are also, mathematically, just real numbers. So $$x + y$$ is a real number (point on the number line), which may not be exactly representable in floating point. The Fundamental Axiom is telling us that the floating point operation $$\oplus$$ can do no worse than simply trying to represent the number $$x + y$$ as a floating point value directly, assuming our computer had access to the mathematical, infinite precision object $$x + y$$.

Put another way, we incur no extra error by going through floating point arithmetic than we would by using a magical computer to do exact arithmetic on our floating point values and then casting back to floating point. In mathematical terms,

\begin{equation*} \frac{\left|(\text{fl}(a) \oplus \text{fl}(b)) - (a + b)\right|}{a + b} \approx \frac{\left|\text{fl}(\text{fl}(a) + \text{fl}(b)) - (a + b)\right|}{a + b} \approx \epsilon_{\text{machine}}. \end{equation*}

Recall that $$\text{fl}(a)$$ is the mathematical number we actually represent when we try to represent $$a$$ in floating point. So the first fraction gives the relative error from performing $$\oplus$$ on the floating point representations of $$a$$ and $$b$$, and the second fraction gives the relative error from performing the exact arithmetic operation, $$+$$, on the floating point representations of $$a$$ and $$b$$, then casting the result to floating point.

## Error Analysis

The Fundamental Axiom of Floating Point Arithmetic allows us to analyze the numerical error that may be incurred by even complex arithmetic operations. To see an idea of how this works, let's consider the problem of computing the length of a 2D vector, $$[x, y]$$. Mathematically, this has the form

\begin{equation*} d = \sqrt{x^2 + y^2}. \end{equation*}

There are several stages to the computation, and at each one we will incur a little bit of numerical error:

• Enter the numbers $$x$$ and $$y$$ into the computer, incurring roundoff error.
• Compute $$x^2$$ and $$y^2$$, using floating point multiplication.
• Compute $$x^2 + y^2$$, using floating point addition.
• Compute $$\sqrt{x^2 + y^2}$$, using the floating point square root operation.

At each step, the result we get will be equal to the true result, times some error factor $$(1 + \epsilon)$$, where each $$\epsilon$$ is very small, on the order of $$\epsilon_\text{machine}$$. Each operation may have a different error $$\epsilon$$, but we'll use the same symbol for all of them. We don't care so much about the exact value of $$\epsilon$$, only that it is very small.

We're going to carry these $$\epsilon$$ through the computation to see how they affect the final result. To make that easier, we can use special rules of arithmetic to manipulate $$\epsilon$$:

• $$\epsilon^2 = 0$$. If $$\epsilon \approx 10^{-16}$$, then $$\epsilon^2 \approx 10^{-32}$$, which is so small that we just decide to completely ignore it.
• $$(1 + \epsilon)^2 = 1 + 2\epsilon + \epsilon^2 = 1 + 2\epsilon$$.
• $$\sqrt{1 + \epsilon} = 1 + \frac{\epsilon}{2} - \frac{\epsilon^2}{8} + \cdots = 1 + \frac{\epsilon}{2}$$. This comes from doing a Taylor expansion of $$\sqrt{x}$$ around the point 1. Another way to look at it is that, when $$x$$ is very close to 1, $$\sqrt{x} \approx x$$, and the slope of the graph of $$\sqrt{x}$$ at $$x = 1$$ is $$1/2$$.

With these rules, we're ready to do the error calculation. Red expressions are the new error factor for each calculation, which appear in blue in the next expression down:

\begin{align*} \text{fl}(x) &= {\color{red}{(1 + \epsilon)}} x \\\ \text{fl}(y) &= {\color{red}{(1 + \epsilon)}} y \\\ \\\ \text{fl}(x) \otimes \text{fl}(x) &= (1 + \epsilon) \left[ {\color{blue}{(1 + \epsilon)}}^2 x^2 \right] = (1 + \epsilon)^3 x^2 \\\ &= (1 + 3\epsilon + 3\epsilon^2 + \epsilon^3) x^2 \\\ &= {\color{red}{(1 + 3\epsilon)}} x^2 \\\ \text{fl}(y) \otimes \text{fl}(y) &= (1 + \epsilon) \left[ {\color{blue}{(1 + \epsilon)}}^2 y^2 \right] = (1 + \epsilon)^3 y^2 \\\ &= (1 + 3\epsilon + 3\epsilon^2 + \epsilon^3) y^2 \\\ &= {\color{red}{(1 + 3\epsilon)}} y^2 \\\ \\\ [\text{fl}(x) \otimes \text{fl}(x)] \oplus [\text{fl}(y) \otimes \text{fl}(y)] &= (1 + \epsilon) \left[ {\color{blue}{(1 + 3\epsilon)}} x^2 + {\color{blue}{(1 + 3\epsilon)}} y^2 \right] \\\ &= (1 + \epsilon + 3\epsilon + 3\epsilon^2) [ x^2 + y^2 ] \\\ &= {\color{red}{(1 + 4\epsilon)}}(x^2 + y^2) \\\ \\\ \sqrt[\text{fl}]{ \text{fl}(x) \otimes \text{fl}(x) \oplus \text{fl}(y) \otimes \text{fl}(y)} &= (1 + \epsilon) \sqrt{{\color{blue}{(1 + 4\epsilon)}}(x^2 + y^2)} \\\ &= (1 + \epsilon)\sqrt{(1 + 4\epsilon)} \sqrt{x^2 + y^2} \\\ &= (1 + \epsilon)(1 + 2\epsilon) \sqrt{x^2 + y^2} \\\ &= {\color{red}{(1 + 3\epsilon)}} \sqrt{x^2 + y^2} \end{align*}

Phew! After carrying our error factors through all steps of the calculation, we find that the total numerical error is at most 3 times the machine precision. Keep in mind, this is an upper bound. We can't really say anything about what the actual error will be, only that it is not greater than $$3 \epsilon_{\text{machine}}$$.

## A Cautionary Tale

The error analysis we did above was pretty carefree with manipulating the magical error value $$\epsilon$$. We didn't have to be too careful, because multiplication and addition of positive values (like $$x^2$$ and $$y^2$$) is safe, using those rules. However, when we're considering subtraction, we have to be much more careful. Suppose $$X$$ and $$Y$$ are very large numbers, but that they are very close together: $$X - Y$$ is small. Let's try to compute $$X - Y$$ in floating point, and see where we get. Now we might be tempted to write,

\begin{align*} \text{fl}(X) \ominus \text{fl}(Y) &= (1 + \epsilon) [ (1 + \epsilon) (X) - (1 + \epsilon) (Y) ] \\\ &= (1 + \epsilon)^2 (X - Y) \\\ &= (1 + 2\epsilon) (X - Y). \end{align*}

However, we can't pull out a factor of $$1 + \epsilon$$ from the subtraction! We have to remember that we are only pretending that $$\epsilon$$ is the same across these operations. In fact, to be more precise, we should write

\begin{align*} \text{fl}(X) \ominus \text{fl}(Y) &= (1 + \epsilon_\ominus) [ (1 + \epsilon_X) (X) - (1 + \epsilon_Y) (Y) ] \\\ \end{align*}

to distinguish between the error values we get at different steps. Now, consider what happens if $$\epsilon_Y = 0$$, but $$\epsilon_X \neq 0$$. Well, then,

\begin{align*} (1 + \epsilon_X) (X) - (1 + \epsilon_Y) (Y) = X + \epsilon_X X - Y = (X - Y) + \epsilon_X X. \end{align*}

We supposed above that $$X - Y$$ was small, but $$X$$ was large. Now we have shown that the error incurred by subtracting the floating point representations of $$X$$ and $$Y$$ is of the order $$\epsilon_\text{machine} X$$, which may be much larger than $$\epsilon_\text{machine}$$! Here's a Python snippet to illustrate:

>>> (1e15 + 0.1) - 1e15
0.125


We should have gotten 0.1, but ended up getting 0.125, an error of 25%! This effect is known as "catastrophic cancellation", and numerical analysts are very careful to design their algorithms so as to avoid it.

## Learning More

The Fundamental Axiom of Floating Point Arithmetic gives us a starting point to analyze numerical methods, but it is by no means the whole story. The very best numerical algorithms have a magical property called "backwards stability", which essentially ensures that they exactly solve a problem which is almost the correct problem--the problem they solve exactly is within machine precision of the problem you meant to solve. These algorithms are far more accurate than they have any right to be. Once you start to use the algorithms of numerical linear algebra to solve differential equations that evolve in time, there is a whole science of ensuring that the roundoff error incurred at each time step does not blow up exponentially. With these tools, rather than be intimidated by numerical error, we can anticipate and control it. While seeing your calculation give almost exactly the right answer is a good feeling, there is something even more satisfying about computing the error you got, and verifying that it, too, is exactly as large as expected.

Everything I know about this stuff, I learned from the excellent book Numerical Linear Algebra, by Lloyd Trefethen and David Bau, as well as the inimitable Anne Greenbaum of the University of Washington.