0 Computer Arithmetic

Integer

Integer rarely adds any complexity to the computation. We mainly use it to index arrays. And as the size of data grows larger, we need a larger indices to keep track of it. For example, a 32-bit integer can address $2^{32} - 1 \approx 10^9 bits \approx 4GB$ of memory. The modern day OS uses 64-bit integer to index larger memories.

The indexing range is from $[ 0,2^{\text{bits}} ]$ . To satisfy the need to store negative numbers, we need to some extra information. One way is to spend the first bit as the sign bit. This implementation is easy to understand but has a few flaws (+/- 0s, addition, greater than). Another approach is having a base number so that the result is $number - base$ . In this approach, we a single 0 representation and well ordered. But $n - n$ does not produce a 0 bitstring. We use a system called 2's complement to rotate the number line.

Floating Point

Real numbers can only be approximately represented since there are finite number of bits. Therefore, we need to truncate the number somewhere. The cutoff (rounding error) is one of the characteristic feature of the floating point representation.

Here is a 32-bit single precision example from Wikipedia

Error

Absolute Error $e = \hat{Q} - Q$

need units/context to be meaningful

Relative Error $\epsilon = \frac{\hat{Q} - Q}{Q}$

has no units
depending on the application

Relative Rounding Error $\epsilon_{machine} = \frac{Round(x) - x}{x} = 2^{-p}$

References

Ch3 Introduction to High-Performance Scientific Computing by Victor Eijkhout
What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg
Why 0.1 Does Not Exist In Floating-Point by Rick Regan

PreviousScientific Computing Next1 Root Finding

Last updated 5 years ago