0 Computer Arithmetic

Integer

Integer rarely adds any complexity to the computation. We mainly use it to index arrays. And as the size of data grows larger, we need a larger indices to keep track of it. For example, a 32-bit integer can address 2321109bits4GB2^{32} - 1 \approx 10^9 bits \approx 4GB of memory. The modern day OS uses 64-bit integer to index larger memories.

The indexing range is from [0,2bits][ 0,2^{\text{bits}} ]. To satisfy the need to store negative numbers, we need to some extra information. One way is to spend the first bit as the sign bit. This implementation is easy to understand but has a few flaws (+/- 0s, addition, greater than). Another approach is having a base number so that the result is numberbasenumber - base. In this approach, we a single 0 representation and well ordered. But nnn - n does not produce a 0 bitstring. We use a system called 2's complement to rotate the number line.

Floating Point

Real numbers can only be approximately represented since there are finite number of bits. Therefore, we need to truncate the number somewhere. The cutoff (rounding error) is one of the characteristic feature of the floating point representation.

Here is a 32-bit single precision example from Wikipedia

Error

Absolute Error e=Q^Qe = \hat{Q} - Q

  • need units/context to be meaningful

Relative Error ϵ=Q^QQ\epsilon = \frac{\hat{Q} - Q}{Q}

  • has no units

  • depending on the application

Relative Rounding Error ϵmachine=Round(x)xx=2p\epsilon_{machine} = \frac{Round(x) - x}{x} = 2^{-p}

References

Last updated