Number Representations

Courses

Errors

There are many different sources of errors which can affect numerical methods, such as

Errors in data
Round-off errors
Truncation errors

Definition: Absolute and Relative Errors

Let $\tilde{a}$ be an approximation to $a$, then the absolute error is given by $\left| {\tilde{a}-a} \right|$.

If ${\left| a \ne 0\right|}$, the relative error is given by $\left| \dfrac{\tilde{a}-a}{a} \right|$ and the error bound is the magnitude of the admissible error.

Theorem:

For both addition and subtraction the bounds for the absolute error are added.

In division and multiplication the bounds for the relative errors are added.

Definition: Linear Sensitivity to Uncertainties

If $y(x)$ is a smooth function, i.e. is differentiable, then $\left| y^{\prime} \right|$ can be interpreted as the linear sensitivity of $y(x)$ to uncertainties in $x$.

For functions of several variables, i.e. $f : \mathbb{R}^n \rightarrow \mathbb{R}$, then

$$ \left| \Delta y \right| \le \sum\limits_{i=1}^{n}\left| \dfrac{\partial y}{\partial x_i} \right| \left| \Delta x_i \right| $$

where $\left| \Delta x_i \right| = \left| \tilde{x}_i - x_i\right|$ for an approximation $\tilde{x}_i$, thus ${\left| \Delta y_i \right| = \left| \tilde{y}_i - y_i \right|= \left| f \left( \tilde{x}_i \right) - \left( x_i \right) \right|}$.

Example:

An .ipynb notebook with an example of errors and number representations can be accessed online [here].

It can be downloaded from [here] as a python file or downloaded as a notebook from [here].

Definition: Base Representation

Let $b \in \mathbb{N} \backslash \lbrace 1 \rbrace$. Every integer $x \in \mathbb{N}_0$ can be written as a unique expansion with respect to base $b$ as

$$ \begin{equation*} \left(x\right)_b = a_0 b^0 + a_1 b^1 + \ldots + a_n b^n = \sum \limits_{i=0}^{n} a_i b^i \end{equation*} $$

A number can be written in a nested form:

$$ \begin{align*} \left( x \right)_b & = a_0 b^0 + a_1 b^1 + \ldots + a_n b^n \\\ & = a_0 + b\left( a_1 + b\left( a_2 + b\left( a_3 + \ldots + b a_n\right) \right. \ldots \right) \end{align*} $$

with $a_i \in \mathbb{N}_0$ and $a_i < b$, i.e. ${a_i \in \lbrace 0, \ldots, b-1 \rbrace }$.

For a real number, ${x \in \mathbb{R}}$, write

$$ \begin{align*} x & = \sum \limits_{i=0}^{n} a_i b^i + \sum \limits_{i=1}^{\infty} \alpha_i b^{-i} \\\ & = a_n \ldots a_0 \cdot \alpha_1 \alpha_2 \ldots \end{align*} $$

Algorithm: Euclid

Euclid’s algorithm can convert number $x$ in base 10, i.e. $\left(x \right)_{10}$ into another base, $b$, i.e. $\left(x \right)_{b}$.

Input $\left(x \right)_{10}$
Determine the smallest integer $n$ such that ${x < b^{n+1}}$
Let $y=x$. Then for $i=n, \ldots, 0$ $$ \begin{array}{rcl} a_i & = & y \text{ div } b^i \\\ y & = & y \text{ mod } b^i \end{array} $$ which at each steps provides an $a_i$ and updates $y$.
Output as $\left(x \right)_{b} = a_n a_{n-1} \cdots a_0$

Algorithm: Horner

Input $\left(x \right)_{10}$
Set $i=0$
Let $y=x$. Then while $y>0$ $$ \begin{array}{rcl} a_i & = & y \text{ mod } b \\\ y & = & y \text{ div } b \\\ i & = & i+1 \end{array} $$ which at each steps provides an $a_i$ and updates $y$.
Output as $\left(x \right)_{b} = a_n a_{n-1} \cdots a_0$

Definition: Floating Point Representations

A number may be represented as

$$ x = \sigma \times f \times b^{t-p} $$

where $\sigma$ is the sign of the number, i.e. $\pm 1$, $f$ is the mantissa, $b$ is the base, $t$ is the shifted exponent (where $p$ is the shift required to determine the actual exponent.)

Normalized floating point representations with respect to some base $b$, may store a number $x$ as

$$ x= \pm 0 \cdot a_1 \ldots a_k \times b^n $$

where the $a_i \in \lbrace 0, 1, \ldots b-1 \rbrace$ are called the digits, $k$ is the precision and $n$ is the exponent. The set $a_1, \ldots, a_k$ is the mantissa. Impose that $a_1 \ne 0$, it makes the representation unique.

Alternative conventions may express $x= \pm a_1 \cdot a_2 \ldots a_{k-1} \times b^n$, with $a_1 \ne 0$. Normalization refers to specifying the digits before the decimal place.

Theorem:

Let $x$ and $y$ be two normalized floating point numbers with ${x > y > 0}$ and base ${b=2}$. If there exists integers $p$ and ${q \in \mathbb{N}_0}$ such that

$$ 2^{-p} \leq 1 - \dfrac{y}{x} \leq 2^{-q} $$

then, at most $p$ and at least $q$ significant bits (i.e. significant figures written in base 2) are lost during subtraction.

Last updated on Feb 12, 2025

← Taylor Series

Linear Equations →