Main Content

The sections that follow describe the relationship between arithmetic operations and fixed-point scaling, and offer some basic recommendations that may be appropriate for your fixed-point design. For each arithmetic operation,

The general [Slope Bias] encoding scheme described in Scaling is used.

The scaling of the result is automatically selected based on the scaling of the two inputs. In other words, the scaling is

*inherited*.Scaling choices are based on

Minimizing the number of arithmetic operations of the result

Maximizing the precision of the result

Additionally, binary-point-only scaling is presented as a special case of the general encoding scheme.

In embedded systems, the scaling of variables at the hardware interface (the ADC or DAC) is fixed. However for most other variables, the scaling is something you can choose to give the best design. When scaling fixed-point variables, it is important to remember that

Your scaling choices depend on the particular design you are simulating.

There is no best scaling approach. All choices have associated advantages and disadvantages. It is the goal of this section to expose these advantages and disadvantages to you.

Consider the addition of two real-world values:

$${V}_{a}={V}_{b}+{V}_{c}.$$

These values are represented by the general [Slope Bias] encoding scheme described in Scaling:

$${V}_{i}={F}_{i}{2}^{{E}_{i}}{Q}_{i}+{B}_{i}.$$

In a fixed-point system, the addition of values results in finding the variable *Q _{a}*:

$${Q}_{a}=\frac{{F}_{b}}{{F}_{a}}{2}^{{E}_{b}-{E}_{a}}{Q}_{b}+\frac{{F}_{c}}{{F}_{a}}{2}^{{E}_{c}-{E}_{a}}{Q}_{c}+\frac{{B}_{b}+{B}_{c}-{B}_{a}}{{F}_{a}}{2}^{-{E}_{a}}.$$

This formula shows

In general,

*Q*is not computed through a simple addition of_{a}*Q*and_{b}*Q*._{c}In general, there are two multiplications of a constant and a variable, two additions, and some additional bit shifting.

In the process of finding the scaling of the sum, one reasonable goal is to simplify the calculations. Simplifying the calculations should reduce the number of operations, thereby increasing execution speed. The following choices can help to minimize the number of arithmetic operations:

Set

*B*=_{a}*B*+_{b}*B*. This eliminates one addition._{c}Set

*F*=_{a}*F*or_{b}*F*=_{a}*F*. Either choice eliminates one of the two constant times variable multiplications._{c}

The resulting formula is

$${Q}_{a}={2}^{{E}_{b}-{E}_{a}}{Q}_{b}+\frac{{F}_{c}}{{F}_{a}}{2}^{{E}_{c}-{E}_{a}}{Q}_{c}$$

or

$${Q}_{a}=\frac{{F}_{b}}{{F}_{a}}{2}^{{E}_{b}-{E}_{a}}{Q}_{b}+{2}^{{E}_{c}-{E}_{a}}{Q}_{c}.$$

These equations appear to be equivalent. However, your choice of rounding and
precision may make one choice stand out over the other. To further simplify
matters, you could choose *E _{a}* =

In the process of finding the scaling of the sum, one reasonable goal is
maximum precision. You can determine the maximum-precision scaling if the range
of the variable is known. Maximize Precision shows that you
can determine the range of a fixed-point operation from max(*V _{a}*) and min(

$$\begin{array}{l}\mathrm{min}\left({\tilde{V}}_{a}\right)=\mathrm{min}\left({\tilde{V}}_{b}\right)+\mathrm{min}\left({\tilde{V}}_{c}\right),\\ \mathrm{max}\left({\tilde{V}}_{a}\right)=\mathrm{max}\left({\tilde{V}}_{b}\right)+\mathrm{max}\left({\tilde{V}}_{c}\right).\end{array}$$

You can now derive the maximum-precision slope:

$$\begin{array}{c}{F}_{a}{2}^{{E}_{a}}=\frac{\mathrm{max}\left({\tilde{V}}_{a}\right)-\mathrm{min}\left({\tilde{V}}_{a}\right)}{{2}^{w{s}_{a}}-1}\\ =\frac{{F}_{a}{2}^{{E}_{b}}\left({2}^{w{s}_{b}}-1\right)+{F}_{c}{2}^{{E}_{c}}\left({2}^{w{s}_{c}}-1\right)}{{2}^{w{s}_{a}}-1}.\end{array}$$

In most cases the input and output word sizes are much greater than one, and the slope becomes

$${F}_{a}{2}^{{E}_{a}}\approx {F}_{b}{2}^{{E}_{b}+w{s}_{b}-w{s}_{a}}+{F}_{c}{2}^{{E}_{c}+w{s}_{c}-w{s}_{a}},$$

which depends only on the size of the input and output words. The corresponding bias is

$${B}_{a}=\mathrm{min}\left({\tilde{V}}_{a}\right)-{F}_{a}{2}^{{E}_{a}}\mathrm{min}\left({Q}_{a}\right).$$

The value of the bias depends on whether the inputs and output are signed or unsigned numbers.

If the inputs and output are all unsigned, then the minimum values for these variables are all zero and the bias reduces to a particularly simple form:

$${B}_{a}={B}_{b}+{B}_{c}.$$

If the inputs and the output are all signed, then the bias becomes

$$\begin{array}{l}{B}_{a}\approx {B}_{b}+{B}_{c}+{F}_{b}{2}^{{E}_{b}}\left(-{2}^{w{s}_{b}-1}+{2}^{w{s}_{b}-1}\right)+{F}_{c}{2}^{{E}_{c}}\left(-{2}^{w{s}_{c}-1}+{2}^{w{s}_{c}-1}\right),\\ {B}_{a}\approx {B}_{b}+{B}_{c}.\end{array}$$

For binary-point-only scaling, finding *Q _{a}* results in this simple expression:

$${Q}_{a}={2}^{{E}_{b}-{E}_{a}}{Q}_{b}+{2}^{{E}_{c}-{E}_{a}}{Q}_{c}.$$

This scaling choice results in only one addition and some bit shifting. The avoidance of any multiplications is a big advantage of binary-point-only scaling.

The accumulation of values is closely associated with addition:

$${V}_{a\_new}={V}_{a\_old}+{V}_{b}.$$

Finding *Q _{a_new}* involves one multiplication of a constant and a variable, two
additions, and some bit shifting:

$${Q}_{a\_new}={Q}_{a\_old}+\frac{{F}_{b}}{{F}_{a}}{2}^{{E}_{b}-{E}_{a}}{Q}_{b}+\frac{{B}_{b}}{{F}_{a}}{2}^{-{E}_{a}}.$$

The important difference for fixed-point implementations is that the scaling of the output is identical to the scaling of the first input.

For binary-point-only scaling, finding *Q _{a_new}* results in this simple expression:

$${Q}_{a\_new}={Q}_{a\_old}+{2}^{{E}_{b}-{E}_{a}}{Q}_{b}.$$

This scaling option only involves one addition and some bit shifting.

**Note**

The negative accumulation of values produces results that are analogous to those produced by the accumulation of values.

Consider the multiplication of two real-world values:

$${V}_{a}={V}_{b}{V}_{c}.$$

These values are represented by the general [Slope Bias] encoding scheme described in Scaling:

$${V}_{i}={F}_{i}{2}^{{E}_{i}}{Q}_{i}+{B}_{i}.$$

In a fixed-point system, the multiplication of values results in finding the
variable *Q _{a}*:

$$\begin{array}{c}{Q}_{a}=\frac{{F}_{b}{F}_{c}}{{F}_{a}}{2}^{{E}_{b}+{E}_{c}-{E}_{a}}{Q}_{b}{Q}_{c}+\frac{{F}_{b}{B}_{c}}{{F}_{a}}{2}^{{E}_{b}-{E}_{a}}{Q}_{b}\\ +\frac{{F}_{c}{B}_{b}}{{F}_{a}}{2}^{{E}_{c}-{E}_{a}}{Q}_{c}+\frac{{B}_{b}{B}_{c}-{B}_{a}}{{F}_{a}}{2}^{-{E}_{a}}.\end{array}$$

This formula shows

In general,

*Q*is not computed through a simple multiplication of_{a}*Q*and_{b}*Q*._{c}In general, there is one multiplication of a constant and two variables, two multiplications of a constant and a variable, three additions, and some additional bit shifting.

The number of arithmetic operations can be reduced with these choices:

Set

*B*=_{a}*B*_{b}*B*. This eliminates one addition operation._{c}Set

*F*=_{a}*F*_{b}*F*. This simplifies the triple multiplication—certainly the most difficult part of the equation to implement._{c}Set

*E*=_{a}*E*+_{b}*E*. This eliminates some of the bit shifting._{c}

The resulting formula is

$${Q}_{a}={Q}_{b}{Q}_{c}+\frac{{B}_{c}}{{F}_{c}}{2}^{-{E}_{c}}{Q}_{b}+\frac{{B}_{b}}{{F}_{b}}{2}^{-{E}_{b}}{Q}_{c}.$$

You can determine the maximum-precision scaling if the range of the variable is known. Maximize Precision shows that you can determine the range of a fixed-point operation from

$$\mathrm{max}\left({\tilde{V}}_{a}\right)$$

and

$$\mathrm{min}\left({\tilde{V}}_{a}\right).$$

For multiplication, you can determine the range from

$$\begin{array}{c}\mathrm{min}\left({\tilde{V}}_{a}\right)=\mathrm{min}\left({V}_{LL},{V}_{LH},{V}_{HL},{V}_{HH}\right),\\ \mathrm{max}\left({\tilde{V}}_{a}\right)=\mathrm{max}\left({V}_{LL},{V}_{LH},{V}_{HL},{V}_{HH}\right),\end{array}$$

where

$$\begin{array}{l}{V}_{LL}=\mathrm{min}\left({\tilde{V}}_{b}\right)\cdot \mathrm{min}\left({\tilde{V}}_{c}\right),\\ {V}_{LH}=\mathrm{min}\left({\tilde{V}}_{b}\right)\cdot \mathrm{max}\left({\tilde{V}}_{c}\right),\\ {V}_{HL}=\mathrm{max}\left({\tilde{V}}_{b}\right)\cdot \mathrm{min}\left({\tilde{V}}_{c}\right),\\ {V}_{HH}=\mathrm{max}\left({\tilde{V}}_{b}\right)\cdot \mathrm{max}\left({\tilde{V}}_{c}\right).\end{array}$$

For binary-point-only scaling, finding *Q _{a}* results in this simple expression:

$${Q}_{a}={2}^{{E}_{b}+{E}_{c}-{E}_{a}}{Q}_{b}{Q}_{c}.$$

Consider the multiplication of a constant and a variable

$${V}_{a}=K\text{\hspace{0.05em}}{V}_{b},$$

where *K* is a constant called the gain. Since *V _{a}* results from the multiplication of a constant and a variable,
finding

$${Q}_{a}=\left(\frac{K{F}_{b}{2}^{{E}_{b}}}{{F}_{a}{2}^{{E}_{a}}}\right)\text{\hspace{0.05em}}{Q}_{b}+\left(\frac{K{B}_{b}-{B}_{a}}{{F}_{a}{2}^{{E}_{a}}}\right)\text{\hspace{0.17em}}.$$

Note that the terms in the parentheses can be calculated offline. Therefore, there is only one multiplication of a constant and a variable and one addition.

To implement the above equation without changing it to a more complicated form, the constants need to be encoded using a binary-point-only format. For each of these constants, the range is the trivial case of only one value. Despite the trivial range, the binary point formulas for maximum precision are still valid. The maximum-precision representations are the most useful choices unless there is an overriding need to avoid any shifting. The encoding of the constants is

$$\begin{array}{l}\left(\frac{K{F}_{b}{2}^{{E}_{b}}}{{F}_{a}{2}^{{E}_{a}}}\right)={2}^{{E}_{X}}{Q}_{X}\\ \left(\frac{K{B}_{b}-{B}_{a}}{{F}_{a}{2}^{{E}_{a}}}\right)={2}^{{E}_{Y}}{Q}_{Y}\end{array}$$

resulting in the formula

$${Q}_{a}={2}^{{E}_{X}}{Q}_{X}{Q}_{B}+{2}^{{E}_{Y}}{Q}_{Y}.$$

The number of arithmetic operations can be reduced with these choices:

Set

*B*=_{a}*KB*. This eliminates one constant term._{b}Set

*F*=_{a}*KF*and_{b}*E*=_{a}*E*. This sets the other constant term to unity._{b}The resulting formula is simply

$${Q}_{a}={Q}_{b}.$$

If the number of bits is different, then either handling potential overflows or performing sign extensions is the only possible operation involved.

The scaling for maximum precision does not need to be different from the scaling for speed unless the output has fewer bits than the input. If this is the case, then saturation should be avoided by dividing the slope by 2 for each lost bit. This prevents saturation but causes rounding to occur.

Division of values is an operation that should be avoided in fixed-point embedded systems, but it can occur in places. Therefore, consider the division of two real-world values:

$${V}_{a}={V}_{b}/{V}_{c}.$$

These values are represented by the general [Slope Bias] encoding scheme described in Scaling:

$${V}_{i}={F}_{i}{2}^{{E}_{i}}{Q}_{i}+{B}_{i}.$$

In a fixed-point system, the division of values results in finding the variable *Q _{a}*:

$${Q}_{a}=\frac{{F}_{b}{2}^{{E}_{b}}{Q}_{b}+{B}_{b}}{{F}_{c}{F}_{a}{2}^{{E}_{c}+{E}_{a}}{Q}_{c}+{B}_{c}{F}_{a}{2}^{{E}_{a}}}-\frac{{B}_{a}}{{F}_{a}}{2}^{-{E}_{a}}.$$

This formula shows

In general,

*Q*is not computed through a simple division of_{a}*Q*by_{b}*Q*._{c}In general, there are two multiplications of a constant and a variable, two additions, one division of a variable by a variable, one division of a constant by a variable, and some additional bit shifting.

The number of arithmetic operations can be reduced with these choices:

Set

*B*= 0. This eliminates one addition operation._{a}If

*B*= 0, then set the fractional slope_{c}*F*=_{a}*F*/_{b}*F*. This eliminates one constant times variable multiplication._{c}

The resulting formula is

$${Q}_{a}=\frac{{Q}_{b}}{{Q}_{c}}{2}^{{E}_{b}-{E}_{c}-{E}_{a}}+\frac{\left({B}_{b}/{F}_{b}\right)}{{Q}_{c}}{2}^{-{E}_{c}-{E}_{a}}.$$

If *B _{c}* ≠ 0, then no clear recommendation can be made.

You can determine the maximum-precision scaling if the range of the variable is known. Maximize Precision shows that you can determine the range of a fixed-point operation from

$$\mathrm{max}\left({\tilde{V}}_{a}\right)$$

and

$$\mathrm{min}\left({\tilde{V}}_{a}\right).$$

For division, you can determine the range from

$$\begin{array}{c}\mathrm{min}\left({\tilde{V}}_{a}\right)=\mathrm{min}\left({V}_{LL},{V}_{LH},{V}_{HL},{V}_{HH}\right),\\ \mathrm{max}\left({\tilde{V}}_{a}\right)=\mathrm{max}\left({V}_{LL},{V}_{LH},{V}_{HL},{V}_{HH}\right),\end{array}$$

where for nonzero denominators

$$\begin{array}{l}{V}_{LL}=\mathrm{min}\left({\tilde{V}}_{b}\right)/\mathrm{min}\left({\tilde{V}}_{c}\right),\\ {V}_{LH}=\mathrm{min}\left({\tilde{V}}_{b}\right)/\mathrm{max}\left({\tilde{V}}_{c}\right),\\ {V}_{HL}=\mathrm{max}\left({\tilde{V}}_{b}\right)/\mathrm{min}\left({\tilde{V}}_{c}\right),\\ {V}_{HH}=\mathrm{max}\left({\tilde{V}}_{b}\right)/\mathrm{max}\left({\tilde{V}}_{c}\right).\end{array}$$

For binary-point-only scaling, finding *Q _{a}* results in this simple expression:

$${Q}_{a}=\frac{{Q}_{b}}{{Q}_{c}}{2}^{{E}_{b}-{E}_{c}-{E}_{a}}.$$

**Note**

For the last two formulas involving *Q _{a}*, a divide by zero and zero divided by zero are possible.
In these cases, the hardware will give some default behavior but you must
make sure that these default responses give meaningful results for the
embedded system.

From the previous analysis of fixed-point variables scaled within the general [Slope Bias] encoding scheme, you can conclude

Addition, subtraction, multiplication, and division can be very involved unless certain choices are made for the biases and slopes.

Binary-point-only scaling guarantees simpler math, but generally sacrifices some precision.

Note that the previous formulas don't show the following:

Constants and variables are represented with a finite number of bits.

Variables are either signed or unsigned.

Rounding and overflow handling schemes. You must make these decisions before an actual fixed-point realization is achieved.