## Floating-Point Numbers

“Floating point” refers to a set of data types that encode real numbers, including fractions and decimals. Floating-point data types allow for a varying number of digits after the decimal point, while fixed-point data types have a specific number of digits reserved before and after the decimal point. So, floating-point data types can represent a wider range of numbers than fixed-point data types.

Due to limited memory for number representation and storage, computers can represent a finite set of floating-point numbers that have finite precision. This finite precision can limit accuracy for floating-point computations that require exact values or high precision, as some numbers are not represented exactly. Despite their limitations, floating-point numbers are widely used due to their fast calculations and sufficient precision and range for solving real-world problems.

### Floating-Point Numbers in MATLAB

MATLAB® has data types for double-precision (double) and single-precision (single) floating-point numbers following IEEE® Standard 754. By default, MATLAB represents floating-point numbers in double precision. Double precision allows you to represent numbers to greater precision but requires more memory than single precision. To conserve memory, you can convert a number to single precision by using the single function.

You can store numbers between approximately –3.4 × 1038 and 3.4 × 1038 using either double or single precision. If you have numbers outside of that range, store them using double precision.

#### Create Double-Precision Data

Because the default numeric type for MATLAB is type double, you can create a double-precision floating-point number with a simple assignment statement.

x = 10;
c = class(x)
c =
'double'

You can convert numeric data, characters or strings, and logical data to double precision by using the double function. For example, convert a signed integer to a double-precision floating-point number.

x = int8(-113);
y = double(x)
y =
-113

#### Create Single-Precision Data

To create a single-precision number, use the single function.

x = single(25.783);

You can also convert numeric data, characters or strings, and logical data to single precision by using the single function. For example, convert a signed integer to a single-precision floating-point number.

x = int8(-113);
y = single(x)
y =
single
-113

#### How MATLAB Stores Floating-Point Numbers

MATLAB constructs its double and single floating-point data types according to IEEE format and follows the round to nearest, ties to even rounding mode by default.

A floating-point number x has the form:

$x=-{1}^{s}\cdot \left(1+f\right)\cdot {2}^{e}$

where:

• s determines the sign.

• f is the fraction, or mantissa, which satisfies 0 ≤ f < 1.

• e is the exponent.

s, f, and e are each determined by a finite number of bits in memory, with f and e depending on the precision of the data type.

Storage of a double number requires 64 bits, as shown in this table.

BitsWidthUsage
631Stores the sign, where 0 is positive and 1 is negative
62 to 5211Stores the exponent, biased by 1023
51 to 052Stores the mantissa

Storage of a single number requires 32 bits, as shown in this table.

BitsWidthUsage
311Stores the sign, where 0 is positive and 1 is negative
30 to 238Stores the exponent, biased by 127
22 to 023Stores the mantissa

### Largest and Smallest Values for Floating-Point Data Types

The double- and single-precision data types have a largest and smallest value that you can represent. Numbers outside of the representable range are assigned positive or negative infinity. However, some numbers within the representable range cannot be stored exactly due to the gaps between consecutive floating-point numbers, and these numbers can have round-off errors.

#### Largest and Smallest Double-Precision Values

Find the largest and smallest positive values that can be represented with the double data type by using the realmax and realmin functions, respectively.

m = realmax
m =
1.7977e+308
n = realmin
n =
2.2251e-308

realmax and realmin return normalized IEEE values. You can find the largest and smallest negative values by multiplying realmax and realmin by -1. Numbers greater than realmax or less than –realmax are assigned the values of positive or negative infinity, respectively.

#### Largest and Smallest Single-Precision Values

Find the largest and smallest positive values that can be represented with the single data type by calling the realmax and realmin functions with the argument "single".

m = realmax("single")
m =
single
3.4028e+38
n = realmin("single")
n =
single
1.1755e-38

You can find the largest and smallest negative values by multiplying realmax("single") and realmin("single") by –1. Numbers greater than realmax("single") or less than –realmax("single") are assigned the values of positive or negative infinity, respectively.

#### Largest Consecutive Floating-Point Integers

Not all integers are representable using floating-point data types. The largest consecutive integer, x, is the greatest integer for which all integers less than or equal to x can be exactly represented, but x + 1 cannot be represented in floating-point format. The flintmax function returns the largest consecutive integer. For example, find the largest consecutive integer in double-precision floating-point format, which is 253, by using the flintmax function.

x = flintmax
x =
9.0072e+15

Find the largest consecutive integer in single-precision floating-point format, which is 224.

y = flintmax("single")
y =
single
16777216

When you convert an integer data type to a floating-point data type, integers that are not exactly representable in floating-point format lose accuracy. flintmax, which is a floating-point number, is less than the greatest integer representable by integer data types using the same number of bits. For example, flintmax for double precision is 253, while the maximum value for type int64 is 264 – 1. Therefore, converting an integer greater than 253 to double precision results in a loss of accuracy.

### Accuracy of Floating-Point Data

The accuracy of floating-point data can be affected by several factors:

• Limitations of your computer hardware — For example, hardware with insufficient memory truncates the results of floating-point calculations.

• Gaps between each floating-point number and the next larger floating-point number — These gaps are present on any computer and limit precision.

#### Gaps Between Floating-Point Numbers

You can determine the size of a gap between consecutive floating-point numbers by using the eps function. For example, find the distance between 5 and the next larger double-precision number.

e = eps(5)
e =
8.8818e-16

You cannot represent numbers between 5 and 5 + eps(5) in double-precision format. If a double-precision computation returns the answer 5, the result is accurate within eps(5). This radius of accuracy is often called machine epsilon.

The gaps between floating-point numbers are not equal. For example, the gap between 1e10 and the next larger double-precision number is larger than the gap between 5 and the next larger double-precision number.

e = eps(1e10)
e =
1.9073e-06

Similarly, find the distance between 5 and the next larger single-precision number.

x = single(5);
e = eps(x)
e =
single
4.7684e-07

Gaps between single-precision numbers are larger than the gaps between double-precision numbers because there are fewer single-precision numbers. So, results of single-precision calculations are less precise than results of double-precision calculations.

When you convert a double-precision number to a single-precision number, you can determine the upper bound for the amount the number is rounded by using the eps function. For example, when you convert the double-precision number 3.14 to single precision, the number is rounded by at most eps(single(3.14)).

#### Gaps Between Consecutive Floating-Point Integers

The flintmax function returns the largest consecutive integer in floating-point format. Above this value, consecutive floating-point integers have a gap greater than 1.

Find the gap between flintmax and the next floating-point number by using eps:

format long
x = flintmax
x =
9.007199254740992e+15
e = eps(x)
e =
2

Because eps(x) is 2, the next larger floating-point number that can be represented exactly is x + 2.

y = x + e
y =
9.007199254740994e+15

If you add 1 to x, the result is rounded to x.

z = x + 1
z =
9.007199254740992e+15

### Arithmetic Operations on Floating-Point Numbers

You can use a range of data types in arithmetic operations with floating-point numbers, and the data type of the result depends on the input types. However, when you perform operations with different data types, some calculations may not be exact due to approximations or intermediate conversions.

#### Double-Precision Operands

You can perform basic arithmetic operations with double and any of the following data types. If one or more operands are an integer scalar or array, the double operand must be a scalar. The result is of type double, except where noted otherwise.

• single — The result is of type single.

• double

• int8, int16, int32, int64 — The result is of the same data type as the integer operand.

• uint8, uint16, uint32, uint64 — The result is of the same data type as the integer operand.

• char

• logical

#### Single-Precision Operands

You can perform basic arithmetic operations with single and any of the following data types. The result is of type single.

• single

• double

• char

• logical

### Unexpected Results with Floating-Point Arithmetic

Almost all operations in MATLAB are performed in double-precision arithmetic conforming to IEEE Standard 754. Because computers represent numbers to a finite precision, some computations can yield mathematically nonintuitive results. Some common issues that can arise while computing with floating-point numbers are round-off error, cancellation, swamping, and intermediate conversions. The unexpected results are not bugs in MATLAB and occur in any software that uses floating-point numbers. For exact rational representations of numbers, consider using the Symbolic Math Toolbox™.

#### Round-Off Error

Round-off error can occur due to the finite-precision representation of floating-point numbers. For example, the number 4/3 cannot be represented exactly as a binary fraction. As such, this calculation returns the quantity eps(1), rather than 0.

e = 1 - 3*(4/3 - 1)
e =
2.2204e-16

Similarly, because pi is not an exact representation of π, sin(pi) is not exactly zero.

x = sin(pi)
x =
1.2246e-16

Round-off error is most noticeable when many operations are performed on floating-point numbers, allowing errors to accumulate and compound. A best practice is to minimize the number of operations whenever possible.

#### Cancellation

Cancellation can occur when you subtract a number from another number of roughly the same magnitude, as measured by eps. For example, eps(2^53) is 2, so the numbers 2^53 + 1 and 2^53 have the same floating-point representation.

x = (2^53 + 1) - 2^53
x =
0

When possible, try rewriting computations in an equivalent form that avoids cancellations.

#### Swamping

Swamping can occur when you perform operations on floating-point numbers that differ by many orders of magnitude. For example, this calculation shows a loss of precision that makes the addition insignificant.

x = 1 + 1e-16
x =
1

#### Intermediate Conversions

When you perform arithmetic with different data types, intermediate calculations and conversions can yield unexpected results. For example, although x and y are both 0.2, subtracting them yields a nonzero result. The reason is that y is first converted to double before the subtraction is performed. This subtraction result is then converted to single, z.

format long
x = 0.2
x =
0.200000000000000
y = single(0.2)
y =
single
0.2000000
z = x - y
z =
single
-2.9802323e-09

#### Linear Algebra

Common issues in floating-point arithmetic, such as the ones described above, can compound when applied to linear algebra problems because the related calculations typically consist of multiple steps. For example, when solving the system of linear equations Ax = b, MATLAB warns that the results may be inaccurate because operand matrix A is ill conditioned.

A = diag([2 eps]);
b = [2; eps];
x = A\b;
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 1.110223e-16.

## References

[1] Moler, Cleve. Numerical Computing with MATLAB. Natick, MA: The MathWorks, Inc., 2004.