/    /  Digital Logic-Floating Point Representation

Floating Point Representation

 

Floating-point representation doesn’t reserve a specific number of bits for the fractional or the integer part, it reserves a certain number of bits for the number (called the mantissa or significand) and for the decimal portion (called the exponent).

It has two parts. The first part represents the signed fixed-point number, also known as the mantissa, and the second part represents the position of the decimal or binary point known as the exponent. 

 

Floating-point representation has been standardized by the IEEE as follows:

Therefore the actual number is (-1)s(1+m)x2^(e-Bias).

Where, s= sign bit, it is 0 for a positive number and 1 for a negative number. 

m=mantissa.

e=exponent value.

Bias=Bias number.

 

The floating-point number can be represented in the following ways:

  1. Half Precision (16 bit): 1 sign bit, 5-bit exponent, and 10-bit mantissa.
  2. Single Precision (32 bit): 1 sign bit, 8-bit exponent, and 23-bit mantissa.
  3. Double Precision (64 bit): 1 sign bit, 11-bit exponent, and 52-bit mantissa.
  4. Quadruple Precision (128 bit): 1 sign bit, 15-bit exponent, and 112-bit mantissa.

 

Example:

Let us assume a number is using a 32-bit format: the 1-bit sign bit, 8 bits for the signed exponent, and 23 bits for the fractional part, the leading bit 1 is the hidden bit.

So the −53.5 is normalized as  -53.5=(-110101.1)2=(-1.101011)x2^5. 

It will be represented as follows:

Where 00000101 is the 8-bit binary value of exponent value +5.

 

Reference

Floating Point Representation