Floating Point Representation


In this arti­cle we are going to see what is a float­ing point and how dec­i­mal val­ues are rep­re­sent­ed using this system.

Float­ing point rep­re­sen­ta­tion is a way of writ­ing a dec­i­mal num­ber that resem­bles sci­en­tif­ic nota­tion. This allows us to rep­re­sent and oper­ate with very large num­bers and also with very small num­bers (with many decimals).

The float­ing point com­put­ing stan­dard is described in IEEE 754.


Structure of a floating point number

This rep¬≠re¬≠sen¬≠ta¬≠tion sys¬≠tem uses a cer¬≠tain num¬≠ber of bina¬≠ry dig¬≠its depend¬≠ing on the type of accu¬≠ra¬≠cy (com¬≠mon¬≠ly 16, 32, 64 and 128 bits).

A bit is des­tined to the sign, i.e. if that bit is worth 0 it is a pos­i­tive num­ber, if it is worth 1 it is a neg­a­tive number.

The remain­ing bits are dis­trib­uted in the rep­re­sen­ta­tion of the dec­i­mals (usu­al­ly called man­tis­sa) and the exponent.

    \[      \boxed{n = (-1)^s . 2^{(e-127)} . (1+m)}\]

In the expres¬≠sion n it is the dec¬≠i¬≠mal num¬≠ber to represent. 

The let­ter s is the bit for the sign (if s is 0, the expres­sion (-1) raised to 0 results in 1 positive).

The let­ter e is the expo­nent and m is the mantissa.

Move decimal numbers to floating point

1- Take the num¬≠ber to rep¬≠re¬≠sent, sep¬≠a¬≠rate the sign and write the absolute val¬≠ue in base 2.

2- The absolute val¬≠ue in base 2 is writ¬≠ten in sci¬≠en¬≠tif¬≠ic nota¬≠tion in nor¬≠mal¬≠ized base 2.

3- The expo­nent is expressed in excess nota­tion (it will depend on the type of pre­ci­sion chosen).

4- The coef­fi­cient is writ­ten on the man­tis­sa with­out the whole part, because the nor­mal­iza­tion in step 2 forces the whole part of the man­tis­sa to be 1, stor­ing it does not pro­vide information.

System Truncation Error Floating Point

A trun­ca­tion error occurs when you take a cer­tain num­ber of dig­its from one num­ber and leave out the others.

Think of the num¬≠ber ŌÄ (3.14159265‚Ķ), which is an irra¬≠tional num¬≠ber with infi¬≠nite digits.

Com¬≠put¬≠ers can¬≠not store infi¬≠nite infor¬≠ma¬≠tion in mem¬≠o¬≠ry because infi¬≠nite¬≠ly large mem¬≠o¬≠ry would be need¬≠ed, so at some point it must stop.

If we trun¬≠cate all the dec¬≠i¬≠mal part to ŌÄ and we are left with only 3, we will be mak¬≠ing an error of approx¬≠i¬≠mate¬≠ly 4.5% rel¬≠a¬≠tive to the real val¬≠ue of ŌÄ . If on the oth¬≠er hand we take into account the first two dec¬≠i¬≠mals of ŌÄ, we are left with 3.14. In this case we will be mak¬≠ing an error of approx¬≠i¬≠mate¬≠ly 0.05% rel¬≠a¬≠tive to the real val¬≠ue of ŌÄ. 

This error will occur in the float­ing point sys­tem either because we want to rep­re­sent irra­tional num­bers or because the dec­i­mal we want to rep­re­sent becomes irra­tional when passed to the bina­ry sys­tem (exam­ple 0.1 dec­i­mal has infi­nite dig­its when passed to binary).

Scroll to Top
Secured By miniOrange