Floating point numbers in machine learning: Less is more for Intel, Nvidia and ARM

floating point numbers in machine learning less is more for.jpg
floating point numbers in machine learning less is more for.jpg

The three hardware manufacturers have published a joint initiative for the standardized 8-bit format “FP8” for floating point numbers.


ARM, Intel and Nvidia are all pushing for an 8-bit floating-point number format. The new FP8 standard is intended to define the numbers and supplement the formats for 16, 32 and 64 bits described in IEEE 754. The motivation is the use in the training of machine learning models, which, despite the narrower data format, should run significantly faster without any significant loss of accuracy.


The IEEE 754 standard defines the structure of floating-point numbers of different precision. The basis is the single precision 32 bit (FP32). In line with this, double-precision floating-point numbers require 64 bits (FP64) and 16 bits (FP16) are sufficient for half precision. In addition, there are so-called mini-floats with a lower number of bits, which are not defined in the IEEE standard.

Some programming languages ​​have different types like float and double in C and C++ as well f32 and f64 in Rust. Other languages ​​such as JavaScript and Python, which is widespread in the data science and machine learning environment, use 64-bit floating point numbers by default.

Regardless of the bit width, all formats represent the numbers in a combination of mantissa (m) and exponent (e). The former contains the digits while the latter defines the exponent. Added to this is the base (b), which is 2 in IEEE 754. The actual number is given by the formula x = m * 2e. The exponent is signed and can be, for example, between -126 and 127 for a single-precision floating-point number that provides 23 bits for the mantissa and 8 for the exponent. All precision levels use one bit for the sign of the number.

While the width of the floating-point numbers has no significant impact on performance in typical applications, in machine learning, especially when training models, but also during inference in productive use, countless calculations are carried out in parallel, which are typically performed on dedicated hardware such as GPUs or AI accelerators takes place.

Therefore, various approaches have been trying to optimize floating-point number formats for machine learning for some time. Many rely on a mixed-precision approach that mixes floating-point numbers for half and single precision, for example, but also FP16 with 8-bit integer values ​​(INT8). Many current processors and computational accelerators also process BFloat16 (BF16).

According to a paper on arXiv, the accuracy of many vision- and language-based models suffers little from lower floating-point precision. It compares the use of 8-bit floating-point numbers (FP8) with that of FP16 values ​​in the BERT and GPT-3 language and transformer models, among others.


On the other hand, there are clear differences in performance: Nvidia used the performance measurement tool MLPerf Inference 2.1 to carry out measurements on the hopper architecture introduced at the beginning of 2022. According to the company, processing was four and a half times faster than with FP16.

In order to ensure uniform processing of 8-bit floating-point numbers, ARM, Intel and Nvidia want to jointly promote a standardization proposal. The FP8 format provides two different versions: E5M2 uses two bits for the mantissa and five for the exponent, while E4M3 distributes three bits to the mantissa and only four to the exponent. So the latter can work more accurately, but in a smaller range of numbers.

In addition, FP8 should cover the special cases of zero, infinity and NaN (Not a Number), i.e. an undefined or non-representable value. The FP8 format is probably natively integrated into the Nvidia Hopper architecture.

Further details on the FP8 floating-point number format and the standardization efforts can be found on the Nvidia blog.

Previous articleOppo F21s Pro 5G coming tomorrow: Reno8 Z rebranding?
Next articleOfficial Honor X40: design smartphone for the mid-range
Brian Adam
Professional Blogger, V logger, traveler and explorer of new horizons.