What is a floating point number 04/15 Update SLTechnology News&Howtos

What is a floating point number

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is a floating point number". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is a floating point number".

1 preface

When we learn the C language, we usually think that floating-point numbers and decimals are equivalent, and there is no strict distinction between their concepts, which does not affect our learning, because floating-point numbers and decimals are bound together. Only decimals are stored in floating-point format.

In fact, both integers and decimals can be stored in fixed-point format or floating-point format, but the reality is that C language uses fixed-point format to store integers and floating-point format to store decimals, which is the result of the balance between "numerical range" and "numerical precision".

2 what is a floating point number?

Floating-point type simply means real number. Floating-point numbers are used in computers to approximate any real number. Specifically, this real number is obtained by multiplying an integer or a fixed point (that is, Mantissa) by the integer power of a cardinality (usually 2 in a computer). This representation is similar to scientific notation with a cardinality of 10.

3 Storage of floating-point numbers in memory

First of all, to be clear, data types such as integers, floats, or characters are stored in binary at the bottom of the computer.

Floating-point numbers are stored in memory differently from integers because integers can be converted into one-to-one binary data. The storage of floating point number is composed of symbol bit (sign) + exponential bit (exponent) + decimal place (fraction).

Type symbol bit index Mantissa Float 1 (31st) 8 (23 ~ 30) 23 (0 ~ 22) Double 1 (63) 11 (52 ~ 62) 52 (0 ~ 51)

Int and float also occupy four bytes of memory, but the maximum value that float can represent is much larger than int, the root cause of which is that floating-point numbers are stored exponentially in memory.

The steps for converting floating-point numbers to memory are as follows:

Convert floating point numbers to binary

Representation of binary floating-point numbers by scientific counting

Calculate the value after exponential offset

For point 3: you need to add an offset when calculating the index (see later on why the offset is used), and the value of the offset is related to the type of floating point number (float offset value is 127and double offset value is 1023). For example, for index 6, the offset values of float and double are:

Float: 127 + 6 = 133

Double:1023 + 6 = 1029

4 examples

How the floating point number 19.625 is stored in float:

Convert floating-point numbers to binary: 10011.101 (19.625 integer parts are divided by 2 and decimal parts are multiplied by 2)

Using scientific counting to represent binary floating point numbers: 1.0011101 * 2 ^ 4

The value after calculating the exponential offset: 1274 = 131( 10000011)

To sum up, the value of 19.625 of the float type in memory is 0-10000011-1101 0000 0000 0000.

5 float and double range and accuracy

Range

The range of float and double is determined by the number of digits of the index. (because the representation is in the form of 1.x * 2 ^ Y, the effect of 1.x is ignored and the exponent is taken directly to represent the range of floating point numbers.)

Float:

1bit (symbol bit) 8bits (exponential bit) 23bits (Mantissa digit)

Double:

1bit (symbol bit) 11bits (exponential bit) 52bits (Mantissa digit)

Therefore, the exponential range of float is-127 cycles 128, while that of double is-1023 cycles 1024, and the exponential bits are divided in the form of complements.

Among them, the negative index determines the non-zero number with the smallest absolute value that the floating-point number can express, while the positive index determines the maximum absolute value that the floating-point number can express, that is, it determines the range of the floating-point number.

The range of float is-2 ^ 128 ~ + 2 ^ 128, that is,-3.40E+38 ~ + 3.40E+38

Double ranges from-2 ^ 1024 to + 2 ^ 1024, that is,-1.79E+308 ~ + 1.79E+308.

Precision.

The accuracy of float and double is determined by the number of Mantissa digits. The more Mantissa can represent, the more significant digits after the decimal point, so the higher the accuracy. Floating point number is stored in memory according to scientific counting, and its integer part is always an implied "1". Because it is constant, it can not affect the accuracy.

Float: 2 ^ 23 = 8388608, a total of seven digits, which means that there can be up to seven significant digits, but the absolute guarantee is six digits, that is, the precision of float is 6 to 7 significant digits.

Double: 2 ^ 52 = 4503599627370496, a total of 16 bits. Similarly, the precision of double is 15 bits.

6 Anatomy: why is the index calculated by offset?

If the offset is not used:

The signed number range represented by 8-bit binary numbers has two ranges: 0000 0000000000000011111 1111 and-127000, respectively.

You see the problem here, there are two zeros, one positive zero and one negative zero.

If the offset is used:

The conversion to binary is: 0111 1111

that

When we say-127, there are 127127, that is, 0111 1111-0111 1111 = 0000 0000.

When we say-126, there are 127126, that is, 0111 1111-0111 1110 = 0000 0001.

When we say-2, there are 1272s, that is, 0111 1111-0000 0010 = 0111 1101.

When we say-1, there are 1271s, that is, 0111 1111-0000 0001 = 0111 1110

When we want to express 0, there is 0000 0000 + 0111 1111 = 0111 1111.

When we want to express 1, we have 1: 127, that is, 0000 0001 + 0111 1111 = 1000 0000.

When we want to express 2, we have 1: 127, that is, 0000 0010 + 0111 1111 = 1000 0001.

When we say 128, there are 128127, that is, 1000 0000 + 0111 1111 = 1111 1111.

From the above example, we can draw a rule that with shift storage technology, we can use 8-bit binary to represent a total of 127 negative numbers + zero (0) + 128 positive numbers from-127 binary 128 a total of 256 digits. It seems that the use of shift storage not only does not have the problem of + 0 and-0, but also makes full use of the power exponent of the newly generated 8-bit binary to represent single-precision floating-point numbers to the maximum.

Thank you for your reading, the above is the content of "what is a floating point number", after the study of this article, I believe you have a deeper understanding of what is a floating point number, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.