How to analyze the accuracy range of Java float and double 07/02 Update SLTechnology News&Howtos

How to analyze the accuracy range of Java float and double

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

How to analyze the accuracy range of Java float and double? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Accuracy range of Java float and double

To understand the value range and accuracy of float and double, you must first understand how decimals are stored in your computer:

For example: 78.375 is a positive decimal. To store this number on your computer, you need to represent it as a floating-point number, first performing a binary conversion:

1. Binary conversion of decimals (floating point numbers)

The integer portion of 78.375:

Decimal part:

So, the binary form of 78.375 is 1001110.011.

Then, using binary scientific notation, there are

Note that the number represented by binary scientific notation after conversion has an index and a decimal part, which is called a floating point number.

2. Storage of floating-point numbers in computer

In a computer, this number is saved using a floating-point representation, which is divided into three parts:

The first part is used to store symbolic bits (sign) to distinguish between positive and negative numbers, where 0 is a positive number.

The second part is used to store the index (exponent), which is 6 in decimal system.

The third part is used to store the decimal (fraction), where the decimal part is 001110011

It should be noted that there are also positive and negative indices, which will be discussed later.

As shown in the following picture (picture from Wikipedia):

For example, the float type is 32-bit, which is a single-precision floating-point representation:

Symbol bit (sign) occupies 1 bit and is used to represent positive and negative numbers.

The exponential bit (exponent) occupies 8 bits and is used to represent the index

The decimal place (fraction) occupies 23 places and is used to represent the decimal place.

The double type is 64-bit, which is a double-precision floating-point representation:

Symbol bits occupy 1 bit, exponential bits occupy 11 places, and decimal places occupy 52 places.

In fact, it can be seen vaguely from here:

The exponential bit determines the size range, because the larger the number that the exponential bit can represent, the greater the number it can represent.

And the decimal places determine the calculation accuracy, because the larger the number of decimal places can represent, the greater the accuracy of calculation!

It may not be clear enough, for example:

Float has only 23 decimal places, that is, 23 binary digits, and the largest decimal number that can be represented is 2 to the 23rd power, that is, 8388608, that is, 7 decimal digits. Strictly speaking, the accuracy can only 100% guarantee the decimal 6-digit operation.

Double has 52 decimal places, the corresponding maximum decimal value is 4503 599 627 370 496, this number has 16 digits, so the calculation accuracy can only 100% guarantee the decimal 15-digit operation.

Third, the offset and unsigned representation of exponential bits

It should be noted that the exponent may be negative or positive, that is, the exponent is a signed integer, and the calculation of signed integers is more troublesome than unsigned integers. So in order to reduce unnecessary trouble, in the actual storage of the index, the index needs to be converted into unsigned integers. So how to change it?

Note that the exponential part of float is 8 bits, then the value range of the index is from-126to + 127. in order to eliminate the actual computational impact of negative numbers (such as comparison size, addition and subtraction, etc.), you can do a simple mapping to the index in the actual storage, plus an offset, such as the exponential offset of float is 127. in this way, there will be no negative numbers.

such as

If the index is 6, then what is actually stored is 6 "127" 133, that is, 133 is converted to binary and then stored.

If the index is-3, then the actual storage is-3-127-124, that is, 124 is converted to binary and then stored.

When we need to calculate the decimal number actually represented, we can subtract the offset from the index.

For the corresponding double type, the exponential offset during storage is 1023.

Fourth, make a summary

So if you use the float type to save the decimal 78.375, you need to convert it to a floating point number to get the symbolic bits and exponents and decimal parts.

This example has been analyzed earlier, so:

The symbol bit is 0

The exponential bit is 6 "127" 133 and the binary is expressed as 10 000 101

The decimal part is 001110011, please fill in the insufficient part automatically.

Concatenated with float, the bold part is the exponential bit, and the leftmost bit is the symbol bit 0, which represents a positive number:

0 10000101 001110011 00000 00000 0000

If you use double to save. Do the math yourself. There are too many zeros.

What is the range of float and double?

Float occupies 4 bytes in Java, 32bit. The calculation range formula is ((- 1) ^ S) * (2 ^ (EMI 127)) * (1.m), in which S occupies a symbol bit, E occupies 8bit an exponential bit, and M occupies 23 digits.

At the beginning of the part (1.m) here, I never figured out why there was 1 in front of me, but suddenly one day my mind enlightened that the decimal point must be 1 in front of the decimal point in scientific counting, so 1 in front of the decimal point in normalization.

E occupies 8 places, so the size is 0-255, but in order to represent decimals, the exponential part needs to be a decimal, one-half, so the last is Emur127, that is, the exponential part is-127-128.

There is nothing to say about the Mantissa, the range is 1-1.11. (all 23 are 1)

Note: the Mantissa here 1.1111 is actually decimal 1 + binary 0.1111. What do you mean? an example will be clearer:

1.1-> 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2

1.11-> 1 pound 1: 2 + 1: 4 = 1.75 = 2-1: 4

To sum up, according to reason, the maximum value should be (2 ^ 128) * (2-2 ^ (- 23)) = 2 ^ 129-2 ^ 105 = 6.81 * 10 ^ 38, but it is usually 3.40 * 10 ^ 38 in books, so here comes the problem again. why is it twice as big?

Rule out the fact that all the books are wrong due to the copying behavior of all the publishers, and there is only a problem somewhere above. First of all, go back to the above normalization (which I personally think can be replaced by a general situation), and think about it, if all the numbers are normalized as above:

First: the number you get is always (1.m) multiplied by a number, and the exponential part will not be zero, so how do you say 0?

Second: infinity and infinitesimal, and how is NAN (not a number) expressed in this 32bit? One explanation I have thought about is that if there is no such number in the computer, it means NAN. I used to think this made sense. But the computer can only be used as a tool, that is to say, it can't make something out of nothing, it can only deal with what we give it, all infinity and infinitesimal and NAN must have a way to express it in the computer.

Therefore, there must be a non-standardized representation, that is, the so-called special case.

First: when E is eight zeros, it is not (1.m) but (0.m), which means zeros and, of course, numbers that are very close to zeros.

Second: when E is 8 1s, if the decimal field is all 0, it means infinity, and the rest represents NAN.

From the above, when the index part is (0127) and (255127), it represents two special cases, so the range of E should be [- 126127]. Finally, it is concluded that the representation range of normalized floating point numbers is positive or negative (2 ^ 127) * (2-2 ^ (- 23)) = 2 ^ 128-2 ^ 104 = 3.40 * 10 ^ 38

This is the answer to the question on how to analyze the precision range of Java float and double. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.