What are the numerical problems of the computer 07/16 Update SLTechnology News&Howtos

What are the numerical problems of the computer

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what are the numerical problems of the computer". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. let's follow the editor's train of thought to study and learn "what are the numerical problems of computers?"

I. some concepts

Number of machines: the number of values encoded inside the computer, that is, the actual stored sequence of 0x1.

Truth: the actual value that the machine number wants to represent can be understood as the number with plus or minus signs that we usually use in real life.

The corresponding relationship between the number of machines and the true value is the internal coding of the value in the computer, there are mainly four kinds: original code, inverse code, complement code, shift code.

Original code

The original code consists of a 1-bit symbol bit and a numerical part, and the numerical part is the absolute value of the true value.

Characteristics

The corresponding relationship with the true value is intuitive.

The representation of 0 is not unique, with + 0 (00. 000) and-0 (10. 000).

Addition and subtraction are complex and need judgment symbols.

Inverse code

The inverse code consists of a 1-bit symbol bit and a numerical value part. When it represents a positive number, it is the same as the original code; when it represents a negative number, the symbol bit remains the same, and the numerical value is reversed.

The inverse code does not improve much compared to the original code, and is no longer used to represent data.

Complement code

The complement is also composed of a 1-bit symbol bit and a numerical part. When it represents an integer, it is the same as the original code; when it represents a negative number, the symbol bit remains unchanged, the numerical part is inverted, and the last bit is added by 1.

Fast conversion from source code to complement (negative number): the symbol bit remains unchanged, the first 1 from right to left remains unchanged, and the other bits are inverted. The conversion method from complement code to original code (negative number) is the same as above, the symbol bit remains the same, the numerical part is inverted, and the last bit is added 1. That is to say, the symbol bit remains the same, the first one from right to left remains unchanged, and the other bits are reversed.

Reason: according to theory, the complement to the original code should be the inverse operation from the original code to the complement, that is, minus 1 and then reverse. If the complement is A, the original code is ~ (A-1), that is, 11. 111-(A-1) = 11. 111-A + 1, that is, A + 1. So ~ A + 1 = ~ (A-1)

The complement of a number is negative: together with the symbol bits, add 1 to the last bit, that is, the first 1 from right to left remains unchanged, and other bits are reversed.

For example, 4-digit complement, [x] ~ complement ~ = 1011, [- x] ~ complement ~ = 0101

Characteristics

0 has a unique representation of 00. 000, so the complement can represent a range of 1 more than the inverse of the original code, indicating a range of-2 ^ n-1 ^ ~ 2 ^ n-1 ^-1.

The addition and subtraction symbol bits can directly participate in the operation.

Modern computer integers are basically represented by complement.

Shift code

The frameshift is mainly used for the order code part of the floating point number. Later, it will be said that the order of the floating point number is positive or negative. When comparing the two floating point numbers, you need to compare the order code to compare the order. In order to facilitate comparison, add the order to a bias constant to make it a positive number, because the order is added to the same bias constant, the difference of the order will not change.

So the frameshift is the value itself plus an offset constant, which is usually 2 ^ n-1 ^, or 2 ^ n-1 ^-1.

The above is the representation of fixed points, fixed points as the name implies, the decimal point is agreed not to move in a fixed position. Fixed-point number is divided into fixed-point decimal and fixed-point integer.

The decimal point of a fixed-point integer is fixed at the rightmost part of the number and is generally used to represent an integer.

The decimal point of a fixed decimal point is fixed to the left of the number, which generally represents the Mantissa part of a floating point number.

The representation of the floating point number is similar to the scientific counting method, its index part can be changed, and the corresponding Mantissa part can also change, just like the decimal point is floating, so it is called floating point, which will be explained in detail after the floating point.

2. Numerical comparison

Integers are divided into unsigned integers and signed integers. Given a number, how to store it in the computer is a matter of coding, and it is a matter of upper software how to interpret it. * * for example, it can be interpreted as signed and unsigned numbers in c language, but only signed numbers in java.

When comparing values, you have to determine the type before you can compare. It usually defaults to signed numbers compared to unsigned numbers, and if unsigned numbers occur, unsigned numbers are compared.

The concept is vague and abstract, see the following classic example to understand:

The first: two numbers are compared according to signed numbers, 11. 111B (- 1)

< 00...000B(0) 第二个：0U为无符号数，所以两数按照无符号数相比，-1的机器数为11...111，按照无符号数解释为2^32^ -1，这时是大于0的。11...111(2^32^-1) >

00.000 (0)

Note: the binary in front of parentheses represents the number of machines that are numeric, usually represented by a complement. Inside the parentheses are the numbers interpreted by c language for machine numbers as unsigned or signed numbers.

Add some knowledge before looking at the second example:

When judging a constant type, the symbol is "kicked away" first, only by which interval the following values are in, and the type is determined according to the interval.

C90 and C99 are two sets of standards in c language, and the two standards deal with constants differently. The following example is illustrated by C90.

The first one: 2147483647 is int type, compared on both sides according to signed numbers, 01. 111B (2 ^ 31 ^-1) > 10. 000B (- 2 ^ 31 ^)

The second: 2147483647 is int type, the former with U for unsigned number, according to the unsigned number, 01. 111B (231-1)

< 10...000B(231) 第三个：2147483648为 unsigned int 型，但被强制类型转换为int型，所以按照有符号数相比，01...111B(2^31 -1^) < 10...000B(-2^31^) 第四个：2147483648为 unsigned int 型，两数按照无符号数相比，01...111B(2^31^ -1) < 10...000B(2^31^) 完全理解上述几个例子后对数值比较这一块应该没问题了，在此再强调总结一番。一个数据怎么存储成 0/1 序列是编码的事，怎么解释这个 0/1 序列是上层软件的事。也就是说上述的数值比较中 2147483648 的机器数始终是10...000B，2147483647的机器数始终是01...111B，之所以出现不同的比较结果是因为 c 语言对它们进行了不同的解释处理。有了上面的基础，再来看几个程序：打印结果如上图注释，原因如下： x 的机器数为 11...111，u 的机器数为 10...0000 所以x按照无符号数解释为2^32^ - 1 = 4294967295，按照有符号数解释为 -1。 u按照无符号数解释为 2^31^，按照有符号数解释为 -2^31^ 由上也可以看出机器数为10...000的数，是能表示的最小整数，取负后溢出还是它本身。这是计算数组元素和的一个函数，按照程序所设想，length 传入 0 时应该返回 0，但实际上并非如此。这个程序理论上会无限循环，实际运行时会发生数组越界导致异常。理论上无限循环的原因：len 为无符号数，传入0时，len - 1 为 -1，机器数为11...111，按无符号数解释为2^32^ -1，是 32 位能表示的最大整数。前面说过，有无符号数参与比较时，两边都按照无符号数相比，所以不管 i 怎么变化，始终小于等于右边那个最大的值。当 i = 2^32^ -1 时，下一步又会回绕到0；但实际运行时，肯定不会有那么长的数组，所以会发生越界错误。此函数时用来比较两个字符串str1, str2谁长，用到了库函数strlen()，其原型如上。此函数设想 str1 长时返回 1，否则返回 0；但实际上只有 str1 和 str2 长度相等时会返回 0，其他时候都返回1；问题就出在函数strlen()的返回值是size_t，即unsigned int。也就是说比较是按照无符号数来比较的，无符号数永远是大于等于 0 的，所以只有两个串儿长度相等时会使左边式子等于 0，其他时候左边结果的机器数中肯定有非 0 位，那么按无符号数解释就会大于0，也就返回1了。改进方法：返回语句改成return strlen(str1) >

Strlen (str2), which directly compares the lengths of two strings instead of subtracting them and comparing them with zero.

This example shows that the calling library function should also be careful to understand its return type and parameter type, otherwise it may make an error.

Floating point numbers (IEEE 754)

The floating point format specified by the IEEE 754 floating point standard is shown in the figure above and is briefly explained as follows:

Symbol bits: 1 negative, 0 positive

The order code is expressed by the shift code, which is a fixed-point integer, and the offset constant is 2n-1, so the single-precision offset constant bit is 127 and the double-precision offset constant bit is 1023.

Mantissa is a fixed-point decimal, which actually represents 24 significant digits, implying a digit 1, and the actual Mantissa is 1. B

Represents a range of 32-bit single precision

The representation of negative numbers is just a change in symbolic bits, so the extremes that a 32-bit single-precision floating-point number can represent are as follows:

Minimum positive number: 2-126

Maximum integer: (2-2-23) * 2127

Maximum negative number:-2-126

Minimum negative number:-(2-2-23) * 2127

64-bit double precision

The principle of 64-bit double-precision floating-point numbers is the same as above, and the extreme values that can be expressed are as follows

Minimum positive number: 2-1022

Maximum integer: (2-2-52) * 21023

Maximum negative number:-2-1022

Minimum negative number:-(2-2-52) * 21023

So the range that the IEEE 754 floating point number can represent is shown in the following axis diagram:

Floating-point numbers the number of numbers that can be represented in each 2 ^ n-1 ^ ~ 2 ^ n ^ interval is 2 ^ 23 ^ (Mantissa 23 digits) and isometric. For example, there are 2 ^ 23 ^ numbers between 1 ^ 2 and 2 ^ 23 ^ equidistant numbers between 2 and 4, so the density of floating-point numbers is not equal. The nearer the distance to 0, the greater the density.

In other words, not every decimal can be accurately represented, and when you enter an unrepresentable number, the machine will convert it to the nearest number. This is why you should not use floating-point numbers to compare when writing programs, especially in the case of equality, because the number you want to compare may not be expressed, and the machine automatically converts it for you.

Special representation order code all zero Mantissa all zero

+ 0 or-0, positive or negative symbol bits

Order code all 1 Mantissa all 0

+ ∞,-∞, positive and negative symbol bits

Order code all 1 Mantissa is not 0

The order code all 1 Mantissa is not 0 Not a Number, it is not a number, 0 / 0, 0 * ∞ and so on will produce NaN.

The Mantissa of order code is not zero.

From the above number axis diagram, we can see that the number between-2 ^-126 ^ ~ 2 ^-126 ^ can not be expressed, and the introduction of unnormalized number can solve this problem. The hidden bit in the Mantissa of the non-normalized number is 0, and although the order code is all 0, the order is still-126. In this way, the number of 2 * 2 ^ 23 ^ is added between-2 ^-126 ^ ~ 2 ^-126 ^, which solves the underflow problem.

Again, with the basics above, let's take a look at some examples:

These questions are all very simple. Just pay attention to a few points:

The conversion from high precision to low precision may have a problem, while the conversion from low precision to high precision will not be a problem.

The floating point number can be negative and directly change the sign bit.

The last one is the addition and subtraction of floating-point numbers, which can be explained in detail later, which can be simply understood as an example: when d is very large and f is very small, the result obtained by d + f is equal to d, excluding f, so the left is equal to 0 and the right is equal to f, which is equal to both sides.

Numerical operation bitwise operation and logical operation

These two operations are relatively simple, just to distinguish between concepts.

A bitwise operation, as its name suggests, is the and or non-operation of the bits of a numeric value.

The operands of logical operations are only true and false, and the processing of numeric values is true if not zero.

Move left, move right.

Shift is divided into logical shift and arithmetic shift:

Logical shift: regardless of symbol bit, high position moves out and low position complements 0 when moving left; low position moves out and high position complements 0 when moving right.

Arithmetic shift: consider symbol position, high position shift out, low position complement 0 when left shift, low position shift out, high position complement symbol position when right shift

The shift operation performed by the compiler in C language is different from that performed by CPU:

Compiler: perform an actual shift, such as moving the w bit, and actually moving the w bit

CPU: move w% k, w is the number of bits moved, k is the number of digits of the data type

See the following program to help you understand. The printed results have been annotated later.

Bit extended bit truncation

Expansion is divided into zero extension and symbol extension. Bit expansion is needed when converting data from short numbers to long numbers.

The 0 extension is used for unsigned numbers, adding enough zeros to the original number.

Symbolic extension is used for signed numbers, adding enough symbolic bits in front of the original number.

Bit truncation, truncation will occur when the long number is converted to the short number, the rules are relatively rough and simple, just "cut" the high position and leave the low position.

The representation range of long numbers is definitely larger than that of short numbers, so truncating a number may change the original value. Look at the following example:

After truncation and expansion, 32768 becomes-32768, so pay attention to the overflow problem when truncating again.

The actual process of the addition, subtraction, multiplication and division of integer floating point numbers in the computer is very complex and contains a lot of contents. It is suggested to directly look at the sixth chapter of Tang Shuofeng's computer composition principle and the circuit implementation of adders and multipliers in digital logic related books. In-depth understanding of the theoretical derivation of various numerical algorithms in the computer system. You can get the corresponding books by replying to the e-books in the official account. Let's just talk about some of the things that I think are more important.

Addition and subtraction operation

In modern computers, integers can be seen as complement, which unifies addition and subtraction, and symbol bits can also be calculated together with numerical parts.

Whether it is an unsigned number or a signed number, the operation is first done according to the actual machine number, and then the result is interpreted as the corresponding unsigned number or signed number.

The whole computer computing system uses modular operation, and if the result exceeds the number of digits represented by the computer, the high position will be lost directly.

For example, if the two numbers are a, b, and the corresponding number of machines is A, B, then the corresponding operation of a + b is A + B, a-b is A + B + 1. Then the number of machines obtained from the calculation is interpreted as the final result.

Multiplication and division operation

Multiplication mainly needs to consider the overflow problem, n-digit multiplied by n-digit, the result may need 2n bits to represent. The truncation in C language is directly broken into n bits (from a deep understanding of the computer system), but the experimental results seem to expand automatically.

As long as the result is within the representation range of your machine, for example, the multiplication of two short numbers will automatically expand to 32 bits to represent the result correctly, and will not be truncated unless the qualified result is still a short.

Integer division generally has no overflow problem, because the absolute value of the quotient is not greater than the absolute value of the divisor.

Except for one case, the minimum number of signed numbers is divided by-1, for example, in the case of int:-2147483648 /-1 will overflow, and the result is still-2147483648 itself.

Say some interesting things in c here, which can be regarded as bug. The smallest of the signed numbers mentioned above, that is, the number of machines with a number of 10. 000 (set to s) is still negative, this is no problem, and the number of machines is indeed the same. So in theory, s =-s. This works in the case of 32-bit int and 64-bit long long, but not in the case of short. So there are a lot of strange things about numerical values in c, which you can try if you are interested.

Constant multiplication and division

The operation of multiplication and division takes much more time than shift addition and subtraction, so the compiler will replace multiplication and division with the combination of shift, addition and subtraction when dealing with multiplication and division of variables and constants.

Let's look at a specific example: X is a shaping variable, and now we want to calculate 55 * x, and give an expression to minimize the clock cycle used.

The topic is very simple, mainly to explain how to change.

Multiplying 2 ^ n ^ is equivalent to moving n bits to the left, and dividing by 2 ^ n ^ is equivalent to moving n bits to the right. * * when you move to the left, you need to pay attention to the overflow of the high position, while when you move to the right, you need to pay attention to the rounding problem. The general rounding rule is rounding to 0, but division by shift is rounded down. * * there is no problem for positive numbers, rounding down is rounding to 0. But there is a problem with negative numbers. Rounding down is not rounding to zero, but needs to be corrected.

Why shift to achieve division is rounded down? positive numbers should be well understood. After moving to the right, the decimal part is discarded, and the value naturally becomes smaller. If it is a negative number, shouldn't the value of the decimal part be increased after moving to the right? But don't forget that the negative number is represented by a complement in the computer, and the relationship between it and the true value is reversed, so you will find that the value is still smaller when it is changed to the true value.

Let's take a look at an example to understand rounding down, it should be clearer.

How to correct it? Since a negative number is also rounded down, add an offset to it before it shifts to make it larger, then rounding after shift is not correct.

If you move n bits to the right, the offset is 2 ^ n ^-1, originally x / 2 ^ n ^ 1, and now (x + 2 ^ n ^-1) / 2n = x / 2 ^ n ^ + (2 ^ n ^-1 / 2 ^ n ^); it is equivalent to adding a "limit decimal" to correct it.

This number can be understood in the decimal system, for example, if you move one place, that is, a decimal place, you have to round-2 even if you move 2.9 to 2.1. How should you operate it? Just add a 0.9 to both numbers. Here 0.9 is the limit decimal in the case of one decimal place, which is replaced by the same binary number, and a "limit decimal" in binary n digits is 2 ^ n ^-1.

Floating point operation addition and subtraction

1. Step 1.

Only when the order is equal can the Mantissa be added or subtracted directly. The principle of order: small order aligns large order, the Mantissa of small order moves to the right, and the Mantissa of right shift to the order difference.

Note: when moving to the right, note that implied bit 1 should also participate in the shift, in order to ensure accuracy, the bits moved out of the low position should not be lost, and the Mantissa should be added or subtracted later.

2. Mantissa addition and subtraction

The Mantissa is represented by a fixed-point decimal, where there are no symbolic bits, so addition and subtraction is a common binary addition and subtraction. Note here that the additional bits removed from the implied bits and order are also involved in the operation.

3. Standardize

After the addition and subtraction of the Mantissa mentioned above, the results may be varied and strange, and need to be normalized into the standard form of IEEE 754. To put it simply, it is to change the Mantissa into the form of 1. * B, and then adjust the order code accordingly. The position of the decimal point is closely related to the order code, the decimal point floats, of course, the order code should be changed accordingly.

4. Result rounding

When the order and Mantissa are normalized, they may shift to the right. In order to ensure the accuracy, the removed bits will be reserved to participate in the intermediate operation, and the final result will be rounded. IEEE 754 stipulates that at least 2 bits should be retained. The one immediately to the right of the Mantissa is called the protection bit (guard), and the one to the right of the protection bit is called the round bit. To improve the accuracy, there is a sticky bit to the right of the round bit. Set 1 as long as there are any non-zero numbers to the right of the sticky position, otherwise set 0.

5. Judgment of order code overflow

The order code of the result is overflowed on the table 1, resulting in an exception or the result is set to ∞. The order code of the result is overflowed under the table of all zero, resulting in an exception or setting the result to 0

This can explain why the previous (d + f)-d is not necessarily equal to f, if d is very large, f is very small, f is equal to d when the Mantissa has been moving to the right, so that the significant bit has not become all zero, and the value of d does not change when the Mantissa is added or subtracted. So the left equals zero and the right equals f, which is not the same.

Multiplication and division

The operation process of floating point multiplication is similar to addition and subtraction, the main difference is that multiplication and division do not need order, and other processes are basically the same. It is also easy to understand, just like when we usually use scientific counting to do arithmetic, addition and subtraction must require equal exponents before Mantissa can be added or subtracted, while multiplication and division can directly operate Mantissa and Mantissa, index and index, one truth.

Thank you for your reading. the above is the content of "what are the numerical problems of computers?" after the study of this article, I believe you have a deeper understanding of the numerical problems of computers. the specific use of the situation also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.