The arithmetic ability is close to full score. The National University of Singapore released Goat, which only uses 7 billion parameters and starts to support 16-digit multiplication and division. 07/01 Update SLTechnology News&Howtos

The arithmetic ability is close to full score. The National University of Singapore released Goat, which only uses 7 billion parameters and starts to support 16-digit multiplication and division.

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

The language model can finally multiply and divide!

Although large-scale language models show superior performance in various natural language processing tasks, arithmetic problems are still a major difficulty, and even the most powerful GPT-4 is difficult to deal with basic operations.

Recently, researchers from the National University of Singapore proposed a goat model Goat for arithmetic. After fine-tuning the LLaMA model, it achieved significantly better arithmetic ability than GPT-4.

Article link: https://arxiv.org/ pdf / 2305.14201.pdf achieves state-of-the-art performance on BIG-bench arithmetic subtasks by fine-tuning the synthetic arithmetic dataset

Goat can achieve near-perfect accuracy in addition and subtraction of large numbers only by supervised fine tuning, surpassing all previous pre-training language models, such as Bloom, OPT, GPT-NeoX, etc., in which the accuracy of zero-sample Goat-7B is even higher than that of PaLM-540 after learning a small number of samples.

The researchers attribute the excellent performance of Goat to LLaMA's consistent word segmentation of numbers.

In order to solve more challenging tasks, such as large number multiplication and division, the researchers also proposed a method to classify tasks according to the learnability of arithmetic. Then the basic arithmetic principles are used to decompose non-learnable tasks (such as multi-digit multiplication and division) into a series of learnable tasks.

After comprehensive experimental verification, the decomposition steps proposed in this paper can effectively improve the arithmetic performance.

And Goat-7 B can be trained efficiently with LoRA on 24 GB VRAM GPU, and other researchers can easily repeat this experiment, and models, data sets, and python scripts that generate data sets are about to be open source.

The language model LLaMA is a set of open source pre-training language models that are trained on trillions of token using publicly available data sets and achieve state-of-the-art performance on multiple benchmarks.

Previous studies have shown that tokenization is very important to the arithmetic ability of LLM, but commonly used word segmentation techniques can not represent numbers well, for example, numbers with too many digits may be segmented.

LLaMA chose to split numbers into multiple token to ensure consistent representation of numbers, and the researchers attributed the extraordinary arithmetic ability shown in the experimental results to LLaMA's consistent segmentation of numbers.

In the experiment, other fine-tuned language models, such as Bloom, OPT, GPT-NeoX and Pythia, do not match LLaMA's arithmetic ability.

Before the learnability of arithmetic tasks (Learnability of Arithmetic Tasks), researchers have theoretically analyzed the use of intermediate supervision to solve complex tasks (composite task). The results show that this task is not learnable, but it can be decomposed into simple subtasks with a polynomial number.

That is, compound problems that cannot be learned can be learned by using intermediate oversight or a step-by-step chain of thinking (CoT).

On the basis of this analysis, the researchers first classify learnable and non-learnable tasks.

In the context of arithmetic computing, learnable tasks usually refer to those tasks that can successfully train the model to generate answers directly, so as to achieve high enough accuracy within a predefined number of training epochs.

Unlearnable tasks are those that are difficult for models to learn and generate direct answers correctly, even after extensive training.

Although the exact reason behind the change in task learnability is not entirely clear, it can be assumed that it is related to the complexity of the basic pattern and the size of working memory required to complete the task.

The researchers experimentally checked the learnability of each task by fine-tuning the model for each task in a simplified synthetic environment.

The results of learnable and non-learnable task classification are also the same as human perception. Through practice, human beings can calculate the addition and subtraction of two large numbers in their minds without manual calculation. The final numerical answer can be written directly from left (the most significant number) to the right (the lowest significant number).

However, mental arithmetic to solve large number multiplication and division is a challenging task.

It can also be observed that the above classification of tasks is also consistent with the performance of GPT-4, especially GPT-4 is good at generating direct answers for large addition and subtraction, and the accuracy is significantly reduced when it comes to multiplication and division tasks.

A powerful model like GPT-4 cannot directly solve unlearnable tasks and may indicate that generating direct answers to these tasks can be challenging even with extensive training.

It is worth noting that tasks that are learnable for LLaMA may not necessarily be learnable for other LLM.

In addition, not all tasks that are classified as unlearnable are completely impossible for the model to learn.

For example, multiplying two digits by two digits is considered an unlearnable task, but if the training set contains all the possible two-digit multiplication enumeration data, the model can still generate the answer directly by fitting the training set.

But the whole process takes nearly 10 epoch to achieve about 90% accuracy.

By inserting the CoT before the final answer, the model can achieve quite good accuracy in two-digit multiplication after one epoch training, which is consistent with the previous research conclusion, that is, the existence of intermediate supervision is helpful to the learning process.

The two arithmetic operations of addition and subtraction are learnable, and only through supervised fine-tuning, the model shows an extraordinary ability to accurately generate direct digital answers.

Although the model is only trained on a very limited subset of additive data, it can be seen from the near-perfect accuracy of the model on unseen test sets that the model successfully captures the basic patterns of arithmetic operations without using CoT

Multiplication researchers have verified through experiments that the multiplication of n-digit multiplication by 1-digit multiplication is learnable, while the multiplication rule of multi-digit multiplication cannot be learned.

To overcome this problem, the researchers chose to fine-tune LLM to generate CoT before generating answers, breaking down multi-digit multiplication into five learnable subtasks:

1. Extraction to extract arithmetic expressions from natural language instructions

two。 Split (split), dividing the smaller of the two into place values

3. Expansion (expansion), summation based on distributive expansion

4. Product (product), and calculate each product at the same time

5. Add item by item (adding term by term), add the first two terms, copy the rest, and get the final sum

Each of these tasks is learnable.

Similarly, it can be observed through experiments that n-digit division by 1-digit can be learned, while multi-digit division is not learnable.

Using the recursive equation of improved slow division, the researchers designed a new thinking chain tip.

The main idea is to subtract the multiple of the divisor from the divisor until the remainder is less than the divisor.

The experiment designed in the data set article is the addition and subtraction of two positive integers, each positive integer contains up to 16 digits, and the result of subtraction may be negative.

In order to limit the maximum sequence length, the result of multiplication is a positive integer less than 12 digits; in the division of two positive integers, the divisor is less than 12 digits and the quotient value is less than 6 digits.

The researchers used the Python script to synthesize a data set that generated about 1 million question-and-answer pairs, including the proposed CoT and the final digital output, all randomly generated, ensuring that the probability of repeating instances is very low, but small numbers may be sampled multiple times.

Fine tuning to enable the model to solve arithmetic problems based on instructions and to facilitate natural language question answering, researchers used ChatGPT to generate hundreds of instruction templates.

During instruction adjustment, randomly select a template for each arithmetic input from the training set and fine-tune the LLaMA-7B, similar to the method used in Alpaca.

Goat-7B can fine-tune it with LoRA on 24GB VRAM GPU, and fine-tune 100,000 samples in about 1.5 hours on the A100 GPU with near-perfect precision.

It seems unfair to compare the performance of Goat and GPT-4 in a large number of multiplication and division, because GPT-4 generates answers directly, while Goat depends on the designed chain of thoughts, so "Solve it step by step" is added at the end of each prompt in the GPT-4 evaluation.

However, it can be observed that in some cases, the intermediate steps between long multiplication and division in GPT-4 are wrong, but the final answer is still correct, which means that GPT-4 does not use the intermediate supervision of the thought chain to improve the final output.

Finally, the following three common errors were identified from GPT-4 's solution:

1. Alignment of corresponding numbers

two。 Repetitive number

3. Error in the middle result of multiplying n digits by 1 digit

It can be seen from the experimental results that GPT-4 performs quite well on 8D+8D and 16D+16D tasks, but the calculations on most 16D+8D tasks are wrong, although intuitively, 16D+8D should be relatively easier than 16D+16D.

Although the exact cause of this situation is not clear, one possible factor may be the GPT-4 inconsistent process of numeral segmentation, which makes it difficult to align the two numbers.

Reference:

Https://huggingface.co/papers/2305.14201

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.