How did ChatGPT suddenly become so strong? doctor's ten thousand words deep disassemble the origin of GPT-3.5 ability. 07/15 Update SLTechnology News&Howtos

How did ChatGPT suddenly become so strong? doctor's ten thousand words deep disassemble the origin of GPT-3.5 ability.

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

How did ChatGPT evolve from GPT-3?

Recently, ChatGPT, released by OpenAI, has injected a shot in the arm into the field of artificial intelligence, and its powerful ability far exceeds the expectations of natural language processing researchers.

Users who have experienced ChatGPT will naturally ask: how did the original GPT 3 evolve into ChatGPT? Where does GPT 3.5's amazing language ability come from?

Recently, researchers from the Allen Institute of artificial Intelligence wrote an article trying to analyze ChatGPT's Emergent Ability and trace the origin of these capabilities, and provide a comprehensive technology roadmap to show how the GPT-3.5 model family and related large language models have gradually evolved into their current powerful form.

Original link: https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756 author Wu Yao is a doctoral student at the University of Edinburgh in 2020 with a master's degree from Columbia University and a bachelor's degree from Peking University. He is currently a research intern at the Allen Institute of artificial Intelligence. His main research direction is the large-scale probability generation model of human language.

Peng Hao, an undergraduate from Peking University and a PhD from the University of Washington, is currently Young Investigator of the Allen Institute of artificial Intelligence and will join the Department of computer Science at the University of Illinois at Urbana-Champaign in August 2023 as an assistant professor. His main research interests include making the language AI more efficient and easier to understand, and building large-scale language models.

The writer, Dr. Tushar Khot, graduated from the University of Wisconsin-Madison and is currently a research scientist at the Allen Institute for artificial Intelligence. His main research direction is structured machine reasoning.

1. The 2020 version of the initial GPT-3 and the large-scale pre-training initial GPT-3 demonstrate three important capabilities:

Language generation: follow the prompts (prompt), and then generate sentences that complete the prompts. This is the most common way for human beings to interact with language models today.

Contextual learning (in-context learning): follow several examples of a given task and then build a solution for a new test case. It is important to note that although GPT-3 is a language model, its paper barely talks about "language modeling" (language modeling)-the authors devote all their writing energy to the vision of context learning, which is the real focus of GPT-3.

World knowledge: including factual knowledge (factual knowledge) and common sense (commonsense).

So where do these abilities come from?

Basically, the above three abilities come from large-scale pre-training: pre-training a model with 175 billion parameters on a corpus of 300 billion words (60% of the training corpus comes from 2016-2019 C4 + 22% from WebText2 + 16% from Books + 3% from Wikipedia). Where:

The ability of language generation comes from the training goal of language modeling (language modeling).

World knowledge comes from a training corpus of 300 billion words (where else could it be).

The 175 billion parameters of the model are used to store knowledge, Liang et al. The article of (2022) further proves this point. They concluded that the performance of knowledge-intensive tasks is closely related to the size of the model.

It is still difficult to trace the source of the ability of context learning and why context learning can be generalized. Intuitively, this ability may come from the fact that the data points of the same task are arranged sequentially in the same batch during training. However, few people study why language model pre-training promotes context learning and why the behavior of context learning is so different from that of fine-tuning.

What is curious is how strong the original GPT-3 was.

In fact, it is difficult to determine whether the original GPT-3 (called davinci in OpenAI API) is "strong" or "weak".

On the one hand, it responds reasonably to some specific queries and achieves good performance in many datasets.

On the other hand, it does not perform as well as a small model like T5 on many tasks (see its original paper).

Under today's (December 2022) ChatGPT standards, it is difficult to say that the original GPT-3 is "smart". Meta's open source OPT model attempts to replicate the original GPT-3, but its capabilities are in sharp contrast to today's standards. Many people who have tested OPT also think that compared with today's text-davinci-002, this model is really "not very good".

Still, OPT may be a good enough open source approximate model for the original GPT-3 (according to the OPT paper and Stanford University's HELM assessment).

Although the initial GPT-3 may seem weak on the surface, later experiments have proved that the primary GPT-3 has very strong potential. These potentials are later unlocked by code training, instruction fine-tuning (instruction tuning) and reinforcement learning based on human feedback (reinforcement learning with human feedback, RLHF). Finally, the body shows a very strong emergent ability.

Second, from the 2020 version of GPT-3 to the 2022 version of ChatGPT, starting with the original GPT-3, in order to show how OpenAI evolved into ChatGPT, let's take a look at the evolution tree of GPT-3.5:

In July 2020, OpenAI published a first-generation GPT-3 paper on davinci indexed by the model, and it has been evolving ever since.

In July 2021, Codex's paper was published, in which the initial Codex was fine-tuned based on (possibly internal) 12 billion parameter GPT-3 variants. Later, this 12 billion-parameter model evolved into code-cushman-001 in OpenAI API.

In March 2022, OpenAI published a paper on instruction fine-tuning (instruction tuning), and its supervised fine-tuning (supervised instruction tuning) section corresponds to davinci-instruct-beta and text-davinci-001.

Between April and July 2022, OpenAI began Beta testing of the code-davinci-002 model, also known as Codex. Then code-davinci-002, text-davinci-003, and ChatGPT are all fine-tuned by instructions from code-davinci-002. For more information, see OpenAI's model index documentation.

Although Codex sounds like a code-only model, code-davinci-002 is probably the most powerful GPT-3.5 variant for natural languages (better than text-davinci-002 and-003). Code-davinci-002 is likely to be trained in both text and code and then adjusted according to instructions (explained below).

Then text-davinci-002, released in May-June 2022, is a supervised instruction fine-tuning (supervised instruction tuned) model based on code-davinci-002. Fine-tuning instructions on text-davinci-002 probably reduces the model's contextual learning ability, but enhances the model's zero-sample ability (explained below).

Then there are text-davinci-003 and ChatGPT, both released in November 2022, two different variants of the version instruction fine-tuning (instruction tuning with reinforcement learning from human feedback) model based on human feedback for reinforcement learning.

Text-davinci-003 recovers (but is still worse than code-davinci-002) some of the context learning capabilities lost in text-davinci-002 (presumably because it mixes language modeling during fine-tuning) and further improves zero-sample capabilities (thanks to RLHF). On the other hand, ChatGPT seems to have sacrificed almost all its contextual learning capabilities for the ability to model the history of conversations.

Overall, during the period 2020-2021, before code-davinci-002, OpenAI had put a lot of effort into enhancing GPT-3 through code training and instruction fine-tuning. By the time they complete the code-davinci-002, all the capabilities are already in place. It is likely that subsequent instruction fine-tuning, whether through a supervised version or a reinforcement learning version, will do the following (more on this later):

Instruction fine-tuning does not inject new capabilities into the model-- all the capabilities already exist. The purpose of instruction fine-tuning is to unlock / activate these capabilities. This is mainly because the amount of data of instruction fine-tuning is several orders of magnitude less than that of pre-training data (the basic ability is injected through pre-training).

Instruction fine-tuning divides GPT-3.5 into different skill trees. Some are better at contextual learning, such as text-davinci-003, and some are better at dialogue, such as ChatGPT.

Instruction fine-tuning sacrifices performance for alignment with humans (alignment). The authors of OpenAI call it "alignment tax" (alignment tax) in their directive fine-tuning papers.

Many papers have reported that code-davinci-002 achieves the best performance in benchmarking (but the model does not necessarily meet human expectations). After fine-tuning instructions on code-davinci-002, the model can generate feedback that is more in line with human expectations (or the model is aligned with human beings), such as zero sample questions and answers, generating secure and fair dialogue responses, and rejecting questions that are beyond the scope of the model's knowledge.

Third, Code-Davinci-002 and Text-Davinci-002, training in code, fine-tuning in instructions before code-davinci-002 and text-davinci-002, there are two intermediate models, davinci-instruct-beta and text-davinci-001. Both of them are worse than the above two-002 models in many ways (for example, text-davinci-001 chain thinking is not strong in reasoning).

So we focus on the-002 model in this section.

3.1The source of complex reasoning and the ability to generalize to new tasks We focus on code-davinci-002 and text-davinci-002, the first version of the GPT3.5 model, one for code and the other for text. They show three important capabilities that are different from those of the original GPT-3:

Response to human instructions: previously, the output of GPT-3 mainly focused on the common sentences in the training set. The current model generates more reasonable answers (rather than relevant but useless sentences) for instructions / prompts.

Generalize to unseen tasks: when the number of instructions used to adjust the model exceeds a certain size, the model can automatically generate valid answers on new instructions that have never been seen before. This capability is critical for online deployment because users always ask new questions and the model has to be able to answer them.

Code generation and code understanding: this ability is obvious because the model is trained in code.

Using thinking chain (chain-of-thought) for complex reasoning: the reasoning ability of the model thinking chain of the original GPT3 is very weak or even not. Code-davinci-002 and text-davinci-002 are two models with strong thinking chain reasoning ability.

Chain of thought reasoning is important because the chain of thought may be the key to unlocking the emergent ability and transcending the scaling law (scaling laws).

Where do these abilities come from?

Compared with the previous model, the two main differences are instruction fine-tuning and code training. Specifically:

The ability to respond to human instructions is a direct product of instruction fine-tuning.

The generalization ability to give feedback to unseen instructions occurs automatically after the number of instructions exceeds a certain degree, which is further proved by the papers of T0, Flan and FlanPaLM.

The ability to use thought chains for complex reasoning is likely to be a magical by-product of code training. In this regard, we have the following facts to support:

The original GPT-3 was not code trained and could not be used as a thought chain.

Although the text-davinci-001 model has been fine-tuned by instructions, the first edition of the thinking chain paper reports that its reasoning ability of the thinking chain is very weak-- so instruction fine-tuning may not be the reason for the existence of the thinking chain, but code training is the most likely reason why the model can do thinking chain reasoning.

PaLM has 5% code training data, which can be used as a thinking chain.

The amount of code data in Codex's paper is 159G, which is about 28% of the training data of the original GPT-3 570 billion. Code-davinci-002 and its subsequent variants can do chain reasoning.

In the HELM test, Liang et al. (2022) different models are evaluated on a large scale. They found that the model for code training has strong language reasoning ability, including code-cushman-001 with 12 billion parameters.

Our work at AI2 also shows that code-davinci-002 is by far the best performing model on important mathematical benchmarks such as GSM8K when equipped with complex thought chains.

Intuitively, process-oriented programming (procedure-oriented programming) is similar to the human process of solving tasks step by step, and object-oriented programming (object-oriented programming) is similar to the process of decomposing complex tasks into multiple simple tasks.

All of the above observations are correlation between code and reasoning ability / thinking chain, but not necessarily causality. This correlation is interesting, but it is still an open question to be studied. At present, it seems that we have no very conclusive evidence that code is the cause of the chain of thinking and complex reasoning.

In addition, another possible by-product of code training is long-distance dependency, as Peter Liu points out: "the prediction of the next word in a language is usually very local, while code usually requires longer dependencies to do something, such as matching parentheses or referencing distant function definitions."

Here I would like to further add that because of class inheritance in object-oriented programming, code may also contribute to the model's ability to build coding hierarchies. We leave the testing of this hypothesis to future work.

In addition, pay attention to some detailed differences:

Text-davinci-002 and code-davinci-002

Code-davinci-002 is the basic model, and text-davinci-002 is the product of instruction fine-tuning code-davinci-002 (see OpenAI documentation). It makes fine-tuning on the following data: (1) manual labeling instructions and expected output; (2) model output selected by manual tagging.

When there is a context example (in-context example), Code-davinci-002 is better at context learning; when there is no context example / zero sample, text-davinci-002 performs better at zero sample task completion. In this sense, text-davinci-002 is more in line with human expectations (because it can be troublesome to write a context example for a task).

OpenAI is unlikely to deliberately sacrifice its contextual learning ability for zero sample capability-reduced contextual learning is more of a side effect of instruction learning, which OpenAI calls alignment tax.

001 model (code-cushman-001 and text-davinci-001) v.s.002 model (code-davinci-002 and text-davinci-002)

The 001 model is mainly for pure code / plain text tasks, while the 002 model is a deep fusion of code training and instruction fine-tuning, both code and text.

Code-davinci-002 may be the first model that deeply integrates code training and instruction fine-tuning. The evidence is that code-cushman-001 can reason but does not perform well in plain text, while text-davinci-001 performs well in plain text but is not very good at reasoning. Code-davinci-002 can do both.

3.2 do these capabilities already exist after pre-training or are they injected through fine-tuning later? At this stage, we have identified the key role of instruction fine-tuning and code training. An important question is how to further analyze the impact of code training and instruction fine-tuning.

Specifically: do the above three capabilities already exist in the original GPT-3, only triggered / unlocked by instruction and code training? Or are these capabilities not available in the original GPT-3, injected through instruction and code training?

If the answer is already in the original GPT-3, then these capabilities should also be in OPT. Therefore, to replicate these capabilities, you may be able to adjust the OPT directly through instructions and code.

However, code-davinci-002 may not be based on the original GPT-3 davinci, but on a larger model than the original GPT-3. If this is the case, it may not be possible to reproduce it by adjusting the OPT.

The research community needs to further clarify what models OpenAI has trained as the basic model for code-davinci-002.

We have the following assumptions and evidence:

The basic model of code-davinci-002 may not be the primary GPT-3 davinci model.

The original GPT-3 trained on the data set C4 2016-2019, while the code-davinci-002 training set was extended until 2021. Therefore, it is possible for code-davinci-002 to train on the 2019-2021 version of C4.

The original GPT-3 had a context window of 2048 words. The context window for code-davinci-002 is 8192. The GPT series uses absolute position embedding (absolute positional embedding), so it is difficult to extrapolate absolute position embedding directly without training, and can seriously damage the performance of the model (see Press et al., 2022). If code-davinci-002 is based on the original GPT-3, how does OpenAI extend the context window?

On the other hand, whether the base model is an initial GPT-3 or a later trained model, the ability to follow instructions and zero-sample generalization may already exist in the base model before it is unlocked (rather than injected) through instruction fine-tuning.

This is mainly because the amount of instruction data reported in OpenAI's paper is only 77K, which is several orders of magnitude less than the pre-training data.

Other instruction fine-tuning papers further prove the comparison of dataset size to the performance of the model, such as Chung et al. In the work of (2022), the instruction fine-tuning of Flan-PaLM is only 0.4% of the pre-training calculation. Generally speaking, the instruction data is significantly less than the pre-training data.

However, the complex reasoning ability of the model may be injected through code data in the pre-training phase.

The size of the code dataset is different from the fine-tuning of the above instructions. The amount of code data here is large enough to occupy an important part of the training data (for example, PaLM has 8% of the code training data)

As mentioned above, the pre-code-davinci-002 model text-davinci-001 probably did not fine-tune the code data, so its reasoning / thinking chain ability is very poor, as reported in the first edition of the thinking chain paper, sometimes even worse than code-cushman-001 with a smaller number of parameters.

Perhaps the best way to distinguish between code training and instruction fine-tuning is to compare code-cushman-001, T5, and FlanT5.

Because they have similar model sizes (11 billion and 12 billion) and similar training data sets (C4), the biggest difference between them is whether they have been trained in code / have made instruction fine-tuning.

There is no such comparison at present. We leave this to future research.

4. Text-davinci-003 and ChatGPT, the power of human feedback-based reinforcement learning (Reinforcement Learning from Human Feedback, RLHF) at the current stage (December 2022), there is almost no strict statistical comparison between text-davinci-002, text-davinci-003 and ChatGPT, mainly because:

Text-davinci-003 and ChatGPT were released less than a month at the time of this writing.

ChatGPT cannot be called through OpenAI API, so it is troublesome to test it on a standard benchmark.

So the comparison between these models is more based on the collective experience of the research community (statistically not very strict). However, we believe that preliminary descriptive comparisons can still reveal the mechanism of the model.

We first noticed the following comparison between text-davinci-002,text-davinci-003 and ChatGPT:

All three models are fine-tuned by instructions.

Text-davinci-002 is a model with supervised learning instruction fine tuning (supervised instruction tuning).

Text-davinci-003 and ChatGPT are instruction fine-tuning (Instruction tuning with Reinforcement Learning from Human Feedback RLHF) of reinforcement learning based on human feedback. This is the most significant difference between them.

This means that the behavior of most new models is a product of RLHF.

So let's take a look at the ability to trigger RLHF:

Informative response: text-davinci-003 is usually generated longer than text-davinci-002. ChatGPT's response is so lengthy that users have to explicitly ask "answer me in one sentence" to get a more concise answer. This is a direct product of RLHF.

Fair response: the ChatGPT usually gives a very balanced response to events that involve the interests of multiple entities, such as political events. This is also the product of RLHF.

Reject inappropriate issues: this is a combination of the capabilities of the content filter and the model triggered by RLHF, where the filter filters out part of it, and then the model rejects part of it.

Reject issues outside the scope of its knowledge: for example, reject new events that occur after June 2021 (because it is not trained on subsequent data). This is the most amazing part of RLHF because it enables the model to implicitly distinguish between which problems are within its knowledge and which are not.

There are two things worth noting:

All the capabilities are inherent in the model, not injected through RLHF. The function of RLHF is to trigger / unlock the emergence ability. This argument mainly comes from the comparison of the amount of data: because RLHF takes up much less computation / data than the amount of data pre-trained.

The model knows that it doesn't know what is not achieved by writing rules, but by unlocking it through RLHF. This is a very surprising finding, because the original goal of RLHF is to get the model to generate compound human expected answers, which is more about making the model generate safe sentences than letting the model know what it doesn't know.

What happens behind the scenes may be:

ChatGPT: the ability to model the history of conversations at the expense of contextual learning. This is an empirical observation because ChatGPT does not seem to be as strongly influenced by contextual presentations as text-davinci-003.

Text-davinci-003: restore the context learning ability sacrificed by text-davinci-002 and improve the ability of zero samples. According to instructGPT's paper, this comes from the reinforcement learning adjustment phase mixed with the goal of language modeling (not RLHF itself).

5. Summarize the evolution of GPT-3.5 at the current stage. So far, we have carefully examined all the capabilities that appear along the evolution tree. The following table summarizes the evolution path:

We can conclude that:

Language generation ability + basic world knowledge + context learning all come from pretraining (davinci).

The ability to store large amounts of knowledge comes from 175 billion parameters.

The ability to follow instructions and generalize to new tasks comes from expanding the number of instructions in instruction learning (Davinci-instruct-beta).

The ability to perform complex reasoning is likely to come from code training (code-davinci-002).

The generation of neutral, objective ability, security and detailed answers comes from alignment with human beings. Specifically:

If it is the supervised learning version, the resulting model is text-davinci-002.

If it is reinforcement Learning Edition (RLHF), the resulting model is text-davinci-003.

Whether supervised or RLHF, the model cannot outperform code-davinci-002 in many tasks, and this phenomenon of performance degradation due to alignment is called alignment tax.

Dialogue ability also comes from RLHF (ChatGPT). Specifically, it sacrifices the ability of context learning in exchange for:

Modeling conversation history.

Increase the amount of information in the conversation.

Reject problems outside the scope of model knowledge.

What can GPT-3.5 do at present although GPT-3.5 is an important step in the research of natural language processing, it does not fully contain all the ideal attributes imagined by many researchers (including AI2). Here are some important attributes that GPT-3.5 does not have:

Rewrite the model's belief in real time: when the model expresses a belief in something, if the belief is wrong, it may be difficult for us to correct it:

One example I came across recently is that ChatGPT insists that 3599 is a prime number, although it acknowledges 3599 = 59 * 61. Also, see the example of the fastest marine mammal on Reddit.

However, the intensity of model beliefs seems to have different levels. One example is that even if I told him Darth Vader won the 2020 election, the model would still think that the current president of the United States is Biden. But if I change the election year to 2024, it will think that the president is Darth Vader and that he will be president in 2026.

Formal reasoning: the GPT-3.5 series cannot be inferred in systems with strict forms such as mathematics or first-order logic:

In the literature of natural language processing, the definition of the word "reasoning" is often unclear. However, if we look at it from the perspective of fuzziness, for example, some questions (a) are very ambiguous and there is no reasoning; (b) there is some logic in it, but some places can also be vague; and (c) it is very rigorous and there can be no ambiguity.

In that case, the model can well perform the fuzzy reasoning of class (b). Examples are as follows:

Generate a method of how to make tofu pudding. When making tofu pudding, it is acceptable to have a blur in many of the middle steps, such as whether it is salty or sweet. As long as the overall steps are roughly correct, Jellied Tofu can eat.

The idea of proving mathematical theorem. It is proved that the idea is an informal step-by-step solution expressed in language, in which the strict derivation of each step can not be too specific. Proving ideas are often used in math teaching: as long as the teacher gives a roughly correct overall step, students can roughly understand. Then the teacher assigned the specific proof details to the students as homework, and the answer was brief.

GPT-3.5 cannot reason for type (c) (reasoning cannot tolerate ambiguity).

One example is strict mathematical proof, which requires that you can't jump, blur or make mistakes in the intermediate steps.

However, whether this strict reasoning should be done by the language model or by the symbolic system remains to be discussed. One example is that instead of trying to get GPT to do three-digit addition, call Python directly.

Search from the Internet: the GPT-3.5 series (temporarily) cannot search the Internet directly.

But there was a WebGPT paper published in December 2021 in which GPT called the search engine. So the retrieval capability has been tested within OpenAI.

What needs to be distinguished here is that the two important but different capabilities of GPT-3.5 are knowledge and reasoning. In general, it would be nice if we could unload the knowledge part to an external retrieval system and let the language model focus only on reasoning. Because:

The internal knowledge of the model is always cut off at some point. Models always need up-to-date knowledge to answer the latest questions.

Recall that we have discussed that 175 billion of the parameters are heavily used to store knowledge. If we can unload knowledge out of the model, the model parameters may be greatly reduced, and eventually it can even run on the phone (crazy idea, but ChatGPT is science fiction enough, who knows what the future will be).

Conclusion in this blog post, we carefully examined the capabilities of the GPT-3.5 series and traced the sources of all their emergent capabilities.

The primary GPT-3 model obtains generation ability, world knowledge and in-context learning through pre-training. Then through the model branch of instruction tuning, we gain the ability to follow instructions and generalize to tasks we have never seen before. After code training, the branch model acquires the ability of code understanding. as a by-product of code training, the model also potentially acquires the ability of complex reasoning.

Combining these two branches, code-davinci-002 seems to be the most powerful GPT-3.5 model with all the powerful capabilities. Then through supervised instruction tuning and RLHF, we sacrifice the ability of the model in exchange for alignment with humans, that is, alignment tax. RLHF enables the model to generate more detailed and fair answers while rejecting questions outside the scope of its knowledge.

We hope that this article will help provide a clear GPT evaluation diagram and trigger some discussion about language models, instruction tuning, and code tuning. Most importantly, we hope that this article will serve as a roadmap for replicating GPT-3.5 in the open source community.

Are these statements in this FAQ article more like hypothesis or conclusion?

The ability of complex reasoning comes from code training, which is an assumption that we tend to believe.

The generalization ability of unseen tasks comes from large-scale instruction learning is the conclusion of at least 4 papers.

GPT-3.5 comes from other large base models, rather than 175 billion-parameter GPT-3 is a well-founded guess.

All of these abilities already exist, and unlocking rather than injecting them through instruction tuning, whether supervised learning or reinforcement learning, is a powerful assumption that you don't dare to believe. The main reason is that the amount of instruction tuning data is several orders of magnitude less than that of pre-training data.

Conclusion = many evidence supports the correctness of these claims; hypothesis = positive evidence but not strong enough; well-founded guess = there is no conclusive evidence, but some factors point in this direction

Why aren't other models, such as OPT and BLOOM, so powerful?

OPT is probably because the training process is too unstable.

The case of BLOOM is unknown.

Original text link:

Https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), author: Fu Yao, Peng Hao, Tushar Khot, editor: LRS, good sleepy

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.