Cost less than $100, UC Berkeley re-open source ChatGPT model "koala": a large amount of data is useless, high quality is the king 04/25 Update SLTechnology News&Howtos

Cost less than $100, UC Berkeley re-open source ChatGPT model "koala": a large amount of data is useless, high quality is the king

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Language Model Zoo adds a new member Koala, which is more in line with the needs of real users!

Since Meta open source LLaMA, various kinds of ChatGPT models have sprung up in academia.

First Stanford proposed 7 billion parameter Alpaca, and then UC Berkeley joined forces with CMU, Stanford, UCSD and MBZUAI to release 13 billion parameter Vicuna, which can rival ChatGPT and Bard in more than 90% of cases.

Recently, Berkeley released a new model, "koala Koala". Compared to using OpenAI's GPT data for instruction fine-tuning, Koala is different from using high-quality data obtained from the network for training.

Blog link: https://bair.berkeley.edu/ blog / 2023-04-03 / koala/

Data preprocessing code: https://github.com/ young-geng / koala_data_pipeline

Evaluation test set: https://github.com/ arnav-gudibande / koala-test-set

Model download: https://drive.google.com/ drive / folders / 10f7wrlAFoPIy-TECHsx9DKIvbQYunCfl

In a blog post, the researchers described the dataset management and training process of the model, as well as the results of user research, comparing the model with ChatGPT and Stanford's Alpaca model.

The research results show that Koala can effectively answer queries from various users, and the answers generated are often more popular than Alpaca, and at least half of the cases are comparable to those of ChatGPT.

The researchers hope that the results of this experiment will further promote the discussion about the relative performance of large closed-source models compared to small public models, especially for small models that can run locally. If the training data are carefully collected, the performance of the large model can also be achieved.

This may mean that the community should devote more effort to managing high-quality data sets, which may help to build safer, more practical, and more capable models than simply increasing the size of existing systems.

It should be emphasized that Koala is only a research prototype, and although researchers hope that the release of the model will provide a valuable community resource, it still has major shortcomings in terms of content security and reliability and should not be used outside the research area.

After the release of the large language model of Koala system Overview, virtual assistants and chatbots are becoming more and more capable of not only chatting, but also writing code, poetry and stories.

However, the most capable language models usually require massive computing resources to train the model, and also need large-scale dedicated data sets, so ordinary people basically have no way to train the model themselves.

In other words, the language model will be controlled by a small number of powerful organizations in the future. Users and researchers will interact with the model by paying, and there is no direct access to the model to modify or improve it.

On the other hand, in recent months, some organizations have released more powerful free or partially open source models, such as Meta's LLaMA, whose capabilities cannot be compared with those closed models (such as ChatGPT), but their capabilities have been rapidly improving with the help of the community.

The pressure comes to the open source community: can we see more and more integration around a few closed source models in the future? Or are there more open models that use smaller model architectures? Can the performance of the same architectural model be close to that of the larger closed-source model?

Although open models are unlikely to match the size of closed-source models, the use of carefully selected training data may bring them close to the performance of an untuned ChatGPT.

In fact, the previous Alpaca model released by Stanford University, the experimental results of fine-tuning LLaMA data according to OpenAI's GPT model, have shown that the correct data can significantly improve the smaller open source model, which is the original intention of Berkeley researchers to develop and release the Koala model, which provides another experimental proof of this discussion.

Koala fine-tunes the free interactive data obtained from the Internet, and pays special attention to data that interacts with high-performance closed-source models such as ChatGPT.

The researchers fine-tune the basic LLaMA model based on dialogue data extracted from the network and public data sets, including high-quality responses to user queries from other large language models, as well as question and answer data sets and human feedback data sets. The resulting Koala-13B model shows almost the same performance as the existing models.

The results show that learning from high-quality data sets can reduce some shortcomings of small models, and may even compete with large-scale closed-source models in the future, which means that the community should devote more energy to managing high-quality data sets, which is more conducive to building safer, more practical and more capable models than simply increasing the size of existing models.

By encouraging researchers to participate in the systematic demonstration of the Koala model, the researchers hope to find some unexpected features or defects that will help to evaluate the model in the future.

One of the main obstacles to establishing a dialogue model for data sets and training is the management of training data. All chat models, including ChatGPT, Bard, Bing Chat and Claude, use a large number of dedicated data sets constructed by manual tagging.

In order to build Koala, researchers organize training sets by collecting dialogue data from the Internet and public data sets, part of which includes conversations from large language models (such as ChatGPT) published by users online.

Instead of crawling as much network data as possible to maximize the amount of data, researchers focus on collecting a small, high-quality data set that uses public data sets to answer questions, human feedback (rated positive and negative), and conversations with existing language models.

ChatGPT Distillation data share conversations with public users of ChatGPT (ShareGPT): use public API to collect about 60, 000 conversations shared by users on ShareGPT.

URL link: https://sharegpt.com/ to ensure data quality, the researchers deleted duplicate user query and deleted all non-English conversations, leaving about 30, 000 samples.

Human ChatGPT Comparative Corpus (HC3): using human and ChatGPT responses from the HC3 English dataset, which contains about 60, 000 human answers and 27000 ChatGPT answers to about 24000 questions, a total of about 87000 Q & A samples are obtained.

Open source data Open Instruction Generalist (OIG): using a subset of components manually selected from the open instruction general dataset planned by LAION, including primary school math guides, poems to songs, and plot-script-book-dialogue datasets, a total of about 30,000 samples were obtained.

Stanford Alpaca: includes datasets for training the Stanford Alpaca model.

The dataset contains about 52000 samples, generated by the text-davinci-003 of OpenAI according to the self-instruct process.

It is worth noting that the HC3, OIG, and Alpaca datasets are single-round questions and answers, while the ShareGPT dataset is multi-round conversations.

Anthropic HH: contains human ratings for the harmfulness and usefulness of the model output.

The dataset contains about 160000 examples of human assessments, each of which consists of a pair of responses from chatbots, one of which is a human preference, which provides functionality and additional security for the model.

OpenAI WebGPT: this dataset contains a total of about 20,000 comparisons, each of which includes a question, a pair of model answers, and metadata, which are graded by humans according to their own preferences.

OpenAI Summarization: contains about 93000 samples, including feedback from humans on the summary of model generation, and the human evaluator chooses a better summary result from two options.

When using open source datasets, some datasets may provide two responses corresponding to whether they are rated as good or bad (AnthropicHH, WebGPT, OpenAI digest).

Previous research results have proved the effectiveness of conditional language models for human preference markers (useful / useless) to improve performance. Researchers put the model on positive or negative markers according to preference tags, and use positive tags on data sets if there is no human feedback. During the evaluation phase, the written prompt includes positive tags.

Koala is based on the open source framework EasyLM (pre-training, fine-tuning, serving and evaluating a variety of large language models) and is implemented in JAX / Flax; the training equipment is a Nvidia DGX server and eight A100 GPU, requiring 6 hours of training to complete 2 epochs.

On public cloud computing platforms, the expected cost of training is no more than $100.

Preliminary Evaluation in the experiment, the researchers evaluated two models: Koala-Distill, which uses only distillation data, and Koala-All, which uses all data, including distillation and open source data.

The purpose of the experiment is to compare the performance of the model and to evaluate the impact of distillation and open source data sets on the performance of the final model; to conduct a human evaluation of the Koala model and compare Koala-All with Koala-Distill, Alpaca and ChatGPT.

The experimental test set consists of Stanford Alpaca Test Set and Koala Test Set. The test queryAlpaca test set consists of user prompt sampled from the self-isntruct data set, and represents the distributed data of the Alpaca model. In order to provide a more practical evaluation protocol, the Koala test set contains 180 real user query published online, spanning different topics, usually conversational, more representative of the actual use cases based on the chat system, and in order to reduce possible test set leakage, queries with BLEU scores greater than 20% are finally screened out from the training set.

In addition, because the research team was better at English, the researchers deleted non-English and coding-related hints to provide more reliable tagging results, and finally conducted a blind test on about 100 taggers on the Amazon crowdsourcing platform. an input prompt and the output of two models are provided for each grader in the scoring interface It is then required to use criteria related to the quality and correctness of responses to determine which output is better (just as good is allowed).

In the Alpaca test set, Koala-All performs as well as Alpaca.

In the Koala test set (including the real user query), Koala-All is better than Alpaca in nearly half of the samples, and in 70% of cases, it is better than or as good as Alpaca. There must be a reason why the Koala training set test set is more similar, so this result is not particularly surprising.

But as long as these hints are more like the downstream use cases of these models, it means that Koala will perform better in helper-like applications, indicating that using samples published on the network to interact with language models is an effective strategy to give these models effective instruction execution capabilities.

Surprisingly, the researchers found that apart from distillation data (Koala-All), the training of open source data was slightly worse than that of only ChatGPT distillation data (Koala-Distill).

While this difference may not be significant, the results show that the quality of ChatGPT conversations is so high that even twice as much open source data is not significantly improved.

The initial assumption was that Koala-All should perform better, so Koala-All was used as the primary evaluation model in all evaluations, and it was eventually found that valid instructions and auxiliary models could be obtained from large language models, as long as these prompts represented the diversity of users during the testing phase.

Therefore, the key to establishing a strong dialogue model may be to manage high-quality dialogue data, which vary in user queries, and cannot simply reformat existing data sets into questions and answers.

Restrictions, like security and other language models, Koala has limitations that can hurt users if they are misused.

The researchers observed that Koala can hallucinate and respond non-factually in a very confident tone, which may be the result of dialogue fine-tuning, in other words, smaller models inherit the confident style of larger language models, do not inherit the same level of facts, and need to be improved in the future.

When misused, Koala's hallucinatory responses may facilitate the spread of error messages, spam, and other content.

Koalas can fantasize about inaccurate information in a confident and convincing tone. In addition to hallucinations, koalas have the inadequacies of other chatbot language models. These include:

Biases and stereotypes: the model inherits biased training dialogue data, including stereotypes, discrimination, and other injuries.

Lack of common sense: although large language models can produce seemingly coherent and grammatically correct texts, they often lack common sense knowledge that people take for granted, which can lead to absurd or inappropriate reactions.

Limited understanding: large language models may be difficult to understand the context and nuances of a dialogue, or to identify irony or irony, which can lead to misunderstandings.

To address the security risks of Koala, the researchers included confrontational hints in the data sets of ShareGPT and AnthropicHH to make the model more robust and harmless.

To further reduce potential abuse, OpenAI's content audit filter was deployed in the demo to mark and remove unsafe content.

Future researchers hope that the Koala model will become a useful platform for academic research on large-scale language models in the future: the model is small enough to demonstrate many functions of modern language models, and can be fine-tuned or used with less computation. Future research directions may include:

Security and consistency: further study of the security of language models and better consistency with human intentions.

Model bias: better understand the biases of large language models, the existence of false correlations and quality problems in dialogue data sets, and ways to reduce these biases.

Understand large language models: because Koala's reasoning can be performed on a relatively cheap GPU, you can better examine and understand the interior of the dialogue language model, making the black box language model easier to understand.

Reference:

Https://bair.berkeley.edu/blog/2023/04/03/koala/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.