In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
Don't let the big model be cheated by the benchmark evaluation.
This is the subject of a new study from the School of Information at Renmin University, the Hillhouse School of artificial Intelligence and the University of Illinois at Urbana-Champaign.
The study found that it is becoming more and more common that relevant data are accidentally used for model training in benchmarking.
Because the pre-training corpus contains a lot of public text materials, and the evaluation benchmark is also based on this information, this situation is inevitable.
Now the problem is getting worse as the big model tries to collect more public data. You know, this kind of data overlap is very harmful.
It will not only lead to the false high test scores of some parts of the model, but also reduce the generalization ability of the model and the performance of irrelevant tasks. It may even cause "harm" to the large model in practical application.
So this study issued a formal warning and verified the actual harm that may be caused by a number of simulation tests, specifically.
It is dangerous for the large model to be left out. The research mainly tests and observes the impact of the large model by simulating the extreme leakage of data.
There are four ways to extremely disclose data:
Training sets using MMLU
A training set that uses all test benchmarks except MMLU
Use all training sets + Test prompt
Use all training sets, test sets, and test prompt (this is the most extreme case, experimental simulation only, and would not normally occur)
The researchers then "poisoned" four large models, and then observed their performance in different benchmark, mainly evaluating their performance in question and answer, reasoning, reading comprehension and other tasks.
The models used are:
GPT-Neo (1.3B)
Phi-1.5 (1.3B)
OpenLLaMA (3B)
LLaMA-2 (7B)
Meanwhile, LLaMA (13B / 30B / 65B) was used as the control group.
It is found that when the pre-training data of the large model contains the data of a certain evaluation benchmark, it will perform better in this benchmark, but its performance will decline in other unrelated tasks.
For example, after training with MMLU data sets, the scores of several large models are improved in the MMLU test, while the scores in the common sense benchmark HSwag and mathematical benchmark GSM8K decrease.
This shows that the generalization ability of the large model is affected.
On the other hand, it may cause unrelated test scores to be falsely high.
As mentioned above, the four training sets that "poisoned" the big model contained only a small amount of Chinese data, but after the big model was "poisoned", its scores in C3 (Chinese benchmark test) became higher.
This rise is unreasonable.
This leakage of training data can even lead to model test scores that outperform larger models.
For example, phi-1.5 (1.3B) outperforms LLaMA65B on RACE-M and RACE-H, and the latter is 50 times the size of the former.
But this increase in scores makes no sense, it's just cheating.
To make matters worse, even tasks that do not have data leaked will be affected and their performance will decline.
As you can see in the following table, there is a significant drop in scores for both large models in the code task HEval.
At the same time, after the data has been leaked, the fine-tuning improvement of the large model is far less than that of the unleaked situation.
In the case of data overlap / leakage, this study analyzes various possibilities. For example, large model pre-training corpus and benchmark data will choose open text (web pages, papers, etc.), so overlap is inevitable.
And the current large model evaluations are done locally or through API calls to get the results. In this way, it is impossible to strictly check some abnormal numerical increases. And the pre-training corpus of the current large model is regarded as the core secret by all parties and cannot be evaluated by the outside world. As a result, the large model was accidentally "poisoned".
So how to avoid this problem? The research team also made some suggestions.
How to avoid it? The research team gave three suggestions:
First, it is difficult to completely avoid data overlap in practice, so large models should be evaluated more comprehensively using multiple benchmarks.
Second, for large model developers, they should desensitize the data and disclose the detailed composition of the training corpus.
Third, for benchmark maintainers, the source of benchmark data should be provided, the risk of data contamination should be analyzed, and multiple assessments should be made using more diverse tips.
However, the team also said that there are still some limitations in this study. For example, there are no systematic tests on different degrees of data leakage, and failure to directly introduce data leakage into pre-training to simulate and so on.
This study was carried out by several scholars from the School of Information of Renmin University of China, the Hillhouse School of artificial Intelligence and the University of Illinois at Urbana-Champaign. In the research team, we found two big names in the field of data mining: Wen Jirong and Han Jiawei.
Professor Wen Jirong is currently dean of the Hillhouse School of artificial Intelligence of Renmin University of China and dean of the School of Information of Renmin University of China. The main research interests are information retrieval, data mining, machine learning, training and application of large-scale neural network models.
Professor Han Jiawei leads an expert in the field of data mining and is currently a professor of computer science at the University of Illinois at Urbana-Champaign, academician of the American computer Association and academician of IEEE.
Paper address:
Https://arxiv.org/abs/2311.01964
This article comes from the official account of Wechat: quantum bit (ID:QbitAI), by Mingmin.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.