Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

"Mathematical vegetable Chicken" ChatGPT understands human preferences very well, and generating random numbers online is the ultimate answer to the universe.

2025-03-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

ChatGPT has also learned the human routine in generating random numbers.

ChatGPT may be a nonsense artist and disseminator of misinformation, but it is not a "mathematician"!

Recently, Colin Fraser, a data scientist at Meta, found that ChatGPT does not generate real random numbers, but more like "human random numbers".

Through the experiment, Fraser concluded: "ChatGPT likes the numbers 42 and 7 very much." "

Netizens said that it means that human beings like these numbers very much.

ChatGPT also loves the Ultimate answer to the Universe. In his test, Fraser typed prompt as follows:

"Pick a random number between 1 and 100. Just return the number; Don't include any other text or punctuation in the response . "

By asking ChatGPT to generate a random number between 1 and 100 at a time, Fraser collected 2000 different answers and summarized them into a table.

As you can see, the number 42 has the highest frequency, as high as 10%. In addition, the frequency of numbers containing 7 is also very high.

In particular, the digital frequency between 71 and 79 is higher. Among the numbers outside this range, 7 often appears as the second digit.

42 what does it mean?

Anyone who has read Douglas Adams's sensational science fiction Guide to the Galaxy knows that 42 is "the ultimate answer to life, the universe, and everything."

In short, 42 and 69 are meme numbers on the Internet. This shows that ChatGPT is not really a random number generator, but only chooses the popular numbers in life from the huge data sets collected on the Internet.

In addition, the frequent appearance of 7 precisely reflects that ChatGPT caters to human preferences.

In Western culture, 7 is generally regarded as a lucky number, which is called Lucky 7. Just like we are obsessed with the number 8.

Interestingly, Fraser also found that GPT-4 seems to compensate for this.

When GPT-4 is asked to provide more numbers, the random number it returns is too evenly distributed.

In short, ChatGPT basically gives a response through prediction, rather than really "thinking" to get an answer.

It can be seen that a chatbot touted as almost omnipotent is still a bit silly.

Let it plan a road trip for you, and it will stop you in a town that doesn't exist. Or, let it output a random number, most likely based on a popular meme.

Some netizens tried it in person and found that GPT-4 really likes 42.

What's the point if ChatGPT ends up repeating online platitudes?

GPT-4, the birth of GPT-4, which violates machine learning rules, is exciting, but also disappointing.

OpenAI not only did not release more information about GPT-4 or even the size of the model, but highlighted its performance in many professional and standardized tests to crush humans.

Take the BAR lawyer license examination in the United States as an example, the GPT3.5 can reach 10% and the GPT4 can reach 90%.

However, Arvind Narayanan, a professor of computer science at Princeton University, and Sayash Kapoor, a doctoral student, wrote that

OpenAI may have been tested on training data. In addition, the human benchmark makes no sense to chatbots.

Specifically, OpenAI may violate the basic rules of machine learning: don't test on training data. You should know that the test data and training data should be separated, otherwise there will be problems of fitting.

Aside from this problem, there is a bigger problem.

The way language models solve problems is different from that of human beings, so these results are of little significance to the performance of a robot when faced with practical problems faced by professionals. A lawyer's job is not to answer questions about the bar exam all day.

Question 1: training data contamination in order to evaluate the programming ability of GPT-4, OpenAI was evaluated on the website of the Russian programming competition Codeforces.

Surprisingly, Horace He pointed out online that in a simple category, GPT-4 solved 10 problems before 2021, but none of the last 10 problems were solved.

The deadline for GPT-4 's training data is September 2021.

This strongly implies that the model can memorize the solutions in its training set, or at least partially memorize them, enough to fill in what it cannot recall.

To provide further evidence for this hypothesis, Arvind Narayanan tested GPT-4 on the issue of Codeforces competitions at different times in 2021.

It turns out that GPT-4 can solve simple classification problems before September 5, but none of the problems after September 12.

In fact, we can clearly prove that it has remembered the problem in the training set: when you prompt GPT-4 for the title of a Codeforces question, it contains a link to the exact game in which the problem occurs. It is worth noting that GPT-4 does not have access to the Internet, so memory is the only explanation.

GPT-4 memorized the Codeforce problem before the training deadline

As for benchmarks other than programming, Professor Narayanan said, "We don't know how to separate problems by time in a clear way, so we think that OpenAI is very difficult to avoid data contamination." For the same reason, we cannot conduct experiments to test how performance varies with dates. "

However, you can start from the other side, if it is memory, then GPT must be highly sensitive to the wording of the question.

In February, Melanie Mitchell, a professor at the Santa Fe Institute, cited an example of a MBA test, in which a slight change in detail is enough to deceive ChatGPT (GPT-3.5), which is not deceiving a person.

A more detailed experiment like this would be valuable.

Because of the lack of transparency in OpenAI, Professor Narayanan is not sure that it is the problem of data contamination. But what is certain is that OpenAI's method of detecting pollution is hasty:

"We use the substring matching method to measure cross-contamination between the evaluation data set and the pre-training data. The evaluation and training data are processed to delete all spaces and symbols, leaving only characters (including numbers). For each evaluation example, we randomly select three substrings with a length of 50 characters (if the sample length is less than 50 characters, use the entire example). If any sampled evaluation substring is a substring of the processed training example, the match is considered successful. This gives you a list of contaminated examples. We discard these examples and rerun them to get uncontaminated scores. "

This method simply won't stand the test.

If the test problem exists in the training set, but the name and number have changed, it cannot be detected. Now there is a more reliable method to use, such as embedding distance.

If OpenAI is to use the method of embedding distance, how much similarity is too similar? There is no objective answer to this question.

Therefore, even if the performance on the multi-choice standardized test seems simple, there are a lot of subjective elements.

Question 2: professional examinations are not an effective way to compare human and robot capabilities, just like the spectrum, even though the language model has not seen an exact problem in the training set, due to the huge training corpus, it has inevitably seen many very similar examples.

This means that it can escape deeper reasoning. Therefore, the benchmark results do not provide us with evidence that language models are acquiring the in-depth reasoning skills that human candidates need.

In some practical tasks, shallow reasoning GPT-4 may be competent, but not always.

Benchmarking has been widely used in large model comparisons and has been criticized by many people for simplifying multidimensional evaluation to a single number.

Unfortunately, it is a pity that OpenAI chose to use so much of these tests in the GPT-4 assessment, coupled with inadequate measures to deal with data contamination.

Reference:

Https://futurism.com/the-byte/chatgpt-random-numbers

Https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report