Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Beat GPT-4 completely, kill the closed source model in seconds! Mysterious version of Code Llama exposed

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

[guide to Xin Zhiyuan] Meta's open source Code Llama is about to usher in a big wave of second innovation, and WizardCoder crushes GPT-4 with a winning rate of 73.2%. OpenAI employees have revealed that Llama 3 can play GPT-4, but it will still be open source.

Only 2 days after its release, Code Llama once again detonated a change in AI coding.

Do you still remember the mysterious version of Unnatural Code Llama that Meta appeared in the Code Llama paper, which can completely equalize GPT-4?

Boss Sebastian explained on his blog:

This is a version of Code Llama- Python 34B that has been fine-tuned using 15,000 unnatural language instructions.

By hiding such a very hidden message in the paper, Meta seems to suggest to the open source community that Code Llama has great potential, so fine-tune it quickly!

So just now, WizardCoder 34B, based on Code Llama fine-tuning, directly beat GPT-4 on the HumanEval benchmark.

Specifically, WizardCoder crushed the March version of GPT-4 with a winning rate of 73.2% (67%).

In addition, WizardCoder 34B outperforms the latest version of GPT-3.5, as well as Claude 2.

WizardCoder, a large programming model, was released by Microsoft in conjunction with Hong Kong Baptist University in June. It is said that a fine-tuned version of 13B / 7B is coming soon.

Jim Fan, a top scientist at Nvidia, said it was basically an open version of Unnatural Code Llama.

Although the benchmark data looks good, HumanEval only tests a narrow distribution and may be overfitted. It is the data testing in the natural scenario that really matters. The coding benchmark needs to be significantly upgraded.

The mysterious version of Code Llama was born? On Friday, Meta officially opened up three versions of Code Llama.

In the HumanEval and MBPP benchmark images, many people found a version that was not mentioned in the Meta official-Unnatural Code Llama.

This mysterious version has achieved 62.2% performance on HumanEval pass@1.

The fine-tuning WizardCoder 34B announced today has a performance of 73.2% on HumanEval pass@1.

According to introduction, WizardCoder 34B is a version that uses the synthetic dataset Evol-Instruct to fine-tune the Code Llama model.

The following is a visual comparison of the performance of all open source and closed source models.

In comparison with the OpenAI model, the researchers point out that GPT4 and ChatGPT-3.5 have two HumanEval results:

OpenAI's official GPT4 report (2023-03-15) provides results of 67.0% and 48.1%, respectively. The results of the researchers' tests using the latest API (2023-08-26) were 82.0% and 72.5% respectively.

In addition, the researchers stressed that this performance result is 100% reproducible!

The demo of WizardCoder 34B is open and anyone can test it.

Some people have pointed out that over-fitting the public rankings is one of the main reasons why the open source model is difficult in practical application. Here is an example where wizard-coder 's data is prepared to use HumanEval pass@1 scores to determine whether or not to further develop the dataset. Optimizing only for the test set goes against the original intention of the test set.

Also yesterday, researchers from the Phind group fine-tuned Code Llama-34B to beat GPT-4 in the HumanEval assessment.

ChatGPT fights with Code Llama

How does Code Llama perform in actual code tasks?

A netizen did a comparative test of GPT-3.5 and Code Llama Instruct-34B. It is tested by the access service of Code Llama 34B provided by Perplexity.AI.

It fed eight identical code tasks to each of the two models and compared the quality of their generated code.

As a result, GPT-3.5 won by 8:5.

The following are the specific test results.

The first question

Use Python to do this, given two strings, word1 and word2. Merge strings by adding letters in alternate order, starting with word1. If one string is longer than another, append additional letters to the end of the merged string.

Finally, the merged string is output.

For example:

Input: word1 = "abc", word2 = "pqr" output: "apbqcr"

Both GPT-3.5 and Code Llama can do it-- 1:1

Second question

To do this using Python, given a string s, only reverse all vowels in the string and return it.

The vowels are "a", "e", "I", "o" and "u", which can appear multiple times in lowercase and uppercase.

For example: input: s = "hello" output: "ello"

GPT-3.5 completed, Code Llama not finished-2:1

The third question

To do this using Python, give an array of integers nums, move all zeros to the end, while maintaining the relative order of non-zero elements.

Note that you must do this in place without making a copy of the array.

For example: input: nums = [0rec 1jue 0rem 3jue 12] output: [1meme 3pr 12je 0pr 0]

GPT-3.5 completed, Code Llama not finished-3:1

Question 4

Using Python to accomplish this task, you have a long flower bed, some of which are planted with flowers and some without.

However, adjacent plots cannot grow flowers. Given an integer array of flower beds containing 0 and 1, where 0 is empty, 1 is not empty, and an integer n, if n new flowers can be planted in the flower bed without violating the rule of no adjacent flowers, true is output, otherwise false is output.

Example 1: input: flower bed = [1jc0j0j0j0jin1], n = 1 output: true example 2: input: flowerbed = [1j0j0j0re0re0jue 1], n = 2 output: false

Both models are complete-4:2

Question 5

Using Python, given an input string s, reverse the order of words. A word is defined as a sequence of non-space characters. The words in s will be separated by at least one space.

Outputs a string of words concatenated by a single space in reverse order. Note that s may contain leading or trailing spaces or multiple spaces between two words.

The returned string should have only one space to separate the words. Do not include any extra spaces.

Example: input: s = "the sky is blue" output: "blue is sky the"

Both models are complete-5:3

Question 6

To do this using Python, given a string s and an integer k, returns the maximum number of vowels in any substring of length k in s.

The vowels in English are "a", "e", "I", "o" and "u". Example: input: s = "leetcode", k = 3 output: 2

Explanation: "lee", "eet" and "ode" contain two vowels.

Both models are complete-6:4

Question 7

Use Python to do this, given a string s that contains an asterisk *. With one operation, you can: select an asterisk in s.

Delete the nearest non-asterisk character to its left, and delete the asterisk itself. Output strings after all asterisks are deleted. Example: input: s = "leet**cod*e" output: "lecoe"

GPT-3.5 finished, but Code Llama didn't finish-- 7:4

Question 8

To do this using Python, give an integer temperature array representing the daily temperature and return an array of answers, where answer [I] is the number of days you have to wait after day I to get a warmer temperature.

If you can't do this one day in the future, leave answer [I] = = 0. For example, input: temperature = [73, 74, 75, 71, 69, 72, 76, 73] output: [1, 1, 1, 4, 2, 1, 1, 0]

Both models are complete-8:5

For the performance of the two models, the netizen believes that this is not a rigorous study, but a simple test. Every time you let the model generate the code again, you can get a better answer, but not in the test.

So the conclusion of the test is not the performance of the final two models.

It is comparable to the release of GPT-4,Llama 3 from the open source release of Llama and Llama 2, which set off the ChatGPT craze in the machine learning community, and various fine-tuning models sprang up.

Jason Wei, a researcher at OpenAI, said he learned from Meta GenAI social events that Llama 3 and Llama 4 will also be open source in the future.

We have the computing power to train Llama 3 and 4. Our plan is to make Llama-3 as good as GPT-4. Wow, if Llama-3 is as good as GPT-4, will you still open source? Yes, we will. I'm sorry, align the staff.

Another netizen said that Meta wants to open source a GPT-5-level model and seems to insist on open source before AGI.

I want to be clear about what this means: there is no death switch.

If something goes wrong-an agent gets out of control, or a bad guy weaponizes it-there is no easy way to turn it off. It can run on any small cluster. There is no security to speak of.

Safety research becomes meaningless.

All the work that people do to make artificial intelligence systems honest, consistent, moral, etc., becomes meaningless. Artificial intelligence systems in the world will develop in the direction of which systems can produce the greatest economic benefits, regardless of their values or motivations. There's no guardrail. Anyone can change the values or abilities of artificial intelligence at will, for better or worse.

If Meta continues to open source while we get more intelligent artificial intelligence, then I know very well that things will get messed up. The arrival of these alien agents will already make the world a mess, but it will be even worse if we give up what little control humans have.

As far as I know, Meta wants open source mainly from the "open source community dogma", that is, "open source is good". And as far as I know, they didn't approve of open source until their first model, Llama, was accidentally leaked, and they've been pretending to be in favor of open source ever since.

In response, Musk says, however, LLM using autoregressive Transfomer is extremely inefficient, not only in training, but also in reasoning. I think it deviates by several orders of magnitude.

Llama 2 coding capability soars

Llama 2 is a model with strong performance in all aspects.

However, it has a very obvious weakness-code capability.

According to data from a paper published by Meta in Llama 2, Llama 2's performance in HumEval (assessing LLM coding-related benchmarks) is even worse than GPT-3.5, not to mention much worse than GPT-4.

The annotated diagram from the original Llama 2 paper, but the code capability will certainly be an important direction for the future open source community to use Llama 2, Meta naturally can not be broken in this direction, so there is a Code Llama that has been greatly optimized for code capabilities.

Two days ago, Meta officially released the Code Llama family: Code Llama (7B, 13B and 34B), with three variants: the general code model Code Llama, the instruction compliance model Code Llama-instruct, and the Python code-specific version Code Llama-Python.

These models are the same as the license for Llama 2, free of charge for academic and commercial use.

The code power of the Code Llama 34B model is almost twice that of Llama 2, greatly narrowing the gap with GPT-4.

Do you still remember the Unnatural Code Llama that Meta appeared in the Code Llama paper, which can be completely equal to the GPT-4 version?

Boss Sebastian explained on his blog:

This is a version of Code Llama- Python 34B that has been fine-tuned using 15,000 unnatural language instructions.

By hiding such a very hidden message in the paper, Meta seems to suggest to the open source community that Code Llama has great potential, so fine-tune it quickly!

Why is there no 70B Code Llama model?

Interestingly, Code Llama has only 7B, 13B and 34B parameter versions, which is 70B less than Llama 2.

Although Meta does not explain why this is the case in the paper, tech mogul Sebastian offers two possible reasons:

1. Code Llama is trained on 500B token, while Llama 2 is trained on 2T token.

Because the training data of CodeLlama is only 1max 4 compared with Llama 2, the performance of CodeLlama70B may not be very good because of the lack of enough training data and the limitation of Scaling Laws of LLM.

2. The Code Llama model supports a context size of 100k, which is very useful when dealing with code tasks.

By contrast, Llama 2 only supports an input length of up to 4k. If you want the 70B model to support an input length of 100k token, the computational requirements of the model may become too exaggerated.

Reference:

Https://twitter.com/DrJimFan/status/1695503308268740808

Https://twitter.com/_philschmid/status/1695429665262084362

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report