Google Gemini Pro measurement is not as good as GPT-3.5,CMU in-depth comparative study: to ensure fairness, transparency and repeatability 02/12 Update SLTechnology News&Howtos

Google Gemini Pro measurement is not as good as GPT-3.5,CMU in-depth comparative study: to ensure fairness, transparency and repeatability

2026-02-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

What is the strength of Google Gemini? Carnegie Mellon University came to a professional objective third-party comparison. To ensure fairness, all models use the same prompts and generation parameters, and provide repeatable code and completely transparent results.

It won't compare 5-shot with CoT@32, as Google did at its official launch.

Bottom line: the Gemini Pro version is close to but slightly inferior to GPT-3.5 Turbo,GPT-4 and is still far ahead.

In the in-depth analysis, we also found some strange features of Gemini, such as multiple choice questions like to choose D...

Many researchers said that it was so popular that Gemini produced such a detailed test only a few days after its release.

Six major tasks in-depth testing this test compares six major tasks and selects the corresponding data sets:

Knowledge question and answer: MMLU

Reasoning: BIG-Bench Hard

Mathematics: GSM8k, SVAMP, ASDIV, MAWPS

Code: HumanEval, ODEX

Translation: FLORES

Surfing the Internet: WebArena

Knowledge Q & A: like to choose D, as can be seen from the results, using thought chain tips in this kind of task may not necessarily lead to improvement.

MMLU data set is full of multiple-choice questions, further analysis of the results also found a strange phenomenon: Gemini prefers to choose D.

The distribution of the GPT series on the four options is much more balanced, and the team suggests that this may be due to Gemini's failure to fine-tune a large number of instructions for multiple choice topics.

In addition, the security filtering of Gemini is relatively serious, with only 85% of moral questions answered and only 28% of questions related to human sexual behavior.

The two subjects in which Gemini Pro outperformed GPT-3.5 were safety studies and high school microeconomics, but the gap was small, and the team said there was nothing special about it.

Reasoning: long problems are not good at

Gemini Pro does not perform well on longer and more complex issues, while the GPT series is more robust.

This is especially true of GPT-4 Turbo, where there is little performance degradation even on longer problems, indicating its powerful ability to understand complex problems.

When it comes to the type of problem, Gemini is particularly bad at questions like "tracking_shuffled_objects", where people exchange items and let AI decide who owns what.

Gemini is good at understanding sports that require world knowledge, manipulating symbol stacks, sorting words in alphabetical order, and parsing tables.

Mathematics: complex tasks surpass

This time the problem itself is too long, Gemini Pro and GPT-3.5 performance decline together, only GPT-4 can maintain the consistent level.

However, when the length of the thought chain prompt is the longest, Gemini surpasses GPT-3.5.

Code: good at matplotlib for code problems, Gemini is poor at questions with long reference answers.

Classified by the libraries invoked, the GPT series is stronger in most types, but matplotlib is not at all.

As long as you answer, the quality is very high. On the translation task, there are 12 types of Gemini refusing to answer, but as long as the answer is of high quality, the overall performance is higher than that of GPT-4.

The types of translation that Gemini refuses mainly involve Latin and Arabic.

Network navigation: good at cross-site surfing WebArena simulates an Internet environment for AI, including e-commerce, social forums, GitLab collaborative development, content management systems and online maps, etc., requiring AI to find information or complete cross-site tasks.

Gemini underperforms GPT-3.5 Turbo overall, but performs slightly better in tasks that span multiple sites.

Netizen: but it's free. Finally, Graham Neubig, an associate professor of CMU, admitted some of the limitations of the study.

The behavior of API-based models may change at any time.

Only a limited number of hints have been tried, and the prompts applicable to different models may not be the same.

Unable to control whether the test set is leaked

Zhou Dengyong, head of Google's large model reasoning team, pointed out that setting the temperature of Gemini to 0 for reasoning tasks can be increased by 5-10 percentage points.

In addition to the Gemini and GPT series, this test also features the open source MoE model Mixtral that has attracted a lot of attention recently.

However, reinforcement learning expert Noam Brown believes that the results of Mixtral can be ignored because the third-party API is used rather than the official implementation.

The founder of Mistral AI also came to provide the team with official transfer rights, thinking that they could get a better result.

All in all, Gemini Pro is not as good as GPT-3.5, but it is free for no more than 60 calls per minute.

So there are still a lot of individual developers who have changed sides.

The highest version of Gemini, the Ultra version, has not yet been released, and the CMU team is interested in continuing the research. Do you think Gemini Ultra can reach the level of GPT-4?

Thesis:

Https://arxiv.org/abs/2312.11444

Reference link:

[1] https://twitter.com/gneubig/status/1737108977954251216

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.