In a word, unlock the real strength of 100k + context big model, 27 points rise to 98 minutes GPTMel4, Claude2.1 is applicable. 04/20 Update SLTechnology News&Howtos

In a word, unlock the real strength of 100k + context big model, 27 points rise to 98 minutes GPTMel4, Claude2.1 is applicable.

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

The big models roll up the context window one after another, and the standard configuration of Llama-1 is still 2k, and now those who are no more than 100k are embarrassed to go out.

However, an extreme test by Geese found that most people did not use it correctly and did not give full play to the strength that AI should have.

Can AI really find the key facts from hundreds of thousands of words? The redder the color, the more mistakes AI makes.

By default, the results of GPT-4-128k and the newly released Claude2.1-200k are not very good.

But when the Claude team learned about the situation, they came up with an ultra-simple solution, adding a sentence that directly raised the score from 27% to 98%.

It's just that this sentence is not added to the user's question, but asks AI to say at the beginning of the reply:

"Here is the most relevant sentence in the context:"

This is the most relevant sentence in the context:)

Let the model look for a needle in a haystack to take the test, the author Greg Kamradt spent at least $150 out of his own pocket.

Fortunately, when testing Claude2.1, Anthropic offered him a free quota, otherwise it would have cost an extra $1016.

In fact, the testing method is not complicated, using 218 blog posts by Paul Graham, founder of YC, as test data.

Add specific sentences to different places in the document: the best thing in San Francisco is to sit in Dolores Park and eat a sandwich on a sunny day.

Ask GPT-4 and Claude2.1 to answer questions only using the context provided, testing repeatedly in documents of different context lengths and added in different locations.

Finally, the Langchain Evals library is used to evaluate the results.

The author named the test "looking for a needle in a haystack / looking for a needle in a haystack", and opened up the code on GitHub, which has won 200 + stars, and revealed that a company has sponsored the testing of the next big model.

A few weeks after AI found its own solution, Anthropic, the company behind Claude, carefully analyzed it and found that AI was just reluctant to answer questions based on a single sentence in the document, especially when the sentence was inserted later and had little to do with the entire article.

In other words, AI decided that this sentence had nothing to do with the theme of the article, so he was lazy not to look for it sentence by sentence.

At this point, you need to use some means to shake past the AI and ask the Claude to add the sentence "Here is the most relevant sentence in the context:" at the beginning of the answer.

Using this method, you can also improve the performance of Claude when looking for sentences that are not artificially added later and are already in the original article.

Anthropic says it will continue to train Claude to make it more adaptable to such tasks in the future.

There are other clever uses of asking AI to answer with a specified beginning when API is called.

Entrepreneur Matt Shumer added a few tips after taking a look at this plan:

If you want AI to output in pure JSON format, the prompt ends with "{". Similarly, if you want AI to list Roman numerals, the prompt ends with "I:".

But it's not over yet.

Domestic big model companies also noticed this test and began to try to see if their own big models could pass.

The Kimi large model team of the Dark side of the Moon, which also has a very long context, also tested the problem, but gave different solutions and achieved good results.

In this way, it is easier to modify the user's question Prompt than to ask AI to add a sentence to his answer, especially if the chatbot product is used directly instead of calling API.

The dark side of the moon also tested GPT-4 and Claude2.1 with its own new method, and the results showed that GPT-4 improved significantly, while Claude2.1 improved only slightly.

It seems that the experiment itself has some limitations, and Claude also has its own particularity, which may be related to their own alignment Constituional AI, and it is better to use the method provided by Anthropic.

Later, the engineer on the dark side of the moon conducted more rounds of experiments, one of which was …...

Oh, no. I'm the test data.

Reference link:

[1] https://x.com/GregKamradt/status/1727018183608193393

[2] https://www.anthropic.com/index/claude-2-1-prompting

This article is from the official account of Wechat: quantum bit (ID:QbitAI), author: Mengchen

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.