In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
At present, the generative search engine can not replace the traditional search engine, there are too few sentence provenance tags, and the accuracy of quotation is not high.
Shortly after the release of ChatGPT, Microsoft successfully got into the car to launch "New Bing", not only the stock rose sharply, but also greatly replaced Google and ushered in a new era of search engine.
But is the new Bing really the right way to play with a large language model? Is the generated answer really useful to the user? What is the credibility of the quotation in the sentence?
Recently, Stanford researchers collected a large number of user queries from different sources and conducted manual evaluations of the current four fire-generating search engines, Bing Chat, NeevaAI,perplexity.ai and YouChat.
Link to paper: https://arxiv.org/ pdf / 2304.09848.pdf experimental results show that responses from existing generated search engines are smooth and informative, but often contain undocumented statements and inaccurate citations.
On average, only 51.5% of the citations can fully support the generated sentences, and only 74.5% of the citations can be used as evidence of related sentences.
The researchers believe that this result is too low for systems that may be the main tool for information search users, especially considering that some sentences are only plausible and that generative search engines still need to be further optimized.
Home page: https://cs.stanford.edu/~nfliu/ lead author Nelson Liu is a fourth-year doctoral student in the Natural language processing Group of Stanford University with a tutor of Percy Liang and a bachelor's degree from the University of Washington. His main research direction is to build practical NLP systems, especially applications for information search.
Don't trust the spanning search engine in March 2023, Microsoft reported that "about 1/3 of daily preview users use [Bing] chat every day," and Bing chat provided 45 million chats in the first month of its public preview, that is, integrating large language models into search engines is very marketable and is likely to change the search portal of the Internet.
However, at present, the existing generative search engines based on large-scale language model technology still have the problem of low accuracy, but the specific accuracy has not been fully evaluated, and then can not understand the limitations of the new search engine.
Verifiability is the key to improve the credibility of search engines, that is, to provide external links to citations for each sentence in the generated answer as evidence support, which can make it easier for users to verify the accuracy of the answer.
By collecting questions of different types and sources, the researchers conducted manual evaluations on four commercial generation search engines (Bing Chat, NeevaAI, perplexity.ai, YouChat).
The evaluation indicators mainly include fluency, that is, whether the generated text is coherent; usefulness, that is, whether the response from the search engine is helpful to the user, and whether the information in the answer can solve the problem; citation recall, that is, the percentage of sentences generated about external websites that contain citation support; citation accuracy, that is, the proportion of relevant sentences supported by the generated citations.
Fluency (fluency) also displays user queries, generated responses, and statements that "the response is smooth and semantically coherent", and the tagger scores the data with a five-point Likert scale.
Perceived utility is similar to fluency, and commentators need to assess how much they agree with the statement that the reply is useful and informative to the user's query.
Citation recall (citation recall) citation recall rate refers to the proportion of sentences that are fully supported by the relevant citations and worth verifying, so the calculation of this index requires determining the sentences in the response that are worth verifying, and evaluating that each sentence worth verifying can be supported by the relevant citations.
In the process of "identifying sentences worth verifying", the researchers believe that every generated sentence about the outside world is worth verifying, even common sense that may seem obvious and trivial, because it seems obvious to some readers, but it may not be true.
The goal of the search engine system should be to provide a reference source for all generated sentences about the outside world, so that readers can easily verify any narrative in the generated response without sacrificing verifiability for the sake of simplicity.
So the tagger actually validates all the sentences generated, except for responses with the system as the first person, such as "as a language model, I don't have the ability to do it." or questions to users, such as "do you want to know more?" "wait.
Evaluating "whether a statement worth verifying is fully supported by its relevant citation" can be based on the attribution identified source (AIS, attributable to identified sources) evaluation framework, and the tagger makes binary tagging, that is, if an ordinary listener agrees that "a citation-based web page can be obtained." then the citation can fully support the reply.
Citation accuracy in order to measure the accuracy of a reference, the tagger needs to determine whether each reference provides full, partial, or irrelevant support to its related sentence.
Full support: all the information in the sentence is supported by the citation.
Partial support (Partial support): some information in the sentence is supported by the citation, but other parts may be missing or contradictory.
Irrelevant support (No support): if the referenced web pages are completely irrelevant or contradictory.
For sentences with multiple related citations, additional tagging staff will be required to use the AIS evaluation framework to determine whether all relevant citation pages as a whole provide sufficient support for the sentence (binary judgment).
In the evaluation of fluency and usefulness of the experimental results, we can see that each search engine can generate very smooth and useful responses.
In the specific search engine evaluation, it can be seen that Bing Chat has the lowest fluency / usefulness score (4.40), followed by NeevaAI (4.43), perplexity.ai (4.51), and YouChat (4.59).
In different types of user queries, we can see that short extractive questions are usually more fluent than long ones, usually answering only factual knowledge; some difficult questions usually need to summarize different tables or web pages, and the synthesis process will reduce the overall fluency.
In the citation evaluation, we can see that the existing generative search engines often cannot fully or correctly quote web pages. On average, only 51.5% of the generated sentences are fully supported by the citation (recall rate). Only 74.5% of the citations fully support the relevant sentences (accuracy).
This number is unacceptable for search engine systems that already have millions of users, especially when the amount of information generated is often large.
And there are great differences in citation recall and accuracy among different generative search engines, in which perplexity.ai achieves the highest recall rate, while NeevaAI, Bing Chat and YouChat have the lowest recall rate.
On the other hand, Bing Chat achieved the highest accuracy (89.5), followed by perplexity.ai (72.7), NeevaAI (72.0) and YouChat (63.6).
In different user queries, the difference in reference recall rate between NaturalQuestions and non-NaturalQuestions queries with long answers is close to 11% (58.5 and 47.8, respectively).
Similarly, the reference recall gap between NaturalQuestions queries with short answers and NaturalQuestions queries without short answers is close to 10% (63.4 for queries with short answers, 53.6 for queries with only long answers, and 53.4 for queries with no long or short answers).
In problems without web support, the citation rate will be lower. For example, when evaluating open AllSouls paper problems, the generative search engine has only 44.3 citation recall rates.
Reference:
Https://arxiv.org/abs/2304.09848
This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.