GPT-4 became the reviewer of Nature? Nearly 5000 papers from Stanford Tsinghua alumni were measured, and the results were consistent with the human review. 04/15 Update SLTechnology News&Howtos

GPT-4 became the reviewer of Nature? Nearly 5000 papers from Stanford Tsinghua alumni were measured, and the results were consistent with the human review.

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Thanks CTOnews.com netizen Hua Ke high achiever's clue delivery! [introduction to New Zhiyuan] Stanford scholars found that GPT-4 's opinions on the review of Nature and ICLR's papers were more than 50% similar to those of human reviewers. It seems that it is not a fantasy to ask the big model to review the paper for us.

GPT-4, has been successfully promoted as a reviewer!

Recently, researchers from Stanford University and other institutions have thrown thousands of top articles from Nature, ICLR, etc., to GPT-4 to generate reviews and suggestions for revision, and then compare them with those given by human reviewers.

Paper address: https://arxiv.org/ abs / 2310.01783

As a result, GPT-4 is not only perfectly qualified for the job, but even better than humans!

More than 50% of its opinions are consistent with at least one human reviewer.

And more than 82.4% of the authors said the advice given by GPT-4 was very helpful.

James Zou, author of the paper, concluded: we still need high-quality manual feedback, but LLM can help authors improve their first drafts before formal peer review.

GPT-4 may give you better advice than humans, so how do you get LLM to review your manuscript?

It's very simple, as long as you extract the text from the paper PDF and feed it to GPT-4, it immediately generates feedback.

Specifically, we need to extract and analyze the title, abstract, graphics, table title and main text of a PDF.

Then tell GPT-4 that you need to follow the feedback form of the top journal meetings in the industry, which consists of four parts-whether the results are important and novel, the reasons why the paper is accepted, the reasons why the paper was rejected, and suggestions for improvement.

As you can see from the figure below, GPT-4 gives very constructive comments, and the feedback consists of four parts.

What are the defects in this paper?

GPT-4 bluntly pointed out that although the paper mentioned the phenomenon of modal gap, it did not propose a way to narrow the gap, nor did it prove the benefits of doing so.

The researchers compared human feedback and LLM feedback from 3096 Nature series papers and 1709 ICLR papers.

The two-stage comment matching pipeline extracts the comment points from LLM and human feedback respectively, and then performs semantic text matching to match the common comment points between LLM and human feedback.

The following figure is a specific two-phase comment matching pipeline.

For each paired comment, the similarity rating gives a reason.

The researchers set the similarity threshold to 7 and weakly matched comments were filtered out.

In the Nature and ICLR datasets, the average token lengths of papers and human reviews are as follows.

The study involved 308 researchers from 110 AI institutions and computational biology institutions in the United States.

Each researcher uploaded his own paper, read the feedback from LLM, and then filled in his own comments and feelings about the feedback from LLM.

The results show that researchers generally agree that the feedback generated by LLM overlaps significantly with the results of human reviews, which is usually helpful.

If there is any shortcoming, it is slightly worse in concreteness.

As shown in the figure below, about 1/3 (30.85%) of the GPT-4 opinions on papers submitted to Nature coincide with those of human reviewers.

In ICLR papers, more than 1/3 (39.23%) of GPT-4 opinions coincide with those of human reviewers.

LLM is slightly different from human reviewers.

The following are LLM comments and human comments on the same ICLR paper, you can see that LLM's vision is very fierce, and the comments are to the point.

For example, in comparison with previous studies, human reviewers say:

The comparison is flawed. In particular, the loss of label consistency and central consistency of the GNN method is not taken into account. A fairer comparison would be to use a GNN method that takes both losses into account.

And GPT-4 's assessment is:

The paper lacks a thorough comparison with the existing methods. Although the author compares baselines for some methods, a more comprehensive comparison is needed.

In terms of theoretical reliability, human reviewers have given this opinion--

With all due respect, the proof of the theory is too trivial. The final conclusion is that if the similarity is appropriate, the predicted action is accurate. Because the model is learning the correct similarity, it means that if model h is well trained, the output is correct. This is obvious.

And GPT-4 's opinion is:

The author should provide more theoretical analysis to compare the relationship between information transmission and consistency constraints, so as to make it easier for readers to understand.

In terms of the reproducibility of the research, human reviewers hope that the paper can provide code so that other readers can reproduce the experiment.

GPT-4 agrees: "the author should provide more detailed information about the experimental settings to ensure the repeatability of the study." "

Users who participated in the survey generally believe that LLM feedback can help improve the accuracy of reviews and reduce the workload of human reviewers. And most users plan to use the LLM feedback system again.

Interestingly, compared with human reviewers, LLM reviewers have their own unique characteristics.

For example, it mentions impact factors 7.27 times more often than human reviewers.

Human reviewers are more likely to ask for additional ablation experiments ablation experiments, while LLM will focus on requiring experiments on more data sets.

Netizens said one after another: this work is amazing!

Some people say, in fact, I have already done this, I have been using a variety of LLM to help me summarize and improve my papers.

Some people ask, so will GPT reviews be biased in order to cater to today's peer review standards?

It has also been asked whether this indicator is useful in quantifying the overlap of GPT and human review opinions.

You know, ideally, reviewers should not have too many overlapping opinions, and the original intention of choosing them is for them to provide different views.

At the very least, however, this study shows us that LLM can indeed be used as an artifact for rewriting essays.

Three steps, let LLM review the manuscript for you. Create a PDF resolution server and run it in the background:

Conda env create-f conda_environment.yml

Conda activate ScienceBeam

Python-m sciencebeam_parser.service.server-- port=8080 # Make sure this is running in the background

two。 Create and run a LLM feedback server:

Conda create-n llm python=3.10

Conda activate llm

Pip install-r requirements.txt

Cat YOUR_OPENAI_API_KEY > key.txt # Replace YOUR_OPENAI_API_KEY with your OpenAI API key starting with "sk-"

Python main.py

3. Open a web browser and upload your paper:

Open http://0.0.0.0:7799 and upload the paper, and you can get the feedback generated by LLM in about 120 seconds.

The author introduces Weixin Liang (Liang Weixin)

Weixin Liang is a doctoral student in the Department of computer Science at Stanford University and a member of the Stanford artificial Intelligence Lab (SAIL), under the guidance of Professor James Zou.

Prior to that, he received a master's degree in electrical engineering from Stanford University under Professor James Zou and Professor Zhou Yu, and a Bachelor's degree in computer Science from Zhejiang University under Professor Kai Bu and Professor Mingli Song.

He has worked as an intern at Amazon Alexa AI, Apple and Tencent, and has worked with Professors Daniel Jurafsky, Daniel A. McFarland and Serena Yeung.

Yuhui Zhang

Yuhui Zhang is a doctoral student in the Department of computer Science at Stanford University, under the guidance of Professor Serena Yeung.

His research direction is to build multimodal artificial intelligence systems and develop creative applications that benefit from multimodal information.

Prior to that, he completed his undergraduate and master's studies at Tsinghua University and Stanford University, and worked with outstanding researchers such as Professor James Zou, Professor Chris Manning and Professor Jure Leskovec.

Hancheng Cao (Cao Hancheng)

Hancheng Cao is a sixth-year doctoral student in computer science at Stanford University (minor in management science and engineering). He is also a member of Stanford's NLP team and human-computer interaction group, under the guidance of professors Dan McFarland and Michael Bernstein.

He received his Bachelor's degree in Electronic Engineering from Tsinghua University with honors in 2018.

Since 2015, he has worked as a research assistant at Tsinghua University, mentored by Professor Li Yong and Professor Vassilis Kostakos (University of Melbourne). In the autumn of 2016, he worked under the guidance of Professor Hanan Samet, a professor at the University of Maryland Distinguished University. In the summer of 2017, he worked as an exchange student and research assistant in the Human Dynamics Group of the Media Lab of the Massachusetts Institute of Technology, under the guidance of Professor Xiaowen Dong of Alex 'Sandy' Pentland.

His research interests include computational social sciences, social computing and data science.

Reference:

Https://arxiv.org/abs/2310.01783

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.