Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Transformer startling thesis "overturned", the picture is not consistent with the code

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

The title of the original text: "Transformer starts the mountain thesis shockingly" rollover "? The picture is inconsistent with the code, and the mysterious bug is silly. "

Papers that are inconsistent with the code are very "common", but the same error occurred in the original thesis of Transformer.

Today, the AI circle was scrubbed by a shocking "rollover".

The picture in "Attention Is All Your Need", the foundational work of NLP of Google brain and the founding paper of Transformer architecture, was found by netizens to be inconsistent with the code.

Paper address: https://arxiv.org/ abs / 1706.03762 since its publication in 2017, Transformer has become the cornerstone king in the field of AI. Even the real boss of the popular ChatGPT is it.

Google also applied for a patent specifically for it in 2019.

Going back to the origin, all kinds of GPT (Generative Pre-trained Transformer) all originate from this 17-year-old paper.

According to Google Scholar, this groundbreaking work has been cited more than 70,000 times so far.

So, the cornerstones of ChatGPT are unstable?

As the "founder of the mountain" paper, the structure diagram is wrong? Sebastian Raschka, the founder of Lightning AI and a machine learning researcher, found that Transformer's diagram in this paper is wrong.

Where it is circled in the figure, LayerNorms is after the attention and full connectivity layer. Placing layer normalization between residual blocks will result in a large expected gradient of parameters near the output layer.

Also, this is inconsistent with the code.

Code address: https://github.com/ tensorflow / tensor2tensor / commit / f5c9b17e617ea9179b7d84d36b1e8162cb369f25#diff-76e2b94ef16871bdbf46bf04dfe7f1477bafb884748f08197c9cf1b10a4dd78e, but some netizens pointed out that Noam shazeer corrected the code a few weeks later.

Later, Sebastian said that in the paper Layer Normalization in the Transformer Architecture, Pre-LN performed better and could solve the gradient problem.

This is what many or most architectures use in practice, but it can lead to representation collapse.

If the layer normalization is placed in the residual connection before the attention and the full connection layer, a better gradient will be achieved.

Sebastian suggested that while discussions about using Post-LN or Pre-LN are still ongoing, a new paper proposes to combine the two.

Https://arxiv.org/ abs / 2304.14802 in this kind of double residual Tranformer, the problems of representation collapse and gradient disappearance have been solved.

Netizens talked hot about the doubtful points in the paper, and some netizens pointed out: don't you already have PreLN and PostLN in the middle?

Sebastian replied that she felt a little strange, too. Maybe 2nd LN refers to the last output layer, not every transformer block, but he's not sure about that.

One netizen said: "We often encounter papers that do not match the code or results." Most of them are made out of mistakes, but sometimes it can be very strange. This paper has been around for a long time, and it is really strange why this question has never been raised before. "

Sebastian said that to be fair, the original code is consistent with the image, but they modified the version of the code in 2017, but did not update the image. So, this is very confusing.

Some netizens said that there has been a paper showing a less complex architecture in NormFormer, and his team recently confirmed their results. It is surprising that the ResiDual paper does not mention NormFormer anywhere.

At the same time, netizens constantly appear in the comment area to confirm that the LN used in Transformers is different from the way used in CNN.

So, is there really a loophole in the paper, or is it the Oolong incident?

Let's wait and see what happens.

Reference:

Https://twitter.com/rasbt/status/1655575611979489282

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report