How to implement embedded layer in BERT 07/19 Update SLTechnology News&Howtos

How to implement embedded layer in BERT

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article shows you how to achieve the embedded layer in BERT, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Token embedding

Purpose

As mentioned in the previous section, the role of the token embedding layer is to convert words into a fixed-dimensional vector representation. In the case of BERT, each word is represented as a 768-dimensional vector.

Realize

Suppose the input text is "I like strawberries". The following figure describes the role of the token embedded layer:

Tokenize the input text before passing it to the token embedding layer. In addition, add additional tokens at the beginning ([CLS]) and end ([SEP]) of the tokens. These tokens are intended as input representations for classification tasks and separate pairs of input text (more details are described in the next section).

Tokens is done using a method called WordPiece token. This is a data-driven tokenization method, which aims to achieve a balance between vocabulary and non-vocabulary. This is how "strawberries" is divided into "straw" and "berries". A detailed description of this approach is beyond the scope of this article. Interested readers can refer to Wu et al. (2016) and section 4.1 of Schuster & Nakajima (2012). With the use of token words, BERT can only store 30522 "words" in its vocabulary, and when tokenizing English text, it rarely encounters words outside the vocabulary.

The token embedding layer converts each wordpiece token into a 768-dimensional vector representation. This will allow our six input token to be transformed into a matrix of shape (6768) or a tensor of shape (1re6768) if we include the batch dimension.

Segment embedded in the target

BERT can solve NLP tasks that include text categorization. An example of this kind of problem is to classify whether two texts are semantically similar. This pair of input text is simply connected and entered into the model. So how does BERT distinguish between input? The answer is Segment embedding.

Realize

Suppose our input text pair is ("I like cats", "I like dogs"). Here's how Segment embedding helps BERT distinguish between tokens in this input pair:

The Segment embedding layer is represented by only two vectors. The first vector (index 0) is assigned to all tokens belonging to input 1, while the last vector (index 1) is assigned to all tokens belonging to input 2. If an input has only one input statement, its Segment embedding is a vector with an index of 0 corresponding to the Segment embedded table.

Position embedded in the target

BERT consists of a bunch of Transformers, and in a broad sense, Transformers does not encode the sequence characteristics of its input. The motivation section of https://medium.com/@init/how-self-attention-with-relatedposition-representations-works-28173b8c245a explains what I mean in more detail in this blog post. In summary, having Position embedding will allow BERT to understand a given input text, such as:

I think, therefore I am

The first "I" should not have the same vector representation as the second "I"

Realize

BERT is designed to process input sequences of length 512. The author includes the sequential features of the input sequence by asking BERT to learn the vector representation of each position. This means that the Position embedding layer is a lookup table of size (512768), where the first row is a vector representation of any word in the first position, the second row is a vector representation of any word in the second position, and so on. Therefore, if we type "Hello world" and "Hi there", "Hello" and "Hi" will have the same Position embedding because they are the first word in the input sequence. Similarly, the Position embedding of "world" and "there" is the same.

Merge representation

We have seen that tokenized input sequences of length n will have three different representations, namely:

Token embedding, shape (1Maginn, 768), this is just the vector representation of the word Segment embedding, shape (1Grainn768), which is the vector representation to help BERT distinguish between paired input sequences. Position embedding, shape (1, 768), let BERT know that its input has a time attribute.

These representations are summed up by elements to generate a single representation of the shape (1) n, 768. This is the input representation of the encoder layer passed to BERT.

The above is how to implement the embedding layer in BERT. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.