How to realize the embedding of pre-training words in python 07/12 Update SLTechnology News&Howtos

How to realize the embedding of pre-training words in python

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to achieve pre-training word embedding in python". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "python how to achieve pre-training word embedding".

What is pre-training word embedding?

Let's answer a big question directly-what exactly is word embedding in pre-training?

Pre-training word embedding is word embedding learned in one task, which can be used to solve another task.

These are embedded in large datasets for training, saving, and then used to solve other tasks. This is why pre-training word embedding is a form of transfer learning.

Transfer learning, as the name implies, is to transfer the learning results of one task to another. Learning can be either weight or embedding. Here, what we learn is embedding. Therefore, this concept is called pre-training word embedding. In the case of weight, this concept is called pre-training model.

But why do we need pre-training word embedding in the first place? Why can't we learn our own embedding from scratch? I will answer these questions in the next section.

Why do we need pre-training word embedding?

The semantic and syntactic meaning of the words are captured when the pre-training words are embedded in the big data collection. They can improve the performance of natural language processing (NLP) models. These words are useful for embedding in competition data and, of course, in real-world problems.

But why don't we learn our own embedding? Well, learning word embedding from scratch is a challenging problem for two main reasons:

Sparse training data

A large number of trainable parameters

Sparse training data

One of the main reasons for not doing so is the scarcity of training data. Most real-world problems contain a data set that contains a large number of rare words. The embedding learned from these data sets cannot get the correct representation of words.

To achieve this, the dataset must contain a rich vocabulary.

A large number of trainable parameters

Secondly, when learning embedding from scratch, the number of trainable parameters increases. This will cause the training process to slow down. Learning embedding from scratch may also leave you unclear about the way words are represented.

Therefore, the way to solve the above problems is to embed words in advance. Let's discuss different pre-training word embedding in the next section.

Different models of word embedding in pre-training

I will roughly divide embedding into two categories: word-level embedding and character-level embedding. ELMo and Flair embedding are examples of character-level embedding. In this article, we will introduce two popular word-level pretraining word embedding:

Google's Word2vec

Stanford's GloVe

Let's see how Word2Vec and GloVe work.

Google's Word2vec

Word2Vec is one of the most popular pre-training word embedding tools developed by Google. Word2Vec is trained on the Google news dataset (about 100 billion words). It has several use cases, such as recommendation engine, word similarity and different text classification problems.

The architecture of Word2Vec is very simple. It is a feedforward neural network with only one hidden layer. Therefore, it is sometimes called shallow neural network structure.

According to the embedded learning style, Word2Vec is divided into two methods:

Continuous word bag model (CBOW)

Skip-gram model

Continuous word bag (CBOW) model learns focus words in the case of given adjacent words, while Skip-gram model learns adjacent words in the case of given words.

The continuous word bag model and the Skip-gram model are reversed each other.

For example, think about this sentence: "I have failed at times but I never stopped trying". Suppose we want to learn the embedding of the word "failed". So, the focus word here is "failed".

The first step is to define the context window. The context window refers to the number of words that appear around the focus words. The words that appear in the context window are called adjacent words (or contexts). Let's dock the context window to 2

Continuous word bag model: Input=, Output=failed

Skip-gram model jump: Input = failed, Output = [I, have, at, times]

As you can see, CBOW accepts multiple words as input and generates a word as output, while Skip gram accepts one word as input and generates multiple words as output.

So let's define the architecture based on input and output. Remember, however, that each word is entered into the model as an one-hot vector:

Stanford's GloVe

The basic idea of GloVe embedding is to derive the relationship between words from global statistics.

But how can statistics represent meaning? Let me explain.

One of the easiest ways is to look at the co-occurrence matrix. The co-occurrence matrix tells us how often a pair of specific words appear together. Each value in the co-occurrence matrix is a count of the simultaneous occurrence of a pair of words.

For example, consider a corpus: "play cricket, I love cricket and I love football". The co-occurrence matrix of the corpus is as follows:

Now, we can easily calculate the probability of a pair of words. For simplicity, let's focus on the word "cricket":

P (cricket/play) = 1

P (cricket/love) = 0.5

Next, we calculate the probability ratio:

P (cricket/play) / p (cricket/love) = 2

When the ratio is greater than 1, we can infer that the most relevant word for cricket is "play", not "love". Similarly, if the ratio is close to 1, then both words are related to cricket.

We can use simple statistical methods to find out the relationship between these words. This is the idea of GLoVE pre-training word embedding.

Case study: learning embedding from scratch and embedding in pre-training words

Let's compare the performance of our own embedding and pre-training word embedding from scratch through a case study. We will also see whether using pre-training word embedding will improve the performance of the NLP model.

So, let's choose a text classification problem-emotional analysis of film reviews.

Load the dataset into Jupyter:

# Import library import pandas as pdimport numpy as np# to read csv file train = pd.read_csv ('Train.csv') valid = pd.read_csv (' Valid.csv') # training test set separation x_tr, y_tr = train ['text'] .values, train [' label'] .valuesx _ val, y_val = valid ['text'] .values, valid [' label']. Values

Prepare the data:

From keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencestokenizer = Tokenizer () # prepare vocabulary tokenizer.fit_on_texts (list (x_tr)) # convert text to integer sequence x_tr_seq = tokenizer.texts_to_sequences (x_tr) x_val_seq = tokenizer.texts_to_sequences (x_val) # fill to prepare sequence x_tr_seq = pad_sequences (x_tr_seq) of the same length Maxlen=100) x_val_seq = pad_sequences (x_val_seq, maxlen=100)

Let's take a look at the number of words in the training data:

Size_of_vocabulary=len (tokenizer.word_index) + 1 # + 1 is used to fill print (size_of_vocabulary)

Output: 112204

We will build two different NLP models with the same architecture. The first model learns embedding from scratch, and the second model uses pre-training words to embed.

Define the architecture to learn embedding from scratch:

# Deep Learning Library from keras.models import * from keras.layers import * from keras.callbacks import * model=Sequential () # embedded layer model.add (Embedding (size_of_vocabulary,300,input_length=100,trainable=True)) # LSTM layer model.add (LSTM (128 LSTM)) # GlobalMax pooled model.add (GlobalMaxPooling1D ()) # Dense layer model.add (Dense (64)) model.add (Dense (1)) Activation='sigmoid') # add loss function, metric, optimizer model.compile (optimizer='adam', loss='binary_crossentropy',metrics= ["acc"]) # add callback es = EarlyStopping (monitor='val_loss', mode='min', verbose=1,patience=3) mc=ModelCheckpoint ('best_model.h6', monitor='val_acc', mode='max', save_best_only=True,verbose=1) # output model print (model.summary ())

Output:

The total number of trainable parameters in the model is 33889169. Among them, the embedded layer contributed 33661200 parameters. There are too many parameters!

Training model:

History = model.fit (np.array (x_tr_seq), np.array (y_tr), batch_size=128,epochs=10,validation_data= (np.array (x_val_seq), np.array (y_val)), verbose=1,callbacks= [es,mc])

Evaluate the performance of the model:

# load the best model from keras.models import load_modelmodel = load_model ('best_model.h6') # Evaluation _, val_acc = model.evaluate (xonal validated seq print (val_acc) print)

Output: 0.865

Now it's time to build the second version with GLoVE pre-trained word embedding. Let's embed GLoVE into our environment:

# load the entire embedding into memory embeddings_index = dict () f = open ('.. / input/glove6b/glove.6B.300d.txt') for line in f: values = line.split () word = values [0] coefs = np.asarray (values [1:], dtype='float32') embeddings_ index [word] = coefsf.close () print ('Loaded% s word vectors.'% len (embeddings_index))

Output: Loaded 400000 word vectors.

Create an embedding matrix by assigning pre-trained word embedding to the vocabulary:

# create a weight matrix for the words in the document embedding_matrix = np.zeros ((size_of_vocabulary, 300)) for word, I in tokenizer.word_index.items (): embedding_vector = embeddings_index.get (word) if embedding_vector is not None: embedding_ Matrix [I] = embedding_vector

Define architecture-pre-training embedding:

Model=Sequential () # embedded layer model.add (Embedding (size_of_vocabulary,300,weights= [naming _ matrix], input_length=100,trainable=False)) # LSTM layer model.add (LSTM (128 LSM sequencestransactions TrueNo dropoutflows 0.2)) # GlobalMax pooled model.add (GlobalMaxPooling1D ()) # Dense layer model.add (Dense (64) Activation levels sigmoid)) model.add (Dense (1) Activation functions sigmoid') # add loss function, Metrics, Optimizer model.compile (optimizer='adam', loss='binary_crossentropy') Metrics= ["acc"]) # add callback es = EarlyStopping (monitor='val_loss', mode='min', verbose=1,patience=3) mc=ModelCheckpoint ('best_model.h6', monitor='val_acc', mode='max', save_best_only=True,verbose=1) # output model print (model.summary ())

Output:

As you can see, the number of trainable parameters is only 227969. This is a huge drop compared to the embedded layer.

Training model:

History = model.fit (np.array (x_tr_seq), np.array (y_tr), batch_size=128,epochs=10,validation_data= (np.array (x_val_seq), np.array (y_val)), verbose=1,callbacks= [es,mc])

Evaluate the performance of the model:

# load the best model from keras.models import load_modelmodel = load_model ('best_model.h6') # Evaluation _, val_acc = model.evaluate (xonal validated seq print (val_acc) print)

Output: 88.49

Compared with re-learning embedding, the performance of embedding using pre-training words is improved.

Thank you for your reading, the above is the content of "how to achieve pre-training word embedding in python". After the study of this article, I believe you have a deeper understanding of how to achieve pre-training word embedding in python, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.