Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the algorithm of Chinese text classification based on CNN

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the Chinese text classification algorithm based on CNN. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

The task of text classification is an enduring topic, and its applications include spam detection, emotion analysis and so on.

The method of traditional machine learning is to carry out feature engineering at first, and then input the feature vectors into various classification models (Bayesian, SVM, neural network, etc.) for classification.

With the development of deep learning and the emergence of RNN and CNN, the construction of feature vectors will be completed automatically by the network, so as long as we input the vector representation of the text into the network, we can complete the automatic feature construction and classification process.

In terms of classification tasks, CNN is more appropriate than RNN. At present, CNN is the most widely used in image processing, and there are also some applications in text processing.

Let's design a simple CNN and apply it to the Chinese spam detection task.

1.1 basic knowledge of neural network

If you are not familiar with deep learning or neural networks such as RNN or CNN, please step here first.

Http://www.wildml.com/

Looking for relevant articles to read carefully, every article written by this blogger is very good, from shallow to deep, very suitable for entry.

1.2 how to apply CNN to text processing

Reference understanding-convolutional-neural-networks-for-nlp

Http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

1.3CNN network structure and implementation method (must read)

Most of the CNN network structure and implementation methods in this blog article refer to IMPLEMENTING A CNN FOR TEXT CLASSIFICATION IN TENSORFLOW, and the structure and implementation details of CNN are described in detail in this article.

2 training data

2.1 Chinese Spam dataset

Description: TREC06C is obtained by simple cleaning and stored in utf-8 format

Download address of the full code dataset:

1. Forward this article to moments.

2. Follow Wechat's official account datayx and reply to text classification.

2.2 Spam

Spam_5000.utf8

3Pretreatment

3.1 input

The above two files (spam_5000.utf8 ham_5000.utf8)

Embedding_dim (the dimension of word embedding, that is, how many dimensional vectors are used to represent a word)

3.2 output:

Max_document_length (number of words contained in the longest message)

X (Vector representation of all messages, dimension is [number of all messages, max_doument_length, embedding_dim])

Y (the corresponding label for all messages, [0,1] indicates normal mail, [1,0] indicates spam, and the dimension of y is [number of all messages, 2])

3.3 main processes:

3.3.1 filter characters

For the convenience of word segmentation, all non-Chinese characters are removed from the sample program, and you can also choose to retain other characters such as punctuation, English characters, numbers and so on.

3.3.2 participle

In order to train the Word2Vec model, it is necessary to segment the training text first. Here, for convenience, each Chinese character is separated directly, that is, the vector of the word2vec at the last training place is the embedding of the character, and the effect is quite good.

3.3.3 alignment

In order to speed up the training process of the network, batch calculation is needed, so the input training samples need to be padding to make their dimensions consistent. The alignment here increases the length of all messages to max_document_length (the number of words contained in the longest message), and the blank space is filled with a specified word (the filling word used in the sample program is "PADDING").

3.3.4 training word2vec

After segmenting and aligning the text, you can train the word2vec model. The specific training process is not described here, and the program can refer to the word2vec_helpers.py in the project file.

4 define CNN network and training steps

4.1 Network structure

Most of the CNN network structure and implementation methods in this blog article refer to IMPLEMENTING A CNN FOR TEXT CLASSIFICATION IN TENSORFLOW, and the structure and implementation details of CNN are described in detail in this article. The repetition is no longer explained, but mainly about different places.

The CNN implemented in that article is used for binary classification of English text, and before convolution, there is an embedding layer to get the vector representation of the text.

The CNN implemented in this blog has been slightly modified to support the classification of Chinese texts. The only change in the structure of CNN is that it removes the embedding layer and inputs the embedding vectors trained by word2vec directly into the network for classification.

The network structure diagram is shown in the following figure:

4.2 training steps

In the preprocessing stage, x and y are obtained, and then x and y are divided into training sets train_x, train_y and test sets dev_x, dev_y according to a certain proportion.

Then, according to batch_size, train_x is input into the network TextCNN in batches for training. After convolution and max-pool of three convolution layers, a vector is obtained, which represents some characteristics of training data learned by each convolution layer. Finally, the vector is input into a single-layer neural network and classified with softmax, the final classification result is obtained, the loss (cross entropy) is calculated and the backward propagation begins. Perform a batch gradient descent to update the network parameters.

5 results

Accuracy:

Error:

Because the data set does not have a standard training set and test set, this paper only makes a simple segmentation according to the proportion of 0.1, and does not filter some duplicate documents, so the accuracy can reach about 99%. If we use more standard data sets and add cross-validation and other methods, I believe that the accuracy will be reduced, but I believe that the accuracy can still exceed most of the classifiers written by traditional machine learning methods.

This is how the CNN-based Chinese text classification algorithm is shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report