Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize CNN-RNN Chinese text Classification based on TensorFlow

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

How to achieve CNN-RNN Chinese text classification based on TensorFlow, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can get something.

Based on the simplified implementation of TensorFlow in Chinese data sets, character-level CNN and RNN are used to classify Chinese texts, and good results are achieved.

Use a subset of THUCNews for training and testing. For data sets, please go to THUCTC: an efficient Chinese text classification toolkit.

This training uses 10 of these categories, each with 6500 pieces of data.

The categories are as follows:

Sports, finance and economics, real estate, home furnishings, education, technology, fashion, current politics, games, entertainment

The dataset is divided as follows:

Training set: 500010

Verification set: 5000010

Test set: 1000: 10

For the process of generating a subset from the original dataset, see the two scripts under helper. Among them, copy_data.sh is used to copy 6500 files from each category, and cnews_group.py is used to consolidate multiple files into one file. After executing the file, you get three data files:

Cnews.train.txt: training set (50000 items)

Cnews.val.txt: validation set (5000)

Cnews.test.txt: test set (10000 items)

Pretreatment

Data/cnews_loader.py is the preprocessing file of the data.

Read_file (): reading file data

Build_vocab (): build a vocabulary, using a character-level representation, this function stores the vocabulary and avoids each repetitive processing

Read_vocab (): read the vocabulary stored in the previous step and convert it to {word: id} to indicate

Read_category (): fixed the category catalog and converted it to {category: id} to indicate

To_words (): reconverts a piece of data represented by id to text

Process_file (): converts a dataset from text to a fixed-length id sequence representation

Batch_iter (): prepares batches of data through shuffle for neural network training.

After data preprocessing, the format of the data is as follows:

CNN model

See the implementation of cnn_model.py for details.

The general structure is as follows:

Training and verification

Run python run_cnn.py train to start training.

The best effect on the verification set is 94.12%, and it has stopped after only 3 iterations.

The accuracy and error are shown in the figure:

test

Run python run_cnn.py test to test on the test set.

The accuracy on the test set is 96.04%, and all kinds of precision, recall and f1-score are more than 0.9.

From the confusion matrix, we can also see that the classification effect is very good.

RNN cyclic neural network configuration item

The parameters that RNN can configure are shown below, in rnn_model.py.

RNN model

See the implementation of rnn_model.py for details.

The general structure is as follows:

Training and verification

This part of the code is very similar to run_cnn.py, with only minor changes to the model and some directories.

Run python run_rnn.py train to start training.

If you have trained before, please delete the tensorboard/textrnn to avoid the overlap of TensorBoard training results.

The best effect on the verification set is 91.42%. After 8 rounds of iterative stops, the speed is much slower than that of CNN.

The accuracy and error are shown in the figure:

test

Run python run_rnn.py test to test on the test set.

The accuracy on the test set is 94.22%, and all kinds of precision, recall and f1-score, except for home, are more than 0.9%.

From the confusion matrix, we can see that the classification effect is very good.

Comparing the two models, it can be seen that except for the performance of home classification, RNN is not very different from CNN in other categories.

Better results can also be achieved by further adjusting the parameters.

In order to facilitate prediction, predict.py in repo provides the prediction method of CNN model.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report