In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
How to achieve CNN-RNN Chinese text classification based on TensorFlow, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can get something.
Based on the simplified implementation of TensorFlow in Chinese data sets, character-level CNN and RNN are used to classify Chinese texts, and good results are achieved.
Use a subset of THUCNews for training and testing. For data sets, please go to THUCTC: an efficient Chinese text classification toolkit.
This training uses 10 of these categories, each with 6500 pieces of data.
The categories are as follows:
Sports, finance and economics, real estate, home furnishings, education, technology, fashion, current politics, games, entertainment
The dataset is divided as follows:
Training set: 500010
Verification set: 5000010
Test set: 1000: 10
For the process of generating a subset from the original dataset, see the two scripts under helper. Among them, copy_data.sh is used to copy 6500 files from each category, and cnews_group.py is used to consolidate multiple files into one file. After executing the file, you get three data files:
Cnews.train.txt: training set (50000 items)
Cnews.val.txt: validation set (5000)
Cnews.test.txt: test set (10000 items)
Pretreatment
Data/cnews_loader.py is the preprocessing file of the data.
Read_file (): reading file data
Build_vocab (): build a vocabulary, using a character-level representation, this function stores the vocabulary and avoids each repetitive processing
Read_vocab (): read the vocabulary stored in the previous step and convert it to {word: id} to indicate
Read_category (): fixed the category catalog and converted it to {category: id} to indicate
To_words (): reconverts a piece of data represented by id to text
Process_file (): converts a dataset from text to a fixed-length id sequence representation
Batch_iter (): prepares batches of data through shuffle for neural network training.
After data preprocessing, the format of the data is as follows:
CNN model
See the implementation of cnn_model.py for details.
The general structure is as follows:
Training and verification
Run python run_cnn.py train to start training.
The best effect on the verification set is 94.12%, and it has stopped after only 3 iterations.
The accuracy and error are shown in the figure:
test
Run python run_cnn.py test to test on the test set.
The accuracy on the test set is 96.04%, and all kinds of precision, recall and f1-score are more than 0.9.
From the confusion matrix, we can also see that the classification effect is very good.
RNN cyclic neural network configuration item
The parameters that RNN can configure are shown below, in rnn_model.py.
RNN model
See the implementation of rnn_model.py for details.
The general structure is as follows:
Training and verification
This part of the code is very similar to run_cnn.py, with only minor changes to the model and some directories.
Run python run_rnn.py train to start training.
If you have trained before, please delete the tensorboard/textrnn to avoid the overlap of TensorBoard training results.
The best effect on the verification set is 91.42%. After 8 rounds of iterative stops, the speed is much slower than that of CNN.
The accuracy and error are shown in the figure:
test
Run python run_rnn.py test to test on the test set.
The accuracy on the test set is 94.22%, and all kinds of precision, recall and f1-score, except for home, are more than 0.9%.
From the confusion matrix, we can see that the classification effect is very good.
Comparing the two models, it can be seen that except for the performance of home classification, RNN is not very different from CNN in other categories.
Better results can also be achieved by further adjusting the parameters.
In order to facilitate prediction, predict.py in repo provides the prediction method of CNN model.
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.