How to realize news text classification by naive Bayes and LSTM in Python 07/19 Update SLTechnology News&Howtos

How to realize news text classification by naive Bayes and LSTM in Python

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how Python achieves news text classification through naive Bayes and LSTM, which is very detailed and has certain reference value. Friends who are interested must finish reading it!

Data processing and analysis

The materials provided in this competition are written in csv format and can be read by calling the pandas library in python. To get a more intuitive view of the data, I calculated the average length of the document and the document corresponding to each tag. (the method of obtaining sen dictionary and tag dictionary will be shown later. This step is only used to present the data distribution. You can skip it at runtime.)

In import matplotlib.pyplot as pltfrom tqdmimport tqdmimport timefrom numpy import * import pandas as pdprint ('count: 200000') # Dictionary sen, each tag corresponds to its 2D list of all sentences print (' average:'+ str (sum ([[sum (Seni] [j]) for j in range (len (Seni))] for i in sen]) / 200000)) x = [] y = [] for key,value in tag.items (): # Dictionary tag Each tag corresponds to the number of sentences under the tag x.append (key) y.append (value) plt.bar (XMagine y) plt.show ()

Finally, we get the following results:

The average document length is about 907 words, and the number of documents corresponding to each tag decreases one by one from 0 to 13.

Text classification based on machine learning-naive Bayes 1. Model introduction

The basic idea of naive Bayesian classifier is to use the joint probability of feature items and categories to estimate the category probability of a given document. It is assumed that the text is a word-based unitary model, that is, the occurrence of the current word in the text depends on the text category, but does not depend on other words and the length of the text, that is to say, words are independent of each other. According to Bayesian formula, the probability that a document Doc belongs to Ci class is

The document Doc is represented by TF vector, that is, the component of the document vector V is the frequency at which the corresponding features appear in the document, and the probability that the document Doc belongs to the Ci document is

Among them, TF (ti,Doc) is the frequency of the occurrence of feature ti in document Doc. In order to prevent the occurrence of words that are not in the dictionary leading to a probability of zero, we take P (ti | Ci) as the Laplace probability estimation of the conditional probability of feature ti in Ci documents:

Here, TF (ti,Ci) is the frequency of feature ti in Ci documents, and | V | is the size of the feature set, that is, the total number of different features contained in the document representation.

two。 Code structure

I read the file directly through the open () function that comes with python, set up the corresponding dictionary, and set up the stop word, which selects all the words that appear in more than 100000 documents in the words dictionary. The training set takes the first 190000 documents and the test set takes the last 10, 000 documents.

Train_df = open ('. / data/train_set.csv'). Readlines () [1:] train = train_df [0data/train_set.csv' 190000] test = train_df [190000VOG200000] true_test = open ('. / data/test_a.csv'). Readlines () [1:] tag = {str (I): 0 for i in range (0Cool 14)} sen = {str (I): {} for i in range (0CO14)} words= {} stop_words = {'4149bike: 1,' 1519mm: 1 '2465: 1,' 7539: 1. }

Next, we need to build a tag dictionary and a sentence dictionary, and use the tqdm function to show the progress.

For line in tqdm (train_df): cur_line = line.split ('\ t') cur_tag = cur_line [0] tagged [cur _ tag] + = 1 cur_line = cur_line [1] [:-1] .split ('') for i in cur_line: if i not in words: words [I] = 1 else: words [I] + = 1 If i not in Sen [cur _ tag]: Sen [cur _ tag] [I] = 1 else: Sen [cur _ tag] [I] + = 1

In order to facilitate the calculation, I define the following function, in which mul () is used to calculate the product of all the numbers in the list, prob_clas () is used to calculate P (Ci | Doc), and probability () is used to calculate P (ti | Ci). In the probability () function, I add the numerator + 1 in the output and the denominator plus the dictionary length to achieve Laplace smoothing.

Def mul (l): res = 1 for i in l: res * = i return resdef prob_clas (clas): return tag [clas] / (sum ([tagI] for i in tag])) def probability (char Clas): # P (feature | Category) if char not in sen [clas]: num_char = 0 else: num_char = sen [clas] [char] return (1+num_char) / (len (senclas]) + len (words))

After all the preparatory work is done and the function is defined, the corresponding probabilities of 14 tags for each sentence in the test set are calculated respectively, and the tags with the highest probability are stored in the prediction list, and the progress is displayed by the tqdm function.

PRED = [] for line in tqdm (true_test): result = {str (I): 0 for i in range (0J14)} cur_line = line [:-1] .split ('') clas = cur_tag for i in result: prob = [] for j in cur_line: if j in stop_words: continue prob.append (log (probability (j) )) result [I] = log (prob_clas (I)) + sum (prob) for key,value in result.items (): if (value = = max (result.values ()): pred = int (key) PRED.append (pred))

Finally, the results are stored in the csv file and uploaded to the website, and then submitted to check the results. (csv files written in this method need to be opened and deleted from the first column before uploading)

Res=pd.DataFrame () res ['label'] = PREDres.to_csv (' test_TL.csv') 3. Result analysis

In the process of 190000 documents before training and 10, 000 documents after testing, I constantly adjust the withdrawal list of stop words and test them with TF and TF-IDF vector representation respectively. The results show that the accuracy of using TF representation is higher. Finally, the stop words are taken as words that appear in more than 100, 000 documents. Finally, the highest efficiency is 0.622.

After submitting to the website, the F1 value of testing 50,000 documents is only about 0.29, and the effect is poor.

Text classification based on deep learning-LSTM1. Model introduction

In addition to the traditional machine learning methods, I use the LSTM (Long Short-Term Memory) long-term and short-term memory network in deep learning to try to deal with news text classification, hoping to have higher accuracy. LSTM is a kind of time cycle neural network, which is suitable for processing and predicting important events with relatively long intervals and delays in time series. LSTM has many applications in the field of science and technology. The LSTM-based system can learn to translate languages, control robots, image analysis, document summaries, speech recognition, image recognition, handwriting recognition, control chatbots, predict diseases, click-through rates and stocks, synthesize music and other tasks. I use deep learning library Keras to build LSTM model and classify text.

For convolutional neural network CNN and cyclic network RNN, with the increase of time, the hidden layer is multiplied by the weight W again and again. If a certain weight w is a number close to 0 or greater than 1, with the increase of multiplication times, the weight value will become very small or very large, resulting in back propagation gradient calculation becomes very difficult, resulting in gradient explosion or gradient disappearance, the model is difficult to train. In other words, the general RNN model has a poor memory for long-distance information, so LSTM arises at the historic moment.

LSTM long-term and short-term memory network can better solve this problem. In a unit of LSTM, there are four network layers displayed as yellow boxes, each with its own weight. For example, the layer marked with σ is the sigmoid layer, and tanh is an excitation function. These red circles represent point-by-point or element-by-element operations. The cell state has little interaction when passing through the LSTM unit, so that most of the information is retained, and the cell state is modified only through these control gates (gate). The first control gate is the amnesia gate, which is used to determine what information we will discard from the state of the unit. The second door is the update door, which is used to determine what new information is stored in the cell state. The last door is the output door, we need to determine what kind of value to output. To sum up, the LSTM unit consists of the unit state and a bunch of control gates used to update the information, so that the information part is transferred to the hidden layer state.

two。 Code structure

The first is the setting of the initial data and the call of the package. Considering that the average sentence length is about 900. here, the maximum read length is 2 max_len 3 of the average length, and then the learning efficiency can be adjusted by adjusting this parameter.

From tqdmimport tqdmimport pandas as pdimport timeimport matplotlib.pyplot as pltimport seaborn as snsfrom numpy import * from sklearn import metricsfrom sklearn.preprocessing import LabelEncoder,OneHotEncoderfrom keras.models import Modelfrom keras.layers import LSTM, Activation, Dense, Dropout, Input, Embeddingfrom keras.optimizers import rmsprop_v2from keras.preprocessing import sequencefrom keras.callbacks import EarlyStoppingfrom keras.models import load_modelimport os.pathmax_words = 7549 # maximum dictionary number # Model effect and learning speed can be adjusted by adjusting max_len. Maximum length of sentence stop_words = {}

Next, we define a function that converts the format of DataFrame into a matrix. This function outputs a list of two-dimensional documents with a length of 600 and its corresponding tag values.

Def to_seq (dataframe): X = [] y = array ([[0] * int (I) + [1] + [0] * (13-int (I)) for i in dataframe ['label']]) for i in tqdm (dataframe [' text']): cur_sentense = [] for word in i.split (''): if word not in stop_words: # was not finally adopted Stop word list cur_sentense.append (word) x.append (cur_sentense) return sequence.pad_sequences (x Maxlen=max_len), y

Next is the body function of the model. The function inputs the document of the test, the truth of the test set, the training set and the test set, and outputs the predicted confusion matrix. For a specific description of the code, see the comments in the following code.

Def test_file (text,value,train,val): # # define the LSTM model inputs = Input (name='inputs',shape= [max _ len]) # # Embedding (glossary size, batch size, word length of each news) layer = Embedding (max_words+1,128,input_length=max_len) (inputs) layer = LSTM (128) (layer) layer = Dense (128 max_words+1,128,input_length=max_len activation = "relu" Name= "FC1") (layer) layer = Dropout (layer) layer = Dense (14 layer activation = "softmax", name= "FC2") (layer) model = Model (inputs=inputs,outputs=layer) model.summary () model.compile (loss= "categorical_crossentropy", optimizer=rmsprop_v2.RMSprop (), metrics= ["accuracy"]) # # training begins after the model is established If you have saved the training file (.h6 format) Then you can directly call if os.path.exists ('my_model.h6') = = True: model = load_model (' my_model.h6') else: train_seq_mat,train_y = to_seq (train) val_seq_mat,val_y = to_seq (val) model.fit (train_seq_mat,train_y,batch_size=128,epochs=10 # the accuracy and speed of operation can be adjusted by epochs number validation_data= (val_seq_mat,val_y)) model.save ('my_model.h6') # # start to predict test_pre = model.predict (text) # # calculate confusion function confm = metrics.confusion_matrix (argmax (test_pre,axis=1), argmax (value) Axis=1)) print (metrics.classification_report (argmax (test_pre,axis=1), argmax (value,axis=1)) return confm

The training process is shown in the figure below.

In order to show the result more intuitively, the following function is defined to draw the image.

Def plot_fig (matrix): Labname = [str (I) for i in range (14)] plt.figure (figsize= (8)) sns.heatmap (matrix.T, square=True, annot=True, fmt='d', cbar=False,linewidths=.8, cmap= "YlGnBu") plt.xlabel ('True label',size = 14) plt.ylabel (' Predicted label',size = 14) plt.xticks (arange (14) + 0.5 Magi Labname Size = 12) plt.yticks (arange (14) + 0.3 plt.show () return

Finally, only need to read the csv file through pandas, which is proportionally divided into training set, test set and test set (the proportion is 15:2:3 here), and the whole prediction process can be completed.

Def test_main (): train_df = pd.read_csv (". / data/train_set.csv", sep='\ THNWSS200000) train = train_df.iloc [0VH 150000:] test = train_df.iloc [15000000 test_seq_mat,test_y:] val = train_df.iloc [1800000 test_seq_mat,test_y (test) Confm = test_file (test_seq_mat,test_y,train) Val) plot_fig (Confm)

After obtaining the selection of the parameters with the highest prediction results, we train the whole train_set file as follows. Before training, we need to delete the existing training file (h6). The test line in this function can be selected at will, just to satisfy the sufficient variables of the test_file () function. This function is only used to train and store the data with the best learning results.

Def train (): train_df = pd.read_csv (". / data/train_set.csv", sep='\ tbrush test_seq_mat,test_y 200000) train = train_df.iloc [0pd.read_csv 1700000 Confm:] test = train_df.iloc [014000 Confm:] val = train_df.iloc [1700000 Confm = test_file (test_seq_mat,test_y,train,val) plot_fig (Confm)

After getting the best training data, we can start to predict. We bring the test set provided in the competition into the model, load the stored training set to predict, and get the prediction matrix. Then convert the maximum value of each row in the prediction matrix into the corresponding label, store it in the output list, and finally write the list into the 'test_DL.csv' file and upload it. (the csv file generated in this way is the same as the previous model, and the first column needs to be manually opened and deleted)

Def pred_file (): test_df = pd.read_csv ('. / data/test_a.csv') test_seq_mat = sequence.pad_sequences ([i.split ('') for i in tqdm (test_df ['text'])], maxlen=max_len) inputs = Input (name='inputs',shape= [max _ len]) # # Embedding (glossary size, batch size, word length of each news) layer = Embedding (max_words+1128 Input_length=max_len) (inputs) layer = LSTM (layer) layer = Dense (128 layer = "relu", name= "FC1") (layer) layer = Dropout (0. 5) (layer) layer = Dense (14) name= "FC2") (layer) model = Model (inputs=inputs,outputs=layer) model.summary () model.compile (loss= "categorical_crossentropy", optimizer=rmsprop_v2.RMSprop () Metrics= ["accuracy"]) model = load_model ('my_model.h6') test_pre = model.predict (test_seq_mat) pred_result = [i.tolist () .index (max (i.tolist ()) for i in test_pre] res=pd.DataFrame () res [' label'] = pred_result res.to_csv ('test_DL.csv')

After finishing, we only need to comment out the corresponding instruction line to train or predict.

# if you want to train, cancel the downline comments, delete the original training file (.h6) # train () # if you want to see the model effect, cancel the downline comments (training set: test set: test set = 15:2:3) # test_main () # if you want to predict and generate the csv file, cancel the downline comment # pred_file () 3. Result analysis

The resulting confusion matrix is shown in the following figure. The prediction accuracy of 14 tags is more than 80%, 11 tags are more than 90%, and 6 tags are more than 95%.

The prediction results drawn are shown in the following figure, which shows that the prediction effect is quite ideal, the accuracy of each label is particularly considerable, and the number of wrong texts is very small compared to the total amount.

Finally, the result of uploading the website shows that the F1 value is more than 90%, and the effect is good.

The above is all the content of the article "how Python achieves news text classification through naive Bayes and LSTM respectively". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.