In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "how to realize document classification by naive Bayes in web algorithm". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "web algorithm in naive Bayes how to achieve document classification" it!
Job requirements:
The experimental data is in the bayes_datasets folder. Among them
train is a training data set, which contains two Chinese text sets, hotel and travel, in txt format. The hotel text set is full of documents introducing hotel information, and the travel text set is all documents introducing scenic spot information.
Bayes_datasets/test is a test dataset that contains several hotel class documents and travel class documents.
Naive Bayesian algorithm is used to classify the above two types of documents. It is required to output the document classification results of the test dataset, that is, the number of documents of each type.
(example: hotel:XX,travel:XX)
Bayesian formula:
The core of naive Bayesian algorithm, Bayesian formula is as follows:
Code implementation:
Part one: reading data
F_path = os.path.abspath ('.') +'/ bayes_datasets/train/hotel'
F1_path = os.path.abspath ('.') +'/ bayes_datasets/train/travel'
F2_path = os.path.abspath ('.') +'/ bayes_datasets/test'
Ls = os.listdir (f_path)
Ls1 = os.listdir (f1_path)
Ls2 = os.listdir (f2_path)
# remove the regular expression of the URL
Pattern = r "(http [s]: / (?: [a-zA-Z] | [0-9] | [$- _ @. & +] | [! *,] | (?:% [0-9a-fA-F] [0-9a-fA-F]) +) | ([a-zA-Z] +.\ w +\. + [a-zA-Z0-9\ / _] +)"
Res = []
For i in ls:
With open (str (faded pathways'\\'+ I), encoding='UTF-8') as f:
Lines = f.readlines ()
Tmp = '.join (str (i.replace ('\ njinghe')) for i in lines)
Tmp = re.sub (pattern,'',tmp)
Remove_digits = str.maketrans (',', digits)
Tmp = tmp.translate (remove_digits)
# print (tmp)
Res.append (tmp)
Print ("hotel Total:", len (res))
For i in ls1:
With open (str (f1_path +'\'+ I), encoding='UTF-8') as f:
Lines = f.readlines ()
Tmp = '.join (str (i.replace ('\ njunction,') for i in lines))
Tmp = re.sub (pattern,'', tmp)
Remove_digits = str.maketrans (',', digits)
Tmp = tmp.translate (remove_digits)
# print (tmp)
Res.append (tmp)
Print ("travel Total:", len (res)-308)
# print (ls2)
For i in ls2:
With open (str (f2_path +'\'+ I), encoding='UTF-8') as f:
Lines = f.readlines ()
Tmp = '.join (str (i.replace ('\ njunction,') for i in lines))
Tmp = re.sub (pattern,'', tmp)
Remove_digits = str.maketrans (',', digits)
Tmp = tmp.translate (remove_digits)
# print (tmp)
Res.append (tmp)
Print ("test Total:", len (res)-616)
Print ("data Total:", len (res))
The task of this part is to read the txt documents under each folder into the program and store the data as needed. The data is divided into training set and test set, and the training set includes scenic spots and hotels, so the data is divided into three categories, which are read three times respectively. There is a string of URL information at the front of each txt file in the training set, which I filter out with regular expressions, and it turns out that this part does not affect the final result. After that, the three types of documents are read in turn, and the read results are stored in a result list. Each entry in list is a string that stores all the information in an txt file after removing the data to be filtered. In the end, there are 308 documents under the travel folder, 308 documents under the hotel folder and 22 documents in the test set.
Part II: word segmentation, removal of stop words
Stop_word = {} .fromkeys ([',','. ,'!', 'this','I', 'very','is',':',';])
Print ("result of Chinese word segmentation:")
Corpus = []
For an in res:
Seg_list = jieba.cut (a.strip (), cut_all=False) # precise mode
Final =''
For seg in seg_list:
If seg not in stop_word:# non-stop words, reserved
Final + = seg
Seg_list = jieba.cut (final,cut_all=False)
Output ='. Join (list (seg_list))
# print (output)
Corpus.append (output)
# print ('len:',len (corpus))
# print (corpus) # Segmentation result
What we need to do in this part is to set up the disabled word set, that is, the invalid words filtered out in the process of word segmentation, and to segment the data in each txt file into Chinese words. First of all, stop_word stores the disabled word set, uses the third-party library jieba for Chinese word segmentation, and finally puts the result of removing the stopped words into corpus.
The third part: calculate the word frequency
# convert the words in the text into a word frequency matrix
Vectorizer = CountVectorizer ()
# calculate the number of times each word appears
X = vectorizer.fit_transform (corpus)
# get all the text keywords in the word bag
Word = vectorizer.get_feature_names ()
# check the results of word frequency
# print (len (word))
For w in word:
Print (wend = "")
Print ("")
# print ("word Frequency Matrix:")
X = X.toarray ()
# print ("Matrix len:", len (X))
# np.set_printoptions (threshold=np.inf)
# print (X) how much is the flow of people in Wuxi http://www.bhnnk120.com/
The task of this part is to convert the words in the text into a word frequency matrix, and to calculate the number of times each word appears. The word frequency matrix converts a set of documents into a matrix, each document is a row, each word (tag) is a column, and the corresponding (row, column) value is the frequency of each word or tag in the document.
But we need to note that the size of the word frequency matrix of this assignment is too large. I have tried to output the entire word frequency matrix, which directly led to the program stutter. I also tried the first term of the output matrix, which also has nearly 20000 elements, so if it is not necessary, you don't have to output the word frequency matrix.
Part IV: data analysis
# using 616 txt folder contents for prediction
Print ("data Analysis:")
X_train = X [: 616]
X_test = X [616:]
# print ("x_train:", len (x_train))
# print ("x_test:", len (x_test))
Y_train = []
# 1 indicates good comments and 0 means bad reviews
For i in range (0616):
If I < 308:
Y_train.append (1) # 1 represents a hotel
Else:
Y_train.append (0) # 0 indicates scenic spot
# print (y_train)
# print (len (y_train))
Y recall test = [0pr 0pr 0pr 0je 1je 1jr 1je 0pr 0pl 0pl 0pl 0je 0je 0pr 0je 0pr 0je 0pr 0pr 0je 0pr 0pr 0je 0jr 0pr 0pr 0je 0pr 0jr 0pr 0jr 0pr 0jrt 1]
# call MultionmialNB classifier
Clf = MultinomialNB () .fit
Pre = clf.predict (x_test)
Print ("Forecast result:" pre)
Print ("Real result:", y_test)
Print (classification_report (yearly test pre))
Hotel = 0
Travel = 0
For i in pre:
If I = = 0:
Travel + = 1
Else:
Hotel + = 1
Print ("Travel:", travel)
Print ("Hotel:", hotel)
The task of this part is to train all the training data according to their label contents, and then predict and classify the test data according to the training results. X_train represents all the training data, a total of 616 groups; its corresponding label is y_train, and there are also 616 groups, with a value of 1 for hotel and a value of 0 for scenic spots. X_test represents all the test data, a total of 22 groups; its corresponding label is y_test, also 22 groups, and its value is written by hand in the program according to its value in advance. It is important to note at this point that the order in which the test set files are read in the program may not be consistent with the order in the folder directory. Finally, the test set data is used to call the model obtained by the MultionmialNB classifier using the training set data to predict, and the predicted results are compared with the real results.
The running results are obtained.
At this point, I believe that everyone on the "web algorithm in the naive Bayes how to achieve document classification" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.