Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize document Classification by naive Bayes in web algorithm

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to realize document classification by naive Bayes in web algorithm". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "web algorithm in naive Bayes how to achieve document classification" it!

Job requirements:

The experimental data is in the bayes_datasets folder. Among them

 train is a training data set, which contains two Chinese text sets, hotel and travel, in txt format. The hotel text set is full of documents introducing hotel information, and the travel text set is all documents introducing scenic spot information.

 Bayes_datasets/test is a test dataset that contains several hotel class documents and travel class documents.

Naive Bayesian algorithm is used to classify the above two types of documents. It is required to output the document classification results of the test dataset, that is, the number of documents of each type.

(example: hotel:XX,travel:XX)

Bayesian formula:

The core of naive Bayesian algorithm, Bayesian formula is as follows:

Code implementation:

Part one: reading data

F_path = os.path.abspath ('.') +'/ bayes_datasets/train/hotel'

F1_path = os.path.abspath ('.') +'/ bayes_datasets/train/travel'

F2_path = os.path.abspath ('.') +'/ bayes_datasets/test'

Ls = os.listdir (f_path)

Ls1 = os.listdir (f1_path)

Ls2 = os.listdir (f2_path)

# remove the regular expression of the URL

Pattern = r "(http [s]: / (?: [a-zA-Z] | [0-9] | [$- _ @. & +] | [! *,] | (?:% [0-9a-fA-F] [0-9a-fA-F]) +) | ([a-zA-Z] +.\ w +\. + [a-zA-Z0-9\ / _] +)"

Res = []

For i in ls:

With open (str (faded pathways'\\'+ I), encoding='UTF-8') as f:

Lines = f.readlines ()

Tmp = '.join (str (i.replace ('\ njinghe')) for i in lines)

Tmp = re.sub (pattern,'',tmp)

Remove_digits = str.maketrans (',', digits)

Tmp = tmp.translate (remove_digits)

# print (tmp)

Res.append (tmp)

Print ("hotel Total:", len (res))

For i in ls1:

With open (str (f1_path +'\'+ I), encoding='UTF-8') as f:

Lines = f.readlines ()

Tmp = '.join (str (i.replace ('\ njunction,') for i in lines))

Tmp = re.sub (pattern,'', tmp)

Remove_digits = str.maketrans (',', digits)

Tmp = tmp.translate (remove_digits)

# print (tmp)

Res.append (tmp)

Print ("travel Total:", len (res)-308)

# print (ls2)

For i in ls2:

With open (str (f2_path +'\'+ I), encoding='UTF-8') as f:

Lines = f.readlines ()

Tmp = '.join (str (i.replace ('\ njunction,') for i in lines))

Tmp = re.sub (pattern,'', tmp)

Remove_digits = str.maketrans (',', digits)

Tmp = tmp.translate (remove_digits)

# print (tmp)

Res.append (tmp)

Print ("test Total:", len (res)-616)

Print ("data Total:", len (res))

The task of this part is to read the txt documents under each folder into the program and store the data as needed. The data is divided into training set and test set, and the training set includes scenic spots and hotels, so the data is divided into three categories, which are read three times respectively. There is a string of URL information at the front of each txt file in the training set, which I filter out with regular expressions, and it turns out that this part does not affect the final result. After that, the three types of documents are read in turn, and the read results are stored in a result list. Each entry in list is a string that stores all the information in an txt file after removing the data to be filtered. In the end, there are 308 documents under the travel folder, 308 documents under the hotel folder and 22 documents in the test set.

Part II: word segmentation, removal of stop words

Stop_word = {} .fromkeys ([',','. ,'!', 'this','I', 'very','is',':',';])

Print ("result of Chinese word segmentation:")

Corpus = []

For an in res:

Seg_list = jieba.cut (a.strip (), cut_all=False) # precise mode

Final =''

For seg in seg_list:

If seg not in stop_word:# non-stop words, reserved

Final + = seg

Seg_list = jieba.cut (final,cut_all=False)

Output ='. Join (list (seg_list))

# print (output)

Corpus.append (output)

# print ('len:',len (corpus))

# print (corpus) # Segmentation result

What we need to do in this part is to set up the disabled word set, that is, the invalid words filtered out in the process of word segmentation, and to segment the data in each txt file into Chinese words. First of all, stop_word stores the disabled word set, uses the third-party library jieba for Chinese word segmentation, and finally puts the result of removing the stopped words into corpus.

The third part: calculate the word frequency

# convert the words in the text into a word frequency matrix

Vectorizer = CountVectorizer ()

# calculate the number of times each word appears

X = vectorizer.fit_transform (corpus)

# get all the text keywords in the word bag

Word = vectorizer.get_feature_names ()

# check the results of word frequency

# print (len (word))

For w in word:

Print (wend = "")

Print ("")

# print ("word Frequency Matrix:")

X = X.toarray ()

# print ("Matrix len:", len (X))

# np.set_printoptions (threshold=np.inf)

# print (X) how much is the flow of people in Wuxi http://www.bhnnk120.com/

The task of this part is to convert the words in the text into a word frequency matrix, and to calculate the number of times each word appears. The word frequency matrix converts a set of documents into a matrix, each document is a row, each word (tag) is a column, and the corresponding (row, column) value is the frequency of each word or tag in the document.

But we need to note that the size of the word frequency matrix of this assignment is too large. I have tried to output the entire word frequency matrix, which directly led to the program stutter. I also tried the first term of the output matrix, which also has nearly 20000 elements, so if it is not necessary, you don't have to output the word frequency matrix.

Part IV: data analysis

# using 616 txt folder contents for prediction

Print ("data Analysis:")

X_train = X [: 616]

X_test = X [616:]

# print ("x_train:", len (x_train))

# print ("x_test:", len (x_test))

Y_train = []

# 1 indicates good comments and 0 means bad reviews

For i in range (0616):

If I < 308:

Y_train.append (1) # 1 represents a hotel

Else:

Y_train.append (0) # 0 indicates scenic spot

# print (y_train)

# print (len (y_train))

Y recall test = [0pr 0pr 0pr 0je 1je 1jr 1je 0pr 0pl 0pl 0pl 0je 0je 0pr 0je 0pr 0je 0pr 0pr 0je 0pr 0pr 0je 0jr 0pr 0pr 0je 0pr 0jr 0pr 0jr 0pr 0jrt 1]

# call MultionmialNB classifier

Clf = MultinomialNB () .fit

Pre = clf.predict (x_test)

Print ("Forecast result:" pre)

Print ("Real result:", y_test)

Print (classification_report (yearly test pre))

Hotel = 0

Travel = 0

For i in pre:

If I = = 0:

Travel + = 1

Else:

Hotel + = 1

Print ("Travel:", travel)

Print ("Hotel:", hotel)

The task of this part is to train all the training data according to their label contents, and then predict and classify the test data according to the training results. X_train represents all the training data, a total of 616 groups; its corresponding label is y_train, and there are also 616 groups, with a value of 1 for hotel and a value of 0 for scenic spots. X_test represents all the test data, a total of 22 groups; its corresponding label is y_test, also 22 groups, and its value is written by hand in the program according to its value in advance. It is important to note at this point that the order in which the test set files are read in the program may not be consistent with the order in the folder directory. Finally, the test set data is used to call the model obtained by the MultionmialNB classifier using the training set data to predict, and the predicted results are compared with the real results.

The running results are obtained.

At this point, I believe that everyone on the "web algorithm in the naive Bayes how to achieve document classification" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report