How to classify books with Bayesian classifier 07/19 Update SLTechnology News&Howtos

How to classify books with Bayesian classifier

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use Bayesian classifier to classify books". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn "how to classify books with Bayesian classifier".

Start with the question:

The problem we solve is the binary classification of books. The classification is based on the tag of the book. So the tag may come from an expert, or an editor, or a user. For example, "Foreign Literature", "Detective", "computer" and "python" all belong to tag. To simplify the problem, we now divide books into two categories: "humanities" or "non-humanities".

For example, "introduction to computer Science", its tag has "computer", "science", "classic", "introduction", it belongs to "non-humanities". "Catcher in the Rye", its tag has "novel", "literature", "America", it belongs to "humanities".

Basic principles:

How Bayesian classifier works:

P (a | b) = P (b) * P (a) / P (b) means: if you want to ask for P (a | b), and you know the value of P (b | a), P (a), P (b), you can get it through Bayesian formula.

A book is known to have some tag:tag1,tag2,tag3.... What is the probability that it belongs to the "humanities" category? What is the probability that it belongs to the "non-humanities" category?

Suppose p1 represents the probability that it belongs to the "humanities" in this case, and p2 indicates the probability that it belongs to the "non-humanities" in this case.

If p1 > p2, then it belongs to "humanities".

Conditional probability:

In fact, this is a question of conditional probability. The so-called conditional probability is to find the probability of a happening when b is known to occur, we write P (a | b)

Combined with our practical problem, that is the probability that this book belongs to "humanities" and "non-humanities" when tag1,tag2,tag3 has happened. Let's write and do

P (Humanities | tag1,tag2,tag3...) It means in tag1,tag2,tag3.... When it happens, this book belongs to the Humanities.

P (non-humanities | tag1,tag2,tag3...) It means in tag1,tag2,tag3.... When it happens, this book belongs to "non-humanities".

P (Humanities | tag1,tag2,tag3...) = P (tag1,tag2,tag3... | Humanities) * P (Humanities) / P (tag1,tag2,tag3...)

= = >

P (tag1,tag2,tag3... | Humanities): that is, when you know that a book has been classified as "humanities", tag1,tag2,tag3... The probability of appearing together

P (Humanities): that is, the probability of appearing in all books ("humanities" and "non-humanities") in books marked as "humanities" (in the training set).

P (tag1,tag2,tag3...): that is, tag1,tag2,tag3... Probability of occurrence of all tag in (training set)

Here's a noteworthy technique, actually P (tag1,tag2,tag3...), we don't need to calculate, because our aim is to compare.

P (Humanities | tag1,tag2,tag3...) And P (non-humanities | tag1,tag2,tag3...) Not to get the actual value, because the denominator in the above formula

P (tag1,tag2,tag3...) It's the same. So we just need to compare the size of the molecules.

P (tag1,tag2,tag3... | Humanities) * P (Humanities) and P (tag1,tag2,tag3... | non-Humanities) * P (non-Humanities)

Naive Bayes:

So how do we calculate P (tag1,tag2,tag3... | Humanities)? The concept of naive Bayes is used here, that is, we think that in the tags in a book, each tag is independent of each other and has nothing to do with the appearance of the other party. In other words, the probability of the emergence of "computer" and "classic" is not related to each other, and the emergence of "computer" will not lead to a high probability of "classic".

P (tag1,tag2,tag3... | Humanities) = P (tag1 | Humanities) * P (tag2 | Humanities) * P (tag3 | Humanities).

That is, to calculate the probability of occurrence of each tag in all tag in Humanities and non-Humanities books, and then multiply them.

Example analysis:

We now have a book, introduction to computer Science, which is labeled "computer", "Science", "Theory", "Classic" and "introduction". We want to know the probability that "introduction to computer Science" belongs to "humanities" and "non-humanities" respectively when these labels appear.

So what do we already have? Fortunately, we currently have 10 books, of which 6 are known to be humanities and 4 are non-humanities. These 10 books, after ranking, have a total of 70 different labels, including "computer", "Science", "Theory" and "introduction".

Based on this, we can conclude that P (humanities) = 6 stroke 10: 0.6 P (non-humanities) = 1-0.6-0.4, that is to say, the concept of "humanities" in all books 0.6 "non-humanities" is 0.4.

Then there are P (tag1,tag2,tag3... | Humanities) and P (tag1,tag2,tag3... | non-Humanities), that is, we have to calculate the probabilities of all the tag of "computer", "Science", "Theory", "Classics" and "introduction" among all the numbers in the "Humanities" category.

1. Prepare the training set:

Almost all machine learning needs training sets. The Bayesian classification is the same. The known data we are talking about above is the training set. The 10 books listed in the above example and the tag ranked by the other 10 books are our training set, and the probabilities of 0.6 and 0.4 are P (tag1,tag2,tag3... | Humanities) and P (tag1,tag2,tag3... | non-Humanities) a priori probabilities.

Based on our problems, we need to prepare 100 books, the humanities are divided into "humanities" and "non-humanities" categories, and collect all the tag of these books. (you can climb to the book resources on Amazon or Douban)

two。 Form a tag set:

The tag mentioned above is saved with a list in python, and we make every element in its bit dicts.dicts a tag.

Dicts = ["science", "theory", "C++"]

3. Calculate the probability of "humanities" and "non-humanities" in the training set.

Suppose 60 of the 100 books in our training set are humanities, then P (humanities) = 60 / 100 = 60 P (non-humanities) = 1-P (humanities) = 0.4

4. Calculate the probability of tag occurrence of each tag in the tag set in the training set "humanities" data

First of all, we construct a list based on the training set, each item in this list is another list, and each item in this list is either 1 or 0. 1 indicates that the tag at this position in the dictionary is a tag of the book.

Dicts= [computer, novel, Psychology, Science, programming, behavior, introduction, Classics, travels, America,.] Tag set

Tag_vector_ Humanities = [

The first book, Catcher in the Rye, tag: novel, Classic, America.

], the second book, predictable irrationality tag: psychology, behavior, America

[], third book

]

Tag_vector_ non-humanities = [

[]

....

]

With such data, we can calculate P (tag1 | Humanities). Corresponding to tag1, we calculate the number of times tag1 appears in all the books on Humanities in the training set.

For example, in the training set, there are 60 copies of Humanities, 40 of which are written by the classic tag, so we will have num_of_tag1=40, and so on.

Num_of_tag2=32,num_of_tag3=18...

Then, we find out the total number of tag tags for all books in the "humanities" category, for example, 2 books in the "humanities" category, the first book is labeled "Prose", "Classics" and "Foreign countries", and the second book is "Classics" and "novels", so the total number of all tag books is 3 books 2 books 5. Now let's find the total number of tags for all 100 tag books in the training set. Suppose the total is 700. We make total_ humanities = 700,

So the probability of tag1 appearing in the "humanities" category is P (tag1 | Humanities) = num_of_tag1 / total_ Humanities = 40max 700000.057

Using numpy

From numpy import * num_tags_cate1 = ones (len (dicts)) # 1total_cate1 = 2.0 # 2for item in tag_vector_cate1: num_tags_cate1 + = item # 3 total_cate1 + = sum (item) # 4p_tags_cate1 = num_tags_cate1 / total_cate1 # 5room1 means to generate an numpy array Ones () is a function of numpy that returns an array of numpy populated with a value of 1. The argument is the length of the array. For example, temp=ones (3) generates an array of numpy and returns it to temp. So it is to take the length of the tag set dicts of the training set as a parameter to generate a numpy array as long as dicts filled with 1. # 2 # 3 tag_vector_cate1 is [], [], []] and item is a list of each element, and the length is the length of the dicts, indicating whether the corresponding tag exists. The result of the numpy array + tag_vector_cate1 is that the sum of the elements in the corresponding position an is a numpy [1pje 2je 3je 5je 0] b is a python list [0meme 0je 3je 2jue 1] a + b = [1mie 2je 6je 7jue 1] the result is that numpy's array # 4 adds all the tag numbers that appear in each book, and sum (item) is also a function of numpy. The function is to say that each item in the item adds sum ([2pje 5mlb 1]) = 2 + 5-1 = 6 if item is the corresponding list = [0LJ 1J 0je 0J 0J 0J 0] corresponding to "Catcher in the Rye" is equivalent to 3 # 5 Thank you for your reading, the above is the content of "how to classify Books with Bayesian Classifier", after the study of this article I believe that we have a deeper understanding of how to use Bayesian classifier to classify books, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.