From Bayesian Formula to Spam recognition 07/13 Update SLTechnology News&Howtos

From Bayesian Formula to Spam recognition

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Looking at "* * and the painter" talking about "a way to prevent spam", I think it is very suitable to express the relationship between mathematical formulas and machine learning. The mathematical formula related to machine learning is relatively simple and will be covered in the basic course of probability theory. The problem solved is also typical: the identification of spam.

There are many ways to prevent spam, the most intuitive one is the "rules", a variety of if-else conditions. This method can solve a problem, but not a class of problems. Moreover, the formulation of this rule needs to be very familiar with the business, but the business problems we face are usually very vertical and can be solved through the rules. After all, solving problems is the core demand of the business.

Next, with the development of the business, the rules become more and more complex, and it is more and more difficult for us to maintain. And the use of rules, is a passive problem-solving, the user experience is not good. At this time, it is time for a new method, which is called "statistical method". Because the more rules we come into contact with, we will gradually find that a keyword appears in the email, which can only indicate that the email may be spam. How is this possibility measured? Use the Bayesian method.

The idea of Bayesian method belongs to reverse thinking. Usually, the problem solved by probability theory is "knowing that e-mail is spam, asking the probability of each word appearing in spam", while the problem solved by Bayesian method is "knowing the content of e-mail and asking the probability that the current e-mail belongs to spam".

It is not difficult to understand Bayesian formula, and its basic points are "conditional probability" and "joint probability". The derivation of Bayesian formula is also very simple:

P (AB) = P (B) * P (A | B)

P (AB) = P (A) * P (B | A)

There are:

P (B) * P (A | B) = P (A) * P (B | A)

P (A | B) = P (A) * P (B | A) / P (B)

Although the most taboo of machine learning is set of formulas, but in order to facilitate understanding, let's first set up a formula:

P (spam | message content) means "the probability that a message belongs to spam when the content of the message is known"

P (Spam | email content) = P (Spam) * P (email content | Spam) / P (email content)

The probability on the right side of the equation can be calculated from the sample.

Now that there is a way to solve the problem and a mathematical formula, is the problem solved? Apparently not. We just completed the model selection. See how the model falls to the ground through "* and the painter".

Select sample: the author selected 4000 normal emails and 4000 spam emails.

Select features: letters, Arabic numerals, dashes, apostrophes, dollar signs as "meaningful identifiers"

Statistics: calculated the number of occurrences of each meaningful identity in two mail groups

Determine the calculation formula. This is actually the essence of the whole article. a. The author does not completely apply the Bayesian formula; b. The author uses Bayesian thought in the two dimensions of token and mail respectively. That's what's commendable.

Feature selection: the author chooses the features of top15 instead of all the token of the email.

Result selection: usually we take 0.5 as the boundary, while the author takes 0.9 as the boundary.

If programming in the usual sense is one-dimensional, then machine learning programming is two-dimensional. The usual engineering problem is black and white, either available or Bug is not available. The core concern of the landing of machine learning in engineering is whether the effect of the algorithm is good or not and whether the effect of the algorithm can be better. Whether the effect of the algorithm is good or not, the core point lies in the mathematical model, followed by how to make good use of the mathematical model. "* and the Painter" uses concise examples to illustrate how he uses mathematical models to solve business problems.

An extension: this problem is a typical dichotomy problem. Such as spam, spam comments, emotional judgment of comments, whether to target users, whether to recommend users. Many problems can be classified into two-category problems. If the "spam identification" is abstracted to the classification problem, the whole idea of solving the problem will be broadened a lot.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.