How to realize spam recognition by Python 07/06 Update SLTechnology News&Howtos

How to realize spam recognition by Python

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces Python how to achieve spam identification related knowledge, detailed and easy to understand, simple and fast operation, has a certain reference value, I believe everyone will have a harvest after reading this Python how to achieve spam identification article, let's take a look at it.

development tools

Python version: 3.6.4

Related modules:

scikit-learn module;

jieba module;

numpy module;

And some Python modules.

environment construction

Install Python and add it to the environment variables, pip install the relevant modules you need.

gradually realize

(1) Dividing the data set

Most of the data sets available online for spam detection are in English, so to show my sincerity, I took a while to find a data set for Chinese mail. The dataset is divided as follows:

Training dataset:

7063 normal emails (under the data/normal folder);

7775 spam emails (under the data/spam folder).

Test Data Set:

A total of 392 emails (under the data/test folder).

2) Creating a dictionary

The email content in the dataset generally looks like this:

First of all, we use regular expressions to filter out non-Chinese characters, then use jieba sub-thesaurus to segment sentences, and clear some stop words, and finally use the above results to create a dictionary, the dictionary format is:

{"Word 1": Word 1 Frequency, "Word 2": Word 2 Frequency...}

The specific implementation of these contents is reflected in the "utils.py" file, which can be called in the main program (train.py):

The final results are saved in the file "results.pkl".

Is it done? Of course not!!

There were 52113 words in the current dictionary, which was obviously too many. Some words only appeared once or twice. It was obviously unwise to occupy a dimension in the subsequent feature extraction. Therefore, we keep only the 4000 words with the highest word frequency as the final dictionary created:

The final result is saved in the file "wordsDict.pkl".

(3) Feature extraction

With the dictionary ready, we can convert the contents of each letter into word vectors, obviously with dimensions of 4000, each dimension representing the frequency of occurrence of a high-frequency word in the letter. Finally, we combine these word vectors into a large eigenvector matrix with the size:

(7063+7775)×4000

That is, the first 7063 lines are feature vectors of normal emails, and the rest are feature vectors of spam emails.

The specific implementation of the above content is still reflected in the "utils.py" file, which is called in the main program as follows:

The final result is stored in the file "fvs_%d_%d.npy" where the first formatter represents the number of normal messages and the second formatter represents the number of spam messages.

(4) Training classifiers

We use scikit-learn machine learning libraries to train classifiers, model selection naive Bayes classifiers and SVM(Support Vector Machine):

(5) Performance test

Test the model with a test dataset:

The results were as follows:

It can be found that the performance of the two models is similar (SVM slightly better than Naive Bayes), but SVM is more inclined to spam.

About "Python how to achieve spam identification" The content of this article is introduced here, thank you for reading! I believe everyone has a certain understanding of "Python how to realize spam identification" knowledge. If you still want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.