How to use naive Bayes to identify spam messages in Python 07/12 Update SLTechnology News&Howtos

How to use naive Bayes to identify spam messages in Python

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use naive Bayes to identify spam messages in Python". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use naive Bayes to identify spam messages in Python".

I. introduction

Nowadays, many mobile phone management software have the function of intercepting spam messages, which is very smart and sweet, isn't it?

Um ~ o ("▽") o

It is very useful for people who are often harassed by spam messages. (a) ✧

But after a lot of interceptor software intercepts spam messages... Send another notice to intercept the junk SMS ╮ (garbage _ messages) ╭.

Curiosity killed the cat. You told me you intercepted junk messages. Of course I want to know what kind of junk messages you intercepted. ╮ (╯ _ ╰) ╭

II. Classification and identification of spam messages

Machine learning can be divided into three categories according to its nature:

Classification (Supervision)

Return (Supervision)

Clustering (semi-supervised)

Spam messages usually use marked short message data to judge unknown short messages, which belongs to the classification nature of machine learning.

There are many machine learning modules in Python, such as Sklearn, Tensorflow, Caffe and so on, which can easily call some machine learning algorithms.

Third, spam message recognition

Well, just get started... (please _, please)

The 80w training data set and the 20w test data set are both from a little brother on github. Thank you d = ("▽" *) b.

1. Data processing

Well, let's see what the data looks like:

Import pandas as pddata = pd.read_csv (r "H:\ RubbishMessage\ data\ 80w.txt", encoding='utf-8',sep='', header=None) data.head ()

The last column lists the content of the SMS, and the penultimate column is the type of SMS. 0 indicates normal SMS and 1 indicates spam.

Then, we segment and segment the content of SMS messages according to different types (normal messages and spam messages):

# junk SMS import jiebaspam = data [data [1] = = 1] spam [2] = spam [2] .map (lambda xpura '.join (jieba.cut (x) spam.head ()

# normal SMS

Normal = data [data [1] = = 0] normal [2] = normal [2] .map (lambda xpura '.join (jieba.cut (x) normal.head ()

Save different types of SMS messages after word segmentation as different files:

Spam.to_csv ('soam.csv',encoding='utf-8',header=False,index=False,columns= [2]) normal.to_csv (' normal.csv',encoding='utf-8',header=False,index=False,columns= [2]) 2, model selection and training

Instead of choosing Sklearn or other deep learning libraries, we choose NLTK natural language processing library for Bayesian classification.

Import module:

Import nltk.classify.util

From nltk.classify import NaiveBayesClassifier

From nltk.corpus import PlaintextCorpusReaderimport random

Load the SMS file you just exported:

Load SMS corpus message_corpus = PlaintextCorpusReader ('. /', ['soam.csv','normal.csv']) all_message = message_corpus.words ()

Define a feature function to generate features:

Def massage_feature (word,num_letter=1): return {'feature':word [- num_letter:]}

Tag extraction of SMS features:

Labels_name = ([(massage,' garbage') for massage in message_corpus.words ('soam.csv')] + [(massage,' normal') for massage in message_corpus.words ('normal.csv')]) random.seed (7) random.shuffle (labels_name)

Train and predict the model

From nltk.classify import accuracy as nltk_accuracyfeaturesets = [(massage_feature (n), massage) for (nmagorage) in labels_name] train_set,test_set = featuresets [2000:], featuresets [: 2000] classifier = NaiveBayesClassifier.train (train_set)

Finally, let's look at the accuracy of the prediction:

Print ('result accuracy:', str (100*nltk_accuracy (classifier,test_set)) + str ('%'))

Thank you for your reading, the above is the content of "how to use naive Bayes to identify spam messages in Python". After the study of this article, I believe you have a deeper understanding of how to use naive Bayes to identify spam messages in Python, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.