In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly explains "how to use Naive Bayes". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to use Naive Bayes.
I. Overview
Advantages: it is still valid in the case of less data, and can deal with multi-category problems.
Disadvantages: sensitive to the way the input data is prepared
Applicable data type: nominal data
Second, principle
III. Document classification
A,B,C,D.. Is a word in the document. Suppose there are only four kinds of words: a, B, C, D, etc. Training samples are 5.
ABCD category document 100110 document 201110 document 310011 document 411001 document 511101 test document 1010?
Category: C0BZ C1
Test documentation: W
Max {P (C0 | W), P (C1 | W)} = = > max {log [P (C0 | W)], log [P (C1 | W)]}
P (C0 | W) = P (W | C0) * P (C0) / P (W)
P (C0) = 2 / 5 = > 2 documents of type 0, 3 documents of type 1
P (W | C0) = P (A*B*C*D | C0) = > Navie Bayes = = > P (A | C0) * P (B | C0) * P (C | C0) * P (D | C0)
P (A | C0) = (0 + 0) / (0 + 0 + 1 + 1 + 0 + 1 + 1 + 1) = 0 = = > the number of times An appears in the category 0 document / the total vocabulary in the category 0 document
P (B | C0) = (0 + 1) / (0 + 0 + 1 + 1 + 0 + 1 + 1) = 1 + 5 = > the number of occurrences of B in the category 0 document / the total vocabulary in the category 0 document
P (C | C0) = (1 + 1) / (0 + 0 + 1 + 1 + 0 + 1 + 1) = 2 + 5 = > the number of times C appears in the category 0 document / the total vocabulary in the category 0 document
P (D | C0) = (1 + 1) / (0 + 0 + 1 + 1 + 0 + 1 + 1) = 2 + 5 = > the number of times D appears in the category 0 document / the total vocabulary in the category 0 document
Because when multiplied, there is 0 * = > 0 to take log.
Log [P (W | C0) * P (C0)] = log [P (A | C0) * P (B | C0) * P (C | C0) * P (D | C0) * P (C0)]
= log [P (A | C0)] + log [P (B | C0)] + log [P (C | C0)] + log [P (D | C0)] + log [P (C0)]
Similarly calculate log [P (W | C1) * P (C1)]
Test sample:
Log [P (C0 | W)] = 0 * log (1pm 5) + 1 * log (2pm 5) + 0 * log (2pm 5) + log (2max 5) =
Log [P (C1 | W)] = 1 * log (3) + 0 * log (2) + 1 * log (1) + 0 * log (1) + log (1-2) =
#-*-coding:UTF-8from numpy import * 1. The Bernoulli model does not consider the number of times words appear in the document, only whether they appear or not. The hypothetical word is 2. 5% of the equal weight. Polynomial model''def loadDataSet (): postingList = [[my',' dog', 'has',' flea', 'problems',' help', 'please'], [' maybe', 'not',' take', 'him',' to', 'dog',' park', 'stupid'], [' my', 'dalmation',' is'] 'so',' cute', 'love',' him'], ['stop',' posting', 'stupid',' worthless', 'garbage'], [' mr', 'licks',' ate', 'my',' steak', 'how',' to', 'stop',' him'], ['quit'' 'buying',' worthless', 'dog',' food', 'stupid'] classVec = [0vocaSet 1] return postingList,classVecdef createVocabList (dataSet): vocaSet = set ([]) for document in dataSet: vocaSet = vocaSet | set (document) return list (vocaSet)' 'vocabList = ['','',.] inputSet = ['my',' dog', 'has',' flea', 'problems' 'help',' please']''def setOfWords2Vec (vocabList InputSet): returnVec = [0] * len (vocabList) for word in inputSet: if word in vocabList: VEC [vocabList.index (word)] = 1 else: print 'the word:% s is not in my vocabularies'% word return returnVec'''P (c | w) = P (w | c) * P (c) / P (w) 1.P (c) 2.P (w) trainMatrixtrainCategory=== > 0] Vector of tag set pAbusive = (0 + 0 + 1 + 1 + 0) / 5A B C D category0 01 1 00 11 1 01 00 1 11 1 00 11 11 0 11 01 0? numTrainDocs = 5 = > 5 documents numWords = 4 = > 4 features pAbusive = (0 + 0 + 0 + 1 + 1) / 5 = 2Tue 5 = > priori probability p0Num = P0Denom=1 0] p1Num0] p0Num0] p0Denom= 0.0p1Denom = 0.000110 = = > p0Num= [0p0Num1] p01110 = > p0Num= [0p0Num2] p0Denom=2 10011 = = > p1Num= [1p1Num1] p1Denom=1 11001 = > p1Num= [2JEO 01] p1Denom=2 1 1 01 = = > p1Num= [3W2 1] p1Denom=3 P (C0 | W) = P (W | C0) * P (C0) / P (W) = P (A*B*C*D | C0) * P (C0) / P (W) = P (A | C0) * P (B | C0) * P (C | C0) * P (D | C0) * P (C0) / P (W) P (C1 | W) = P (W | C1) * P (C1) / P (W) = P (A*B*C*D | C1) * P ( C1) / P (W) = P (A | C1) * P (B | C1) * P (C | C1) * P (D | C1) * P (C1) / P (W) P (W) = = > there is no need to calculate max {P (C0 | W) P (C1 | W)} = = > max {Log [P (C0 | W)] Log [P (C1 | W)]} Log [P (C0 | W)] = Log [P (A | C0)] + Log [P (B | C0)] + Log [P (C | C0)] + Log [P (D | C0)] + log [P (C0)] P (A | C0) = 0 / (0cm 1cm 2cm 2) = 0max 5P (B | C0) = 1 / (0cm 1cm 2n 2) = 1pm 5P (C | C0) = 2 / (01cm 2n 2) = 2P5P (D | C0) = 2 / (01x 2cm 2) = 2/5Log [P (C1 | W)] = Log [P (A | C1 | W)] = )] + Log [P (B | C1)] + log [P (C1)] P (A | C1) = 3 / (3x 2cm 1x 1) = 3pr 7p (B | C1) = 2 / (3x 2x 1x 1) = 2max 7p (C | C1) = 1 / (3x 2x 1x 1) = 1max 7p (D | C1) = 1 / (3x 2x 1x 1) = 1C7 test sample 1 0 1 0? Log [P (C0 | W)] = 1 * Log [05] + 0 * Log [1Log 5] + 1 * Log [2Log 5] + 0 * Log [2x5] + Log [2Accord 5] log [P (C1 | W)] = 1 * Log [3Log 7] + 0 * Log [2x7] + 1 * Log [1x7] + 0 * Log [1x7] + Log [1-2hand 5] Note the presence of Log [0] = = > all initialization Let's set p0Num = [1 2.0p1Denom 1] p1Num = [1 2.0p1Denom 1] p0Denom = 2.0'''def trainNB0 (trainMatrix) TrainCategory): numTrainDocs = len (trainMatrix) numWords = len (trainMatrix [0]) pAbusive = sum (trainCategory) / float (numTrainDocs) p0Num = zeros (numWords) p1Num = zeros (numWords) p0Denom = 0.0 for i in range (numTrainDocs): if trainCategory [I] = 1: p1Num + = trainMatrix [I] p1Denom + = sum (Matrix [I]) else : p0Num + = trainMatrix [I] p0Denom + = sum (matrix [I]) p1Vec = log (p1Num/p1Denom) p0Vec = log (p0Num/p0Denom) return p0Vec P1Vecje pAbusivedef trainNB1 (trainMatrix TrainCategory): numTrainDocs = len (trainMatrix) numWords = len (trainMatrix [0]) pAbusive = sum (trainCategory) / float (numTrainDocs) p0Num = ones (numWords) p1Num = ones (numWords) p0Denom = 2.0for i in range (numTrainDocs): if trainCategory [I] = 1: p1Num + = trainMatrix [I] p1Denom + = sum (Matrix [I]) else : p0Num + = trainMatrix [I] p0Denom + = sum (matrix [I]) p1Vec = log (p1Num/p1Denom) p0Vec = log (p0Num/p0Denom) return p0Vec P1Vecje pAbusivedef classifyNB (vec2Classify, p0Vec, p1Vec, pClass1): P1 = sum (vec2Classify * p1Vec) + log (pClass1) p0 = sum (vec2Classify * p0Vec) + log (1.0-pClass1) if p1 > p0: return 1 else: return 0def testingNB (): listOPosts,listClasses = loadDataSet () myVocabList = createVocabList (listOPosts) trainMat = [] for postingDoc in listOPosts: trainMat.append (setOfWords2Vec (myVocabList, postingDoc)) p0V P1V myVocabList Pab = trainNB0 (trainMat, listClasses) testEntry = ['love','my','dalmation'] thisDoc = array (setOfWords2Vec (myVocabList, testEntry)) print (testEntry,' classified as:', classifyNB (thisDoc,p0V,p1V,pAb))
Fourth, filter spam
Def textParse (bigString): import re listOfTokens = re.split (bigString) # simple space participle return [tok.lower () for tok in listOfTokens if len (tok) > 2] # simple filter word length
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.