How to implement kNN algorithm with python 07/06 Update SLTechnology News&Howtos

How to implement kNN algorithm with python

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to implement the kNN algorithm with python". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "how to implement the kNN algorithm with python"!

2.1 the concept of text mining and text classification

1. Text mining: the process of extracting previously unknown, understandable, and ultimately usable knowledge from a large amount of text data, and using this knowledge to better organize information for future reference.

In short, it is the process of finding knowledge from unstructured texts.

2. The subdivided fields of text mining: search and information retrieval (IR), text clustering, text classification, Web mining, information extraction (IE), natural language processing (NLP), concept extraction.

3, text classification: find the correct category for each document given by the user

4, the application of text classification: text retrieval, spam filtering, web page hierarchical directory automatic generation of metadata, subject matter detection 5, text classification methods: one is based on the pattern system, the other is the classification model.

2.2 text classification technology and process of the Chinese language of the text classification project:

1) pre-processing: removing noisy information from text: HTML tags, text format conversion

2) Chinese word segmentation: use the Chinese word separator to segment the text and remove the stopped words

3) construct the word vector space: count the word frequency of the text and generate the word vector space of the text.

4) weight strategy-TF-IDF method: use TF-IDF to find feature words and extract them to reflect the subject of the document.

5) Classifier: use algorithm to train classifier

6) Evaluation of classification results: analysis of test results of classifiers

2.2.1 text preprocessing:

The core task of text processing: to transform unstructured text into structured form, that is, vector space model

Different types of text need to be preprocessed before text processing.

Steps for text preprocessing:

1. Select the range of text to be processed: the entire document or its paragraphs

2. Establish a classified text corpus:

Training set corpus: text resources that have been classified. (file name: train_corpus_small)

Test set corpus: text corpus to be classified (the test corpus of this project is randomly selected from the training corpus) (file name: test_corpus)

3, text format conversion: unified conversion to plain text format. (note: garbled)

4. Check sentence boundaries: mark the end of sentences

2.2.2 introduction to Chinese word Segmentation

1. Chinese word segmentation: dividing a sequence of Chinese characters (sentences) into a single word (the core issue of Chinese natural language processing)

2. Algorithm of Chinese word segmentation: conditional random field based on probability graph model (CRF)

3. Structured representation of text after word segmentation: word vector space model, topic model, tree representation of dependency syntax, graph representation of RDF

4, the word segmentation system of this project: using jieba word segmentation

5. Word segmentation modes supported by jieba participle: default segmentation, full segmentation, search engine segmentation

6 the code of Jieba participle is shown in the file: segmenting the unsegmented corpus and persisting the object to a dat file (corpus file after creating participle: train_corpus_seg)

Import sysimport osimport jiebareload (sys) sys.setdefaultencoding ('utf-8') def savefile (savpath,content): fp = open (savepath, "wb") fp.write (content) fp.close () def readfile (path): fp = open (path) "rb" content = fp.read () fp.close () return content corpus_path = "train_corpus_small/" seg_path = "train_corpus_seg/" catelist = os.listdir (corpus_path) for mydir in catelist: class_path = corpus_path+mydir+ "/" seg_dir = seg_path+mydir+ "/" if not os.path.exists (seg_dir): Os.makedirs (seg_dir) file_list = os.listdir (class_path) for file_path in file_list: fullname = class_path+file_path content = readfile (full.name). Strip () content = content.replace ("\ r\ n" "). Strip () content_seg = jieba.cut (content) savefile (seg_dir+file_path," .join (content_seg)) print "end of Chinese word segmentation" from sklearn.datasets.base import Bunchbunch = Bunch {target_name= [], label= [], filename= [] Contents= []} wordbag_path = "train_word_bad/train_set.dat" seg_path = "train_corpus_seg/" catelist = os.listdir (seg_path) bunch.target_name.extend (catelist) for mydir in catelist: class_path = seg_path+mydir+ "/" file_list = os.listdir (class_path) for file_path in file_list: fullname = class_path+file_path Bunch.label.append (mydir) bunch.filenames.append (fullname) bunch.contents.append (readfile (fullname) .strip ()) file_obj = open (wordbad_path "wb") pickle.dump (bunch,file_obj) file_obj.close () print "end of building text object!" 2.2.3 introduction to Scikit-Learn Library

1. Module classification:

1) Classification and regression algorithms: generalized linear model, support vector machine, kNN, naive Bayes, decision tree, feature selection

2) clustering algorithm: K-means

3) Dimension reduction: PCA

4) Model selection: cross-validation

5) data preprocessing: standardization, removal of mean and variance scaling, normalization, binarization, coding classification features, interpolation of missing values

2.2.4 Vector Space Model: a structured approach to text Classification

1. Vector space model: the text is represented as a vector, and each feature of the vector is represented as a word that appears in the text. 2, stop words: before text classification, automatically filter out some words or words to save storage space. Remove according to the disabled word list, the table can be downloaded. For the code, see the file.

2.2.5 weight strategy: TF-IDF method

1, word vector space model: convert the words in the text into numbers, and the whole text set into a word vector matrix with equal dimensions (simple understanding, extract each word that is not repeated, and represent the text in terms of the number of times the word appears)

2, normalization: it is expressed in the form of probability, for example: 0 TF (for the document itself only)

3, the document frequency of the entry IDF: the word frequency for all documents

TF-IDF weight strategy: calculating the weight vector of the text

1 the meaning of TFMI IDF: word frequency reverses document frequency. If a word appears frequently in an article (high frequency of words) and rarely appears in other articles (low frequency of documents), it is considered that the word has a good ability to distinguish categories and is suitable for classification. IDF actually counteracts TF.

2. Definition of word frequency TF: the frequency at which a given word appears in the file (normalization of the number of words)

3. Inverse file frequency IDF: the IDF of a particular word, divided by the total number of files divided by the number of files containing the word, and then the quotient is logarithmic.

4Computation of TF TF IDF: the product of TFMI and IDF

5. The persistent corpus file dat after participle is transformed by TF-IDF strategy, and the persistent code is shown in the file.

Import sysimport os from sklearn.datasets.base import Bunch import cPickle as pickle from sklearn import feature_extractionfrom sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import TfidfVectorizer reload (sys) sys.setdefaultencoding ('utf-8') def readbunchobj (path): file_obj = open (path, "rb") bunch = pickle.load (file_obj) file_obj.cloase () return bunch def writebunchobj (path,bunchobj): file_obj = open (path "wb") pickle.dump (bunchobj,file_obj) file_obj.close () path = "train_word_bag/train_set.dat" bunch = readbunchobj (path) tfidfspace = Bunch (target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm= [], vocabulary= []) vectorizer = TfidfVectorizer (stop_words=stpwrdlist,sublinear_tf=True Max_df=0.5) transformer=TfidfTransformer () tfidfspace.tdm = vectorilzer.fit_transform (bunch.contents) tfidfspace.vocabulary = vectorizer.vocabulary space_path = "train_word_bag/tfidfspace.dat" writebunchobj (space_path,tfidfspace) 2.2.6 use naive Bayesian classification module

Commonly used text classification methods: kNN nearest neighbor algorithm, naive Bayesian algorithm, support vector machine algorithm

In this section, the naive Bayesian algorithm is selected for text classification, and the test set is randomly selected from the document set of the training set, with 10 documents for each classification.

The training procedure is the same as the training set: word segmentation (file test_corpus) "generate file word vector file" to generate word vector model.

(difference: when training the word vector model, we need to load the training set word bag, map the word vector generated by the test set to the dictionary of the training set word bag, and generate the vector space model. ) the code is shown in the file.

Path = "test_word_bag/test_set.dat" bunch = readbunchobj (path) testspace = Bunch (target_name=bunch.target_name,label+bunch.label,filenames=bunch.filenames.tdm= [], vocabulary= []) trainbunch = readbunchobj ("train_word_bag/tfidfspace.dat") vectorizer=TfidfVectorizer (stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5 Vocabulary=trainbunch.vocabulary) transformer=TfidfTransformer () testspace.tdm=vectorizer.fit_transform (bunch.contents) testspace.vocabulary=trainbunch.vocabularyspace_path = "test_word_bag/testspace.dat" writebunchobj (space_path,testspace)

Execute the polynomial Bayesian algorithm to classify the test text, and return the classification accuracy. The code is shown in the file.

From sklearn.naive_bayes import MultinomialNB trainpath = "train_word_bag/tfidfspace.dat" train_set = readbunchobj (trainpath) testpath = "test_word_bag/testspace.dat" test_set = readbunchobj (testpath) clf = MultinomialNB (alpha = 0.001) .fit (train_set.tdm,train_set.label) predicted = clf.predict (test_set.tdm) total = len (predicted) Rate = 0for flabel,file_name,expct_cate in zip (test_set.label,test_set.filenames,predicted): if flabel! = expct_cate: rate+=1 print file_name, ": actual category:", flabel, "--> Forecast category:", expct-cate print "error rate:", float (rate) * 100/float (total), "%" 2.2.7 Classification result evaluation

Indicators for algorithm evaluation in the field of machine learning:

(1) recall rate (recall rate): the ratio of the number of relevant documents retrieved to all relevant documents in the document library, which is a measure of the recall rate of the retrieval system.

Recall rate = total number of relevant files retrieved by the system / all relevant documents of the system

(2) accuracy (accuracy): the ratio of the number of relevant documents retrieved to the total number of documents retrieved

Accuracy = relevant files retrieved by the system / total number of files retrieved by the system

(3) Fp-Measure

Fp= (p2o1) PR/ (p2P+R), P is the accuracy, R is the recall rate

When pumped 1, it is F1-Measure.

Evaluation of the classification evaluation results of text classification projects: the code can be found in the document

Import numpy as npfrom sklearn import metricsdef metrics_result (actual,predict): print 'Precision: {0print 3f}' .format (metrics.precision_score (actual,predict)) print 'recall: {0pur0.3f}' .format (metrics.recall_score (actual,predict)) print 'f1-score: {0VAND 3f}' .format (metrics.f1_score (actual,predict)) metrics_result (test_set.label,predicted) 2.3.Classification algorithm: naive Bayes

This section mainly discusses the basic principle and python implementation of naive Bayesian algorithm.

2.3.1 derivation of Bayesian formula

The idea of naive Bayesian text classification: it holds that the two words in the word bag are independent of each other, that is, each dimension in the feature vector of an object is independent of each other.

Definition of naive Bayesian classification:

(1), let x = {a1jina2, ^ am} be an item to be classified, and each an is a feature attribute of x.

(2), there is a category set C = {y _ 1 ~ y _ 2, Yn}.

(3), calculate P (y1 | x), P (y2 | x),. , P (yn | x)

(4), if P (yk | x) = max {P1Magol P2, … , Pn}, then x belongs to yk

-- calculate the conditional probabilities of step (3):

(1) find a set to be classified, that is, the training set.

(2) the conditional probability estimation of each feature attribute under each category is obtained by statistics, namely:

P (A1 | y1), P (a2 | y2),. , P (am | Y1)

P (A1 | y2), P (a2 | y2),. , P (am | Y2)

……

(3) if each feature attribute is conditionally independent, according to Bayesian theorem:

P (yi | x) = P (x | yi) * P (yi) / P (x)

The denominator is constant for all categories, so you only need to maximize the numerator.

Therefore, the process of Bayesian classification is:

The first stage: training data generation training sample set: TF-IDF

Phase 2: calculate P (yi) for each category

The third stage: calculate the conditional probability of all partitions for each feature attribute

Phase IV: calculate P (x | yi) P (yi) for each category

The fifth stage: take the largest term of P (x | yi) P (yi) as the category to which x belongs.

2.3.2 implementation of naive Bayesian algorithm

Example: use simple English corpus as the data set, the code can be found in the file

Def loadDataSet (): postingList= [['my',' dog', 'has',' flea', 'problems',' help', 'please'], [' maybe', 'not',' take', 'him',' to', 'dog',' park', 'stupid'], [' my', 'dalmation',' is', 'so',' cute' 'love',' him','my'], ['stop',' posting', 'stupid',' worthless', 'garbage'], [' mr', 'licks',' ate', 'my',' steak', 'how',' to', 'stop',' him'], ['quit',' buying' 'worthless',' dog', 'food',' stupid']] classVec = [0je 1je 0rem 0je 1je 0je 0je 1] return postingList ClassVec class NBayes (object): def _ init__ (self): self.vocabulary = [] self.idf = 0 self.tf = 0 self.tdm = 0 self.Pcates = {} self.labels = [] self.doclength = 0 self.vocablen = 0 self.testset = 0 def train_set (self Trainset,classVec): self.cate_prob (classVec) self.doclength = len (trainset) tempset = set () [tempset.add (word) for doc in trainset for word in doc] self.vocabulary = list (tempset) self.vocablen = len (self.vocabulary) self.calc_wordfreq (trainset) self.build_tdm () def cate_prob (self ClassVec): self.labels = classVec labeltemps = set (self.labels) for labeltemp in labeltemps: self.labels.count (labeltemp) self.Pcates [labeltemp] = float (self.labels.count (labeltemp)) / float (len (self.labels)) def calc_wordfred (self,trainset): self.idf = np.zeros ([1 Self.vocablen]) self.tf = np.zeros ([self.doclength,self.vocablen]) for indx in xrange (self.doclength): for word in trainset [indx]: self.tf [indx,self.vocabulary.index (word)] + = 1 for signleword inset (roomset[ indx]): self.idf [0 Self.vocabulary.index (signleword)] + = 1 def build_tdm (self): self.tdm = np.zeros ([len (self.Pcates), self.vocablen]) sumlist = np.zeros ([len (self.Pcates)) 1]) for indx in xrange (self.doclength): Self.tdm [self.labels [indx]] + = self.tf [indx] sumlist [self.labels [indx]] = np.sum (self.tdm [self.labels [indx]]) Self.tdm = self.tdm/sumlist (3)-(5) functions are all called def map2vocab (self) by the train_set function Testdata): self.testset = np.zeros ([1Magne self.vocablen]) for word in testdata: self.testset [0meme self.vocabulary.index (word)] + = 1 def predict (self,testset): if np.shape (testset) [1]! = self.vocablen: print "output error" exit (0) predvalue = 0 predclass = "" for tdm_vect Keyclass in zip (self.tdm,self.Pcates): temp = np.sum (testset*tdm_vect* self.Pacate [keyclass]) if temp > predvalue: predvalue = temp predclass = keyclass return predclass def calc_tfidf (self,trainset): self.idf = np.zeros ([1meme self.vocablen]) self.tf = np.zeros ([self.doclength] Self.vocablen]) for indx in xrange (self.doclength): for word in trainset [indx]: self.tf [indx,self.vocabulary.index (word)] + = 1 self.tf [indx] = self.tf [indx] / float (len (roomset[ indx]) for signleword inset (roomset[ indx]): self.idf [0 Self.vocabulary.index (signleword)] + = 1 self.idf = np.log (float (self.doclength) / self.idf) self.tf = np.multiply (self.tf,self.idf) import sysimport osfrom numpy import * import numpy as npfrom NBayes_lib import * dataSet,listClasses = loadDataSet () nb = NBayes () nb.train_set (dataSet,listClasses) nb.map2vocab (dataSet [0]) print nb.predict (nb.testset) 2.4 Classification algorithm: KNN

KNN algorithm: calculate the distance between vectors to measure similarity for text classification

2.4.1 the principle of KNN algorithm

1. Algorithm idea: if most of the k nearest neighbors (nearest neighbors) samples in the feature space belong to a certain category, then the sample also belongs to this category, and k is an external variable defined by oneself.

2 steps of the KNN algorithm:

The first stage: determine the k value (that is, the number of nearest neighbors), which is usually odd.

The second stage: to determine the distance measurement formula, text classification generally uses the angle cosine to get the data points to be classified and the sample points of all known categories, from which the k samples with the closest distance are selected.

Angle cosine formula: cos = AB/ | A | * | B |

The third stage: count the number of each category among the k sample points, and divide the data points into which category has the largest number of data points.

2.4.2 python implementation of kNN algorithm import sysimport osfrom numpy import * import numpy as * import operatorfrom Nbayes_lib import * reload (sys) sys.setdefaultencoding ('utf-8') k=3def cosdist (vector1,vector2): return dot (vector1,vector2) / (linalg.norm (vector1) * linalg.norm (vector2)) def classify (testdata,trainSet,listClasses) K): dataSetSize=trainSet.shape [0] distances=array (zeros (dataSetSize)) for indx in xrange (dataSetSize): distances [indx] = cosdist (testdata,trainSet [indx]) sortedDisIndicies=argsort (- distances) classCount= {} for i in range (k): voteIlabel= listings [sortedDistindications [I] classCount [voteIlabel] = classCount.get (voteIlabel) 0) + 1 sortedClassCount=sorted (classCount.iteritem (), key=operator.itemgetter (1), reverse=True) return sortedClassCount [0] [0] dataSet,listClasses=loadDataSet () nb.NBayes () nb.train_set (dataSet,listClasses) print classify (nb.tf [3], nb.tf,listClasses,k) so far I believe you have a deeper understanding of "how to implement the kNN algorithm with python", so you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.