Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Python how to predict Urban Air quality by handwritten KNN algorithm

2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "Python how to handwrite KNN algorithm to predict urban air quality". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Brief introduction of KNN algorithm

KNN (K-Nearest Neighbor) nearest neighbor classification algorithm is one of the commonly used algorithms in data mining classification (classification) technology. Its guiding idea is "those who are close to red, those who are close to ink are black", that is, your neighbor will infer your category.

The implementation principle of KNN nearest neighbor classification algorithm: in order to judge the category of unknown samples, take all known samples as reference, calculate the distance between unknown samples and all known samples, select the K known samples closest to the unknown samples, and then classify the unknown samples and the K nearest neighbor samples into one category according to the majority voting rule (majority-voting).

The core idea of KNN algorithm is to find the nearest k pieces of data and speculate the classification of new data.

The key to KNN algorithm:

1. All the features of the sample should be quantified comparatively.

If there is a non-numerical type in the sample feature, a method must be taken to quantify it to a numerical value. For example, if the sample feature contains a color, you can calculate the distance by converting the color to a grayscale value.

two。 Sample features should be normalized.

There are many parameters in the sample, and each parameter has its own definition domain and value range, and their influence on distance calculation is different. For example, the influence of a larger value will outweigh the parameters with a smaller value. Therefore, the sample parameters must do some scale processing, and the simplest way is to normalize the values of all features.

3. A distance function is required to calculate the distance between two samples

The commonly used distance functions are: Euclidean distance, cosine distance, hamming distance, Manhattan distance and so on. Euclidean distance is generally chosen as the distance measure, but this is only applicable to continuous variables. In the case of discontinuous variables such as text classification, the cosine distance can be used as a measure. In general, if some special algorithms are used to calculate the metric, the accuracy of K-nearest neighbor classification can be significantly improved, such as large edge nearest neighbor method or nearest neighbor component analysis.

Taking the calculation of the distance between two points A (x1) and B (x2) in two-dimensional space as an example, the commonly used method of calculating the Euclidean distance is as follows:

Determine the value of K

If the K value is too large, it is easy to cause underfitting, and too small is easy to over-fit, so it is necessary to cross-verify to determine the K value.

Advantages of KNN algorithm:

Simple, easy to understand, easy to implement, no need to estimate parameters, no training

Suitable for classification of rare events

Especially suitable for multi-classification problems (multi-modal, where objects have multiple category tags), KNN performs better than SVM.

Disadvantages of KNN algorithm:

The main deficiency of KNN algorithm in classification is that when the samples are unbalanced, for example, the sample size of one class is very large, while the sample size of other classes is very small, it is possible that when a new sample is input, the K neighbors of the sample account for the majority of large-capacity samples. The algorithm only calculates the nearest neighbor samples, and the number of samples in a certain class is very large, so either this kind of sample is not close to the target sample, or this kind of sample is very close to the target sample. In any case, the quantity does not affect the running result. The weight method (and the neighbor with a small distance from the sample) can be used to improve it.

Another disadvantage of this method is the large amount of computation, because for each text to be classified, it is necessary to calculate its distance to all known samples in order to find its K nearest neighbor points.

Second, the implementation of KNN algorithm.

To implement the KNN algorithm with Python, there are three main steps:

Calculate distance: give a sample to be classified and calculate the distance between it and each sample in the classified sample.

Find neighbors: circle the K classified samples that are closest to the samples to be classified as the nearest neighbors of the samples to be classified.

Classification: decide which category the sample to be classified belongs to according to the category to which most of the samples in the K neighbors belong.

Third, KNN algorithm predicts urban air quality 1. Get data

For this kind of Table tabular data, you can directly use pandas's read_html () method to save the data to csv, and you don't have to write crawlers to parse web pages and extract data.

#-*-coding: UTF-8-*-"@ File: spider.py@Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/@http://www.tianqihoubao.com/aqi/beijing-201901.html"""import pandas as pdimport logginglogging.basicConfig (level=logging.INFO, format='% (asctime) s -% (levelname) s:% (message) s') for page in range (1) 13): # 12 months if page < 10: url = f 'http://www.tianqihoubao.com/aqi/guangzhou-20190{page}.html' df = pd.read_html (url, encoding='gbk') [0] if page = = 1: df.to_csv (' 2019 Guangzhou air quality data .csv', mode='a+', index=False, header=False) else: df.iloc [1: ::]. To_csv ('2019 Guangzhou air quality data .csv', mode='a+', index=False, header=False) else: url = f 'http://www.tianqihoubao.com/aqi/guangzhou-2019{page}.html' df = pd.read_html (url, encoding='gbk') [0] df.iloc. To_csv (' 2019 Guangzhou air quality data .csv', mode='a+', index=False Header=False) logging.info (f' {page} monthly air quality data download completed!')

Climb a few more cities and save the 2019 historical air quality data locally.

two。 Generate test set and training set import pandas as pd# take the 2019 Chengdu air quality data as the test set df = pd.read_csv ('2019 Chengdu air quality data .csv') # take the quality grade AQI index on the day AQI ranks PM2.5. 8 columns of data # SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame solution df1 = df [['AQI Index', 'current AQI ranking', 'PM2.5',' PM10', 'So2',' No2', 'Co' [O3']] .copy () air_quality = [] # print (df ['quality level']. Value_counts ()) # quality level column data is converted into a string to make it easy to judge and predict for i in df ['quality level']: if I = = "excellent": air_quality.append ('1') elif I = "good": air_quality.append ('2') Elif I = "mild pollution": air_quality.append ('3') elif I = = "moderate pollution": air_quality.append ('4') elif I = = "heavy pollution": air_quality.append ('5') elif I = = "serious pollution": air_quality.append ('6') print (air_quality) df1 ['air quality' ] = air_quality# writes data to test.txt# print (df1.values Type (df1.values)) # with open ('test.txt', 'w') as f: for x in df1.values: print (x) s =', '.join ([str (I) for i in x]) # print (s Type (s) f.write (s +'\ n') import pandas as pd# customizes the air quality data of several other cities as the training set df = pd.read_csv ('2019 Tianjin air quality data .csv, encoding='utf-8') # takes the quality grade AQI index to rank PM2.5 on the same day. 8 columns of data # SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame solution df1 = df [['AQI Index', 'current AQI ranking', 'PM2.5',' PM10', 'So2',' No2', 'Co' 'O3'] .copy () air_quality = [] # print (df [' quality level']. Value_counts ()) # quality level column data is converted into a string into a digital ID for i in df ['quality level']: if I = = "excellent": air_quality.append ('1') elif I = "good": air_quality.append ('2') elif I = = "mild pollution": air_quality.append ('3') elif I = = "moderate pollution": air_quality.append ('4') elif I = = "heavy pollution": air_quality.append ('5') elif I = = "serious pollution": air_quality.append ('6') print (air_quality) df1 ['air quality'] = air_quality# Append data writes to train.txt# print (df1.values Type (df1.values)) # with open ('train.txt',' asides') as f: for x in df1.values: print (x) s =', '.join ([str (I) for i in x]) # print (s, type (s)) f.write (s +'\ n') 3. Implement KNN algorithm

Read dataset

Def read_dataset (filename1, filename2, trainingSet, testSet): with open (filename1 'r') as csvfile: lines = csv.reader (csvfile) # read all rows dataset1 = list (lines) # convert to list for x in range (len (dataset1)): # each row of data for y in range (8): dataset1 [x] [y] = float (dataset 1 [x] [y]) # 8 parameters converted to floating point testSet.append (dataset 1 [x]) # generate test set with open (filename2 'r') as csvfile: lines = csv.reader (csvfile) # read all rows dataset2 = list (lines) # convert to list for x in range (len (dataset2)): # each row of data for y in range (8): dataset2 [x] [y] = float (dataset 2 [x] [y]) # 8 parameters are converted to floating point trainingSet.append (dataset 2 [x]) # to generate a training set

Calculate the Euclidean distance

Def calculateDistance (testdata, traindata, length): # Computing distance distance = 0 # length indicates that the dimensional data has several dimensions for x in range (length): distance + = pow ((int (testdata [x])-int (int [x])), 2) return round (math.sqrt (distance), 3) # keep 3 decimal places

Find K nearest neighbors

Def getNeighbors (self, trainingSet, test_instance, k): # returns the nearest k margins distances = [] length = len (test_instance) # calculate the actual distance from each training set to the test set for x in range (len (trainingSet)): dist = self.calculateDistance (test_instance, trainingSet [x] Length) print ('training set: {}-distance: {}' .format (trainingSet [x], dist)) distances.append ((training set [x]) Dist)) distances.sort (key=operator.itemgetter (1)) # in order of distance # print (distances) neighbors = [] # the first k for x in range (k): neighbors.append (distances [x] [0]) print (neighbors) return neighbors after sorting is completed

The classification with the largest proportion of calculation

Def getResponse (neighbors): # obey the majority according to the minority Decide which category class_votes = {} for x in range (len (neighbors)): response = Nightborns [x] [- 1] # count the number of air quality in each category if response in class_votes: class_ votes [response] + = 1 else: class_ votes [response] = 1 print (class_votes.items ()) sortedVotes = sorted (class_votes.items () Key=operator.itemgetter (1), reverse=True) # sort by classification size descending order return sortedVotes [0] [0] # the largest minority subordinate to the majority is the prediction result

Calculation of prediction accuracy

Def getAccuracy (test_set, predictions): correct = 0 for x in range (len (test_set)): # comparison between predictions prediction and testset actual calculation accuracy of prediction if test_ set [x] [- 1] = = predictions [x]: correct + = 1 else: # View error prediction print (test_ set [x] Predictions [x]) print ('have {} correct predictions A total of {} test data '.format (correct, len (test_set)) return (correct / (len (test_set) * 100.0

Run function call

#-*-coding: UTF-8-*-"" @ Author: Ye Tingyun @ official account: training Python@CSDN: https://yetingyun.blog.csdn.net/"""def run (self): training_set = [] # training set test_set = [] # Test set self.read_dataset ('. / train_4/test.txt','. / train_4/train.txt', training_set Test_set) # data partition print ('Train set:' + str (len (training_set) print ('Test set:' + str (len (test_set) # generate predictions predictions = [] k = 7 # take the last six data for x in range (len (test_set)): # Test all test sets neighbors = self.getNeighbors (training_set, test_ set [x]) K) # find the 8 nearest neighbors result = self.getResponse (neighbors) # find out which type of predictions.append (result) accuracy = self.getAccuracy (test_set, predictions) print ('prediction accuracy: {: .2f}%' .format (accuracy)) # keep 2 decimal places

The running effect is as follows:

The prediction accuracy can be improved by increasing the amount of urban air quality data collected by training and adjusting the number of neighbors k.

This is the end of the content of "how Python handwritten KNN algorithm predicts urban air quality". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report