In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how to use python to achieve KNN classifier". In daily operation, I believe many people have doubts about how to use python to achieve KNN classifier. Xiaobian consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubt of "how to use python to achieve KNN classifier". Next, please follow the editor to study!
What is supervised learning?
Supervised learning refers to the process of using a group of known samples to adjust the parameters of the classifier to achieve the required performance, also known as supervised training or teacher learning.
Supervised learning is a machine learning task that infers a function from marked training data. The training data includes a set of training examples. In supervised learning, each instance consists of an input object (usually a vector) and a desired output value (also known as a supervisory signal). The supervised learning algorithm is to analyze the training data and generate an inference function, which can be used to map out new examples. A best solution would allow the algorithm to correctly determine the class tags of invisible instances.
It would be clearer to give an example.
This is a dataset that contains the quality, width, height and color scores of some fruit samples.
The aim is to train a model, which will let us know the name of the fruit if we enter mass, width, height and color scores into the model. For example, if we enter a fruit whose quality, width, height, and color score are set to 175, 7.3, 7.2, and 0.61, the model should output the name of the fruit as an apple.
In this case, quality, width, height, and color scores are input features (X). The name of the fruit is the output variable or label (y).
This example may sound silly to you. But this is the mechanism used in monitoring machine learning models.
I'll show a real example with a real data set later.
KNN classifier
KNN classifier is an example of a memory-based machine learning model.
This means that the model remembers the training example, and then they use it to classify objects that have never been seen before.
The k of KNN classifier is the number of training samples retrieved to predict a new test case.
The KNN classifier works in three steps:
When it is given a new instance or instance for classification, it will retrieve the previously memorized training samples and find out the k number of the most recent samples.
The classifier then looks for the k-digit label of the most recent example (the name of the fruit in the above example).
Finally, the model combines these tags for prediction. Usually, it predicts the one with the most tags. For example, if we choose k as 5, in the last five examples, if we have three oranges and two apples, then the predicted value of the new instance will be oranges.
Data preparation
Before you begin, I suggest you check if the following resources are available on your computer:
Numpy library
Pandas library
Matplotlib library
Scikit-Learn library
Jupyter Notebook
If you do not have Jupyter Notebook installed, you can choose another notebook. I suggest you use Google's Colab.
Google Colab Notebook is not private. So don't do any professional or sensitive work there. But it's great for practice. Because many commonly used software packages have been installed in it.
I suggest downloading the dataset. I provided a link at the bottom of the page. You can run every line of code yourself.
First, import the necessary libraries:
% matplotlib notebookimport numpy as npimport matplotlib.pyplot as pltimport pandas as pdfrom sklearn.model_selection import train_test_split
In this tutorial, I will use the Titanic dataset from Kaggle. I have uploaded this dataset to the same folder as my notebook.
Here is how to import a dataset using pandas.
Titanic = pd.read_csv ('titanic_data.csv') titanic.head () # titaninc.head () gives the first five rows of the dataset. We only print the first five lines to check the dataset.
Look at the second column. It contains information if the person survives. 0 indicates that the person survived, and 1 means that the person did not survive.
In this tutorial, our goal is to predict "survival" features.
For simplicity, I will retain some of the key features that are more important to the algorithm and remove the rest.
This data set is very simple. Intuitively, we can see that some columns are not important for predicting "survival" characteristics.
For example, "PassengerId", "Name", "Ticket" and "Cabin" do not seem to help predict whether passengers will survive.
I will make a new data frame with some key features and name it titanic1.
Titanic1 = titanic [['Pclass',' Sex', 'Fare',' Survived']]
The "Sex" column has a string value that needs to be changed. Because the computer doesn't understand words. It only knows numbers. I will change "male" to 0 and "female" to 1.
Titanic1 ['Sex'] = titanic1.Sex.replace ({' male':0, 'female':1})
The following is the appearance of the titanic1 data frame:
Our goal is to predict the "survival" parameters based on other information in the Titanic 1 data frame. Therefore, the output variable or label (y) is "survive". The input features (X) are 'PMI class', 'Sex'' and 'Fare'.
X = titanic1 [['Pclass',' Sex', 'Fare']] y = titanic1 [' Survived'] to develop KNN classifier
First, we need to divide the data set into two sets: the training set and the test set.
We will use the training set to train the model, where the model will remember both input characteristics and output variables.
We will then use a test set to verify that the model can use "P-class", "Sex" and "Fare" to predict passenger survival.
The "train_test_split" method will help split the data. By default, this function uses 75% of the data to get the training set and 25% of the data to get the test set. You can change it. You can specify "train_size" or "test_size".
If train_size is set to 0.8, it is split into 80% of the training data and 20% of the test data. But for me, the default value of 75% is good. So, I didn't use the train_size or test_size parameters.
X_train, X_test, y_train, y_test = train_test_split (X, y, random_state=0)
Remember to use the same value for "random_state". In this way, each time this split is performed, the data of the training set and the test set are the same.
I choose a random state of 0. You can choose a number.
Python's scikit-learn library already has a KNN classifier model. Import.
From sklearn.neighbors import KNeighborsClassifier
Save this classifier in a variable.
Knn = KNeighborsClassifier (n_neighbors = 5)
In this case, n_neighbors is 5.
This means that when we ask our training model to predict the survival probability of a new instance, it requires five recent training data.
Based on the labels of these five training data, the model will predict the labels of the new instance.
Now, I will fit the training data into the model so that the model can remember them.
Knn.fit (X_train, y_train)
You might think that when it remembers the training data, it can predict the label of the training feature 100% correctly. But not necessarily. Why?
Whenever we give input and ask it to predict the tag, it votes from its five nearest neighbors, even if it remembers exactly the same characteristics.
Let's see how accurate it can give us in training data.
Knn.score (X_test, y_test)
The accuracy of training data is 0.83 or 83%.
Remember, we have a test dataset that the model has never seen before. Now check to see how accurately it can predict the labels of the test dataset.
Knn.score (X_test, y_test)
The accuracy is 0.78% or 78%.
Combined with the above code, here are four lines of code that make up the classifier:
Knn = KNeighborsClassifier (n_neighbors = 5) knn.fit (X_train, y_train) knn.score (X_train, y_train) knn.score (X_test, y_test)
Congratulations! You learned the KNN classifier!
Note that the accuracy of the training set is a little higher than that of the test set.
What is over-fitting?
Sometimes, the model learns the training set very well and can predict the label of the training data set very well. However, when we ask the model to predict using test data sets or data sets it has not seen before, if its performance is far worse than the training set, this phenomenon is called over-fitting.
In a word, when the accuracy of the training set is much higher than that of the test set, we call it overfitting.
Forecast
If you want to view the predicted output of the test dataset, do the following:
Enter:
Y_pred = knn.predict (X_test) y_pred
Output:
Array ([0,0,0,1,0,0,0,1,1,0,1,1,1,0,1,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1], dtype=int64)
Or you can enter just one example and find the tag.
I wonder if "Sex" is a woman, that is, Sex=1, when a person travels in "P-class" = 3, and whether she can survive according to our model after paying a "fare" of £25.
Enter:
Knn.predict ([[3,1,25]])
Remember to use two parentheses because it requires a two-dimensional array
Output:
Array ([0], dtype=int64)
The output is zero. This means that according to the model we have trained, this person cannot survive.
Please feel free to try more different inputs, just like this one.
If you want to further analyze the KNN classifier
KNN classifier is very sensitive to k and n_neighbors selection. In the above example, I used n_neighbors=5.
For different n_neighbors, the performance of the classifier is different.
Let's examine its implementation of different n_neighbors on the training dataset and the test dataset. I choose 1 to 20.
Now, we will calculate the training set accuracy and test set accuracy for each n_neighbors from 1 to 20.
Training_accuracy = [] test_accuracy = [] for i in range (1,21): knn = KNeighborsClassifier (n_neighbors = I) knn.fit (X_train, y_train) training_accuracy.append (knn.score (X_train, y_train)) test_accuracy.append (knn.score (X_test, y_test))
After running this code snippet, I got training and testing accuracy for different n_neighbors.
Now, let's compare the accuracy of the training and test sets in the same diagram.
Plt.figure () plt.plot (range (1,21), training_accuracy, label='Training Accuarcy') plt.plot (range (1,21), test_accuracy, label='Testing Accuarcy') plt.title ('Training Accuracy vs Test Accuracy') plt.xlabel (' Accuracy') plt.ylim ([0.7,0.9]) plt.legend (loc='best') plt.show ()
Analyze the chart above
At the beginning, when the n_neighbors is 1, 2 or 3, the training accuracy is much higher than the test accuracy. Therefore, this model is suffering from over-fitting problems.
After that, the accuracy of training and testing became closer. This is the best choice. We want this.
But as n_neighbors becomes more, the accuracy of both training and test sets is declining. We don't need this.
As you can see from the figure above, the ideal n_neighbors for this particular dataset and model should be 6 or 7.
This is a good classifier!
Look at the chart above! When n_neighbors is 7, the accuracy of training and testing is more than 80%.
At this point, the study on "how to use python to achieve KNN classifier" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.