How python uses KNN to fill in missing values 07/01 Update SLTechnology News&Howtos

How python uses KNN to fill in missing values

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how python fills the missing value through KNN". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how python fills the missing value through KNN".

KNN algorithm, also known as K-nearest neighbor classification algorithm, is one of the simplest methods in data mining classification technology. The so-called K nearest neighbor means K nearest neighbors, which means that each sample can be represented by its nearest K neighbor values. The nearest neighbor algorithm is a method to classify each record in the data set. When we need to fill in the missing sample data, we can use the K-nearest neighbor algorithm to train a model, and then let it estimate the missing value, which is the method that python uses knn to fill the missing value.

Look at the code ~ # load library import numpy as npfrom fancyimpute import KNNfrom sklearn.preprocessing import StandardScalerfrom sklearn.datasets import make_blobs# to create a simulation feature matrix features, _ = make_blobs (n_samples = 1000, n_features = 2 Random_state = 1) # Standardization feature scaler = StandardScaler () standardized_features = scaler.fit_transform (features) standardized_features# Manufacturing missing value true_value = standardized_features [0L0] standardized_features [0Med 0] = np.nanstandardized_features# Forecast features_knn_imputed = KNN (Know5, verbose=0). Fit_transform (standardized_features) # features_knn_imputed = KNN (KL5) Verbose=0) .complete (standardized_features) features_knn_imputed# # compare the real value with the predicted value print ("real value:", true_value) print ("predicted value:", features_knn_imputed [0L0]) # load the library import numpy as npfrom fancyimpute import KNNfrom sklearn.preprocessing import StandardScalerfrom sklearn.datasets import make_blobs# to create the simulation feature matrix features, _ = make_blobs (n_samples = 1000, n_features = 2) Random_state = 1) # Standardization feature scaler = StandardScaler () standardized_features = scaler.fit_transform (features) standardized_features# Manufacturing missing value true_value = standardized_features [0L0] standardized_features [0Med 0] = np.nanstandardized_features# Forecast features_knn_imputed = KNN (Know5, verbose=0). Fit_transform (standardized_features) # features_knn_imputed = KNN (KL5) Verbose=0) .complete (standardized_features) features_knn_imputed# # compare the real value with the predicted value print ("real value:", true_value) print ("predicted value:", features_knn_imputed [0L0]) true value: 0.8730186113995938 predicted value: 1.0955332713113226

Add: a convenient and reliable method to fill missing values in scikit-learn: KNNImputer

In data mining, it is an essential step to deal with the missing values in samples. Among them, the selection of missing value interpolation method is very important, because it will have an important impact on the effect of the final model fitting.

At the end of 2019, scikit-learn released version 0.22, which not only fixed some of the previous bug, but also updated a lot of new features, making it easier for data miners to use. Among them, I found a new and useful missing value interpolation method: KNNImputer. This new method based on KNN algorithm makes it easier to deal with missing values and is more reliable than using mean and median directly. Based on the principle of KNN algorithm, this interpolation method fills the missing values of target features with the help of the distribution of other features.

Next, let's use a practical example to see how KNNImputer is used, ‎.

To use KNNImputer, you need to import from scikit-learn:

From sklearn.impute import KNNImputer

To start with a small example of appetizer, the second sample in data has a missing value.

Data = [[2,4,8], [3, np.nan, 7], [5,8,3], [4,3,8]]

The hyperparameters in KNNImputer are the same as those in KNN algorithm. For n_neighbors to select the number of "neighbor" samples, try n_neighbors=1 first.

Imputer = KNNImputer (n_neighbors=1) imputer.fit_transform (data)

As you can see, because the first column feature 3 and the third column feature 7 of the second sample are closest to the Euclidean distance of the first column feature 2 and the third column feature 8 of the first row sample, the missing value is filled according to the first sample, and the filling value is 4. What about n_neighbors=2?

Imputer = KNNImputer (n_neighbors=2) imputer.fit_transform (data)

At this time, according to the Euclidean distance, it is calculated that the nearest adjacent samples are the first row of samples and the fourth row of samples, and the filling value at this time is the mean value of the second column feature 4 and 3 of these two samples: 3.5.

Next, let's look at an actual case, this data set is from the classified competition of Kaggle Pima diabetes prediction, there are many missing values, let's try KNNImputer interpolation.

Import numpy as npimport pandas as pdimport pandas_profiling as ppimport matplotlib.pyplot as pltimport seaborn as snssns.set (context= "notebook",) import warningswarnings.filterwarnings ('ignore')% matplotlib inline from sklearn.impute import KNNImputer#Loading the datasetdiabetes_data = pd.read_csv (' pima-indians-diabetes.csv') diabetes_data.columns = ['Pregnancies',' Glucose', 'BloodPressure',' SkinThickness', 'Insulin',' BMI', 'DiabetesPedigreeFunction',' Age' 'Outcome'] diabetes_data.head ()

In this data set, the value of 0 represents the missing value, so we need to convert 0 to the value of nm and then deal with the missing value.

Diabetes_data_copy = diabetes_data.copy (deep=True) diabetes_data_copy [['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diabetes_data_copy [[' Glucose','BloodPressure','SkinThickness','Insulin','BMI']] .replace (0, np.NaN) print (diabetes_data_copy.isnull () .sum ())

In this paper, we try to use DiabetesPedigreeFunction and Age to interpolate 35 missing values in BloodPressure by KNNImputer.

Let's take a look at the samples where the missing values are:

Null_index = diabetes_data_ copy.loc.isnull (),:] .indexnull_indeximputer = KNNImputer (n_neighbors=10) diabetes_data_copy [['BloodPressure',' DiabetesPedigreeFunction', 'Age']] = imputer.fit_transform (diabetes_data_copy [[' BloodPressure', 'DiabetesPedigreeFunction',' Age']]) print (diabetes_data_copy.isnull (). Sum ())

You can see that the 35 missing values in BloodPressure have now disappeared. Let's take a look at the populated data:

Diabetes_data_ copy.iloc[null _ index]

At this point, the missing values in BloodPressure have been populated with KNNImputer based on DiabetesPedigreeFunction and Age. Note that non-numeric features need to be converted to numeric features before KNNImputer filling, because the current KNNImputer method only supports numeric features.

Thank you for your reading, the above is the content of "how python fills the missing value through KNN". After the study of this article, I believe you have a deeper understanding of how python fills the missing value through KNN, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.