How to use k-nearest neighbor algorithm to identify gender based on data in big data 07/15 Update SLTechnology News&Howtos

How to use k-nearest neighbor algorithm to identify gender based on data in big data

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how big data uses the k-nearest neighbor algorithm to identify gender according to the data. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

K-nearest neighbor algorithm is one of the simplest machine learning algorithms. It can be easily used to classify.

Need to:

Python environment

Training set

In fact, the k-nearest neighbor algorithm is to calculate the distance between different eigenvalues and then classify them. The language description is too abstract. Let me give an example (finding the sex of X):

The name owns the number of clothes every day, the length of time for makeup and dressing is Xiaoli 4030 female Xiaoming 1810 male Xiaorong 3234 female Wang 2317 male Xiaofang 267male Xiaomin 3823 female X3515?

If we want to use the k-nearest neighbor algorithm to find out the gender of X, that is, we use the eigenvalues of X to calculate the distance from other people's eigenvalues, of course, we also need to consider the numerical problem, and the variables with large values have a great influence on the results, such as here, the number of clothes is generally more than the length of time to dress up, and it has a greater impact on the results, but in fact it should not be so. We will explain how to deal with this later. The formula is as follows (European formula):

For example, the distance between X and Xiao Li is:

After finding out the distance between X and everyone, take the gender of the first k individuals with the shortest distance. Among these k individuals, the gender that appears most often is the gender of X.

The k here is the origin of the name of this algorithm. K can be taken by itself and should be adjusted timely with the increase of the number of samples.

First, construct the function of this algorithm:

Def classify (XMagnedataSet.labelsMagnek):

# X is the eigenvalue matrix of the object to be tested, dataSet is the sample, and labels is the feature to be classified

DataSetSize = dataSet.shape [0]

# find the length of the matrix

MinusMat = tile (X, (dataSetSize,1))-dataSet

# subtract (tile means repeat (XBY) X times)

SqMinusMat = minusMat * * 2

# squared

SqDistances = sqMinusMat.sum (axis=1)

# add up, axis=1 is the addition of each line

Distances = sqDistances * * 0.5

# the root of European formula

SortedDistIndicies = distances.argsort ()

# returns the index of the distance from small to large in the array

ClassCount = {}

# Dictionary, ready to count

For i in range (k):

# execute k times

Votelabels = labels [sorted DistIndices [I]]

ClassCount [votelabels] = classCount.get (votelabels,0) + 1

# look for the existence of this symbol, return the default 0 if it does not exist, and then add one

SortedClassCount = sorted (classCount.items (), key=operator.itemgetter (1), reverse = True)

# classCount.items () for objects in this dictionary

# reverse = True descending order

# key=operator.itemgetter (1) compare the two in the 'bafflebank 2' based on the first value

Return sortedClassCount [0] [0]

# return the items that appear the most

Let's use this function to identify the sex of X.

First create the sample set:

Def createDataSet ():

Group = array ([[40, 30], [18, 10], [32, 34], [23, 17], [26, 7], [38]])

Labels = ['female', 'male', 'female', 'male', 'male', 'female']

Return group,labels

Write out the data of X, pass the sample set and data into the function, and run:

X = [35, 15]

Group,labels = createDataSet ()

Print (classify (XMagna Group Magi labels 3))

It was a bit of a surprise that the result was male.

But we haven't dealt with the weight yet. As mentioned earlier, the weight of the number of clothes you own is not the same as the amount of time you spend on makeup every day. So we need normalized eigenvalues.

The so-called normalized eigenvalue is to convert all eigenvalues into values between 0 and 1 according to their specific gravity (a bit like the S function). The algorithm goes like this:

NewValue = (oldValue-min) / (max-min)

Convert to a function:

Def balance (XMagna dataSet):

Min = dataSet.min (0)

Max = dataSet.max (0)

# get the minimum and maximum values of two columns

Ranges = max-min

# scope of acquisition

NormDataSet = zeros (shape (dataSet))

# create a matrix with zero value of the same size

M = dataSet.shape [0]

# get the length of the column

NormDataSet = dataSet-tile (min, (mpene 1))

# oldValue-min

NormDataSet = normDataSet/tile (ranges, (mreco1))

# (oldValue-min) / (max-min)

# calculate the value of X

X_return = zeros (shape (X))

X_return = X-tile (min, (1pm 1))

X_return = X_return / tile (ranges, (1,1))

Return normDataSet,X_return

The eigenvalues of X are normalized as follows:

[[0.77272727 0.2962963]]

And the training set is:

[[1. 0.85185185]

[0. 0.11111111]

[0.63636364 1.]

[0.22727273 0.37037037]

[0.36363636 0. ]

[0.90909091 0.59259259]]

Obviously, such a standard is more scientific and will not affect the weight between variables because of the large number.

Recalculate:

X = [35, 15]

Group,labels = createDataSet ()

Group,X = balance (XMagazine group)

Print (X)

Print (group)

Print (classify (XMagna Group Magi labels 3))

The result is still male, which is actually quite unexpected, because X has a large number of clothes, we will subjectively think that it is a girl, but in fact, more objective data tell us that it is a boy.

This is how big data uses the k-nearest neighbor algorithm to identify gender based on the data. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.