How to understand the nearest neighbor and server-side library in data mining technology 07/04 Update SLTechnology News&Howtos

How to understand the nearest neighbor and server-side library in data mining technology

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article analyzes "how to understand the nearest neighbor and server-side library in data mining technology". The content is detailed and easy to understand. Friends who are interested in "how to understand the nearest neighbor and server-side library in data mining technology" can follow the editor's train of thought to read it slowly and deeply. I hope it will be helpful to you after reading. Let's follow the editor to learn more about the nearest neighbor in data mining technology and how to understand the server-side library.

In our previous article, we used WEKA as a stand-alone application. So how useful can it be in practice? Obviously, it's not perfect. Because WEKA is an Java-based application, it has a Java library that can be used in our own server-side code. For most people, this is probably the most common usage, because you can write code to constantly analyze your data and make adjustments dynamically, without having to rely on others to extract the data, convert it to WEKA format, and then run it within WEKA Explorer.

Go back to the top of the page

Nearest neighbor

The nearest neighbor (i.e. Collaborative Filtering or Instance-based Learning) is a very useful data mining technique that can be used to predict the unknown output value of a new data instance with a previous data instance whose output value is known. From the current description, the nearest neighbor is very similar to regression and classification. So what's the difference between it and the two? First of all, regression can only be used for numerical output, which is the most direct difference between it and the nearest neighbor. Classification, as we saw in the example in the previous article, uses each data instance to create a tree that we need to traverse to find the answer. And this can be a serious problem for some data. For example, companies like Amazon often use the "customers who bought X also bought Y" feature. If Amazon were to create a classification tree, how many branches and nodes would it need? It has hundreds of thousands of products. How big will this tree be? How accurate can such a huge tree be? Even for a single branch, you will be surprised to find that it has only three products. Amazon's page usually has 12 products recommended to you. For this kind of data, classification tree is a very unsuitable data mining model.

The nearest neighbor can solve all these problems very effectively, especially in the above Amazon example. It will not be limited by quantity. Its scalability is no different for a database of 20 customers than for a database of 20 million customers, and you can define the number of results you want. It looks like a great technology! It's really great-and it's probably most useful for e-commerce shopkeepers who are reading this article.

Let's first explore the mathematical theory behind the nearest neighbor so that we can better understand the process and some of the limitations of this technique.

Mathematical theory behind the nearest neighbor

The mathematical theory behind the nearest neighbor technology is very similar to the mathematical theory involved in clustering technology. For an unknown data point, the distance between the unknown data point and each known data point needs to be calculated. Calculating this distance with a spreadsheet will be very tedious, and a high-performance computer can do these calculations immediately. The easiest and most common way to calculate distance is "Normalized Euclidian Distance". It looks complicated, but it's not. Let's use an example to find out what the fifth customer is likely to buy.

Listing 1. The mathematical theory of the nearest neighbor Customer Age Income Purchased Product1 45 46k Book2 39 100k TV3 35 38k DVD4 69 150k Car Cover5 58 51k? Step 1: Determine Distance FormulaDistance = SQRT ((58-Age) / (69-35) ^ 2) + (51000 -Income) / (150000-38000) ^ 2) Step 2: Calculate the ScoreCustomer Score Purchased Product1 .385 Book2 .710 TV3 .686 DVD4 .941 Car Cover5 0.0?

If we use the nearest neighbor algorithm to answer the question of what the fifth customer is most likely to buy, the answer will be a book. This is because the distance between the fifth customer and the first customer is shorter (actually much shorter) than the distance between the fifth customer and any other customer. Based on this model, it can be concluded that the behavior of the fifth customer can be predicted by the customer who is most like the fifth customer.

But the benefits of the nearest neighbor are much more than that. The nearest neighbor algorithm can be extended to include not only one recent match, but also any number of recent matches. These recent matches can be called "N-nearest neighbors" (such as 3-nearest neighbors). Going back to the above example, if we want to know the product that the fifth customer is most likely to buy, the conclusion this time is the book and DVD. In the case of Amazon, if you want to know the 12 products that a customer is most likely to buy, you can run a 12-nearest neighbor algorithm (but Amazon actually runs an algorithm that is much more complex than a simple 12-nearest neighbor algorithm).

Moreover, this algorithm is not limited to predicting which product the customer will buy. It can also be used to predict the output value of a Yes/No. Considering the above example, if we change the last column to (from customer 1 to customer 4) "Yes,No,Yes,No," then we can use the 1-nearest neighbor model to predict that the fifth customer will say "Yes", and if we use a 2-nearest neighbor algorithm, we will also get the prediction result "Yes" (customers 1 and 3 both say "Yes") If you use the 3-nearest neighbor model, you will still get "Yes" (customer 1 and 3 say "Yes", customer 2 says "No", so their average is "Yes").

The last question we considered was "how many neighbors should we use in our model?" AHA-not everything is that simple. In order to determine the optimal number of neighbors required, experiments need to be carried out. And, if you want to predict the output of columns with values of 0 and 1, you obviously need to select odd neighbors to break the draw.

Dataset for WEKA

The dataset we are going to use for our nearest neighbor example should look familiar-the same dataset we used in the classification example in the previous article. This example is about a fictional BMW dealership and its promotion to sell a two-year extension to regular customers. To review this dataset, some of the metrics I introduced in the previous article are listed below.

There are 4500 data points in the previous sales records of extended coverage. The attributes in the data set include: income level (0,000,30k, 1=$31k-$40k, 2=$41k-$60k, 3=$61k-$75k, 4=$76k-$100k, 5=$101k-$150k, 6=$151k-$500k, 7percent 501k+), the year / month the customer purchased the first BMW, the year / month the last BMW was purchased, and whether the customer has responded to the extended promotion in the past.

Listing 2. The nearest neighbor WEKA data @ attribute IncomeBracket {0Percience 1, 2, 3, 4, 5, 6, 7} @ attribute FirstPurchase numeric@attribute LastPurchase numeric@attribute responded {1, 0} @ data4,200210,200601,05,200301,200601,1...WEKA

Why should we use the same dataset as in the classification example? This is because the results of the classification model are only 59% accurate, which is totally unacceptable (not much better than guesswork). We will improve the accuracy and provide some useful information for this fictional dealer.

Load the data file bmw-training.arff into WEKA with the same steps we used in the Preprocess tab earlier. After loading the data, the screen should look like figure 1.

Figure 1. BMW nearest neighbor data within WEKA

Similar to what we did in the regression and classification model in the previous article, we should next select the Classify tab. On this tab, we should select lazy, and then select IBk (IB represents Instance-Based, while k allows us to specify the number of neighbors to use).

Figure 2. BMW nearest neighbor algorithm

Now we are ready to create our model in WEKA. Please make sure that Use training set is selected so that we can use the dataset we just loaded to create our model. Click Start to have WEKA run. Figure 3 shows a screenshot, and listing 3 contains the output of this model.

Figure 3. BMW nearest neighbor model

Listing 3. Output from IBk calculation = Evaluation on training set = Summary = Correctly Classified Instances 2663 88.7667 Incorrectly Classified Instances 337 11.2333 Kappa statistic 0.7748Mean absolute error 0.1326Root mean squared error 0.2573Relative absolute error 26.522 Root relative squared error 51.462 Total Number of Instances 3000 = Detailed Accuracy By Class = TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.95 0.177 0.847 0.95 0.896 0.972 1 0.823 0.941 0.823 0.878 0.972 0Weighted Avg. 0.888 0.114 0.893 0.888 0.887 0.972 Confusion Matrix = a b

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.