How to use WEKA for data Mining 07/04 Update SLTechnology News&Howtos

How to use WEKA for data Mining

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to use WEKA for data mining". In the daily operation, I believe that many people have doubts about how to use WEKA for data mining. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "how to use WEKA for data mining". Next, please follow the editor to study!

Classification vs. Cluster vs. Nearest neighbor

Before I delve into the details of each method and use them through WEKA, I think we should understand each model-what type of data each model fits and what each model is trying to achieve. We will also include our existing model-the regression model-in our discussion so that you can see the comparison of these three new models with the one we already know. I will show the use of each model and its differences through practical examples. These practical examples revolve around a local BMW dealership to see how it can increase sales. The dealership has kept all its past sales information and information about every customer who has purchased BMW, paid attention to BMW or visited the BMW showroom. The dealership wants to increase future sales and deploy data mining to achieve this goal.

Regress

Question: "how do we price the new BMW M5?" The regression model can only give a numerical answer to this problem. The regression model uses past sales data from BMW and M5 to determine the price at which people used to buy cars at the dealership based on the properties and selling points of the cars sold. The regression model then allows BMW dealerships to insert the attributes of the new car to determine its price.

For example: Selling Price = $25000 + ($2900 * Liters in Engine) + ($9000 * isSedan) + ($11000 * isConvertible) + ($11000 * inches of car) + ($22000 * isM).

classification

Question: "so how likely is customer X to buy the latest BMW M5?" Create a classification tree (a decision tree) and mine the data to determine how likely the person is to buy a new M5. The nodes on the tree can be age, income level, number of cars currently owned, marital status, children or not, owner or tenant. Using these attributes of this person on this decision tree can determine the possibility of him buying M5.

Cluster

The question is: "which age group likes the silver BMW M5 best?" This requires mining data to compare the age of past car buyers with the color of cars purchased in the past. From this data, it can be found that a certain age group (such as 22-30 years old) has a higher tendency to order BMW M5 in a certain color (75% buy blue). Similarly, it can also show that a different age group (for example, 55-62) is more likely to order silver BMW (65% for silver and 20% for gray). These data, after mining, tend to focus on certain age groups and specific colors, making it convenient for users to quickly judge the patterns in the data.

Nearest neighbor

Question: "when people buy BMW M5, what other options do they tend to buy at the same time?" Data mining shows that when people enter a store and buy an BMW M5, they also tend to buy a matching suitcase. This is also known as shopping basket analysis. Using this data, car dealerships will place promotional advertisements for matching suitcases in a conspicuous place on the storefront, or even in newspapers, and if they buy M5, matching suitcases will be free / discounted in order to increase sales.

Go back to the top of the page

classification

Classification (that is, classification tree or decision tree) is a data mining algorithm that creates step-by-step guidance on how to determine the output of a new data instance. Each node on the tree it creates represents a location where decisions must be made based on input and moved from one node to the next until it reaches the leaf node that can produce the predicted output. Although this sounds a little confusing, it is actually very intuitive. Let's look at an example.

Listing 1. Simple classification tree [Will You Read This Section?] /\ Yes No /\ [Will You Understand It?] [Won't Learn It] /\ Yes No /\ [Will Learn It] [Won't Learn It]

This simple classification tree attempts to answer the question: "do you understand the classification tree?" At each node, you answer this question and continue to move down the branch until you reach a leaf node that answered yes or no. This model can be used for any unknown data instance to predict whether the unknown data instance can understand the classification tree by asking only two simple questions. This looks like a big advantage of classification trees-it can create a very accurate and informative tree without a lot of information about the data.

An important concept of the classification tree is very similar to what we see in data mining with WEKA, part 1: introduction and regression models: using a "training set" to generate models. Is to take a set of datasets with known output values and use this dataset to create our model. After that, as long as we have a new data point with an unknown output value, we can put it into the model and generate the expected output. This is no different from what we see in the regression model. However, this model goes a step further, usually dividing the entire training set into two parts: putting about 60-80% of the data into our training set to generate the model, and then putting the rest of the data into a test set, after the model is generated, immediately use it to test the accuracy of our model.

So why is this extra step so important in this model? This problem is called overfitting: if we provide too much data for model creation, our model will be created perfectly, but only for that data. Remember: we want to use this model to predict future unknowns; we don't want to use this model to accurately predict values we already know. This is why we want to create a test suite. After creating the model, we check to make sure that the accuracy of the model we created is not reduced in the test set. This ensures that our model will accurately predict unknown values in the future. You can see the actual effect of using WEKA.

This also leads to another important concept of classification trees: pruning. Pruning, as its name suggests, means to delete the branches of a classification tree. So why would anyone want to delete information from the classification tree? Or because of over-fitting. As the dataset grows and the number of attributes grows, the tree we create becomes more and more complex. Theoretically, a tree can have leaves = (rows * attributes). But what good is that? As far as predicting the unknowns of the future is concerned, it does not help us at all, because it is only suitable for our existing training data. So what we need is a balance. We want our tree to be as simple as possible, with as few nodes and branches as possible. At the same time, we want it to be as accurate as possible. This needs to be weighed, as we will see soon.

Before using WEKA, I would like to make one last point about classification, that is, false positives and false negatives. False positives refer to a data instance in which the model we created predicts that it should be positive, but on the contrary, the actual value is negative. Similarly, false negative refers to a data example: the model we created predicts that it should be negative, but on the contrary, the actual value is positive.

These errors indicate that there is something wrong with our model, which is misclassifying some data. Although incorrect classifications may occur, the acceptable percentage of errors is determined by the model creator. For example, if you are testing a heart monitor in a hospital, it is clear that a very low percentage of errors will be required. If you just mine some fictional data in articles about data mining, the error rate can be higher. In order to take it further, it is also necessary to decide what the acceptable percentage ratio of false negatives to false positives is. One example that comes to mind immediately is the spam model: a false positive (a true email is marked as spam) is more destructive than a false negative (a spam message is not marked as spam). In an example like this, false negatives can be judged: the lowest false positive ratio of 100 is acceptable.

Well, there has been enough introduction to the background and technical aspects of the classification tree. Let's start now to get some real data and bring it into WEKA.

WEKA data set

The dataset we used to categorize the example still revolves around our fictional BMW dealership. The dealership is launching a marketing program in an attempt to promote a two-year extension to its regular customers. The dealership has made similar plans in the past and collected 4500 data points from past sales. The properties in the dataset are:

Income level [30k, 1=$31k-$40k, 2=$41k-$60k, 3=$61k-$75k, 4=$76k-$100k, 5=$101k-$150k, 6=$151k-$500k, 7 million 501k +]

The year / month of the first BMW purchase

Year / month of the most recent BMW purchase

Have you ever responded to the extended insurance plan in the past?

Let's take a look at the Attribute-Relation File Format (ARFF) used in this example.

Listing 2. Classified WEKA data @ attribute IncomeBracket {0recol 1pm 2je 3je 4je 5pm 6je 7} @ attribute FirstPurchase numeric@attribute LastPurchase numeric@attribute responded {1pm 0} @ data4,200210,200601,05,200301,200601,1... Classify within WEKA

Use the same steps we used earlier to load the data file bmw-training.arff (see download) into WEKA. Please note that this file contains only 3000 of the 4500 records in the dealership records. We need to split our records so that some data instances are used to create the model, and some are used to test the model to make sure it is not over-fitted. After loading the data, the screen should look like figure 1.

Figure 1. BMW classification data in WEKA

Similar to what we did with WEKA for data mining, part 1: introduction and regression models, we selected the Classify tab, then the trees node, then the J48 leaf (I don't know why this is the official name, but accept it).

Figure 2. BMW classification algorithm

At this point, we are ready to create our model within WEKA. Make sure that Use training set is selected so that we can use the dataset we just loaded to create the model. Click Start and let WEKA run. The output of the model should be similar to the result in listing 3.

Listing 3. Output of WEKA's classification model Number of Leaves: 28Size of the tree: 43Time taken to build model: 0.18 seconds=== Evaluation on training set = Summary = Correctly Classified Instances 1774 59.1333 Incorrectly Classified Instances 1226 40.8667 Kappa statistic 0.1807Mean absolute error 0.4773Root mean squared error 0.4885Relative absolute error 95.4768 Root relative squared error 97.7122 Total Number of Instances 3000 = = Detailed Accuracy By Class = = TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.662 0.481 0.587 0.662 0.622 0.616 1 0.519 0.338 0.597 0.519 0.555 0.616 0Weighted Avg. 0.591 0.411 0.592 0.591 0.589 0.616 Confusion Matrix = a b

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.