How to use WEKA for data Mining 07/16 Update SLTechnology News&Howtos

How to use WEKA for data Mining

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article is to share with you about how to use WEKA for data mining, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Data mining is a topic of conversation in the technology world because companies are generating millions of data points about their users and trying to turn that information into an increase in revenue. Data mining is a common term for many technologies to represent collecting information from data little by little and turning it into something meaningful. This article introduces you to open source data mining software and some of the most common techniques used to parse data.

Brief introduction

What is data mining? You will ask yourself this question from time to time, because this topic is getting more and more attention from the technical community. You may have heard of things like Google and Yahoo! Such companies are generating billions of data points about all their users, and you can't help but wonder, "what do they need all this information for?" You may also be surprised to find that Walmart is one of the most advanced companies that do data mining and apply the results to their business. Now almost all the companies in the world are using data mining, and companies that have not yet used data mining will find themselves at a great disadvantage in the near future.

So, how can you and your company keep up with the tide of data mining?

We hope to be able to answer all your entry-level questions about data mining. We would also like to introduce you to Waikato Environment for Knowledge Analysis (WEKA), a free open source software that you can use to mine data and turn your perception of your users, customers, and business into useful information to increase revenue. You will find that the task of mining data well is not as difficult as you think.

In addition, the first technique of data mining, regression, is introduced, which means to predict the value of future data based on existing data. It's probably the easiest way to mine data, and you've even done this rudimentary data mining with one of your favorite spreadsheet software before (although WEKA can do more complex calculations). )

What is data mining?

Data mining, as far as its core is concerned, refers to transforming a large amount of data into meaningful patterns and rules. Moreover, it can be divided into two types: direct and indirect. In direct data mining, you try to predict a specific data point-for example, using the price of a given house to predict the price of other houses in the neighborhood.

In indirect data mining, you try to create data sets or find patterns within existing data-for example, people who create "middle-class women". In fact, every American census is data mining, and the government wants to collect data from each citizen and turn it into useful information.

Modern data mining began in the 1990s, when the power of computing and the cost of computing and storage reached a high level, and companies began to calculate and store on their own without the help of outside computing.

In addition, the term data mining is omni-directional and can refer to many technologies and processes for viewing and transforming data. Because this series only scratches the surface of the functions that can be implemented with data mining. Experts in data mining are often PhDs in data statistics and have 10-30 years of research experience in this field. This will give you the impression that only large companies can afford data mining.

We want to clear up these misunderstandings about data mining and make it clear that data mining is neither as simple as running a spreadsheet function on a series of data, nor is it as difficult as some people think it is impossible to achieve on its own. This is a good example of the 80x20 paradigm-even further it could be the 90Universe 10 paradigm. You can create a data mining model with 90% effectiveness with 10% of the expertise of a so-called data mining expert. In order to make up for the remaining 10% of the effectiveness of the model and create a perfect model will take 90% extra time, or even as long as 20 years. So unless you are determined to pursue a career in data mining, "good enough" will be fine. On the other hand, the "good enough" achieved by using data mining is always better than the other technologies you use now.

The ultimate goal of data mining is to create a model that improves the way you interpret existing and future data. Since there are already many data mining techniques, the most important step in creating a good model is to decide which technology to use. This greatly depends on practice and experience as well as effective guidance. After that, the model needs to be optimized to make it more satisfactory. After reading this series of articles, you should be able to determine the right technology to use based on your own dataset, and then take the necessary steps to optimize it. You will be able to create a good enough model for your own data.

WEKA

Data mining is by no means exclusive to large companies, nor is it expensive software. In fact, there is one type of software that does everything that expensive software can do-this software is WEKA (see Resources). WEKA was born in University of Waikato (New Zealand) and was first implemented in its modern format in 1997. It uses GNU General Public License (GPL). The software is written in Java ™language and includes a GUI to interact with data files and generate visual results (such as tables and curves). It also has a generic API, so you can embed WEKA into your own applications like other libraries to accomplish tasks such as server-side automatic data mining.

Let's continue and install WEKA. Because it is based on Java, if you do not have JRE installed on your computer, download a version of WEKA that contains JRE.

Figure 1. Start screen of WEKA

When you start WEKA, the GUI selector pops up, giving you a choice of four ways to use WEKA and data. For the examples in this article series, we chose only the Explorer option. This is sufficient for the functionality we need to implement in these series of articles.

Figure 2. WEKA Explorer

Now that you're familiar with how to install and start WEKA, let's take a look at our first data mining technique: regression.

Regress

Regression is the easiest and easiest technique to use, but it is also probably the least powerful (it's interesting that the two always go hand in hand). This model can be as simple as one input variable and one output variable (called a Scatter graph in Excel, or XYDiagram in OpenOffice.org). Of course, it can be much more complex and can include many input variables. In fact, all regression models conform to the same general model. Multiple independent variables can be combined to produce a result-a dependent variable. Then the regression model is used to predict the result of an unknown dependent variable based on the values of these independent variables.

Everyone may have used or seen a regression model, or even created a regression model in mind. One example that people can think of immediately is pricing a house. The price of a house (dependent variable) is the result of many independent variables-the size of the house, the size of the area, whether there is granite in the kitchen and whether the bathroom has just been reinstalled. So, whether you have bought a house or sold a house, you may create a regression model to price the house. This model is based on the prices of other comparable houses in the neighborhood, and then put the value of your own house into the model to produce an expected price.

Let's continue to use this regression model of housing pricing as an example to create some real data. There are some houses for sale in my neighborhood, and I try to find a reasonable price for my own house. I also need to declare property tax with the output of this model.

Table 1. The house value of the regression model (square feet) is there any reinstallation of granite bathrooms with large and small bedrooms? Selling price 35299191600 $205jue 000324710061511 $224jue 900403210150501 $197je 900239714156410 $189,90022009600401` $195jue 000353619994611 $325J.00029839365501 $230000

31989669511????

The good news (or bad news, depending on your own opinion) is that the above brief introduction to the regression model has only scratched the surface, and it won't even be noticed. There are university courses to choose from about regression models, which will teach you more information about regression models than you want to know. However, our introduction has fully familiarized you with this concept and is sufficient for the WEKA trial in this article. If you are more interested in the details of the regression model and the statistics in it, you can use your favorite search engine to search for the following terms: least squares, homoscedasticity, normal distribution, White tests, Lilliefors tests, R-squared, and p-values.

Build a dataset for WEKA

In order to load the data into WEKA, we must put the data in a format that we can understand. The format of the loaded data recommended by WEKA is Attribute-Relation File Format (ARFF), where you can define the type of data being loaded and then provide the data itself. In this file, we define each column and what each column contains. For regression models, there can only be NUMERIC or DATE columns. Finally, each row of data is provided in a comma-separated format. The ARFF file we use for WEKA is shown below. Please note that my house is not included in the data row. Because we are building a model, the price of my house is not known yet, so we can't enter my house yet.

Listing 1. WEKA file format @ RELATION house@ATTRIBUTE houseSize NUMERIC@ATTRIBUTE lotSize NUMERIC@ATTRIBUTE bedrooms NUMERIC@ATTRIBUTE granite NUMERIC@ATTRIBUTE bathroom NUMERIC@ATTRIBUTE sellingPrice NUMERIC@DATA3529,9191,6,0,0,205000 3247meme 10061memorial1memorial224900 4032pence10150pence5penalise1penalise1979002397pair14156pyror4 pyrronome1899002200penaly40penie1magical1950003536pair19994mr6mrlg1mrl3250002983pr 9365ml0ml230000 to load data into WEKA

After the data is created, we can start to create our regression model. Start WEKA and select Explorer. The Explorer screen appears with the Preprocess tab selected. Select the Open File button and select the ARFF file you created in the previous section. After selecting the file, the WEKA Explorer should be similar to the screenshot shown in figure 3.

Figure 3. WEKA after housing data loading

In this view, WEKA allows you to look at the data you are working on. On the left side of the Explorer window, all the columns (Attributes) of your data and the number of rows of data provided (Instances) are given. If you select a column, the right side of the Explorer window displays information about that column of data in the dataset. For example, by selecting the houseSize column on the left (which should be selected by default), statistics about that column are displayed on the right side of the screen. It shows that the maximum value of this column in the dataset is 4032 square feet and the minimum value is 2200 square feet. The average size is 3131 square feet and the standard deviation is 655 square feet (standard deviation is a statistical measure of difference). In addition, there is a visual way to view the data by clicking the Visualize All button. Because of the limited number of rows in this dataset, visualization does not appear to be as powerful as when there are more strongholds (for example, hundreds).

Well, there are enough introductions to the data. Let's immediately create a model to get the price of my house.

Create a regression model with WEKA

To create this model, click the Classify tab. The first step is to select the model we want to create so that WEKA knows what to do with the data and how to create an appropriate model:

Click the Choose button, and then expand the functions branch.

Select the LinearRegression leaf.

This will tell WEKA that we want to build a regression model. In addition, there are many other options, which means that there are many models that can be created. A lot! This also shows from another aspect that this article only introduces the surface of this topic. There is one thing worth noting. There is another option in the same branch, called SimpleLinearRegression. Please do not select this option, because a simple regression can only have one variable, and we have six variables. After selecting the correct model, the WEKA Explorer should look similar to figure 4.

Figure 4. Linear regression model in WEKA

Can I use electronic data to express the same purpose?

To put it simply: no. The thoughtful answer is: yes. Most popular spreadsheet programs cannot easily accomplish what we do with WEKA, that is, defining a linear model with multiple independent variables. However, you can easily implement a Simple Linear Regression model with an argument. If you are brave enough, you can even do a multivariable regression, but this will be very difficult and definitely not as easy as using WEKA. A sample video of Microsoft ®Excel ®is available in the Resources section of this article.

Now that we have selected the model we want, we must tell WEKA where the data it should use to create this model is. Although it is clear that we want to use the data provided in the ARFF file, there are actually different options to choose from, some of which are even far more advanced than the options we are going to use. The other three options are: Supplied test set allows you to provide a different data set to build the model; Cross-validation lets WEKA build a model based on a subset of the provided data, and then averages them to create the final model; and Percentage split WEKA takes 1% of the provided data to build a final model. These different choices are useful for different models, as we will see in subsequent articles in this series. For regression, we can simply choose Use training set. This tells WEKA that we can use the data we provided in the ARFF file in order to build the model we want.

The final step in creating the model is to select the dependent variable (that is, the column we want to predict). In this case, it refers to the selling price of the house, because that is exactly what we want. Directly below these test options, there is a combo box that you can use to select this dependent variable. Column sellingPrice should be selected by default. If not, please select it.

When we are ready to create the model, click Start. Figure 5 shows the output.

Figure 5. Housing price regression model in WEKA

Analyze the regression model

WEKA is not careless. It puts the regression model directly on the output, as shown in listing 2.

Listing 2. Regression output sellingPrice = (- 26.6882 * houseSize) + (7.0551 * lotSize) + (43166.0767 * bedrooms) + (42292.0901 * bathroom)-21661.1208

Listing 3 shows the result, which has been inserted into the price of my house.

Listing 3. Using the regression model, sellingPrice = (- 26.6882 * 3198) + (7.0551 * 9669) + (43166.0767 * 5) + (42292.0901 * 1)-21661.1208sellingPrice = 219328

However, looking back at the beginning of this article, we know that data mining is not just about outputting a number: it's about identifying patterns and rules. It is not strictly used to generate an absolute value, but to create a model that allows you to detect patterns, predict output, and draw conclusions from that data. Let's take a closer look at the patterns and conclusions our model tells us in addition to house prices:

Granite doesn't matter-WEKA will use only those columns that statistically contribute to the correctness of the model (measured in R-squared, but this is beyond the scope of this article). It will discard and ignore columns that are not helpful in creating a good model. So this regression model tells us that the granite in the kitchen does not affect the value of the house.

The bathroom is relevant-because we use a simple value of 0 or 1 for the bathroom, we can use this coefficient from the regression model to determine the effect of this value of the bathroom on the value of the house. This model tells us that it increases the value of the house by $42292.

The price of a larger house is lower-WEKA tells us that the bigger the house, the lower the selling price. This can be seen from the negative coefficient in front of the houseSize variable. This model tells us that every extra square foot of a house reduces the house price by $26. It doesn't make any sense. This is in America! Of course, the bigger the house, the better, especially in Texas, where I live. So how can we explain this? This is a good example of useless data in and out of useless data. The size of the house is not an independent variable, it is also related to the bedroom variable, because a large house usually has more bedrooms. So our model is not perfect. But we can fix this problem. Remember: on the Preprocess tab, you can delete columns from the dataset. For this example, we delete the houseSize column and create another model. So how will it affect the price of the house? How can this new model make more practical sense? (the revised price of my house is: $217894).

A hint for statisticians

This model breaks several requirements of a conventional linear regression model because each column is not completely independent and there are not enough data rows to generate a valid model. Since the main purpose of this article is to introduce WEKA as a data mining tool, we greatly simplify the sample data.

To take this simple example to a new level, let's take a look at a data file provided to us as a regression example on the WEKA Web site. In theory, this is much more complicated than our simple example of seven houses. The purpose of this sample data file is to create a regression model that can predict the car's fuel consumption (miles per gallon, MPG) based on several characteristics (keep in mind that the data are taken from 1970 to 1982). This model includes the following properties of the car: cylinder, displacement, horsepower, weight, acceleration, year, origin and manufacturer. In addition, this data set has 398 rows of data, which is sufficient to meet a variety of statistical needs, which is not possible in our house price model. In theory, this is an extremely complex regression model, and it may take a lot of time for WEKA to create a model with so much data (but I guess you foresee that WEKA can handle that data well).

To generate a regression model with this data set, we need to strictly follow the steps for processing house data, so I won't repeat it here. Continue and create this regression model. It will generate the output shown in listing 4.

Listing 4. MPG data regression model class (aka MPG) =-2.2744 * cylinders=6,3,5,4 +-4.4421 * cylinders=3,5 4 + 6.74 * cylinders=5,4 + 0.012 * displacement +-0.0359 * horsepower +-0.0056 * weight + 1.6184 * model=75,71,76,74,77,78,79,81,82,80 + 1.8307 * model=77,78,79,81,82,80 + 1.8958 * model=79,81,82,80 + 1.7754 * model=81,82,80 + 1.167 * model=82,80 + 1.2522 * model=80 + 2.1363 * origin=2,3 + 37.9165

When you generate the model yourself, you will see that it took less than a second for WEKA to process the model. So, even if you are dealing with a powerful regression model with a large amount of data, it is not a problem in terms of calculation. This model should look much more complex than house data, but this is not the case. For example, the first line of this regression model,-2.2744 * cylinders=6,3,5,4, says that if the car has six cylinders, it will put a 1 in this column, and if the car has eight cylinders, it will put a 0. Let's take a sample routine (line 10) from this dataset and put these values into the regression model to see if the output of our model is similar to that provided to us in the dataset.

Listing 5. Example MPG data data = 8, 390, 190, 38, 50, 8.5, 70, 1 15class (aka MPG) =-2.2744 * 0 +-4.4421 * 0 + 6.74 * 0 + 0.012 * 390 +-0.0359 * 190 + 0.0056 * 3850 + 1.6184 * 0 + 1.8307 * 0 + 1.8958 * 0 + 1.7754 * 0 + 1.167 * 0 + 1.2522 * 0 + 2.1363 * 0 + 37 .9165 Expected Value = 15 mpgRegression Model Output = 14.2mpg

Therefore, when we tested the model with randomly selected test data, the model performed very well, and for a car with an actual value of 15 MPG, our prediction was 14.2 MPG.

Try to answer the question of "what is data mining" by introducing you to the background of the topic of data mining and the goals in this field. Data mining is to turn a large amount of unavailable information (usually in the form of scattered data) into useful information by creating models and rules. Your goal is to use models and rules to predict future behavior to improve your business or to explain things that you cannot explain in other ways. These models can help you identify ideas you already have, and may even allow you to discover new things in the data that you didn't realize before. Here's an interesting example of data mining (I don't know how many more). In the United States, Walmart moves beer to the bottom of the diaper shelf on weekends, because Walmart data mining results show that men usually buy diapers on weekends, and they also like to drink beer on weekends.

Introduced to you a free open source software program WEKA. Of course, there are many more complex data mining commercial software products on the market, but for those who are just beginning to do data mining, this open source solution is very beneficial. Remember, you can never be an expert in data mining unless you plan to study it for 20 years. WEKA allows you to enter the door of data mining, but also provides a perfect solution to the primary problems you encounter. If you haven't had much contact with data mining in the past, this very good solution will meet all your needs.

Finally, this paper discusses the first data mining model: the regression model (especially the linear regression multivariable model), and shows how to use it in WEKA. This regression model is easy to use and can be used for many datasets. You will find that this model is the most useful of all the models I have discussed in this series of articles. However, data mining is not limited to simple regression, in the case of different data sets and different output requirements, you will find that other models may be a better solution.

The above is how to use WEKA for data mining, the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.