Several problems that must be seen in the introduction to data Mining 10/29 Update SLTechnology News&Howtos

Several problems that must be seen in the introduction to data Mining

2025-10-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

How to do data Mining well

What's the difference between NO.1 Data Mining and statistical analysis?

It doesn't make much sense to make a distinction between Data Mining and Statistics. It is generally defined as the theoretical methods such as CART, CHAID or fuzzy calculation of Data Mining technology, which are also developed by statisticians according to statistical theory. On the other hand, a large proportion of Data Mining is supported by multivariate analysis in higher statistics. But why does the emergence of Data Mining attract wide attention in various fields? The main reason is that compared with traditional statistical analysis, Data Mining has the following characteristics:

1. It is more powerful to deal with a large amount of real data and does not need a very professional statistical background to use Data Mining tools.

two。 The trend of data analysis is to grab the required data from large databases and use proprietary computer analysis software. The tools of Data Mining are more in line with the needs of enterprises.

3. Purely from the basic point of view of the theory, there is a difference in application between Data Mining and statistical analysis. after all, the purpose of Data Mining is to facilitate enterprise end-users rather than for statisticians to test.

What is the relationship between NO.2 Data Warehousing and Data Mining?

If Data Warehousing (data warehouse) is compared to a mine, Data Mining is the work of going deep into the pit mining. After all, Data Mining is not a magic trick made out of nothing, nor is it alchemy that turns stone into gold. If there is not enough rich and complete data, it is difficult to expect Data Mining to dig out any meaningful information.

In order to convert large amounts of data into useful information, information must be collected efficiently. With the progress of science and technology, the fully functional database system has become the best tool for data collection. Data warehouse, to put it simply, is to collect useful data from other systems and store it in an integrated storage area. So in fact, it is a processed and integrated relational database with large capacity, which is used to store the data needed by decision support system (Design Support System) for decision support or data analysis. From the perspective of information technology, the goal of a data warehouse is to give the right data to the right person in the organization at the right time. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

Many people are often confused about Data Warehousing and Data Mining and don't know how to tell them apart. In fact, data warehouse is a new theme of database technology, the use of computer systems to help us operate, calculate and think, so that the way of operation changes, so does the way of decision-making.

The data warehouse itself is a very large database, which stores the data integrated from the organizational job database, especially the data obtained by the transaction processing system OLTP (On-Line Transactional Processing). The integrated data is placed in the cry of the data, and the company's decision makers use the data to make decisions; however, this process of transforming and integrating the data is the biggest challenge in building a data warehouse. Because converting the data in the job into useful strategic information is the focus of the entire data warehouse. To sum up, the data warehouse should have these data: integrated data (integrated data), detailed and aggregated data (detailed andsummarized data), historical data, and data that interprets the data. Mining useful information and knowledge from data warehouse is the biggest purpose of establishing data warehouse and using Data Mining. The essence and process of both are two different things. In other words, the data warehouse should be established before Data mining can be carried out efficiently, because the data contained in the data warehouse itself is clean (there will be no wrong data mixed with it), complete, and integrated. Therefore, the relationship between the two may be interpreted as Data Mining is a process and technology to find useful information from huge data warehouses.

Can NO.3 OLAP replace Data Mining?

The so-called OLAP (OnlineAnalytical Process) refers to the online analysis and processing program linked by the database. Some people will say, "I already have OLAP tools, so I don't need Data Mining." "in fact, the two are quite different. The main difference is that Data Mining is used to generate hypotheses and OLAP is used to verify hypotheses. To put it simply, OLAP is led by the user, who first makes some assumptions, and then uses OLAP to verify whether the assumptions are valid, while Data Mining is used to help users generate assumptions. So when using OLAP or other Query tools, users are doing their own Exploration, but Data Mining is using tools to help with exploration.

For example, when planning shelf furnishings for a supermarket, a market analyst may first assume that baby diapers and baby milk powder are often purchased together, and then use OLAP tools to verify whether this hypothesis is true and how obvious the evidence is. However, this is not the case with Data Mining. After sorting through the huge checkout data, people who perform Data Mining do not need to assume or expect possible results. Through Mining technology, we can find out the potential rules that exist in the data, so we may get the unexpected discovery that diapers and beer are often purchased at the same time, which OLAP cannot do. Data Mining can often dig out relationships beyond the scope of induction, but OLAP can only use manual queries and visual reports to confirm some relationships, with the characteristic of Data Mining automatically finding data models and relationships that will not even be suspected, in fact, beyond the limits of our experience, education, and imagination, OLAP and Data Mining can complement each other, but this feature can not be replaced by OLAP. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

What are the steps involved in the complete DataMining of NO.4?

The following provides a step-by-step of Data Mining for reference:

1. Understanding business and data

two。 Acquire relevant technology and knowledge

3. Integrate and query data

4. Remove errors or inconsistencies and incomplete data

5. The sample is selected from the data and tested in advance.

6. Establish a data model

7. Analysis work of actual Data Mining

8. Test and inspection

9. Find out the hypothesis and give an explanation

10. Continuous application in the enterprise process.

As can be seen from the above steps, Data Mining involves a lot of preparation and planning process. in fact, many experts believe that 80% of the time and energy of the whole Data Mining is spent in the data pre-operation stage, including data purification and format conversion or even table links. It can be seen that Data Mining is only a step in the process of information mining, and there is still a lot of work to be done before this step.

What theories and technologies does NO.5 Data Mining use?

Data Mining is a hot topic in database application technology in recent years, which seems magical and fashionable, but in fact it is nothing new. Because of its use such as prediction model, data segmentation, link analysis (Link Analysis), deviation detection (Deviation Detection), etc., the United States has been used in census and military as early as before World War II.

With the unimaginable development of information technology, many new computer analysis tools have come out, such as relational database, fuzzy computing theory, genetic algorithm, neural network and so on. it makes it a systematic and executable program to discover treasure from data.

Generally speaking, the theory and technology of Data Mining can be divided into two branches: traditional technology and improved technology. The traditional technology is represented by statistical analysis, and the sequence statistics, probability theory, regression analysis and category data analysis contained in statistics all belong to the traditional data mining techniques. In particular, most Data Mining objects are data with a large number of variables and a large number of samples, which are factor analysis (Factor Analysis) used to simplify variables and discriminant analysis (DiscriminantAnalysis) used in multivariate analysis included in higher statistics. And cluster analysis (Cluster Analysis), which is used to separate groups, are especially commonly used in the process of Data Mining.

In the aspect of improved technology, decision tree theory (Decision Trees), neural network-like (Neural Network) and rule induction (Rules Induction) are widely used. Decision tree is a prediction model that shows how data is affected by variables in a dendritic shape. The rules of classification are constructed according to the different effects on target variables, which are generally used in the analysis of customer data, such as finding out the combination of variables that affect the classification results of mailed objects with and without reply. The commonly used classification methods are CART (Classification and Regression Trees) and CHAID (Chi-Square Automatic InteractionDetector).

Class R neural network is a data analysis model that simulates the thinking structure of the human brain. it learns from the input variables and values and adjusts the parameters continuously according to the knowledge gained from learning experience in order to construct the data pattern (patterns). The neural network is a nonlinear design. Compared with the traditional regression analysis, the advantage is that there is no need to limit the pattern in the analysis, especially when there is interaction between data variables, it can be detected automatically. The disadvantage is that the analysis process is a black box, so it is often unable to show in a readable model format, and the weighting and transformation of each stage is not clear, so the neural network Dolly is used when the data is highly nonlinear and has a considerable degree of variable sympathetic effect.

Rule induction is the most commonly used format in the field of knowledge discovery, which is composed of a series of "if". / then... In the technology of subdividing the data by the logical rules of (If / Then), the biggest problem is how to define the rules as valid in practice. It is usually necessary to eliminate the items in the data that have too few occurrences in order to avoid meaningless logical rules.

What are the main functions of NO.6 Data Mining?

The practical application function of Data Mining can be divided into three categories and six items: Classification and Clustering belong to classification partition class; Regression and Time-series belong to reckoning and prediction class; Association and Sequence belong to sequence rule class. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

Classification is calculated based on the values of some variables, and then classified according to the results. (the results of the calculation are eventually classified into a few discrete values, such as dividing a set of data into two categories: 'may respond' or 'may not respond'). Classification is often used to deal with the problem of mailing object filtering as mentioned earlier. We will use some data that have been classified according to historical experience to study their characteristics, and then predict other unclassified or new data based on these characteristics. The classified data we use to find features may come from our existing customer data, or a complete database may be partially sampled and then tested by actual operation; for example, using partial sampling from a large mailing object database to build a Classification Model, and then using this Model to classify and predict other data in the database or new data.

Clustering is used to group data, its purpose is to find out the differences between groups, but also to find out the similarity of members in the group. Clustering differs from Classification in that it is not known how or by which basis it will be classified before analysis. Therefore, it is necessary to cooperate with professional domain knowledge to interpret the meaning of these clusters.

Regression uses a series of existing values to predict the possible values of a continuous number. If the scope is expanded, Logistic Regression can also be used to predict category variables, especially in the extensive use of modern analysis techniques such as neural network or decision tree theory. The model of estimation and prediction is no longer limited to the traditional linear, but greatly increases the flexibility of selection tools and the breadth of application.

Time-SeriesForecasting is similar to Regression, except that it uses existing values to predict future values. The biggest difference between the two is that the values analyzed by Time- Series are related to time. Time-SeriesForecasting 's tools can deal with some characteristics of time, such as periodicity, stratification, seasonality, and other special factors (such as past and future relevance). The purpose of Association is to find out what will appear at the same time in an event or data. For example, if An is a choice of an event, what are the chances that B will also appear in that event? (for example, if a customer buys ham and orange juice, there is an 85% chance that the customer will also buy milk.) Sequence Discovery is closely related to Association, except that the correlation of events in Sequence Discovery is separated by time factors (for example, if A shares rise 12 per cent on a certain day and the stock market weighted index falls on that day, B shares are 68 per cent more likely to rise within two days).

What is the application of NO.7 Data Mining in various fields?

Data Mining is widely used in various fields. As long as the industry has data warehousing or databases with analytical value and needs, Mining tools can be used for purposeful mining and analysis. Generally, the more common application cases occur in retail, direct marketing, manufacturing, finance, finance and insurance, communications and medical services.

Discover customers' consumption habits in sales data, and find out customers' preferred product combinations through transaction records. Others, including finding out the characteristics of lost customers and the timing of launching new products, are common examples in the retail industry. The concept of focus and database marketing emphasized by direct marketing makes the development of direct marketing more powerful after the introduction of Data Mining technology. For example, Data Mining is used to analyze the consumption behavior and transaction records of customer groups, combined with basic data, and to distinguish customers according to their brand value levels, so as to achieve the purpose of differentiated marketing. The demand for Data Mining in manufacturing industry is mostly used in quality control, and the most important factors affecting product quality are found out in the manufacturing process, in order to improve the efficiency of the operation process.

Recently, telephone companies, credit card companies, insurance companies and stock dealers are all interested in fraud detection (FraudDetection). These industries suffer considerable losses due to fraud every year. Data Mining can find similar characteristics from some customer data with poor credit and predict possible fraudulent transactions to achieve the purpose of reducing losses. Finance and finance can use Data Mining to analyze market trends and predict the operation and stock price trend of individual companies. Another unique use of DataMining is in the medical profession, which is used to predict the efficiency of surgery, medication, diagnosis, or flow control. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

What's the difference between NO.8 Web Mining and Data Mining?

If Web is regarded as a new Channel of CRM, then Web Mining can simply be regarded as a general term for Data Mining applications in network data.

How to measure the success of a website? Which contents, discounts and advertisements are the most popular? Who are the main visitors? What attracted them to come? How to find out the operating factors that make the website operate more efficiently from the mountain of data obtained from the Internet? All the above belong to the category of Web Mining analysis. Web Mining is not only limited to the well-known log file analysis, but in addition to calculating web page views and visitors, for all retail, financial services, communication services, government agencies, medical consulting, distance education, etc., as long as the database linked by the network is large and complete enough, Web Mining can do all the analysis that Off-Line can do, or even integrate Off-Line and On-Line databases. After all, with the convenience and power of the Internet, coupled with the traceability and high interaction of network behavior, the concept of one-to-one marketing is most likely to be fully implemented in the online world.

On the whole, Web Mining has the following characteristics: 1. Data collection is easy and unnoticed, so every walk must leave a trace, when visitors enter the site, all browsing behavior and history can be recorded immediately. 2. With interactive personalized service as the ultimate goal, in addition to presenting specially designed web pages according to different visitors, different visitors will also have different services; 3. Data from external sources can be integrated to make the analysis function deeper and wider. In addition to the resources directly obtained by the network, such as log file, cookies, member form data, online survey data, online transaction data, and so on, combined with the resources accumulated in the physical world for a longer time and a wider range, the results of the analysis will be more accurate and in-depth.

The use of Data Mining technology to establish a more in-depth analysis of visitor data, and rely on the structure of accurate prediction model, in order to present a truly intelligent and personalized network service, is the direction of Web Mining efforts.

What is the role of NO.9 Data Mining in CRM?

CRM (CustomerRelationship Management) is a topic that has aroused heated discussion and great concern recently, especially driven by the rise of direct marketing and the rapid development of the Internet, failing to keep up with the pace of CRM is like not keeping up with the times. In fact, CRM is not a new invention. CO (Customer Ownership), which has been promoted by Ogilvy's direct marketing for more than ten years, is now called CRM- customer relationship management.

The main ways of Data Mining application in CRM can correspond to the three parts of Gap Analysis:

For Acquisition Gap, we can use Customer Profiling to find out some common characteristics of customers, hoping to gain an in-depth understanding of customers, and then predict who may become our customers through Cluster Analysis, and then predict who may become our customers through Pattern Analysis, so as to help marketers find the right marketing target, thus reduce costs and improve the success rate of marketing.

For Sales Gap, we can use BasketAnalysis to help understand customers' product consumption patterns, find out which products customers are most likely to buy together, or use SequenceDiscovery to predict how long customers will buy another product after buying one product, and so on. Data Mining can be used to more effectively determine the product mix, product recommendation, purchase or inventory, or even how to set up goods in the store, and can also be used to evaluate the effectiveness of promotional activities.

In view of RetentionGap, we can analyze the characteristics of the customer base that the original customer is later transferred to the competitor, and then find out the customers who may turn from the existing customer data according to the analysis results, and then design some methods to prevent customer loss. A more systematic approach is to use Neural Network to rank customer loyalty according to customer consumption behavior and transaction records, so as to distinguish the level of turnover rate and match different strategies.

CRM is not only to set up a customer service line, but also not only to input a bunch of basic customer data into the computer, the complete CRM operation mechanism before the relevant hardware and software systems can be fully supported, there are too many data preparation and analysis work to promote. Through Data Mining, enterprises can efficiently mine the most critical and important answers to consumers from the large amount of data collected and accumulated by the market and customers, and establish customer relationship management based on the customer demand point, aiming at four aspects related to strategy, target positioning, operation efficiency and measurement evaluation.

What are the commonly used Data Mining analysis tools in the NO.10 industry?

The tool market can be broadly divided into three categories:

1. Software packages for general analytical purposes

Sas Enterprise Miner

IBM Intelligent Miner

Unica PRW

SPSS Clementine

SGI MineSet

Oracle Darwin

Angoss KnowledgeSeeker

two。 Software developed for a specific function or industry

KD1 (for retail)

Options & Choices (for insurance)

HNC (for credit card fraud or bad debt detection)

Unica Model 1 (for marketing)

Large-scale Analysis system integrating DSS (Decision SupportSystems) / OLAP/Data Mining

Cognos Scenario and Business Objects

Conclusion

Thank you for watching. If there are any deficiencies, you are welcome to criticize and correct them.

If you have a partner who is interested in big data or a veteran driver who works in big data, you can join the group:

658558542 (click on ☛ to join the group chat)

It collates a large volume of learning materials, all of which are practical information, including the introduction to big data's technology, high-level analysis language for massive data, distributed storage for massive data storage, and distributed computing for massive data analysis. for every big data partner, this is not only a gathering place for Xiaobai, but also Daniel online solutions! Welcome beginners and advanced partners to join the group to learn and communicate and make progress together!

Finally, I wish all the big data programmers who encounter bottlenecks to break through themselves and wish you all the best in the future work and interview.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.