In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
In this issue, the editor will bring you what the meaning of data mining is. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
Data mining (Data Mining) is the process of extracting hidden, unknown but potentially useful information from a large amount of data. The goal of data mining is to establish a decision model to predict future behavior based on past action data.
Data mining refers to the process of searching for hidden information from a large amount of data through algorithms.
Data mining is usually related to computer science, and the above goals are achieved through statistics, online analytical processing, information retrieval, machine learning, expert systems (relying on past rules of thumb), pattern recognition and many other methods.
Data mining is an indispensable part of knowledge discovery (knowledge discovery in database, KDD) in database, while KDD is the whole process of transforming raw data into useful information, which includes a series of conversion steps, from data preprocessing to post-processing of data mining results.
The Origin of data Mining
Researchers from different disciplines come together and begin to develop more efficient and scalable tools that can handle different data types. These works are based on the methodology and algorithms previously used by researchers, and achieve the most exciting part in the field of data mining.
In particular, data mining uses ideas from the following fields: (1) sampling, estimation and hypothesis testing from statistics; (2) search algorithm modeling techniques and learning theories for artificial intelligence, pattern recognition and machine learning.
Data mining also quickly embraces ideas from other fields, including optimization, evolutionary computing, information theory, signal processing, visualization and information retrieval.
Some other areas also play an important supporting role. The database system provides effective storage, indexing and query processing support. Technologies derived from high-performance (parallel) computing are often important in dealing with large datasets. Distributed technology can also help deal with large amounts of data, and it is even more important when the data cannot be processed together.
KDD (Knowledge Discovery from Database)
Data cleaning
Eliminate noise and inconsistent data
Data integration
Multiple data sources can be combined
Data selection
Extract data related to analysis tasks from the database
Data transformation
Transform and unify data into a form suitable for mining through aggregation or aggregation operations
data mining
Basic steps to extract data patterns using intelligent methods
Model evaluation
Identify truly interesting patterns that represent knowledge according to a certain degree of interest
Knowledge representation
Use visualization and knowledge representation technology to provide users with mining knowledge.
Data mining methodology
Business understanding (business understanding)
Understand the goals and requirements of the project from a business point of view, and then transform these understandings into operational problems of data mining through theoretical analysis, and make a preliminary plan to achieve the goals.
Data understanding (data understanding)
The data understanding phase begins with the collection of raw data, followed by familiarity with the data, identification of data quality issues, exploration of a preliminary understanding of the data, and discovery of interesting subsets to form assumptions about exploration information.
Data preparation (data preparation)
The data preparation phase refers to the activity of constructing the information needed for data mining from the raw data from the original data. Data preparation tasks may be performed multiple times without any specified order. The main purpose of these tasks is to obtain the required information from the source system according to the requirements of dimensional analysis, and to preprocess the data, such as conversion, cleaning, construction, integration and so on.
Modeling (modeling)
At this stage, it is mainly to select and apply various modeling technologies. At the same time, their parameters are tuned to achieve the optimal value. There are usually a variety of modeling techniques for the same type of data mining problem. Some technologies have special requirements for data form and often need to return to the data preparation phase.
Model evaluation (evaluation)
Before the deployment and release of the model, it is necessary to judge the effect of the model from the technical level and check the steps of establishing the model, as well as evaluate the practicability of the model in the actual business scene according to the business objectives. The key purpose at this stage is to determine whether there are some important business issues that have not been fully considered.
Model deployment (deployment)
After the completion of the model, the model users (customers) encapsulate to meet the needs of the business system according to the background and goal at that time.
Data mining task
In general, data mining tasks fall into the following two categories.
Prediction mission. The goal of these tasks is to predict the values of specific properties based on the values of other attributes. The predicted attribute is generally called target variable (targetvariable) or dependent variable (dependentvariable), while the attribute used to make prediction is called descriptive variable (explanatoryvariable) or independent variable (independentvariable).
Describe the task. The goal is to derive patterns (correlations, trends, clustering, trajectories, and anomalies) of potential relationships in summary data. In essence, descriptive data mining tasks are usually exploratory and often require post-processing techniques to verify and interpret the results.
Predictive modeling (predictivemodeling) involves building a model by describing the function of a variable as a target variable.
There are two types of predictive modeling tasks: classification, used to predict discrete target variables, and regression, used to predict continuous target variables.
For example, predicting whether a Web user will buy a book in an online bookstore is a classification task because the target variable is binary, while predicting the future price of a stock is a regression task because the price has a continuous value attribute.
The goal of both tasks is to train a model to minimize the error between the predicted value and the actual value of the target variable. Predictive modeling can be used to determine the response of customers to product promotions, to predict the disturbance of the earth's ecosystem, or to judge whether a patient suffers from a disease based on the results of the examination.
Association analysis (association analysis) is used to discover patterns that describe strong correlation features in data.
The discovered patterns are usually expressed in the form of implication rules or a subset of features. Because the search space is exponential, the goal of association analysis is to extract the most interesting patterns in an effective way. The applications of association analysis include finding genomes with related functions, identifying Web pages visited by users together, and understanding the relationships between different elements of the earth's climate system.
Cluster analysis (cluster analysis) aims to find closely related groups of observations so that, compared with observations belonging to different clusters, observations belonging to the same cluster are similar to each other as much as possible. Clustering can be used to group relevant customers, identify ocean areas that significantly affect the earth's climate, and compress data.
The task of anomaly detection (anomaly detection) is to identify observations whose characteristics are significantly different from other data.
Such observations are called outliers (anomaly) or outliers (outlier). The goal of anomaly detection algorithm is to find real outliers and avoid mislabeling normal objects as outliers. In other words, a good anomaly detector must have high detection rate and low false alarm rate.
The applications of anomaly detection include detection of fraud, cyber attacks, unusual patterns of disease, ecosystem disturbances and so on.
The above is the meaning of data mining shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.