Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the relationship between big data and "data mining"? -from Zhihu

2025-01-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Zhihu users, Internet

244 people agreed.

When I was a graduate student in data mining:

If we want to describe a very large amount of data, we use Massive Data (massive data)

If we want to describe the diversity of data, we use Heterogeneous Data (heterogeneous data).

If we want to describe both diverse and large amounts of data, we use Massive Heterogeneous Data (massive heterogeneous data).

……

If you want to apply for a fund to cheat a sum of money, we use Big Data (big data)

Editor's comments in 2014-02-2817 thank you

The collection does not help to report that the author reserves the right

Liu Zhiyuan, NLPer

Four people agreed.

I think big data, like deep learning, is an effective attempt to make difficult computer concepts recognized and recognized by the public, whether it is "big" or "depth". It is very vivid and intuitive to show the challenge and significance of these research topics, although these research topics have been explored in related research fields for decades.

Posted on 2014-05-15 to add comments thank you

The collection does not help to report that the author reserves the right

Leaf opening, nonparametric statistics, data mining, R

21 people agreed.

Talk about your personal opinion:

Based on the rapidly developing cross-disciplines of database theory, machine learning, artificial intelligence and modern statistics, data mining has been applied in many fields. There are many algorithms involved, such as neural network, decision tree, support vector machine based on statistical learning theory, classification regression tree, and association analysis. The definition of data mining is to find meaningful patterns or knowledge from massive data.

Big data was put forward this year, and it is also a concept deceived by the media. There are three important characteristics: large amount of data, complex structure, and fast data update. Due to the development of Web technology, the automatic preservation of data generated by web users, the continuous collection of data by sensors, and the development of the mobile Internet, the speed of automatic data collection and storage is accelerating, the amount of data in the world is expanding, and the storage and calculation of data is beyond the capacity of a single computer (minicomputer and mainframe), which poses a challenge to the implementation of data mining technology (in general, The implementation of data mining is based on a minicomputer or mainframe, and parallel computing can also be carried out. Google put forward the distributed storage file system and developed the concept of cloud storage and cloud computing.

Big data needs to map to small units for calculation, and then integrate all the results, which is the so-called map-reduce algorithm framework. The calculation on a single computer still needs to use some data mining techniques. The difference is that some of the original data mining techniques may not be easily embedded in the map-reduce framework, and some algorithms need to be adjusted.

In addition, the improvement of big data's processing ability also poses new challenges to statistics. Statistical theory is often based on samples, but in big data's era, it was possible to get the population, rather than the non-return sampling of the population.

Posted on 2014-03-07 to add comments thank you

The collection does not help to report that the author reserves the right

Zhang Weiqi, candidate for datascience master

Eight people agreed.

There are many definitions of big data, quoting the mainstream definition of big data in Doug Laney 2001.

Translation is prone to errors, the following is the English definition, from the capacity (Volume), rate (Velocity) and type (Variety) three aspects:

Volume. Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.

Velocity. Data is streaming in at unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.

Variety. Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.

To put it simply, data mining (Data mining) is a process of extracting information from unprocessed data, focusing on finding relevance and pattern analysis.

The similarity or relationship between big data and data mining is that the future of data mining is no longer aimed at a small amount of or sampled, randomized accurate data, but massive, mixed big data.

Editor added comments on 2014-03-23. Thank you.

The collection does not help to report that the author reserves the right

Xiao Zhibo and big data were invented by the media to deceive people.

75 people agreed.

I always adhere to my attitude: "big data" is a term deceived by the media, which is used to cheat money and projects. So it answers your first question, not an extension. The second problem is that there is no similarity.

Let's talk about it next. If we insist on the degree of similarity, then the degree of coincidence is indeed very high. Because what big data does is actually what data mining does.

Before data mining, it was called KDD (Knowledge Discovery and Data Mining, or it could also be Knowledge Discovery in Database), which is easy to explain. Data mining is to find hidden knowledge and rules from massive data. So, when did this thing come up? In the last century. When did big data bring it up? It's just the last few years. So big data is a good name for data mining to a large extent.

In fact, we can not completely deny "big data", at least through the media speculation, let many people know the importance of "data". It's just that many people don't know how to do big data, because this thing is empty. If you want to know big data, then the down-to-earth way is to learn about "data mining" and "machine learning". For specific content, you can search for what I have answered before.

Posted on 2014-02-2722 comments. Thank you.

The collection does not help to report that the author reserves the right

Xu Fangzheng wants to be a *.

28 people agreed.

Thank you, big data, who used to follow for a period of time, but now the main entanglement community has found it, and big data is not used for the time being. Please correct anything wrong, but don't spray it. Ahem, I think it's pretty good in a book I've read. Give me a general introduction.

The core idea of dealing with many problems lies in sample selection and result selection:

Sample selection: from a long time ago to the present, our ability to obtain and analyze data is very limited, which leads to a lot of data that we cannot collect when we need it, for example: population census. The modern United States required a census to be conducted every 10 years, but as the population grew faster and faster, it took 13 years to calculate the approximate population of the country. no, no, no. Therefore, a census cannot be used. So we must use another classical method to achieve the purpose of analyzing large-scale problems by obtaining a small amount of data-sampling. We all know that sampling surveys have a variety of requirements and criteria, and rational and often unsatisfactory, but on the premise that it is very difficult to obtain data (you can only see it in person, a manual inspection). This approach does give us the ability to deal with large-scale data: it is completely random (which we all know is impossible) to choose some correct (and impossible) data to analyze.

Sample analysis: through the sampling method introduced above, we have obtained the data we need to analyze the problem. We're going to start using them now, so how do we use them? The data can be very simple, such as length, temperature, time, weight, etc., or it can be very complex, a book, a picture, a stone. These data are complex because they are made up of the simple data we mentioned, such as weight and length. So, if we want to analyze stones, it will become very difficult-- because there are so many kinds of data to deal with, and there are influences of one kind or another between all kinds of data. This makes our ancestors who are seriously deficient in computing power (only pens and abacus, all kinds of functions and formulas are not clearly stated. (extremely crazy. Because by the time we calculate it in a simple and rough way, the validity of the data is likely to have passed (see the census data mentioned earlier). So we invented another powerful and classic method-modeling. We use several data which are critical to describe this object to replace all the data, so that the amount of calculation and the difficulty of calculation are objectively improved.

What is introduced above is our traditional data acquisition and processing methods, and now we will talk about data mining.

Why do we have to mine data? I personally think it is because it is much easier for us to obtain data now, so we have a lot of data-- it looks like we're going to throw up. Yes.

It is precisely because we are about to throw up, so we do not want to see, we want not to use the human brain, but let the computer to help us find the value of data, so we have to use data mining method, that is, Xiao Zhibo put forward: data mining is to find hidden knowledge and rules from huge amounts of data. So the premise of data mining is the same as that of big data, that is, huge amounts of data. So as far as methods are concerned, the two are very similar.

We always talk about big data now. I think it is mainly a train of thought:

1. Instead of using sampled data, use all data: all the data I mean here is completely all data, including both correct and incorrect data. Noise and error data also contain useful information.

two。 Do not care about why, only care about what it is: because we have a huge amount of data, so we through big data statistics results should be a considerable degree of universality. So just take this phenomenon-it turns out to be a trap. It is usually extremely difficult to explore and prove causality. One example is classic beer and diapers, which are easy to get from the data. Putting them together can increase sales and achieve Wal-Mart's goals, while it takes a lot of work to find out why.

3. Pay more attention to data acquisition than data analysis methods: in other words, data comes first. Because the computer is so powerful now, as long as we think of a way, it can do the corresponding work for us. Based on this, what we need to do is to get more and more comprehensive data for computer analysis. For example, foreign express companies install sensors on vehicles to help with express scheduling, and Rolls-Royce installs sensors on aircraft engines and predicts potential faults and repairs them in advance through historical and real-time data. In big data's mode of thinking, data provides us with the most possibilities and the greatest value, so we focus on obtaining data.

Having said so much, what I want to say is that data mining can be summarized as: after we have more data, give the data to the collection of computer analysis methods. And big data is a kind of new thinking that jumps out of our traditional data analysis and processing method framework. Compared with a kind of technology, a kind of thinking is indeed much more virtual, and if thinking is to be put into practice, it must be based on technology. But it is precisely because of the different way of thinking that we can get more things from the data, such as the analysis of noise and erroneous data that were previously considered worthless, or some interesting results unexpectedly found by paying attention to the phenomenon. no, no, no.

Therefore, I personally think that big data is a new thinking born in the process of developing data mining technology, and the practical application of this thinking is based on data mining technology. and can promote us to develop more data mining technology. no, no, no.

Posted on 2014-04-264. thank you.

The collection does not help to report that the author reserves the right

Aiirii wong

20 people agreed.

After reading a lot of comments, they said it was just a hoax, just like up to now, many people still think that cloud computing is synonymous with virtualization, and it is also a hoax, in fact, they have not really understood its connotation.

Just as cloud computing is qualitatively changed because of the quantitative change in the development of virtualization technology (although virtualization is not a necessary prerequisite for the realization of cloud computing), big data is also the new product caused by the development of old technology to a certain extent.

Many people are still stuck in the concept that big data is a huge amount of data (this is just one of the features). Many so-called big data examples on the Internet do not reflect the obvious characteristics of big data at present, which is no different from previous data mining. what's more, it makes some people think that big data is an alias for data mining.

In my personal understanding, there are several differences:

1. Data mining is still based on the user's hypothesis of cause and effect, and then validates it, while big data focuses on finding out the relationship, and the change of A will affect the change range of B.

2. The traditional method is only to extract and analyze data from the internal database, while big data uses more unstructured data from more ways.

3. In terms of processing time, the traditional requirement for time is not high; big data emphasizes real-time, and the data is available online.

4. In the traditional way, the key point is to mine the residual value from the data, while big data finds out the new content and the value of innovation from the data.

...

The biggest difference is the difference in thinking and the way of thinking, which leads to great differences in methodology and tools.

Editor's comments in 2015-05-153 thank you

The collection does not help to report that the author reserves the right

Zhihu users, love Python,Data Debugger, machine learning into …

Three people agree.

Data mining is a technology and knowledge. In a narrower sense, it is a general term for the study of a class of algorithms. The common feature of these algorithms is that they want to identify useful pattern from real-world data, and then acquire new knowledge, and finally implement it in decision making.

Big data, this concept is very virtual, has been given too many meanings, lack of substantive connotation. But "big" is what they have in common. I prefer to understand it as a series of data processing tools that have sprung up in recent years, represented by Hadoop based on MapReduce. Most of them are based on distributed environment, with the ability to deal with large amounts of data or real-time as the selling point.

Editor added comments on 2014-02-27. Thank you.

The collection does not help to report that the author reserves the right

Zhihu users, the answer has been collected 161616 times, and the road of seeking knowledge has taken 1%.

14 people agreed.

Take the coal boss who opened the mine in Shanxi as an example:

The premise of mining is the existence of ore, including the storage capacity, storage depth and coal quality of the coal mine.

Then there is mining, and to dig out these buried mines, you need miners, diggers, and transporters.

Then comes processing, coal washing, alchemy, and so on.

In the end, it is converted into silver.

The data industry is very similar:

The premise of mining data is to have data, including the amount of data stored, the depth of storage, and the quality of the data.

Then there is data mining, which needs to mine the buried data.

Then there is the output of data analysis, which should be output visually to guide the analysis and business practice.

It was not until this step that value was created.

The so-called big data probably means that there is a huge mine that is forming now. Go and seize it and become the boss of coal. The next Gates may be born here.

Editor's comments in 2014-03-013 thank you

The collection does not help to report that the author reserves the right

Xu Xiaoyi, AI, Confucianism, https://github.com/andrewxxyi/JXPi

Six people agreed.

Two different things, big data is how to maintain a fast response to an access session in a massive data environment. Data mining is to sum up useful knowledge from a large amount of historical information. This is a matter of two levels.

In principle, data mining does not need big data, because it does not require response speed, it values the utility of the mined knowledge. However, in the massive data environment, if there is no fast data supply capacity of big data, then the computing resources consumed by data mining may make it impossible to complete or the cost is too high.

Posted on 2014-02-273 comments. Thank you.

The collection does not help to report that the author reserves the right

Zhou Li, losers want to do data mining.

Seven people agreed.

Personally, I think data mining is a technology and a concept in a relatively narrow sense.

Big data is more like an industry, and data mining is of course one of its core technologies. However, unlike data mining, big data also involves a wide range of other technologies, such as visualization, data storage and management.

Big data not only uses data mining technology to mine useful information from data, but also takes huge amounts of data, usually distributed real-time processing, and finally uses the information obtained by organizing data mining technology to directly display this information to users.

Posted on 2014-02-28 to add comments thank you

The collection does not help to report that the author reserves the right

Zhihu user, PhD candidate

Six people agreed.

I don't have any special views on this issue. I just recall some of my boss's words here.

1. (when he was a sophomore, the big boss of the lab gave a course on database and expressed some views about big data in the class. The main idea is:) in fact, big data is not a new concept. It has been around for a long time. It's just that in recent years, some people have turned it out by hot stir-frying with cold rice, that is, it is hype. When the hype is done, you can report to the state and apply for any natural science fund.

2. (this paragraph is an internal report made by the big boss in the laboratory, only taking out some of the least important ones that have been talked about on other occasions.) big data does not have a particularly clear definition, how much data can be counted as big data? There is no uniform standard for this. 20 years ago, hundreds of megabytes of data looked very large to us; a few years ago, we thought that several GB data was big data; now we think that only a few TB data can be called big data. Big data's standard is constantly changing with the development of computer computing power. The boss gave a definition that I think is more reliable in the report, but I don't know if it has been published.

3. For the subject's question [is big data an extension of data mining? What is the similarity between the two? I don't think there is any relationship between the two. The problem brought by big data is that a series of problems caused by more and more data are naturally caused by a series of problems that the existing tools can not effectively deal with, including database systems, computing methods and other basic problems. Data mining is a process of knowledge discovery based on data. There is no obvious question of who is the extension of whom, and there are not many similarities between the two.

If you insist on the relationship between the two, you can take a look at the following.

The main challenge brought by big data is that the basic technology can not meet the demand now. For example, traditionally we think a sub-linear time algorithm is good, but when it comes to big data, sub-linear time also fails. This is the challenge posed by the increase in the amount of data to the entire computer science community. You say that you have an O (log (n)) algorithm, which does not work on big data (refers to scenarios that cannot be distributed and calculated). If you can do distributed computing, as long as you build a few more machines (like MapReduce), it will not be called big data after it is dispersed into "small data". The problem for data mining is that even if the time complexity of many data mining algorithms is very low in the traditional concept, they can no longer meet the requirements.

[above]

The editor commented on 2014-05-161. thank you.

The collection does not help to report that the author reserves the right

Xu Shen

Four people agreed.

First, let's talk about my understanding of big data. I think big data has two meanings: the first is that everything can be digitized. Digitization is not equal to digitization. Digitization refers to the quantification of objects into analysiable data, which can be structured or unstructured. Quote a passage from an article from the Oriental Morning Post on April 19, 2013, "knowing you better than you-Automobile Life in the big data era":

To take another example, you may never think that the way you sit while driving will prevent your car from being stolen, which sounds incredible, but this is the reality. An industrial research institute in Japan measures the way people exert pressure on all parts of the seat by installing 360 pressure sensors under the car seat, and quantifies it through a range of 0-256 values. Each passenger will produce an exclusive data. This system can identify the passenger according to the difference in pressure on the seat, with an accuracy of 98%. When this technology is installed in the car as a car anti-theft system, the car will know whether the driver is the owner of the car, if not, the car will automatically stall, and it can also judge whether the driver is in fatigue driving according to the sitting posture data. The system can control the possible danger by automatically slowing down or braking.

I give this example to illustrate that with the help of today's technical and mathematical statistical knowledge, things that could not be described quantitatively before can be analyzed and expressed on computers today, that is, data.

The second meaning is big data's "sample is the population". This view comes from Schoenberg's "big data era". The previous quantitative survey and analysis of data, limited by technical, financial and other conditions, always take part of the samples from the whole, and investigate these samples. But big data is different. The data analyzed by big data is a whole.

In a word, big data is a way of thinking.

However, come back to the keyword of data mining. The previous answer has clearly explained data mining and the difference between data mining and big data. I would like to emphasize that big data's unique charm lies in novel and meaningful data mining, such as the classic case of "beer and diapers".

Editor's comments in 2014-03-012 thank you

The collection does not help to report that the author reserves the right

Anonymous user

Three people agree.

Data Mining = big data + Machine Learning

Posted on 2015-01-19 to add comments thank you

The collection does not help to report that the author reserves the right

Zhihu user, PhD in Operations Research / engaged in insurance data mining in the United States

Two people agree.

The boss at today's meeting summed up big data very well: big data is like "teenage-sex". Everyone is talking about it, everyone thinks someone else is doing it, so everyone claims to be doing it.

-

In my opinion, big data is a kind of attribute, and data mining is a method, or a collection of methods.

I think data mining refers to extracting useful information from simple disordered data, first standardize the data, and then choose the corresponding methods according to the questions you want to answer, you can establish a model to predict the future, you can also cluster the current data, and so on. It can also be simply looking for rules from the data, and you don't have to answer specific questions. So I think using excel as pivot table is also a kind of data mining.

And big data refers to the characteristics of the data, as the name implies is large. Massive data will cause a lot of problems. First of all, the amount of calculation is a problem. When the amount of data in the memory of the simplest personal computer reaches millions of rows, it is basically stretched, and it becomes a problem to read it, not to mention calculation. Of course, the speed of calculation is an incidental problem. Then there is the question of choice. In the past, the amount of data is too small. To predict a quantity, I wish I could use all the other quantities that can be collected. Now the amount of data is too large. Imagine a model with more than 1000 different quantities to predict a quantity. Can you trust it? even if you really believe in such a model, it is difficult to use such a model to give appropriate advice. The third characteristic of big data is real-time update, because a large amount of data can be generated every day. Yesterday's model needs to be verified with today's model, and then modified. This is a process of continuous correction.

I don't think big data is all a hype. Today, when everything is digital, there is still something special about the data processing methods. However, the data is still data, and the core of the processing will not change.

Editor added comments on 2014-09-09. Thank you.

The collection does not help to report that the author reserves the right

Zhihu users, data analysis, data mining novice

Two people agree.

A novice, briefly describe my point of view, the relationship between big data and data mining

1. First of all, data mining is a tool, and it has a long history, and it is nothing new, while big data is a concept that has emerged in recent years, mainly emphasizing panoramic data, full data, most of which are unstructured data or semi-structured data (what we generally call data is basically structured data).

two。 Secondly, data mining belongs to a tool of data analysis, and data analysis is a method to explore the law of big data, so to some extent, it can be said that data mining is a tool of big data analysis.

And from Wikipedia, we can find out

Data mining has the following different definitions:

"extract hidden, previously unknown and valuable potential information from the data."

"the science of extracting useful information from large amounts of data or databases."

When it comes to data mining, we should talk about knowledge discovery (KDD). The relationship between data mining and KDD is that KDD is a process of identifying effective, novel, potentially useful and ultimately understandable patterns from data, while data mining is a step for KDD to generate specific patterns within acceptable computational efficiency limits through specific algorithms. In today's literature, these two terms are often used indiscriminately, data mining (DM) = knowledge discovery (KDD), and the commercial field is generally said to be data mining, while the academic field is said to be KDD.

Big data means that the amount of data involved is too large to be intercepted, managed, processed, and sorted into information that can be interpreted by human beings within a reasonable time, while data mining (data mining) is exploring the methods used to analyze big data.

3. For example, the Google flu trend

We found that certain search terms are very helpful in understanding the flu epidemic. Google Influenza Trends estimates the current global influenza epidemic in near real time based on aggregated Google search data.

Every week, millions of users search the Internet for health information. As you might expect, there will be a significant increase in flu-related searches during the flu season, a significant increase in allergies during the allergy season, and a significant increase in sunburn-related searches in the summer. All these phenomena can be studied by Google search analysis. However, can the search query trend provide a basis for establishing an accurate and reliable pattern for the actual phenomenon?

We found that there was a close relationship between the number of people searching for flu-related topics and the number of people actually suffering from influenza symptoms. Of course, not everyone who searches for "flu" really has the flu, but when we aggregate the search queries related to influenza, we can find a pattern. We compared the number of queries we counted with the data from the traditional influenza surveillance system, and found that many searches did increase significantly during the flu season. By counting the number of these search queries, we can estimate the spread of influenza in different countries and regions in the world. This article has been published in the American journal Nature http://static.googleusercontent.com/media/research.google.com/zh-CN//archive/papers/detecting-influenza-epidemics.pdf

(1) this simple example shows that Google uses computer data mining tools to mine search engine records (all flu records) to find out the rules behind the data, that is, influenza trends, where influenza records are full data. not random sampling, this is the biggest difference from the previous (before 2010) data analysis.

(2) ideally, big data is mainly unstructured or semi-structured data, but here Google's recorded data is still structured data, so big data is a concept that is constantly developing and updated. of course, data mining tools are also in the process of upgrading, and the ideal data mining tools should be able to handle full data, real-time data, multi-type data, in short. Big data and data mining are constantly changing and developing, we ordinary people understand big data and data mining based on historical data, of course, there is an ever-changing thing is based on applied statistics analysis methods.

The above is my humble opinion. I hope you can discuss it more and pool our collective wisdom to understand big data and data mining.

Posted on 2014-03-03 to add comments thank you

The collection does not help to report that the author reserves the right

He Dongdong, hem, ha, hey.

One person agrees.

To put it simply, data mining should have appeared earlier than big data. In the process of production (commercial calculation), people will find some data, which must contain some rules along with the production process. People want to use some methods to dig out the secrets in the boring data. So use methods such as statistics, calculation, machine learning and so on (methods are not important, it is important to be able to dig out secrets). This process is called data mining. And big data, just generalization, roughly refers to a huge amount of data, is a big concept, not specific.

Posted on 2014-03-10 to add comments thank you

The collection does not help to report that the author reserves the right

Landlord, landlord.

One person agrees.

Big data can be understood as a technical means, platform, tool or a kind of thought.

Data mining is the work goal, before the concept of big data, data mining can use relational database, analytical database and so on, but now it is only one more choice, and it is a very good technical means.

Posted on 2014-12-08 to add comments thank you

The collection does not help to report that the author reserves the right

Yang Xuechen, I am here, so I know

One person agrees.

Digging-obviously low-end manual labor, not worth mentioning

Big (massive)-- absolutely sophisticated technology. I don't understand, but I think it's awesome.

The same beer, the same diaper, the programmer wrote down the simple essence, the capitalist blew out the high-end blueprint.

The perspective of the media and the public

From the perspective of media publicity, it completely borrows the term "big data" to instill in the public the great hidden role of "data mining" in commercial activities and social life. Whether it is the famous "Beer and Diapers" or the newly released "House of Cards", it is a perfect interpretation of the commercial value of data mining. As said at the beginning, "big data" undoubtedly has more eye-catching potential than "data mining". For the general public, it is not important to let them know how massive data is stored and processed, it is important to tell them that there is value behind the data. As a result, "big data" has become synonymous with "data mining". It has been successfully promoted through the media and has become a tool for some interest groups to use for conceptual speculation.

A professional perspective

As stated in the definition quoted by @ Zhang Weiqi, the concept of big data emphasizes the processing of data with the characteristics of large data capacity, fast generation and miscellaneous data types, including related storage, computing and other technologies. In the process of the development of data mining, we constantly pursue to obtain a larger amount of data from more sources and analyze it more efficiently, in order to obtain more comprehensive, more accurate and more timely results. In my opinion, the proposal of big data's concept is the inevitable result of the development and application of data mining technology, and it is the refinement and summary of the massive data-related problems encountered in the development of data mining.

Editor added comments on 2014-03-01. Thank you.

The collection does not help to report that the author reserves the right

Wang Tsai Noodle, consultant, Amateur photographer

To put it rudely, big data is the ocean, the information in big data is fish, and "data mining" is the fishing net. If "big data" is narrowly understood as a kind of data source, then "data mining" is one of the important means to control "big data".

Because big data is a kind of complex and unfriendly data source, it is often difficult to control with traditional methods. In order to make effective use of big data, people gradually invented a set of systematic methods and tools to collect, store, extract, transform, load, clean, analyze, mine and apply big data. "data mining Data mining" is a general name for all kinds of mining tools and methods.

It should be noted that big data source usually can not carry out data mining directly, but also need to spend a lot of work on preprocessing. Of course, the completion of data mining is not over, but also need to carry on the business application of the mining results in order to create value. Just like an iron mine, the iron ore of up to the standard must be extracted from the mine (pretreatment process, data cleaning, integration, transformation and specification) before it can be sent to the steelmaking plant to be smelted into steel (mining process). Finally, the steel will be used on the construction site (application process).

First of all, let's take a look at what is "big data".

1. Big data is a data source with 3V characteristics (large volume of Volume, complex and diverse Variety, high-speed aging of Velocity). It is easy for the public to understand Volume. The amount of data of Internet companies, operators and financial institutions is often measured in TB, but Variety and Velocity are often ignored.

Variety refers to big data with rich sources and various forms. The common big data includes e-commerce user data, text data, social network data, vehicle information service data, time and location data, RFID data, smart grid data, device sensor data and so on.

Velocity refers to big data's high-speed mass generation, while data analysis and application are also completed in real time, such as online advertising program purchase, Internet financial real-time credit, all involve the technology of real-time processing massive data.

two。 Big data is also a relative concept, the current "small data", used to be "big data". For example, the data derived from ERP and CRM can now be easily manipulated with excel, but decades ago, under the technical conditions at that time, such data was not a large, diverse and high-speed "big data." With the development of technology, the current "big data" will also become a "small data" that can be easily controlled in the future.

3. Big data is usually automatically generated by machines, for example, sensors in the Internet of things automatically generate environmental data. The generation of traditional data often involves human factors, such as retail transactions, telephone calls and so on.

4. Big data is often not "structured", so it is difficult to control. Transaction systems that collect traditional data sources usually generate data in a clean, predefined template way to ensure that the data is easy to load and use. Big data source is not usually strictly defined at the beginning, but to collect all the information that may be used.

Common financial statements are typical "structured" data, with clear categories, subjects, cleanliness and standardization in the header.

Web logs are representative of "semi-structured" data, looking messy and not neat at all, but each piece of information has a specific purpose.

On the other hand, texts, such as blog posts and forum comments, are "unstructured" data, which must be transformed and cleaned with a lot of energy before they can be analyzed and used.

5. A lot of data may be rubbish and do not contain a lot of value. In fact, most of the data are even worthless. A web log will contain very important data, but it also contains a lot of data that is of no value at all. It is very necessary to refine it so as to retain the valuable part.

Let's take a look at what "data mining" is.

Since big data is usually "massive, diverse, high-speed and real-time", and is not "structured", this leads to a question: how to control big data?

As mentioned earlier, people have invented methods including data collection, storage, extraction, transformation, loading, cleaning, analysis, mining and application to control big data, and "data mining Data mining" is a general term for various mining tools and methods.

To understand "data mining", make a simple comparison with "data analysis"

The analysis goal of data analysis is often clear, and the analysis conditions are also relatively clear. Basically, the statistical method is used to describe the data in multi-dimensions.

The goal of data mining is not very clear, it is necessary to rely on mining algorithms to find out the rules and patterns hidden in a large amount of data, that is, to extract hidden, unknown and valuable information from the data.

In practice, data mining is generally divided into several common problems, such as "classification", "clustering", "association" and "sequence". For each kind of problem, there are special mining algorithms to deal with. For example, user churn early warning model and sales promotion response model are used to predict the probability of a user's behavior, which belongs to the problem of "classification". It can be dealt with by decision tree algorithm, logical regression algorithm, multiple linear regression and neural network algorithm.

Friends who want to know about big data can take a look at "controlling big data" (by Bill Franks, translated by Huang Hai, and people's posts and Telecommunications Publishing House), which is relatively systematic in the introductory books, and is very suitable for helping you form a basic cognitive framework for big data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report