Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The risks faced by big data and the existing problems (big data industry must read)

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

"big data" is undoubtedly a hot term at present. When it comes to data analysis, we must talk about big data. This is a double misunderstanding of big data and data analysis. In the face of the promise of a popular concept and the commercial interests it represents, academia should maintain a high degree of sincerity and skepticism. Follow the WX official account:: big data technical engineer for more wonderful information.

"big data" has become the key word in 2018 and is expected to bring about major changes in life, work and thinking.

The work done by Google, Amazon and other Internet companies in using big data has made the data industry see a new path of development. The energy shown by big data's application in education, medical, automobile and service industries makes enterprises and researchers full of confidence in big data's future. Chris Anderson, editor-in-chief of Wired magazine, even asserted as early as 2008 that the flood of data would bring an end to theory and that scientific methods would be outdated. Scientists'"hypothetical, model, and testing" methods have become outdated.

Technological change is encouraging in any industry, but here is a bit of caution to borrow Susan Langer's argument in the New Horizon of philosophy:

Some ideas sometimes have a great impact on the state of knowledge with amazing power. Since these ideas can solve many problems at once, they seem to have the potential to solve all the basic problems and clarify all unclear doubts. Everyone wants to grasp them quickly as a magic weapon to enter some new empirical science and as a conceptual axis that can be used to construct a comprehensive analysis system. This' grand concept 'suddenly became popular and pushed everything aside for a while. [4] [5]

Susan Langer believes that this is due to the fact that "all sensitive and active people are immediately committed to developing it." this argument is also extremely appropriate in today's fanatical worship of big data. Big data's popularity does not mean that other ways of understanding and thinking are no longer suitable for existence. As Microsoft's Mr. Mundie said, "the data-centric economy is still in its infancy." You can see its outline, but the impact of its technical, infrastructure, and even business models is not fully understood. " But it is undeniable that people do shift more academic interest to this field, and once people can start to explain them clearly with prudent thinking, even if they do not provide a perfect solution for the time being, at least it's a way to benefit people.

Of course, when people talk about big data's beautiful picture, they do not completely forget the risks it may bring, but their concerns focus on big data's consequences, such as information security, rather than on how to view big data himself. This paper will make a brief analysis of the risks and existing problems faced in the era of big data, especially in the domestic technological environment, in order to clarify the concept and clarify some misunderstandings.

The risks faced by big data are mainly shown in the following aspects:

First, the computing speed of massive data

Retail giant Wal-Mart processes more than 1 million customer transactions per hour, and the data entered into the database is expected to exceed 2.5PB (petabyte, the 50th power of 2)-- equivalent to 167times the inventory of books in the Library of Congress. Cisco, the maker of communications systems, expects the amount of data flowing on the Internet to reach 667EB (exabyte, 60 to the power of 2) each year by 2013. [6] The growth rate of data will continue to outpace the development of the network that carries it.

Statistics from Taobao show that the amount of data they produce in a day can reach or even exceed that of 30TB, which is only the amount of data within a day of an Internet company. When dealing with such a large amount of data, the first thing they face is technical problems. The massive transaction data and interactive data make big data exceed the ability of common technologies to grab, store and analyze these data sets according to reasonable cost and time limit in terms of scale and complexity.

When it comes to big data, it is hard to avoid the tendency of the United States to talk about it, so how on earth does the United States deal with this problem?

The big data research program launched by six departments of the US government includes:

DARPA's big data research project: multi-scale anomaly detection project, which aims to solve the anomaly detection and characterization of large-scale data sets; network internal threat program, which aims to automatically identify network threats and unconventional war behaviors by analyzing information from sensors and other sources; and Machine Reading project, which aims to realize the application of artificial intelligence and develop learning systems to insert knowledge into natural texts.

The research content of big data of NSF: the core technology of extracting useful information from large, diverse, scattered and heterogeneous data sets; developing a statistical method based on a unified theoretical framework and a scalable network model algorithm to distinguish the methods suitable for random networks.

The National Humanities Foundation (NEH) project includes analyzing the impact of big data's changes on the humanities and social sciences, such as digital books and newspaper databases, searching from the Internet, sensors and mobile phones to record transaction data.

Big data's research projects of the Ministry of Energy (DOE) include machine learning, real-time analysis of data streams, nonlinear random data reduction techniques and scalable statistical analysis techniques. [7]

From this research plan, we can see that the vast majority of research projects are to meet the technical challenges brought by big data. The database technology we use at present was born in the 1970s, and the first thing we need to solve in the big data era is to restructure the entire IT structure to improve the storage and processing capacity of the growing massive data.

The author first entered the field of data analysis in 1986, the machine used is the Great Wall, 520, small IBM machine, after the completion of data input, questionnaire input, do a simple command operation, need to wait three hours before the results, we are now faced with big data's processing ability, figuratively speaking is the PC's ability to deal with small data.

This is why big data is often associated with cloud computing. Real-time large dataset analysis requires at least analytical techniques like MapReduce and Hadoop and thousands of computers working at the same time, because in order to achieve real-time analysis, you need to vacate the analysis workspace in the database and control access to resources and data without affecting the production system. [8] when talking about big data under the existing technical conditions, we need to fully take into account the shortcomings of hardware facilities and analysis technology, because this is the premise, which is why the data center has become the top secret of Google and Amazon. The positive response of Facebook's open source hardware project to many enterprises, including domestic Tencent, is also based on this practical need.

Second, the risk brought by massive data is that there is a false rule everywhere.

"if only human cognition is sporadic and small, there is wisdom in the small, because human cognition depends more on experiments than on understanding. the greatest danger must be the reckless use of local knowledge." Schumacher used this passage in his book "small is good" to express his concern about the large-scale use of nuclear energy, agricultural science, and transportation technology. it is also applicable to today's survey of industries, enterprises, and researchers' superstition of full data and ignoring the risks brought by sampling.

The computing power of massive data can be solved with the popularity of new technologies, such as distributed cache, MPP-based distributed database, distributed file system, various NoSQL distributed storage schemes, etc., but this is only the first step in data processing (even this processing method itself has a great risk), and it is not the biggest risk. Big data's most serious risk exists in the level of data analysis.

(1) the increase in the amount of data will lead to the loss of rules and serious distortion.

Victor Mayer-Schoenberg also pointed out this in his book the Age of big data. "A substantial increase in the amount of data will lead to inaccurate results, and some wrong data will be mixed into the database." in addition, big data's another layer of definition, diversity, that is, the mixing of all kinds of information from different sources will increase the degree of confusion of the data, statisticians and computer scientists point out. Large data sets and fine-grained measurements can lead to an increased risk of "error discovery". The argument that the scientific methods of hypothesis, testing and verification are outdated is precisely out of confusion and confusion in the face of big data, because he is unable to deal with large amounts of unstructured data and draw deterministic conclusions from it. Simply embrace what Kevin Kelly calls chaos. This idea is effective in some areas, such as it can explain biological selectivity, the process of plant selection on the East African prairie, but not necessarily explain people, explain the process of events and the laws behind them.

Big data means more information, but it also means more false relationship information. Professor Trevor Hastie of Stanford University uses' find a needle in a pile of straw'as a metaphor for data mining in big data's era. The problem is that many straws look like needles. How to find a needle is the biggest problem in data mining. Massive data brings the problem of significance test. Will make it difficult for us to find a real connection.

Let's take a practical example to look at the problems that will arise as the sample size continues to increase:

Table 1 significance test problems caused by the increase of the amount of data

The above table is a regression analysis of the spread of online games in 2006. When the sample size is 5241, you will find that the three variables of age, education and income are significant when a simple linear regression is used to fit the data. When we increased the sample size to 10482, we found that the only child and women began to be significant, and when the sample size increased to 20964, the variable outside the system also began to be significant. When the sample increases to 330000, all variables are significant, which means that everything in the world is connected. Well, at this time, what about hundreds of millions of people? When the sample is large to a certain extent, many results will naturally become significant, unable to draw inferences, or draw false statistical relationships. In addition, the existence of broken data and missing data (which will be analyzed below) will make this false relationship grow with the growth of the amount of data, and it will be difficult for us to get back to the truth.

In fact, the real law goes like this:

Outside the system

This is the result of the spread of online games in 2006, the actual model is like this, through this model we can see:

There are significant differences in the use of games among people of different ages within and outside the system, and we can clearly see that online games show the law of innovation diffusion dominated by education level in 2006.

two。 In the highly educated population, it began to spread to the age of 34-40, showing a significant increase, and reached a peak.

3. In low-educated groups, such as high school and junior high school, it spreads rapidly among young people, forming a peak.

4. In 2006, online games began to spread from several angles of educational level, and age is no longer just the difference between high and low, but the effect formed by the comprehensive formation of educational variables [10]. When we see the wavy diffusion process of online games, we can not only find out who was using online games in 2006, but also use life cycle and family cycle to explain the reasons. through the analysis of the differences in the use of people within and outside the system, it can also show the differences in human behavior caused by different workspaces. When we put the results of 2006 back to the whole diffusion process of online games, what we can see is no longer the online games themselves, but the process of social changes brought about by new technologies.

To make an objective, profound and accurate analysis of a social phenomenon, the understanding of things needs data, but more analytical thinking. In the era of big data, theory is not unimportant, but becomes more important. The theory we refer to is not rigid and immutable to cling to the old theory, but is aware of the complexity brought by the massive data in the process of dealing with the problem, and insists on the continuous innovation of analysis methods and theories.

(2) the analysis idea of sampling analysis + full data verification

Viktor Mayer Sheenberg mentioned three ideas when introducing the shift in thinking in data analysis in the big data era, one of which is to analyze all data instead of relying on just a small part of it. The whole data is all the rage for a moment. Enterprises and researchers think that big data is the whole data, so that when we talk about sampling, there seems to be conservatism. This view is undoubtedly biased and inadequate for both big data and sampling. and a popular vocabulary is exactly what it means to the people engaged in this activity, if you think that big data is collecting all the sample information. Let the data speak for itself, then the methodology is narrow, but this narrowness is ignored because of its open, objective and comprehensive light.

The first risk to this view is where is the "full data"? To what extent can the amount of data be regarded as "full" data?

This also involves the second question of full data (let's assume that we find the true sense of complete by people typing search terms on Google: the case in which Google uses search records to predict influenza outbreaks is widely cited to show that the data can speak on its own. when people start searching the Internet for words about colds, they show that they have the flu and establish the relationship between the flu and the space and the virus. Be able to successfully predict a flu) [11] the data do see changes and make "predictions" through changes, but cannot explain the factors affecting the changes. Viktor Mayer Sheenberg's answer is: we want correlations, not causality. This is not the author's choice, but the necessity of abandoning sampling and directly adopting big data.

Viktor Mayer Sheenberg believes that it is possible to use big data's simple algorithm to solve the problem with inaccuracy, while the different performance of Literary Digest and Gallup in presidential election prediction in 1936, it still shows us the importance of scientific and rigorous sampling. Literary Digest relies on the huge circulation of the print era to obtain the data of 2.4 million people, while Gallup only studied 5000 people on the basis of strict sampling, which is a real case where the complex algorithm of "small data" exceeds the simple algorithm of "big data".

Without sampling fitting, facing big data directly will make us lose our understanding of people and pursue the real law. After all, not all social facts are as easy to predict as a flu. And even Google's widely acclaimed flu prediction case is considered problematic: after comparing with traditional influenza surveillance data. Google flu trends updated in real time based on Internet flu search have been found to significantly overestimate influenza peak levels. Scientists point out that too much noise based on search affects its accuracy, suggesting that flu tracking based on social network data mining will not replace but can only supplement traditional epidemic surveillance networks. They are developing alternative tracking methods that are less noisy, such as Twitter-based flu tracking that contains only posts from real patients rather than reprinted flu news reports.

III. Closed data and fracture data

The problems caused by closed data and broken data have been mentioned in the second part, they will produce false statistical relationships, affecting the accuracy and verifiability of the analysis results, the following specific analysis of these two aspects.

(I) closed data leads to a lack of diversification of data

"the key to data value-added is integration, but the premise of free integration is the openness of data. Open data refers to putting the original data and its related metadata on the Internet in a downloadable electronic format for free use by other parties. Open data and open data are two different concepts, openness is at the information level, and openness is at the database level. The significance of opening up data is not only to satisfy citizens' right to know, but also to let the data of the most important means of production in big data era flow freely, so as to promote innovation and promote the development of knowledge economy and network economy. " [13]

Opening up is the meaning of big data's topic, and it is also a change that our government and enterprises must adapt to in the era of big data, but the situation we are facing now is still a platform and a data. the situation caused by data barriers is that there is all the data. at the same time, there is a shortage of everything.

In the medical field, for example, big data is seen as bringing hope to the medical field-computers can go a step further in imitating human experts' intuition without having to rely on small datasets such as EBM. Medical information systems still use outdated data barriers, in which only audited, standard, edited data can be received, and much of the available data is rejected because of a lack of consistency. This barrier creates homogenized data and excludes the diversity that makes the system truly useful. [14]

Taking the four major Weibo data platforms of Sina, Sohu, NetEase and Tencent as an example, the data of the four companies are independent of each other, and the analysis of Weibo user behavior is based on the analysis of their existing users. in this closed data environment, many levels of specific analysis will be greatly limited, such as the analysis of overlapping users, what characteristics of people will only open accounts on one platform. What are the characteristics of people who set up accounts on different platforms, whether they use the same style on different platforms, whether they have the same degree of activity under different accounts, and what are the influencing factors? This cannot be analyzed in a closed data environment.

Data is the most important asset of an enterprise, and with the development of the data industry, it will become more valuable. However, the closed data environment will hinder the realization of data value, which is true for enterprise applications and research findings, so we need a reasonable mechanism to open data under the condition of protecting data security, so that data can be fully utilized. One of the effective solutions is that impartial third-party data analysis companies and research institutions act as middlemen to collect and analyze data, break the boundaries of the real world at the data level, and share data among many companies instead of touching the elephant blindly. Only in this way can we achieve big data in the real sense and give the data a broader and comprehensive analysis space. It will bring about a change of thinking and meaningful change to the industrial structure and data analysis itself.

(2) the broken data makes the data unstructured

Closed data makes it impossible for us to see diversified data, while broken data makes the data unstructured. According to a report from IDC, 90% of the global digital information in 2012 was unstructured data such as video, audio and image files [15]. The lack of structure itself is a problem that can be solved by new technologies, which makes this problem tricky. The excessive pursuit of new technology, on the one hand, will destroy the authenticity and integrity of the data itself, on the other hand, it will not pay enough attention to the analysis of the people and the meaning of life behind the data.

1. No one can be seen behind the behavior and there is no meaning in life.

Take Taobao as an example, when Taobao wanted to study "who on earth" to open a shop on Taobao, they found that it was not as easy as they thought.

On Taobao's real-time map, you can use the GPS system to clearly know what transactions are taking place across the country every second, but real-time maps can't tell them more about the ethnic characteristics of these people. [16] the same problem appears in the user research of Tencent Games's department, they can't know who is playing their game from real-time monitoring, what are their hobbies, what personality, why do they like a game? All they know is an ID account, which is the problem with broken data: seemingly comprehensive, actually fragmented data. It is true that full data can grasp human behavior to some extent, but it is impossible to know what kind of human behavior. Knowing this, you can understand why Google launched Google+, to get specific user information, including names, hobbies, friends, identity and other specific data. Any platform has its own advantages in data collection, but also has its shortcomings. On the surface, it has a large amount of data, but in fact it is just a fragment, lack of continuity and identifiability.

Bara Brazil introduces LifeLinear, a website that allows users to find surveillance footage of themselves anywhere and anytime of the day by typing their name in the search box, and your whereabouts will be recorded by the site wherever you are. This is a virtual website of the author, but there are not a few people who believe it and enter the name of the website to search, because in theory it can be realized, first, with the help of the wireless monitoring system in the city, feedback data into a single retrieval database, instructing the computer to track all people. Second, and the most important thing is that everyone has fixed habits and behavior patterns, based on which the system can build a behavior model for everyone, then predict where you may be, and wait for you there. [17]

The establishment of such a system depends on the technical system, but what is more important is the comprehensive understanding and analysis of each individual, all of which are indispensable. Another data publicist introduced by Barra Brazil in this book, he posted his location data and property information online, but you don't know anything about this person, because there is no personal information about his personality, preferences and other personality. it is a typical case of "everything, but everything".

two。 A large amount of unstructured data subverts the basic paradigm of the original analysis.

In the era of big data, the data to be processed is no longer data in the traditional sense, but various kinds of data such as words, pictures, audio and video, etc. A large number of non-institutionalized data poses new challenges to data analysis, because only data that can be defined is valuable information.

Users of Renren are probably no stranger, and there will be friend recommendations on their home page, which is very simple. You only need to analyze your friends and find the connections between your friends. But when Renren needs to decide what kind of ads to put on the advertising space, it needs to analyze a large number of user-generated text, photos, shared content and the interaction with friends. How to structure a large amount of unstructured and semi-structured data and find out the rules from it requires new algorithms and new analytical thinking.

Fourth, missing data

Oscar Wilde said in 1894, "it's sad that there is almost no useless information these days." Strictly speaking, half of him is not right. Only valuable data can be called information, and it is not easy to get as much information as possible from the data. with the expansion of the amount of data, the proportion of missing data will expand accordingly, especially when multiple items are missing in a sample. it will increase the difficulty of processing, in addition to the accuracy of the construction model, there is also the problem of time complexity.

For all big data, the amount of data for which problem is analyzed is not large enough, for everyone, the missing data is more than the normal number. The use of new technologies to avoid this problem in the process of data collection and integration will make the risk of this problem more prominent in analysis. For example, BI uses rapid repair technology to integrate scattered data in order to avoid data incompleteness, which will make us lose the most original real data, which makes it easy for researchers to abandon the data that do not accord with the hypothesis and make the verification conclusion impossible.

Nestl é, for example, sells more than 100, 000 products in 200 countries and has 550000 suppliers, but it does not have a strong bargaining advantage because its database is a mess. In one inspection, it found that almost half of the 9 million records of suppliers, customers and raw materials were expired or duplicated, while the remaining 1/3 were inaccurate or missing. Some supplier names are abbreviated and some are not abbreviated, resulting in duplicate records. [18] the problems of closure, fracture and missing data are included in this case.

Although missing data can be solved by fuzzy data set theory, the data requirements of many research situations are deterministic. The era of big data needs not only full data, massive data and real-time data, but also truly open, more likely to be accurate, human and social-oriented analysis methods and ideas. In a closed data platform, the neglect of the analytical risks posed by broken data and missing data will make us still stay in the era of small data, and what is worse, the data is still in the era of small data. methods have been simply advocating a variety of new technologies to deal with big data, and the confusion caused by this mismatch is more dangerous than that caused by big data itself.

In a sense, we can use the collected data to understand how to do things better. From this point of view, we will consider innovation and big data application. After all, big data doesn't just use collaborative filtering technology to predict what products you need, and it's not just when it's more cost-effective to buy a plane ticket. It's just one aspect that makes people and business smarter and more interesting. "before the scientific revolution, it was usually the revolution in measurement tools," says Sinan Aral, a business professor at New York University. [19] big data's surging development and sweeping ambition are bound to affect the field of scientific theoretical research, which is why we need to keep a little calm and prudent judgment. In addition, big data's potential in promoting information sharing and promoting social progress is also worthy of our efforts for a more perfect solution.

As Gerz said, "the second principle of thermodynamics, the principle of natural selection, the concept of unconscious motivation, or the organization of the mode of production do not explain everything, not even human things, but they explain something after all." Recognizing this, our attention turns to determining what these things are and to freeing us from the mass pseudoscience entanglement caused by these ideas at their peak. " At the end of the article, I use the point of view put forward by Gerz in the interpretation of Culture to express my views on the study of big data, because up to now, the vagueness of the concept of big data is still more than it signifies, but there is still a lot of room for improvement and research, and our work has just begun.

Finally, I hope you will pay more attention, and more wonderful articles will be brought to you. Those who are interested in learning big data can add a group: 615997810, in which there are big data's basic and project learning materials, as well as java,java interview materials and Python basic learning materials. Find the group owner to get it for free.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report