Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to deal with big data era

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

In recent years, big data has gradually evolved into real life, from medical care to credit, in various industries.

The word "big data" alone indicates a large amount of data. If these data results are not processed and presented as simple numbers, I believe that you will see more than 10 seconds, and your scalp will tingle. If your scalp is numb, then our customers will be even more numb. If this problem cannot be solved, it will greatly affect the development of big data. As a result, a profession will surely emerge, that is, data visualization engineer, whose responsibility is to make the results of big data clear at a glance and reduce the reading time and reading threshold of customers.

This tutorial will be refined as soon as possible to form a classic web tutorial for training data visualization engineers.

Now let's get into the lesson, how to cope with the big data era! I've summarized three effective tips.

Three secret manuals:

● Discard inaccurate sample data and statistically analyze all data

Up to now, we still have limited access to collected data, so more "random sampling analysis."

Definition of random sampling analysis: A method of estimating a biological characteristic of a population by sampling samples equally with equal opportunity.

A method of sampling according to the principle of randomness, that is, ensuring that every unit in the population has an equal chance of being selected.

Advantages: When inferring populations from sample data, the reliability of the inferred values can be objectively measured in a probabilistic manner, thus making this inference based on science. Because of this, random sampling analysis is widely used in social surveys and social research.

Disadvantages: only applicable to the limited number of population units, otherwise the numbering work is heavy; for complex populations, the representativeness of the sample is difficult to guarantee; the known information of the population cannot be utilized, etc. In the market research scope is limited, or the survey object situation is unclear, difficult to classify. And must have more understanding of the overall situation of each unit, otherwise it is impossible to make a scientific classification. This is often difficult to achieve before the actual survey, resulting in a less representative sample.

For example, it is impossible to ask all Chinese citizens how satisfied they are with a policy. The usual approach is to take 10000 people at random and use the satisfaction of these 10000 people to represent everyone.

In order to make the results as accurate as possible, we design questionnaires that are as accurate as possible and make the sample sufficiently random.

This is the approach of the "small data age," where random sampling analysis has been a huge success in a variety of fields when it was impossible to collect all the data.

However, the problem also came:

1. Depends on randomness, which is hard to do. For example, calling 10000 households at random using a landline is also less random because it does not take into account the fact that young people use mobile phones.

2. It looks good from a distance, but once you focus on a certain point, it becomes blurred. For example, let us represent the whole country by 10000 people, chosen randomly from the whole country. However, if this result is used to judge the satisfaction degree of Xizang, it lacks precision. That is, the results of the analysis cannot be applied locally.

3. The results of sampling can only answer the questions you have designed beforehand, not the questions you suddenly realize.

In the "big data era," sample = population. Today, we have the ability to collect comprehensive and complete data.

What we usually call big data is based on having all the data, or at least as much data as possible.

● Focus on the integrity and complexity of data, weakening the accuracy of individual data

In the era of "small data," the first thing we have to solve is to reduce measurement errors, because the information collected by ourselves is relatively small, so to ensure that the results are as accurate as possible, we must first ensure that the recorded information is correct, otherwise subtle errors will be infinitely magnified. Therefore, we must first optimize the measurement tools. And that's how modern science developed. Kelvin, the physicist who developed the international unit of temperature, once said,"measurement is cognition." To be a good scientist, one must be able to collect and manage data accurately.

In the era of "big data," we can easily obtain all the data, and the number is huge to trillions of data. Because of this, it is unimaginable to pursue the accuracy of each data. Weakening the accuracy of the data, then the confusion of the data is inevitable.

However, if the amount of data is large enough, the chaos it causes does not necessarily lead to bad results. And that's why we relaxed our data standards, so we could collect more data, and we could do more with it.

Take an example:

To measure the salinity of an acre of land, if there is only one meter, it must be accurate and always working. But if there is one meter per square meter of land, some of the measurements are wrong, but all of them add up to a more accurate result.

As a result,"big data" usually speaks with more convincing probabilities than with the precision of measurement tools. This requires us to rethink the way we approach data collection. Because of the sheer volume of data, we gave up individual precision, and certainly couldn't achieve it.

For example, we can see on the computer storage that all files can be found through a path, for example, to find a song, you must first find a partition, then find its folder, and finally find the song you want, and this is the traditional method. If you have fewer partitions or folders on your computer, you can do this, but what if there are 100 million partitions? A billion folders? There can be far more data on the web than there are files on personal computers, billions at a time, and if clear categories are used, not only will the classifiers go crazy, but the queries will go crazy. Therefore,"tags" are widely used on the Internet now, and pictures, videos, music, etc. can be retrieved by tags. Of course, sometimes people mislabel a label, which is painful for people accustomed to precision, but accepting chaos also brings us benefits:

By having many more tags than "categories," we are able to get more content.

Content can be filtered by tag combinations.

For example, if we were to search for "dove." And "white dove" is associated with a lot of information: for example, an animal, or a brand, or a celebrity. Once we follow the traditional taxonomy, pigeons are classified as animals, brands, and humans. One result is that the query person does not know that it has other categories, or maybe he only wants to check the animal "white dove," so he will not query the brand category or the name human. However, if you use "tag," enter "dove"+"animal" to find the desired result; enter "dove"+"brand" to find the desired result; enter "dove"+"celebrity" to find the desired result.

Therefore, using "tag" instead of "classification" has a lot of imprecise data, but thanks to a large number of tags, it makes it easier for us to search.

·Think about the relevance of data and abandon single causality

First study the data itself, do not have to delve into the reasons for the formation of the data, let the data speak for itself.

For example:

Walmart is the largest retailer in the world and has a lot of retail data. Through sales data analysis, Walmart found that flashlight and egg tart sales increased every time before a seasonal hurricane. As a result, when seasonal hurricanes hit, Wal-Mart placed its stock of tarts close to hurricane supplies to encourage customers to buy.

There's bound to be questions like,"Why do people buy egg tarts when a hurricane comes? "

And this "why" is causation. And this "cause" is extremely difficult and complicated to analyze, and even if it is finally obtained, it has little meaning. For Wal-Mart, when a hurricane hits, lay out the tarts. This is where the data speaks for itself.

And we know hurricanes are associated with egg tarts, and we can make money.

This is the way to deal with the era of big data, which is to think about the relevance of data and abandon the single causal relationship.

This helps us understand the world better. Sometimes causation leads us to false perceptions.

For example:

We learned from our parents that we should wear hats and gloves when it was cold, or we would catch cold. However, this is not the case with colds. Or we eat in a restaurant and suddenly have a stomachache. We can think of the reason. There is something wrong with the food. But it's probably exposure.

Relevance provides a new perspective when analyzing problems, allowing us to understand what the data itself says. However, causality should not be completely abandoned, but should be examined from the standpoint of scientific relevance.

This raises a new question: how to make data clear in the era of big data? the answer is here

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report