Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is big data's performance estimation method?

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Big data performance estimation method is what, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Big data's performance is an eternal topic. However, in the actual work, we found that many people do not know how to carry out the simplest performance estimation, the result is often fooled by big data manufacturers:).

In fact, it is very simple to calculate the time it takes to get the data out of the hard drive. In addition to individual operations that fetch numbers by index, most operations involve traversing the data as a whole, such as grouping summary statistics and conditional queries (non-index fields). In any case, it is impossible to be less than the hard disk access time, we can calculate a theoretical limit.

For example, it is claimed that it takes only 3 seconds to achieve the OLAP summary of 10T data. So what does that mean?

The common 15000-speed hard disk, the access speed under the operating system is less than 200M/ seconds, SSD will be faster, but there is no order of magnitude improvement, about 3 seconds to read 1G. In this way, it takes more than 30000 seconds to read 10T of data from a single hard disk. If you want to complete the summary within 3 seconds, you need 10, 000 hard drives! As a user, are you prepared for this?

Of course, the speed of hard disk and hard disk varies in different environments, which may be faster or slower, but in short, it can be estimated by this simple method. Don't know the speed of your hard drive? Then get a big file and read it. It will be more accurate to get the experimental data and then calculate it. It should be emphasized that we can not simply look at the performance indicators claimed by the hard disk manufacturer. Under the file system, the ideal value is often less than half, and it is still the most reliable measured value.

In this way, we can know what performance can be achieved in the ideal situation of a certain big data problem, and expectations that are better than this index are impossible to achieve under the hardware conditions used to estimate the indicator. there is no need to think about software products and technical solutions.

This estimate also points to an optimization direction, which is to reduce the amount of storage and visits.

Of course, reducing the amount of storage can not reduce the data itself, and none of the data used for calculation can be reduced, otherwise there will be wrong results. The reduction of storage depends on the means of data compression. 10T of raw data, if there is a good compression means, the actual storage on the hard disk may be only 1T or less, at this time 3 seconds to summarize the data will no longer need 10,000 hard drives.

In the case of storage can no longer be reduced, there are some software means to reduce the number of visits, the commonly used method is column storage. A data table has 100 columns accounting for 10T. If you only access three columns for summary, you only need to access 300G data. At this time, you don't need 10,000 hard drives to complete the summary in 3 seconds.

However, when big data manufacturers claim the performance index of 10T and 3 seconds, they generally do not clearly point out how much storage and access can be reduced after using compression or column storage technology. This is easy to give users the illusion that this technology can generally solve the big data problem, but often, the compression ratio of some data can not be very high, and it does not have any advantage for accessing more operational storage.

To estimate the performance limit more accurately, you should also consider ways to reduce storage and access. Try to see how compressed your data can be (with regular zip software), and check to see if the operation fetches very few columns from many columns.

After reading the above, have you mastered the method of big data's performance estimation? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report