Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the methods of data sampling in Hive

2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the methods of data sampling in Hive". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the methods of data sampling in Hive"?

In the task of large-scale data analysis and modeling, mining and analysis of full data will be very time-consuming and take up cluster resources, so in general, only a small part of the data need to be extracted for analysis and modeling operations. Hive provides the function of data sampling (SAMPLING), which can sample data according to certain rules. Currently, it supports block sampling, barrel sampling and random sampling, as shown below:

1. Block sampling (tablesample () function)

1) tablesample (n percent) extracts data proportionally according to the size of the hive table and saves it to the new hive table. For example, extract 10% of the data from the original hive table

(note: during the test, it is found that the select statement cannot have where conditions and does not support subqueries, which can be solved by creating a new intermediate table or using random sampling)

Create table xxx_new as select * from xxx tablesample (10 percent)

2) tablesample (n M) specifies the size of the sampled data in M.

3) tablesample (n rows) specifies the number of rows of sampled data, where n represents n rows of data for each map task, and the number of map can be confirmed by the simple query statement of the hive table (keyword: number of mappers: X)

two。 Barrel sampling

In hive, the sub-bucket is actually taken according to a field Hash and put into the bucket of the specified data. For example, the table table_1 is divided into 100 buckets according to ID, and the algorithm is hash (id)% 100. in this way, the data of hash (id)% 100 = 0 is put into the first bucket, and the record of hash (id)% 100 = 1 is put into the second bucket. The key statement to create a bucket table is: CLUSTER BY statement.

Bucket sampling syntax:

TABLESAMPLE (BUCKET x OUT OF y [ON colname])

Where x is the bucket number to be sampled, the barrel number starts at 1, colname represents the column sampled, and y represents the number of barrels.

For example, randomly divide the table into 10 groups and extract the data from the first bucket.

Select * from table_01 tablesample (bucket 1 out of 10 on rand ())

3. Random sampling (rand () function)

1) the rand () function is used for random sampling, and the limit keyword restricts the data returned by sampling. The distribute and sort keywords before the rand function can ensure that the data is randomly distributed during the mapper and reducer phases. The example is as follows:

Select * from table_name where col=xxx distribute by rand () sort by rand () limit num

2) use the order keyword

The examples are as follows:

Select * from table_name where col=xxx order by rand () limit num

After testing and comparison, the random sampling order by method in tens of millions of data takes longer, about 30 seconds more.

Thank you for your reading, the above is the content of "what are the methods of data sampling in Hive?" after the study of this article, I believe you have a deeper understanding of what the method of data sampling in Hive has, and the specific use still needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report