Detailed explanation of relevant knowledge points of Hive sampling 07/03 Update SLTechnology News&Howtos

Detailed explanation of relevant knowledge points of Hive sampling

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the "detailed explanation of the relevant knowledge points of Hive sampling". In the daily operation, I believe that many people have doubts about the detailed explanation of the relevant knowledge points of Hive sampling. The editor has consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "detailed explanation of relevant knowledge points of Hive sampling". Next, please follow the editor to study!

Sampling

Sampling is also a common method in Hive, mainly used in the following scenarios

In some machine learning scenarios, data warehouses provide sample data as data providers.

The calculation result of the data is abnormal or the index is abnormal. At this time, if we often need to confirm whether the data of the data source itself is abnormal.

When there is a problem with the performance of SQL, we will also use the sampled method area to view the data, and then do SQL tuning

In the task of large-scale data analysis and modeling, mining and analysis of full data will be very time-consuming and take up cluster resources, so in general, only a small part of the data need to be extracted for analysis and modeling operations.

Random sampling (rand () function)

In general, we use sorting function and rand () function to complete random sampling. The limit keyword limits the data returned by sampling. The difference is which sorting function do we use?

The rand () function is used for extraction because rand () returns a random value of type double between 0 and 1.

Next, we used a table we used earlier with about 4603089 records. I won't prepare the data for you here. You can take a look at the advanced data storage format of Hive to get the test data.

Create table ods_user_bucket_log (id int, name string, city string, phone string, acctime string) CLUSTERED BY (`id`) INTO 5 BUCKETS row format delimited fields terminated by'\ t 'stored as textfile; insert overwrite table ods_user_bucket_log select * from ods_user_log

Order by rand ()

Order by only enables one reduce, so it is time-consuming, as to why we explained the common sorting methods of Hive syntax in the previous article.

Because order by is global, it can achieve the purpose of random sampling.

Select * from ods_user_bucket_log order by rand () limit 10

Sort by rand ()

Sort by provides the sorting function in a single reducer, but does not guarantee the overall order. In fact, it can not be really random at this time, because the random at this time is aimed at the partition, so if we can control that the data entering each partition is also random, then we can be random.

Select * from ods_user_bucket_log sort by rand () limit 10

Distribute by rand () sort by rand ()

The distribute and sort keywords before the rand function can guarantee that the data is randomly distributed in the mapper and reducer phases, and at this time we can also be truly random. As we mentioned earlier, cluster by is basically equivalent to distribute by sort by.

Select * from ods_user_bucket_log distribute by rand () sort by rand () limit 10

Cluster by rand ()

The function of cluster by is the combination of distribute by and sort by. Distribute by rand () sort by rand () is random twice, and cluster by rand () is random only once, so it is faster than the previous method.

Select * from ods_user_bucket_log cluster by rand () limit 10

Tablesample () sampling function

Barrel sampling (barrel table sampling)

In hive, the sub-bucket is actually taken according to a field Hash and put into the bucket of the specified data. For example, the table table is divided into 100 buckets according to ID, and the algorithm is hash (id)% 100. in this way, the data of hash (id)% 100 = 0 is put into the first bucket, and the record of hash (id)% 100 = 1 is put into the second bucket.

Bucket sampling syntax:

TABLESAMPLE (BUCKET x OUT OF y [ON colname])

Where x is the bucket number to be sampled, the barrel number starts at 1, colname represents the sampled column (that is, divided buckets according to that field), and y represents the number of buckets. So the expression means that it is divided into y buckets according to the colname field, and the x bucket is extracted.

SELECT * FROM ods_user_bucket_log TABLESAMPLE (BUCKET 1 OUT OF 100000 ON rand ())

Block sampling

Block sampling is provided starting with Hive 0.8, using tablesample to extract the specified number / proportion / size of rows

SELECT * FROM ods_user_data TABLESAMPLE (1000 ROWS); SELECT * FROM ods_user_data TABLESAMPLE (20 PERCENT); SELECT * FROM ods_user_data TABLESAMPLE (1m)

Proportional sampling of ABLESAMPLE (20 PERCENT)

This will allow Hive to get at least n% of the data

SELECT * FROM ods_user_bucket_log TABLESAMPLE (0.0001 PERCENT)

Extract data of a specific size TABLESAMPLE (100m)

SELECT * FROM ods_user_bucket_log TABLESAMPLE (1m)

It should be noted that it must be an integer M here, thinking that I made a mistake when I tried to make a mistake.

Extract a specific number of rows TABLESAMPLE (10 ROWS)

SELECT * FROM ods_user_bucket_log TABLESAMPLE (10 rows)

Expansion

How to realize proportional sampling by random sampling

Earlier, we introduced that TABLESAMPLE can realize proportional sampling, and random sampling can extract a specific number of records with the help of limit. In fact, if we improve random sampling, we can also achieve proportional sampling, because the function value of rand () is random, so we can filter its return value conditionally to achieve proportional sampling.

Select * from (select *, rand () as radix from ods_user_bucket_log) tmp where radix > = 0.0 and radix

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.