Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the median in Hive

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the median in Hive". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "What is the median in Hive"!

For solving medians, we know that there are median processing functions directly in Python, such as solving a median in Python, the code is very simple.

Python calculates median

import numpy as np

nums = [1.1,2.2,3.3,4.4,5.5,6.6]

#Mean

np.mean(nums)

#Median

np.median(nums)

There is no mean function directly provided in hive, but two UDAF, percentile and percentile_approximate, are officially provided.

Let's see what the authorities say.

DOUBLEpercentile(BIGINT col, p)Returns the exact pthpercentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

arraypercentile(BIGINT col, array(p1[, p2]...)) Returns the exact percentiles p1, p2, ... of a column in the group (does not work with floating point types). pimust be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

DOUBLEpercentile_approx(DOUBLE col, p [, B])Returns an approximate pthpercentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.

arraypercentile_approx(DOUBLE col, array(p1[, p2]...) [, B])Same as above, but accepts and returns an array of percentile values instead of a single one.

NOTE: A true percentile can only be computed for integer values. Use

PERCENTILE_APPROX if your input is non-integral.

That is to say, the real median can only be calculated with percentile, the input needs to be integer type, and the calculation obtained by using percentile_approximate (input is floating point type) is not the real median, that is, the approximate median. After a large amount of data verification, sometimes the approximate median and the real median are still very different.

How to find the median for data with decimals?

You can convert decimals to integers and then find the median (e.g., multiply by 10000)

Sparksql is the same way to find the median, hurry up and try it!

At this point, I believe everyone has a deeper understanding of "what is the median in Hive". Let's do it in practice! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report