How to get started with hive+python data Analysis 07/16 Update SLTechnology News&Howtos

How to get started with hive+python data Analysis

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how to get started with hive+python data analysis. the content is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Why use hive+python to analyze data

For instance,

When there was no database, people manipulated the file system programmatically, which was equivalent to writing mapreduce to analyze data.

Later, with the database, no one operated the file system (unless otherwise required), but directly used sql plus some data processing. This is equivalent to hive + python.

Hive + python can solve most requirements, unless your data is unstructured data, then you go back to the ancient times and had to write mapreduce.

Why not use hive+java, hive+c, hive+...

Because:

Python is really easy to use, scripting language, no compilation, there is a powerful machine learning library, suitable for scientific computing (this is data analysis!)

Use hive+python to analyze data

The division of hive and python: use hive sql as the data source of python, the output of python as the output of map, and then use the aggregate function of hive as reduce.

Let's use an example to analyze: count the amount of various foods that each person eats under a certain date.

Create user_foods user Food Table hive > create table user_foods (user_id string, food_type string, datetime string) partitioned by (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t 'LINES TERMINATED BY'\ n 'STORED AS TEXTFILE# partitioned by (dt string) is separated by date partition # and fields are separated by\ t.

According to business needs, because it is counted on a daily basis, in order to reduce the amount of data in analysis, the above hive table is partitioned by dt (date).

After the Hive table is created, a folder with the same name as the table name is created under the HDFS / hive/ directory

Import data to create partition hive > ALTER TABLE user_foods ADD PARTITION (dt='2014-06-07')

After the partition is created, a directory of df='2014-06-07' is added under the hdfs directory / hive/user_foods/.

Create test data

Create a file, such as data.txt, and add test data

User_1 food1 2014-06-07 09:00user_1 food1 2014-06-07 09:02user_1 food2 2014-06-07 09:00user_2 food2 2014-06-07 09:00user_2 food23 2014-06-07 09:00 Import data hive > LOAD DATA LOCAL INPATH'/ Users/life/Desktop/data.txt' OVERWRITE INTO TABLE user_foods PARTITION (dt='2014-06-07')

After the import is successful, use select * from user_foods to check it.

Or use the

Hive > select * from user_foods where user_id='user_1'

This will generate a mapreduce

Use only hive to analyze

"counting the amount of food each person eats under a certain date" is too simple to be achieved without python:

Hive > select user_id, food_type, count (*) from user_foods where dt='2014-06-07 'group by user_id, food_type

Results:

Use python in combination

If you need data cleaning or further processing, then you definitely need to customize map, which can be achieved using python.

For example, food2 and food23 think that they are the same type of food, and python is used for data cleaning. The script for python is as follows: (m.py)

#! / usr/bin/env python#encoding=utf-8import sys if _ name__== "_ _ main__": # parsing each row of data for line in sys.stdin: # skip blank lines if not line or not line.strip (): continue # here use try to avoid all errors caused by special row parsing errors try: userId, foodType Dt = line.strip () .split ("\ t") except: continue # cleaning data, empty data skip if userId = ='or foodType = =': continue # cleaning data if (foodType = = "food23"): foodType = "food2" # output, separated by\ t That is, the output of map print userId + "\ t" + foodType

Then use hql combined with python script to analyze, there are two steps.

1. Adding a python script is equivalent to adding a script to distributed cache

two。 Execution, using transform and using

Hive > add file / Users/life/Desktop/m.py;hive > select user_id, food_type, count (*) from (select transform (user_id, food_type, datetime) using 'python m.py' as (user_id, food_type) from user_foods where dt='2014-06-07') tmp group by user_id, food_type

Results:

Recommendations for debugging python scripts

1. First of all, make sure that the script is free of syntax errors, and you can execute python m.py to verify

two。 Make sure the code has no other output

3. You can use test data to test scripts, such as:

$> cat data.txt | python m.pyuser_1 food1user_1 food1user_1 food2user_2 food2user_2 food2

After 1, 2, 3 are all correct, if there are errors in using hive+python again, the possible errors are:

1. Python script is not robust in dealing with data, and some boundary conditions are not considered, which leads to exception in python.

two。 Sum it up for yourself.

Other

The python script in the above example acts as map. Of course, you can also create a reduce.py to count the output of map instead of using hive's aggregate function.

This is based on the fact that hive can no longer meet your needs.

The above is how to get started with hive+python data analysis. have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.