How to use hcatalog to read and write hive tables in mr 07/06 Update SLTechnology News&Howtos

How to use hcatalog to read and write hive tables in mr

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

What this article shares with you is about how mr uses hcatalog to read and write hive tables. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article.

In enterprises, due to the requirements of leaders, the data storage format in hive often changes, such as changing the tsv,csv format to parquet or orcfile for optimization. So at this time, if the mr job reads the table data of hive, we have to write mr again and redeploy. It hurts very much at this time. Hcatalog helps us solve this problem, with which we don't have to worry about the storage format of the data in hive. Please read this article carefully for more information.

This article is mainly about mapreduce using HCatalog to read and write hive tables. Hcatalog makes hive metadata well used by other hadoop tools, such as pig,mr and hive. HCatalog tables provide users with a relational view of the data in (HDFS) and ensure that users do not have to worry about where their data is stored or in what format, so users do not need to know whether the data is stored in RCFile format, text files or sequence files. It also provides notification services to notify workflow tools (such as Oozie) when new data is available in the warehouse. HCatalog provides HCatInputFormat / HCatOutputFormat to enable MapReduce users to read / write data in Hive's data warehouse. It allows users to read only the partitions of the tables and columns they need. The returned record format is a convenient list format, and users do not need to parse them. Let's give a simple example. In the mapper class, we take the table schema and use this schema information to get the desired columns and their values.

Here is the map class. Public class onTimeMapper extends Mapper {@ Override protected void map (WritableComparable key, HCatRecord value, org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException {

/ / Get table schema HCatSchema schema = HCatBaseInputFormat.getTableSchema (context)

Integer year = new Integer (value.getString ("year", schema)); Integer month = new Integer (value.getString ("month", schema)); Integer DayofMonth = value.getInteger ("dayofmonth", schema)

Context.write (new IntPair (year, month), new IntWritable (DayofMonth);}}

‍

In the reduce class, a schema is created for the data that will be written to the hive table. Public class onTimeReducer extends Reducer {public void reduce (IntPair key, Iterable value, Context context) throws IOException, InterruptedException {int count = 0; / / records counter for particular year-month for (IntWritable s:value) {count++;} / / define output record schema List columns = new ArrayList (3); columns.add (new HCatFieldSchema ("year", HCatFieldSchema.Type.INT, "); columns.add (new HCatFieldSchema (" month ", HCatFieldSchema.Type.INT,")) Columns.add (new HCatFieldSchema ("flightCount", HCatFieldSchema.Type.INT, ")); HCatSchema schema = new HCatSchema (columns); HCatRecord record = new DefaultHCatRecord (3); record.setInteger (" year ", schema, key.getFirstInt ()); record.set (" month ", schema, key.getSecondInt ()); record.set (" flightCount ", schema, count); context.write (null, record);} finally, create the driver class and indicate the input and output schema and table information. Public class onTimeDriver extends Configured implements Tool {private static final Log log = LogFactory.getLog (onTimeDriver.class)

Public int run (String [] args) throws Exception {Configuration conf = new Configuration (); Job job = new Job (conf, "OnTimeCount"); job.setJarByClass (onTimeDriver.class); job.setMapperClass (onTimeMapper.class); job.setReducerClass (onTimeReducer.class)

HCatInputFormat.setInput (job, "airline", "ontimeperf"); job.setInputFormatClass (HCatInputFormat.class); job.setMapOutputKeyClass (IntPair.class); job.setMapOutputValueClass (IntWritable.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (DefaultHCatRecord.class); job.setOutputFormatClass (HCatOutputFormat.class); HCatOutputFormat.setOutput (job, OutputJobInfo.create ("airline", "flight_count", null); HCatSchema s = HCatOutputFormat.getTableSchema (job) HCatOutputFormat.setSchema (job, s); return (job.waitForCompletion (true)? 0:1);} public static void main (String [] args) throws Exception {int exitCode = ToolRunner.run (new onTimeDriver (), args); System.exit (exitCode);}}

‍

Of course, you should create an output table in hive before running the code written above. Create table airline.flight_count (Year INT, Month INT, flightCount INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY', 'STORED AS TEXTFILE; may cause errors is that $HIVE_HOME is not set. This is how mr uses hcatalog to read and write hive tables. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more through this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.