How Mapreduce RCFile writes and reads API 05/04 Update SLTechnology News&Howtos

How Mapreduce RCFile writes and reads API

2025-05-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces Mapreduce RCFile how to write and read API, the article is very detailed, has a certain reference value, interested friends must read it!

RCFile is a row and column storage structure with high compression ratio and efficient reading developed by FaceBook. You can usually use insert-select transformations directly on an Text table in Hive, but sometimes you want to use Mapreduce for RCFile reading and writing.

Org.apache.hadoop

Hadoop-client

2.5.0-cdh6.2.1

Org.apache.hive

Hive-serde

0.13.1-cdh6.2.1

Org.apache.hive.hcatalog

Hive-hcatalog-core

0.13.1-cdh6.2.1

Read the text file and use mapreduce to generate the RCFile format file

Import org.apache.hadoop.conf.Configuration

Import org.apache.hadoop.fs.Path

Import org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable

Import org.apache.hadoop.hive.serde2.columnar.BytesRefWritable

Import org.apache.hadoop.io.NullWritable

Import org.apache.hadoop.io.Text

Import org.apache.hadoop.mapreduce.Job

Import org.apache.hadoop.mapreduce.Mapper

Import org.apache.hadoop.mapreduce.Reducer

Import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

Import org.apache.hive.hcatalog.rcfile.RCFileMapReduceInputFormat

Import java.io.IOException

Public class RcFileReaderJob {

Static class RcFileMapper extends Mapper {

@ Override

Protected void map (Object key, BytesRefArrayWritable value

Context context)

Throws IOException, InterruptedException {

Text txt = new Text ()

StringBuffer sb = new StringBuffer ()

For (int I = 0; I < value.size (); iTunes +) {

BytesRefWritable v = value.get (I)

Txt.set (v.getData (), v.getStart (), v.getLength ())

If (I = = value.size ()-1) {

Sb.append (txt.toString ())

} else {

Sb.append (txt.toString () + "\ t")

}

Context.write (new Text (sb.toString ()), NullWritable.get ())

}

@ Override

Protected void cleanup (Context context) throws IOException

InterruptedException {

Super.cleanup (context)

}

@ Override

Protected void setup (Context context) throws IOException

InterruptedException {

Super.setup (context)

}

Static class RcFileReduce extends Reducer {

@ Override

Protected void reduce (Text key, Iterable values

Context context) throws IOException, InterruptedException {

Context.write (key, NullWritable.get ())

}

Public static boolean runLoadMapReducue (Configuration conf, Path input, Path output) throws IOException

ClassNotFoundException, InterruptedException {

Job job = Job.getInstance (conf)

Job.setJarByClass (RcFileReaderJob.class)

Job.setJobName ("RcFileReaderJob")

Job.setNumReduceTasks (1)

Job.setMapperClass (RcFileMapper.class)

Job.setReducerClass (RcFileReduce.class)

Job.setInputFormatClass (RCFileMapReduceInputFormat.class)

/ / MultipleInputs.addInputPath (job, input, RCFileInputFormat.class)

RCFileMapReduceInputFormat.addInputPath (job, input)

Job.setOutputKeyClass (Text.class)

Job.setOutputValueClass (NullWritable.class)

FileOutputFormat.setOutputPath (job, output)

Return job.waitForCompletion (true)

}

Public static void main (String [] args) throws Exception {

Configuration conf = new Configuration ()

If (args.length! = 2) {

System.err.println ("Usage: rcfile")

System.exit (2)

}

RcFileReaderJob.runLoadMapReducue (conf, new Path (args [0]), new Path (args [1]))

}

Read RCFile format file and use mapreduce to generate Text format file

Import org.apache.hadoop.conf.Configuration

Import org.apache.hadoop.conf.Configured

Import org.apache.hadoop.fs.Path

Import org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable

Import org.apache.hadoop.hive.serde2.columnar.BytesRefWritable

Import org.apache.hadoop.io.NullWritable

Import org.apache.hadoop.io.Text

Import org.apache.hadoop.mapreduce.Job

Import org.apache.hadoop.mapreduce.Mapper

Import org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Import org.apache.hadoop.util.GenericOptionsParser

Import org.apache.hadoop.util.Tool

Import org.apache.hadoop.util.ToolRunner

Import org.apache.hive.hcatalog.rcfile.RCFileMapReduceOutputFormat

Import java.io.IOException

Public class RcFileWriterJob extends Configured implements Tool {

Public static class Map extends Mapper {

Private byte [] fieldData

Private int numCols

Private BytesRefArrayWritable bytes

@ Override

Protected void setup (Context context) throws IOException, InterruptedException {

NumCols = context.getConfiguration () .getInt ("hive.io.rcfile.column.number.conf", 0)

Bytes = new BytesRefArrayWritable (numCols)

}

Public void map (Object key, Text line, Context context

) throws IOException, InterruptedException {

Bytes.clear ()

String [] cols = line.toString () .split ("\ t",-1)

System.out.println ("SIZE:" + cols.length)

For (int item0; I

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.