Use BulkLoad to bulk import data from HDFS to HBase 07/09 Update SLTechnology News&Howtos

Use BulkLoad to bulk import data from HDFS to HBase

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

When writing data to Hbase, common writing methods include using HBase API and Mapreduce to import data in batches. When using these methods to import data, the approximate flow of writing a piece of data to HBase database is shown in Figure.

The data is first written to the rain shoe log WAl, then to the pre-write log, then to MemStore, and finally to Flush to Hfile. This way of writing data does not lead to data loss, and the order of the data is correct, but when a large number of data writes are encountered, the speed of writing is difficult to guarantee. So, introduce a higher performance write method BulkLoad.

Bulk writing data using BulkLoad is mainly divided into two parts:

1. Use HFileOutputFormat2 to write HFile to HDFS directory through MapReduce job written by yourself. Since the data written to HBase is sorted in order, configureIncrementalLoad() in HFileOutputFormat2 can complete the required configuration.

2. Move Hfile from HDFS to HBase table. The approximate process is as shown in Figure 1.

Example code pom dependency:

org.apache.hbase hbase-server 1.4.0 org.apache.hadoop hadoop-client 2.6.4 org.apache.hbase hbase-client 0.99.2 package com.yangshou;import org.apache.hadoop.hbase.client.Put;import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;public class BulkLoadMapper extends Mapper { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //Read every piece of data in the file, using the serial number as the row key String line = value.toString(); //split data //The elements in the split array are: serial number, user id, product id, user behavior, product classification, time, address String[] str = line.split(" "); String id = str[0]; String user_id = str[1]; String item_id = str[2]; String behavior = str[3]; String item_type = str[4]; String time = str[5]; String address = "156"; //concatenate rowkey and put ImmutableBytesWritable rowkry = new ImmutableBytesWritable(id.getBytes()); Put put = new Put(id.getBytes()); put.add("info".getBytes(),"user_id".getBytes(),user_id.getBytes()); put.add("info".getBytes(),"item_id".getBytes(),item_id.getBytes()); put.add("info".getBytes(),"behavior".getBytes(),behavior.getBytes()); put.add("info".getBytes(),"item_type".getBytes(),item_type.getBytes()); put.add("info".getBytes(),"time".getBytes(),time.getBytes()); put.add("info".getBytes(),"address".getBytes(),address.getBytes()); //write the data context.write(rowkry,put); }}package com.yangshou;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.TableName;import org.apache.hadoop.hbase.client.*; import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class BulkLoadDriver { public static void main(String[] args) throws Exception { //Get Hbase configuration Configuration conf = HBaseConfiguration.create(); Connection conn = ConnectionFactory.createConnection(conf); Table table = conn.getTable(TableName.valueOf("BulkLoadDemo")); Admin admin = conn.getAdmin(); //set job Job job = Job.getInstance(conf,"BulkLoad"); job.setJarByClass(BulkLoadDriver.class); job.setMapperClass(BulkLoadMapper.class); job.setMapOutputKeyClass(ImmutableBytesWritable.class); job.setMapOutputValueClass(Put.class); //Set input/output path of file job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(HFileOutputFormat2.class); FileInputFormat.setInputPaths(job,new Path("hdfs://hadoopalone:9000/tmp/000000_0")); FileOutputFormat.setOutputPath(job,new Path("hdfs://hadoopalone:9000/demo1")); //load data into Hbase table HFileOutputFormat2.configureIncrementalLoad(job,table,conn.getRegionLocator(TableName.valueOf("BulkLoadDemo"))); if(job.waitForCompletion(true)){ LoadIncrementalHFiles load = new LoadIncrementalHFiles(conf); load.doBulkLoad(new Path("hdfs://hadoopalone:9000/demo1"),admin,table,conn.getRegionLocator(TableName.valueOf("BulkLoadDemo"))); } }}

instance data

44979 100640791 134060896 1 5271 2014-12-09 Tianjin City 44980 100640791 96243605 1 13729 2014-12-02 Xinjiang

Creating tables in the Hbase shell

create 'BulkLoadDemo','info'

Packaged and executed

```hadoop jar BulkLoadDemo-1.0-SNAPSHOT.jar com.yangshou.BulkLoadDriver

Note: Before executing hadoop jar, you should load the relevant packages from Hbase first.

export HADOOP_CLASSPATH=$HBASE_HOME/lib/*

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.