Using BulkLoad to import Hbase table 07/06 Update SLTechnology News&Howtos

Using BulkLoad to import Hbase table

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

1. What are the problems with the traditional method of inserting HBase tables?

Let's first take a look at the writing process of HBase:

MapReduce usually uses TableOutputFormat method when writing HBase, and directly generates put objects in map/reduce to write HBase. This method is inefficient when writing large amounts of data, because HBase writes block and frequently performs a large number of IO operations such as flush, split, compact and so on. This will also have a certain impact on the stability of HBase nodes, such as GC time is too long, the response becomes slow, resulting in node timeout exit, and causing a series of chain reactions. On the other hand, HBase supports the writing mode of BulkLoad, which uses the principle that HBase data is stored in HDFS according to a specific format, and directly uses MapReduce to generate persistent HFile data format files, and then upload them to the appropriate location, that is, to complete the rapid storage of huge amounts of data. With the completion of mapreduce, it is efficient and convenient, and does not occupy region resources, increases the load, and can greatly improve the writing efficiency when writing a large amount of data, and reduce the writing pressure on HBase nodes.

Replacing the previous method of calling HTableOutputFormat directly by using Mr. HFile and then BulkLoad to HBase has the following benefits:

(1) eliminate the insertion pressure on HBase cluster

(2) the running speed of Job is improved and the execution time of Job is reduced.

2. BulkLoad practice

The principle of BulkLoad has been introduced above, and its specific implementation process is to first use MapReduce to generate HFile file output and store it on HDFS, and then use loader.doBulkLoad (HFIle,HBaseTable) to write it into HBase. The specific code is as follows:

Public class BulkLoad {private static final String JOBNAME = "BulkLoad"; private static final String TABLENAME = "bulkLoad"; private static final String PATH_IN = "/ xx/xx"; / / enter path private static final String PATH_OUT = "/ xx/xx"; / / enter path private static final String SEPARATOR = "\\ |"; private static final byte [] ColumnFamily = "f" .getBytes () / / column cluster private static final byte [] QUALIFIER_TAG1 = "tag1" .getBytes (); / / column name private static final byte [] QUALIFIER_TAG2 = "tag2" .getBytes (); private static final byte [] QUALIFIER_TAG3 = "tag3" .getBytes (); private static final byte [] QUALIFIER_TAG4 = "tag4" .getBytes (); private static final byte [] QUALIFIER_TAG5 = "tag5" .getBytes () Private static final byte [] QUALIFIER_TAG6 = "tag6" .getBytes (); private static final byte [] QUALIFIER_TAG7 = "tag7" .getBytes (); private static final byte [] QUALIFIER_TAG8 = "tag8" .getBytes (); private static final byte [] QUALIFIER_TAG9 = "tag9" .getBytes (); private static final byte [] QUALIFIER_TAG10 = "tag10" .getBytes () Public static class Map extends Mapper {protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {String [] strArr = value.toString (). Split (SEPARATOR); String row = strArr [0]; Put put = new Put (Bytes.toBytes (row.toString (); / / rowkey put.add (ColumnFamily, QUALIFIER_TAG1, Bytes.toBytes (strArr [2])) Put.add (ColumnFamily, QUALIFIER_TAG2, Bytes.toBytes (strArr [3])); put.add (ColumnFamily, QUALIFIER_TAG3, Bytes.toBytes (strArr [4])); put.add (ColumnFamily, QUALIFIER_TAG4, Bytes.toBytes (strArr [5])); put.add (ColumnFamily, QUALIFIER_TAG5, Bytes.toBytes (strArr [6])) Put.add (ColumnFamily, QUALIFIER_TAG6, Bytes.toBytes (strArr [7])); put.add (ColumnFamily, QUALIFIER_TAG7, Bytes.toBytes (strArr [8])); put.add (ColumnFamily, QUALIFIER_TAG8, Bytes.toBytes (strArr [9])); put.add (ColumnFamily, QUALIFIER_TAG9, Bytes.toBytes (strArr [10])) Put.add (ColumnFamily, QUALIFIER_TAG10, Bytes.toBytes (strArr [11])); context.write (new ImmutableBytesWritable (value.getBytes ()), put);} public static void main (String [] args) throws Exception {Configuration conf = HBaseConfiguration.create (); conf.set ("hbase.zookeeper.quorum", "xx,xx,xx") Job job = new Job (conf, JOBNAME); job.setJarByClass (CreditScoreBulkLoad.class); job.setMapOutputKeyClass (ImmutableBytesWritable.class); job.setMapOutputValueClass (Put.class); job.setMapperClass (Map.class); / / this SorterReducer (KeyValueSortReducer or PutSortReducer) may not be specified, / / because job.setReducerClass (PutSortReducer.class) has been determined in the source code Job.setOutputFormatClass (HFileOutputFormat.class); FileSystem fs = FileSystem.get (URI.create ("/"), conf); Path outPath = new Path (PATH_OUT); if (fs.exists (outPath)) fs.delete (outPath, true); FileInputFormat.setInputPaths (job, new Path (PATH_IN)); FileOutputFormat.setOutputPath (job, outPath); HTable table = new HTable (conf, TABLENAME) HFileOutputFormat.configureIncrementalLoad (job, table); if (job.waitForCompletion (true)) {LoadIncrementalHFiles loader = new LoadIncrementalHFiles (conf); loader.doBulkLoad (outPath, table);} System.exit (0);}}

3. Instructions and points for attention:

(0) it is mentioned above that the generated HFile file will be inserted into the HBase. In this process, the file generated by MapReduce and stored on the HDFS will disappear. In fact, inserting HBase is to move the HFile file to the HBase, but the storage path of the HFile file on the HDFS is still there, but the file inside has disappeared.

(1) when importing HBase using BulkLoad, remember to pre-split the region when creating the table (the Region of HBase will be summarized later). The HFileOutputFormat.configureIncrementalLoad method will feel the number of reduce and the rowkey range covered by each reduce according to the number of region. Otherwise, when the reduce is too large, the task processing is not balanced, resulting in the task running time is too long.

(2) do not have too many child columns under a single rowkey, otherwise it will cause oom when sorting in the reduce phase. One way is to avoid the sorting of the reduce phase through secondary sorting, depending on the application.

(3) after the code is executed, the generated hfile in hdfs needs to be written to the hbase table. It is realized by hadoop jar hbase-version.jar completebulkload / hfilepath tablename command.

(4) HFile mode is the fastest of all loading schemes, but there is a premise-the data is imported for the first time and the table is empty. If there is already data in the table. The split operation is triggered when the HFile is then imported into the hbase table.

(5) whether it is map or reduce, the type of key and value of the output must be:

< ImmutableBytesWritable, KeyValue>

< ImmutableBytesWritable, Put>

Otherwise, report such an error:

Java.lang.IllegalArgumentException: Can't read partitions file...Caused by: java.io.IOException: wrong key class: org.apache.hadoop.io.*** is not class org.apache.hadoop.hbase.io.ImmutableBytesWritable

(6) in the final output part, the Value type is KeyValue or Put, and the corresponding Sorter is KeyValueSortReducer or PutSortReducer, respectively. This SorterReducer may not be specified, because a judgment has been made in the source code:

If (KeyValue.class.equals (job.getMapOutputValueClass () {job.setReducerClass (KeyValueSortReducer.class);} else if (Put.class.equals (job.getMapOutputValueClass () {job.setReducerClass (PutSortReducer.class);} else {LOG.warn ("Unknown map output value type:" + job.getMapOutputValueClass ());}

(7) job.setOutputFormatClass (HFileOutputFormat.class) in the MR example; HFileOutputFormat is only suitable for organizing single-column families into HFile files at a time, and multiple job is needed for multi-column clusters, but this limitation has been resolved by the new version of Hbase.

(8) the final generated HFile in the MR example is stored on HDFS, and the subdirectories under the output path are each column family. If you import HFile into HBase, it is equivalent to the Region from move HFile to HBase, and the column family content of the HFile subdirectory is gone.

(9) the last Reduce does not have a setNumReduceTasks because the setting is automatically configured by the framework according to the number of region.

(10) it doesn't matter whether what is commented out or not is written in the following configuration section, because you can see from the source code that the configureIncrementalLoad method has configured all the fixed configurations, and only the unfixed parts need to be configured manually.

4 、 Refer:

1. Http://blog.csdn.net/kirayuan/article/details/6371635

2. Http://blog.pureisle.net/archives/1950.html

3. Http://shitouer.cn/2013/02/hbase-hfile-bulk-load

4. Http://my.oschina.net/leejun2005/blog/187309

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.