How HBase uses HashTable/SyncTable tools to synchronize cluster data 07/09 Update SLTechnology News&Howtos

How HBase uses HashTable/SyncTable tools to synchronize cluster data

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "HBase how to use HashTable/SyncTable tools to synchronize cluster data", the content is easy to understand, clear, hope to help you solve doubts, the following let the editor lead you to study and learn "HBase how to use HashTable/SyncTable tools to synchronize cluster data" this article.

Introduction to HashTable/SyncTable

HashTable/SyncTable is a tool implemented as two map-reduce jobs executed as separate steps. It looks similar to the CopyTable tool, which can perform partial or full table data replication. Unlike CopyTable, it only replicates dispersed data between target clusters, saving network and computing resources during replication.

The first step in this process is the HashTable Map-Reduce job. This should run on a cluster whose data should be replicated to a remote peer (usually the source cluster). The following shows a quick example of how to run it, and a detailed description of each required parameter will be given later in this article:

Hbase org.apache.hadoop.hbase.mapreduce.HashTable-families=cf my-table / hashes/test-tbl... 20-04-28 05:05:48 INFO mapreduce.Job: map 100% reduce 100% 20ripple 04 reduce 28 05:05:49 INFO mapreduce.Job: Job job_1587986840019_0001 completed successfully20/04/28 05:05:49 INFO mapreduce.Job: Counters: 68 … File Input Format CountersBytes Read=0File Output Format CountersBytes Written=6811788

Once the above command job of HashTable has been executed, some table directories of output hdfs / hashes/my-table have been generated in the source HDFS:

Hdfs dfs-ls-R / hashes/test-tbldrwxr-xr-x-root supergroup 0 2020-04-28 05:05 / hashes/test-tbl/hashes-rw-r--r-- 2 root supergroup 0 2020-04-28 05:05 / hashes/test-tbl/hashes/_SUCCESSdrwxr-xr-x-root supergroup 0 2020-04-28 05:05 / hashes/test-tbl/hashes/part-r-00000-rw-r--r -- 2 root supergroup 6790909 2020-04-28 05:05 / hashes/test-tbl/hashes/part-r-00000/data-rw-r--r-- 2 root supergroup 20879 2020-04-28 05:05 / hashes/test-tbl/hashes/part-r-00000/index-rw-r--r-- 2 root supergroup 99 2020-04-28 05:04 / hashes/test-tbl/manifest-rw-r--r-- 2 root supergroup 2020-04-28 05:04 / hashes/test-tbl/partitions

These are required as inputs for running SyncTable. SyncTable must be started on the target peer. The following command runs SyncTable for the output of HashTable in the previous example. It uses the dryrun option described later in this article:

Hbase org.apache.hadoop.hbase.mapreduce.SyncTable-dryrun-sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://source-cluster-active-nn/hashes/test-tbl test-tbl test-tbl... Org.apache.hadoop.hbase.mapreduce.SyncTable$SyncMapper$CounterBATCHES=97148HASHES_MATCHED=97146HASHES_NOT_MATCHED=2MATCHINGCELLS=17MATCHINGROWS=2RANGESNOTMATCHED=2ROWSWITHDIFFS=2SOURCEMISSINGCELLS=1TARGETMISSINGCELLS=1

As a quick reference, you can replace only the given parameters in the two examples with the actual environment values. The rest of this article will cover the implementation details in more depth.

Why two different steps?

The main goal of this tool is to identify and replicate only the data lost between the two clusters. HashTable acts as a shard / indexing job, analyzing batches of table data and generating hash indexes for each batch. These are the output written to the file in the / hashes/my-table directory of hdfs, which is passed as one of the job parameters. As mentioned earlier, this output is required for the SyncTable job. SyncTable scans the target table locally at the same local size as the batches used by HashTable and calculates hashes for those batches using the same functions used by HashTable. Then compare one of the values in the local batch hash HashTable output. If the hash value is equal, it means that the entire batch is the same in both clusters, and there is no need to copy anything on that segment. Otherwise, it will scan the batches in the source cluster, check whether each unit already exists in the target cluster, and copy only those units that are different. On sparse, slightly different datasets, this will result in much less data replicating between the two clusters. It will also only need to scan a small number of units in the source to check for mismatches.

Necessary parameters

HashTable requires only two parameters: the name of the table and the output path where the relevant hash and other meta-information files will be written. SyncTable uses the HashTable output directory as input and uses table names from the source and target clusters, respectively. Because we use HashTable/SyncTable to move data between remote clusters, we must define the sourcezkcluster option for SyncTable. This should be the Zookeeper arbitration address of the source cluster. In the example in this article, we also refer directly to the source cluster active namenode address so that SyncTable will read the hash output file directly from the source cluster. Alternatively, you can manually copy the HashTable output from the source cluster to the remote cluster (for example, using distcp).

Note: remote clustering is only supported in different kerberos realms since CDH 6.2.1.

Advanced option

These two hash tables and SyncTable provide additional options that can be adjusted for best results.

HashTable allows you to filter data by row key and modification time (with the startrow/starttime,stoprow/stoptime attribute, respectively). Dataset scope can also be limited by version and column cluster properties. The BATCHSIZE attribute defines the size of each part to be hashed. This directly affects the synchronization performance. In cases where mismatches are rare, setting a larger batch value to higher performance may cause a large portion of the dataset to be ignored without having to scan through SyncTable.

SyncTable provides the dryrun option, which allows you to preview changes to be applied in the target.

The default behavior of SyncTable is to mirror the source data on the destination side, so any other units that exist in the destination but not in the source will eventually be deleted on the destination side. This may not be desirable when synchronizing the cluster under the Active-Active replication setting, in which case the doDeletes option can be set to false to skip replication deleted on the target. There is a similar doPuts flag for cases where other units should not be inserted into the target cluster.

Output analysis

HashTable outputs some files with meta-information for SyncTable, but these files are not readable. It does not make any changes to existing data, so the relevant information has little interest in the user context.

SyncTable is the step of actually applying modifications to the target, and it is important to review the summary of the target cluster data before actually changing it (see the dryrun option above). It publishes some related counters at the end of the mapping to be executed by Reduce. Looking at the values in the example above, we can see that there are 97148 partitions with hash (reported by the BATCHES counter), and SyncTable only detects the difference between the two partitions (according to the HASHES_MATCHED and HASHES_NOT_MACTHED counters). In addition, the inner two partitions have different hashes, and 17 cells match in 2 rows (reported by MATCHING_CELLS and MATCHING_ROWS, respectively), but 2 rows are also divergent on these two partitions (according to RANGESNOTMATCHED and ROWSWITHDIFFS). Finally, SOURCEMISSINGCELLS and TARGETMISSINGCELLS will tell us in detail that the unit exists only on the source or target cluster. In this example, the source cluster has a unit that is not on the target, but the target also has a unit that is not on the source. Because SyncTable runs without specifying the dryrun option and setting the doDeletes option to false, the job removes the extra units in the target cluster and adds the extra units found in the source to the target cluster. Assuming that no writes occur on either cluster, then running the exact same SyncTable command on the target cluster will not show any difference:

Hbase org.apache.hadoop.hbase.mapreduce.SyncTable-sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/test-tbl test-tbl test-tbl... Org.apache.hadoop.hbase.mapreduce.SyncTable$SyncMapper$CounterBATCHES=97148HASHES_MATCHED=97148... Suitable for scene data synchronization

At first glance, HashTable/SyncTable seems to overlap with the CopyTable tool, but in certain cases, both tools are more appropriate. As the first comparison example, initializing the initial load of a table with 100004 rows and a total data size of 5.17GB using HashTable/SyncTable takes only a few minutes to complete SyncTable:

... 20-04-29 03:48:00 INFO mapreduce.Job: Running job: job_1587985272792_001120/04/29 03:48:09 INFO mapreduce.Job: Job job_1587985272792_0011 running in uber mode: false20/04/29 03:48:09 INFO mapreduce.Job: map 0 reduce 0% 20 reduce 04 reduce 29 03:54:08 INFO mapreduce.Job: map 100% reduce 0% 20 Universe 29 03:54:09 INFO mapreduce.Job: Job job_1587985272792_0011 completed successfully … Org.apache.hadoop.hbase.mapreduce.SyncTable$SyncMapper$CounterBATCHES=97148EMPTY_BATCHES=97148HASHES_NOT_MATCHED=97148RANGESNOTMATCHED=97148ROWSWITHDIFFS=100004TARGETMISSINGCELLS=749589TARGETMISSINGROWS=100004

Even on such a small dataset, CopyTable performs faster (about 3 minutes, while SyncTable takes 6 minutes to replicate the entire dataset):

.. 20-04-29 05:12:07 INFO mapreduce.Job: Running job: job_1587986840019_000520/04/29 05:12:24 INFO mapreduce.Job: Job job_1587986840019_0005 running in uber mode: false20/04/29 05:12:24 INFO mapreduce.Job: map 0 reduce 0% 20 reduce 04 reduce 29 05:13:16 INFO mapreduce.Job: map 25 reduce 0% 20 reduce 29 05:13:49 INFO mapreduce.Job: map 50% reduce 0% 20 reduce 04 / 29 05:14:37 INFO mapreduce.Job: map 75% reduce 0% 20 map 04 INFO mapreduce.Job 29 05:15:14 INFO mapreduce.Job: map 100% reduce 0% 20 Charpy 04 29 05:15:14 INFO mapreduce.Job: Job job_1587986840019_0005 completed successfully … HBase CountersBYTES_IN_REMOTE_RESULTS=2787236791BYTES_IN_RESULTS=5549784428MILLIS_BETWEEN_NEXTS=130808NOT_SERVING_REGION_EXCEPTION=0NUM_SCANNER_RESTARTS=0NUM_SCAN_RESULTS_STALE=0REGIONS_SCANNED=4REMOTE_RPC_CALLS=1334REMOTE_RPC_RETRIES=0ROWS_FILTERED=0ROWS_SCANNED=100004RPC_CALLS=2657RPC_RETRIES=0

Now, let's use these two tools again to deal with sparse differences on datasets. The test-tbl table used in all of these examples has four regions in the source cluster. After replicating all the original datasets to the target cluster in the previous example, we added only four rows on the source side, one for each existing zone, and then ran HashTable/SyncTable again to synchronize the two clusters:

20-04-29 05:29:23 INFO mapreduce.Job: Running job: job_1587985272792_001320/04/29 05:29:39 INFO mapreduce.Job: Job job_1587985272792_0013 running in uber mode: false20/04/29 05:29:39 INFO mapreduce.Job: map 0% reduce 0% 20 reduce 04 INFO mapreduce.Job 29 05:29:53 INFO mapreduce.Job: map 50% reduce 0% 20 reduce 29 05:30:42 INFO mapreduce.Job: map 100% reduce 0% 20 reduce 29 05 30:42 INFO mapreduce.Job: Job job_1587985272792_0013 completed successfully... Org.apache.hadoop.hbase.mapreduce.SyncTable$SyncMapper$CounterBATCHES=97148HASHES_MATCHED=97144HASHES_NOT_MATCHED=4MATCHINGCELLS=42MATCHINGROWS=5RANGESNOTMATCHED=4ROWSWITHDIFFS=4TARGETMISSINGCELLS=4TARGETMISSINGROWS=4

As we can see, only four partitions do not match, and SyncTable is much faster (it takes about a minute to complete). Perform the same synchronization using CopyTable to display the following results:

20-04-29 08:32:38 INFO mapreduce.Job: Running job: job_1587986840019_000820/04/29 08:32:52 INFO mapreduce.Job: Job job_1587986840019_0008 running in uber mode: false20/04/29 08:32:52 INFO mapreduce.Job: map 0% reduce 0% 20 reduce 04 INFO mapreduce.Job 29 08:33:38 INFO mapreduce.Job: map 25 reduce 0% 20 reduce 29 08:34:15 INFO mapreduce.Job: map 50% reduce 0% 20 INFO mapreduce.Job 29 08 : 34:48 INFO mapreduce.Job: map 75% reduce 0% 20INFO mapreduce.Job 04 map 29 08:35:31 INFO mapreduce.Job: map 100% reduce 0% 20ame04 INFO mapreduce.Job 29 08:35:32 INFO mapreduce.Job: Job job_1587986840019_0008 completed successfully … HBase CountersBYTES_IN_REMOTE_RESULTS=2762547723BYTES_IN_RESULTS=5549784600MILLIS_BETWEEN_NEXTS=340672NOT_SERVING_REGION_EXCEPTION=0NUM_SCANNER_RESTARTS=0NUM_SCAN_RESULTS_STALE=0REGIONS_SCANNED=4REMOTE_RPC_CALLS=1323REMOTE_RPC_RETRIES=0ROWS_FILTERED=0ROWS_SCANNED=100008RPC_CALLS=2657RPC_RETRIES=0

Even if only four cells are copied, CopyTable takes the same time as copying the entire dataset. This is still possible for this very small dataset and an idle cluster, but in the case of a production use case with a higher big data set, and many client applications that write data to it may also use the target cluster, the performance degradation of CopyTable will be higher than SyncTable.

It is worth mentioning that there are other tools / functions that can be used in combination for the initial loading of the target cluster (the target has no data at all), such as snapshot export, bulk loading, or even a direct copy of the original copy. The table directory in the source cluster. For an initial load with a large amount of data to be copied, taking a snapshot of the table before using the ExportSnapshot tool will outperform online replication tools such as SyncTable or CopyTable.

Check replication integrity

Another common use of HashTable/SyncTable when troubleshooting possible replication problems is to monitor replication status between clusters. In this case, it can be used as an alternative to the VerifyReplication tool. Usually, when checking the status between two clusters, there is either no mismatch at all, or a temporary problem causes a small portion of the big data set to be out of sync. In the previous example, we have been using 100008 rows with matching values on both clusters in the test environment. Running SyncTable on the target cluster with the dryrun option will allow us to determine all the differences:

10:47:25 on 20-05-04 INFO mapreduce.Job: Running job: job_1588611199158_0004... 10:48:48 on 20-05-04 INFO mapreduce.Job: map 100% reduce 0% 20 appert 05 Universe 04 10:48:48 INFO mapreduce.Job: Job job_1588611199158_0004 completed successfully... HBase CountersBYTES_IN_REMOTE_RESULTS=3753476784BYTES_IN_RESULTS=5549784600ROWS_SCANNED=100008... Org.apache.hadoop.hbase.mapreduce.SyncTable$SyncMapper$CounterBATCHES=97148HASHES_MATCHED=97148...

Unlike SyncTable we must run the VerifyReplication tool on the source cluster. We pass the peer id as one of its parameters so that it can find the remote cluster to scan for comparison:20/05/04 11:01:58 INFO mapreduce.Job: Running job: job_1588611196128_0001... 11:04:39 on 20-05-04 INFO mapreduce.Job: map 100% reduce 0% 20 appert 05 Universe 04 11:04:39 INFO mapreduce.Job: Job job_1588611196128_0001 completed successfully... HBase CountersBYTES_IN_REMOTE_RESULTS=2761955495BYTES_IN_RESULTS=5549784600...

Org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$CountersGOODROWS=100008...

SyncTable looks for all hash matches between the source and destination partitions without distinction, thus avoiding the need to scan the remote source cluster again. VerifyReplication makes an one-to-one comparison of each unit in the two clusters, which may have resulted in high network costs even when dealing with such a small data set.

Add another row to the source cluster, and then perform the check again. Use VerifyReplication:

11:14:05 on 20-05-05 INFO mapreduce.Job: Running job: job_1588611196128_0004... 11:16:32 on 20-05-05 INFO mapreduce.Job: map 100% reduce 0% 20hip 05 Universe 05 11:16:32 INFO mapreduce.Job: Job job_1588611196128_0004 completed successfully... Org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$CountersBADROWS=1GOODROWS=100008ONLY_IN_SOURCE_TABLE_ROWS=1...

Before using SyncTable, we must use HashTable again to regenerate the hash on the source because we now have a new cell:

11:31:48 on 20-05-04 INFO mapreduce.Job: Running job: job_1588611196128_0003... 20-05-04 11:33:15 INFO mapreduce.Job: Job job_1588611196128_0003 completed successfully... It's SyncTable:20/05/07 05:47:51 INFO mapreduce.Job: Running job: job_1588611199158_0014...

05:49:20 on 20-05-07 INFO mapreduce.Job: Job job_1588611199158_0014 completed successfully... Org.apache.hadoop.hbase.mapreduce.SyncTable$SyncMapper$CounterBATCHES=97148HASHES_NOT_MATCHED=97148MATCHINGCELLS=749593MATCHINGROWS=100008RANGESMATCHED=97147RANGESNOTMATCHED=1ROWSWITHDIFFS=1TARGETMISSINGCELLS=1TARGETMISSINGROWS=1

We can see that the execution time has increased due to additional scans and unit comparisons between the two remote clusters. At the same time, the execution time of VerifyReplication has barely changed.

These are all the contents of the article "how HBase uses HashTable/SyncTable tools to synchronize cluster data". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.