How to realize the Integration of Cassandra and Hadoop MapReduce 07/12 Update SLTechnology News&Howtos

How to realize the Integration of Cassandra and Hadoop MapReduce

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to achieve the integration of Cassandra and Hadoop MapReduce, the article is very detailed, has a certain reference value, interested friends must read it!

Integrate Cassandra and Hadoop MapReduce

When you see this title, you are bound to ask. How is this integration defined?

Personally, I think the so-called integration means that we can write MapReduce programs that read data from HDFS and insert it into Cassandra. Or you can read the data directly from the Cassandra and calculate it accordingly.

Read data from HDFS and insert it into Cassandra

For this type, we can follow the following steps.

1 upload the data that needs to be inserted into Cassandra to HDFS.

2 start the Hadoop MapReduce program.

This type of integration has nothing to do with Cassandra itself. We just run a normal MapReduce program and insert the calculated data into the Cassandra on the Map or reduce side. That's all.

Read the data directly from Cassandra, and then calculate it accordingly.

This feature is added in the Cassandra0.6.x version. It can read the data needed by MapReduce directly from Cassandra and realize the function of full table scan for Cassandra.

The steps are as follows:

1 in the MapReduce program to specify the use of KeySpace,ColumnFamily, and SlicePredicate and other parameters related to Cassandra. (for these concepts, you can refer to "lie Cassandra data Model" and "talk about Cassandra client")

2 start the Hadoop MapReduce program.

There are still many differences between this type of consolidation and those that read data from HDFS, with the following differences:

1 the input data comes from different sources: the former is to read the input data from HDFS, and the latter is to read data directly from Cassandra.

The version of 2Hadoop is different: the former can use any version of Hadoop, while the latter can only use Hadoop0.20.x

Integrate Hadoop0.19.x and Cassandra0.6.x

In Cassandra0.6.x, integration with Hadoop0.20.x is implemented by default, and we cannot use it directly in Hadoop0.19.x.

So, to achieve this goal, what we * needs to do is modify the source code of Cassandra to provide a function that can be used in Hadoop0.19.x.

To do this test, we can follow these steps:

1 download the modified code.

2 specify the following in the MapReduce (note that the package used by the class here is under com.alibaba.dw.cassandra.hadoop):

ConfigHelper.setColumnFamily (conf,Keyspace,MemberCF, "/ home/admin/apache-cassandra-0.6.1/conf"); SlicePredicatepredicate=newSlicePredicate (). SetColumn_names (Arrays.asList ("CITY" .getBytes (UTF8), "EMPLOYEES_COUNT" .getBytes (UTF8)); ConfigHelper.setSlicePredicate (conf,predicate); ConfigHelper.setRangeBatchSize (conf,512); ConfigHelper.setSuperColumn (conf, "MemberInfo")

3 make sure that the specified directory of each machine running MapReduce is the same as the path of the storage-conf.xml file set in the MapReduce program.

(4) run Hadoop MapReduce program.

Existing problems and improvement

In the actual use, we will find that the following error message will appear on the map side:

Java.lang.RuntimeException:TimedOutException () atcom.alibaba.dw.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit (ColumnFamilyRecordReader.java:125) atcom.alibaba.dw.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext (ColumnFamilyRecordReader.java:164) atcom.alibaba.dw.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext (ColumnFamilyRecordReader.java:1) atcom.google.common.collect.AbstractIterator.tryToComputeNext (AbstractIterator.java:135) atcom.google.common.collect.AbstractIterator.hasNext (AbstractIterator.java:130) atcom.alibaba.dw.cassandra.hadoop.ColumnFamilyRecordReader.next (ColumnFamilyRecordReader Java: 224) atcom.alibaba.dw.cassandra.hadoop.ColumnFamilyRecordReader.next (ColumnFamilyRecordReader.java:1) atorg.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext (MapTask.java:192) atorg.apache.hadoop.mapred.MapTask$TrackedRecordReader.next (MapTask.java:176) atorg.apache.hadoop.mapred.MapRunner.run (MapRunner.java:48) atorg.apache.hadoop.mapred.MapTask.run (MapTask.java:342) atorg.apache.hadoop.mapred.Child.main (Child.java:158) Causedby:TimedOutException () Atorg.apache.cassandra.thrift.Cassandra$get_range_slices_result.read (Cassandra.java:11015) atorg.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices (Cassandra.java:623) atorg.apache.cassandra.thrift.Cassandra$Client.get_range_slices (Cassandra.java:597) atcom.alibaba.dw.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit (ColumnFamilyRecordReader.java:108)... 11more

The cause of this problem is the failure to read data from Cassandra using ThriftAPI. So we can optimize this code to provide the desired error handling to provide the availability of the program.

The above is all the contents of the article "how to realize the Integration of Cassandra and Hadoop MapReduce". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.