What are the common ways for Spark DataFrame to write to HBase 07/19 Update SLTechnology News&Howtos

What are the common ways for Spark DataFrame to write to HBase

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the common ways of writing Spark DataFrame into HBase. It is very detailed and has a certain reference value. Friends who are interested must finish reading it.

Spark is the most popular distributed computing framework at present, while HBase is a column distributed storage engine based on HDFS. It is very popular to do offline or real-time computing based on Spark, and the data results are saved in HBase. For example, user portraits, individual portraits, recommendation systems, etc., can all use HBase as a storage medium for client use.

Therefore, how to write data to HBase by Spark becomes a very important part.

The code is tested in spark version 2.2.0.

The first is the easiest way to use it, which is RDD-based partitions. since a partition is always stored on an excutor in spark, you can create a HBase connection to submit the entire partition content.

The rough code is:

Rdd.foreachPartition {records = > val config = HBaseConfiguration.create config.set ("hbase.zookeeper.property.clientPort", "2181") config.set ("hbase.zookeeper.quorum", "A1 maxie a2Mae a3") val connection = ConnectionFactory.createConnection (config) val table = connection.getTable (TableName.valueOf ("rec:user_rec")) val list = new java.util.ArrayList [Put] for (I HBaseRecord (I) "extra")} val df:DataFrame = spark.createDataFrame (data) df.write .mode (SaveMode.Overwrite) .options (Map (HBaseTableCatalog.tableCatalog-> catalog)) .format ("org.apache.spark.sql.execution.datasources.hbase") .save ()} def catalog = s "" {| "table": {"namespace": "rec" "name": "user_rec"}, | "rowkey": "key", | "columns": {| "col0": {"cf": "rowkey", "col": "key", "type": "string"}, | "col1": {"cf": "t", "col": "col1", "type": "boolean"} | | "col2": {"cf": "t", "col": "col2", "type": "double"}, | "col3": {"cf": "t", "col": "col3", "type": "float"}, | "col4": {"cf": "t", "col": "col4", "type": "int"} | | "col5": {"cf": "t", "col": "col5", "type": "bigint"}, | "col6": {"cf": "t", "col": "col6", "type": "smallint"}, | "col7": {"cf": "t", "col": "col7", "type": "string"} | | "col8": {"cf": "t", "col": "col8", "type": "tinyint"} |} ".stripMargin} case class HBaseRecord (col0: String, col1: Boolean, col2: Double, col3: Float | Col4: Int, col5: Long, col6: Short, col7: String, col8: Byte) object HBaseRecord {def apply (I: Int, t: String): HBaseRecord = {val s = s "" row$ {"d" .format (I)} "HBaseRecord (s, I% 2 = = 0, i.toDouble I.toFloat, I, i.toLong, i.toShort, s "String$i: $t", i.toByte)}}

Then add hbase-site.xml, hdfs-site.xml, core-site.xml and other configuration files in the resources directory. The main purpose is to get some connection addresses in Hbase.

If you have a habit of browsing the official website, you will find that the version of the official website of HBase has already reached 3.0.0-SNAPSHOT, and a hbase-spark module has been added in version 2.0, using the same method as the above hortonworks, except that the package name of format is different. Guess is to copy the hortonworks over.

In addition, Hbase-spark 2.0.0-alpha4 is now available in maven repositories.

Http://mvnrepository.com/artifact/org.apache.hbase/hbase-spark

However, the internal spark version is 1.6.0, which is too old! I really can't afford to wait.

Look forward to the official release of hbase-spark soon.

Hortonworks-spark/shc github: https://github.com/hortonworks-spark/shc

Maven warehouse address: http://mvnrepository.com/artifact/org.apache.hbase/hbase-spark

Hbase spark sql/ dataframe official document: https://hbase.apache.org/book.html#_sparksql_dataframes

These are all the contents of the article "what are the common ways in which Spark DataFrame writes to HBase?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.