What is the principle and practice of Spark Connector Reader 07/02 Update SLTechnology News&Howtos

What is the principle and practice of Spark Connector Reader

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how the principle and practice of Spark Connector Reader is. The content is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

The following mainly describes how to use Spark Connector to read Nebula Graph data.

Introduction to Spark Connector

Spark Connector is a data connector of Spark, through which you can read and write external data systems. Spark Connector consists of two parts, Reader and Writer. This article focuses on Spark Connector Reader,Writer and will talk to you in detail in the next part.

Spark Connector Reader principle

Spark Connector Reader takes Nebula Graph as the extended data source of Spark, reads the data from Nebula Graph into DataFrame, and then carries on the subsequent operations such as map, reduce and so on.

Spark SQL allows users to customize data sources and supports extensions to external data sources. The data format read through Spark SQL is a distributed data set organized by named column. DataFrame,Spark SQL itself also provides a large number of API to facilitate users to calculate and convert DataFrame, and can use DataFrame interface for a variety of data sources.

Spark calls the external data source package is org.apache.spark.sql, first learn about the extended data source-related interfaces provided by Spark SQL.

Basic Interfaces

BaseRelation: represents a collection of tuples with a known Schema. All subclasses that inherit BaseRelation must generate Schema in StructType format. In other words, BaseRelation defines the data format in which data read from the data source is stored in the DataFrame of Spark SQL.

RelationProvider: gets the parameter list and returns a new BaseRelation based on the given parameter.

DataSourceRegister: an abbreviation for registering a data source. When using a data source, you don't have to write the fully qualified class name of the data source, you only need to write a custom shortName.

Providers

RelationProvider: generates a custom relation from the specified data source. CreateRelation () generates a new relation based on the given Params parameter.

SchemaRelationProvider: a new Relation can be generated based on a given Params parameter and a given Schema information.

RDD

RDD [InternalRow]: RDD needs to be constructed after Scan from the data source [Row]

To implement a custom Spark external data source, you need to define some of the above methods based on the data origin.

In the Spark Connector of Nebula Graph, we use Nebula Graph as the external data source of Spark SQL and read the data in the form of sparkSession.read. The class diagram for the implementation of this function is shown below:

Define the data source NebulaRelatioProvider, inherit RelationProvider for relation customization, and inherit DataSourceRegister for external data source registration.

Define NebulaRelation defines the data Schema and data conversion method of Nebula Graph. The Meta service that connects to the Nebula Graph in the getSchema () method gets the Schema information corresponding to the configured return field.

Define NebulaRDD to read Nebula Graph data. The compute () method defines how to read Nebula Graph data, which mainly involves Scan of Nebula Graph data, converting the read Nebula Graph Row data into InternalRow data of Spark, forming a row of RDD with InternalRow, in which each InternalRow represents a row of data in Nebula Graph, and finally reading out and assembling all Nebula Graph data into the final DataFrame result data in the form of partition iteration.

Spark Connector Reader practice

Spark Connector's Reader function provides an interface for users to program to read data. One point / edge type of data is read at a time, and the result is DataFrame.

Let's start the practice by pulling the Spark Connector code on GitHub:

Git clone-b v1.0 git@github.com:vesoft-inc/nebula-java.gitcd nebula-java/tools/nebula-sparkmvn clean compile package install-Dgpg.skip-Dmaven.javadoc.skip=true

Copy the compiled package to the local Maven library.

Examples of applications are as follows:

Add nebula-spark dependencies to the pom file of the mvn project

Com.vesoft nebula-spark 1.1.0

Read Nebula Graph data in the Spark program:

/ / read Nebula Graph point data val vertexDataset: Dataset [Row] = spark.read .nebula ("127.0.0.1 Row 45500", "spaceName", "100") .loadVerticesToDF ("tag", "field1,field2") vertexDataset.show () / / read Nebula Graph edge data val edgeDataset: Dataset [Row] = spark.read .nebula ("127.0.0.1 Row 45500", "spaceName" LoadEdgesToDF ("edge", "*") edgeDataset.show ()

Configuration instructions:

Nebula (address: String, space: String, partitionNum: String)

Address: multiple addresses can be configured to be separated by commas, such as "ip1:45500,ip2:45500" space: graphSpacepartitionNum of Nebula Graph: set the number of partition when spark reads Nebula, and try to use the partitionNum in the Nebula Graph specified when creating Space, to ensure that the partition of a Spark reads the data of Nebula Graph a part.

LoadVertices (tag: String, fields: String)

Tagfields of the tag:Nebula Graph midpoint: the fields in the Tag, with multiple field names separated by commas. Means only the fields in fields are read, and * means all fields are read.

LoadEdges (edge: String, fields: String)

Edgefields in edge:Nebula Graph: a field in this Edge, with multiple field names separated by commas. It means to read only the fields in fields. * it means to read all the fields mentioned above is what the principle and practice of Spark Connector Reader is like. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.