Analysis of Nebula Graph data Import tool for Graph Database-- Spark Writer 07/02 Update SLTechnology News&Howtos

Analysis of Nebula Graph data Import tool for Graph Database-- Spark Writer

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Start with Hadoop.

In recent years, with the rise of big data, distributed computing engines emerge one after another. Hadoop is a distributed computing open source framework of Apache open source organization, which has been applied on many large websites. The core idea of Hadoop design comes from Google MapReduce papers and is inspired by map and reduce methods in functional languages. In functional languages, map means to apply a method for each element in the list, and reduce means to iterate over the elements in the list. Through the MapReduce algorithm, the data can be classified and regulated according to some characteristics, processed and the final result can be obtained.

Talk about Apache Spark again

Apache Spark is a general memory parallel computing framework built around speed and ease of use. It was developed by the AMP Lab of the University of California, Berkeley in 2009 and became an open source project of the Apache Foundation in 2010. Spark draws lessons from the design idea of Hadoop, inherits the advantages of its distributed parallel computing, and provides rich operators.

Spark provides a comprehensive and unified framework for managing various big data processing requirements with different types of data sources, supporting batch data processing and streaming data processing. Spark supports in-memory computing, and its performance is greatly improved compared to Hadoop. Spark supports programming in Java,Scala and Python, operating distributed datasets by manipulating local collections, and interactive queries. In addition to classic MapReduce operations, Spark also supports SQL queries, streaming, machine learning, and graph computing.

Resilient distributed dataset (RDD,Resilient Distributed Dataset) is the most basic abstraction of Spark and represents immutable partitioned datasets. RDD has the characteristics of fault tolerance and location-aware scheduling. Operating RDD is like operating a local data set without concern for task scheduling and fault tolerance. RDD allows users to cache work sets in memory explicitly when executing multiple queries, and subsequent queries can reuse the dataset. RDD forms DAG through a series of transformations, and DAG is divided into different Stage according to the different dependencies between RDD.

Like RDD, DataFrame is an immutable distributed data collection. Unlike data in RDD,DataFrame, data is organized into named columns, just like tables in a relational database. DataFrame is designed to make it easier to work with large datasets, allowing developers to specify a schema for distributed datasets for a higher level of abstraction.

DataSet is a domain-specific object that supports strong typing, which can be converted in parallel by functional or relational operations. A DataSet is a collection of JVM objects with well-defined types, which can be specified through Case Class defined in Scala or Class in Java. DataFrame is a Dataset of type Row, that is, Dataset [Row]. DataSet's API is strongly typed; and these patterns can be used for optimization.

DataFrame and DataSet trigger calculations only when an action is performed. In essence, the dataset represents a logical plan that describes the calculations needed to generate the data. When performing an action, Spark's query optimizer optimizes the logical plan and generates an efficient parallel and distributed physical plan.

Data Import tool based on Spark

Spark Writer is a distributed data import tool of Nebula Graph based on Spark, which is based on DataFrame implementation. It can convert data from various data sources into graph points and edges and import them into graph database in batch.

Currently supported data sources are: Hive and HDFS.

Spark Writer supports importing multiple labels and edge types at the same time, and different labels and edge types can be configured with different data sources.

Through the configuration file, Spark Writer generates an insert statement from the data and sends it to the query service to perform the insert operation. Insert operations in Spark Writer are performed asynchronously, and the number of successes and failures is counted through the accumulator in Spark.

Get Spark Writer compiled source code git clone https://github.com/vesoft-inc/nebula.git cd nebula/src/tools/spark-sstfile-generator mvn compile package tag data file format

The tag data file consists of a line of data, each line representing a point and its attributes. In general, the ID of the first column point-the name of this column will be specified in the mapping file later, and the attributes of the other column points. For example, Play tag data file format:

{"id": 100, "name": "Tim Duncan", "age": 42} {"id": 101," name ":" Tony Parker "," age ": 36} {" id ": 102," name ":" LaMarcus Aldridge "," age ": 33} Edge Type data File format

The edge type data file consists of a line of data, each line representing an edge and its attributes. In general, the first column is the start ID, the second is the end ID, and the start ID column and the end ID column are specified in the mapping file. Others are listed as edge attributes. Let's take the JSON format as an example.

Take the edge type follow data as an example:

{"source": 101,101,101,101,101,95} {"source", "target", "likeness": 95} {"source": 101,102,102, "likeness": 90} {"source": 100,101,105,95, "ranking": 2} {"source": 101,101,100,100,105," ranking ": 1} {" source ": 101,101,102,102," likeness": 90} {"source": 101,101,102,102,102,101,102,102,102,102,101,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,102,101,101,101,101,101,102,102,102,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,10 "ranking": 3} configuration file format

Spark Writer uses the HOCON configuration file format. HOCON (Human-Optimized Config Object Notation) is an easy-to-use configuration file format with an object-oriented style. The configuration file consists of four parts: Spark configuration section, Nebula configuration section, label configuration section and edge configuration section.

The Spark information configures the relevant parameters of the Spark operation, and the Nebula related information configures the user name and password of the connection Nebula and other information. Tags mapping and edges mapping correspond to the input source mapping of multiple tag/edge respectively, describing the data source and other basic information of each tag/edge. Different tag/edge can come from different data sources.

Nebula configuration section is mainly used to describe nebula query service address, user name and password, picture space information and other information.

Nebula: {# query engine IP list addresses: ["127.0.0.1 retry 3699"] # user name and password to connect to the Nebula Graph service user: user pswd: password # Nebula Graph image space name space: test # thrift timeout and number of retries. The default values are 3000 and 3 connection {timeout: 3000 retry: 3} # nGQL query retries, respectively. The default value is 3 execution {retry: 3}} Nebula configuration segment

The tag configuration section is used to describe the imported tag information, and each element in the array is a tag information. There are two main types of label import: file-based import and Hive-based import.

File-based import configuration needs to specify the file type based on Hive import configuration to specify the query language to be executed. # processing tag tags: [# loading data from HDFS file Here the data type is field_0 in the Parquet tag name ${TAG_NAME} # HDFS Parquet file, and field_1 will be written to the ${TAG_NAME} # node column as ${KEY_FIELD} {name: ${TAG_NAME} type: parquet path: ${HDFS_PATH} fields: {field_0: nebula_field_0 Field_1: nebula_field_1} vertex: ${KEY_FIELD} batch: 16} # similar to the above # loading from Hive executes the command ${EXEC} as the dataset {name: ${TAG_NAME} type: hive exec: ${EXEC} fields: {hive_field_0: nebula_field_0 Hive_field_1: nebula_field_1} vertex: ${KEY_FIELD}}]

Description:

The name field is used to indicate the label name fields field is used to configure the mapping relationship between the HDFS field or the Hive field and the Nebula field. The batch parameter means the number of records imported in a batch and needs to be configured according to the actual situation.

The edge type configuration section is used to describe the import label information, and each element in the array is an edge type information. There are two main types of edge type import: file-based import and Hive-based import.

You need to specify a file type to import a configuration based on a file

Based on Hive import configuration, you need to specify the query language to execute

# processing side edges: [# load data from HDFS, data type is JSON # Edge name is ${EDGE_NAME} # HDFS JSON file field_0 and field_1 will be written to ${EDGE_NAME} # start field is source_field, end field is target_field, edge weight field is ranking_field. {name: ${EDGE_NAME} type: json path: ${HDFS_PATH} fields: {field_0: nebula_field_0 Field_1: nebula_field_1} source: source_field target: target_field ranking: ranking_field} # load from Hive to execute the command ${EXEC} as a dataset # Edge weight is optional {name: ${EDGE_NAME} type: hive exec: ${EXEC} fields: {hive_field_0: nebula_field_0 Hive_field_1: nebula_field_1} source: source_id_field target: target_id_field}]

Description:

The name field is used to represent the edge type name

The fields field is used to configure the mapping relationship between the HDFS field or the Hive field and the Nebula field. The source field is used to represent the starting point of the edge, the target field is used to represent the end point of the edge, and the ranking field is used to represent the weight of the edge. The batch parameter means the number of records imported in a batch, which needs to be configured according to the actual situation. Import data command bin/spark-submit\-- class com.vesoft.nebula.tools.generator.v2.SparkClientGenerator\-- master ${MASTER-URL}\ ${SPARK_WRITER_JAR_PACKAGE}-c conf/test.conf-h-d

Description:

-c:config is used to specify the profile path-h:hive is used to specify whether Hive-d:dry is supported to test whether the configuration file is correct and does not process data.

The author has something to say: Hi, Hello, everyone. I am a software engineer at darion,Nebula Graph. I have some ideas about distributed systems. I hope the above article can give you some inspiration. Limited to the level, if there is anything inappropriate, please correct the axe. Thank you here.

Do you like this article? Come on, order a star for our GitHub to encourage ~? ‍♂️? ‍♀️ [manual kneeling]

Communication diagram database technology? Make a friend, NebulaGraph official assistant Wechat: NebulaGraphbot pulls you into the communication group ~ ~

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.