How to use Nebula Graph Exchange to import data from Neo4j to Nebula Graph Database 04/20 Update SLTechnology News&Howtos

How to use Nebula Graph Exchange to import data from Neo4j to Nebula Graph Database

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to use Nebula Graph Exchange to import data from Neo4j to Nebula Graph Database. The article is rich in content and analyzes and describes it from a professional point of view. I hope you can get something after reading this article.

The following is mainly about how to import data from Neo4j to Nebula Graph Database using the data import tool Nebula Graph Exchange. Before we talk about how to actually import data, let's take a look at how this import function is implemented internally in Nebula Graph.

Data processing principle of Nebula Graph Exchange

Our import tool name is Nebula Graph Exchange, using Spark as the import platform to support the import of massive data and ensure performance. Spark itself provides a nice abstraction, DataFrame, which makes it easy to support multiple data sources. With the support of DataFrame, adding a new data source only needs to provide the code to read the configuration file and return the Reader class of DataFrame to support the new data source.

DataFrame can be thought of as a distributed storage table. DataFrame can be stored in different partitions of multiple nodes, and multiple partitions can be stored on different machines, thus supporting parallel operations. Spark also provides a concise set of API that makes it easy for users to manipulate DataFrame as well as local datasets. Most databases now provide the ability to export data directly to DataFrame, and even if a database does not provide this functionality, you can manually build DataFrame through the database driver.

After Nebula Graph Exchange processes the data from the data source into DataFrame, it traverses every row of it and gets the corresponding value by column name according to the mapping of fields in the configuration file. After traversing the batchSize rows, Exchange writes the acquired data to the Nebula Graph at one time. Currently, Exchange generates nGQL statements and then writes data asynchronously by Nebula Client. The next step is to support the direct export of sst files stored in Nebula Graph to achieve better performance. Next, let's introduce the specific implementation of Neo4j data source import.

Implementation of Neo4j data Import

Although Neo4j officially provides a library that can directly export data to DataFrame, it is difficult to use it to read data to meet the needs of resuming transmission from breakpoints. Instead of using this library directly, we use Neo4j's official driver to read data. Exchange achieves better performance by calling Neo4j driver in different partitions to execute different skip and limit Cypher statements and distributing data among different partitions. The number of partitions is specified by the configuration item partition.

The Neo4jReader class in Exchange will first replace the exec Cypher statement in the user configuration with the statement after return with count (*) to obtain the total amount of data, and then calculate the starting offset and size of each partition based on the number of partitions. Here, if the user has configured the check_point_path directory, it will read the files in the directory, and if it is in the state of resuming upload, Exchange accounting will calculate the offset and size of each partition. Each partition then adds a different skip and limit after the Cypher statement, calling driver for execution. Finally, the returned data is processed into DataFrame and the data import of Neo4j is completed.

The process is shown in the following figure:

Practice of Neo4j data Import

The system environment in which we import the demonstration here is as follows:

Cpu name:Intel (R) Xeon (R) CPU E5-2697 v3 @ 2.60GHz

Cpu cores:14

Memory size:251G

The software environment is as follows:

Neo4j:3.5.20 Community Edition

Nebula graph:docker-compose deployment, default configuration

Spark: stand-alone version 2.4.6 pre-build for hadoop2.7

Since Nebula Graph is a strong schema database, you need to create Space and schema of Tag and Edge before data import. For more information, please see here.

Here we have a Space named test with 1 copies. Here you create two point types with Tag tagA and tagB, both with four attributes, and an edge type called edgeAB, which also contains four attributes. The specific nGQL statement is as follows:

# create graphic space CREATE SPACE test (replica_factor=1); # Select graphic space testUSE test;# to create label tagACREATE TAG tagA (idInt int, idString string, tboolean bool, tdouble double); # create label tagBCREATE TAG tagB (idInt int, idString string, tboolean bool, tdouble double); # create edge type edgeABCREATE EDGE edgeAB (idInt int, idString string, tboolean bool, tdouble double)

At the same time, import Mock data into Neo4j-a total of 1 million points labeled tagA and tagB, and a total of 10 million edges connecting tagA and tagB points with the edge type edgeAB. It is also important to note that the data exported from Neo4j must have attributes in Nebula Graph, and the data must be of the same type as Nebula Graph.

Finally, in order to improve the efficiency of importing Mock data into Neo4j and the efficiency of reading Mock data in Neo4j, the idInt attributes of tagA and tagB are indexed. About indexing, it should be noted that Exchange does not import indexes, constraints and other information from Neo4j into Nebula Graph, so users are required to create their own indexes and REBUILD indexes (index existing data) after performing data writing in Nebula Graph.

Then we can import the Neo4j data into Nebula Graph. First we need to download and compile the packaged project, which is in the tools/exchange folder under the nebula-java repository. The following commands can be executed:

Git clone https://github.com/vesoft-inc/nebula-java.gitcd nebula-java/tools/exchangemvn package-DskipTests

Then you can see the file target/exchange-1.0.1.jar.

Next, write a configuration file in the format of HOCON (Human-Optimized Config Object Notation), which can be changed based on the src/main/resources/server_application.conf file. First, configure address, user, pswd and space under the nebula configuration. The test environment is all the default configuration, so no additional modifications are needed here. Then do the tags configuration, which requires the configuration of tagA and tagB. Only the tagA configuration is shown here, and the tagB and tagA configurations are the same.

{# = neo4j connection setting = name: tagA # must be the same as the tag name in Nebula Graph. You need to set up the address of tag server: "bolt://127.0.0.1:7687" # neo4j in advance in Nebula Graph to configure user: neo4j # user name password: neo4j # neo4j password encryption: false # (optional): whether the transmission is encrypted. Default is false database: graph.db # (optional): neo4j database name, Community Edition does not support # = Import settings = type: {source: neo4j # also supports PARQUET, ORC, JSON, CSV, HIVE, MYSQL, PULSAR, KAFKA... Sink: client # is written to Nebula Graph. Currently, it only supports client. In the future, it will support direct export of Nebula Graph underlying database files} nebula.fields: [idInt, idString, tdouble, tboolean] fields: [idInt, idString, tdouble, tboolean] # mapping relationship fields, with the attribute name of nebula at the top and the attribute name of neo4j at the bottom. The configuration of the one-to-one mapping relationship is List rather than Map. In order to maintain the order of fields, vertex: idInt # is required as the neo4j field of nebula vid when exporting nebula underlying storage files directly in the future, and the type needs to be long or int. Partition: 10 # partitions batch: 2000 # how much data to write to nebula at a time check_point_path: "file:///tmp/test" # (optional): directory where import progress information is saved, used to resume exec at breakpoint:" match (n:tagA) return n.idInt as idInt, n.idString as idString, n.tdouble as tdouble, n.tboolean as tboolean order by n.idInt "}

The setting of the edge is mostly the same as the setting of the point, but because the edge has the vid of the start point and the vid identification of the end point in the Nebula Graph, you need to specify the domain that serves as the start point vid of the edge and the domain that serves as the end vid of the edge.

The special configuration of the side is given below.

Source: {field: a.idInt # policy: "hash"} # vid settings target: {field: b.idInt # policy: "uuid"} # end vid settings ranking: idInt# (optional): fieldpartition as rank: number of partitions is set to 1 The reason is the following exec: "match (a:tagA)-[r:edgeAB]-> (b:tagB) return a.idInt, b.idInt, r.idInt as idInt, r.idString as idString, r.tdouble as tdouble, r.tboolean as tboolean order by id (r)"

Policy hash/uuid can be set under the vertex of the point and the source and target configuration items of the edge. It can take the field whose type is a string as the vid of the point, and map the string to an integer through the hash/uuid function.

The above example does not require the setting of policy because the vid as a point is an integer. The difference between hash/uuid is shown here.

In the Cypher standard, if there is no order by constraint, there is no guarantee that the sorting of the results of each query is consistent. Although it seems that the order of the results returned without order by Neo4j is the same, in order to prevent possible data loss during import, it is strongly recommended to add order by to the Cypher statement, although this will increase the import time. In order to improve import efficiency, it is best for order by statements to select indexed attributes as sorted attributes. If there is no index, you can also observe the default sorting and select the appropriate sorting properties to improve efficiency. If the default sorting cannot find a rule, you can use the ID of the point / relation as the sorting attribute, and set the value of partition as low as possible to reduce the sorting pressure of Neo4j. In this article, the partition of the side edgeAB is set to 1.

In addition, Nebula Graph uses ID as the only primary key when creating points and edges, and overwrites the data in that primary key if it already exists. So if you take a Neo4j attribute value as the ID of Nebula Graph, and this attribute value is duplicated in Neo4j, it will cause one and only one item of data corresponding to "repeating ID" to be stored in Nebula Graph, and the rest will be overwritten. Because the data import process is to write data to Nebula Graph concurrently, the final saved data is not guaranteed to be the latest data in Neo4j.

Here, also pay attention to the breakpoint resume feature. Between the breakpoint and the continuation, the database should not change the state, such as adding data or deleting data, and the number of partition cannot be changed, otherwise data may be lost.

Finally, because Exchange needs to execute different skip and limit Cypher statements in different partitions, the user-supplied Cypher statements cannot contain skip and limit statements.

Then you can run the Exchange program to import the data, and execute the following command:

$SPARK_HOME/bin/spark-submit-class com.vesoft.nebula.tools.importer.Exchange-master "local [10]" target/exchange-1.0.1.jar-c / path/to/conf/neo4j_application.conf

Under these configurations, it takes 13s to import 1 million points and 213s to import 10 million edges, and the total time is 226s.

Attached: some comparisons between Neo4j 3.5 Community and Nebula Graph 1.0.1

Neo4j and Nebula Graph have some differences in system architecture, data model and access methods. The following table lists common similarities and differences.

The above is how to use Nebula Graph Exchange to import data from Neo4j to Nebula Graph Database. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.