Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to import large amounts of data into neo4j by using batch-import tools in database

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article is about how to import huge amounts of data into neo4j using batch-import tools in the database. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Batch-import original project address: https://github.com/jexp/batch-import

This tool is written by Michael Hunger, one of the authors of neo4j, and is further optimized on the basis of neo4j's own bulk import tool, but when it imports a .gz compressed file, the relationship cannot be imported, so if you want to use the .gz package for import, please use my modified version: https://github.com/mo9527/batch-import

2. Environmental preparation

Above jdk:7

Memory: more than 8 gigabytes, importing too much data will consume a lot of memory. I imported nearly 150 million nodes, 300 million relationships, using 32 gigabytes of memory.

3. Import steps

A) clone the code from github and package it with maven. The packaged jar file is placed in the lib folder together with the dependent jar of the project. The batch.properties file and the script to execute the import are placed in the same directory of lib. The directory structure of * is as follows:

The ps:file folder is the csv file and .gz package that I will import myself.

B) assemble csv files

Speaking of this step, you may need to manually write code to import csv files according to your actual business requirements. Here I will only talk about some key points of csv file format:

1), node csv file

The * column of the node csv file is fixed, and the column value is the label name of the node. The second column is index, and its column header is in the format of id:string:indexName. Explain that id is the property name of this column and can be named as needed, string is the data type of the field, and indexName is the index name to be imported in the neo4j database. My own file format is as follows:

Then, the next column is the property of the node, with no special requirements

2), relational csv file

Take a look at my relationship csv file first:

The first two columns of the relationship csv file should pay special attention. The * column is the start node of the relationship, the second column is the end node of the relationship, the third column is the relationship type, and the following column is the property of the relationship, which is optional. His notes on github do not give some precautions, which should be specifically marked here:

* the column header of the start node of the column, that is, id:string:buyerId, must be exactly the same as defined in the node csv file (above), and the second column must be the same as in the csv file of the end node, otherwise he will not find the corresponding relationship.

3) modify batch.properties file

There are two main changes.

If you are importing from an existing neo4j database, set:

Batch_import.keep_db=true

Add all the index names in the node csv file to the file, for example, the index name in the above node csv file is buyerId, then add batch_import.node_index.buyerId=exact to the file

The following is my own profile:

4. Import

The imports of linux and win environments are similar, but the scripts executed are not the same. Here, take the win environment as an example.

The files are all ready, and now they are being imported.

Open cmd,cd to the directory where the import script is located, which is the directory where import.bat is located, and execute the command:

Import.bat test.db node.csv rel.csv

Explain several parameters of the command: * the parameter is the directory of the database and can be specified to any location with an absolute path. The second parameter is the node csv file, and multiple csv files are separated by commas. If it is a compressed package, you must note that there is a pit. You cannot put all types of node into one compressed package. You must compress each type of node separately. Otherwise, it will only import * types of node nodes, similarly, relationship packages should be compressed separately, and then import .gz files separated by commas.

Well, if there's nothing wrong with your csv file and there's enough memory, start waiting now.

If you want to modify the Heap size of the import tool, you can modify the set HEAP=4G in the script file

Warm Tip: if there is Chinese in the node file, the import will be very slow, unless you have 128g of memory, I have a node file, there is only one column in Chinese, and the longest Chinese character is no more than 4 Chinese characters, more than 20 million records have been guided for 2 hours, note that I am 32 gigabytes of memory, other more than 40 million nodes, there are no Chinese characters, basically no more than 2 minutes.

Thank you for reading! On "how to use batch-import tools in the database to import massive data into neo4j" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, you can share it out for more people to see it!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report