Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Quick experience of knowledge using Nebula Graph data Import in Graph Database

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Recently, @ Yener opened up the largest Chinese knowledge graph in history, OwnThink (link: https://github.com/ownthink/KnowledgeGraphData), with a data volume of 140 million items.

This article describes how to quickly import this data into the graph database Nebula Graph, which takes about 30 minutes.

A brief introduction to Chinese knowledge graph OwnThink (OwnThink)

Knowledge graph is a concept put forward by Google in 2012. It is mainly used to describe various entities and concepts that exist in the real world, as well as the relationship between them. It has many applications in many fields, such as search engine, question answering robot, knowledge extraction and so on.

Recently, Yener has opened up the largest Chinese knowledge graph in history, OwnThink (link: https://github.com/ownthink/KnowledgeGraphData), with 140 million pieces of data. The data is stored in a mixed triple of (entity, attribute, value) and (entity, relationship, entity) in csv format.

You can download it here: https://nebula-graph.oss-accelerate.aliyuncs.com/ownthink/kg_v2.tar.gz

View the original file

Due to too much ownthink_v2.csv data, some of the data are excerpted as an example:

Red food, description, red food refers to food that is red, orange-red or brown-red. Red food, whether it contains preservatives, no red food, main edible efficacy, prevent colds, relieve fatigue, use, enhance epidermal cell regeneration and prevent skin aging, description, Yandang Mountain scenic spot scattered, east Yangjiao Cave, west to sawbanling; south Jin Zhu Xi, north to Liuping Mountain. Big dragon, Chinese name, big dragon, foreign language name, big dragon autrum big dragon, ticket price, 50 yuan big dragon, famous scenic spot, Furongfeng Yao Ming [chairman of China Basketball Association and China Vocational Federation], wife, Ye Li

Here (red food, whether it contains preservatives, no) is typical (entity, attribute, value) form of triple data; and (Yao Ming [chairman of China Basketball Association, China Vocational Federation], wife, Ye Li) is typical (entity, relationship, entity) form of triple data.

Step 1. Data modeling and cleaning preparation modeling

Nebula Graph is an open source distributed graph database (link: https://github.com/vesoft-inc/nebula). Compared with Neo4j, its main feature is completely distributed, so the graph database Nebula Graph is suitable for dealing with scenarios where the amount of data exceeds that of a stand-alone machine.

The data model that graph database usually supports is directed attribute graph (directed property graph). Each vertex in the graph (vertex) can be represented by a tag (Neo4j is called Label), and the relationship between vertices and vertices is connected by edges (edge). Each tag and edge can also have attributes. However, these functions make little sense to the triple data of the knowledge graph:

By analyzing the triple data in the above figure, it is found that whether it is triple data in the form of (entity, attribute, value) or triple data (entity, relation, entity), each triple data can be modeled in the form of two points and one edge. The "entity" and "value" in the former triple are modeled as two points (starting point and end point), the "attribute" is modeled as an edge, and the two "entities" in the latter triple are also modeled as two points (starting point and end point), and the relationship is modeled as an edge.

Moreover, all points are of the same type (named entity), only one attribute is required (called name), all edges are of the same type (named relation), and there is only one attribute on the edge (called name).

For example, (Da Longyi, famous scenic spot, Furong Peak) can be expressed as follows:

Data cleaning and preprocessing

According to the analysis in the previous section, each piece of original triple data also needs to be cleaned and converted into two points and one edge before it can be turned into an attribute graph model.

Download cleaning tools

At the time of testing in this article, the operating system is CentOS 7.5, and the tool is written in Golang language.

You can download the source code of this simple cleaning tool and compile it here (link: https://github.com/jievince/rdf-converter)).

The tool will write the data of the converted vertices to the vertex.csv file and the edge data to the edge.csv file.

Description: in the process of testing, it is found that there are a lot of repetitive data, so the tool is also de-duplicated. The data of the completely deweighted points is about 46 million, and that of the completely deweighted edges is about 100 40 million.

The cleaned vertex.csv file looks like this:

-2469395383949115281, excessive packaging-5567206714840433083 Package3836323934884101628, some goods deliberately increase the number of packaging layers 1185893106173039861, many use solid wood and metal products 3455734391170888430, non-scientific 918316425836124946, education 5258679239570815125, mature market-8062106589304861485, "mature market refers to the market with low growth rate and high share."

Description: each line is a vertex, the first column integer-2469395383949115281 is the vertex ID (called VID), it is calculated by the second column of text through hash, for example-2469395383949115281 is calculated by std::hash ("overpackaging").

Cleaned edge.csv file:

341338383687083624814 087, meaning 341338383687083624087, meaning 3413383836870836248jue 80379844375033188, definition 341338383687083624243756, label 3413383836870836243756, label 3413383833039039864, label 258797907752515675775926279810, description 2587975790775252525676 25779790790775251569, Chinese name 25879779790775256156people 37775790775257757775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775779775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775775

Description: the first column is the VID of the starting point, the second column is the VID of the end, and the third column is the "attribute" or "description" of this edge.

The running time of the cleaning program for complete removal of weight in this machine is about 6 minutes.

Step 2. Nebula Graph startup ready for download and installation

After logging in to GitHub, find the installation package for Nebula here (link: https://github.com/vesoft-inc/nebula/actions)).

Find the download link for the system you are using:

The author's system is CentOS 7.5. download the latest compression package of CentOS 7.5. after decompression, you can find the rpm installation package nebula-5ace754.el7-5.x86_64.rpm. Note that 5ace754 is the git commit number, which may be different when using it. After downloading, decompress it, enter the following command to install it, and remember to replace it with the new git commit:

$rpm-ivh nebula-5ace754.el7-5.x86_64.rpm starts the Nebula Graph service

Enter the following command at the command line CLI to start the service

$/ usr/local/nebula/scripts/nebula.service start all

The result of the command execution is as follows:

You can execute the following command to check whether the service started successfully

$/ usr/local/nebula/scripts/nebula.service status all

The result of the command execution is as follows:

Connect to the Nebula Graph service

Enter the following command to connect to Nebula Graph:

$/ usr/local/nebula/bin/nebula-u user-p password

The result of the command execution is as follows:

Prepare metadata such as schema

The style of using Nebula Graph is a bit similar to that of MySQL, and you need to prepare various meta-information first.

Create a new graph space space

The concept of create space is similar to that of create database in MySQL. Enter the following command in nebula console.

Nebula > CREATE SPACE test; enter test spacenebula > USE test; create point type (entity) nebula > CREATE TAG entity (name string); create edge type (relation) nebula > CREATE EDGE relation (name string)

Finally, simply confirm that the next metadata is correct.

View the properties of the entity tag:

Nebula > DESCRIBE TAG entity

The results are as follows:

View the properties of the relation edge type:

Nebula > DESCRIBE EDGE relation

The results are as follows:

Step 3. Import data using nebula-importer

Log in to GitHub to enter https://github.com/vesoft-inc/nebula-importer, the tool nebula-importer is also written in Golang language, download and compile the source code here.

In addition, prepare a YAML configuration file that tells the importer tool where to go to the csv file. (can directly copy the following paragraph) Zhengzhou to see where infertility is good: http://jbk.39.net/yiyuanzaixian/zztjyy/

Version: v1rc1description: exampleclientSettings: concurrency: 10 # number of graph clients channelBufferSize: 128 space: test connection: user: user password: password address: 127.0.0.1:3699logPath:. / err/test.logfiles:- path:. / vertex.csv failDataPath:. / err/vertex.csv batchSize: 100 type: csv csv: withHeader: false withLabel: false schema: type: vertex vertex: tags: -name: entity props:-name: name type: string-path:. / edge.csv failDataPath:. / err/edge.csv batchSize: 100 type: csv csv: withHeader: false withLabel: false schema: type: edge edge: name: relation withRanking: false props:-name: name type: string

Note: during the test, it was found that there were a large number of escape characters (\) and newline characters (\ r) in the csv data file, which was also processed by nebula-importer.

Finally: start importing data

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report