What are the practical details in Spark data import 04/20 Update SLTechnology News&Howtos

What are the practical details in Spark data import

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In the Spark data import practice details, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

1. Preface

With the passage of time, the atlas business becomes more and more complex, which gradually reflects the performance bottleneck: the stand-alone is not enough to support larger maps. However, from the performance point of view, the original image storage of Neo4j has irreplaceable performance advantages, which is an insurmountable gap for JanusGraph, Dgraph and so on. Even if JanusGraph is very good at OLAP and has some support for OLTP, GraphFrame and others are enough to support its OLAP requirements, not to mention that when Spark 3.0 will provide Cypher support, there are more ways to solve the OLAP requirements of the map than OLTP. At this time, the "emergence of Nebula Graph" is undoubtedly a breakthrough in the inefficiency of distributed OLTP.

Before, after all kinds of research and deployment, especially after the final test of OLTP efficiency of JanusGraph found that it could not meet the online requirements, we no longer imposed a mandatory requirement for OLAP and OLTP on the same atlas, and the architecture of Nebula Graph just meets the needs of the atlas:

Distributed-shared-nothing distributed architecture

High-speed OLTP (performance needs to be similar to Neo4j)-Nebula Graph's storage layer architecture query directly maps physical addresses, which can actually be regarded as original image storage.

High availability of services (that is, the atlas can provide services stably under non-man-made conditions)-local failure services are available and there is a snapshot mechanism

Ensure scalability-support linear expansion, due to open source, support secondary development

To sum up, the Nebula Graph architecture meets the actual production requirements, so the Nebula Graph is investigated, deployed and tested. The part about deployment and performance testing (Meituan NLP team performance test, Tencent Cloud security team performance test) has more detailed data on both the official website and other students' blogs. This article mainly starts from the import of Spark, which can be regarded as a superficial understanding of Nebula Graph's support for Spark.

two。 Test environment

Nebula Graph cluster

3 sets of 32 c (actual limit of 16 c)

400 GB of memory (100 GB actually configured)

SSD

Version information: Nebula Graph version 1.0.0 (testing was earlier at that time).

Network environment: 10 trillion.

Graph size: billion-level nodes (fewer attributes), tens of billions-level edges (directed, non-attribute or weighted).

Spark cluster

Version Information: Spark 2.1.0

In fact, the total usage resources of Nebula Graph are about 2T memory (3 * 30 executor + 1 driver) * 25g.

3.Spark batch Import 3.1 basic process

Package sst.generator (the packages that Spark needs to generate sst).

Configure the Nebula Graph cluster, start the Nebula Graph cluster normally, and create the graph.

The Spark configuration file config.conf (you can refer to the documentation "Spark Import tool") for configuration.

Troubleshoot the Spark cluster for conflicting packets.

Spark imports happily using configuration files and sst.generator when it starts.

Data check.

3.2 some details

It is recommended to build an index before bulk import.

The reason why it is recommended to build an index first is that batch import is only carried out on non-online atlas. Although indexing can be done at the same time as the service is provided, the index can be built first in order to prevent problems with subsequent REBUILD. The problem is that it is relatively slow to import nodes in bulk.

It is recommended to use the int node ID (you can use the Snowflake algorithm, etc.). If the node's ID is not int, you can set the automatic generation of uuid by adding policy: "uuid" to the node / edge.

If you use a separate Spark cluster, you may not have the problem of conflicting packages in the Spark cluster. The main problem is that there may be conflicts between sst.generator and other packages in the Spark environment. The solution is to shade these conflicting packages or rename them.

Spark tuning: we can adjust the parameters according to the actual situation, reduce the memory as much as possible to save resources, and relatively improve the parallelism acceleration.

3.3 Import results

Billion-level nodes (fewer attributes), 10-billion-level edges (directed, no attributes or with weights), it takes about 20 hours to import the whole graph if the index is built in advance.

3.4 about PR

Because in the earlier version of the use of Spark import, of course, there are some imperfections, here also put forward some humble suggestions, slightly modified the SparkClientGenerator.scala.

The problem of staggered columns was first found when writing Nebula Graph using Spark Writer (now: Exchange).

By looking at the source code, I found that there is BUG in SparkClientGenerator.scala, which reads the location of the configuration file rather than the location of the parquet/json file. After the repair, I mentioned my first PR#2187.

Later, it was found that there was a problem of duplicate double quotation marks when using SparkClientGenerator to automatically generate uuid/hash, which made it impossible to import.

It can be said that this piece has been submitted many times because of different ideas to solve the problem. The problem of repeated quotation marks boils down to adding a double quotation mark during type conversion. I found that there is an extraIndexValue method that can convert non-string types into string types. I think there may be users who want to convert non-string index into uuid/hash (such as array), so there are more modifications.

However, after communicating with the official @ darionyaphet, I found that I actually made changes to the data source. When users upload unsupported types such as array, they should report an error instead of converting the type (indeed, at the beginning, I only considered the logic and the use of my own business, not the generality). Modify again, submit PR # 2258, pass. I also learned a lot from this PR.

Later, it was found that nebula-python also had problems in conflict with the official thrift. I originally wanted to mention PR after shade, but I felt that the change was too big, so I referred it directly to the official and recently fixed it.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.