How to achieve efficient data traceability in big data 07/15 Update SLTechnology News&Howtos

How to achieve efficient data traceability in big data

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces how to achieve efficient data traceability in big data. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

How to realize and optimize the database traceability service based on GES diagram.

"in one minute, I want all the information about this person." the bully president patted you and made this request. The secretary began to work hard and found: name, age, contact information, hobbies, these information. Not enough? Then add the information of relatives and friends and recent activities to show this person more completely. Although it is a joke, it also gives us some enlightenment: the information of the object itself may not be "complete", and the associated data around it is also an important part of the object information, which is very useful for data analysis and mining.

Relevance is very common in real life, such as social interaction, commodity production and consumption behavior. In data analysis, in order to make better use of the association relationship, the graph is often used as the data structure, and the database that uses the graph structure to save data is called graph database. The traditional relational database, which presents the data from a tabular perspective, can easily query and manage the data, while the graph database pays more attention to the relationship between nodes and surrounding nodes, which is a kind of mesh structure. it is suitable for traceability analysis, social network analysis, heterogeneous information mining and other applications. The graph database service provided by Huawei Cloud is GES (Graph Engine Service) [1].

Based on graph database can do a lot of interesting applications, data traceability is a very common application. Data traceability is to associate and trace the data generated by each link. In the epidemic, check the circulation process of the goods and check whether the goods are likely to come into contact with the source of infection. In the testing activity, the completeness of the testing activity is analyzed by building the testing process network, which is used to evaluate the quality. These are typical usage scenarios for traceability. If we use traditional relational database to construct data traceability, we need to construct and maintain multiple relational tables independently, and realize many-to-many relational network, which is not easy to understand complex business logic, at the same time, it will also be accompanied by the problem of complex implementation and slow query of traceability query.

Fig. 1 comparison between relational database and graph database

An example is given to illustrate the advantages of graph database in the field of data analysis. Figure 1 is a simple course selection system, which records the students' course selection and the corresponding course information. As shown in the figure on the right, we convert this information into a picture according to the representation of the graph database. It can be seen that the diagram can more intuitively express the relationship between course selection and class, clearly show the relationship between entities, and make it more convenient for association analysis. For example, according to the picture, we can easily find students who are in math class with Xiao Bu, and we can also quickly find students with the same interest in electing courses. The information of the surrounding nodes can be easily queried through the graph database, which is very suitable for traceability implementation. So how to implement traceability service based on graph database? Next, we will take Huawei Cloud GES as an example to analyze the implementation and optimization of database traceability service based on GES diagram.

What is a picture?

In the diagram database, the diagram consists of the following parts:

Point: an entity object in a diagram that is represented as a node in the diagram. For example, people in society, commodities in circulation, etc., can be abstracted as a node in the diagram.

Edge: the relationship between nodes in a graph. Such as social relations between people, the purchase of goods and so on.

Attribute: used to describe the attributes of a node or edge in a graph, such as number, name, etc. In clustering and classification analysis, weight is often used as a relational attribute, that is, the edge attribute.

Fig. 2 digraphs and undirected graphs

According to whether the edge has a direction, the graph can be divided into directed graph and undirected graph. For a directed graph, the starting point and end point of the edge are determined. In figure 2, the city is a node, with the distance between cities and the mode of transportation between cities as the edge. Urban traffic is a directed graph, traffic modes in different directions are represented by different edges, and the distance between cities is an undirected graph, because the distance has nothing to do with the direction. When using GES, points and edges need to be treated as different objects, and point edges need to define the desired attributes. The point mainly contains the information of the entity, while the edge needs to specify the starting point and the end point.

Define GES diagrams

For the steps of creating a diagram for GES, please refer to the official document [1]. The main task is to define nodes and edges, process the data into point and edge files, and finally import them into GES, which can be imported through the interface or API. When dealing with an undirected graph, it does not distinguish between the starting point and the end point of the edge, and usually sets a default direction, that is, to specify the starting point and end point of the edge, which is for the convenience of processing and importing data, which can be ignored in the actual query.

In the process of building a graph by GES, files that define points and edges and related attributes are called metadata. The types of points and edges are called label, and each label can have multiple attributes, such as the name, weight, and so on mentioned above, which can be used as attributes of points or edges. In GES, label is not allowed to be modified once it is defined and created, and if you have to modify the label definition, you need to format the diagram and recreate the imported metadata file into the diagram.

Nodes are usually abstracted from real entities. The common data structures of GES node attributes include float, int, double, long, char, char array, date, bool, enum and string. Generally speaking, there are many attributes of string type in the node, and non-string attributes can be selected according to the data type. There are two choices for string types: string and char array. Char array has a data length limit, usually 256, while string types have no length limit. But using char array in GES has more advantages, because the char array data is stored in memory and the string type data is stored on the hard disk, so char array queries are more efficient, which is where GES metadata definition needs to be paid attention to. In the scenario of our project, the node name and number are commonly used query conditions, taking into account the attribute characteristics, such as the node name is longer and the node number is shorter, the final name uses the string type, and the number chooses the char array type.

GES query optimization

After defining the node information, you can query it in the diagram. GES uses Gremlin [3] for queries. Gremlin is an open source streaming query language with flexible query implementation, and different graph databases have different decomposition and optimization of query statements. Therefore, different writing methods may have different query efficiency. Next, we analyze a retroactive query scenario.

Fig. 4 Analysis of multi-branch query scenario

As shown in figure 4, the letter stands for label, which is a node type. You can see that this scenario has many query branches. According to the node requirements in the figure, the Gremlin query statement is directly implemented as follows:

G.V (id) .hasLabel ('A'). OuE (). OtherV (). HasLabel ('B'). OuE (). OtherV (). HasLabel ('C'). As ('c'). OutE (). OtherV (). HasLabel ('F'). OutE (). OtherV (). HasLabel ('H'). Select ('c'). OutE (). OtherV (). HasLabel ('D'). As ('d'). OutE (). OtherV () hasLabel () G'). Select ('d'). OutE (). OtherV (). HasLabel ('H')

Based on the current Gremlin,GES Gremlin server, the query is decomposed into multiple query atomic operations, which are executed by GES engine. For this kind of multi-hop complex query, it will be parsed into more atomic operations and interact frequently, which will lead to inefficient query. For this scenario, consider using the option statement to query, and the efficiency will be improved. The query statement is as follows:

G.V (id) .hasLabel ('A'). OuE (). OtherV (). HasLabel ('B'). OuE (). OtherV (). HasLabel ('C'). As ('c'). OutE (). OtherV (). HasLabel ('F'). OutE (). OtherV (). HasLabel ('H')). HasLabel ('D'). OutE (). OtherV (). HasLabel ('D'). As ('d'). .otherV () .hasLabel ('G')) .others (select ('d'). OutE (). OtherV (). HasLabel ('H'))

Optional can reduce the query scope of branches to a certain extent, so as to improve the query efficiency. In the actual use of the project, the use of optional can improve the query performance by about twice. However, optional is not suitable for all scenarios. Gremlin implementation needs to be optimized according to the query scenario, data size and data characteristics. For example, the sparsity of nodes and the number of branches in the graph can be optimized.

When optimizing GES queries, even if the Gremlin statements are optimized, the desired query performance may not be achieved. This is because when using Gremlin, the atomic operations after Gremlin server parsing during query processing may interact with GES engine frequently, which will degrade query performance, and the scope of optimization for Gremlin queries is limited. Although Gremlin is a common way to define query scripts in graph databases, the optimization of Gremlin scripts varies from manufacturer to manufacturer, so it is more recommended to use GES native API. Native API does more optimization for fixed scenarios, and reduces the Gremlin parsing process, so the performance is better, but it also introduces the balance between versatility and efficiency. After all, there is no general definition and implementation of API.

Below we will introduce several common traceability query scenarios. These scenarios can be implemented through Gremlin queries, but better query performance can be achieved by using the GES system API.

Scene (1) traces the front (back) n-layer nodes of a node

This query is common and is mainly used to query the parent and child nodes of a node. For the scenario in figure 1, all students in the class can be found. The Gremlin implementation of this scenario is as follows:

G.V (id) .repeat (out ()) .times (n) .emit () .path ()

In this scenario, it is recommended to use the k-hop algorithm in the GES algorithm document to solve this problem. It should be noted that this algorithm API only returns all the points in the sub-graph that meet the query criteria, but there is no node details and edge information. If you need node details, you can use batch-query to query the node details in batches. If edge information is needed, the API used in scenario (2) is recommended.

Scene (2) traces a node before (after) n layers of nodes according to conditions, and the node filter condition is the same.

G.V (id) .repeat (outE () .otherV () .hasLabel ('A')) .times (n) .emit () .path ()

In this scenario, the repeat-query method is recommended. This method can quickly realize the n-hop query before and after a certain starting point, and the query conditions of nodes can be limited, and the query filtering conditions of all points are the same. In the query, if different points need to be filtered with different query conditions, you can not specify the query conditions at first, and then filter after the query results are returned. A query scene that does not specify a point can be degenerated into a scene (1), and the API can return details of both nodes and edges.

The scene (3) traces the n-layer nodes before (after) a node according to the condition, and the filter conditions of different nodes are different.

The example in figure 4 is such a scenario where the query label for each layer is different. In this case, it is recommended to use filtered-query for query, this method needs to specify the filter attributes of each node in detail, which is equivalent to specifying each query condition in the parameters one by one to achieve a query that fully meets the conditions. In the project, the query performance of filtered-query can be improved about 10 times compared with Gremlin query.

In the above three scenarios, repeat-query and k-hop have better generalization ability. You can specify the query hop count n at will, and the parameters you need to set are simple. Filtered-query needs to specify the attributes of each layer node in the query in detail, and the parameters are more complex, which can be selected according to business needs.

GES also provides many algorithms, such as Node2vec and subgraph3vec,GCN algorithms. This paper only introduces how to query nodes quickly and provide traceability services based on GES. Later, we will also consider how to do some data node fusion based on the established graph, and can also carry out similarity analysis, quality evaluation and process recommendation to better mine the value of data.

Big data on how to achieve efficient data traceability is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.