Tree structure storage of a large number of file name records 07/03 Update SLTechnology News&Howtos

Tree structure storage of a large number of file name records

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

With more than a billion directories and files already in NAS for more than a decade, there have been many challenges in designing and developing backup systems. This article will share tree-structured storage practices for large file name records.

I. Introduction

Since it is a regular backup, there must be more than one backup. For a particular directory, each backup is compared with the last backup to find out which files have been deleted and which files have been added, which requires that all file names in that directory be saved at each backup. The first thing we thought of was to concatenate all file names with specific characters and save them. Because we use MySQL to store this information, when there are many files in the directory, this way of splicing is likely to exceed MySQL's Blob length limit. According to experience, when a directory has a large number of files, the names of these files are often generated by the program, there is a certain regularity, and the beginning is generally repeated, so we thought of using a tree structure for storage.

For example, a directory with four files abc, abc1, ad, and cde corresponds to a tree as shown in Figure 1.

Figure 1 Example tree structure

In Figure 1, R represents the root node, the cyan node we call the end node, and the path from R to each end node represents a file name. You can search the tree for a filename, traverse all filenames in the tree, save the tree serialization, and deserialize the spanning tree from the serialization results.

II. Data structures involved

Note: we use java to write, the text involves language features related to knowledge points are java.

2.1 Structure of Node

Each node, including the root node, is represented using the Node class. The code is as follows:

class Node { private char value; private Node[]children = new Node[0]; private byte end = 0; }

Field Description:

value: The character represented by the node. When Node represents the root node, value has no value. children: All children of this node, initialized as an array of length 0. end: Marks whether the node is an end node. 0 is not;1 is. The leaf node must be the end node. Default non-end node. 2.2 Node operations public Node(char v); public Node findChild(char v); public Node addChild(char v);

Operating instructions:

Node: Construction method. Assign parameter v to this.value. findChild: Finds whether children contain child nodes with value v. Return child nodes if yes, null if no. addChild: First find whether children already contain child nodes with value v, if yes, directly return the child nodes found; otherwise, create nodes with value v, extend the length of children by 1, take the newly created node as the last element of children, and return the newly created node. 2.3 Structure of Tree class Tree { public Node root = new Node(); }

Field Description: Tree contains only root Node. As mentioned earlier, root has no value and end is 0. The initial children length is 0.

2.4 Tree operations public void addName(String name) ; public boolean contain(String name); public Found next(Found found); public void writeTo(OutputStream out); public static Tree readFrom(InputStream in);

Operating instructions:

addName: Adds a new file name to the tree, the parameter name. Call addChild with root as the starting point, each character in name as the parameter, and the return value as the new starting point until all characters in name are added. Mark the return value of the last call addChild as the end node. contain: Query whether the tree contains a file name. next: Travels through all the filenames contained in the tree. To make the traversal smooth, we introduce a new class Found, details of which will be described later. writeTo: Writes the tree to an output stream for persistence. readFrom: This method is static. Reconstruct the tree from an input stream. III. Construction of trees

The addName method is called on the newly created Tree to add all file names to the tree and complete the tree construction. Still take the directory containing abc, abc1, ad, cde four files as an example to illustrate the tree construction.

Figure 2 Tree construction process

In Figure 2, the orange node indicates that the addChild method needs to be called on the node to add children, and the return value of addChild is used as the new orange node. Until there are no more children to add, mark the last orange node as the end node.

4. Tree query

Find out if the tree contains a file name corresponding to the Tree's contain method. Look for ef, ab, and abc files on the results in Figure 2 to demonstrate the process of finding. as shown in figure 3.

Figure 3 Diagram of tree query

In Figure 3, an orange node indicates that you need to call the findChild method on that node to find child nodes.

V. Traversals of Trees

The traversal here is different from the traversal of ordinary trees. A normal traversal is traversing nodes in a tree, whereas a traversal here is traversing the path from the root node to all end nodes.

We traverse from left to right, shallow to deep. We introduce the Found class and iterate over it as an argument to the next method.

5.1 Structure of Found class Found { private String name; private int[] idx ; }

In order to explain the problem more easily, a small modification has been made on the basis of Figure 1, and the subscript has been added to the lower right corner of each node, as shown in Figure 4.

Figure 4. Subscripted Tree

For the file name abc, the name value in Found is "abc" and idx is {0, 0, 0}.

For the file name abc1, the name value in Found is "abc1" and idx is {0, 0, 0}.

For the file name ad, the name value in Found is "ad" and idx is {0, 1}.

For the file name cde, the name value in Found is "cde" and idx is {1, 0, 0}.

5.2 how to traverse

For Figure 4, the first call to the next method should be null, and the first result, i.e., Found represented by abc, will be returned; the second call to next with this Found as an argument will return the second result, i.e., Found represented by abc1; the third call to next with this Found as an argument will return the third result, i.e., Found represented by ad; and the fourth call to next with this Found as an argument will return the fourth result, i.e., Found represented by cde; If you continue to make a fifth call with this Found as an argument, it will return null and the traversal will end.

Serialization and deserialization 6.1 Serialization

First of all, it should be clear that each node should contain three pieces of information after serialization: the value of the node, the number of children of the node, and whether the node is the end node.

6.1.1 Value of nodes

Although the value of the node in the previous example is all English characters, in fact, the file name may contain Chinese characters or characters in other languages. For ease of processing, we do not use variable length coding. Instead, it uses unicode code directly. The endianness is big-endian encoded.

6.1.2 Number of children of a node

Since the value of the node uses a unicode code, the number of children cannot exceed the number of characters that unicode can represent, i.e. 65536. The number of children uses 2 bytes. The endianness is also encoded using big-endian.

6.1.3 End of node

A 0 or 1 can be represented using 1 bit, but the smallest unit in Java is a byte. If we use 1 byte to represent end, it is a waste of space. In fact, the possibility that the number of children in any node reaches 65536/2 is extremely small, so we consider borrowing the highest bit of the number of children to represent end.

To sum up, a node takes up 4 bytes after serialization. For example, the root node, the node with value b and the node with value e in Figure 4:

Table 1 Example Node serialization

Number of unicodechildren of value Number of endchildren/(end

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.