Brief introduction of lucene dynamic slicing 07/06 Update SLTechnology News&Howtos

Brief introduction of lucene dynamic slicing

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

In the recently developed search engine, the index needs to be fragmented. According to the requirements of the project, we provide two slicing methods. Record the process on the blog.

Hash algorithm

The principle is simple: determine the shard by the hash value of the row key (_ id), and then perform the operation.

For example, Chestnut (example), there is now an index, initializing 5 fragments, namely shard0, shard1, shard2, shard3, shard4.

Now you need to save a row of data with a _ id of 0001000000123 and a HashCode of 1571574097 and a remainder of 5 (1571574097) of 2 to determine that the data should be saved in shard2. Here is a simple illustration:

The slicing implementation of Hash algorithm is very simple, and the calculation process only needs to know the number of slices to complete the location. But it is also because the number of fragments is part of the algorithm, the cost of modifying the number of fragments is also very expensive.

One solution is to rearrange, such as increasing from M shards to N shards, first dividing each shard into N small shards, and then merging all the small shards into large shards. Copied an illustration from the network.

The advantage of this method is that the number of new slices can be set at will. The disadvantage is that all data needs to be rearranged, which can be time-consuming if the amount of data is large.

Of course, because the growth of project data is unpredictable, we did not choose the method of adding film above, but chose another way of adding film.

Dynamic fragmentation

Combining the Hash algorithm and the principle of binary tree, the slicing is added dynamically.

First of all, the Hash algorithm is the same as before, when the search is created, you can set an initialized number of shards, for example, initialize 5 shards, namely shard_0, shard_1, shard_2, shard_3, shard_4. When adding data, use the hash value of _ id to determine which shard the data needs to be saved to. The difference is that we set the maximum number of rows for each shard. When the number of shards reaches the maximum number, the shard will be split into two small shards and be subshards of the current shard.

For example, set the maximum number of rows of shards to 10 million, and when the shard_2 exceeds 10 million, it is split into two sub-shards, shard_2_2 and shard_2_7. If the shard_2_2 data continues to grow to 10 million, the subfragments shard_2_2_2 and shard_2_2_12 are split.

As can be seen from the example, splitting is not irregular. Assuming that the initial number of fragments is m _ (1) k, which represents the depth of the binary tree, then the splitting rule of n is

Shard_n split into shard_n_n and shard_n_ (n + m * 1)

Shard_n_n split into shard_n_n_n and shard_n_n_ (n + m * 2)

Shard_n_ (n + m * 1) is split into shard_n_ (n + m * 1) _ (n + m * 1) and shard_n_ (n + m * 1) _ (n + m * 1 + m * 2).

...

The above formula looks very complicated, and we use diagrams to illustrate the splitting process.

If you don't understand, we can use _ id to find the corresponding shard to sort out the train of thought, or the above example

A row of data needs to be saved, with a _ id of 0001000000123 and a HashCode of 1571574097 and a remainder of 5 (1571574097) of 2 to determine that the data should be saved in shard_2.

Shard_2 has been split into shard_2_2 and shard_2_7 sub-shards. The cardinality of this layer is 10 (cardinality = number of initialized shards * layers). If we set the remainder (1571574097% 10) of 1571574097 to 10 to 7, the data is stored in shard_2_7.

There is no sub-shard in shard_2_7, which means that the shard is not split and can be stored directly in the shard.

Analyze the principle of fragment search:

Find the fragment according to the hash algorithm

If the shard has a subshard, look for it from the subshard

If the shard does not have a subshard, the data is saved in the shard

Let's take a look at the sharding rule. Why does shard_1 split into shard_1_1 and shard_1_6?

The reason is very simple. Shard_1 means that the hash value of id is 1 after the remainder of 5. If the shard_1 is split into two parts, then the cardinality of layer 2 is 10 = the cardinality of layer 2 * 2, that is, 5 * 2. If the residual value for 5 is 1, then the remainder for 10 will only be 1 and 6, so

Shard_1 is divided into shard_1_1 and shard_1_6.

Data consistency

Dynamic slicing is automatic slicing in the process of use, and the slicing process will be very long. After testing, the index 32 columns and 5 million rows are split into two sub-shards, which takes 245 seconds. During the split process, if the original data is modified, these changes may be lost. Therefore, some measures are needed to ensure the security of data in the process of splitting.

Method one, use pessimistic lock.

The locked fragment cannot be modified before the split, and can not be modified until the split is completed.

Advantages: the logic is simple and rough, and the development difficulty is low.

Cons: locking for too long may result in a large number of abnormal requests from the caller.

Method two is to use the transaction log.

A transaction log is created before splitting, and all new, modified, and deleted operations of the current shard are written to the transaction log. After the split is complete, the shards and subshards are locked, the data is recovered from the transaction log to the subshards, and then unlocked.

Advantages: shards are locked only when the transaction log is created and the data is recovered, the locking time is relatively short, and the service caller is hardly affected.

Disadvantages: the development is difficult, and it is necessary to develop a set of transaction log and log recovery operation interface. However, the underlying lucene storage already has a set of transaction log interfaces and implementations, which are almost negligible.

Row key incremental slicing

If the row key that holds the data is incremented as a whole, for example, the row key is 000000001pr 0000002pr 0000003pr. In this format, you can press the line key to slice. This kind of slicing is relatively simple.

1. Set an initialization shard when creating an index

two。 During the process of adding data, and record the minimum and maximum values of the sharded row key minId and maxId

3. When the amount of sharding data exceeds the set maximum, a new shard is created, and the new data is saved in the shard.

4. When updating the data, the shard is determined by comparing it with the minId and maxId of each shard.

Comparison of row key incremental fragmentation and Hash algorithm sharding:

1. The line key increment slicing method is simpler to implement and has lower development cost.

two。 The row key increment shard locates the shard through minId and maxId. If the minId and maxId of the shard need to be recorded in each shard information,

3. When the row keys are incremented, the data needs to be stored in a certain order, otherwise the data may be skewed.

4. Line key increment shards add shards as needed. You only need to set the maximum number of rows for each shard. There is no splitting process.

5. A large amount of pressure of row-key incremental fragmentation is concentrated on the latest shards, and the pressure of Hash algorithm shards is dispersed to each shard. Theoretically, Hash algorithm shards can support higher throughput.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.