How to use Index to split by date in EleasticSearch 07/03 Update SLTechnology News&Howtos

How to use Index to split by date in EleasticSearch

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

EleasticSearch how to use the index by date segmentation, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

There is a time field in the document, which makes it easy to cut by date.

In the mapping configuration of index, _ source needs to be true (default) to ensure that es is stored in the source document, not just docId, to facilitate the execution of reindex!

Curl-XGET "localhost:9200/your_index_name/_mapping"

# if _ source is not displayed, it represents "_ source": false

one

two

two。 Delete documents (not recommended)

The first method that comes to mind is to delete the oldest logs. If you only keep the logs for nearly 3 months, use the delete_by_query interface.

Curl-X POST "localhost:9200/twitter/_delete_by_query"-H 'Content-Type: application/json'-d'

{

"query": {

"range": {

"day": {

"lt": "2018-12-01"

}

one

two

three

four

five

six

seven

eight

nine

ten

eleven

However, es's delete is not a real physical deletion, and disk utilization (utilization) will not decline. The deleted document is only marked, and es stores the document in a segment file, and the number of file increases with the writing of the document, so es will have the operation of merging segment file, merging multiple small segment into a large segment file. When the segment is merged, the marked document will be physically deleted.

This scheme is only suitable for the early stage of the project, when the number of documents is small, and the machine load is not high, because batch reading and writing will lead to the surge of cpu utilization, resulting in increased system load, and can be carried out at night when the volume of business is not high.

3. Disk expansion (short-term effective)

So if tag deletion can't solve the problem immediately, then expand the disk capacity. 1T to 2T, 2T to 4T, as upgrading the disk requires restarting the machine, how to gracefully scroll down the es is very important!

3.1 stop the es cluster service

It is common for ps to take a look at the progress of es, and then kill, wait a minute! Prior to this, the cluster needs to be configured to facilitate faster service restart

3.1.1 stop sharding allocation

Because the index in es is distributed storage, an index is divided into multiple shard, distributed on each node node, and each shard can provide services separately. At the same time, each shard can be configured with multiple copies (replicas) to ensure the high availability of the cluster and improve the query efficiency. At the same time, es has its own equalization algorithm when managing these fragments, which ensures that shard is uniformly distributed on each node, and that each shard and its corresponding replica are not on the same node. It is this mechanism that makes it possible that when node leaves and rejoins, the sharding allocation will result in a large number of io files. Because node restart will return soon, automatic sharding allocation is temporarily disabled. For more information, please see another article, which describes the steps of cluster restart. Here we choose to stop the distribution directly.

Curl-X PUT "localhost:9200/_cluster/settings"-H 'Content-Type: application/json'-d'

{

"persistent": {

"cluster.routing.allocation.enable": "none"

}

one

two

three

four

five

six

In this way, when the node is stopped, there will be no rebalancing of sharding distribution.

3.1.2 Mount the new disk

Since the machine is a resource on the cloud, you can operate according to the expansion documents on each cloud. No matter it is from the organic room or the machine on the cloud, it is nothing more than the following steps:

Reboot restart

Df-h take a look at the current mount point (e.g. / dev/vdb) and mount path (e.g. / data/es/)

Fdisk-l can further confirm the information of the disk

Fdisk / dev/vdb, remount the new disk. The options that may be involved in the command, if not partitioned, mostly press enter default: d n p 1 wq

Other operations such as e2fsck-f / dev/vdb,resize2fs / dev/vdb

Mount / dev/vdb / data can be remounted.

3.1.3 resume allocation

Curl-X PUT "localhost:9200/_cluster/settings"-H 'Content-Type: application/json'-d'

{

"transient": {

"cluster.routing.allocation.enable": "all"

}

one

two

three

four

five

six

seven

In this step, there will be some io, and when the recovery ends, the cluster will return to normal. If you have shard with Unassigned status, you need to manually perform shard allocation, and the corresponding command reroute is not difficult. Normally, there is almost no shard for Unassigned.

After performing the above operation, the disk size is doubled. In the short term, the service can be guaranteed.

4. Index split (best practices)

No matter the tag deletion or disk expansion, it does not really solve the problem of excessive index. With the increase of the number of documents, the index is bound to become larger. For large indexes, the operation time for tag deletion and multipart recovery after expansion will also increase significantly, ranging from a few hours to a day or two. The above methods are not feasible!

Because the deletion of index by es is physical and immediate, since the original index cannot be deleted directly, we have to find a way to split the large index into small indexes, and then delete the old ones. So does es have this kind of api that splits large indexes according to certain conditions (by day, by month)? To this end, I found a lot of blogs, and finally found that the api introduced in many blogs are very old, or the official website documents are the most real, the documents are in English, but it is not complicated to understand. The wall crack is recommended to see the official documentation. To this end, I have investigated these api:_rollover,alias,template,reindex, you can have targeted in-depth study of these api, the details will not be discussed here, let's see how I use it first.

4.1. Wrong attempt

_ rollover when I saw this api, I thought I had found the api closest to the requirement. The api is described as follows

The rollover index API rolls an alias over to a new index when the existing index is considered to be too large or too old.

Obviously attracted by too large and too old.

Curl-X PUT "localhost:9200/logs-000001"-H 'Content-Type: application/json'-d'

{

"aliases": {

"logs_write": {}

}

# Add > 1000 documents to logs-000001

Curl-X POST "localhost:9200/logs_write/_rollover"-H 'Content-Type: application/json'-d'

{

"conditions": {

"max_age": "7d", # when the document creation time is more than 7 days

"max_docs": 1000, # when the number of documents exceeds the limit of 1000

"max_size": "5gb" # when the index size exceeds 5GB

}

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

fourteen

fifteen

sixteen

seventeen

Train of thought:

First create an alias alias for index, and then rollover. When executed, you can create a new index if any of the conditions are met, and the alias alias points to the new. In addition, _ rollover can also generate a new index by date.

Question:

Split only takes effect when api is executed, and will not be split automatically later. If indexed by day, it must be performed regularly every day, increasing maintenance costs. If the scheduled task fails.

When the alias is rolled to the new index, the old index can no longer be accessed through alias

The historical index is not generated on a daily basis, and only after the execution of this api will the new document enter the new index.

4.2 Best practices

Suppose the original index name knight-log, current month 2019-02

Train of thought:

Create an alias alias:alias-knight-log

Associate the alias alias-knight-log for the old knight-log

Create an index template template: template-knight-log, which matches all knight-log.* indexes, defines settings and mappings in the template, and automatically associates the alias alias-knight-log

In addition, the operation of reading and writing es in the project is set separately:

Read alias-knight-log to query all associated indexes

Write knight-log.2019-02, append month after the original index! When calling Java api for indexing, the interface specifies that each document needs to specify index name, so you can change the original name to name + current_month, and other languages do the same!

After updating the business code, the new document will enter into the index such as knight-log.2019-02. There will be no write request from the original knight-log, and the read request will go through the alias alias-knight-log proxy.

After the above steps, the old index only reads but does not write, and all write requests enter the index of knight-log.2019-02, which is guaranteed, so we can delete the knight-log one by one according to the time granularity.

Sad does not have such a direct api, but the humble reindex takes on this important task. at first, I think it should have nothing to do with this api. Not only does it not save space, but it doubles it, until I find that it can be conditionally reindex. I'll just reindex according to the time conditions.

# reindex all the documents of 2019-01 to knight-log.2019-01

Curl-X POST "localhost:9200/_reindex"-H 'Content-Type: application/json'-d'

{

"source": {

"index": "knight-log"

"type": "your_doc_type"

"query": {

"bool": {

"filter": {

"range": {

"day": {

"gte": "2019-01-01"

"lt": "2019-02-01"

}

"dest": {

"index": "knight-log.2019-01"

}

one

two

three

four

five

six

seven

eight

nine

ten

eleven

twelve

thirteen

fourteen

fifteen

sixteen

seventeen

eighteen

nineteen

twenty

twenty-one

twenty-two

twenty-three

twenty-four

Tips1: in order to speed up the index before reindex to the new index, you can set the new index and the general bulk index setting notes, which are also applicable here, such as: replicas=0,refresh_interval=-1

Tips2: the first time to use reindex, slice is not enabled (equivalent to multithreading). Although io and cpu are not high, they are slow and can be turned on properly, but be careful that slice does not exceed the number of shard of index. Look at the official documentation, which is not mentioned here!

Tips3: if there are too many documents in a month to operate at one time, you can try to reindex to the index of the new month by day first, so that you can observe cup and io, and then properly adjust the relevant parameters of reindex, and then reindex all the documents for the remaining days of the month to the new index.

Tips4: after performing a reindex and restoring refresh_interval, you can query the new and old indexes at the same time to see if the number of documents is the same! Make sure there is nothing unusual!

Tips5: execute reindex by mistake. You can cancel this task halfway.

Change the replicas number of knight-log from the default 1 to 0, so that half of the size of the whole index can be saved (optional, most of the remaining disks can be ignored). Then, reindex the knight-log by month, which can be operated repeatedly several times to complete the backup in recent months, such as knight-log.2019-01 dint index log. 2018-12, etc. (don't forget the old February documents, you also need reindex, so the February index will be complete) At the same time, because of the characteristics of the template, the month index will automatically add the alias alias-knight-log, killing two birds with one stone!

After reindex has finished the necessary data (such as nearly March), you can delete knight-log directly!

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.