How to analyze Elasticsearch data Writing 07/19 Update SLTechnology News&Howtos

How to analyze Elasticsearch data Writing

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you an analysis on how to write Elasticsearch data. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

The preface mainly talks about the underlying structure of ES-> Lucene, and then describes in detail the process and principle of writing new data to ES and Lucene. This is the basic theoretical knowledge, sorted out.

What is Elasticsearch & Lucene?

What is Elasticsearch?

Elasticsearch is an open source search engine based on Apache Lucene (TM). Then what is Lucene?

Whether in open source or proprietary areas, Lucene can be considered to be by far the most advanced, best-performing, and most functional search engine library, and makes full-text search easier by hiding the complexity of Lucene through simple RESTful API. Elasticsearch is not just about Lucene and full-text search, we can also describe it like this:

Distributed real-time file storage in which each field is indexed and searchable

Distributed real-time analysis search engine

Can scale to hundreds of servers to handle PB-level structured or unstructured data

II. The relationship between Elasticsearch and Lucene

Just like many business systems are implemented on Spring, the relationship between Elasticsearch and Lucene is simple: Elasticsearch is implemented on Lucene. ES is based on the underlying packages and then extends to provide more and richer query statements and make it easier to interact with the underlying layer through RESTful API. Similar ES and Solr are also implemented based on Lucene. In application development, it is easy to use Elasticsearch. But if you use Lucene directly, there will be a lot of integration work.

Therefore, students who are getting started with ES can learn a little about Lucene. If you go to the advanced level, you still need to learn the underlying principles of Lucene. Because inverted index, scoring mechanism, full-text retrieval principle, word segmentation principle and so on, these are not outdated technologies.

III. New document writing process

3.1 data model

As shown in the picture

Under an ES Index (index, such as commodity search index, order search index) cluster, there are multiple Node (nodes). Each node is an instance of ES.

There will be multiple shard (shards) on each node, P1 P2 is the main shard R1 R2 is the replica shard

Each shard corresponds to a Lucene Index (underlying index file).

Lucene Index is a general name. It consists of multiple Segment (segment files, that is, inverted indexes). Each segment file stores an Doc document.

3.2 Lucene Index

In lucene, a single inverted index file is called segment. There is a file that records all the information about segments, called commit point:

When the document create is newly written, a new segment is generated. Will also be recorded in commit point

Document query, will query all segments

When a segment exists and the document is deleted, the information is maintained in the .liv file.

3.3 New document writing process

When a new document is created or updated, the process follows: the update does not modify the original segment, and both the update and creation operations generate a new segment. Where does the data come from? It is first stored in the bugger in memory and then persisted to segment.

The steps for data persistence are as follows: write-> refresh-> flush-> merge

3.3.1 write process

A new document will be stored in the in-memory buffer memory cache and the Translog will be recorded by the way.

At this time, the data has not yet reached segment, it is not possible to find this new document. Data can be searched only after it has been refresh. So let's talk about refresh process 3.3.2 refresh process refresh defaults to 1 second, and execute the above process once. ES supports changing this value by setting the refresh interval through index.refresh_interval. The refresh process is roughly as follows:

The documents in in-memory buffer are written to the new segment, but the segment is stored in the cache of the file system. At this point, the document can be searched.

Finally, empty the in-memory buffer. Note: Translog was not emptied, in order to write segment data to disk

After the document has passed through refresh, segment is temporarily written to the file system cache, which avoids performance IO operations and enables document search. Refresh is executed once every second by default, which results in too much performance loss. It is generally recommended to extend this refresh interval slightly, such as 5 seconds. Therefore, ES actually means real-time and can not achieve real-time.

3.3.3 flush process

In the previous procedure, there was an unexpected loss of failed documents when segment was in the file system cache. Then, in order to ensure that the document is not lost, you need to write the document to disk. Then the process of writing a document from the file cache to disk is flush. After writing to disk, clear the translog.

Translog is very useful:

Ensure that documents in the file cache are not lost

Restore from translog when the system restarts

The new segment is included in commit point

For details, please see the official document: https://www.elastic.co/guide/...

3.3.4 merge process

The above steps show that there will be more and more segment, so the search will be slower and slower. What should I do with it?

Resolve through the merge process:

These are small pieces of files that are merged into one large file. Segment merging process

When the segment merge ends, the old small files will be deleted

Deleted documents maintained by .liv files are cleared through this process

IV. Summary

As shown in this figure, the principle of ES writing is not difficult, just remember the key points.

Write-> refresh-> flush

Write: document data is cached in memory and stored in translog

Refresh: document data in memory cache, to segment in file cache. Can be found at this time.

Flush is a cache of segment document data written to disk

The above is the analysis of how to write Elasticsearch data shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.