What is the principle of Elasticsearch? 07/09 Update SLTechnology News&Howtos

What is the principle of Elasticsearch?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the principle of Elasticsearch". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the principle of Elasticsearch"?

Lucene and ES

Lucene

Lucene is the Java library on which Elasticsearch is based, which introduces the concept of segment search:

Segment: also known as a segment, similar to an inverted index, equivalent to a dataset.

Commit point: the submission point that records all known segments.

Lucene index: "a collection of segments plus a commit point". Consists of a collection of Segment plus a commit point.

For the composition of a Lucene index, the following figure shows:

An Elasticsearch Index consists of one or more shard (shards).

The Lucene index in Lucene is equivalent to a shard of ES.

Writing process

Write process 1.0 (imperfect)

The writing process 1.0 is as follows:

Keep writing Document to In-memory buffer (memory buffer).

The Documents in the memory buffer is flushed to disk when certain conditions are met.

Generate a new segment and a Commit point submission point.

This segment can be read like any other segment.

The drawing is as follows:

Flushing files to disk is very resource-intensive, and there is a cache between the memory buffer and the disk that can be read like a segment on disk once the file enters the cache.

Write process 2.0

The writing process 2.0 is as follows:

Keep writing Document to In-memory buffer (memory buffer).

When certain conditions are met, the Documents in the memory buffer is flushed to the cache.

Generate a new segment, which is still in cache.

There is no commit at this time, but it can be read.

The drawing is as follows:

The process of data from buffer to cache is refreshed periodically every second. So newly written Document can be searched in cache at least 1 second.

The process of Document from buffer to cache is called refresh. It is usually refreshed once a second, and no additional modification is required.

Of course, if there is a need for revision, you can refer to the relevant materials at the end of the article. This is why Elasticsearch is quasi-real-time.

Make the document immediately visible:

PUT / test/_doc/1?refresh {"test": "test"} / / or PUT / test/_doc/2?refresh=true {"test": "test"}

Translog transaction log

Here you can associate that there is also a translog in MySQL's binlog,ES for failed recovery:

Document keeps writing to In-memory buffer, and translog is appended at this time.

When the data in the buffer is refresh to the cache every second, the translog does not enter the flush to the disk and is continuously appended.

Translog fsync to disk every 5 seconds.

The translog will continue to accumulate and become larger and larger, and when the translog is large enough or at regular intervals, the flush will be executed.

The flush operation is performed in the following steps:

The buffer was emptied.

Record the commit point.

The segment in cache is flushed to disk by fsync.

Translog was deleted.

It is worth noting that:

Translog refreshes the disk every 5s, so a failed restart may lose 5s of data.

Translog performs flush operations every 30 minutes by default, or even if the translog is too large.

Perform flush manually:

POST / my-index-000001/_flush

Delete and UPDAT

Segment is immutable, so docment cannot be removed or updated from the previous segment.

So every time commit generates commit point, there will be a .del file that lists the deleted document (logical deletion).

When querying, the obtained results are filtered by .del before being returned. When updated, the old docment is also marked to be deleted, written to a .del file, and a new file is written.

At this point, the query will query two versions of the data, but one will be removed before returning.

Segment merger

Every 1s execution of refresh creates a segment of the data in memory.

Too many segment will bring more trouble. Each segment consumes file handles, memory, and cpu runtime cycles.

More importantly, each search request must check each segment in turn; so the more segment, the slower the search.

There will be a thread in the ES background to do the segment merge:

The refresh operation creates a new segment and opens it for use by the search.

The merge process selects a small number of segment of similar size and merges them into a larger segment in the background. This does not disrupt indexing and search.

When the merger is over, the old segment is deleted.

Describe the activities when the merge is completed:

The new segment is refreshed (flush) to disk. Write a new segment that contains the new commit point and excludes the old and smaller segment.

A new segment is opened for search.

The old segment was deleted.

Physical deletion: in the segment merge section, those document that have been logically deleted will be physically deleted.

At this point, I believe you have a deeper understanding of "what is the principle of Elasticsearch". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.