How to use ScanScroll of elasticsearch 07/03 Update SLTechnology News&Howtos

How to use ScanScroll of elasticsearch

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use ScanScroll of elasticsearch". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use ScanScroll of elasticsearch".

Characteristics of ScanScroll

Advantages

High speed

Large amount of data

Shortcoming

Sorting is not supported

Paging is not supported

Grading is not supported

Further check is not supported.

Working with scen

It seems that the disadvantages outweigh the advantages, but it is very useful. If BULK exists to quickly enter the library, then SCAN is born to quickly exit the library. The query performance of ES is superior, but its analysis ability is weak. So there will be, for example, pull the ES data to the Hadoop cluster to analyze computing requirements, of course, this already has a ready-made plug-in, as expected also uses SCAN. If SCAN encounters BULK, that is, from ES to ES, it has another more familiar name: replicated tables.

Use method def scanTest (): searchRes = es.search (index= "users", size=10,body= {"query": {"match_all": {}, search_type= "scan", scroll= "10s") while True: scrollRes=es.scroll (scroll_id=searchRes ["_ scroll_id"], scroll= "10s", ignore= [400s ") Res_list = scrollRes ["hits"] ["hits"] if not len (res_list): break For res in res_list: print res ["_ source"] ["userName"] principle flow

The whole process is relatively clear, first count a total, and then return the data of the number of size* shards each time scroll until all are traversed. SCAN supports query preference preference, and you can specify shards, so some people say that the number of size* main shards is not accurate, which is easy to verify.

The first stage: Search

Count the total with TotalHitCountCollector, and determine (node, query context ID). Base64 is encoded into ScrollId to return.

The second stage: SearchScroll

Go to each node according to ScrollId, find the query context ID, execute XFilteredQuery, collect the results, merge and return.

In the first stage, in addition to returning the total, there is also a very mysterious ScrollId that looks like this ScrollId, much like the one encoded by Base64. It must not be as simple as ID. Get to know it. Sure enough, there are three main parts that make up type,context,attributes.

Type is queryThenFetch,queryAndFetch,scan, and we're talking about scan.

Attributes has only one element, total_hits

Context is a shard tuple with 2 elements, shard = [node ID, query context ID]

ScrollId is an easy thing to reveal secrets, we will find that ScrollId depends on the node ID and query context ID are variables, query context ID, each request is incremented. So the ScrollId of each request is different, so if our SCAN process terminates unexpectedly, we may need to start over.

Every time SCAN, processing Scroll jumps to the next page, we specify that form is invalid.

/ / SearchServiceprivate void processScroll (InternalScrollSearchRequest request, SearchContext context) {/ / process scroll context.from (context.from () + context.size ()); context.scroll (request.scroll ()); / /...} / / ScanContextpublic TopDocs execute (SearchContext context) throws IOException {ScanCollector collector = new ScanCollector (readerStates, context.from (), context.size (), context.trackScores ()) Query query = new XFilteredQuery (context.query (), new ScanFilter (readerStates, collector)); try {context.searcher (). Search (query, collector);} catch (ScanCollector.StopCollectingException e) {/ / all is well} return collector.topDocs ();}

Custom Filter,Collector, perform search, collect the result set of that page

/ / ScanContext public void collect (int doc) throws IOException {if (counter > = from) {docs.add (new ScoreDoc (docBase + doc, trackScores? Scorer.score (): 0f);} readerState.count++; counter++; if (counter > = to) {throw StopCollectingException;}}

According to the previous understanding of the database, the count operation is always slow, which makes me worried that it will extend the entire query time. Later, I found that this worry is superfluous, and the count operation for full-text retrieval is very fast. According to the test, there are 24 shards of 1.7 billion data, and the average count time of each shard is between 200ms and 700ms. In the worst case, the total can be returned within 1 second, which is acceptable for the entire query time.

Thank you for your reading, the above is the content of "how to use elasticsearch ScanScroll", after the study of this article, I believe you have a deeper understanding of how to use elasticsearch ScanScroll, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.