How to talk about the invalid es setTimeout of ES from the frequent application 502 07/19 Update SLTechnology News&Howtos

How to talk about the invalid es setTimeout of ES from the frequent application 502

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

How to talk about the invalid es setTimeout problem of ES from the frequent application 502? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Origin

The ES cluster version used by the application is 1.7. the company internally asked to upgrade the version, saying that the cluster version was too old, and the application was not upgraded for various reasons, so the application was dragged to 618. as a result, 502 and RPC thread pools were filled frequently during this period, which brought a dark cloud to 618, which should have been smooth.

After getting this alarm, we started the dump machine thread, and found that a large number of es threads are blocking, and many https threads are also blocking, so it is not difficult to find out why the 502RPC thread pool is full. Because the es operation has not returned, the business thread pool is full, the http connection has been waiting for the result to return, and the connection is not released, which leads to the tomcat receiving business request is also full. This leads to the frequent occurrence of 502 application visits.

The image above shows the thread coming down from jstack. You can see that a large number of https threads are blocked, and then check the details.

A large number of threads in the details are blocked on the ES call, so the problem is obvious. In theory, the problem is located, and it should be solved very well. But I didn't think this was the beginning of the pain.

Cause

In conjunction with the es middleware group, we checked the es cluster and concluded that there were no slow logs, saying that the es cluster was normal. Contact the network group, after the network group troubleshooting, the network is normal, contact the operation and maintenance group to check that the machine is normal. Delete all the machine instances and replace a batch of machines with the same problem. However, through call monitoring, logging, threads, and stack signs all point to the problem with es. And the es cluster is a low-version cluster, and the company has declared that it will no longer be maintained. This problem has not been reported before, but now it occurs frequently. It is speculated that the reduction of es resources is partly due to no maintenance. In order to solve this problem temporarily, the thread should not be blocked indefinitely, and the thread should be released as soon as possible so that it can return quickly even if no data is found.

FilteredQueryBuilder fqb = QueryBuilders.filteredQuery (QueryBuilders.matchAllQuery (), boolFilterBuilder) SearchResponse searchResponse = client.prepareSearch (indexName) .setTimeout (TimeValue.timeValueMillis) .setTypes (documentType) .setSearchType (SearchType.QUERY_THEN_FETCH) .setQuery (fqb) .addSort ("created" SortOrder.DESC) .setFrom (0) .setSize (30) .execute () .actionGet ()

The key lies in setTimeout (TimeValue.timeValueMillis (500)) this line, the query time is set to 500ms, and then happily online, I thought the problem would be solved, the result is more face-to-face, the problem is still the same!

Fate extinction

The problem has not been solved for a long time, the application alarm 502 every day, the thread is full, encounter this kind of situation can only restart the application, annoying, and very doubt why clearly set the timeout is not effective. In the end, we can only find the root! Since the previous understanding of es is not so thorough, I think that setTimeout is how long after the query is not returned, then break the query connection and return directly. In fact, this is not the case. What he means is that when querying es, the data on multiple shards will be queried, and if the query is not finished at the set time, the data that has already been queried will be returned. Even so, this timeout is often invalid, as specified in es's official issue:

Sadly, it is a best effort timeout, its not being checked on all places. Specifically, if you send a query that ends up being rewritten into many terms (fuzzy, or wildcard), that part (the rewrite part) does not check for a timeout.

Portal: Timeout on search not respected

So how to configure it to meet the requirement of throwing a timeout exception if the data cannot be queried within the set time? After checking, you can set the timeout timeout actionGet (timeout) T actionGet (long var1, TimeUnit var3) throws ElasticsearchException; in the last step of the api query, that is, actionGet () or get (). If no data is queried at the set time, an timeout exception will be thrown. In fact, this timeout is not a connection timeout, but a processing timeout. Its timeout logic is the java asynchronous future timeout. But this has also met our needs. If it is not finished within the set time, a timeout exception will be thrown.

After the transformation, the system no longer throws 502 exceptions, and the number of threads tends to stabilize at about 400, often more than a thousand before.

One step closer.

Due to the poor understanding of the es timeout mechanism, other timeout-related settings of es are queried.

Client connection cluster node timeout (client.transport.ping_timeout)

Settings settings = Settings.builder () .put ("client.transport.sniff", true) .build (); TransportClient client = new PreBuiltTransportClient (settings)

Client.transport.ping_timeout, The time to wait for a ping response from a node. Defaults to 5s. By default, the response time of the 5srecoverable client ping command, if no return is returned, this node is considered unavailable. If the network latency between the client and the cluster is large or the connection is unstable, you may need to increase this value.

Timeout in scroll

SearchResponse scrollResp = client.prepareSearch (test) .addSort (FieldSortBuilder.DOC_FIELD_NAME, SortOrder.ASC) .setScroll (new TimeValue (60000)) .setQuery (qb) .setSize (100) .get ()

The time in scroll, this will enable timeout scroll scrolling. After testing, this parameter should be another Schrodinger parameter, which has no effect, so rely less on it to do something.

This is the answer to the question about how to talk about the invalid es setTimeout of ES from the frequent application 502. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.