What are the latest Elasticsearch interview questions available in 2021 02/14 Update SLTechnology News&Howtos

What are the latest Elasticsearch interview questions available in 2021

2026-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what are the latest version of Elasticsearch interview questions in 2021". The explanation in the article is simple and clear and easy to learn and understand. Please follow the editor's train of thought to study and learn what the latest version of Elasticsearch interview questions are in 2021.

1. How much does elasticsearch know? tell me about your company's es cluster architecture, index data size, how many shards, and some tuning methods. Elasticsearch interview questions

Interviewer: want to know the ES usage scene and scale that the candidate contacted before, and have you ever done any large-scale index design, planning and tuning?

Answer: answer truthfully combined with your own practice scene.

For example, the ES cluster architecture has 13 nodes, and the index has a total of 20 + indexes according to different channels. According to the date, the index is increased by 20 bits per day, the index is 10 shards, and the data is increased by 100 million per day. The daily index size of each channel is controlled within 150GB.

Index-level tuning means only:

1.1. Tuning in the design phase

(1) according to the business incremental requirements, create the index based on the date template, and scroll the index through roll over API.

(2) use aliases for index management

(3) force_merge the index at a fixed time in the early morning to free up space.

(4) adopt hot and cold separation mechanism to store hot data to SSD to improve retrieval efficiency, and shrink operation of cold data regularly to reduce storage.

(5) adopt curator to manage the life cycle of index.

(6) set up a word separator reasonably only for the fields that need word segmentation.

(7) the Mapping stage fully combines the attributes of each field, whether it needs to be retrieved, whether it needs to be stored, and so on.

1.2. Write tuning

(1) the number of copies before writing is set to 0

(2) disable refresh_interval and set it to-1 before writing, and disable refresh mechanism

(3) during the writing process: bulk batch writing is adopted.

(4) number of recovery copies and refresh interval after writing

(5) try to use the automatically generated id.

1.3. Query tuning

(1) disable wildcard

(2) disable batch terms (hundreds of scenarios)

(3) make full use of the inverted index mechanism to keyword the keyword type as much as possible.

(4) when the amount of data is large, the index can be determined based on time and then retrieved.

(5) set up a reasonable routing mechanism.

1.4. Other tuning

Deployment tuning, business tuning and so on.

In part of the above mentioned, the interviewer will basically evaluate your previous practice or operation and maintenance experience.

2. What is the inverted index of elasticsearch

The data structure that lucene has used extensively since version 4 + is FST. FST has two advantages:

(1) the space occupation is small. Through the reuse of prefixes and suffixes in the dictionary, the storage space is compressed.

(2) the query speed is fast. Query time complexity of O (len (str)).

3. What to do if there is too much elasticsearch index data, how to tune and deploy

Interviewer: want to know the operation and maintenance ability of a large amount of data.

Answer: the planning of index data should be planned in advance. As the saying goes, "design comes first and coding comes later". Only in this way can we effectively avoid the impact on online customer retrieval or other businesses caused by the sudden surge of data that leads to insufficient processing capacity of the cluster.

How to tune, as mentioned in question 1, here is a little bit more detailed:

3.1 dynamic index level

Create an index based on template + time + rollover api scrolling, for example: design phase definition: the template format of blog index is in the form of blog_index_ timestamp, increasing data every day. The advantage of this: the data volume surge will not lead to an unusually large amount of data in a single index, which is close to the 32 power of online 2, and the index storage can reach TB+ or even larger.

Once a single index is large, various risks such as storage follow, so consider it in advance + avoid it as soon as possible.

3.2 Storage level

Hot and cold data are stored separately, hot data (such as the last 3 days or a week), and the rest are cold data.

For cold data will not be written to new data, you can consider regular force_merge plus shrink compression operation to save storage space and retrieval efficiency.

3.3 deployment level

Once there is no plan before, this is an emergency strategy.

Combined with ES's own feature of supporting dynamic expansion, the method of dynamically adding machines can relieve the pressure on the cluster. Note: if the previous master node and other plans are reasonable, you do not need to restart the cluster to complete the dynamic expansion.

4. How does elasticsearch realize the master election?

1GET / _ cat/nodes?v&h=ip,port,heapPercent,heapMax,id,name2ip port heapPercent heapMax id name copy code 5. Describe in detail the process of indexing documents in Elasticsearch

6. Describe the process of Elasticsearch search in detail?

Interviewer: want to understand the underlying principles of ES search, no longer just focus on the business level.

Answer:

The search is broken down into two phases: "query then fetch".

The purpose of the query phase: locate the location, but not take it.

The steps are as follows:

(1) suppose that an index data has 5 master + 1 copies with a total of 10 slices, and a request will hit one of the (master or replica shards).

(2) each shard is queried locally, and the results are returned to the local ordered priority queue.

(3) the result of step 2 is sent to the orchestration node, which produces a global sorted list.

The purpose of the fetch phase is to fetch data.

The routing node gets all the documents and returns them to the client.

7. What are the optimization methods for setting Linux when Elasticsearch is deployed

Interviewer: want to know the operation and maintenance ability of ES cluster.

Answer:

(1) disable caching swap

(2) set heap memory to: Min (node memory / 2, 32GB)

(3) set the maximum number of file handles

(4) Thread pool + queue size is adjusted according to business needs.

(5) disk storage raid mode-storage conditionally uses RAID10 to increase single-node performance and avoid single-node storage failure.

8. What is the internal structure of lucence?

Interviewer: want to know the breadth and depth of your knowledge.

Answer:

Lucene is the two processes of indexing and searching, including index creation, indexing and searching. We can expand some based on this vein.

9. How does Elasticsearch achieve Master election?

(1) the ZenDiscovery module is responsible for the selection of Elasticsearch, which mainly includes two parts: Ping (nodes send each other through this RPC) and Unicast (the unicast module contains a list of hosts to control which nodes need to communicate with ping).

(2) sort all the nodes that can be master (node.master: true) according to the nodeId dictionary, and every time each node is elected, the nodes they know will be sorted once, and then the first (bit 0) node is selected, which is temporarily considered to be a master node.

(3) if the number of votes for a node reaches a certain value (it can be called the number of master nodes) and the node chooses itself, then the node is master. Otherwise, the re-election will continue until the above conditions are met.

(4) add: the responsibilities of master nodes mainly include the management of clusters, nodes and indexes, and are not responsible for document-level management; data nodes can disable the http function *.

10. Nodes in Elasticsearch (for example, a total of 20), of which 10 are

What if you choose one master and the other 10 choose another master?

(1) when the number of master candidates in the cluster is not less than 3, the brain fissure problem can be solved by setting the minimum number of votes (discovery.zen.minimum_master_nodes) to more than half of all candidate nodes.

(3) when the number of candidates is two, it can only be modified to the only master candidate, and the other can be used as data nodes to avoid the problem of brain fissure.

11. How does the client select a specific node to execute the request when connecting with the cluster?

TransportClient uses the transport module to remotely connect to an elasticsearch cluster. It does not join the cluster, but simply obtains one or more initialized transport addresses and communicates with them in a polling manner.

12. Describe in detail the process of indexing documents in Elasticsearch.

By default, the orchestration node uses the document ID to participate in the calculation (it is also supported through routing) to provide appropriate fragmentation for routing.

Shard = hash (document_id)% (num_of_primary_shards) copy the code

(1) when the node where the shard is located receives the request from the coordinator node, it will write the request to MemoryBuffffer, and then write it to Filesystem Cache regularly (default is every 1 second). This process from MomeryBuffffer to Filesystem Cache is called refresh.

(2) of course, in some cases, the data with Momery Buffffer and Filesystem Cache may be lost. ES ensures the reliability of the data through the mechanism of translog. The implementation mechanism is that after receiving the request, it will also be written to translog, and when the data in Filesystem cache is written to disk, it will be cleared. This process is called flflush.

(3) in the flflush process, the buffer in memory will be cleared, the content will be written to a new segment, the fsync of the segment will create a new submission point, and the content will be refreshed to disk, the old translog will be deleted and a new translog will be started.

(4) when flflush is triggered regularly (default is 30 minutes) or when translog becomes too large (default is 512m); add: Segement for Lucene:

(1) the Lucene index is composed of several segments, and the segment itself is an inverted index with full functions.

The (2) paragraph is immutable, allowing Lucene to incrementally add new documents to the index without having to rebuild the index from scratch.

(3) for each search request, all segments in the index are searched, and each segment consumes CPU clock weeks, file handles, and memory. This means that the greater the number of segments, the lower the search performance.

(4) to solve this problem, Elasticsearch merges small segments into a larger segment, commits new merged segments to disk, and deletes old ones.

Elasticsearch is a distributed RESTful-style search and data analysis engine.

(1) query: Elasticsearch allows you to perform and merge multiple types of searches-structured, unstructured, geolocation, metrics-search methods vary according to your heart.

(2) Analysis: it is one thing to find the ten documents that best match the query. But if you are faced with a billion-line log, how to interpret it? Elasticsearch aggregation allows you to look at the big picture and explore trends and patterns of data.

(3) Speed: Elasticsearch is very fast. Really, really fast.

(4) scalability: it can be run on a laptop. It can also run on hundreds of servers that host PB-level data.

(5) resilience: Elasticsearch runs in a distributed environment, with this in mind from the beginning of the design.

(6) flexibility: multiple case scenarios. Numbers, text, geographic location, structured, unstructured. All data types are welcome.

(7) HADOOP & SPARK: Elasticsearch + Hadoop

Elasticsearch is a highly scalable open source full-text search and analysis engine. It allows you to store, search, and analyze large amounts of data quickly and in near real time.

Here are some use cases for using Elasticsearch:

(1) you run an online store and you allow your customers to search for the products you sell. In this case, you can use Elasticsearch to store the entire product catalog and inventory and provide them with search and auto-completion suggestions.

(2) you want to collect log or transaction data, and you want to analyze and mine that data to find trends, statistics, summaries, or anomalies. In this case, you can use loghide (part of the Elasticsearch/ loghide / Kibana stack) to collect, aggregate, and parse the data, and then have loghide input the data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine whatever information you are interested in.

(3) you run a price alert platform that allows price-savvy customers to specify the following rules: "I am interested in buying specific electronic equipment, and I would like to be notified if the price of any supplier's product is less than $X next month." In this case, you can grab the supplier's price, push them into Elasticsearch, and use its reverse search (Percolator) function to match the price trend with the customer query, and finally push the alert to the customer when a match is found.

(4) you have analytical / business intelligence requirements and want to quickly investigate, analyze, visualize, and ask special questions about large amounts of data (think of millions or billions of records). In this case, you can use Elasticsearch to store data, and then use Kibana (part of the Elasticsearch/ loghide / Kibana stack) to build custom dashboards to visualize aspects of data that are important to you. In addition, you can use Elasticsearch aggregation capabilities to perform complex business intelligence queries on data.

15. Describe in detail the process of updating and deleting documents by Elasticsearch.

(1) deletion and update are also write operations, but documents in Elasticsearch are immutable, so they cannot be deleted or changed to show their changes.

(2) each segment on disk has a corresponding .del file. When the delete request is sent, the document is not actually deleted, but marked for deletion in the .del file. The document will still match the query, but will be filtered out in the results. When segments are merged, documents marked for deletion in the .del file will not be written to the new segment.

(3) when a new document is created, Elasticsearch assigns a version number to the document. When an update is performed, the old version of the document is marked for deletion in the .del file, and the new version of the document is indexed to a new segment. The older version of the document can still match the query, but will be filtered out in the results

16. Describe the process of Elasticsearch search in detail.

17. In Elasticsearch, how do you find the corresponding inverted index based on a word?

(1) the indexing process of Lucene is the process of writing the inverted table into this file format according to the basic process of full-text retrieval.

(2) the search process of Lucene is the process of reading out the indexed information according to this file format, and then calculating the score of each document.

18. What are the optimization methods for setting up Linux when Elasticsearch is deployed?

(1) machines with 64 GB memory are ideal, but 32 GB and 16 GB machines are also common. Less than 8 GB can be counterproductive.

(2) if you want to choose between faster CPUs and more cores, it is better to choose more cores. The extra concurrency provided by multiple cores is far better than a slightly faster clock rate.

(3) if you can afford SSD, it will go far beyond any rotating medium. SSD-based nodes, query and index performance have been improved. If you can afford it, SSD is a good choice.

(4) even if the data centers are close at hand, avoid clustering across multiple data centers. Absolutely avoid clusters spanning large geographical distances.

(5) make sure that the JVM running your application is exactly the same as the JVM of the server. In several places in Elasticsearch, local serialization of Java is used.

(6) by setting gateway.recover_after_nodes, gateway.expected_nodes and gateway.recover_after_time, you can avoid excessive sharding exchange when the cluster is restarted, which may cause data recovery from several

The hour is shortened to a few seconds.

(7) Elasticsearch is configured to use unicast discovery by default to prevent nodes from inadvertently joining the cluster. Only nodes running on the same machine are automatically clustered. It is best to use unicast instead of multicast.

(8) do not modify the size of the garbage collector (CMS) and each thread pool at will.

(9) give half of your memory to Lucene (but no more than 32 GB! Set through the ES_HEAP_SIZE environment variable

(10) swapping memory to disk is fatal to server performance. If memory is swapped to disk, a 100 microsecond operation may become 10 milliseconds. And think about how many 10 microsecond operation delays add up. It's not hard to see how terrible swapping is for performance.

(11) Lucene uses a large number of files. At the same time, Elasticsearch uses a large number of sockets to communicate between nodes and HTTP clients. All of this requires sufficient file descriptors. You should increase your file descriptor and set a large value, such as 64000.

19. For GC, what should you pay attention to when using Elasticsearch?

(1) the index of inverted dictionary needs to be resident in memory and cannot be GC, so you need to monitor the growth trend of segmentmemory on data node.

(2) all kinds of caches, fifield cache, fifilter cache, indexing cache, bulk queue, etc., should be set to a reasonable size, and should be used according to the worst-case scenario, that is, when all kinds of caches are full, is there any heap space that can be allocated to other tasks? Avoid using "self-deceiving" methods such as clear cache to free memory.

(3) avoid search and aggregation that return a large number of result sets. Scenarios that do require a large amount of data pull can be implemented using scan & scroll api.

(4) cluster stats resident memory cannot be expanded horizontally. Very large clusters can be split into multiple clusters and connected through tribe node.

(5) if you want to know whether the heap is enough, you must combine the actual application scenarios and continuously monitor the heap usage of the cluster.

(6) understand the memory requirements according to the monitoring data, and configure all kinds of circuit breaker reasonably to minimize the risk of memory overflow.

20. How does Elasticsearch aggregate large amounts of data (hundreds of millions of magnitude)?

The first approximate aggregation provided by Elasticsearch is the cardinality metric. It provides the cardinality of a field, that is, the number of distinct or unique values for that field. It is based on the HLL algorithm. HLL will first hash our input, and then estimate the probability according to the bits in the result of the hash operation to get the cardinality. Its characteristics are: configurable precision, used to control the use of memory (more accurate = more memory); small dataset precision is very high; we can configure parameters to set the fixed memory usage needed to deduplicate. Whether it's the only value of thousands or billions, memory usage is only related to the accuracy of your configuration.

21. In the case of concurrency, how does Elasticsearch ensure consistency in reading and writing?

(1) optimistic concurrency control can be used through the version number to ensure that the new version will not be overwritten by the old version, and specific conflicts can be handled by the application layer.

(2) in addition, for write operations, the consistency level supports quorum/one/all, which defaults to quorum, that is, write operations are allowed only when most shards are available. However, even if most of them are available, there may be a failure to write to the copy due to reasons such as the network, so that the copy is considered to be failed and the shard will be rebuilt on a different node.

(3) for read operations, you can set replication to sync (default), so that the operation will not be returned until both the main part and the replica part are completed. If replication is set to async, you can also check it by setting the search request parameter _ preference to primary.

Ask the master shard to make sure the document is up to date.

22. How to monitor the status of Elasticsearch clusters?

Marvel allows you to easily monitor Elasticsearch through Kibana. You can view your cluster health and sexual performance in real time, and you can also analyze past cluster, index, and node indicators.

23. Introduce the overall technical framework of your e-commerce search.

24. Tell me about your personalized search plan?

Realizing Personalized search based on word2vec and Elasticsearch

(1) based on word2vec, Elasticsearch and custom script plug-ins, we have implemented a personalized search service. Compared with the original implementation, the click-through rate and conversion rate of the new version have been greatly improved.

(2) the commodity vector based on word2vec is also available, that is, it can be used to recommend similar goods.

(3) using word2vec to realize personalized search or personalized recommendation has some limitations, because it can only deal with the time series data such as user click history, but can not fully consider the user preference, which still has a lot of room for improvement and improvement.

Do you know anything about dictionary trees?

The common dictionary data structure is as follows:

The core idea of Trie is to exchange space for time, and use the common prefix of strings to reduce the cost of query time in order to improve efficiency.

It has three basic properties:

1) the root node contains no characters, and each node except the root node contains only one character.

2) from the root node to a node, the characters passed on the path are concatenated to be the corresponding string of that node.

3) all child nodes of each node contain different characters.

Or use arrays to simulate dynamics. The cost of space will not exceed the number of words × the length of words.

(2) implementation: open an array of alphabet size for each node, hang a linked list for each node, and use the left son and right brother representation.

Record this tree.

(3) for Chinese dictionary trees, the child nodes of each node are stored in a hash table, so that there is no need to waste too much space.

The query speed can retain the complexity of hash O (1).

26. How is spelling error correction realized?

(1) Spelling error correction is based on editing distance, which is a standard method that represents the minimum number of steps to convert from one string to another through insert, delete and replace operations.

(2) the process of calculating the editing distance: for example, to calculate the editing distance of batyu and beauty, first create a table of 7 × 8 (the length of batyu is 5, the length of batyu is 6, plus 2 respectively), and then fill in the black number in the following position. The calculation process of other lattices is as follows

The minimum of three values:

If the topmost character is equal to the leftmost character, it is the upper-left number. Otherwise, it is the number + 1 at the top left. (0 for 3Jer 3)

The number on the left + 1 (2 for 3BI 3)

The upper number + 1 (2 for 3p3)

Finally, the value of the lower right corner is the value of the editing distance 3.

Thank you for your reading, the above is the content of "what is the latest version of Elasticsearch interview questions in 2021". After the study of this article, I believe you have a deeper understanding of what the latest version of Elasticsearch interview questions are in 2021, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.