Build your own search and analysis engine with ElasticSearch 07/02 Update SLTechnology News&Howtos

Build your own search and analysis engine with ElasticSearch

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Introduction: the retrieval function of Internet products can be seen everywhere. When the scale of your project is Baidu Dasou | ShangSou or Wechat official account search, it is very natural to develop a search engine and add a variety of customized requirements and optimization. However, if it is only ordinary small and medium-sized projects or even entrepreneurial teams | entrepreneurial projects, directly taking the wheel is a more reasonable choice.

ElasticSearch is such a search engine wheel. More importantly, in addition to the regular full-text retrieval function, it also has basic statistical analysis functions (the most common is aggregation), which makes it more powerful and practical.

Are you still using the like of the database to achieve full-text search of the product? Abandon her and use ElasticSearch ~

ElasticSearch (hereinafter referred to as ES) is an open source search engine product based on Lucene. Lucene is a set of basic libraries for open source document retrieval written by Java, including words, documents, domains, inverted indexes, segments, relevance scores and other basic functions, while ES uses these libraries to build a search engine product that can be used directly. Intuitively, Lucene provides auto parts, while ES sells cars directly.

Speaking of the birth of ES, it is also a very interesting story. Shay Banon--, author of ES, "A few years ago, he was an unemployed engineer and came to London with his new wife. My wife wanted to learn to be a cook in London, and she wanted to develop an app for her to search recipes, so she came to Lucene. Directly using Lucene to build search has a lot of problems, including a lot of repetitive work, so Shay continues to abstract on the basis of Lucene, making it easier for Java programs to embed search, and after a period of polishing, he gave birth to his first open source work Compass, which means "compass" in Chinese. After that, Shay found a new job facing a high-performance distributed development environment, in which he gradually found a growing need for an easy-to-use, high-performance, real-time, distributed search service, so he decided to rewrite Compass, turn it from a library into a separate server, and rename it Elasticsearch. "

You can see how loving the tinkering programmer is, although it is said that the recipe search that Shay Banon promised his wife has not yet been published.

This article briefly introduces the principle of ES and some experience summary of Wetest in using ES. Because ES itself involves a wide range of functions and knowledge, so here focus on the actual project may be used, may also step on some of the key points described.

Important concept

Cluster: ES is a distributed search engine, which is generally composed of multiple physical machines. These physical machines, by configuring the same cluster name, discover each other and organize themselves into a cluster.

Node: an Elasticearch host in the same cluster.

Primary shard: a physical subset of the index (described below). Physically, the same index can be divided into multiple fragments and distributed to different nodes. The implementation of sharding is the index in Lucene.

Note: the number of fragments in an index in ES should be specified when the index is built, and cannot be changed after it is established. So when you start to build an index, it is necessary to estimate the size of the data and allocate the number of fragments in a reasonable range.

Replica shard (Replica shard): each master shard can have one or more copies, the number of which is configured by the user. ES will try to distribute different shards of the same index to different nodes to improve fault tolerance. An index can be used as long as not all the machines on which shards is located are down. The concept of master, replica and node is shown in the following figure:

Index: a logical concept, a collection of retrievable document objects. Similar to the concept of database in DB. Multiple indexes can be built in the same cluster. For example, a common method in production environments is to index the data generated each month to ensure that the magnitude of a single index is controllable. Index-> Type-> documents, documents in ES are organized in such a logical relationship.

Type (Type): the next level concept of an index, which is roughly equivalent to table in the database. There can be multiple Type in the same index. Personal feeling in the actual use of this level of type is not often used, directly in an index to build a type, in this type under the establishment of document collection and search.

Document: the concept of a document in a search engine, which is also a basic unit that can be retrieved in ES, which is equivalent to a row in a database, a record.

Field (Field): equivalent to column in the database. In ES, each document is actually stored in the form of json. A document can be regarded as a collection of multiple fields. For example, an article may include information such as topic, abstract, text, author, time, etc., each of which is a field, which is finally integrated into a json string and landed on disk.

Mapping: equivalent to the schema in the database, it is used to constrain the type of field, but the mapping of Elasticsearch can be specified without display and automatically created based on the document data.

Elasticsearch amicably provides API for RestFul, and all operations can be done directly through HTTP requests. For example, the following official example is to add a document to the index twitter. Type is tweet, and the id of the document is 1:

Accordingly, retrieve the document based on the user field:

Key configuration item

1. The number of shards of the index:

The number of shards, preferably related to the number of nodes. Theoretically, for the same index, it is best to have no more than two shards on a single machine, so that each query is as parallel as possible. However, because the number of shards in ES is determined, there is no way to adjust it, so if you consider that the data will grow at a high speed, you can allocate more at the beginning. Another common idea is to define ES indexes by time latitude (such as month)-- because the number of shards of newly added indexes can be adjusted dynamically. In other cases, such as the example of Wetest aggregation given below, there are many shards defined because the data needs to be cut apart by channel as much as possible, but too many shards are usually not recommended, and ES management also has overhead.

2. Heap memory: the official recommendation is half of the available memory, which is done by defining environment variables in the environment where ES is started. Such as export ES_HEAP_SIZE=10g

3. Cluster.name: the logical name of the cluster. Only machines with the same cluster name will logically form a cluster. For example, there are five instances of ES machines in the intranet, which can form several ES clusters that do not interfere with each other.

4 、 discovery.zen.minimum_master_nodes:

This is the minimum number of master machines used for distributed decision-making in clusters. As with common distributed coordination algorithms, in order to avoid brain fissure, it is recommended that more than half of the machines, nhand 2cm 1

5 、 discovery.zen.ping.unicast.hosts:

List of machines for the ES cluster. Note that ES single point does not need to configure the list of all machines in the cluster, like a connectivity graph, as long as each machine is configured with other machines, and these configurations can be connected to each other, then ES will eventually find all the machines and form a cluster. For example, ['111.111.111.0]

Mapping

Mapping is similar to the table structure in a database. Defining a mapping means creating an index. Unlike a database, an index does not need to build a mapping explicitly. For example, in the above example of inserting document data into an twitter index, if the index is not defined at the time of execution, ES will automatically create the index and mapping based on the fields and contents of the document. However, the index fields created in this way may not be what we need. Therefore, it is better to create the index by manually defining the mapping in advance. Here is an example of creating a mapping, which creates a mapping for type such as user and blogpost in the my_index directory. Below properties is the definition of various fields, including string, numeric value, date, and so on.

As in the red box in the figure, there are two things to note in this example:

1. User_id is of string type, but its index is defined as "not_analzyed", which requires a clear understanding of the meaning: usually, the function of full-text retrieval in search engines is simply implemented as follows: the original document is segmented and then used to build an inverted index. When searching online, the user's query words are segmented. The word segmentation results are used to pull the zipper results, merging and correlation ranking of multiple inverted indexes, and the final results are obtained. However, for some string-type fields, you don't want to build inverted fields, but just want to match them accurately, such as the user's name. You only want to find the person whose name field is exactly "Zhang San", not the "Zhang Si" and "Li San" fields you get after word segmentation. At this time, you need to define the index type field. This field has three types: no, analyzed and not_analyzed. No does not index this field at all, analyzed is analyzed and built according to full-text search, and not_analyzed is an exact match for keyword query.

2. Date type. When creating mapping, you need to specify a variety of possible time formats for input through "format". When you create a document in this way, ES automatically determines which one based on the fields of the input document. But intuitively imagine that specifying a clear time format when creating a document, eliminating the overhead of ES dynamic judgment, should improve performance slightly. In addition, it is important to note that epoch_second (second unit timestamp) and epoch_millis (millisecond unit) should not be mixed as far as possible, and if they have to be mixed, it should be clearly specified at the time of insertion. Once stepped on a pit, the epoch_second was inserted with a second timestamp, but ES preferred milliseconds, resulting in a 1000-fold reduction in time, most recently sometime in 1970.

The following figure lists the data types, built-in fields, and parameters that can be carried by mapping operations in the current version of ES. Because of the space, I will not explain it in detail here:

What we want to describe in detail here is the red box in the figure above, the two key built-in types that we actually used to create mapping, and two mapping parameters. All of these directly affect the performance of the final index access:

1) _ source: es will spell all the fields into a raw json and drop it to disk, so this can be understood as full raw data, which can not be indexed, but can be returned when needed. Be careful not to disable it as much as possible, for example, using script to update is not supported after it is disabled.

2) _ all: a "pseudo" field for fuzzy full-text indexing. It can be understood like this: when you build an index, put all the fields into a string, then cut the "big" field, build it upside down, and then the field is discarded and doesn't really fall on disk. When full-text retrieval, if you do not specify the query field, such as title, body (which is very common), pull the document zipper from this large inversion. As you can imagine, some fields of tag or value type, such as date and score, are meaningless in full-text retrieval and can be excluded from _ all, while text fields, such as title and doc, are included in _ all. These are all possible and best specified when building a mapping.

3) doc_values: both doc_values and the following field_data are parameters used for aggregating (described later) and sorting these statistics. They are enabled by default. Sorting, aggregation, this kind of work carried out globally in the document, using an inverted index is certainly not appropriate. Therefore, for not_analyzed (that is, do not build inverted) fields, doc_values uses a column mode (you can refer to Hbase) to store the front row of the document, which is convenient to do statistics in the document globally. Doc_values is stored on disk, and if you make it clear that some fields are only displayed and not used for statistics, you can disable this. Doc_values must not index the analyzed domain (it is not appropriate to think about it, but how to build a column index), but use the following field data.

4) field_data: there will also be statistical requirements for analyzed text fields, such as text (for example, ES also supports aggregate statistics of documents by some keywords, but the common method for this task is to push it to the online index after it is done through offline tools, such as Hadoop or stand-alone analysis, and it feels strange to calculate it directly in ES). It's not suitable for search engines, but you do, and es will dynamically load this data in a field data in memory. So, just think about it, this is a very memory-consuming operation, it is likely to eat up the jvm heap! By default, es only opens, but does not load, and only dynamically load this memory (in the lazy way) when you need to sort and aggregate analyze fields. So try not to open the Pandora's box during the query, or just turn this option off.

Polymerization

Who says search engines can only be used to search? ES can not only search, but also count directly on the set of search results. At present, the stable non-experimental phase polymerization of ES can be divided into two types: Metrics Aggregation (indicator aggregation) and Bucket Aggregation (barrel polymerization).

Indicator aggregation mainly refers to conventional collective mathematical statistics operations, such as this example of the official guide: find all the red cars traded and then calculate their average prices:

The result is something like this:

Magic bar ~ index operation also includes other, such as maximum, minimum, summation, number, geographical coordinate operation and so on. However, what we are going to give an example today is mainly Bucket Aggregation, bucket aggregation. Bucket aggregation refers to dividing the document into different groups according to a given field, and then further aggregating within the group and returning bucket-level results. More intuitive understanding, such as: histogram, time-division statistics and so on. As in the following example, it is the term aggregation in the bucket aggregation, that is, according to the color field, the exact match is performed in the bucket, and then the average price aggregation and the further bucket aggregation by manufacturer are further nested in the bucket.

The statistical results are similar to the following. There are four red cars, the average price is 32500, and it includes three Honda and one BMW:

Here is a simple example. In our WeTest public opinion, there is a feature of forum hot posts, that is, real-time statistics of the largest number of TopN posts in a data source (such as Baidu Tieba) and a forum (such as Arena of Valor) within a period of time (such as 3 months).

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.