Summary of ElasticSearch practice of pre-loan system 07/19 Update SLTechnology News&Howtos

Summary of ElasticSearch practice of pre-loan system

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The pre-loan system is responsible for the realization of all business processes from purchase to loan, which involves some comprehensive queries with large amount of data, various conditions and complexity. The main purpose of introducing ElasticSearch is to improve the query efficiency, and hopes to quickly realize a simple data warehouse based on ElasticSearch and provide some OLAP related functions. This paper will introduce the practical experience of pre-loan system ElasticSearch.

I. Index

Description: a data structure designed to quickly locate data.

An index is like a catalog in front of a book, which can speed up the query of the database. Understanding the construction and use of indexes is of great help to understand the working mode of ES.

Commonly used indexes:

Bitmap index

Hash indexing

BTREE index

Inverted Index 1.1 Bitmap Index (BitMap)

Bitmap indexes apply when the field value is a limited number of numeric values that can be enumerated.

The bitmap index uses a binary numeric string (bitMap) to identify the existence of data, 1 indicates that there is data at the current location (sequence number), and 0 indicates that there is no data at the current location.

Figure 1 below is the user table, which stores the fields of gender and marital status.

In figure 2, two bitmap indexes are established for gender and marital status, respectively.

For example, the gender-> male correspondence index is 101110011, indicating that the 1st, 3rd, 4th, 5th, 8th and 9th users are male. Other attributes and so on.

Query using a bitmap index:

The record of male and married is 101110011-11010010 = 100100010, that is, the 1st, 4th and 8th users are married men.

Record of female or unmarried = 010001100 | 001010100 = 011011100, that is, the 2nd, 3rd, 5th, 6th and 7th users are female or unmarried. 1.2 Hash indexing

As the name implies, it refers to the index structure that uses some kind of hash function to implement the key- > value mapping.

Hash indexing is suitable for equivalent retrieval, and the location of the data can be located by a hash calculation.

Figure 3 below shows the structure of the hash index, similar to the implementation of HashMap in JAVA, in which hash conflicts are resolved in the form of conflict tables.

1.3 BTREE Index

BTREE index is the most commonly used index structure in relational database, which facilitates the query operation of data.

BTREE: orderly balanced N-order tree with N keys and 1 pointer for each node, pointing to 1 child node.

The simple structure of a BTREE is shown in figure 4. It is a 2-story 3-fork tree with 7 pieces of data:

Take the most commonly used InnoDB engine in Mysql as an example to describe the application of BTREE index.

Tables under Innodb are stored as index-organized tables, that is, the storage of the entire data table is B+tree-structured, as shown in figure 5.

The primary key index is the left half of figure 5 (if the autonomous primary key is not explicitly defined, a unique index that is not empty is used to cluster the index, and if there is no unique index, 6-byte hidden primary key is automatically generated inside the innodb to do the clustering index). The leaf node stores the complete data row information (stored in the form of primary key + row_data).

The secondary index is also stored in the form of B+tree, in the right half of figure 5, unlike the primary key, the leaf node of the secondary index stores not row data, but the index key value and the corresponding primary key value, from which it can be inferred that the secondary index query has one more step to find the primary key of the data.

To maintain an ordered balanced N-tree, it is more complicated to adjust the position of nodes when inserting nodes, especially when the inserted nodes are random and disordered, while when inserting ordered nodes, the adjustment of nodes only occurs in the local part of the whole tree. the scope of influence is small and the efficiency is high.

You can refer to the node insertion algorithm of the red-black tree:

Https://en.wikipedia.org/wiki/Red%E2%80%93black_tree

Therefore, if the innodb table has a self-increasing primary key, the data writing will be orderly and efficient; if the innodb table does not have a self-increasing primary key, inserting random primary key values will lead to a large number of change operations in B+tree, resulting in low efficiency. This is why it is recommended that innodb tables have non-business meaningful self-increasing primary keys, which can greatly improve the efficiency of data insertion.

Note:

Mysql Innodb uses self-increasing primary keys to insert efficiently.

Using the ID generation algorithm similar to Snowflake, the generated ID is trend increasing, and the insertion efficiency is relatively high. 1.4 inverted index (reverse index)

Inverted indexes, also known as reverse indexes, can be compared to forward indexes.

The forward index reflects the corresponding relationship between a document and the keywords in the document; given the document identification, you can get the keyword, word frequency and location information of the word in the document, as shown in figure 6, the document on the left and the index on the right.

Reverse index refers to the corresponding relationship between a keyword and the document in which the word is located; given the keyword identification, you can get a list of all documents where the keyword is located, including word frequency, location and other information, as shown in figure 7.

The set of words and documents in the reverse index (inverted index) form the "word-document matrix" shown in figure 8, and the ticked cells indicate that there is a mapping between the word and the document.

The storage structure of the inverted index can be seen in figure 9. Among them, the dictionary is stored in memory, and the dictionary is the list collection of all the words parsed in the whole document collection; each word points to its corresponding inverted list, and the collection of inverted list forms an inverted file, which is stored on disk, in which the inverted table records the information of the corresponding words in the document, that is, the word frequency, location and other information mentioned earlier.

Here is a concrete example of how to generate an inverted index from a collection of documents.

As shown in figure 10, there are five documents, the first listed as the document number and the second as the text content of the document.

Analyze the above document set for word segmentation, and the 10 words found are: [Google, Map, father, job-hopping, Facebook, × ×, founder, Russ, leave, and]. Take the first word "Google" as an example: first, give it a unique identification "word ID", with a value of 1, and calculate that the document frequency is 5, that is, all 5 documents appear, except for 2 times in the third document. The rest of the documents appear once, so you have the inverted index shown in figure 11.

1.4.1 query optimization of word dictionary

For a large document collection, it may contain hundreds of thousands or even millions of different words. Whether a word can be located quickly will directly affect the response speed of the search. The optimization solution is to index the word dictionary. There are several schemes for reference:

Dictionary Hash index

Hash index is simple and direct, query a word, by calculating the hash function, if the hash table hits, it means that the data exists, otherwise it can directly return empty; suitable for perfect matching, equivalent query. As shown in figure 12, words with the same hash value are placed in a conflict table.

Dictionary BTREE index

Similar to the secondary index of Innodb, the words are sorted according to certain rules to generate a BTree index, and the data node is the pointer to the inverted index.

Binary search

Similarly, the words are sorted according to certain rules, and an array of ordered words is established, and the binary search method is used in the search; the binary search method can be mapped to an ordered balanced binary tree, such as the structure in figure 14.

FST (Finite State Transducers) implementation

FST is a finite state transfer machine. FST has two advantages: 1) small space consumption. Through the reuse of prefixes and suffixes in the dictionary, the storage space is compressed; 2) the query speed is fast. Query time complexity of O (len (str)).

Take the insertion of "cat", "deep", "do", "dog" and "dogs" as an example to build FST (note: it must be sorted).

As shown in figure 15, we finally get a directed acyclic graph as above. This structure can be used to query conveniently, for example, given a word "dog", we can query whether it exists conveniently through the above structure, and even we can associate the word with a certain number or word in the process of construction, so as to realize the mapping of key-value.

Of course, there are other optimization methods, such as the use of Skip List, Trie, Double Array Trie and other structures for optimization, I will not repeat them one by one.

Second, the experience of using ElasticSearch

The following combined with the specific use cases of the pre-loan system, introduce some experience summary of ES.

2.1 Overview

Currently used version of ES: 5.6

Official website address: https://www.elastic.co/products/elasticsearch

ES one sentence introduction: The Heart of the Elastic Stack (excerpt from the official website)

Some key messages for ES:

First released in February 2010

Elasticsearch Store, Search, and Analyze

Rich Restful Interface 2.2 basic concept Index (index)

The index of ES, that is, Index, is not the same concept as the index mentioned earlier, which refers to the collection of all documents, which can be compared to a database in RDB.

Document (document)

That is, a record written to ES, usually in the form of JSON.

Mapping (Mapping)

The metadata description of the document data structure, usually in the form of JSON schema, can be dynamically generated or predefined in advance.

Type (type)

Due to errors in understanding and use, type is no longer recommended. Currently, there is only one default type for one index in the ES we are using.

Node

A service instance of ES, called a service node. In order to achieve the security and reliability of data and improve the query performance of data, ES is generally deployed in cluster mode.

Cluster

Multiple ES nodes communicate with each other and share the storage and query of data, thus forming a cluster.

Slice

Slicing is mainly to solve the storage of a large number of data, the data is divided into several parts, slicing is generally evenly distributed on each ES node. It should be noted that the number of fragments cannot be modified.

Copy

A complete copy of shard data, usually a shard will have a copy, the copy can provide data query, the query performance can be improved in the cluster environment.

2.3 installation and deployment

JDK version: JDK1.8

The installation process is relatively simple. Please refer to the official website: download installation package-> decompress-> run.

Pits encountered during installation:

ES startup takes up a lot of system resources, so you need to adjust system parameters such as file handles, threads, memory and so on. Please refer to the following documentation.

Http://www.cnblogs.com/sloveling/p/elasticsearch.html

2.4 example explanation

Here are some specific operations to describe the use of ES:

2.4.1 initialize the index

Initialize the index, mainly to create a new index in ES and initialize some parameters, including index name, document mapping (Mapping), index alias, number of shards (default: 5), number of copies (default: 1), etc., in which the number of fragments and copies in the case of small amount of data can directly use the default value without configuration.

Here are two ways to initialize the index, one using Dynamic Mapping (dynamic mapping) based on Dynamic Template (dynamic template), and one using explicit predefined mapping.

1) dynamic template (Dynamic Template)

Curl-X PUT http://ip:9200/loan_idx-H 'content-type: application/json'

-d'{"mappings": {"order_info": {"dynamic_date_formats": ["yyyy-MM-dd HH:mm:ss | | yyyy-MM-dd]

"dynamic_templates": [

{"orderId2": {

"match_mapping_type": "string"

"match_pattern": "regex"

"match": "^ orderId$"

"mapping": {

"type": "long"

}

{"strings_as_keywords": {

"match_mapping_type": "string"

"mapping": {

"type": "keyword"

"norms": false

}

]

}

"aliases": {

"loan_alias": {}

The above JSON string is the dynamic template we use, which defines the date format: dynamic_date_formats field; the rule orderId2: whenever you encounter the orderId field, convert it to long type; define the rule strings_as_keywords: all fields that encounter string type are mapped to keyword type, and the norms attribute is false; about keyword type and norms keyword, which will be described in the following data types section.

2) predefined mapping

The difference between the predefined mapping and the above is that all known field type descriptions are written into mapping in advance. The following figure intercepts some of them as examples:

The upper part of the JSON structure in figure 16 is the same as the dynamic template. The contents in the red box are predefined attributes: apply.applyInfo.appSubmissionTime, apply.applyInfo.applyId, apply.applyInfo.applyInputSource and other fields. Type indicates the type of the field. After the mapping definition is completed, the inserted data must conform to the field definition, otherwise ES will return an exception.

2.4.2 Common data types

The commonly used data types are text, keyword, date, long, double, boolean, ip

In practice, the string type is defined as keyword rather than text. The main reason is that the data of text type will be parsed as text, doing some word segmentation, filtering and other operations, while the keyword type will be stored as a complete data, eliminating unnecessary operations and improving index performance.

Keyword is also used with a keyword norm, which means that the current field does not participate in grading. Grading means that the query results are assigned a score according to the TF/IDF of the word or some other rules, which can be sorted when the search results are displayed. General business scenarios do not need such sorting operation (there are clear sorting fields), so as to further optimize the query efficiency.

2.4.3 Index name cannot be modified

When initializing an index, you need to specify an index name in URL. Once specified, it cannot be modified, so generally, you need to specify a default alias (alias) to build an index:

"aliases": {"loan_alias": {}

}

Aliases and index names are many-to-many relationships, that is, an index can have multiple aliases, and an alias can map multiple indexes; in one-to-one mode, all places where index names are used can be replaced with aliases; the advantage of aliases is that they can be changed at any time and are very flexible.

2.4.4 Fields that already exist in Mapping cannot be updated

If a field has been initialized (dynamic mapping by inserting data, predefined by setting the field type), the type of the field is determined. If you insert incompatible data, an error will be reported. For example, if a long type field is defined, if you write a non-numeric type of data, ES will return a prompt for a data type error.

In this case, you may need to rebuild the index, and the alias mentioned above will come in handy; it is generally done in three steps:

Create a new index to specify the malformed field as the correct format; 2) use ES's Reindex API to migrate data from the old index to the new index; 3) use Aliases API to add the alias of the old index to the new index, removing the association between the old index and the alias.

The above steps are suitable for offline migration, and the real-time migration steps without downtime will be a little more complicated.

2.4.5 API

The basic operation is to add, delete, modify and query. You can refer to the official documentation of ES:

Https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html

Some of the more complex operations need to use ES Script, generally using Groovy-like painless script, this script supports some commonly used JAVA API (ES installation uses JDK8, so it supports some JDK8 API), but also supports Joda time, and so on.

Give an example of a more complex update to illustrate how painless script uses:

Requirement description

AppSubmissionTime indicates the time of purchase, lenssonStartDate indicates the time of course opening, and expectLoanDate indicates the time of loan. If the purchase on September 10, 2018 is required, if the difference between the purchase time and the course start time is less than 2 days, the lending time will be set to the purchase time.

Painless Script is as follows:

POST loan_idx/_update_by_query

{"script": {"source": "long getDayDiff (def dateStr1, def dateStr2) {

LocalDateTime date1= toLocalDate (dateStr1); LocalDateTime date2= toLocalDate (dateStr2); ChronoUnit.DAYS.between (date1, date2)

}

LocalDateTime toLocalDate (def dateStr)

{

DateTimeFormatter formatter = DateTimeFormatter.ofPattern (\ "yyyy-MM-dd HH:mm:ss\"); LocalDateTime.parse (dateStr, formatter)

}

If (getDayDiff (ctx._source.appSubmissionTime, ctx._source.lenssonStartDate)

< 2) { ctx._source.expectLoanDate=ctx._source.appSubmissionTime }", "lang":"painless" } , "query": { "bool":{ "filter":[ { "bool":{ "must":[ { "range":{ "appSubmissionTime": { "from":"2018-09-10 00:00:00", "to":"2018-09-10 23:59:59", "include_lower":true, "include_upper":true } } } ] } } ] } } } 解释：整个文本分两部分，下半部分query关键字表示一个按范围时间查询（2018年9月10号），上半部分script表示对匹配到的记录进行的操作，是一段类Groovy代码（有Java基础很容易读懂），格式化后如下，其中定义了两个方法getDayDiff()和toLocalDate()，if语句里包含了具体的操作： long getDayDiff(def dateStr1, def dateStr2){ LocalDateTime date1= toLocalDate(dateStr1); LocalDateTime date2= toLocalDate(dateStr2); ChronoUnit.DAYS.between(date1, date2); } LocalDateTime toLocalDate(def dateStr){ DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"); LocalDateTime.parse(dateStr, formatter); }if(getDayDiff(ctx._source.appSubmissionTime, ctx._source.lenssonStartDate) < 2){ ctx._source.expectLoanDate=ctx._source.appSubmissionTime } 然后提交该POST请求，完成数据修改。 2.4.6 查询数据这里重点推荐一个ES的插件ES-SQL: https://github.com/NLPchina/elasticsearch-sql/wiki/Basic-Queries-And-Conditions 这个插件提供了比较丰富的SQL查询语法，让我们可以使用熟悉的SQL语句进行数据查询。其中，有几个需要注意的点： ES-SQL使用Http GET方式发送情况，所以SQL的长度是受限制的(4kb)，可以通过以下参数进行修改：http.max_initial_line_length: "8k" 计算总和、平均值这些数字操作，如果字段被设置为非数值类型，直接使用ESQL会报错，可改用painless脚本。使用Select as语法查询出的结果和一般的查询结果，数据的位置结构是不同的，需要单独处理。 NRT（Near Real Time）：准实时向ES中插入一条记录，然后再查询出来，一般都能查出最新的记录，ES给人的感觉就是一个实时的搜索引擎，这也是我们所期望的，然而实际情况却并非总是如此，这跟ES的写入机制有关，做个简单介绍： Lucene 索引段 ->

ES index

The data written to ES is first written to the Lucene index segment, and then written to the ES index. All the old data is found before writing to the ES index.

Commit: atomic write operation

The data in the index segment is written to the ES index as an atomic write, so a record submitted to ES ensures full success without worrying that only one part of the write is written and the other part fails.

Refresh: refresh operation to ensure that the latest submission is searched

After the index segment is submitted, there is one last step: refresh, which ensures that the data of the new index can be searched.

Lucene delays time-consuming refreshes for performance reasons, so it doesn't refresh each time a new document is added, by default it refreshes every second. This refresh is already very frequent, but there are many applications that need to refresh faster. If you encounter this situation, either use other technologies or examine whether the requirements are reasonable.

However, ES provides us with a convenient real-time query API. The data queried using this API is always up-to-date. The calling method is described as follows:

GET http://IP:PORT/index\_name/type\_name/id

The above API uses the HTTP GET method to query based on the data primary key (id). This query method will find and merge the data in both the ES index and the Lucene index segment, so the final result is always up-to-date. But there is a side effect: every time this operation is performed, ES forces the refresh operation, resulting in an IO, which, if used frequently, can also affect ES performance.

2.4.7 array processing

The processing of the array is quite special, so let's talk about it separately.

1) the representation is in the normal JSON array format, such as:

[1, 2, 3], ["a", "b"], [{"first": "John", "last": "Smith"}, {"first": "Alice", "last": "White"}]

2) it should be noted that array types do not exist in ES and will eventually be converted to types such as object,keyword.

3) the problem of general array object query.

The storage of ordinary array objects will flatten the data and store the fields separately, such as:

{"user": [

{"first": "John", "last": "Smith"

}

{"first": "Alice", "last": "White"

}

]

}

Will be converted to the following text

{"user.first": ["John", "Alice"

], "user.last": ["Smith", "White"

]

}

Breaking the association between the original text, figure 17 shows the brief process of this data from entering the index to querying:

Assemble data, a text of JSONArray structure.

When ES is written, the default type is set to object.

Query for documents whose user.first is Alice and user.last is Smith (there is no such thing as satisfying both conditions).

Returned results that were not in line with expectations.

4) Nested array object query

Nesting array objects can solve the problem of discrepancies in the above queries. ES's solution is to create a separate document for each object in the array, independent of the original document. As shown in figure 18, after the data is declared as nested, the same query returns empty, because there is really no document with user.first as Alice and user.last as Smith.

5) generally, the modifications to the array are full. If you need to modify a field separately, you need to use painless script. See https://www.elastic.co/guide/en/elasticsearch/reference/5.6/docs-update.html.

2.5 Security

Data security is a vital link, mainly through the following three points to provide data access security control:

XPACK

XPACK provides the Security plug-in, which provides access control based on username and password, and provides a free trial period of one month, and then charges a fee in exchange for a license.

IP whitelist

It means that the firewall is enabled in the ES server, and only a number of servers in the private network can directly connect to this service.

Agent

Generally speaking, the business system is not allowed to directly connect to the ES service to query, so we need to wrap the ES interface, and this work needs to be done by the agent; and the proxy server can do some security authentication work, even if it is not applicable to XPACK can also achieve security control.

2.6 Network

The ElasticSearch server needs to open ports 9200 and 9300 by default.

The following mainly introduces a network-related error, if you encounter a similar error, you can make a reference.

Before introducing an exception, let's introduce a network-related keyword, keepalive:

Http keep-alive and Tcp keepalive.

"Connection: Keep-Alive" is enabled by default in HTTP1.1, indicating that the HTTP connection can be reused, and the current connection can be directly used in the next HTTP request to improve performance. Keep-alive is generally used in HTTP connection pooling implementations.

The role of keepalive in TCP is different from that in HTTP. TPC is mainly used to keep the connection alive, and the relevant configuration is mainly net.ipv4.tcp_keepalive_time, which means that if a TCP connection does not exchange data after how long (default 2 hours), a heartbeat will be sent to detect whether the current link is valid. Normally, you will receive the other party's ack packet, indicating that the connection is available.

Here is the specific exception information, which is described as follows:

Two business servers, the ES cluster connected by restClient (based on HTTPClient and persistent connection) (the cluster has three machines) and the ES server are deployed in different network segments. An exception will occur regularly:

Abnormal Connection reset by peer occurs around 9 o'clock every day. And there are three Connection reset by peer in a row

Caused by: java.io.IOException: Connection reset by peer

At sun.nio.ch.FileDispatcherImpl.read0 (Native Method)

At sun.nio.ch.SocketDispatcher.read (SocketDispatcher.java:39)

At sun.nio.ch.IOUtil.readIntoNativeBuffer (IOUtil.java:223)

At sun.nio.ch.IOUtil.read (IOUtil.java:197)

In order to solve this problem, we have tried a variety of options, such as checking official documents, comparing codes, and grabbing packages. After several days of hard work, it was finally found that the exception was related to the keepalive keyword mentioned above (thanks to the help of colleagues in the operation and maintenance team).

In the actual online environment, there is a firewall between the business server and the ES cluster, and the firewall policy defines an idle connection timeout of, for example, 1 hour, which is inconsistent with the default of 2 hours for the linux server mentioned above. Due to the small number of visits to our current system at night, some connections were not used for more than 2 hours. After one hour, the firewall automatically terminated the current connection. Two hours later, the server tried to send a heartbeat to keep the connection alive, which was directly blocked by the firewall. After several attempts, the server sent RST to break the link, but the client did not know about it at this time. When the invalid link request was used the next morning, the server returned RST directly, and the client reported an error Connection reset by peer. All three servers in the cluster attempted to return the same error, so three identical exceptions were reported in succession. The solution is also relatively simple. Modify the server keepalive timeout configuration, which is less than 1 hour of the firewall.

Author: comprehensive Credit Lei Peng

Source: Yixin Institute of Technology

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.