The Application of Elasticsearch in Huatai Securities 04/25 Update SLTechnology News&Howtos

The Application of Elasticsearch in Huatai Securities

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Construction background and present situation

Background

In the past, Huatai Securities has four characteristics: the system operation and maintenance is tedious, the log can not be kept for a long time, the value of log data is not mined, and big data leads the corner. The so-called tedious system operation and maintenance means that there are many systems, independent construction, services generate a large number of logs, operation and maintenance need to log in to the server to view logs and so on. When you encounter complex problems, you need to log in to the server one by one to search the relevant log files to locate the problem. The fact that the log is not kept for a long time means that according to the relevant laws and regulations of the securities industry, the number of logs need to be kept for a certain number of years and support access at any time. The value of log data without mining means that the log contains very valuable information, but there is no in-depth mining. A corner of big data's field means that Hadoop mainly focuses on the storage and analysis of structured data, rather than the storage and analysis of unstructured data.

Current situation

Huatai Securities began to build Elasticsearch clusters in June 2016. as the computer rooms are connected through direct connect lines, the computer rooms will be occupied when the amount of data is too large, so different computer rooms will lead to the need for many clusters, so the clusters need to be built separately.

The construction capacity of Elasticsearch is 120T, and the amount of data per day will occupy 600g or even 800G of space. Because the capacity of Elasticsearch is relatively small, the storage time of logs is very short, some can only be saved for two months or even weeks, and only very important information can be preserved for three months. It is planned to expand the Elasticsearch capacity in the future, and the log retention time is expected to reach more than one year or even two years.

The applications supported by Huatai Securities include supporting Shengle, financial management platform, monitoring data analysis, log analysis and so on. For example, on the financial platform, Elasticsearch can discover the behavior of users and give users suggestions. Looking to the future, China's securities market has undergone fundamental changes and is now facing a historic opportunity for development. In the face of the new situation, Huatai Securities will continue to adhere to the development strategy of standardization, collectivization and internationalization, and further speed up the pace of expanding and strengthening the company. strive to make the company a securities holding group with core competitiveness and business scale and comprehensive strength in the forefront of the industry as soon as possible.

Application practice

Log search

In log search, log source is the first part of the whole ES, which is basically consistent with the collection and display of ELK. It is mainly divided into log collection module, collection agent, log storage module, log retrieval model, foreground display module, collection management module. Among them, the log collection module is divided into two parts: collection agent and log processing. Collection and processing is mainly used for data caching, log processing is to consume the log into two copies, one consumption into the ES, the other consumption log for real-time alarm. Log retrieval module is to display the log, it mainly has three functions, namely, unified log retrieval, real-time monitoring and alarm, and unified rights management.

Log analysis

Data sources include not only logs, but also web pages and databases. The data source places the data collection into three libraries through morphling, and the three libraries include ES, HDFS and Hbase. ES finally performs log retrieval, and HDFS and Hbase are finally displayed through Hve and Kylin.

Link management system

In the link management system, an internal circulation system is composed of data quality analysis, unified collection framework, automatic deployment and application monitoring.

For the analysis of data quality, users require that there should be no more or less data, but in practice, for the branch system, because the link is relatively long, it is almost impossible to achieve data without loss. But what the link management system can do is to assist the link to find out whether there is a quality problem, and analyze the data quality in this system in time. For example, a node sends 10 million pieces of data and eventually enters the ES through link management. In this process, the phenomenon of data delay may occur. If the sent curve is consistent with the corresponding final curve, then the data quality is intact, that is, the final data is consistent. If the two curves cannot be fitted, it means that some data has been lost.

For the unified collection framework, that is, to standardize the logs in the collected log files, for example, what the logs should contain and so on.

Automated deployment is to automate every step of the deployment process, which currently involves applications, environments, deployment processes, and so on. The first thing to do to achieve automation is to model the applications, environments, and processes that need to be deployed, and an automated deployment system is needed to support it.

Application monitoring, that is, daily inspection only through the application of monitoring system to know whether the node on each link is normal, if abnormal, it will give an alarm.

Use experience

There are six points for attention in the use of Huatai Securities Elasticsearch.

First, it is not easy to have too many disk directories. It is recommended to do Raid5 for easy management.

Second, index fragments are allocated according to actual needs.

Third, hot and cold separation strategy

Fourth, cluster deployment is carried out according to the business characteristics and connected through TrieNode

Fifth, it is not easy to have too much metadata, which will degrade the performance of the cluster and trigger BUG

Sixth, set the fragmentation delay allocation strategy when the node leaves.

Among them, for the problem that it is not easy to have too many disk directories, it is necessary to take the official website suggestion, that is, to adopt a bare disk. Since a single machine has 24 disks, when some shards fall on a disk for a long time, it will cause hot read and write problems on a single disk, and when the amount of data of these shards is large, the disk space is easy to fill up, and ES will not balance the data. Once the disk exceeds the threshold, the whole node will no longer allocate shards, resulting in underutilization of space. This problem can be solved through the Raid5 and unified management of the disks of the newly purchased machines, so that a certain disk will not be full, and the overall throughput can be improved and the old machines will be replaced gradually.

Jstat found that when Master persists with GC, CPU usage is very high. There may be a memory leak problem, so you need to analyze the Jmap memory dump, then further analyze the leak points, and finally modify the source code and verify it.

As for the sixth point in the experience, after the cluster has been running for a long time, there will be more and more cluster shards and more metadata. When the number of cluster shards exceeds 15000, there will be performance problems. After running for a period of time, the cluster will not run properly and the nodes will not be able to connect to the Master. In the actual operation of the cluster, nodes often leave the cluster due to network or GC and other reasons, and the nodes go offline during daily operation and maintenance. In order to avoid data loss, the cluster must ensure that each shard has enough copies, so a large number of network Icano will be generated. At this time, if the node rejoins the cluster, Elasticsearch will redistribute the data shard to the node, which will once again lead to a large number of network Icano operations. In this case, the cluster is in the state of Yellow for a long time, the recovery is very slow, and the shards can be seen to be in a state of migration through monitoring. This requires setting the parameter index.unassigned.nod_left.delayed_timeout to delay the time of shard balance and provide buffer time for operation and node exception recovery.

[this article is transferred from Yunqi Community author: Yunqi Kyushu original link: https://yq.aliyun.com/]

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.