What are the benefits of abandoning MongoDB and using ES 03/31 Update SLTechnology News&Howtos

What are the benefits of abandoning MongoDB and using ES

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

What are the benefits of abandoning MongoDB and using ES? in view of this question, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

Preface

Figure: MongoDB and Elasticsearch heat ranking

It revolves around two topics:

Why migrate from MongoDB to Elasticsearch?

How do I migrate from MongoDB to Elasticsearch?

Current situation and background

MongoDB itself competes with relational database, but there are few projects that put the data of core business system on it, and still choose traditional relational database.

1. Project background

The company is located in the logistics express industry, the business system is complex and huge, there are many user operators, a large number of business data are generated every day, and the business data will change many times. In order to facilitate recording, tracking and analysis, the system operation log recording project arises at the historic moment. Considering the original daily average data, the operation log data is stored based on MongoDB.

Two types of data need to be recorded in the operation logging system, as described below:

1) change the master data, who did what operation in which module of the system at what time, what is the data number, what is the operation tracking number.

{"dataId": 1, "traceId": "abc", "moduleCode": "crm_01", "operateTime": "2019-11-11 12:12:12", "operationId": 100, "operationName": "Zhang San", "departmentId": 1000, "departmentName": "customer Department", "operationContent": "visit customers." }

2) change from the data, before and after the actual change of the data, there are a large number of such data, and multiple fields of a row of data will be recorded.

[{"dataId": 1, "traceId": "abc", "moduleCode": "crm_01", "operateTime": "2019-11-11 12:12:12", "operationId": 100, "operationName": "Zhang San", "departmentId": 1000, "departmentName": "customer Department", "operationContent": "visiting customers" "beforeValue": "20", "afterValue": "30", "columnName": "customerType"}, {"dataId": 1, "traceId": "abc", "moduleCode": "crm_01", "operateTime": "2019-11-11 12:12:12", "operationId": 1000, "operationName": "Zhang San", "departmentId": 1000 "departmentName": "customer Department", "operationContent": "visiting customers", "beforeValue": "2019-11-02", "afterValue": "2019-11-10", "columnName": "lastVisitDate"}]

2. Project architecture

The project architecture is described as follows:

The business system adds or edits data, generates operation log records and sends them to the Kafka cluster, based on the dataid field as key

New or edited data is actually stored in the MySQL database

Canal clusters subscribe to MySQL clusters, and configure monitored databases and tables according to business system modules.

Canal sends the monitored change business data to the Kafka cluster, based on the dataid field as key

The operation log system acquires master record data and slave record data from Kafka

The operation log system writes data to MongoDB and requires anti-query.

Illustration: operation logging business process description

3. MongoDB architecture

Cluster architecture description:

Server configuration 8c/32gb/500gb ssd

The Router routing server deployed 3 nodes

The Config configuration server deployed 3 nodes

Shard sharding server is deployed with 9 nodes

Design 3 fragments of main operation record

Design 3 slices from the operation record.

Problem description

Believers in MongoDB may suspect that we do not use it well, or that we lack the ability of operation and maintenance, or that we have experts in Elasticsearch. This is not the case. Choosing Elasticsearch instead of MongoDB is not a matter of technical bias, but a requirement of our actual scenario for the following reasons:

1. Search query

MongoDB uses B-Tree as the index structure, which is based on the principle of leftmost priority, and must ensure that the query order is consistent with the order of the index fields in order to be effective. This is an advantage, but it is also fatal in complex business scenarios.

There are many filtering conditions for business system query operation log records, and the query conditions are arbitrarily combined. The existing MongoDB does not support it, or all relational databases do not support it. If you want to support it, you have to create a lot of combined B+ indexes. The idea is very irrational. We have discussed this in the article "Analysis and discussion of the mixed Application system scenario of DB and ES", which can be read in detail.

At the same time, there are many character-like data in master records and slave records. These data queries should support both accurate query and full-text retrieval. In these aspects, the function of MongoDB is very single, the performance is also very poor, and business system queries often time out, on the contrary, Elasticsearch is very appropriate.

2. Maturity of technology stack

For the problem of sharding and replica implementation, MongoDB collection data needs to be bound to specific machine instances at design time, which shards are distributed on which nodes and which replicas are distributed on which nodes, which need to be tied up when configuring clusters, which are essentially no different from traditional relational databases. In fact, many data products still have too many clusters, such as Redis-cluster. ClickHouse et al. On the other hand, the cluster of Elasticsearc has no direct binding relationship with shards and replicas, so it can be balanced and adjusted arbitrarily, and the performance configuration of nodes can be easily differentiated.

The amount of operation log data increases rapidly, writing more than 10 million entries a day. It won't be long before the operation and maintenance staff need to expand the capacity of the server, which is much more complicated than Elasticsearch.

MongoDB has more than 1 billion pieces of data in a single collection. In this case, the performance of simple conditional query is not ideal, and it is not as fast as Elasticsearch inverted index.

The company's experience in ES and MongoDB technology stack is different. Elasticsearc is used in many projects, and the very core projects are also widely used, and they have more experience in technology and operation and maintenance. However, if MongoDB removes the core business scenarios, it is almost impossible to find a suitable entry point, and no one dares to use MongoDB in the core projects, which is very embarrassing.

3. The document format is the same

Both MongoDB and Elasticsearch are document databases, and Bson is the same as Json,_objectid and _ id, so there is almost no change in the data model for master data and migration from data to Elasticsearch platform.

Migration scheme

The migration of heterogeneous data systems mainly focuses on these two pieces of content:

The migration of the upper application system was originally aimed at the syntax rules of MongoDB, but now it will be modified to be oriented to Elasticsearch grammar rules.

Lower-level MongoDB data is migrated to Elasticsearch.

1. Elastic capacity evaluation

The original MongoDB cluster uses 15 servers, of which 9 are data servers. How many servers are needed to migrate to the Elastic cluster? We adopt a simple calculation method. For example, if we assume that there are 1 billion pieces of data in a certain MongoDB set in the production environment, we first synchronize 1 million pieces of data from MongoDB to ES in the test environment, and assume that these 1 million pieces of data occupy disk 10G, then the production environment needs 1 T disk space, and then expand some redundancy according to the expected increase in business. According to the preliminary assessment, the Elastic cluster sets up 3 servers and configures 8c/16g memory / 2T mechanical disks. The number of servers has shrunk from 15 to 3, and the configuration has been greatly reduced.

2. Elastic index rules

The system operation log is sequential data, which basically does not need to be modified again after it has been written. The operation log record query is mainly in the current month, and the frequency of subsequent historical data query is very low. According to the evaluation, the core data index is created on a monthly basis, and the operation time range must be included in the business query. The back-end queries need to be queried based on time inversion. Elastic-Api supports multiple index matching queries, making perfect use of the features of Elastic to solve the query merging across multiple months. For non-core data indexes, annual index generation is sufficient.

Figure: Elastic operation log index creation rules

3. Core implementation logic design

Elasticsearch is not a relational database and does not have a transaction mechanism. The data sources of the operation log system are all Kafka, and there is a sequential mechanism for consumption data. Special attention should be paid to two scenarios, as follows:

The master data comes first to the operation log system, from the data to, when the slave data is written, the master data record and the Binlog field data are assembled first.

From the data to the operation of the log system, and then to the master data, the master data updates the relevant index fields of the slave index.

The update of Elasticsearch index data is a near real-time refresh mechanism. After the data is submitted, it can not be queried through Search-Api immediately. How can the data of the master record be updated to the slave record? And because of the irregular use of the business unit, the dataId and tracId of multiple master records may be the same.

Because the fields associated with master data and slave data are dataId and traceId. If the master data and the slave data reach the operation log system at the same time, the failure of the update_by_query command is certainly inaccurate, and the master-slave data may also be a many-to-many relationship, and dataId and traceId cannot uniquely determine a record.

Elasticsearch is actually a NoSQL database that can be cached by key-value. At this time, a new Elastic index is created as the intermediate cache. The principle is that the master data and the slave data will be cached first, and the _ id= (dataId+traceId) of the index. Through this intermediate index, you can find the Id of the master data record or the slave record Id. The index data model is as follows, and detailId is the array record of the slave index _ id.

{"dataId": 1, "traceId": "abc", "moduleCode": "crm_01", "operationId": 100," operationName ":" Zhang San "," departmentId ": 1000," departmentName ":" customer Department "," operationContent ":" visiting customers "," detailId ": [1,2,3,4,5,6]}

We talked earlier about the core API used to manipulate ES when pulling a batch of data on a partition where both the master record and the slave record are Kafka:

# batch get records from index _ mget # batch insert bulk # batch delete intermediate temporary index _ delete_by_query

Migration process

1. Data migration

Choosing DataX as the data synchronization tool is based on the following factors:

Historical data. The operation log data belongs to historical data, and there is almost no need for secondary modification after the record is generated, which is equivalent to offline data.

Non-persistent migration. After the completion of the project, the original MongoDB cluster will be completely destroyed and there will be no need for secondary migration.

The problem of the amount of data. The original MongoDB operation log has billions of data, the migration process can not be too fast or too slow, the speed is too fast, the MongoDB cluster will have performance problems, the speed is too slow, the project cycle is too long, which increases the cost and complexity of operation and maintenance. Otherwise, you can choose Hadoop as the migration of the transit platform

DataX source code specific scene transformation. Such as date type conversion, index primary key _ id generation, index primary key _ id mapping, support repeated synchronization

Multi-instance and multi-thread parallel. Multiple instances are deployed synchronously for master data, multiple instances are also deployed for slave data synchronization, and multiple Channel are configured in a single instance.

Figure: schematic diagram of DataX synchronization data

2. Migrate index settings

Temporarily modify some settings of the index, and then modify it back after data synchronization, as follows:

"index.number_of_replicas": 0, "index.refresh_interval": "30s", "index.translog.flush_threshold_size": "1024m"index.translog.durability": "async", "index.translog.sync_interval": "5s"

3. Application migration

The operation log project is built with Springboot, with custom configuration items added, as follows:

# App writes mongodb ID writeflag.mongodb: true # App writes elasticsearch ID writeflag.elasticsearch: true

Description of project transformation:

When you go online for the first time, set the two write identities to true, double write MongoDB and ES

For reading, provide 2 different interfaces, and the front end is free to switch.

When the data is migrated and there is no difference, change the value of flag again.

Figure: application balanced Migration

Conclusion

1. Migration effect

Instead of using MongoDB to use ElasticSearch as the storage database, the server has changed from 15 MongoDB to 3 ElasticSearch, which saves the company a lot of money every month. At the same time, query performance has been improved more than 10 times, and better support a variety of queries, has been unanimously appreciated by business users, operation and maintenance team and leaders.

2. Summary of experience

After several months before and after the whole project, a number of colleagues participated in the design, research and development, data migration, testing, data verification, stress testing and other links. The technical solution was not achieved in one step, and there were a lot of holes in the middle, and finally went online. ES has many excellent technical features, and only when it is used flexibly can it exert its greatest power.

The answers to the questions about the benefits of abandoning MongoDB and using ES are shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.