Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

High-capacity NoSql solution: Aerospike practice

2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Tweets focus on providing message push services for developers for many years. Through a push SDK, the mobile terminal establishes a long connection with the server and maintains the online status. However, in the case of abnormal network, the message can not be sent to the end user in real time, so the push server establishes a list of offline messages to send the message when the user logs in again. This part of the data is stored in a push Redis cluster, which includes more than 100 master and slave instances. The number of key is at level 1 billion and the storage space is at level T, which brings certain maintenance costs and operation and maintenance challenges. As a push back-end development engineer, we have also been looking for cost-effective solutions.

If we choose to use Aerospike, we find that a single physical machine with a single Inter SSD 4600 can achieve a QPS of nearly 10w, that is, dozens of machines can meet the existing needs and support the business needs for a long time in the future.

Advantages of Aerospike

Aerospike is a high-performance, scalable and reliable NoSQL solution, which supports RAM and SSD as storage media, and is specially optimized for SSD. It is widely used in real-time computing fields such as real-time bidding. Officials guarantee that 99% of the operations are done in 1ms, and provide functions such as automatic Rebalance of cluster data, cluster-aware clients, and support for storage of very large datasets (100T level).

As KV storage, Aerospike provides a variety of data types and operates in a similar manner to Redis. In addition to the basic features, Aerospike also supports AMC console, API and other monitoring methods, including cluster QPS, health, load and other monitoring metrics, which is friendly to operation and maintenance. Support the automatic Rebalance of the data in the cluster, compared with the Redis cluster solution, the maintenance cost is much lower.

This article mainly does some experience sharing on the grayscale deployment and use of Aerospike, hoping to provide some reference for readers who are doing research or are ready to use Aerospike. In addition, the concept of grayscale is not limited to Aerospike itself, but also can be used for reference to the migration and planning of other basic components.

Data model description

Aerospike uses schemaless storage, and the data model is similar to RDBMS, so it is relatively friendly in understanding and use:

Each namespace contains multiple set, each set contains multiple record, and each record contains multiple bin (database columns). The record can be queried by indexing key. Different businesses can use different namespace of the same cluster to isolate resources, so as to achieve the purpose of pooling resources and maximizing the use of resources.

Grayscale online process

One push uses a large Redis cluster in the offline message list storage business. We have investigated ssdb, pika and other disk storage that support Redis protocol. As a whole, Aerospike has the highest performance-to-price ratio.

In the early stage, we combined with the online scene to simulate the actual read-write ratio (analyzing the online business, we found that the approximate ratio of write to read is between 1:1 and 1:2), evaluated and verified the feasibility, and then carried out production planning.

Online business is relatively complex, direct full cut to Aerospike is not very realistic, the risk is also relatively large. Test network simulation verification is difficult to expose the problems that may occur in the production environment, so we divide the whole online process into observation phase and grayscale stage. As the name implies in the observation phase, the original Redis cluster still undertakes the online read and write business, but only imports a copy of the same traffic into Aerospike for real stress verification; in the grayscale phase, the online business is gradually cut to the Aerospike cluster, and the gray scale is expanded to ensure that the cluster runs stably until the business is completely cut to Aerospike. The specific operations of the two phases are as follows:

Observation phase: after the Redis operation is successful, the read and write operations of Redis are synchronized to Aerospike,Aerospike asynchronously without undertaking specific business. The next step is to write both Redis and Aerospike. This stage mainly observes whether the data on both sides are consistent, Aerospike pressure and so on. At the same time, during the observation phase, operation and maintenance operations such as node restart and cluster expansion can be carried out to evaluate the operation and maintenance cost and optimize the configuration. Here, you can use the AMC page console and monitoring API to monitor the cluster status, and the client calls part to record the necessary log and monitoring information.

Grayscale phase: Aerospike begins to undertake offline message list storage for some applications and tasks. In the grayscale stage, the Redis and Aerospike data are double written and double clear, and the hot standby state is maintained until the Redis data is completely switched to Aerospike and runs stably for a period of time.

The observation phase is very important, which is basically an online evaluation of the feasibility of the whole scheme. The observation point at this stage is divided into two parts: the client (AS-Client) and the server (AS-Server). The client mainly observes:

1. Use metrics to monitor client request response time. Use the percentage distribution of request time over a period of time (50%, 90%, 99%, 99.9%) to evaluate the system SLA.

two。 Monitor the count of read and write successes, failures, etc.

3. Set the slow log threshold to 50ms, and count the slow log situation during peak and normal periods.

4. Write Aerospike queue monitoring asynchronously and adjust the queue size reasonably.

The main observation of the server:

1. The health of the cluster.

two。 Disk and memory footprint, memory space / disk space ratio.

3. Machine IO load, CPU load, disk fragmentation and other information.

4. Cluster throughput, whether the read and write TPS is comparable to the online Redis cluster.

5. Data consistency check. How to check the consistency between the observation phase and the grayscale phase? The performance of key-by-key comparison is difficult to meet the requirements. Considering that the data found by Redis should be exactly the same as that found by Aerospike, the query results of sampling records of Redis and Aerospike are recorded in the log, and the proportion of inconsistent data within 1 minute, 5 minutes, 30 minutes and 1 hour is compared and analyzed. If the number of online Key is at level 1 billion, even if only 1/10000 differences are detected, then the inconsistency is significant. In this case, it is necessary to investigate the cause of the inconsistency and resolve it.

In terms of maintainability, it is mainly considered that automatic Rebalance of cluster data will bring some performance degradation, which may have a great impact on user experience. Combined with our experience, we simulate some typical OPS scenarios:

1. Simulate the impact of cluster Rebalance caused by single node failure on system performance.

two。 Simulate the impact of cluster Rebalance on system performance caused by cluster expansion.

3. According to the impact on online business, calculate the Rebalance speed of the cluster during the day and night, and support cron job updates.

4. The node restarts.

5. Add SSD mount.

6. Optimization of related configuration and so on.

To sum up, the complete launch process is divided into the following steps:

0. Simulate the on-line environmental pressure test, and verify the feasibility.

1. Encapsulate the Aerospike client into a Redis-like interface, add necessary logs, monitoring items, check the validity of Bin, and so on.

two。 Message service integrates Aerospike client, the required functions include: Aerospike asynchronous read and write, business data source switching, traffic filtering and so on.

3.QA functional verification.

4. Apply for resources and deploy Aerospike clusters online.

5. The message service integrated with Aerospike function is online.

6. After the verification of the observation phase is passed, it enters the grayscale stage until it is finally online or withdrawn.

Experience summary

In the process of using Aerospike, we have encountered some problems and challenges, which can be summarized as follows:

1.Aerospike turning on single-bin mode saves space.

2.Aerospike does not store the original key, but actually indexes a 20-byte hash value of the original key. If the business needs to use the original key, the bin storage must be set separately. Even if the key and value values have a small number of bytes, the key itself occupies 20 bytes, so it actually takes up a lot of space.

3.Aerospike will Rebalance data when a node goes down or adds or subtracts nodes, and this process will affect the quality of external service. However, the speed of Rebalance can be controlled, so it is necessary to make a tradeoff between ensuring the quality of service and rapid recovery of the cluster.

4. Each time the community version cluster is restarted, the index is rebuilt and then loaded into memory, which leads to a slower speed. Namespace needs to be specified in the configuration file, so it is best to pre-allocate namespace that may be used in the future by business, so as to reduce unnecessary restarts.

5. Because SSD itself has the problem of fragmentation and write magnification, in practical use, we find that if the disk space usage is about 50%, the performance will be seriously degraded. Therefore, the relevant parameters of defragmentation can be optimized according to the actual business.

6.Aerospike has restrictions on HotKey, so frequently reading and writing to a key will return a HotKey error (errorcode 14). The server can increase the concurrency of the same key operation by increasing the transaction-pending-limit configuration. Its default value is 20, and a value of 0 means unlimited. Increasing this configuration may degrade some performance. The client may need to add retry handling to the exception, but retry may further increase the risk of HotKey.

7. This change of basic components must be checked with online flow as much as possible so as to expose potential problems as soon as possible.

8. During the observation phase, the operation and maintenance costs should also be evaluated to avoid jumping from one pit to another.

9. You also need to pay attention to some inherent limitations of Aerospike, such as a maximum of 1023 set for a namespace, a maximum of 14 single-byte characters for a bin name, a maximum of 64 SSD for a namespace, and so on. For more information, please see aerospike_known_limitations.

Conclusion

As a high-capacity NoSql solution, Aerospike has not been widely used in domestic factories. It is suitable for scenarios with large capacity requirements and relatively low QPS, and can save TCO to a certain extent. In terms of supporting commands, Aerospike is similar to Redis, and it is easier for businesses to migrate. It naturally supports cluster deployment and is friendly to monitoring and operation support. Despite so many excellent features, it is necessary to be cautious in technology selection and pre-evaluate whether it is in line with your business scenario, performance and cost can meet the requirements. In some official test scenarios, its performance is even higher than Redis, in fact, because of the limitations of SSD itself, in most cases, it is quite different from Redis in terms of QPS. Finally, be sure to go through online traffic verification before going online, and handle the actual online business in a grayscale way to minimize the impact on the user experience.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report