How to build a large-scale distributed system in Cloud Computing platform 10/18 Update SLTechnology News&Howtos

How to build a large-scale distributed system in Cloud Computing platform

2025-10-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how to build a large-scale distributed system in the cloud computing platform, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can get something.

How to make use of the flexibility of the cloud computing platform, combined with the characteristics of the business, to design and build a highly available and highly scalable back-end system architecture. At the same time, it will talk about the evolution from a simple back-end system to a large-scale distributed system based on the real case on the QingCloud platform.

When many enterprises and developers develop a product, the first consideration is the implementation of the product function, and its back-end architecture is usually very simple and direct. The product is just launched in the early days, because the user access pressure is very small, the amount of data accumulation is not large, it will run well in a short period of time.

However, with the popularity and maturity of the mobile Internet, a product is likely to gather a large number of users in a short period of time, facing many "worries" such as surge in traffic, doubling of data volume, exponential rise in traffic, and so on, which is originally a good thing. However, if the back-end system can not be expanded in time, it is bound to cause slow response, frequent errors and even denial of service.

Even without the "annoyance" of the sudden increase in system pressure mentioned above, various functional modules will become more and more complex in the process of continuous development and upgrade. if the back-end architecture is not well combed and organized, the risk of system failure and unusability will become greater and greater.

In the era without cloud computing, physical hardware from purchasing, shelving, plugging, to installation, debugging, deployment, and then to the actual use, is a long and labor-consuming process, often unable to keep up with the pace of emergency expansion of the system. The emergence of cloud services not only enables us to save costs, but more importantly, we can take advantage of the extremely flexible characteristics of cloud computing to enable enterprises and developers to quickly expand the system online according to their needs.

However, the rapid expansion of cloud services is not enough. Enterprises also need to pay attention to the availability and scalability of various system components at the business level. Next, I will introduce to you how to use the advantages of cloud computing to build a stable and reliable distributed system combined with the business characteristics of the enterprise.

Let's start with the simplest back-end architecture:

Access layer: nginx business layer: Java application data layer: MySQL

In a cloud computing environment, the organization of network architecture is very important. QingCloud provides basic network and VPC network, the difference between which has been introduced in the official website user guide and previous articles, so I won't repeat them here. It is recommended that enterprises use VPC to build their own network, place all hosts and system resources in the VPC network, and specify a private network segment (such as 192.168.x.x / 172.16.x.x). Hosts can communicate through the private network address, which will not change.

As there are more and more hosts, the IP address is not easy to remember. In order to facilitate the mutual identification between hosts, each host can be set up an intranet alias. To facilitate management on the console, label each resource and organize the classification according to the label.

Next let's go back to the simple back-end architecture above. As the access pressure increases, a single nginx + Java application may not be enough, and you will see that the CPU of this host is getting busier and busier and using more and more memory. And if this host fails, the whole service will be unavailable.

Therefore, we first adjust the structure here, add multiple nignx + Java application to provide services at the same time, and introduce a load balancer in the access layer (replaced by the word LB below), so that the public network request is first sent to the LB. There are many options for LB, such as nginx and HAProxy, which provide seven layers of load capacity, and LVS, which provides four layers of load capacity. There are different methods of installation and configuration.

The introduction of LB can distribute the request pressure to multiple business servers at the back end, and automatically isolate the servers that fail at the back end through heartbeat check to achieve high availability of the business layer. But at this time, the LB itself will become a single point, and when a failure occurs, it will also lead to global unavailability. So you can use the Keeplived service to provide a copy of the LB, which can be topped immediately when something goes wrong. There are a lot of materials on the deployment method.

Some people may say that different IP can be polled through DNS to achieve the high availability of LB, but in fact, this is not possible, because once a LB is down, the DNS will resolve to the LB. Even if the DNS is modified immediately, the service is not available before the DNS cache is updated (usually for a long time).

Although the principle of LB is not complex, deployment configuration requires a lot of work, and additional work needs to be done to achieve the high availability of LB. QingCloud provides high-performance and high-availability LB cluster services from Beijing Zone 3, which can be used directly.

Next, let's think about the extension of the business layer. First of all, it is necessary to solve how to quickly expand the business server. If the running environment and programs of the business server are not updated frequently, you can create a host image based on the existing business server. When you need to expand the capacity, create a new host directly based on the image, and connect it to the LB backend for external service immediately.

At this time, you can also use the AutoScaling feature to automate this process, that is, automatically trigger the expansion of the host when a certain trigger condition is reached, such as the number of LB concurrency and the response delay. When the trigger condition is not met, the resource can be recycled.

Of course, if the environment or program of your business server needs to be updated frequently, it is not suitable to make a fixed template. At this point, you can build your own automated deployment (such as Puppet / Ansible) to achieve automatic business expansion. All these operations can be done using QingCloud's open API interface, combined with your automated deployment program.

In addition, you need to make sure that the business server is stateless, because the backend of each LB request may be different, and you can't assume that the last request and this one fall on the same business server. If the server needs to save the session information accessed by the user, it can be devolved to a cache or database for storage.

As the product becomes more and more functional, you will find that the original single business project is getting larger and larger, and all kinds of functional logic are intertwined. When a function fails, it can cause global unavailability. At this point, you need to consider splitting a single business project into multiple independent sub-services. Communication between sub-services can be message-based or RPC-based.

The invocation of sub-services can be divided into two types: synchronous processing and asynchronous processing. You should try to asynchronize all requests that do not need to return results immediately. For requests that can be processed asynchronously, we buffer the data generated by the request by introducing a message queue, and the recipient of the request (queue consumer) can expand horizontally according to the number of tasks in the queue. There are many choices of message queue, such as Redis, RabbitMQ, ActiveMQ, Kafka,QingCloud platform. Distributed, partitioned, multi-copy message queue service is provided at present, which has the characteristics of high throughput and low latency. Users can easily integrate into their own system.

Nowadays, data analysis is becoming more and more important for enterprises. In the process of processing requests, business servers can import the original data into big data processing system through queues. QingCloud provides perfect big data distributed processing platforms Spark and Hadoop, which users can easily create, use and expand according to their needs.

Through the dismantling service, we have the ability to minimize the overall impact on the overall situation and improve the availability of the system as a whole when a certain sub-service fails. In addition, for sub-services with high pressure, we can expand the capacity independently in a way similar to the business server expansion mentioned earlier. QingCloud private network LB service can also play a role here.

With the growth of the business, the data layer will face more and more pressure, and the stand-alone database is no longer enough to support it. Next, let's talk about the distributed and extended technology of the data layer.

For most business scenarios, data operations are more read and write less, and reads are concentrated on a small number of hot data. First of all, a cache layer should be introduced to alleviate the read pressure on the database. If the cache capacity requirement is large, you can build a cache cluster. In the upper layer, the data will be distributed to multiple nodes according to the consistent hashing algorithm, and only a small part of the data will become invalid when new cache nodes need to be added later.

Then introduce new types of databases, Redis has become the first choice for many enterprises, because it supports rich data types and data query interfaces, and memory-based databases naturally have higher performance. You can talk about the less relational data in the business, moving from MySQL to Redis, especially the data of list classes and counting statistics classes. Lighten the load on MySQL and improve the query performance of data at the same time. A single Redis node may not be able to meet your capacity requirements. The QingCloud platform provides a multi-master and multi-slave Redis 3.0 cluster service, which can automatically partition data to increase storage capacity on the one hand and ensure high availability of the service on the other.

The extension to MySQL can be done in several steps. First of all, add a MySQL slave node and distribute some read requests to the slave node at the upper level. Since there may be delays in slave synchronization, the business should tolerate temporary data inconsistencies. For example, if one of your users modifies the age attribute, other users will have to wait a while to see his new age.

QingCloud MySQL database supports the architecture of one master and multiple slaves, and has already done load balancing on multiple slave nodes. You can easily add new slave nodes on the interface to share the reading pressure for you.

Even if you have slave as a copy of the data, you should make a regular cold backup of your database so that you can roll back or restore to a certain point in time when the business is misoperated. On the QingCloud platform, the backup process can be performed manually or configured as an automatic task, which has no effect on the normal use of the database during the backup process.

With the growth of data, a single database can not carry a complete set of data, and the pressure of write operations on a single database is becoming more and more obvious, you should consider the technology of sub-database and sub-table. Split the relatively large data tables and store them separately, which can free up part of the space for the main database and share the pressure of reading and writing. When splitting, you can also store the associated data tables in a library according to functional logic.

When the database single table is very large, causing bottlenecks for both reading and writing, you need to start to consider horizontal sub-table sharding, this expansion method can solve the problem of large single table capacity, reading pressure and writing pressure at the same time, but the difficulty of research and development and operation and maintenance will also increase. It is recommended to finish the above optimization and finally do it again when necessary.

Here is a brief account of the main points of the horizontal sub-meter. First of all, you have to select a reasonable partitioning key (shard key) from the fields of the data table, which should be the most frequently used field in all the query conditions of the table, so that most queries can judge in advance which specific partitions (shard) should be sent. If the query condition does not contain shard key, you need to traverse all partitions and merge the results.

With shard key, we also need to design a partition algorithm, for example, according to the interval, such as user_id in [0100] in shard 1, UserID in [101,200] in shard2, and taking modules according to hash, and so on. When designing the partitioning algorithm, we should fully consider the business characteristics and think more from the point of view of read and write operations, so that whether the design can evenly distribute the pressure and data to each shard.

We also need to consider how the extension of the data layer is transparent to the upper layer, such as introducing distributed database middleware, or combining business logic to make database operations into an independent sub-service for other services to invoke. If you do not make a sub-service, at least there is a separate layer in the business code to encapsulate the operation to the database.

In addition to the above structured data access, enterprises also have the need to store a large amount of small file data (unstructured data). Stand-alone hard disk, LVM and NAS can be used as temporary solutions, but none of them can meet the multiple needs of unlimited capacity, high performance, high security and high availability at the same time. However, the application scenarios of self-built distributed storage systems such as Ceph, GlusterFS and HDFS are very limited, and the cost of operation and secondary development is also very high.

On the QingCloud platform, users can use QingStor object storage service to store massive data files. The service itself provides the characteristics of unlimited capacity, high scalability, high availability and high security.

After talking about the expansion technology of the data layer, let's finally talk about the topic of multi-computer room deployment and remote disaster recovery. QingCloud starts from the computer room in District 3 of Beijing, through its own backbone network optical fiber and multi-channel ring network technology, so that it is not aware of users when there is a network failure in the computer room, and ensures high availability in infrastructure. However, if the user's business can be deployed in multiple computer rooms, it can speed up regional access while sharing the access load, such as accelerating the access of users in the north and south of China or overseas users.

If there are three computer rooms, in the middle is the QingCloud Beijing District 3 computer room, responsible for the main business. On the left is the computer room of QingCloud Asia Pacific Zone 1, which mainly serves customers in Asia Pacific and overseas. Both computer rooms use QingCloud private network (VPC) deployments to interconnect over the network through GRE or IPsec encrypted tunnels. On the right is the physical computer room of your office, where IT personnel can develop and work.

In order to achieve multi-activity in different places, there are usually three stages from easy to difficult: first, a reverse proxy is set up in the backup server room, and the user requests to the backup server room, and the request is directly transferred to the main server room. If the two data centers are connected by direct connect lines or the delay is very small, this is the easiest way to deploy. Second, the business server and cache are deployed in the two computer rooms at the same time. Since most data requests can be read from the cache, there is no need for cross-room access. However, when the cache fails, it still has to be queried from the database of the main computer room. Third, the two computer rooms deploy a full set of systems at the same time, including access layer, business layer and data layer. The data layer relies on the database double master or master-slave technology for cross-room synchronization.

Finally, let's sum up today's sharing. There is no so-called classic or perfect architecture, only the architecture that is most suitable for enterprise business. What we share today is the common extension methods of the system in the access layer, business layer and data layer in the most general business scenario. The evolution of enterprise back-end architecture is a long and arduous process, and it is impossible to design a comprehensive system from scratch, but if the design can focus more on the future at the beginning, it leaves room for further optimization.

Question 1. Enterprise customers, how does the private cloud build distributed systems of different sizes?

First of all, enterprises need to know the scale of the current business, such as the type of business, service QPS, the type of data and the size of the amount of data, as well as the SLA and performance expectations of the business and data. Only when this is clear can there be a trade-off in the planning process.

In the cloud computing environment, the creation and destruction of basic resources are very fast. More attention should be paid to the scalability at the business level, such as stateless business layer, data layer indexing, and hot and cold distinction. Regardless of size, the components of the system should not have a single point of failure or a single point of bottleneck. When the scale is small, the system does not have to expand, but it must have the ability to expand.

2. How is hot and cold data management and data persistence done?

Hotter data should be accessed faster, and the speed of access is mainly determined by distance and media. Local memory > local hard disk > remote memory > remote hard disk in terms of distance, SSD > SAS > SATA in terms of media. The proportion of hot and cold data is generally very different, so it is necessary to store hot data on closer and better media.

Each storage system, such as MySQL Redis, has its own data persistence strategy.

3. Is the security of the platform in large data sets lower than the original point-to-point interface?

In fact, no matter what the form of data storage is, the security of data mainly depends on whether there is redundancy, how much redundancy is, whether the distribution of redundancy is across physical machines, or even across computer rooms. Whether the data write is actually off the disk, and whether the copy of the data is written synchronously or asynchronously.

4. To build a large-scale distributed platform system, cache management is realized by redis. What should we pay attention to?

First of all, consider the granularity of the cache, too coarse granularity will lead to failure too frequently. We should also consider the cache capacity. If a single node can not carry enough hot spot data, we should pay attention to choose the appropriate distribution strategy when using multiple nodes, and compare the common consistent hash and hash modems. Redis3.0 above provides clustering capabilities, which can automatically partition data and provide high availability.

5. How to pool resources in distributed database and cache?

Proxy middleware can be added to the database service, and there are open source solutions and their own implementations. The interfaces provided by users should be shielded from distributed details, and users do not have to care about capacity, performance, distribution strategy, etc., as if what they are seeing is a stand-alone database.

6. How to store the back-end transaction data in the large-scale distributed system and how to realize the multi-center disaster recovery protection of the data?

The most important thing is that transaction data cannot be lost, and performance is secondary. In the past, many traditional enterprises would choose commercial databases like oracle, and more and more new enterprises are willing to use open source implementations such as MySQL PostgreSQL, but when configuring them, they must be equipped with the strictest simultaneous writing of multiple copies before they are returned successfully, and there are logs.

7. What types of applications are suitable for cloud computing and what are the criteria?

As an IT infrastructure resource, cloud computing has successful cases in various industries, regardless of which kind of application it is suitable for. The only measure is whether the demand can be met, whether it can replace the capabilities that traditional hardware can provide, and can provide capabilities other than traditional hardware, such as self-scaling, pay-per-use, quick start destruction, and so on.

8. What is the performance of keepalived? Is the back end HAPROXY?

Keepalived mainly introduces virtual routing redundancy (VRRP) to achieve high availability, which essentially has no impact on performance. It is an independent service and has nothing to do with HAProxy.

9. Can the cloud service of Qingyun QingCloud prevent hadoop cluster collapse caused by namenode power outage and other reasons?

Currently, Qingyun's IaaS layer triggers disaster recovery when the physical machine is powered off, and another same host starts without data loss, and then starts the hdfs service to restore cluster use. Hadoop's own HA will be available soon, so that hdfs services can be automatically restored.

10. If there are too many auto scaling instances, what if the db maximum connection is exhausted?

You can introduce proxy middleware between the instance and the db, and you can implement an independent data access service yourself, preventing the instance from directly manipulating the db.

The above is how to build a large-scale distributed system in the cloud computing platform. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.