Understand the knowledge system of distributed architecture (including super-full core knowledge) 07/11 Update SLTechnology News&Howtos

Understand the knowledge system of distributed architecture (including super-full core knowledge)

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Author | Xiaotu Alibaba Senior engineer

Companion article reading recommended: "[cloud native era, distributed system design necessary knowledge graph (containing 22 knowledge points)] (http://mp.weixin.qq.com/s?__biz=MzUzNzYxNjAzMg==&mid=2247486600&idx=1&sn=0ad92a1fe535f141fe2e8c87ffbd1229&chksm=fae50747cd928e51c05c41d2cc206069babbe9dfdba5957c52ac6e77cb754192169bb6b3e898&scene=21#wechat_redirect)"

Introduction: this paper strives to introduce the outline of distributed knowledge system based on MSA (micro-service architecture) from the aspects of distributed basic theory, architecture design pattern, engineering application, deployment, operation and maintenance, and industry solutions, so as to have a three-dimensional understanding of the evolution from SOA to MSA; from the concept and tool application to further understand the distributed nature of micro-services and the immersive experience of how to build a full set of micro-service architecture.

Follow the official account of "Alibaba Cloud Origin" and reply to "Distribution" to download a clear picture of the distributed system and its knowledge system!

With the development of mobile Internet and the popularity of intelligent terminals, computer systems have long transitioned from stand-alone work to multi-machine cooperation, and clusters have built huge and complex application services according to distributed theory. on the basis of distribution, a cloud-based technological revolution is being carried out, which completely breaks the traditional development mode and liberates the new generation of productive forces.

Big picture of knowledge architecture of distributed system

Cdn.com/d16a1bbbb104cad78fe989c3f11828614c75e946.png ">

Follow the official account of "Alibaba Cloud Origin" and reply "* Distribution *" to download a clear picture of the distributed system and its knowledge system!

Evolutionary SOA Service-oriented Architecture from SOA to MSA in basic Theory

As the business develops to a certain extent, it is necessary to decouple the service, and then divide a single large system into different subsystems logically and communicate through the service interface. The service-oriented design pattern ultimately requires bus integration services, and most of the time the database is shared. When a single point of failure occurs, it will lead to a failure at the bus level, which may further drag down the database. That's why there are more independent designs.

MSA micro-service architecture

Micro-service is an independent service in the real sense, which is logically isolated from the service entrance to the data persistence layer, without the need for service bus access, but it also increases the difficulty of building and managing the entire distributed system. Services need to be choreographed and managed, so with the rise of micro-services, the whole technology stack of micro-service ecology also needs seamless access in order to support the governance concept of micro-services.

Node and network node

The traditional node is a single physical machine, and all services are incorporated into services and databases; with the development of virtualization, a single physical machine can often be divided into multiple virtual machines to maximize the utilization of resources. the concept of node has also become a service on a single virtual machine; in recent years, with the gradual maturity of container technology, services have been thoroughly containerized, that is, nodes are only lightweight container services. Generally speaking, a node is a collection of logical computing resources that can provide unit services.

The network

The foundation of the distributed architecture is the network, whether it is the local area network or the public network, without which the computers cannot work together, but the network also brings a series of problems. There is an order in the spread of network messages, and message loss and delay are common occurrence. We have defined three network working modes:

Synchronous network node synchronous execution message delay limited efficient global lock semi-synchronous network lock range relaxing asynchronous network node independent execution message delay no upper limit no global lock part algorithm is not feasible

A brief introduction to the characteristics of two major protocols in the common network transport layer:

TCP protocol first, tcp protocol transmission is reliable, although other protocols can transmit tcp faster to solve the problem of repetition and disorder. UDP protocol constant data flow packet loss non-fatal time and sequence time

In the slow physical time and space, time is flowing alone. For serial transactions, it is very simple to follow the pace of time, come first and arrive later. Then we invented the clock to depict the time that happened in the past, and the clock kept the world in order. But for a distributed world, dealing with time is really a painful thing.

In the distributed world, we have to coordinate the first-come-and-arrive relationship between different nodes, and different nodes have their own opinions on the time recognized by different nodes, so we create the Network time Protocol (NTP) to try to solve the standard time between different nodes, but the performance of NTP itself is not satisfactory, so we construct a logical clock, and finally improve it into a vector clock:

Some shortcomings of NTP can not fully satisfy the coordination of concurrent tasks in distributed environment. Time out of sync between nodes. Hardware clock drift thread may hibernate operating system hibernate hardware hibernation.

Logical clock defines event first-come-first-served to t'= max (t, t_msg + 1)

Vector clock tanni'= max (tanni, t_msg_i) atomic clock sequence

With the tools to measure time, it is natural to solve the order problem. Because the theoretical basis of the whole distribution is how to negotiate the consistency of different nodes, and order is the basic concept of consistency theory, so we need to take the time to introduce the scale and tools to measure time.

Consistency theory

When it comes to consistency theory, we must look at a comparative picture of the impact of consistency on system construction:

The figure compares the balance of transaction, performance, error and delay under different consistency algorithms.

Strong consistency ACID

In a stand-alone environment, we have stringent requirements for traditional relational databases. Due to network delays and message loss, ACID is the principle of ensuring transactions. We are familiar with these four principles without even explaining them:

Atomicity: atomicity, all operations in a transaction are either completed or not completed, and do not end in the middle; Consistency: consistency, before and after the transaction starts, the integrity of the database is not compromised Isolation: isolation, the ability of a database to allow multiple concurrent transactions to read, write and modify its data at the same time. Isolation can prevent data inconsistency due to cross execution when multiple transactions are executed concurrently; Durabilit: after transaction processing, the modification of the data is permanent, even if the system failure will not be lost. Distributed consistent CAP

In the distributed environment, we can not guarantee the normal connection of the network and the transmission of information, so we have developed three important theories of CAP/FLP/DLS:

CAP: distributed computing systems cannot guarantee consistency (Consistency), availability (Availablity) and partition tolerance (Partition) at the same time; FLP: in an asynchronous environment, if there is no upper limit on the network delay between nodes, as long as there is a malicious node, no algorithm can reach a consensus in a limited time. DLS: protocols running in a partially synchronous network model (that is, the network delay is bounded but we don't know where) can tolerate arbitrary (in other words, Byzantine) errors; deterministic protocols in an asynchronous model (without the upper limit of network delay) cannot be fault tolerant (although it is not mentioned in this paper that randomization algorithms can tolerate 1can3 errors) The protocol in the synchronization model (the network delay can be guaranteed to be less than the known d time) can surprisingly achieve 100% fault tolerance, although there is a limit to what can happen when a node error occurs in 1 and 2. Weakly consistent BASE

In most cases, in fact, we do not have to seek strong consistency, some businesses can tolerate a certain degree of delayed consistency, so in order to take into account the efficiency, the final consistency theory BASE is developed. BASE refers to basic availability (Basically Available), soft state (Soft State), and final consistency (Eventual Consistency):

Basic availability (Basically Available): basic availability means that the distributed system is allowed to lose part of its availability in the event of failure, that is, to ensure the availability of the core; soft state (Soft State) means that the system is allowed to have an intermediate state, which will not affect the overall availability of the system. In general, there are at least three copies of a piece of data in distributed storage, and the delay that allows replicas to synchronize between different nodes is the embodiment of the soft state; Eventual Consistency: ultimate consistency means that all copies of the data in the system can reach a consistent state after a certain period of time. Weak consistency is the opposite of strong consistency, and the final consistency is a special case of weak consistency. Consistency algorithm

The core of distributed architecture lies in the realization and compromise of consistency, so it is very important to design a set of algorithms to ensure the infinite consistency of communication and data between different nodes. It is very difficult to ensure that different nodes can achieve the same copy in the uncertain network environment, and the industry has done a lot of research on this topic.

First of all, we need to understand the major premise principle of consistency (CALM):

The full name of CALM principle is Consistency and Logical Monotonicity, which mainly describes the relationship between monotone logic and consistency in distributed systems. Its contents are as follows, refer to consistency as logical monotonicity.

In distributed systems, monotonous logic can ensure "ultimate consistency", and this process does not need to rely on the scheduling of central nodes; in any distributed system, if all non-monotonic logic has central node scheduling, then the distributed system can achieve the final "consistency".

Then focus on the data structure CRDT (Conflict-Free Replicated Data Types) of the distributed system:

After we understand some laws and principles of distribution, we should start to consider how to implement the solution. The premise of the consistency algorithm is the data structure, that is to say, the foundation of all algorithms is the data structure. Well-designed data structures and exquisite algorithms can effectively solve practical problems. After the continuous exploration of predecessors, we know that the data structure CRDT is widely used in distributed systems.

Refer to "talking about CRDT", A comprehensive study of Convergent and Commutative Replicated Data Types

State-based-based: CRDT data between each node is merged directly, all nodes can be merged into the same state, and the order of data merging will not affect the final result; operation-based-based: notify other nodes of each operation on the data. As long as the node knows all the operations on the data (the order in which the operations are received can be arbitrary), it can be merged into the same state.

After understanding the data structure, we need to take a look at some important protocols of distributed systems, HATs (Highly Available Transactions), ZAB (Zookeeper Atomic Broadcast):

Refer to "High availability transactions", "ZAB Protocol Analysis"

The last thing to learn is the mainstream consistency algorithm in the industry:

To be honest, I don't fully understand the specific algorithm. Consistency algorithm is the core essence of distributed system. The development of this part will also affect the innovation of architecture, and the application of different scenarios will also give birth to different algorithms.

Paxos: "elegant Paxos algorithm" Raft: "Raft consistency algorithm" Gossip: "Gossip Visualization"

After talking about the core theoretical basis of distributed systems and how to achieve data consistency between different nodes, we will talk about what are the mainstream distributed systems at present.

Scene classification file system

There is always an upper limit for the storage of a single computer, and with the emergence of the network, the scheme of cooperative storage of files by multiple computers has been put forward one after another. The earliest distributed file system is actually called network file system, and the first file server was developed in the 1970s. In 1976, Dijido designed File Access Listener (FAL), while modern distributed file system came from the famous paper of Google. "The Google File System" laid the foundation of distributed file system. Modern mainstream distributed file systems refer to "distributed file system comparison". Here are several commonly used file systems:

HDFSFastDFSCephmooseFS database

Of course, the database also belongs to the file system, and the master data adds advanced features such as transaction, retrieval, erasure and so on, so the complexity increases again, not only to consider data consistency but also to ensure sufficient performance. In order to take into account the characteristics of transaction and performance, the development of traditional relational database in distributed aspect is limited. Non-relational database gets rid of the strong consistency shackles of transaction and achieves the effect of final consistency. NoSql (Not Only Sql) has also produced a number of database types of architecture, including KV, column storage, document type and so on.

Column storage: Hbase document storage: Elasticsearch,MongoDBKV type: Redis relational: Spanner computing

The distributed computing system is built on the basis of distributed storage, giving full play to the characteristics of data redundancy and disaster recovery of the distributed system, and efficient data acquisition by multiple replicas, and then parallel computing. The tasks that originally needed to be calculated for a long time are divided into multiple tasks to be processed in parallel, thus improving the computational efficiency. Distributed computing systems are divided into offline computing, real-time computing and streaming computing in the scene.

Offline: Hadoop Real time: Spark streaming: Storm,Flink/Blink Cache

Caching as a powerful tool to improve performance can be found everywhere, from CPU cache architecture to distributed application storage. Distributed cache system provides a random access mechanism for hot data, which greatly improves the access time, but the problem is how to ensure the consistency of data and introduce distributed locks to solve this problem. The mainstream distributed storage system is basically Redis.

Persistence: Redis non-persistence: Memcache messages

Distributed message queuing system is a powerful tool to eliminate a series of complex steps brought by asynchronism. in the scenario of multi-thread and high concurrency, we often need to design business code carefully. to ensure that there is no deadlock caused by resource competition in the case of multi-thread concurrency. Message queuing stores all asynchronous tasks in the queue in a delayed consumption mode, and then digests them one by one.

KafkaRabbitMQRocketMQActiveMQ monitoring

With the development of distributed system from stand-alone to cluster, the complexity is also greatly increased, so the monitoring of the whole system is also essential.

Zookeeper application

The core module of the distributed system is how to deal with the business logic in the application. The direct call of the application depends on the specific protocol to communicate, some based on the RPC protocol and some based on the general HTTP protocol.

HSFDubbo log

Errors corresponding to distributed systems are common, and when we design a system, we need to consider fault tolerance as a common phenomenon. Then when a fault occurs, it is very important to recover and troubleshoot quickly. Distributed log collection, storage and retrieval can provide us with powerful tools to locate the problems in the request link.

Log collection: flume log storage: ElasticSearch/Solr,SLS log location: Zipkin account book

In the previous article, we mentioned that the so-called distributed system is due to the limited performance of the single machine, but the heap hardware can not be increased endlessly, and the stand-alone heap hardware will eventually encounter the bottleneck of the performance growth curve. So we use multiple computers to do the same job, but such a distributed system always needs a centralized node to monitor or schedule the resources of the system, even if the central node may be composed of multiple nodes. Block chain is a real district-centric distributed system, in which only P2P network protocols communicate with each other, and there are no real central nodes. They coordinate the generation of new blocks according to the computing power and rights and interests of block chain nodes.

Bitcoin ethernet square design pattern

In the last section, we listed the roles and functions of different distributed system architectures in different scenarios. In this section, we further summarize how to consider architecture design, the direct differences and priorities of different design schemes, and the need to choose collaborative design patterns in different scenarios to reduce the cost of trial and error. The following problems need to be considered in the design of distributed systems.

Usability

Availability is the proportion of time that the system is running and working, usually measured as a percentage of uptime. It can be affected by system errors, infrastructure problems, malicious attack, and system load. Distributed systems typically provide service level agreements (SLA) for users, so applications must be designed to maximize availability.

Health check: the system implements a full-link function check, and external tools regularly access system load balancing through open endpoints: use queues as a buffer between requests and services. to smooth intermittent heavy load throttling: limit the scope data management of resources consumed by application levels, tenants, or the entire service

Data management is a key element of a distributed system and affects most quality attributes. Due to performance, scalability, or availability, data is usually hosted in different locations and multiple servers, which can pose a series of challenges. For example, data consistency must be maintained and data usually needs to be synchronized across different locations.

Cache: load data from the data storage tier into the cache CQRS (Command Query Responsibility Segregation) as needed: command query responsibility separation event traceability: record the complete series of event index tables in the domain only by append: create index materialized views on frequently queried referenced fields: generate one or more data pre-populated views split: split data into horizontal partitions or shards design and implementation

Good design includes factors such as consistency in component design and deployment, maintainability of simplified management and development, and reusability of components and subsystems for other applications and other scenarios. Decisions made during the design and implementation phases have a significant impact on distributed systems and quality of service and total cost of ownership.

Proxy: reverse proxy adapter: achieve adapter layer front-end separation between modern applications and legacy systems: back-end service provision interface for front-end applications to invoke computing resource integration: merge multiple related tasks or operations into one computing unit configuration separation: remove configuration information from the application deployment package to configure central gateway aggregation: use gateways to multiple Aggregate individual requests into one request gateway uninstall: uninstall shared or dedicated service functions to gateway proxy gateway routing: use a single endpoint to route requests to multiple service leaders election: by selecting an instance as the administrator responsible for managing other instances Coordinate cloud pipes and filters for distributed systems: decompose complex tasks into a series of reusable individual components: deploy application monitoring components into separate processes or containers to provide isolation and encapsulation of static content hosting: deploy static content to CDN to accelerate access to efficiency messages

Distributed systems require a messaging middleware that connects components and services, ideally in a loosely coupled manner to maximize scalability. Asynchronous messaging is widely used and provides many benefits, but it also brings challenges such as message sorting and idempotency.

Competitive consumers: multithreaded concurrent consumption priority queues: message queues are prioritized, and those with high priorities are first managed and monitored by consumption.

Distributed systems run in remote data centers and do not have complete control over the infrastructure, which makes management and monitoring more difficult than stand-alone deployment. Applications must expose runtime information that administrators can use to manage and monitor the system and to support changing business needs and customizations without stopping or redeploying the application.

Performance and scalability

Performance represents the responsiveness of the system to perform any operation within a given time interval, while scalability is the ability of the system to handle increased load without affecting performance or easily increasing available resources. Distributed systems often encounter changing load and activity peaks, especially in multi-tenant scenarios, which are almost impossible to predict. Instead, applications should be able to scale within limits to meet peak demand and expand when demand decreases. Scalability involves not only computing instances, but also other elements, such as data storage, message queues, and so on.

Elasticity

Resilience means that the system can handle the fault gracefully and recover from it. Distributed systems are typically multi-tenancy, using shared platform services, competing for resources and bandwidth, communicating over Internet, and running on commercial hardware, which means that transient and more permanent failures are more likely. In order to remain resilient, faults must be detected and recovered quickly and effectively.

Isolation: isolate the elements of the application into the pool so that if one of them fails, the other elements will continue to run the circuit breaker: handle fault compensation transactions that may require different time to repair when connecting to a remote service or resource: undo the work performed by a series of steps, these steps work together to define a final consistent operational health check: the system implements a full-link functional check External tools periodically access system retries through public endpoints: transparently retry previously failed operations so that applications handle expected temporary fail-safe when trying to connect to services or network resources

Security is that the system can prevent malicious or unexpected behavior outside the design and use, and prevent the disclosure or loss of information. Distributed systems run on Internet outside trusted local boundaries, are usually open to the public, and can serve untrusted users. You must protect your application from malicious attack, restrict access to approved users, and protect sensitive data.

Federated identity: delegate authentication to external identity provider gatekeepers: protect applications and services by using a dedicated host instance that acts as a proxy between the client and the application or service, verifies and cleans up requests, and passes requests and data valet keys between them: use token or key engineering applications that provide clients with restricted direct access to specific resources or services

In the previous article, we introduced the core theory of distributed system, some difficult problems and compromise ideas to solve the problems, listed the classification of existing mainstream distributed systems, and summarized some methodologies for building distributed systems. then we will introduce the contents and steps of building a distributed system from an engineering point of view.

Resource scheduling

A skillful housewife cannot make bricks without rice. All our software systems are built on the basis of hardware servers. From the beginning of the physical machine to the direct deployment of software systems, to the application of virtual machines, and finally to the cloud storage of resources, the use of hardware resources also began to intensive management. This section compares the corresponding scope of responsibilities of traditional operation and maintenance roles. In the devops environment, the integration of development, operation and maintenance, we also want to achieve the flexible and efficient use of resources.

Elastic expansion

In the past, if the software system needed to increase the machine resources with the increase of the number of users, the traditional way was to find the operation and maintenance staff to apply for the machine, and then deploy the software service to connect to the cluster. the whole process depends on the human experience of the operation and maintenance staff. inefficient and error-prone. For distributed microservices, there is no need for human flesh to add physical machines. With the support of containerization technology, we just need to apply for cloud resources and execute container scripts.

Application expansion: the surge of users needs to expand the service, including automatic capacity expansion, automatic capacity reduction after the peak, offline: for outdated applications, the application is offline, and the cloud platform takes back the container host resources. Machine replacement: for faulty machines, it can be used to replace container host resources, services are started automatically, and network management is seamlessly switched.

With computing resources, the other most important thing is the network resources. Under the existing cloud background, we hardly have direct contact with the physical bandwidth resources, but directly manage the bandwidth resources by the cloud platform. What we need is the maximum application and effective management of network resources.

Domain name application: application for matching domain name resources, specification of multiple sets of domain name mapping rules for domain name change: domain name change unified platform management load management: access policy setting for multi-computer applications security outreach: basic access authentication, interception of illegal requests for unified access: provide a unified access application platform, provide a unified login management fault snapshot

In the case of system failure, our first priority is to restore the system, at the same time, it is also very important to retain the crime scene, and the resource scheduling platform needs a unified mechanism to preserve the fault site.

On-site retention: preservation of memory distribution, number of threads and other resource phenomena, such as JavaDump hook access debugging access: bytecode technology does not need to invade the business code, can be used for production environment site log debugging traffic scheduling

After we have built a distributed system, the gateway is the first to be tested, and then we need to pay attention to the system traffic, that is, how to manage the traffic. What we pursue is that within the upper limit of the traffic that the system can accommodate, leave resources to the best quality traffic and keep illegal and malicious traffic out of the door, so as to save costs while ensuring that the system will not be crashed.

Load balancing

Load balancing is a general design of how services digest traffic, which is usually divided into hard load balancing of the underlying protocols in the physical layer and soft load in the software layer. Load balancing solution is already a mature solution in the industry. We usually optimize it in different environments for specific business. The following load balancing solutions are commonly used.

Design of switch F5LVS/ALI-LVSNginx/TengineVIPServer/ConfigServer Gateway

The gateway is the first to bear the brunt of the load balancer, because the gateway is the first place to hit the centralized cluster traffic. If the gateway cannot bear the pressure, the whole system will not be available.

High performance: the first thing to consider in gateway design is high-performance traffic forwarding. A single node of the gateway can usually achieve millions of concurrent traffic distributions: for the sake of traffic pressure sharing and disaster preparedness, gateway design also needs distributed business filtering: gateway design is the same as the simple design rules to eliminate most malicious traffic management requests: request authentication can block how many illegal requests Cleaning data cache: most stateless requests have data hotspots, so using CDN can consume a considerable part of the traffic to flow control.

For the rest of the real traffic, we use different algorithms to divert requests.

Flow distribution

Counter queue funnel token bucket dynamic flow control traffic is limited when the traffic surges, and we usually need limited flow measures to prevent the system from avalanche, so we need to estimate the upper limit of the system traffic, and then set the upper limit number. However, when the traffic increases to a certain threshold, the extra traffic will not enter the system, and the availability of the system will be preserved by sacrificing part of the traffic. Current limiting Policy QPS granularity Thread count granularity RT threshold current limiting tool-Sentinel Service scheduling

The so-called need a good blacksmith to make good steel, after the flow scheduling management, the rest is to serve their own robustness. It is common for distributed system services to fail, and we even need to think of the failure itself as part of a distributed service.

Registration center

Our network management section introduces the gateway, the gateway is the distribution center of traffic, and the registry is the base of services.

Status type: the first good application service status, through the registry can detect whether the service is available lifecycle: different states of the application service constitute the application lifecycle version management cluster version: the cluster does not need the application to have its own version number Clusters composed of different services also need to define a large version number rollback: when deploying an exception, you can roll back the management service according to the large cluster version.

Service choreography is defined as controlling the interaction of various parts of resources through the interaction sequence of messages. The resources involved in the interaction are peer-to-peer and there is no centralized control. There are many services in the microservice environment. We need to have a general coordinator to agree on the dependency and invocation relationship between services. K8s is our only choice.

K8sSpring CloudHSFZK+Dubbo service control

Earlier, we solved the problem of the robustness and efficiency of the network. This section describes how to make our services more robust.

In the section on Discovery Resource Management, we introduce that after applying for container host resources from the cloud platform, the application service can be started through an automated script. After startup, the service needs to discover the registry and register its own service information with the service gateway, which is called gateway access. The registry monitors the different status of services, performs health checks, and classifies unavailable services.

Gateway access health check

Downgrade: when users proliferate, we first tamper with the traffic side, that is, limit the flow. When we find that the response of the system slows down after current limitation, which may lead to more problems, we also need to do something about the service itself. Service downgrade is to turn off the functions that are not very core at present, or to relax the scope of accuracy that is not very important, and then do some manual remediation afterwards.

Reduce consistency constraints and turn off non-core service simplification functions

Circuit breaker: after we have done the above operations, we still feel uneasy, so we need to worry about it further. Fuse is a kind of self-protection against overload, just like our switch tripping. For example, when our service is constantly querying the database, if business problems cause query problems, the database itself needs a circuit breaker to ensure that it will not be dragged down by the application, and access friendly information. Tell the service to stop calling blindly.

Closed state semi-open state open state fuse tool-Hystrix idempotent: we know that the characteristic of an idempotent operation is that the effect of arbitrary execution is the same as that of one execution. Then you need to assign a global id to a single operation to identify it, so that after multiple requests, we can determine that it comes from the same client, so as to avoid dirty data. Globally consistent IDSnowflake data scheduling

The biggest challenge of data storage is the management of data redundancy. With more redundancy, it becomes inefficient and takes up resources, and fewer copies can not play the role of disaster preparedness. Our usual practice is to separate requests with transition into stateless requests.

State transition

Separate the state to global storage, and convert the request to stateless traffic. For example, we usually cache the login information to the global redis middleware without the need to redundant user login data in multiple applications.

Sub-database sub-table

Data scale out.

Fragmentation and partition

Multiple copies are redundant.

Automatic operation and maintenance

We introduced the trend of devops from the time of resource application and management, and different middleware are needed to achieve the integration of development, operation and maintenance.

Configuration center

The global configuration center is distinguished according to the environment and managed uniformly, which reduces the confusion of multiple configurations.

Switchdiamend deployment strategy

Distributed deployment of microservices is a common occurrence. How to make our services better support business development? a robust deployment strategy is the first thing we need to consider. The following deployment strategy is suitable for different businesses and different stages.

Downtime deployment rolling deployment blue-green deployment grayscale deployment Ahambo B test job scheduling

Task scheduling is an indispensable part of the system, the traditional way is to configure crond timing tasks on the Linux machine or to complete the scheduling business directly in the business code, but now it is replaced by mature middleware.

Application Management of SchedulerXSpring timing Task

A large part of the operation and maintenance work requires application restart, online and offline operations, and log cleaning.

Application restart application offline log cleaning fault-tolerant processing

Now that we know that distributed system failures are common, solutions to failures are also indispensable. Usually we have both active and passive ways to deal with it:

The initiative is when a mistake occurs, we try to try it a few more times, and we may succeed. If we succeed, we can avoid the mistake. The passive way is that the wrong thing has already happened. In order to recover, we just deal with it in time and minimize the negative impact. Retry the design.

The key to retry the design is to design the time and number of retries. If you exceed the number of retries, or for a period of time, there is no point in retrying. The open source project spring-retry is a good way to implement our retry plan.

Transaction compensation

Transaction compensation is in line with our concept of ultimate consistency. The compensation transaction does not necessarily return the data in the system to the state it was in at the beginning of the original operation. Instead, it compensates for work performed by steps that have been successfully completed before the operation fails. The order of the steps in the compensation transaction is not necessarily the opposite of the order in the original operation. For example, one data store may be more sensitive to inconsistencies than another, so the step to compensate for changes to that store in a transaction should occur first. Using short-term timeout-based locks for each resource required to complete the operation and fetching those resources in advance helps to increase the likelihood of success of the overall activity. Work should be performed only after all resources have been obtained. All operations must be completed before the lock expires.

Full stack monitoring

Since the distributed system is a system with the cooperation of many machines, and the network is not guaranteed to be fully available, we need to build a system that can monitor all aspects of the system, so that we can monitor from the bottom to all levels of the business. In the event of an accident, the fault can be repaired in time to avoid more problems.

Basic layer

The basic level is the monitoring of container resources, including the load of each hardware index.

CPU, IO, memory, threading, throughput middleware

The distributed system is connected to a large number of middleware platforms, and the health of the middleware itself also needs to be monitored.

Application layer performance monitoring: the application level needs to monitor the real-time indicators (qps,rt) of each application service, upstream and downstream dependencies, etc.: in addition to the monitoring degree of the application itself, business monitoring is also a link to ensure that the system is normal. Through the design of reasonable business rules, alarm settings are set for abnormal situations to monitor link zipkin/eagleeyeslsgocAlimonitor fault recovery.

When the fault has occurred, the first thing we need to do is to eliminate the fault immediately and ensure that the system service is available. At this time, we usually do the rollback operation.

Apply rollback

Before applying rollback, you need to save the fault site in order to troubleshoot the cause.

Baseline fallback

After the application service is rolled back, the code baseline also requires revert to the previous version.

Version rollback

The overall rollback requires service orchestration, and the cluster is rolled back through the large version number.

Performance tuning

Performance optimization is a major topic of distributed systems, involving a wide range of areas, this piece can simply be taken out to do a series, this section will not expand. In itself, the process of doing service governance is also in the process of performance optimization. Refer to the knowledge system of High concurrency programming

Distributed lock

Caching is a powerful tool for solving performance problems, and ideally, it is the fastest to get a result immediately for each request without additional computation. From the three-level cache of CPU to the distributed cache, the cache is everywhere. What the distributed cache needs to solve is data consistency. At this time, we introduce the concept of distributed lock. How to deal with the distributed lock will determine the efficiency of obtaining cached data.

High concurrency

Multithreaded programming mode not only improves the throughput of the system, but also brings business complexity.

Async

Event-driven asynchronous programming is a new programming mode, which abandons the complex business processing problem of multi-thread and can improve the response efficiency of the system.

Summary

Finally, if possible, try using a single-node approach instead of a distributed system. Distributed systems are accompanied by some failed operations, in order to deal with catastrophic failures, we use backup; in order to improve reliability, we introduce redundancy.

The essence of distributed system is the cooperation of a bunch of machines, and what we need to do is to find all kinds of means to make the machines run as expected. Such a complex system, need to understand each link, each middleware access, is a very big project. Fortunately, in the context of micro-services, most of the basic work has been done for us. The distributed architecture described above can basically be built using a distributed three-piece suite (Docker+K8S+Srping Cloud) when it is implemented in a project.

The distribution map of core technologies of distributed architecture is as follows:

Source of original image: https://dzone.com/articles/deploying-microservices-spring-cloud-vs-kubernetes

Distributed technology stack uses middleware:

Source of original image: https://dzone.com/articles/deploying-microservices-spring-cloud-vs-kubernetes

"Alibaba Cloud's native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, and is the technical official account that best understands cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.