What is a microservice? 04/04 Update SLTechnology News&Howtos

What is a microservice?

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what is micro-service". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Or split the repeated parts of multiple modules, or simply to split the inflated single application, these split parts are independently deployed and maintained as a single service, that is, micro-service.

The split will naturally lead to some of the necessary needs:

From the relationship of local method calls to that of remote procedure calls, reliable communication is of paramount importance.

With the progress of the split work, the resource scheduling relationship will become complex, and perfect service governance is needed at this time.

The overall complexity of the invocation network will also bring us greater risk, that is, the possibility of service avalanche caused by chain reaction, so how to ensure service stability is also needed to be considered in the micro-service architecture.

This is not domestic demand but self-evolution. After service-oriented, if we can combine containerization and Devops technology to realize the integration of service operation and maintenance, it will greatly reduce the cost of micro-service maintenance, whether now or in the future.

What is the micro-service like?

From the macro point of view of the current common website architecture, the micro-service is in the middle level. The parts circled in the red box belong to the category of micro-services.

Including the most basic RPC framework, registry, configuration center, as well as a broader perspective of monitoring and tracking, governance center, scheduling center and so on.

From the point of view of the microservice itself, it will roughly include the following modules:

Service registration and discovery

RPC remote call

Routing and load balancing

Service monitoring

Service governance

The premise of service

Is it a micro-service as long as you put on a micro-service framework? Although there is a table of micro-services in this way, there is no essence of micro-services, that is, "micro".

The premise of micro-service is that the service is split to enough "micro", enough for a single responsibility, of course, the degree of separation and service boundaries need to be combined with the business.

In a broad sense, service splitting includes both application split and data split. After the split of the application, we need to introduce the micro-service framework for service communication and service governance, which is the traditional definition of micro-service.

After data split, we also need to introduce a series of means to ensure it. Since it is not a topic strongly related to micro-services, we will only briefly explain it here:

Distributed ID

New table optimization

Data migration and data synchronization

Modification of SQL call scheme

Cut the library scheme

Data consistency

Behind the specific micro service request

After we have an overall understanding of the micro-service architecture and have the premise of service, what does a complete micro-service request need to involve?

This includes three basic functions of the microservice framework:

Publishing and referencing of services

Registration and Discovery of Services

Remote communication of servic

Publishing and referencing of services

First of all, the first problem we face is how to publish and reference services. Specifically, what is the interface name of the service, what are the parameters, what type of return value, etc., usually that is, the interface description information.

Common ways of publishing and referencing include:

RESTful API/ declarative Restful API

XML

IDL

Generally speaking, no matter which method is used, it is necessary for the server to define and implement the interface, for example:

@ exa (id = "xxx") public interface testApi {@ PostMapping (value = "/ soatest/ {id}") String getResponse (@ PathVariable (value = "id") final Integer index, @ RequestParam (value = "str") final String Data;}

The specific implementation is as follows:

Public class testApiImpl implements testApi {@ Override String getResponse (final Integer index, final String Data) {return "ok";}}

Declarative Restful API: this often uses HTTP or HTTPS protocols to invoke services, with relatively poor performance.

First, the server defines the interface and implements the interface, and then the service provider can publish the service through Servlet using a framework like restEasy, while the service consumer directly references the defined interface invocation.

In addition, there is another way similar to Feign, that is, the release of the server depends on SpringMVC Controller, and the framework is only based on client-side templated HTTP request calls.

In this case, the interface definition needs to be agreed with the server Controller, so that the client can directly refer to the API to initiate the call.

XML: people who use private RPC protocol will choose XML configuration to describe the interface, which is more efficient, such as Dubbo, Motan and so on.

Similarly, the server defines the interface and implements the interface as above, and the server exposes the file interface through server.xml. The service consumer refers to the interface that needs to be invoked through the client.xml.

However, this approach has a high intrusion into the business code, and when the XML configuration changes, both the service consumer and the service provider need to be updated.

IDL: an interface description language that is often used for cross-language calls. The most commonly used IDL includes Thrift protocol and gRpc protocol.

For example, the gRpc protocol uses Protobuf to define the interface. After writing a proto file, the protoc plug-in corresponding to the language is used to generate the code corresponding to the server side and the client side, which can be used directly.

However, if there are too many parameter fields, the proto file will appear very large and difficult to maintain. And if fields need to be changed frequently, such as deleting fields, PB will not be forward compatible.

Some Tips: either way, the service consumer needs to be notified when the interface changes. Consumers' strong dependence on api is hard to avoid, and various call failures caused by interface changes are also common.

So if there are changes, try to add new interfaces, or define a version number for each interface. In terms of use, most people choose external Restful, internal XML, and cross-language IDL.

Some problems: there are still many problems on the landing of actual service releases and references, most of which are related to configuration information.

For example, a simple interface call timeout configuration, should this configuration be at the service level or interface level? Is it on the service provider's side or the service consumer's side?

In practice, most service consumers ignore these configurations, so it is necessary for the service provider itself to provide a default configuration template, which is equivalent to a predefined process.

After inheriting the predefined configuration of the service provider, each service consumer also needs to be able to customize the configuration coverage.

However, for example, a service has 100 interfaces, each interface has its own timeout configuration, and the service has 100 consumers. When the service node changes, 100,100 messages from the registry will occur. This is terrible and may cause a network storm.

Registration and Discovery of Services

Suppose you have released the service and deployed the service on a machine, how can the consumer find the address of your service?

Some might say DNS, but DNS has many drawbacks:

Maintenance trouble, update delay

Unable to do load balancing on the client

Unable to achieve port-level service discovery

In fact, in distributed systems, there is a very important role, called registry, which is used to solve this problem.

The procedure for addressing and calling using the registry is as follows:

When the service starts, it registers itself with the registry and regularly sends a heartbeat to report its survival status.

When the client invokes the service, it subscribes to the service from the registry, caches the list of nodes locally, and establishes a connection with the server (lazy load, of course). When initiating the call, select a server to initiate the call based on the load balancing algorithm in the list of local cache nodes.

When the server node changes, the registry can sense it and notify the client.

The implementation of the registry mainly needs to consider the following issues:

Self-consistency and availability

Registration mode

Storage structure

Service health monitoring

Status change notification

① consistency and availability

An old proposition is CAP (consistency, availability, partition fault tolerance) in distributed systems.

We know that it is impossible to satisfy CAP at the same time, so there needs to be a trade-off. Common registries are roughly divided into CP registry and AP registry.

CP registry: typical are Zookeeper, etcd and Consul, which sacrifice availability to ensure consistency, and use Zab protocol or Raft protocol to ensure consistency.

AP registry: sacrificing consistency to ensure availability, it feels like you can only list Eureka. Eureka each server keeps a separate list of nodes, and inconsistencies may occur.

In theory, AP type is far more appropriate than CP type for registry only. The requirement of usability is much higher than consistency, which can be ensured as long as the final consistency is ensured, and a variety of fault-tolerant strategies can be used to make up for the inconsistency.

In fact, there are many ways to ensure high availability, such as cluster deployment or multi-IDC deployment. Consul is a typical example of multi-IDC deployment to ensure availability, which uses wan gossip to keep the state synchronized across computer rooms.

② registration method

There are two ways to interact with the registry, one is to integrate SDK within the application, and the other is to interact indirectly with the registry outside the application in other ways.

In-application: this should be the most common way, both the client and server integrate the relevant sdk to interact with the registry.

For example, if you choose Zookeeper as the registry, you can use Curator SDK for service registration and discovery.

Out-of-application: Consul provides a solution for out-of-application registration. Consul Agent or a third-party Registrator can monitor the service status and thus be responsible for registering or destroying the service provider.

On the other hand, Consul Template can regularly pull the list of nodes from the registry and refresh the LB configuration (such as upstream through Nginx), which is equivalent to completing the load balancing on the service consumer side.

③ storage structure

The relevant information stored in the registry generally adopts a bibliographic hierarchical structure, which is generally divided into service-interface-node information.

At the same time, registries are generally grouped, and the concept of grouping is very broad, which can be divided according to the computer room or according to the environment.

The node information mainly includes the address of the node (ip and port number), as well as other information about the node, such as the number of failed retries of the request, the setting of the timeout, and so on.

Of course, in many cases, the interface layer may be removed, because considering the large number of interfaces, too many nodes will cause a lot of problems, such as the network storm mentioned earlier.

④ service health monitoring

Service survival status monitoring is also a necessary function of the registry. In Zookeeper, each client maintains a persistent connection with the server and generates a Session.

During the Session expiration period, the client regularly sends heartbeats to the server to detect whether the link is normal, and the server resets the expiration time of the next Session.

If the client's heartbeat packet is not detected during the Session expiration period, it will be considered unavailable and removed from the node list.

⑤ status change Notification

After the registry has the service health detection capability, it is also necessary to notify the client of the status change. In Zookeeper, service changes can be obtained through the Process method of the listener Watcher.

Remote communication of servic

Above, the service consumer has correctly referenced the service and found the address of the service, so how do you make a request to that address?

To solve the problem of remote communication between services, we need to consider some issues:

The processing of Network Ipaw O

Transport protocol

Serialization mode

The processing of ① Network I _ (I) O

To put it simply, how does the client handle the request? How does the server handle the request?

First of all, from the client side, the time to create a connection can be when we get the node information from the registry, but more often, we will choose to create the connection when the first request is made. In addition, we tend to maintain a connection pool for the node for connection reuse.

If it is asynchronous, we also need to number each request and maintain a request pool to find the corresponding request when the response returns. Of course, this is not necessary, many frameworks will help us do these things well, such as rxNetty.

On the server side, the way requests are handled can be traced back to the five IO models of Unix. We can directly use Netty, MINA and other network frameworks to handle server requests, or if you are very interested, you can implement a communication framework by yourself.

② transport protocol

Of course, the most common is to directly use the HTTP protocol, the users do not need to pay attention to and understand the content of the agreement, convenient and direct, but the natural performance will be compromised.

There is also the hot HTTP2 protocol, which has many excellent features, such as binary data, header compression, multiplexing and so on.

However, from its own practice, HTTP2 still has a long way to go to production. In the simplest example, after upgrading to HTTP2, all header names will become lowercase, not case-insenstive, so there will be compatibility problems.

Of course, if you want more efficient and controllable transport, you can customize the private protocol and transport based on TCP. The customization of the private protocol requires that both sides of the communication understand its characteristics, and the design also needs to pay attention to setting aside the extension field and dealing with the sticky packet subcontracting and other problems.

③ serialization mode

Before and after the network transmission, it is often necessary to encode at the sender and decode at the server, which is mainly to reduce the amount of data transmission during the network transmission.

Common serialization methods include text classes, such as XML/JSON, and binary types, such as Protobuf/Thrift.

In the consideration of choosing serialization:

First, the performance, Protobuf compression size and compression speed will be much faster than JSON, performance is also better.

Second, in terms of compatibility, relatively speaking, the compatibility of JSON before and after is stronger, which can be used in scenarios where the interface changes frequently.

It is important to emphasize here that using each type of serialization requires an understanding of its features and a good grasp of boundaries when the interface changes.

For example, the FAIL_ON_UNKNOW_PROPERTIES property of jackson, the CompatibleFieldSerializer of kryo, jdk serialization will strictly compare serialVersionUID, and so on.

Stability of microservices

When a single application is transformed into multiple micro-services, there are often more problems in the process of request invocation, and problems may occur in every link of the communication process.

After the problem, if not dealt with, there will be a chain reaction leading to a service avalanche. The service governance function is used to deal with such problems.

We will start with the three roles of microservices: registries, service consumers, and service providers.

How to ensure the Stability of Registration Center

The registry is mainly responsible for the maintenance of node status and the corresponding change detection and notification operation.

On the one hand, the stability of the registry itself is very important. On the other hand, we can not rely entirely on the registry, and we need to conduct a fault drill like how the micro-service works normally after the registry is completely down.

In this section, we focus not on the availability guarantee of the registry itself, but more on the part related to the state of the node.

Guarantee of ① Node Information

We have said that when the registry is completely down, the microservice framework still needs to be able to work properly. This benefits from some mechanisms within the framework for dealing with node state.

Local memory: first of all, the service consumer will keep the node state in local memory.

On the one hand, because the node state will not change so frequently, putting it in memory can reduce the network overhead. On the other hand, when the registry goes down, the service consumer can still find the list of service nodes in the local memory and initiate the call.

Local snapshot: we say that after the registry goes down, the service consumer can still find the list of service nodes in the local memory. What if the service consumer restarts?

At this point, we need a local snapshot, that is, we save a copy of the node state to the local file and restore it to local memory after each restart.

Removal of ② Service Node

Now, whether the registry works or not, we can get the service node smoothly. But not all service nodes are correctly available?

In practical application, this requires a question mark. If we do not verify the correctness of the service node, it is likely to be called to an abnormal node. So we need to do the necessary node management.

For node management, we have two means, mainly to remove incorrect service nodes.

Registry removal mechanism: one is to remove nodes through the registry. The service provider maintains a heartbeat with the registry, and if the heartbeat packet is not received for a certain period of time, the registry thinks that there is a problem with the node and removes the node from the service list and notifies the service consumer. in this way, the service consumer will not call to the problematic node.

Service consumer removal mechanism: the second is to remove the node on the service consumer side. Because the service consumer itself is the role that knows best whether the node is available or not, it is more reasonable to make a judgment on the service consumer's side. If a network exception occurs in the service consumer call, the node is removed from the memory cache list.

Of course, how many times the call failed before the removal, as well as the removal recovery time and other details, in fact, are similar to the client circuit breaker, can be combined to do.

Generally speaking, for high-traffic applications, the sensitivity of service consumer removal is higher than that of registry removal, and there is no need to make a synchronization judgment between them, because registry removal will automatically cover service consumer removal after a period of time.

Can ③ service nodes be removed / changed at will?

In the previous section, we talked about removing the problem node to avoid traffic calls to that node. But can nodes be removed at will? At the same time, it also includes "can nodes be updated at will?" Questions.

Frequent changes: when the network wobbles, the nodes of the registry will constantly change. As a result, the change message is constantly notified to the service consumer, and the service consumer constantly flushes the local cache.

If a service provider has 100 nodes and 100 service consumers at the same time, the effect of frequent changes may be 100 to 100, resulting in a full bandwidth.

At this time, we can do some control in the registry, for example, we can notify the change message after a period of time, or directly block the notification after turning on the switch, or use a probability calculation to determine which service consumers need to be notified.

Incremental update: also due to the network storm problem caused by frequent changes, a feasible solution is to carry out incremental updates. The registry will only push those changed node information instead of all. In order to avoid network storms when frequent changes.

Too few available nodes: when the network jitters and the nodes are removed, there are likely to be too few available nodes.

At this time, too much traffic is allocated to too few nodes, which leads to the embarrassment and heavy burden of the remaining nodes, which leads to the deterioration of the strike.

In fact, it is possible that most of the nodes are available, only due to network problems and the registry's failure to keep the heartbeat in time.

At this point, you need to set a switch ratio threshold on the service consumer side. When the registry notifies the node to be removed, but when the number of nodes left in the cache list is less than a certain proportion (compared with the previous period of time), it will no longer be removed, so as to ensure that there are enough nodes to provide normal services.

This value can actually be set higher, such as 70%, because there is no frequent network jitter under normal circumstances. Of course, if the developer does need to offline most nodes, you can turn off this switch.

How to ensure the stability of service consumers

If a request fails, the most direct impact is on the service consumer, so what can be done on the service consumer side?

① timeout

If we call an interface but do not return a response, we often need to set a timeout to prevent ourselves from being dragged to death by the remote call.

The setting of the timeout is also fastidious. If the setting is too long, the effect is small, the risk of being dragged down is great, and if the setting is too short, it is possible to misjudge some normal requests and greatly increase the error rate.

In practical use, we can take the value of P999 within a period of time of the application, or take the value of p95 * 2, the specific situation needs to be determined by ourselves.

When the timeout is set, there is also a distinction between synchronous and asynchronous interfaces. For synchronous interfaces, the value of the timeout setting needs to take into account not only the downstream interface but also the upstream interface.

For asynchronous, because the interface has returned quickly, you don't have to consider the upstream interface, you only need to consider the blocking time in the asynchronous thread, so the timeout is wider.

② fault-tolerant mechanism

A request call can never guarantee success, so how can the service consumer be fault-tolerant when the request fails?

Fault tolerance mechanisms are usually divided into the following:

FailTry: failed to retry. It refers to the most common retry mechanism, in which the view initiates a request to retry again when the request fails.

In this way, in terms of probability, the failure rate will decline exponentially. For the number of retries, you also need to choose an appropriate value, if there are too many retries, it may cause service deterioration.

In addition, combined with the timeout, services with performance requirements can initiate a retry before the timeout arrives, so as to optimize the request call in probability. Of course, the premise of a retry is idempotent.

FailOver: failed to switch. The strategy is similar to the above, except that FailTry will retry on the current instance. FailOver will re-select a node in the list of available nodes according to the load balancing algorithm to retry.

FailFast: quick failure. If the request fails, report an error directly, or record it in the error log, which is nothing to say.

In addition, there are many kinds of fault-tolerant mechanisms, most of which are customized based on their own business characteristics, mainly on retry, such as an exponential increase in waiting time for each retry.

Third-party frameworks also have built-in default fault tolerance mechanisms. For example, Ribbon's fault tolerance mechanism is composed of retry and retry next, that is, retry the current instance and retry the next instance.

One more thing to say here, the number of retries of Ribbon and the number of retries of the next instance are provided by Cartesian product.

③ fuse break

The fault-tolerant mechanisms described in the previous section are mainly retry mechanisms, which are more effective for errors caused by accidental factors, such as network reasons.

However, if the cause of the error is the failure of the service provider itself, then the retry mechanism will cause the service to deteriorate.

At this time, we need to introduce a circuit breaker mechanism, that is, no longer initiate the call within a certain period of time, give the service provider a certain recovery time, and then initiate the call after the service provider returns to normal. This protection mechanism greatly reduces the possibility of service avalanches caused by chain exceptions.

In practical application, fuses are often divided into three states, open, half-open and closed. Quote a schematic drawn by MartinFowler:

Under normal circumstances, the circuit breaker is turned off and the request can be called normally. When the request fails to reach a certain threshold condition, the circuit breaker is opened and the call to the service provider is prohibited.

When the circuit breaker is opened for a period of time, it will enter a half-open state, in which the request will close the circuit breaker if the call is successful, and reopen the circuit breaker if it is not successful, waiting for the next half-open state cycle.

One of the most important points in the realization of circuit breaker is the setting of failure threshold. The failure condition can be set as the number of consecutive failed calls according to the business requirements, or the failure ratio in the time window, which is calculated by a certain sliding window algorithm.

In addition, some tricks can also be done for the half-open state period of the circuit breaker. A common calculation method is that the cycle length increases exponentially with the number of failures.

The specific implementation method can be specified according to the specific business, or you can choose a third-party framework such as Hystrix.

④ isolation

Isolation is often used in conjunction with a circuit breaker, or take Hystrix as an example, which provides two isolation methods:

Semaphore isolation: use semaphores to control isolated threads. You can set different semaphores for different resources to control concurrency and isolate each other. Of course, in fact, it's no different to use atomic counters.

Thread pool isolation: isolating resources by providing isolated thread pools consumes more resources, but can better cope with burst traffic.

⑤ downgrade

Degradation is also mostly used in conjunction with circuit breakers. When the service caller's circuit breaker is turned on, the call can no longer be made to the service provider. In this case, the impact of circuit breakers can be avoided by returning degraded data.

Downgrades are often used in businesses with high error tolerance. At the same time, how to set downgraded data is also a knowledge.

One method is to set acceptable downgrade data for each interface in advance, but this static downgrade method is of narrow applicability.

Another method is to go to the online log system / traffic recording system to retrieve the last correct returned data as the degraded data this time, but the key to this method is to provide a log system or traffic sampling recording system for stable crawling requests.

In addition, for the downgrade, we often set the operation switch, for some of the small impact of automatic downgrade, and for some of the greater impact of human intervention to downgrade.

How to ensure the stability of service providers

① current limit

Flow restriction is to limit the service request traffic. The service provider can set a threshold for the request according to its own situation (capacity). When the threshold is exceeded, the request will be discarded, which ensures the normal operation of its own service.

The setting of the threshold can be considered in two aspects:

QPS, that is, requests per second

Number of concurrent threads

From a practical point of view, we tend to choose the latter, because the high QPS is often due to high processing power, which does not reflect that the system is "overwhelmed".

In addition, we have many algorithms for current limitation. For example, token bucket algorithm and leaky bucket algorithm are mainly optimized for burst traffic.

Third-party implementations such as guava rateLimiter implement the token bucket algorithm. We will not expand on the details here.

② restart and rollback

Current restriction is more of a safeguard, but what if there is already a problem with the service provider?

At this time, there will be two situations: one is that the code has Bug, and on the one hand, the service consumer needs to do operations such as circuit breaker and downgrade, and on the other hand, the service provider combined with DevOps needs the ability to roll back to the last correct version quickly.

More often, we may only encounter a stand-alone failure that is not strongly related to the code, and a simple and rude way is to restart automatically.

For example, if it is observed that the average time consuming of an interface is beyond the normal range, the instance will be restarted automatically.

Of course, automatic restart requires a lot of considerations, such as whether the restart time is set at night, and the same problems caused by automatic restart as the removal of the above nodes, which need to be considered and dealt with.

In hindsight, if the scene was not protected at that time, it would be difficult to locate the cause of the problem. So we often need on-site protection before rollback or automatic restart with one click.

On-site protection can be automatic, for example:

Add the parameter of printing gc log to jvm from the beginning-XX:+PrintGCDetails

Or output oom file-XX:+HeapDumpOnOutOfMemoryError

It can also be done with DevOps automatic script, or manually.

Generally speaking, we will do the following:

Print stack information, jstak-l 'java process PID'

Print memory image, jmap-dump:format=b,file=hprof 'java process PID'

Keep gc log and business log

③ scheduling traffic

In addition to the above measures, it is also a very common means to avoid calling to the problem node by scheduling traffic.

When there is a problem with one of the service providers and the other machines are normal, we can quickly adjust the weight of the machine to 0 combined with the load balancing algorithm to avoid the inflow of traffic, and then go to the machine for slow troubleshooting without having to restart it as soon as possible.

If the service provider is divided into different clusters / packets, when there is a problem in one of the clusters, we can also route the traffic to the normal cluster through the routing algorithm. At this time, a cluster is a micro-service packet.

When the IDC fails, such as the computer room is bombed or the optical cable is stolen, we deploy multi-IDC, and we can also switch the traffic to the normal IDC in some ways, so that the service can continue to operate normally.

Switching traffic can also be achieved through micro-service routing, but at this time an IDC corresponds to a micro-service packet.

In addition, it is also possible to use DNS resolution to switch traffic, switching the VIP of the external domain name from one IDC to another IDC.

This is the end of the content of "what is Micro Service". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.