The inflection point has arrived, and cloud native leads the digital transformation and upgrading. 07/06 Update SLTechnology News&Howtos

The inflection point has arrived, and cloud native leads the digital transformation and upgrading.

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Author | Yi Li Ali Yun senior technical expert

This article is sorted out from the speech entitled "the inflection point has come, cloud native leads the digital transformation and upgrading" delivered by Yi Li at the 2019 Ctrip Technology Summit.

Follow the official account of "Alibaba Cloud original" and reply to the keyword "transformation" to download the PPT of this article.

The topic I would like to share with you today is "the inflection point has arrived, cloud native leads the digital transformation and upgrading". First of all, I would like to introduce myself briefly. My name is Yi Li. I come from Aliyun Container platform. I have been in charge of Aliyun Container products since 2015. Before that, I worked at IBM for 14 years, mainly responsible for product research and development of enterprise middleware and cloud computing.

Today, we will share with you our simple thoughts on the field of cloud native development, as well as our general introduction to the four trends of cloud native development:

Embrace Serverless-extremely flexible, no need for operation and maintenance; service grid-decouple service governance capabilities from applications and sink to the infrastructure layer; standardize cloud native application management-build an efficient, automated and reliable application delivery system; computing borderless-achieve efficient collaboration of cloud-edge-IoT devices. Basic concepts of cloud origin

Let's start with a brief introduction to some basic concepts of cloud native.

Cdn.com/95631e1a7bc32b67c9301334365c90cfb022be8e.jpeg ">

We have come into contact with a lot of customers, for these customers, whether to go to the cloud is no longer a problem, they are concerned about how to go to the cloud? How to make full use of the power of the cloud and maximize the value of the cloud? In the era of All in Cloud, the technological capability of enterprises has become the core competitiveness, and they are very willing to use cloud as the multiplier of enterprise IT capability.

Yunyuan livelihood is a set of best practices and methodologies. Building scalable, robust and loosely coupled applications in public and proprietary cloud environments can enable faster innovation and low-cost trial and error; new computing paradigms such as containers, service grids, and non-service computing continue to emerge.

Containers are the prelude to cloud native technology:

Docker image forms the standard of application distribution and delivery, which can decouple the application from the underlying running environment.

Kubernetes technology has become the standard for distributed resource scheduling and orchestration. Kubernetes shields the differences in the underlying infrastructure and helps applications run in different infrastructures.

On this basis, the community began to establish the upper application abstraction. For example, in the service governance layer, Istio becomes the network protocol stack of service communication, decoupling the service governance capability from the application layer.

On top of this, domain-oriented cloud native frameworks are also emerging rapidly, such as machine learning-oriented cloud native platform Kubeflow and serverless Knative and so on. Through this architectural layering, developers only need to focus on their own business logic, without paying attention to the complexity of the underlying implementation.

We can see that a prototype of a cloud native operating system is beginning to emerge, which is the best time for developers, greatly increasing the speed of business innovation.

In the early days, Kubernetes mainly ran stateless Web applications, such as Apache Dubbo/Spring Cloud-based micro-service applications. Now, more and more core business, data intelligence business and innovative business are also running on Kubernetes.

For example, Aliyun's own cloud products, such as enterprise distributed application service EDAS, real-time computing platform Flink, elastic AI algorithm service EAS and blockchain platform BaaS, are also deployed on Aliyun Kubernetes service ACK.

K8s has become a cloud-era operating system and an interface for applications to use cloud infrastructure. Aliyun ACK achieves optimized integration of cloud infrastructure, provides agile, resilient and portable cloud native application platforms, and enables consistent application deployment and management on public, proprietary and edge clouds.

From container to serverless Serverless Kubernetes

Let's talk about the Serverless evolution of Kubernetes.

Everyone likes the power and flexibility that K8s provides, but the operation and maintenance of a Kubernetes production cluster is extremely challenging.

Aliyun's Kubernetes service ACK simplifies the lifecycle management of the K8s cluster, hosting the master nodes of the cluster, but users still need to keep the worker node resource pool, maintain the nodes, such as upgrading security patches, and plan the capacity of the resource layer according to their own usage.

In view of the complexity of K8s operation and maintenance, Aliyun launched Serverless Kubernetes container service ASK, which is fully compatible with existing K8s container applications, but all container infrastructure is hosted by Aliyun, and users can focus on their own applications. It has several characteristics:

First of all, users do not have any reserved resources and pay according to the resources actually consumed by container applications; for users, there is no concept of nodes and zero maintenance; all resources are created on demand without any capacity planning.

Serverless Kubernetes greatly reduces the complexity of operation and maintenance, and its own design is very suitable for burst application load, such as CI/CD, batch computing and so on. For example, a typical online education customer deploys teaching applications on demand according to teaching needs, automatically releases resources at the end of the course, and the overall calculation cost is only using the monthly node.

Cloud-scale Nodeless Architecture-Viking

How does it come true? When we started the Serverless Kubernetes project at the end of 2017, we were thinking: if Kubernetes grows on the cloud, how should its architecture be designed? We code-named its internal product Viking because ancient Viking warships were famous for their speed and ease of operation.

First, we want to be compatible with Kubernetes. Users can directly use Kubernetes's declarative API, compatible with Kubernetes application definition, Deployment, StatefulSet, Job, Service and so on without modification.

Secondly, the underlying layer of Kubernetes makes full use of the capabilities of cloud infrastructure services and cloud services as far as possible, such as computing, storage, network, resource scheduling, etc.; fundamentally simplify the design of the container platform, enhance the scale, and reduce the complexity of user operation and maintenance. We follow the Kubernetes controller design pattern and drive the entire IaaS resource state to approach the state declared by the user application.

We provide an elastic container instance-ECI in the resource layer. Unlike Azure Container Instance ACI and AWS Fargate, ECI provides native support for Kubernetes Pod instead of providing a separate container instance. ECI provides a sandbox environment based on lightweight virtual machines to achieve security isolation, fully compatible with Pod semantics, support for multi-container processes, health check, boot sequence and other capabilities. This makes it very simple and straightforward for the upper layer to build a K8s compatibility layer.

In the orchestration and scheduling layer, we use Microsoft's Virtual-Kubelet and extend it deeply. Virtual-Kubelet provides an abstract controller model to simulate a Kubernetes node. When a Pod is dispatched to the virtual node, the controller uses the ECI service to create an ECI instance to run Pod. At the same time, the controller supports two-way state synchronization. if a running ECI instance is deleted, the controller will restore a new ECI instance according to the application target state.

At the same time, our cloud services based on Aliyun implement the behaviors of Kube-Proxy, Kube-DNS and Ingress Controller, and provide complete Kubernetes Service capability support:

For example, Aliyun's DNS service PrivateZone is used to dynamically configure DNS address resolution for ECI instances, which supports Headless Service; to provide Cluster IP and load balancing capability through private network SLB, and to implement Ingress routing rules through layer 7 routing provided by SLB.

We also provide end-to-end observability for ECI, and deeply integrate with Ali Cloud log service, cloud monitoring and other services, and can also easily support horizontal expansion of HPA.

Container start acceleration-"Zero second" Image download

For Serverless container technology, application startup speed is a core indicator. The impact of containers on application startup speed is mainly as follows:

Resource preparation: ECI can optimize resource preparation time to seconds by optimizing end-to-end management links and tailoring and optimizing container scenario virtualization and operating systems

Image download time: downloading images from the Docker image repository and decompressing them locally is a very time-consuming operation. Download time depends on the size of the image, usually ranging from 30 seconds to a few minutes.

In traditional Kubernetes, the worker node caches the downloaded image locally, so that the download and decompression will not be repeated the next time you start. In order to achieve maximum elastic cost efficiency, ECI and ECS adopt the strategy of pooling and computing storage separation, which also means that it is impossible to use the local disk to cache the container image in the traditional way.

To this end, we have implemented an innovative scheme: the container image can be made into a data disk snapshot.

When ECI starts, if the mirror snapshot exists, you can create a read-only data disk directly based on the snapshot, and when the instance starts to mount automatically, the container application directly uses the mounted data disk as a rootfs to start. Based on the Pangu 2.0 architecture and the extreme Icano performance of Aliyun ESSD cloud disk, we can reduce the image loading time to less than 1 second.

In order to simplify user operation, we provide CRD in K8s to allow users to indicate which mirrors need to build snapshots. At the same time, on the software delivery pipeline of ACR image repository service, we can declare which images need to be accelerated, so that when a user pushes a new image, the corresponding snapshot cache is automatically built.

Extreme elasticity

Let's talk about elasticity. For the vast majority of enterprises, elasticity is the most important demand for Shangyun. Double 11 is a typical pulse computing, and the peak computing resources will be many times higher than usual. There are also unexpected spikes, such as the need for rapid expansion on the cloud after a popular game boom. Kubernetes can maximize the resilience of the cloud.

ACK provides rich flexible strategies in the resource layer and application layer. At present, the mainstream solution in the resource layer is to scale nodes horizontally through cluster-autoscaler. When Pod cannot be scheduled due to insufficient resources, cluster-autoscaler selects a scaling group and automatically adds an instance to the group.

In the auto scaling group, we can expand the capacity of ECS virtual machine, DPCA bare metal and GPU instances according to application load requirements. It is worth mentioning Spot instance. Bidding instances can take advantage of Aliyun's free computing resources, and the cost discount can be as low as 90% of the postpaid instances.

Bidding instances are very suitable for stateless and fault-tolerant applications, such as batch data processing or video rendering, which can greatly reduce computing costs. Based on Aliyun's powerful flexible computing power, we can achieve 1000-node scalability in minutes.

Further combined with the ECI mentioned above, we can implement auto scaling based on virtual nodes in ACK. Virtual-kubelet can be registered as a virtual node with unlimited capacity in theory. When Pod is dispatched to virtual nodes, ECI is used to dynamically create Pod, which is very suitable for big data offline tasks, CI/CD jobs, sudden online loads, and so on. In the production environment of a large customer, an elastic container instance can launch 500 Pod within 30 seconds to easily cope with sudden peak requests.

In the application layer, Kubernetes provides a way for HPA to scale horizontally for Pod, and VPA for vertical scaling for Pod. Aliyun provides alibaba-cloud-metrics-adapter, which can provide more flexible metrics. For example, you can dynamically adjust the number of applied Pod based on Ingress Gateway's QPS metrics and Cloud Monitor metrics.

In addition, for many industry customers, the resource profile of application load is periodic. For example, one of our clients in the securities industry, from Monday to Friday, the opening time of the stock market is trading time, while at other times, we can only inquire not to provide trading, and the difference in peak-valley resource demand is as high as 20 times.

To solve this scenario, Aliyun CCS provides timed scaling components to cope with periodic scenarios with resource portraits. Developers can define time schedule, expand resources in advance, and recover resources regularly after trough arrival. Combined with the node scaling capability of the underlying cluster-autoscaler, a good balance is made between system stability and resource cost savings.

In the future, we will release some flexible scaling strategies based on machine learning, which can achieve better resource prediction and enhance flexibility of SLA based on historical resource portraits.

Enable the next generation of serverless applications

The above mentioned why Serverless is becoming more and more popular with developers, because people are more concerned about their own business than infrastructure maintenance. Serverless is the inevitable trend of the development of cloud services. We need to sink the resource scheduling, system operation and maintenance capabilities to the infrastructure. Google, IBM,CloudFoundry and others jointly launched Knative as a Serverless choreography framework, which can be very concise and efficient to achieve server-free applications. It provides several core competencies:

Eventing-provides an event-driven processing model. We extend rich event sources for Aliyun. For example, when OSS receives a video clip uploaded by a user, it triggers the application in the container to transcode the video.

Serving- provides flexible service response capability, which can automatically scale automatically according to the number of business requests, and even support capacity reduction to zero. Using Ali Cloud elastic infrastructure, resource costs can be greatly reduced.

Tekton-Automation pipelining from code to application deployment can be easily implemented.

Combined with application management capabilities and application performance monitoring services, we can quickly build domain-specific application hosting services (Micro PaaS) based on Knative, greatly reducing the complexity of direct operation of Kubernetes resources, and allowing developers to focus more on application iteration and service delivery efficiency.

Evolution of safe sandbox container technology

Having just finished talking about the programming model, take a look at the underlying implementation. All the core implementations under Serverless are security container sandboxes. The traditional Docker RunC container shares the kernel with the host Linux and achieves resource isolation through CGroup and namespace. This method is very efficient, but due to the large size of the operating system kernel, once malicious containers exploit kernel vulnerabilities, it can affect all containers on the entire host machine.

More and more enterprise customers are concerned about the security of containers. In order to improve security isolation, Aliyun and Ant Financial Services Group team work together to introduce secure sandbox container technology. In September this year, we released the RunV security sandbox based on lightweight virtualization technology. Compared with RunC containers, each RunV container has a separate kernel. Even if the kernel to which the container belongs is breached, it will not affect other containers. It is very suitable for running untrusted applications from third parties or better security isolation in multi-tenant scenarios.

After performance optimization, the security sandboxed container can now achieve 90% native RunC performance, and the RunV container provides a user experience exactly the same as the RunC container, including logging, monitoring, elasticity, and so on. At the same time, ACK can mix RunC and RunV containers on a DPCA bare metal instance, and users can choose according to their own business characteristics.

At the end of the fiscal year, we will launch a trusted container sandbox RunE based on Intel SGX trusted computing technology. Container applications run in a secure and trusted execution environment called enclave in CPU. A metaphor: we put the container in the safe, and no one, including cloud service providers, can tamper with and intercept data from the outside. Customers can run the logic of high-secret applications, such as key signing, signature verification, private data processing and so on, in the RunE container.

From micro-service to service grid

Let's talk about another aspect-the evolution of micro-service architecture. Internet application architecture has given birth to the development of micro-service architecture. Its core idea is to split complex applications into a set of loosely coupled services, each of which obeys the principle of single responsibility (Single Responsibility Principle). Each service can be deployed and delivered independently, greatly improving business agility; each service can scale out / contract independently to meet the challenges of the scale of the Internet.

Service governance capacity sinks

Micro-service frameworks, such as HSF/Dubbo or Spring Cloud, provide strong service governance capabilities, such as service discovery, load balancing, circuit breaker degradation, etc. These service governance capabilities are built with applications in an Fat SDK manner, and as applications are released and maintained, service governance capabilities are coupled with the lifecycle of business logic.

The upgrade of the micro-service framework will lead to the reconstruction and deployment of the whole application. In addition, because Fat SDK is usually bound to a specific language, it is difficult to support multilingual (polyglot) implementation of enterprise applications.

In order to solve the above challenges, the community put forward the Service Mesh (Service Grid) architecture. It sinks the service governance capabilities to the infrastructure and provides service governance capabilities through an independent Sidecar process, while the application side only retains the codec of the protocol. Thus, the decoupling of service governance and business logic is realized, and the two can evolve independently without interference with each other, which improves the flexibility of the overall architecture; at the same time, the service grid architecture reduces the intrusiveness of business logic and reduces the complexity of multi-language support.

Service grid

Within Alibaba economy, we have begun to apply service grid technology on a large scale to provide multilingual support, reduce the threshold for business docking, provide a unified architecture model, and improve the speed of technology iteration. The service grid technology represented by Istio has a bright future, but there are still many challenges when large-scale production hits the ground.

The first is the complexity of Istio service grid technology itself.

The second is the stability and performance challenges brought by scale:

In the case of massive services, can the control plane support the efficient distribution of service configurations? Can the data plane minimize the communication delay after an increase of two hops? Sink observability and policy management capabilities to the data plane to avoid performance bottlenecks introduced by centralized Mixer. Finally, it is compatible with the existing micro-service architecture and supports the unified configuration management service and communication protocol of the existing micro-service.

In order to solve the above challenges, Alibaba and Ant Financial Services Group built the service grid capability on the technical system compatible with the Istio community. On June 18 this year, Ant Financial Services Group has completed the verification of the core system to SOFAMosn. In Singles 11, which has just been completed, Alibaba and Ant Financial Services Group launched Service Mesh on a large scale in the core system.

At the same time, Alibaba economy will feedback the results of its own technological evolution to the upper reaches in time, and work with the community to promote the development of Service Mesh. For example, in the latest version of Alibaba's open source service discovery and configuration management project Nacos, Istio support for MCP protocol is provided. Later, Aliyun will launch a hosted Service Mesh service to help developers in the cloud easily use service grid technology.

Focus on the application lifecycle

Another focus is the automation and standardization of the application life cycle. We know that the positioning of Kubernetes is Platform for Platform, which helps enterprises to realize automatic application operation, maintenance and management.

Kubernetes provides many basic meta-language abstractions for distributed application management, such as Deployment for stateless applications and StatefulSet for stateful applications. However, in the enterprise production environment, in the face of different needs of applications, there are still some deficiencies in the existing capacity. To participate in technology sharing, we often hear every enterprise talking about how to modify K8s to solve its own problems, many of which are similar.

OpenKruise

As a leader in cloud native technology, Alibaba precipitated our best practices in large-scale production of cloud native computing technology, and opened up and co-built with the community as an open source project OpenKruise.

On the one hand, it helps enterprise customers to avoid detours and reduce technology fragments in the process of cloud native exploration; on the other hand, it promotes the upstream technology community to gradually improve and enrich Kubernetes's application cycle automation capabilities.

Take the following new controllers as examples:

Broadcast Job: you can have one-time tasks run on the specified node on the machine, for example, we need to install security patches on the node, or download a container image on the node in advance.

Sidecar Set: more and more operation and maintenance capabilities are provided in the form of Sidecare, such as logging, monitoring, and Envoy, the data plane component in the service grid. We can manage the lifecycle of Sidecar declaratively through Sidecar Set

Advanced StatefulSet: support for in-place release and batch upgrades to make it easier to support stateful services.

These controllers address the real pain points of many customers.

OAM- 's first open application model

On November 16, Microsoft and Aliyun jointly released Open Application Model (OAM), hoping to establish a standardized cloud native application model to help developers, application operations and infrastructure operation and maintenance teams to collaborate more efficiently.

The focus design standards it adopts include different dimensions, developers are responsible for defining application components, dependencies and architecture, and application operation and maintenance personnel are responsible for defining application runtime configuration and operation and maintenance requirements, such as publishing policies and monitoring metrics. The infrastructure operation and maintenance team can configure customized parameters according to different application deployment environments.

Through this Separation of Concerns design, the application definition, operation and maintenance capability and infrastructure can be deconstructed. Make application delivery more efficient, reliable, and automated.

Computational borderless

Finally, let's talk about thinking about the future of borderless cloud computing. With the approach of 5G era, with the rapid development of low-latency network, AI hardware computing power and intelligent applications, an era of everything intelligence will surely come. it will be the inevitable trend of cloud computing development to extend computing power from cloud to edge side and device side, and to carry out unified application delivery and resource control through cloud.

Cloud edge end-to-end collaboration

Based on the container, we have established a cloud-side-to-end integrated collaborative platform-ACK@Edge. In this way, we can deploy some applications that require low-latency processing to the edge nodes to achieve nearby access. For example, we can put AI model prediction and real-time data processing on the edge to make real-time intelligent decisions, while model training, big data processing and other applications that require massive computing power can be put on the cloud.

ACK Edge Edition provides unified management and control capability, and can support cloud ECS, edge ENS nodes and IoT devices in K8s cluster. And in view of the particularity of the edge, it provides the ability of unit isolation, disconnection autonomy and self-healing. We have started large-scale applications in Aliyun Video Cloud, Youku and other scenarios.

Youku somersault cloud

We take Youku Somersault Cloud as an example to introduce the evolution of its computing architecture.

Youku is the largest video platform in China. With the rapid development of Youku business, it is necessary to evolve the centralized architecture originally deployed in several IDC to cloud + edge computing architecture. At this time, we need a way to uniformly manage more than a dozen region and many edge nodes of Ali Cloud.

Youku chose ACK@Edge, which can manage cloud and edge nodes in a unified way, and achieve unified application release and flexible capacity expansion. Through the flexibility, the machine cost is saved by 50%. With the new architecture, user terminals can access edge nodes nearby, reducing end-to-end network latency by 75%.

From the community, giving back to open source

Finally, cloud native technology comes from the common construction of the community. Alibaba, as a cloud native practitioner and leader, fully embraces cloud native technology, and gives our best practices in mass production back to the community to build a better cloud native technology ecology with the community.

"Alibaba Cloud's native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, and is the technical official account that best understands cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.