What is the process of big data system cloud's primary gradual evolution? 07/09 Update SLTechnology News&Howtos

What is the process of big data system cloud's primary gradual evolution?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

The knowledge points of this article "what is the process of big data system cloud primary gradual evolution" are not quite understood by most people, so the editor summarizes the following contents for you. The content is detailed, the steps are clear, and it has a certain reference value. I hope you can get something after reading this article. Let's take a look at this article "how the process of big data system cloud primary gradual evolution is".

1. Introduction

With the rise of cloud native concept, more and more enterprises devote themselves to the wave of cloud native transformation to solve the problems faced by traditional applications, such as lack of resilience, low resource utilization, long iteration cycle and so on. Through cloud native technologies (such as containers, immutable infrastructure and declarative API, etc.), it becomes easier for enterprises to build and run applications in cloud environments such as public cloud, private cloud and hybrid cloud, and make full use of the advantages of cloud environment. Accelerate enterprise application iteration, reduce resource costs, improve system fault tolerance and resource flexibility.

The traditional big data system based on Hadoop ecology is also faced with some problems, such as lack of flexibility, low resource utilization, difficult management and so on. Cloud native technology is naturally suitable to solve these problems. However, transforming the traditional big data system based on Hadoop ecology into a cloud native architecture involves many challenges such as high transformation cost and high migration risk. Is there a solution that can not only solve the problems of insufficient flexibility, low resource utilization and management difficulties of big data system based on cloud native technology, but also ensure that the transformation cost and migration risk are relatively low? Based on the current situation of big data system and the characteristics of big data technology and container technology, Tencent Cloud Big data team and Container team have proposed a gradual cloud native evolution solution. Using this scheme, we can realize the cloud biochemistry of big data system and make full use of the advantages of cloud origin under the premise of less transformation cost and migration risk.

This paper analyzes the main problems currently faced by big data system, how to solve these problems, and the challenges faced by big data system cloud native transformation. Based on these problems and adjustments, this paper focuses on the progressive cloud native evolution scheme based on Hadoop Yarn on Kubernetes Pod (described in detail below) and its best practices.

two。 Main problems of big data system

The traditional big data system around the rapid ecological development of Hadoop, a hundred flowers blossom, each enterprise has gradually established its own big data platform, and even the data platform. However, driven by fierce market competition and increasing consumption expectations, on the one hand, the business needs rapid iteration to meet the rapid growth, on the other hand, it is necessary to control the high cost while the demand for resources is growing to maintain the competitiveness of enterprises. This requires that big data system can be timely and rapid expansion to meet production needs, but also as much as possible to improve the efficiency of the use of resources and reduce the cost of the use of resources. The specific problems are reflected in the following points:

Flexible capacity expansion can not meet the rapid growth of business needs: with the development of business, traffic and data volume increase suddenly, especially for real-time computing, resources need to be able to expand in time to meet business needs. Although some big data management and control platforms try to achieve automatic expansion (such as expansion through cluster load), under the traditional big data platform architecture, it usually requires a series of steps such as resource application, relying on software installation, service deployment and so on. This process is usually slow, and it is not timely for cluster load relief.

Offline deployment and coarse-grained scheduling can not improve the utilization of resources: under the traditional Hadoop architecture, offline operations and online jobs often belong to different clusters, but online business and streaming operations have obvious characteristics of peaks and troughs. In the trough period, a large number of resources will be idle, resulting in a waste of resources and an increase in cost. In the offline mixed cluster, through the dynamic scheduling to cut the peak and fill the valley, when the utilization rate of the online cluster is in the trough period, the offline task can be scheduled to the online cluster, which can significantly improve the utilization of resources. However, at present, Hadoop Yarn can only be allocated through the static resources reported by NodeManager, which can not be based on dynamic resource scheduling, and can not well support the mixed scenarios of online and offline services.

Operating system image and deployment complexity slow down application release: virtual machines or bare metal devices rely on images that contain many software packages, such as HDFS, Spark, Flink, Hadoop, etc., the image of the system is much larger than 10GB, and there are usually some problems, such as too large image, tedious production, long cross-region distribution cycle and so on. Based on these problems, some big data development teams have to divide the requirements into mirror and non-mirror requirements, and release them uniformly when the need to modify the image accumulates to a certain extent, and the iterative speed is limited. when users are urgent and need to modify the image, they are bound to face great business pressure. At the same time, after the purchase of resources, the deployment of applications involves dependency deployment, service deployment and other links, further slowing down the release of applications.

Fig. 1 main problems of big data system

The flexible capacity expansion, application release efficiency and resource utilization mentioned above are common problems in big data system at present. How to solve and deal with these problems has increasingly become a topic of concern for enterprises. Next, we will analyze how to solve these problems from a cloud native point of view.

3. How to solve the problem of big data system by Cloud Native Technology

How does the cloud native technology solve the problem of elastic expansion: in the cloud native architecture, the application and its dependent environment have been built in advance in the image, and the application runs in the container launched based on the image. During the peak period of business, with the increase of business volume, you can apply for container resources from the cloud native environment and just wait for the image download to complete and start the container (usually the image download time is also seconds). When the container is started, business applications will run immediately and provide computing power, and there is no time-consuming links such as virtual machine creation, dependent software installation and service deployment. In the business trough, delete idle containers, you can offline the corresponding applications, in order to save the cost of using resources. With the help of cloud native environment and container technology, container resources can be quickly obtained, and applications can be launched in seconds based on application images to achieve rapid start and stop of business, and real-time expansion of business resources to meet production needs.

How does cloud native technology solve the problem of low resource utilization: in the traditional architecture, big data business and online business are often deployed in different resource clusters, and these two parts of business are independent of each other. However, big data's business is generally more offline computing business, which is at the peak of the business at night, while the online business is often empty at night. Cloud native technology makes use of the isolation capability of container integrity (CPU, memory, disk IO, network IO, etc.), and the powerful orchestration and scheduling capability of kubernetes to achieve mixed deployment of online and offline services, so that offline services can make full use of resources during the idle period of online services to improve resource utilization.

In addition, using serverless (serverless) technology, through containerized deployment, resources are applied only when computing tasks are needed, resources are used and paid on demand, and resources are returned in time after use, which greatly increases the flexibility of resource use, improves the efficiency of resource use, and effectively reduces the cost of resource use.

How does the cloud native technology solve the problem of long release cycle: in the traditional big data system, all environments basically use the same image, so it is complex to rely on the environment, and the deployment and release cycle is often long. Sometimes the basic components need to be updated because the image needs to be rebuilt and uploaded to various regions, which can take as long as several days. While the cloud native architecture uses containers for deployment, the release of applications and updates of basic components only need to pull new images and restart the container, which has the natural advantage of fast update speed and does not have the problem of environmental consistency. It can speed up the pace of application release and solve the problem of long application release cycle.

4. The Challenge of the Evolution of big data system to Cloud Native Architecture

Although the cloud native technology can solve the problems encountered in the current big data system, it will face some challenges to migrate big data's system from the traditional Hadoop-based ecological architecture to the cloud native architecture:

High cost of application transformation: migrating big data applications running on the Hadoop platform to the cloud native platform, on the one hand, requires big data team to carry out containerization transformation of business applications, such as system task startup mode, infrastructure adaptation (changes in environment variables, configuration file acquisition, etc.), all of which require big data team to adapt. In the way of resource management, it is changed from adaptation Yarn to adaptation Kubernetes. The overall transformation cost is relatively high. On the other hand, it needs to be modified at the resource application level of big data's application, so that it can apply for resources directly from the Kubernetes cluster, also known as Native on Kubernetes. At present, Apache Spark and Apache Flink have supported this feature to varying degrees from the framework kernel, but the overall integrity depends on the efforts of the community.

High migration risk: the more changes introduced in a single change, the more likely it is to cause a failure. In the field of Hadoop, Hadoop Yarn is responsible for managing and scheduling the resources of big data's application. Specifically, big data's application runs in the Container provided by Yarn. Container here is the abstraction of resources in Yarn, not Linux Container. Migrating it to the cloud native architecture based on container technology spans the underlying infrastructure, with a larger scope of change and higher risk.

The organizational structure causes additional costs: the teams responsible for the development and operation and maintenance of Hadoop systems in the enterprise usually belong to different departments, and their technology stacks are obviously different. In the process of migration, there is too much cross-departmental communication, resulting in additional migration costs. If the changes are relatively large, the cost of cross-part communication will be very high.

Thus it can be seen that it is not that simple to migrate big data applications from traditional Hadoop architecture to Kubernetes architecture, especially by relying on the transformation of big data application itself by the community to make it have the ability to run on the cloud native platform. However, these transformations cannot be completed overnight, and big data application community is still required to make more efforts in the cloud native direction.

5. Big data system cloud native progressive evolution scheme 5.1 brief introduction of progressive evolution scheme

The existing problems of big data system mentioned above, how to solve the problem of big data system with cloud native technology, and the challenge of big data system migrating from traditional architecture to cloud native architecture. Is there a solution that can solve the problem of big data's system and make big data's system architecture more cloud-based? Can it also reduce the transformation cost in the process of migration and avoid the risk of migration?

Next, this paper will introduce the gradual evolution of big data system to cloud native evolution, through the way of gradual migration evolution, in the case of minor changes in the architecture, through cloud native technology to solve the problem of big data system. Reap the dividend of cloud native technology with less investment and avoid the risk of the migration process. At the same time, on the basis of this, we can further smoothly evolve the big data system to the cloud native architecture.

The progressive evolution scheme mainly includes two modes: elastic capacity expansion and offline hybrid deployment. The emphasis of the two modes is slightly different. Elastic capacity expansion mainly focuses on how to make use of cloud native resources. With the help of serverless technology, rapid expansion of resources to supplement computing power to meet the real-time needs of business. The offline mixed department mainly focuses on the use of idle resources in the idle period of online business, by scheduling big data offline computing tasks to online business idle resources, on the basis of ensuring business stability, greatly improve the efficiency of the use of resources. Both modes use the form of Yarn on Kubernetes Pod, as shown in the figure below. The basic idea is to run Yarn NodeManager in the newly expanded Pod container in the Kubernetes cluster. When Yarn NodeManager Pod starts, it automatically initiates registration with the Yarn ResourceManager of the existing Hadoop cluster according to the configuration file, and finally supplements the computing power of the Yarn cluster in the form of Kubernetes Pod.

Figure 2 Yarn on Kubernetes Pod

5.2 Elastic scale-up mode of gradual evolution

In the elastic expansion and reduction mode, the elastic expansion and reduction module will dynamically apply for (release) resources from the serverless Kubernetes cluster according to the use of big data cluster resources. The specific form of applying for resources is to create (destroy) a custom resource (CustomResourceDefinition,CRD) of Yarn operator in the Kubernetes cluster, and the Yarn-operator deployed in the cluster will create (delete) the Yarn pod based on the crd resource. The Yarn nodemanager process will be started in Yarn pod. After the Yarn nodemanager process is started, it will automatically register with the Yarn resource-manager in the big data cluster to expand (reduce) the computing power of big data cluster and meet the resource requirements of the task.

As shown in figure 1, big data cluster running on Tencent Cloud EMR (elastic MapReduce) system on the left, and Tencent Cloud EKS (elastic container service) (Serverless Kubernetes) cluster on the right.

Fig. 3 elastic expansion and reduction scheme (EMR big data cluster)

The key components of this scheme are Yarn-operator and Yarn-autoscaler. The Yarn-autoscaler component listens to the usage of resources in the Yarn cluster, determines whether to expand or reduce the capacity, and then creates Yarn-operaor crd resources to the EKS cluster. Yarn-operaor creates or deletes corresponding Yarn pod instances based on crd resources. The functions of these two components are as follows.

1) Yarn-operator

Yarn-operator listens to the crd resources created by the Yarn-autoscaler module in big data's cluster management platform through the kubernetes API. The main functions accomplished by Yarn-opterator include:

(1) create a corresponding Yarn pod; according to the configuration in the crd. (2) maintain the life cycle of the pod. When an exception occurs in the pod, automatically restart the pod; (3) specify the pod to scale down (4) when the pod fails to start, the flag fails to start.

Among them, pod exception recovery and fixed pod name mainly refer to the design idea of kurbernetes statefulsets to ensure that nodes can join the Yarn cluster with the same name after exception. Specify pod to scale down, which is not limited by the subscript order of pod. Delete any pod instance. For users who care about the cluster topology, the operation space is more flexible. The fast failure flag can actively delete the Pod that has not entered the running state for a long time to avoid long-term blocking of the expansion process.

2) Yarn-autoscaler

Yarn-autoscaler components provide two scaling modes: load-based scaling and time-based flexible scaling. For load scaling, users can set thresholds for different metrics to trigger capacity scaling, such as setting availablevcore, pending vcore, available mem, pending mem of Yarn root queues. When these metrics in Yarn reach the preset threshold, Yarn-autoscaler will trigger the expansion process and complete the expansion of the EKS cluster through the crd resources of the Yarn-opterator created by the Yarn cluster.

Fig. 4 scaling rule management-load scaling

For time-based auto scaling, users can set different time rules to trigger capacity expansion, such as setting one-off, daily, weekly and monthly rules. When the rules are triggered, the auto scaling process is carried out to increase or decrease the computing power of the Yarn cluster by creating (deleting) the crd resources of Yarn-opterator in the EKS set.

Figure 5 scaling rule management-time scaling

In addition, for the big data cluster built by customers on the cloud, you can also import the cluster into the management system of EMR to achieve flexible capacity expansion and improve the efficiency of resource use. Specifically, you only need to install EMR agent components on each node, and then the EMR team can complete the import of the cluster by adding the corresponding cluster information in the background. EMR agent itself does not invade the cluster and consumes relatively small resources (CPU consumption is less than 0.1core and memory consumption is less than 150m). It mainly does monitoring index collection, log collection, cluster heartbeat reporting and so on. After installing agent, the cluster will be completely managed by the EMR control system. Customers can not only use the flexible capacity expansion capability, but also use the log monitoring capability provided by EMR while using their own log monitoring capabilities. You can also continue to enjoy the various capabilities provided by EMR in the future.

Figure 6 flexible expansion and reduction scheme (user-built cluster is imported into EMR management and control system)

5.3 offline mixed mode of progressive evolution

For the offline mixed mode, the agent component on the node is based on monitoring statistics on the real usage of cpu and memory, which are collected by a server. Through this server, the big data management and control platform obtains the specification and quantity of idle computing power that can be provided in the current online cluster, calls Knetes api to create the corresponding number of resources, and the ex-scheduler extension scheduler ensures that the Pod is created on the nodes with more remaining resources. The specific form of applying for resources is the same as in the elastic capacity expansion mode, and the Yarn pod is created (deleted) by Yarn operator based on crd resources.

Figure 7 offline mixing scheme

As shown in the figure above, TKE (Tencent Cloud CCS) cluster is on the left, and EMR big data cluster is on the right. Online business has obvious characteristics of peaks and valleys, and the law is relatively obvious, especially at night, the utilization rate of resources is relatively low. At this time, big data management and control platform sends a request to the Kubernetes cluster to create resources, which can improve the computing power of big data application. This is mainly implemented through four components, recomm-agent, recomm-server, ex-scheduler and Yarn-operator.

Ceres-agent reads the cpu idle information of the node from prometheus (node-exporter, telegraf) as the number of cpu that can be oversold, and through the allocable memory of the node node-the total requested memory as the amount of free memory, and patch the calculation result to the node.status.capacity field of the Node node

Ceres-server summarizes the oversold cpu and memory information of ceres-agent patch on each node, and returns the number of pod that can be supported according to the pod specifications provided by http client.

Ex-scheduler is an extended scheduler based on Kubernetes scheduler extender. Compared with Yarn scheduler, Kuberentes scheduler has finer scheduling granularity, such as scheduling CPU resources in units of milli-cores, such as 500m, representing 0.5 cpu, scheduling memory resources in units of bytes, etc. Finer granularity can usually bring better resource utilization. In the score scoring section, the scheduler determines the Node score according to the squeezed-cpu declared in the pod to be scheduled and the squeezed-cpu written by the ceres-agent in the node's node.status.capacity. The node with more free resources has a higher score, so as to filter out the node with the most idle resources.

The main function of Yarn-opterator is to dynamically create (delete) pod based on crd resources. Its function is the same as Yarn-opterator in elastic expansion mode, so I won't repeat it here.

5.4.How does the gradual evolution scheme solve the problem of big data system

The above two schemes solve a series of problems and challenges mentioned at the beginning of the article. With the help of the gradual evolution solution, it can not only solve the problems of big data system and the challenges of migration, make big data system architecture more cloud-based, make full use of cloud-based capabilities, but also reduce the transformation cost in the process of migration. Avoid migration risks as much as possible, which is mainly reflected in the following aspects:

In terms of flexible capacity expansion and resource application, with the help of Kubernetes-based serveless services, resources can be created, used and paid on demand, while the scheduling mode of resources remains unchanged. Specifically, Kubernetes is only the provider of the resource. It only provides the API for creating and destroying the resource. The business side is responsible for calling the API to create and destroy the resource. After the resource is created on the Kubernetes, the Yarn NodeManager component of the resource automatically registers with the Yarn ResourceManager and provides computing power in the form of Kubernetes Pod. Yarn is still responsible for scheduling the resources involved in the subsequent execution of the job.

In terms of the image and release cycle, the container image technology simplifies the running environment of the application. The image only needs to provide the necessary dependent environment for the application, which greatly reduces its storage space, and the time for uploading and downloading images becomes shorter. It becomes easy to start and destroy quickly, and the overall release cycle of the application is greatly shortened.

In terms of resource utilization, with the help of the technical capabilities of the cloud native architecture, the resource utilization of the system is improved in many ways, such as fine-grained scheduling (dividing the two core resources of CPU and memory more finely, thus allocating system resources more fully) and dynamic scheduling (scheduling tasks to nodes that have allocated resources but not actually used them based on the real load of nodes rather than statically partitioned resources). Thus more fully improve the computing power of the system), in the offline mixed part (according to the periodicity of offline and online tasks, cut peaks and fill valleys, so as to make full use of the idle resources of the system).

In terms of application transformation cost, migration risk and organizational structure: through gradual migration, big data application team does not need to transform the existing architecture, but only needs to create an image of the currently used version of Hadoop to complete the creation of container resource supplementary computing on Kubernetes. In this way, changes can be minimized, thus the migration risk can be reduced as much as possible. At the same time, big data's team ensures that Yarn resources are scheduled and used. Container team to ensure the stable operation of Yarn pod, a clear division of labor, can maximize the stability of the system.

6. Best practices for Cloud Native gradual Evolution of big data system 6.1 Best practices for elastic scaling based on EKS

Figure 8 user best practice-flexible expansion and reduction

The user built big data cluster based on Hadoop Yarn, which contains a variety of components, such as Spark, Flink, Hive and so on. The main problem encountered at present is how to quickly expand the capacity to improve computing power in the face of temporary sudden traffic, and how to release resources in real time to solve the cost after the calculation is completed. With the serverless capability of Tencent Cloud EKS, the fast automatic expansion and reduction solution we have implemented can meet the needs of this user.

On the console, users use the configuration policy of automatic capacity expansion and reduction provided by us to freely configure the trigger threshold for automatic capacity expansion and reduction. For example, when the remaining CPU or memory is less than the specified value, the Yarn auto scaling component will call EKS Kubernetes API to create Yarn NodeManager Pod, and automatically register to Yarn ResourceManager after the container is started, thus providing computing power When the user-configured scaling policy is triggered, for example, when the remaining CPU or memory is greater than the specified value, the Yarn auto scaling component will also call the EKS Kubernetes API scaling Yarn NodeManager Pod. There is no need for users to create a virtual machine in the whole process. The billing method is based on the CPU and memory of Pod, so that resources can be built on demand and paid on demand.

6.2 Hybrid Cloud elasticity is based on TKE's best practices for offline mixing

Figure 9 user Best practices-- offline mixing

The application and storage of big data, a customer, runs in the big data cluster managed by Yarn, and faces many problems in the production environment, mainly reflected in big data's lack of computing power and the waste of resources during the troughs of online business. For example, when the computing power of offline computing is insufficient, the punctuality of data cannot be guaranteed, especially when it comes to random emergency big data query tasks, and there are no available computing resources, so you can only stop existing computing tasks, or wait for existing tasks to be completed. no matter which way, the efficiency of the overall task execution will be greatly reduced.

Based on TKE's online and offline mixed division scheme, offline tasks are automatically expanded to cloud clusters and mixed with online services, making full use of idle resources during cloud trough periods, improving the computing power of offline services, and taking advantage of the rapid elastic capacity expansion of resources on cloud to supplement the computing power of offline computing in time. To put it simply, the scheme provides three ways to use it:

Configure the scheduled expansion task according to the trough period of the online business. When the scheduled task arrives at the specified time, call TKE Kubernetes API to submit the expansion request. Yarn NodeManager will be created by Kubernetes in the form of Pod, and automatically register with Yarn ResourceManager according to the pre-prepared configuration in the image to provide computing resources. This scheme not only helps users improve the utilization rate of online clusters, but also improves the computing power of offline clusters.

Big data management and control platform can also send expansion instructions directly to TKE Kubernetes API to deal with temporary emergency big data query tasks, so as to avoid tasks caused by insufficient computing power that cannot be started, thus improving the SLA of the system.

Users can configure the automatic scaling policy on the console, and create the Yarn NodeManager on the appropriate node combined with Ceres Server\ Client resource prediction.

The above is the content of this article on "what is the original gradual evolution of big data system cloud". I believe we all have some understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.