How to understand the Application of Kubernetes in big data 04/06 Update SLTechnology News&Howtos

How to understand the Application of Kubernetes in big data

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "how to understand the application of Kubernetes in big data". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

What is Kubernetes?

Over the past few years, Kubernetes has been an exciting topic in the DevOps and Data Science communities. It has continuously become one of the preferred platforms for developing cloud native applications. Kubernetes, built with Google as an open source platform, handles scheduling containers to computing clusters and manages workloads to ensure that they work as expected. But there is a trap: what does this mean? Of course, other research on Kubernetes is possible, but assuming that most readers already know something about the basics of technology, many of the articles on Internet are high-level summaries filled with technical and complex terms.

What is a microservice?

To understand how Kubernetes works and why we need it, we need to study microservices. There is no accepted definition of microservices, but to put it simply, microservices are small and separate components of larger applications that perform specific tasks. These components communicate with each other through REST API. This architecture makes the application extensible and maintainable. This also makes the development team more productive because each team can focus on its own components without interfering with the rest of the application.

Because each component runs more or less independently of the rest of the application, it is necessary to have an infrastructure that can manage and integrate all of these components. This infrastructure will need to ensure that all components work properly when deployed in production.

Container and virtual machine (VM)

Each microservice has its dependencies and requires its own environment or virtual machine (VM) to host them. You can think of VM as a "giant" process in a computer, whose storage, processes, and network functions are separate from the computer. In other words, VM is a software plus hardware abstraction layer on top of physical hardware that simulates a complete operating system.

As you can imagine, a virtual machine is a resource-consuming process that consumes the computer's CPU, memory and storage space. If your components are small (very common), you will have a lot of underutilized resources left in your VM. This makes most micro-service-based applications hosted on VM time-consuming and costly to maintain.

Containers, like containers in real life, hold things in them. The container packages the code, system libraries, and settings needed to run the microservice, making it easier for developers to know that their application will run, no matter where it is deployed. Most applications that can be used in a production environment consist of multiple containers, each running a separate part of the application while sharing the operating system (OS) kernel. Unlike VM, containers need minimal resources to run reliably in production. Therefore, compared with VM, containers are considered lightweight, independent, and portable.

Go deep into Kubernetes

We hope you are still on the road! After experiencing containers and microservices, it should be easier to learn about Kubernetes. In a production environment, you must manage the life cycle of containerized applications to ensure that there is no downtime and efficient use of system resources. Kubernetes provides a framework to dynamically and flexibly manage all these operations in distributed systems. In short, it is the operating system of the cluster. A cluster consists of multiple virtual machines or real machines connected together in a network. Officially, this is the way to define Kubernetes on the official website:

"Kubernetes is a portable, extensible open source platform for managing containerized workloads and services to facilitate declarative configuration and automation. It has a large and fast-growing ecosystem. Kubernetes services, support and tools are widely available."

Kubernetes is an extensible system. It achieves scalability through the use of modular architecture. This means that each service of your application is separated by a defined API and load balancer. A load balancer is a mechanism by which the system ensures that each component, whether server or service, is using the maximum available capacity to perform its operations. The extension application is simply a matter of changing the number of replication containers in the configuration file, or you can simply enable automatic extension. This is particularly convenient because the complexity of system extension is delegated to Kubernetes. Automatic scaling is accomplished by real-time metrics such as memory consumption and CPU load. On the client side, Kubernetes automatically distributes traffic evenly among replication containers in the cluster, thus maintaining the stability of the deployment.

Kubernetes can achieve better hardware utilization. Production-ready applications typically rely on a large number of components that must be deployed, configured, and managed across multiple servers. As mentioned above, Kubernetes greatly simplifies the task of determining the server in which a component must be deployed based on resource availability criteria (processors, memory, etc.).

Another great feature of Kubernetes is that it can repair itself, which means it can automatically recover from failures, such as rebuilding crashed containers. For example, if the container fails for some reason, Kubernetes automatically compares the number of containers that are running with the number defined in the configuration file and restarts the new container as needed to ensure minimum downtime.

Now that we've solved this problem, it's time to look at the main elements that make up Kubernetes. We will first explain the lower-level Kubernetes Worker node, and then explain the upper-level Kubernetes Master. The worker node is the lackey of the running container, while the master node is the headquarters of the supervision system.

Kubernetes worker node component

The Kubernetes worker node (also known as Kubernetes minions) contains all the components necessary to communicate with Kubernetes Master (primarily kube-apiserver) and run containerized applications.

The Docker container runtime Kubernetes requires a container runtime to be orchestrated. Docker is a common choice, but other alternatives, such as CRI-O and Frakti, are also available. Docker is a platform for building, delivering, and running containerized applications. Docker runs on each worker node and is responsible for running the container, downloading the container image and managing the container environment.

The PodA pod contains one or more tightly coupled containers (for example, one for the back-end server and another for auxiliary services, such as uploading files, generating analysis reports, collecting data, and so on). These containers share the same network IP address, port space, and even volumes (storage). This shared volume has the same lifecycle as the container, which means that if the container is removed, the volume will disappear. However, Kubernetes users can set persistent volumes to separate them from Pod. Then, after the pod is removed, the installed volume will still exist.

Kube-proxy kube-proxy is responsible for routing incoming or outgoing network traffic on each node. Kube-proxy is also a load balancer that distributes incoming network traffic across containers.

Kubelet kubelet takes a set of pod configurations from kube-apiserver and ensures that the defined containers are working properly.

Kubernetes main component

Kubernetes Master manages the Kubernetes cluster and coordinates the worker nodes. This is the main entry point for most management tasks.

Etcd etcd is an important part of Kubernetes cluster. It is a key store for sharing and replicating all configuration, status, and other cluster data.

Kube-apiserver almost all communication between Kubernetes components and user commands that control the cluster are done using REST API calls. Kube-apiserver is responsible for handling all these API calls.

Kube-scheduler kube-scheduler is the default scheduler in Kubernetes to find the best working node for the newly created Pod. If desired, you can also create your own custom plan components.

Kubectl kubectl is a client command line tool for communicating and controlling Kubernetes clusters over kube-apiserver.

Kube-controller-manager kube-controller-manager is a daemon (background process) that embeds a set of Kubernetes core functional controllers, such as endpoints, namespaces, replication, service accounts, and so on.

Cloud-controller-managercloud-controller-manager runs a controller that interacts with the underlying cloud service provider. This allows cloud providers to integrate Kubernetes into the cloud infrastructure they are developing. Cloud providers such as Google Cloud,AWS and Azure already offer versions of their Kubernetes services.

Kubernetes big data

One of the main challenges in developing big data solutions is to define the correct architecture to deploy big data software in production systems. As the name implies, big data system is an exponentially growing large-scale application that processes online and batch data. Therefore, a reliable, scalable, secure, and easy-to-manage platform is needed to bridge the gap between the large amount of data to be processed, software applications and the underlying infrastructure (on-premises or cloud-based).

Kubernetes is one of the best choices for deploying applications in a large infrastructure. With Kubernetes, you can handle all the online and batch workloads you need, such as analytics and machine learning applications.

In big data's world, Apache Hadoop has always been the dominant framework for deploying scalable and distributed applications. However, the rise of cloud computing and cloud native applications has weakened the popularity of Hadoop (although most cloud providers such as AWS and Cloudera still offer Hadoop services). Hadoop basically provides three main functions: resource Manager (YARN), data Storage layer (HDFS), and Computing paradigm (MapReduce). All three components have been replaced by more modern technologies, such as Kubernetes for resource management, Amazon S3 for storage, and Spark / Flink / Dask for distributed computing. In addition, most cloud providers offer their own proprietary computing solutions.

The first thing we need to clarify is that there is no one-to-one relationship between Hadoop or most other big data stacks and Kubernetes. In fact, people can deploy Hadoop on Kubernetes. However, Hadoop is built and matured in an environment that is very different from what it is today. It was built at a time when network latency became a major problem. Companies are forced to have internal data centers to avoid having to move large amounts of data for data scientific and analytical purposes. That said, large enterprises that want to have their own data centers will continue to use Hadoop, but adoption rates may still be low due to better alternatives.

Today, under the leadership of cloud storage providers and cloud native solutions, a large number of computing operations are carried out within the enterprise. In addition, many companies choose to deploy their own private clouds internally. For these reasons, Hadoop,HDFS and other similar products have lost their appeal to newer, more flexible, and ultimately better technologies (such as Kubernetes).

Big data's application is a good candidate for using the Kubernetes architecture because the Kubernetes cluster is scalable and extensible. Recently, there have been some major campaigns to use Kubernetes for big data. For example, Apache Spark, the "poster" that handles the heavy computation of large amounts of data, is trying to add a native Kubernetes scheduler to run Spark jobs. Google recently announced that it will replace YARN with Kubernetes to schedule its Spark work. E-commerce giant eBay has deployed thousands of Kubernetes clusters to manage its Hadoop AI / ML pipeline.

So why is Kubernetes suitable for big data application? Take, for example, two Apache Spark jobs An and B that do some data aggregation on the computer, and say that a shared dependency has been updated from version X to Y, but job A needs version X, and job B needs version Y. Job A will not run.

In a Kubernetes cluster, each node will run isolated Spark jobs on its own driver and executor container. This setting will prevent dependencies from interfering with each other while still maintaining parallelism.

Kubernetes still has some major pain points when deploying the big data stack. For example, because containers are designed for short-lived stateless applications, the lack of persistent storage that can be shared between different jobs is a major problem for big data applications running on Kubernetes. Other major issues include scheduling (the above implementation of Spark is still in the experimental stage), security and network connectivity.

Consider the following situation: a job run by Node A needs to read the data stored in HDFS on the data node located on Node B in the cluster. This will greatly increase network latency because, unlike YARN, data is sent over the network of this isolated system for calculation. Although trying to solve these data locality problems, Kubernetes still has a long way to go before it can really become a feasible and realistic choice for deploying big data applications.

Nevertheless, the open source community is still working tirelessly to solve these problems in order to make Kubernetes a practical choice for deploying big data applications. Every year, because of its inherent advantages (such as elasticity, scalability and resource utilization), Kubernetes is getting closer and closer to becoming the actual platform of distributed big data applications.

This is the end of "how to understand the Application of Kubernetes in big data". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.