In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly shows you "how Hadoop runs on the Kubernetes platform", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "how Hadoop runs on the Kubernetes platform" this article.
Hadoop and Kubernetes are like two great masters in the world, one is the elder who has been famous for a long time, and the other is still famous, and the other is a fledgling teenager, whose bones are surprised and do not go the usual way, and he is amazed at the whole martial arts. There is a deep relationship between Hadoop and Kubernetes, because they both come from the IT family-Google, but the latter is his own son, and it is precisely because of the boss's endorsement that as soon as Kubernetes comes out of the mountain, all kinds of schools flock to be king.
I don't know whether it is because Hadoop is a godson or because he is "very old". In short, the descendants of Hadoop moments, such as Spark and Storm, have all kinds of materials and cases of deployment and operation on Kubernetes, but Hadoop has been dissociated from the Kubernetes system. In this paper, we give a practical case of Hadoop on Kubernetes to make up for this deficiency.
There is a lot of data about Hadoop containerization, but there is almost no information about Hadoop deployed on Kubernetes. This is mainly due to the following reasons:
First, Hadoop cluster relies heavily on DNS mechanism, and some components also use reverse domain name resolution to determine the identity of nodes in the cluster, which brings great challenges to the modeling and operation of Hadoop on Kubernetes. It is necessary to deeply understand the working principle of Hadoop cluster and be proficient in Kubernetes in order to solve this problem.
Second, the model of Hadoop's new Map-Reduce computing framework, Yarn, appears relatively late, its cluster mechanism is more complex than HDFS, and the data is relatively less, which increases the difficulty of Hadoop overall modeling and migration of Kubernetes platform.
Third, Hadoop and Kubernetes respectively belong to two different areas, one is the traditional big data field, the other is the emerging container and micro-service architecture field, the intersection between these two areas is very small, coupled with the fact that Hadoop has lost its focus in recent years (which can be found from Baidu search keywords), so it is reasonable that not many people pay attention to and study the deployment of Hadoop in Kubernetes.
Hadoop 2.0 actually consists of two complete clusters, one is a basic HDFS file cluster, and the other is a YARN resource scheduling cluster, as shown in the following figure:
Therefore, before Kubernetes modeling, we need to make an in-depth analysis of the working mechanism and operation principle of these two clusters. The following figure is the architecture diagram of the HDFS cluster:
Practice of Hadoop running on Kubernetes platform
We can see that the HDFS cluster is composed of two types of nodes: NameNode (Master node) and Datanode (data node), where the client program (Client) and the DataNode node will access the NameNode. Therefore, the NameNode node needs to be modeled as Kubernetes Service to provide services. The following is the corresponding Service definition file:
ApiVersion: v1
Kind: Service
Metadata:
Name: k8s-hadoop-master
Spec:
Type: NodePort
Selector:
App: k8s-hadoop-master
Ports:
Name: rpc
Port: 9000
TargetPort: 9000
Name: http
Port: 50070
TargetPort: 50070
NodePort: 32007
Where the NameNode node exposes 2 service ports:
Port 9000 is used for internal IPC communication, mainly for obtaining metadata of files.
Port 50070 is used for HTTP services and for Web management of Hadoop
To reduce the number of Hadoop images, we build an image and use the container's environment variable HADOOP_NODE_TYPE to distinguish different node types to start different Hadoop components. The following is the contents of the startup script startnode.sh in the image:
#! / usr/bin/env bash
Sed-I "s/@HDFS_MASTER_SERVICE@/$HDFS_MASTER_SERVICE/g" $HADOOP_HOME/etc/hadoop/core-site.xml
Sed-I "s/@HDOOP_YARN_MASTER@/$HDOOP_YARN_MASTER/g" $HADOOP_HOME/etc/hadoop/yarn-site.xml
Yarn-master
HADOOP_NODE= "${HADOOP_NODE_TYPE}"
If [$HADOOP_NODE = "datanode"]; then
Echo "Start DataNode..."
Hdfs datanode-regular
Else
If [$HADOOP_NODE = "namenode"]; then
Echo "Start NameNode..."
Hdfs namenode
Else
If [$HADOOP_NODE = "resourceman"]; then
Echo "Start Yarn Resource Manager..."
Yarn resourcemanager
Else
If [$HADOOP_NODE = "yarnnode"]; then
Echo "Start Yarn Resource Node..."
Yarn nodemanager
Else
Echo "not recoginized nodetype"
Fi
Fi
Fi
Fi
We notice that in the startup command, the HDFS Master node address in the Hadoop configuration file (core-site.xml and yarn-site.xml) is replaced with the parameter HDFS_MASTER_SERVICE in the environment variable, and the YARN Master node address is replaced with HDOOP_YARN_MASTER. The following figure is a complete modeling diagram of the Hadoop HDFS 2-node cluster:
Practice of Hadoop running on Kubernetes platform
The circle in the figure represents the Pod, and you can see that the Datanode does not model the Kubernetes Service, but as a separate Pod, because the Datanode is not directly accessed by the client, so there is no need to model the Service. When Datanode runs in the Pod container, we need to modify the following parameters in the configuration file to cancel the checking mechanism of the hostname (DNS) and the corresponding IP address of the host where the DataNode node resides:
Dfs.namenode.datanode.registration.ip-hostname-check=false
If the above parameters are not modified, the DataNode cluster will be "split". Because the hostname of Pod does not correspond to the IP address of Pod, the interface will display 2 nodes, both of which are in an abnormal state.
The following is the Pod definition for the HDFS Master node Service:
ApiVersion: v1
Kind: Pod
Metadata:
Name: k8s-hadoop-master
Labels:
App: k8s-hadoop-master
Spec:
Containers:
Name: k8s-hadoop-master
Image: kubeguide/hadoop
ImagePullPolicy: IfNotPresent
Ports:
ContainerPort: 9000
ContainerPort: 50070
Env:
Name: HADOOP_NODE_TYPE
Value: namenode
Name: HDFS_MASTER_SERVICE
ValueFrom:
ConfigMapKeyRef:
Name: ku8-hadoop-conf
Key: HDFS_MASTER_SERVICE
Name: HDOOP_YARN_MASTER
ValueFrom:
ConfigMapKeyRef:
Name: ku8-hadoop-conf
Key: HDOOP_YARN_MASTER
RestartPolicy: Always
The following is the node definition (hadoop-datanode-1) of HDFS's Datanode:
ApiVersion: v1
Kind: Pod
Metadata:
Name: hadoop-datanode-1
Labels:
App: hadoop-datanode-1
Spec:
Containers:
Name: hadoop-datanode-1
Image: kubeguide/hadoop
ImagePullPolicy: IfNotPresent
Ports:
ContainerPort: 9000
ContainerPort: 50070
Env:
Name: HADOOP_NODE_TYPE
Value: datanode
Name: HDFS_MASTER_SERVICE
ValueFrom:
ConfigMapKeyRef:
Name: ku8-hadoop-conf
Key: HDFS_MASTER_SERVICE
Name: HDOOP_YARN_MASTER
ValueFrom:
ConfigMapKeyRef:
Name: ku8-hadoop-conf
Key: HDOOP_YARN_MASTER
RestartPolicy: Always
In fact, Datanode can be deployed in DaemonSet on each Kubernerntes node, which is not defined here for clarity. Next, let's look at how the Yarn framework is modeled. The following is a diagram of the cluster architecture of the Yarn framework:
Practice of Hadoop running on Kubernetes platform
We can see that there are two kinds of nodes in the Yarn cluster: ResourceManager and NodeManger, the former belongs to the mind of the Yarn cluster (Master), and the latter is the work bearer node (Work Node). Although this architecture is very similar to HDFS, it is impossible to follow the HDFS modeling method because of an important detail. This detail is that the ResourceManager in the Yarn cluster has to strictly verify the NodeManger nodes. That is, the hostname (DNS) of the host where the NodeManger node resides strictly matches the corresponding IP address. To put it simply, it must comply with the following rules:
The IP address used by NodeManger to establish a TCP connection must be the IP address corresponding to the hostname of the node, that is, the IP address of the node returned after the host DNS name resolution.
Therefore, we use a special kind of Service--Headless Service in Kubernetes to solve this problem, that is, we model a Headless Service and a corresponding Pod for each NodeManger node. The following is a modeling diagram of a Yarn cluster composed of a ResourceManager and two NodeManger nodes:
Practice of Hadoop running on Kubernetes platform
The special feature of Headless Service is that this kind of Service does not assign Cluster IP. When the name of Ping such as Service in Kuberntes DNS, the IP address of the corresponding Pod is returned. If there are multiple Pod instances behind, the Pod address returned is randomly polled. When we model NodeManger with Headless Service, there is another detail to pay attention to, that is, the name of Pod (the hostname of the container) must be the same as the name of the corresponding Headless Service. When the NodeManger process running in the container initiates a TCP connection to ResourceManager, the hostname of the container is used, and this hostname happens to be the service name of NodeManger Service, and the IP address resolved by this service name happens to be the IP address of the container. In this way, the DNS limitation of Yarn cluster is cleverly solved.
Taking yarn-node-1 as an example, the corresponding YAM files of Service and Pod are given. First, the YAM definition of Headless Service corresponding to yarn-node-1:
ApiVersion: v1
Kind: Service
Metadata:
Name: yarn-node-1
Spec:
ClusterIP: None
Selector:
App: yarn-node-1
Ports:
Port: 8040
Notice the sentence "clusterIP:None" in the definition, indicating that this is a Headless Service and does not have its own Cluster IP address. Here is the definition of the YAM file:
ApiVersion: v1
Kind: Pod
Metadata:
Name: yarn-node-1
Labels:
App: yarn-node-1
Spec:
Containers:
Name: yarn-node-1
Image: kubeguide/hadoop
ImagePullPolicy: IfNotPresent
Ports:
ContainerPort: 8040
ContainerPort: 8041
ContainerPort: 8042
Env:
Name: HADOOP_NODE_TYPE
Value: yarnnode
Name: HDFS_MASTER_SERVICE
ValueFrom:
ConfigMapKeyRef:
Name: ku8-hadoop-conf
Key: HDFS_MASTER_SERVICE
Name: HDOOP_YARN_MASTER
ValueFrom:
ConfigMapKeyRef:
Name: ku8-hadoop-conf
Key: HDOOP_YARN_MASTER
RestartPolicy: Always
There is nothing special about the YAML definition of ResourceManager, where Service is defined as follows:
ApiVersion: v1
Kind: Service
Metadata:
Name: ku8-yarn-master
Spec:
Type: NodePort
Selector:
App: yarn-master
Ports:
Name: "8030"
Port: 8030
Name: "8031"
Port: 8031
Name: "8032"
Port: 8032
Name: http
Port: 8088
TargetPort: 8088
NodePort: 32088
The corresponding Pod is defined as follows:
ApiVersion: v1
Kind: Pod
Metadata:
Name: yarn-master
Labels:
App: yarn-master
Spec:
Containers:
Name: yarn-master
Image: kubeguide/hadoop
ImagePullPolicy: IfNotPresent
Ports:
ContainerPort: 9000
ContainerPort: 50070
Env:
Name: HADOOP_NODE_TYPE
Value: resourceman
Name: HDFS_MASTER_SERVICE
ValueFrom:
ConfigMapKeyRef:
Name: ku8-hadoop-conf
Key: HDFS_MASTER_SERVICE
Name: HDOOP_YARN_MASTER
ValueFrom:
ConfigMapKeyRef:
Name: ku8-hadoop-conf
Key: HDOOP_YARN_MASTER
RestartPolicy: Always
In the current scheme, there is another problem to be solved: the file system formatting problem after the HDFS NameNode node is restarted, which can be solved through the startup script, that is, to determine whether the HDFS file system has been formatted, and if not, execute the format command at startup, otherwise skip the format command.
After installation, we can access the HDFS management interface of Hadoop through a browser, and click the Overview tab on the home page to display the familiar HDFS interface:
Practice of Hadoop running on Kubernetes platform
Switch to the Datanodes tab, and you can see the information and current status of each Datanodes:
Practice of Hadoop running on Kubernetes platform
Next, we can log in to the Pod where NameNode is located and execute the HDSF command for functional verification. The result of the following command execution is to create a HDFS directory and upload a file to this directory:
Root@hadoop-master:/usr/local/hadoop/bin# hadoop fs-ls /
Root@hadoop-master:/usr/local/hadoop/bin# hadoop fs-mkdir / leader-us
Root@hadoop-master:/usr/local/hadoop/bin# hadoop fs-ls /
Found 1 items
Drwxr-xr-x-root supergroup 0 2017-02-17 07:32 / leader-us
Root@hadoop-master:/usr/local/hadoop/bin# hadoop fs-put hdfs.cmd / leader-us
We can then browse the HDFS file system in the HDFS administration interface to verify the results of the operation:
Practice of Hadoop running on Kubernetes platform
Next, we log in to the Pod corresponding to hadoop-master and start a Map-Reduce test job-wordcount. After the job starts, we can see the execution information of the job in the Yarn management interface, as shown in the following figure:
Practice of Hadoop running on Kubernetes platform
When the job is executed, you can see the detailed statistics through the interface. For example, the execution result of wordcount is shown in the following figure:
Practice of Hadoop running on Kubernetes platform
Finally, we compare the performance of the bare metal version of Hadoop cluster with that of Hadoop cluster on Kubernetes. The test environment is a cluster of ten servers. The specific parameters are as follows:
Hardware:
CPU:2E5-2640v3-8Core
Memory: 1616G DDR4
Network card: 210GE multimode optical port
Hard disk: 123T SATA
Software:
BigCloud Enterprise Linux 7 (GNU/Linux 3.10.0-514.el7.x86_64 x86x64)
Hadoop2.7.2
Kubernetes 1.7.4 + Calico V3.0.1
We performed the following standard tests:
TestDFSIO: read and write testing of distributed system
NNBench:NameNode test
MRBench:MapReduce test
WordCount: word frequency statistics task test
TeraSort:TeraSort task testing
After comprehensive testing, the performance of Hadoop decreases when it runs on the Kuberntes cluster. Take the test of TestDFSIO as an example, the following is a comparison of the performance test of Hadoop cluster file reading:
Practice of Hadoop running on Kubernetes platform
We can see that the file read performance on the Kubernetes cluster has decreased by about 30% compared with the physical machine, and the task execution time has also increased a lot. To compare the performance of file writes, the test results are shown below:
Practice of Hadoop running on Kubernetes platform
We can see that the gap in the performance of writing files is not big, and the main reason here is that during the test, the speed of HDFS writing to the disk is much lower than the speed of reading the disk, so the gap cannot be widened.
One of the main reasons for the decline in the performance of Hadoop clusters deployed on Kuberntes is the performance loss caused by the container virtual network. If the Host Only model is used, the gap between the two will be further narrowed. The following figure shows the performance test comparison of Hadoop cluster file reading in the TestDFSIO test:
Practice of Hadoop running on Kubernetes platform
Therefore, we recommend that the network model of Host Only be adopted in the production environment to improve the cluster performance of Hadoop.
To capture the deployment of Hadoop on Kubernetes and verify it in production, we can proudly say that nothing can hinder the pace of application migration to Kubernetes. Using unified PaaS to build enterprise application cluster and big data cluster, realizing the sharing of resources and unified management of services will greatly improve the speed of business deployment and the efficiency of management.
The above is all the contents of the article "how Hadoop runs on the Kubernetes platform". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.