In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/02 Report--
I would like to share with you an example of quickly building a multi-node Hadoop cluster based on Docker. I believe most people don't know much about it, so share this article for your reference. I hope you can learn a lot after reading this article. Let's learn about it.
one。 Project profile
GitHub: kiwanlau/hadoop-cluster-docke
Building Hadoop clusters directly on machines is a very painful process, especially for beginners. They may be torn to pieces by this problem before they start running wordcount. And not everyone has several machines, right? You can try to build it with multiple virtual machines, as long as you have a machine with leverage.
My goal is to run the Hadoop cluster in a Docker container so that Hadoop developers can quickly and easily build a multi-node Hadoop cluster on the machine. In fact, this idea has been implemented a lot, but it is not ideal, either the mirror image is too large, or the use is too slow, or the use of third-party tools makes it too complex to use. The following table shows some known Hadoop on Docker projects and their problems
Project image size problem sequenceiq/hadoop-docker:latest 1.491GB image is too large, only a single node sequenceiq/hadoop-docker:2.7.0 1.76 GB sequenceiq/hadoop-docker:2.60 1.624GB sequenceiq/ambari:latest 1.782GB image is too large, too slow to use Using complex sequenceiq/ambari:2.0.0 4.804GB sequenceiq/ambari:latest:1.70 4.761GB alvinhenrick/hadoop-mutinode 4.331GB images is too large, too slow to build, and increases node trouble, such as bug
My project refers to the alvinhenrick/hadoop-mutinode project, but I do a lot of optimization and refactoring. The GitHub home page of the alvinhenrick/hadoop-mutinode project is as follows:
GitHub:Hadoop (YARN) Multinode Cluster with Docker
The following two tables are a comparison of the parameters between the alvinhenrick/hadoop-mutinode project and my kiwenlau/hadoop-cluster-docker project
Mirror name Construction time Mirror layer size alvinhenrick/serf 258.213s 21 239.4MBalvinhenrick/hadoop-base 2236.055s 58 4.328GBalvinhenrick/hadoop-dn 51.959s 74 4.331GBalvinhenrick/hadoop-nn-dn 49.548s 84 4.331GB Mirror name Build time mirror layer size kiwenlau/serf-dnsmasq 509.46s 8 206.6 MBkiwenlau/hadoop-base 400.29s 7 775.4 MBkiwenlau/hadoop-master 5.41s 9 775.4 MBkiwenlau/hadoop-slave 2.41s 8 775.4 MB
You know, I have mainly optimized the following points:
Smaller image size, faster construction time, less mirror layers.
Faster and easier to change the number of Hadoop cluster nodes
In addition, when adding nodes to the alvinhenrick/hadoop-mutinode project, you need to manually modify the Hadoop configuration file and then rebuild the hadoop-nn-dn image, and then modify the container startup script to achieve the function of adding nodes. On the other hand, I can automatically talk through shell scripts, and I can rebuild the hadoop-master image in less than 1 minute and run it immediately! This project starts Hadoop clusters with 3 nodes by default, and supports Hadoop clusters with any number of nodes.
In addition, starting Hadoop, running wordcount, and rebuilding the image are all automated using shell scripts. This makes the use and development of the whole project very convenient and fast.
Development and test environment
Operating system: ubuntu 14.04 and ubuntu 12.04 kernel version: 3.13.0-32-genericDocker version: 1.5.0 and 1.6.2
Friends, there is not enough hard disk and not enough memory, especially if the kernel version is too low, it will lead to running failure.
two。 Introduction to mirroring
A total of 4 images have been developed in this project:
Serf-dnsmasqhadoop-basehadoop-masterhadoop-slave
Serf-dnsmasq Mirror
Install serf based on ubuntu:15.04 (chosen because it is the smallest, not because it is up to date): serf is a distributed machine node management tool. It can dynamically discover all Hadoop cluster nodes. Install dnsmasq: dnsmasq as a lightweight DNS server. It can provide domain name resolution service for Hadoop cluster.
When the container starts, the IP of the master node is passed to all slave nodes. Serf starts immediately after container starts. The serf agent on the slave node will immediately find the master node (master IP they all know), and the master node will immediately discover all the slave nodes. Then by exchanging information with each other, all nodes can know the existence of all other nodes. (Everyone will know Everyone). When serf discovers a new node, it reconfigures dnsmasq and then restarts dnsmasq. So dnsmasq can resolve the domain names of all nodes in the cluster. This process will take longer as the number of nodes increases. Therefore, if you configure more Hadoop nodes, you need to test whether serf has found all nodes and whether DNS can resolve the domain names of all nodes after starting the container. Wait a moment to start Hadoop. This solution was proposed by SequenceIQ, which focuses on running Hadoop in Docker.
Hadoop-base Mirror
Install JDK based on serf-dnsmasq image (OpenJDK) install openssh-server, configure password-less SSH install vim: install Hadoop 2.3.0: install compiled Hadoop (2.5.2,2.6.0,2.7.0 are larger than 2.3.0, so I don't bother to upgrade)
If I need to redevelop my hadoop-base, I need to download the compiled hadoop-2.3.0 installation package and put it in the hadoop-cluster-docker/hadoop-base/files directory
Hadoop-master Mirror
Master Node formatting namenode based on hadoop-base Image configuration hadoop
To do this, you need to configure the slaves file, while the slaves file needs to list the domain names or IP of all nodes. Therefore, the number of Hadoop nodes varies, so does the slaves file. Therefore, when you change the number of Hadoop cluster nodes, you need to modify the slaves file and then rebuild the hadoop-master image. I wrote a resize-cluster.sh script to automate this process. You can easily change the number of nodes in a Hadoop cluster by simply giving the number of nodes as a script parameter. Since the hadoop-master image only does some configuration work and does not need to download any files, the whole process is very fast. One minute is enough.
Hadoop-slave Mirror
Configure the slave node of hadoop based on hadoop-base image
Image size analysis
The following table shows the running results of sudo docker images:
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZEindex.alauda.cn/kiwenlau/hadoop-slave 0.1.0 d63869855c03 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-master 0.1.0 7c9d32ede450 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-base 0.1.0 5571bd5de58e 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau / serf-dnsmasq 0.1.0 09ed89c24ee8 17 hours ago 206.7 MBubuntu 15.04 bd94ae587483 3 weeks ago 131.3 MB
The following conclusions are easy to know:
Serf-dnsmasq image added to ubuntu:15.04 image, 75.4MBhadoop-base image added to serf-dnsmasq image, 570.7MBhadoop-master image and hadoop-slave image added to hadoop-base image, almost no increase in size.
The following table shows some of the running results of sudo docker history index.alauda.cn/kiwenlau/hadoop-base:0.1.0
IMAGE CREATED CREATED BY SIZE2039b9b81146 44 hours ago / bin/sh-c # (nop) ADD multi:a93c971a49514e787 158.5 MBcdb620312f30 44 hours ago / bin/sh-c apt-get install-y openjdk-7-jdk 324.6 MBda7d10c790c1 44 hours ago / bin/sh-c apt-get install-y openssh-server 87.58 MBc65cb568defc 44 hours ago / bin/sh-c curl-Lso serf.zip https://dl.bint 14.46 MB3e22b3d72e33 44 hours ago / bin/sh-c apt-get update & & apt-get install 60.89 MBb68f8c8d2140 3 weeks ago / bin/sh-c # (nop) ADD file:d90f7467c470bfa9a3 131.3 MB
It is known that:
The basic image ubuntu:15.04 is 131.3MB to install OpenJDK requires 324.6MB to install Hadoop requires 158.5MBUbuntu, OpenJDK and Hadoop are all necessary for the image, which together account for 614.4MB
As a result, the hadoop image I developed is close to the minimum, and the space for optimization is already very small.
three。 Steps for building a 3-node Hadoop cluster
1. Pull the image
Sudo docker pull index.alauda.cn/kiwenlau/hadoop-master:0.1.0sudo docker pull index.alauda.cn/kiwenlau/hadoop-slave:0.1.0sudo docker pull index.alauda.cn/kiwenlau/hadoop-base:0.1.0sudo docker pull index.alauda.cn/kiwenlau/serf-dnsmasq:0.1.0
3 / 5 minutes OK~ can also pull the image directly from my DokcerHub repository, so you can skip step 2:
Sudo docker pull kiwenlau/hadoop-master:0.1.0sudo docker pull kiwenlau/hadoop-slave:0.1.0sudo docker pull kiwenlau/hadoop-base:0.1.0sudo docker pull kiwenlau/serf-dnsmasq:0.1.0
View the downloaded image:
Sudo docker images
Running result:
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZEindex.alauda.cn/kiwenlau/hadoop-slave 0.1.0 d63869855c03 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-master 0.1.0 7c9d32ede450 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-base 0.1.0 5571bd5de58e 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau / serf-dnsmasq 0.1.0 09ed89c24ee8 17 hours ago 206.7 MB
Hadoop-base image is based on serf-dnsmasq image, while hadoop-slave image and hadoop-master image are based on hadoop-base image. So in fact, the total of the four images is 777.4MB.
two。 Modify mirror tag
Sudo docker tag d63869855c03 kiwenlau/hadoop-slave:0.1.0sudo docker tag 7c9d32ede450 kiwenlau/hadoop-master:0.1.0sudo docker tag 5571bd5de58e kiwenlau/hadoop-base:0.1.0sudo docker tag 09ed89c24ee8 kiwenlau/serf-dnsmasq:0.1.0
View the image after modification of tag:
Sudo docker images
Running result:
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZEindex.alauda.cn/kiwenlau/hadoop-slave 0.1.0 d63869855c03 17 hours ago 777.4 MBkiwenlau/hadoop-slave 0.1.0 d63869855c03 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-master 0.1.0 7c9d32ede450 17 hours ago 777.4 MBkiwenlau/hadoop-master 0.1.0 7c9d32ede450 17 hours ago 777.4 MBkiwenlau/hadoop-base 0.1.0 5571bd5de58e 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-base 0.1.0 5571bd5de58e 17 hours ago 777.4 MBkiwenlau/serf-dnsmasq 0.1.0 09ed89c24ee8 17 hours ago 206.7 MBindex.alauda.cn/kiwenlau/serf-dnsmasq 0.1.0 09ed89c24ee8 17 hours ago 206.7 MB
The reason for modifying the image is that I upload the image to Dockerhub by default, so the image names in Dokerfile and shell scripts do not have an alauada prefix, sorry for this.... However, it is still very fast to change tag. If I download my image in DockerHub directly, there is no need to modify tag.... But the download speed of Alauda image is very fast.
3. Download the source code
Git clone https://github.com/kiwenlau/hadoop-cluster-docker
To prevent GitHub from being XX, I imported the code into the Git repository in open source China:
Git clone http://git.oschina.net/kiwenlau/hadoop-cluster-docker
4. Run the container
Cd hadoop-cluster-docker./start-container.sh
Running result:
Start master container...start slave1 container...start slave2 container...root@master:~#
A total of 3 containers, 1 master and 2 slave, were opened. After opening the container, you will enter the root directory of the master container root user (/ root)
View the files in the root user's home directory for master:
Ls
Running result:
Hdfs run-wordcount.sh serf_log start-hadoop.sh start-ssh-serf.sh
Start-hadoop.sh is the shell script that enables hadoop, and run-wordcount.sh is the shell script that runs wordcount, which can test whether the image is working properly.
5. Test whether the container starts normally (you have entered the master container at this time
View hadoop cluster members:
Serf members
Running result:
Master.kiwenlau.com 172.17.0.65:7946 alive slave1.kiwenlau.com 172.17.0.66:7946 alive slave2.kiwenlau.com 172.17.0.67:7946 alive
If the result is missing nodes, you can wait a moment before executing the "serf members" command. Because serf agent needs time to find all the nodes.
Test ssh:
Ssh slave2.kiwenlau.com
Running result:
Warning: Permanently added 'slave2.kiwenlau.com,172.17.0.67' (ECDSA) to the list of known hosts.Welcome to Ubuntu 15.04 (GNU/Linux 3.13.0-53-generic x861664) * Documentation: https://help.ubuntu.com/The programs included with the Ubuntu system are free software;the exact distribution terms for each program are described in theindividual files in / usr/share/doc/*/copyright.Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted byapplicable law.root@slave2:~#
Exit slave2:
Exit
Running result:
LogoutConnection to slave2.kiwenlau.com closed.
If ssh fails, please wait a moment and test again, because it takes time for dnsmasq's dns server to start. After the test is successful, you can start the Hadoop cluster! In fact, you can also not test, open the container and wait patiently for one minute.
6. Turn on Hadoop
. / start-hadoop.sh
After the last step from ssh to slave2, please remember to go back to master! Too many running results, ignored, the startup speed of Hadoop depends on the performance of the machine.
7. Run wordcoun
. / run-wordcount.sh
Running result:
Input file1.txt:Hello Hadoopinput file2.txt:Hello Dockerwordcount output:Docker 1Hadoop 1Hello 2
The execution speed of wordcount depends on machine performance.
four。 Steps for building N-node Hadoop cluster
1. Preparatory work
Refer to part 2: download the image, modify the tag, download the source code note that you do not have to download serf-dnsmasq, but it is best to download hadoop-base, because hadoop-master is built on hadoop-base.
two。 Rebuild the hadoop-master image
. / resize-cluster.sh 5 Don't worry, it can be done in 1 minute. You can set different positive integers for resize-cluster.sh scripts as parameters 1, 2, 3, 4, 5, 6.
3. Start the container
. / start-container.sh 5 you can set different positive integers for start-container.sh scripts as arguments 1, 2, 3, 4, 5, 6. As for this parameter, it is better to be consistent with the parameter of the previous step:) if this parameter is larger than the parameter of the previous step, if you start more nodes, Hadoop does not recognize them. If this parameter is smaller than the one in the previous step, Hadoop feels that fewer started nodes are dead.
4. Testing work
Refer to the third part 5: 7: test container, open Hadoop, run wordcount, please note that if the number of nodes increases, be sure to test the container first, and then open Hadoop, because serf may not have found all the nodes, and dnsmasq's DNS server says that the service test waiting time depends on machine performance. The above is all the contents of the example of quickly building a multi-node Hadoop cluster based on Docker. Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.