How to build and configure the environment of Hadoop cluster 07/03 Update SLTechnology News&Howtos

How to build and configure the environment of Hadoop cluster

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to build the configuration of the Hadoop cluster environment". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to build the configuration of the Hadoop cluster environment.

I. the choice of hardware

First of all, the choice of hardware for Hadoop cluster environment is nothing more than a few aspects:

1. How many nodes (Node) do you need to build a cluster?

On this issue, the point to be considered is the need to build several Server environments, because in a distributed environment, a Server is a node, so the selection of nodes needs to be decided by referring to the business scenarios to be applied by the current cluster. Of course, the more nodes in the distributed cluster environment, the improvement of the performance of the whole cluster. It also means an increase in costs.

However, there is a minimum number of nodes for reference with regard to Hadoop clusters.

First, in a Hadoop clustered environment, NameNode,SecondaryNameNode and DataNode need to be assigned on different nodes, so there are at least three nodes to take these roles of course. This means that at least three servers are required. Of course, when the Hadoop run job completes, you need another role, History Server, to record the running of the history program. It is recommended that this role be run on a separate server.

So, in the simplest Hadoop distributed cluster, you need at least three servers to build:

The first one is used to record all the data distribution, and the running process is NameNode.

The second is used to back up all the data distribution, after all, when the previous server is down, it can also be used to recover data. So, the program that the server is running is SecondaryNameNode

The third machine is used to store the actual data, and the running process is DataNode

The fourth is an optional server to record the health of the application history. The program you are running is History Server.

2. How to choose the configuration for each service in the cluster environment?

In fact, this question is a question of configuration selection. Configuration is nothing more than how to choose memory, CPU, storage, and so on. Of course, as long as the company budget permits, the higher the configuration, the better. When building a Hadoop environment, you need to consider the following points.

First of all, with regard to the configuration of several nodes in the cluster according to the division of roles, it does not require all servers to be configured the same. In the Hadoop cluster environment, the most important thing is the server on which NameNode is running, because it plays the role of scheduling and coordination of the whole cluster. Of course, the most important process in this role is resource management (ResourceManager). It is the real coordination of the operation of each node in the entire cluster. So the configuration of this server is higher than other nodes.

Secondly, in the process of running a Hadoop cluster, you need to pull all the data distribution records into memory, so this means that when the data of the entire cluster becomes larger and larger, we know that in big data's environment, several TB or PB levels of data are very common, which means that the data distribution records also need to be increased, so you need to increase the memory. Here is a reference:

The average 1GB memory can manage millions of block files.

For example: bolck is 128m, replicas are 3, 200 clusters, 4TB data, the Namenode memory required is: 200 (number of servers) x 4194304MB (4TB data) / (128MB x 3) = 2184533.33 files = 2.18 million files, so the memory value is close to 2.2G.

Third, because there is a machine for backup, secondary namenode needs about the same memory as namenode, and then the amount of memory needed by each server of the node. Here is also a reference:

First calculate the number of virtual cores (Vcore) of the current CPU: virtual cores (Vcore) = number of CPU * number of single CPU * HT (number of hyperthreads)

Then configure memory capacity according to virtual cores: memory capacity = virtual cores (Vcore) * 2GB (at least 2GB)

With regard to the choice of CPU, because Hadoop is a distributed computing operation, its running model is basically intensive parallel computing, so the recommended CPU should choose multi-channel and multi-core as far as possible, and every node should do so if conditions permit.

Then, in a large distributed cluster, it should also be noted that because distributed computing requires frequent communication and IO operations among various nodes, which means that there is a requirement for network bandwidth, it is recommended to use a network card with a network card of more than 1 gigabit, as well as the switch.

3. How to configure the storage size of each node in the cluster environment? What raid needs to be introduced?

First of all, let's talk about raid. Before, the storage layer data backup mechanism was done because the purpose of raid is to prevent data loss. Now the best use scenario is the high-risk configuration of a single service, and then in the distributed cluster, the stored data is distributed to each data node (DataNode), and the Hadoop application has implemented data backup by default. So raid doesn't play much role in distributed systems, but it doesn't work well. In fact, the principle is very simple, the data backup of a single node in the cluster is basically unable to recover effective data in the case of unexpected downtime.

Then let's analyze the problem of storage, and it can be clear that the amount of data determines the overall storage size of the cluster, as well as the size of the whole cluster!

Let's give an example:

If we can determine the amount of data stored in 1TB, and then increase the amount of data in 10GB every day, the storage size of the current cluster will be calculated as follows:

(1TB 10GB * 365 days) * 3*1.3=17.8TB

As you can see, the size of this cluster needs about 18T of storage space a year. Here explain the calculation formula. Multiplying 3 outside parentheses refers to the current data in order to prevent the loss of its own redundant backups. By default, three copies of the data are stored on different servers, and then multiplied by 1.3 to reserve space as the operating system of the node or the temporary result of computing.

Then we proceed to calculate the number of nodes:

Number of nodes (Nodes) = 18TB/2TB=9

The above formula divided by 2TB assumes that each node has storage space for 2TB. Here, the number of data storage nodes in the entire cluster can be calculated according to the storage size of the cluster: 9.

So the number of summary points required: summary points = 9 (data storage nodes) + 2 (NameNode and SecondaryNameNode) = 11.

At this point, you need to set up 11 servers to run the cluster.

Second, the choice of software

About the choice of Hadoop cluster environment software, nothing more than around these several software products to choose: OS operating system, Hadoop version, JDK version, Hive version, MySQL version and so on.

1. Which operating system should I choose?

Hadoop products are developed by the Java language, so the recommended operating system is the Linux operating system. The reason is very simple, open source is free, and one free reason is enough to compete with Microsoft's operating system, because we know that the cluster environment requires a lot of servers, so the cost of using Microsoft servers will be much higher. Of course, Microsoft can hardly be found in big data's open source products, so from this point of view. Microsoft has been left behind a lot, even in the lonely!

Therefore, in the open source Linux operating system is a hundred flowers blossom, a variety of versions, friends can check the differences and advantages of each version online, here I will directly tell you my recommended operating system CentOS.

The following is copied from the brief introduction of Bo you Shrimp skin:

CentOS is an enterprise-level Linux distribution based on the freely available source code provided by Red Hat Enterprise Linux. Each version of CentOS is supported for seven years (through security updates). New versions of CentOS are released every two years, and each version of CentOS is updated periodically (about every six months) to support new hardware. In this way, a safe, low-maintenance, stable, highly predictable and highly repeatable Linux environment is established.

Characteristics of CentOS

CentOS can be understood as the Red Hat AS series! It is released after the improvement of Red Hat AS! There is no difference between various operations, uses and RED HAT!

CentOS is completely free, and there is no problem that RED HAT AS4 needs a serial number.

CentOS's unique yum command supports online upgrades and can update the system instantly, unlike RED HAT, which costs money to buy support services!

CentOS fixed a lot of RED HAT AS's BUG!

CentOS version notes: CentOS3.1 is equivalent to RED HAT AS3 Update1 CentOS3.4 and RED HAT AS3 Update4 CentOS4.0 is equivalent to RED HAT AS4.

Well, I believe the above reasons are enough to conquer you.

2. How to choose the version of Hadoop?

About the historical version of Hadoop in the process of change, there are many versions, interested children's shoes can check on their own, here I only split the Hadoop version from the general direction, here is temporarily known as Hadoop1.0 and Hadoop2.0, as of the time I wrote this article, the Hadoop2.0 version has been quite stable, and gradually widely promoted in enterprise applications, about these two versions I will not go too much introduction, netizens can consult by themselves Or refer to my previous article on the architecture comparison between the two versions.

Therefore, the version of this series I applied is based on the Hadoop2.0 series to explain.

The question about the version of Jdk matches the version of Hadoop. We will analyze other related products later. Of course, you can query it from the official website of Hadoop. I will not repeat it here.

Operating system

In order to facilitate the demonstration, I will use the virtual machine to explain to you, of course, interested children's shoes can also download their own virtual machine to follow me step by step to build this platform, here I choose the virtual machine is: VMware.

We can download and install on the Internet, the process is very simple, there is nothing to explain, of course, your PC configuration needs to be a little better, at least 8G or more, or basically can not play the virtual machine.

The installation is completed is what it looks like above. Please consult the relevant information online. I will not repeat it here.

Then, we install the Liunx operating system, as mentioned above, we choose CentOS operation, so we need to download and install it on the CentOS official website, remember: don't be afraid, don't spend money!

Here, you need to keep in mind when choosing the CentOS version, if it is not required by the company, try not to choose the latest one, but to choose the most stable one, the reason is very simple, no one should be a new version of the guinea pig.

Then select the stable version to download, and here I recommend the CentOS6.8 64-bit operating system.

Then, click to find the download package to download it.

Before installing each node, we need to prepare the configuration information of the relevant nodes in advance, such as computer name, IP address, installation role, Super Admin account information, memory allocation, storage, etc., so I list a table for your reference:

Machine name IP address role OS Supreme Administrator name (Name) Supreme Administrator password (PWD) General user name (Name) General user password (Name) General user password (Hadoop192.168.1.50) General user password (Hadoop192.168.1.50 MasterCentOS6.8rootpassword01encrypted Salve01.Hadoop192.168.1.51Salve1CentOS6.8rootpassword01examples Salve02.Hadoop192.168.1.52Salve2CentOS6.8rootpassword01ideal Salve03.Hadoop192.168.1.53Salve3CentOS6.8root3CentOS6.8rodooppassword01MyMyLSQLServer12.168.100LServertutuotword01password

As you can see, I first planned four servers to build the Hadoop cluster, and then assigned the machine name, IP,IP needs to be set to a unified network segment, and then in order to build our Hadoop cluster, we need to create an independent user for all the nodes in the cluster. Here I have a name, which is called Hadoop. Of course, in order to facilitate memory, I uniformly set all passwords to password01users.

Of course, here we configure memory and storage in advance, because we know that the virtual machine information we use can be dynamically adjusted according to the usage.

In addition, I set up two Ubuntu servers to install MySQLServer separately and set up a master-slave mode. We know that Ubuntu is a user-friendly operating system. The purpose of separating it from the Hadoop cluster is that the Mysql database takes up memory resources, so we install it on a separate machine. Of course, MySQL is not needed by the Hadoop cluster, and there is no inevitable relationship between the two. It is built here for the subsequent installation of Hive to analyze data applications, and we can develop and debug in this machine, of course, the Window platform is also possible, after all, we are most proficient in using the Windows platform.

Thank you for your reading. the above is the content of "how to build and configure Hadoop cluster environment". After the study of this article, I believe you have a deeper understanding of how to build and configure Hadoop cluster environment, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.