What is the construction plan of hadoop cluster management system? 07/12 Update SLTechnology News&Howtos

What is the construction plan of hadoop cluster management system?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares to you is about what the hadoop cluster management system building plan is, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article. Let's take a look at it.

Building a Hadoop distributed cluster environment is a headache for every beginner, because you may spend a lot of time building a running environment, but you don't know why you can't build it successfully. But for beginners, the probability of unsuccessful building of the operating environment is quite high.

The quick search DKHadoop release recommended for beginners of hadoop in the previous sharing article is indeed much easier to install in the runtime environment than other distributions hadoop. After all, DKHadoop is reintegrated and encapsulated at the bottom, which is a very friendly distribution for studying hadoop, especially for beginners!

1. Distributed machine architecture diagram:

Machine 1 master node, machine 2 slave node, machine 3, machine 4 and so on are all computing nodes. When the master node goes down, the slave node works instead of the master node, and the normal state is that the slave node works the same as the computing node. This architectural design ensures data integrity.

First of all, we ensure that there is a DataNode node and a NodeManager node on each computing node. Because they are all computing nodes, they really do the work. In terms of quantity, we have to guarantee. Then NameNode and ResourceManager are two very important managers, our client request, the first time to deal with NameNode and ResourceManager. NameNode is responsible for managing the metadata of the HDFS file system. No matter reading or writing files, the client must first find NameNode to obtain the metadata of the file, and then operate the file. The same is true of ResourceManager, which manages resources and task scheduling in the cluster, or you can think of it as the "big data operating system". Whether the client can submit the application and run it depends on whether your ResourceManager is normal.

2. What is the scale of data that is worth dealing with in big data's way?

First, from the point of view of the amount of data, but there is no definite answer, generally speaking, from a qualitative point of view, you think that this amount of data cannot be processed on a single machine, such as memory limitations, too long time, etc., and use clusters, but to reduce the time, your processing logic must be able to distributed processing, quantitative is general data or future data will reach the PB level (maybe GB) or more, you need to use distributed Of course, the premise is that your processing logic can be distributed.

Second, from the point of view of the algorithm, or the time complexity of processing logic, for example, although you do not have many data records, the time complexity of your algorithm or processing logic is n square or even higher. At the same time, your algorithm can be distributed, so consider distributed. For example, although your record is only 1w, but the time complexity is really the square of n, then think about how long it will take on a single machine. If your algorithm can be distributed, then consider using distributed processing.

3. Several problems restricting big data's handling ability.

A, network bandwidth

The network is the link between computers, of course, the wider the better, so that when computer resources permit, more data can be transmitted per unit of time and more data can be processed by the computer. Now in the enterprise network, the widespread use is the 100-megabit network, but also has the gigabit, although the ten-gigabit has, but uses not much.

B, disk

All data, no matter where it comes from, will eventually be stored on a different hard drive, or flash drive. Flash drives are much more efficient in reading and writing than hard drives, but they also have obvious disadvantages: high price and small capacity. Now the storage medium is mainly hard disk, which has two models: sequential read-write model and random read-write model. Sequential reading and writing means that the head rolls forward regularly along the track, like an assembly line. Random read and write is the jumping of the magnetic head to find the space on the track and write the data in. Obviously, sequential read and write is more efficient than random read and write, so the system architect takes sequential read and write as the main choice when designing big data storage scheme.

C, the number of computers

In a distributed cluster environment, the larger the computer, the better. In this way, in the case of the same amount of data, the more the number of computers, the less data allocated to each computer, and the processing efficiency will naturally be higher. However, the number of computers can not be increased indefinitely, the cluster to accommodate the size of the computer has a peak, beyond this peak, it is very difficult to improve, and it will decline if it is not handled properly. The reason mainly comes from the barrel short board effect, the boundary effect and the scale amplification effect. According to a test many years ago, it was based on Pentium 3 and Pentium 4 chips and ran the LAXCUS big data system on a 100m network. When the scale of a thousand computers is reached, the bottleneck begins to emerge. If you use the new X86 chip now, coupled with a higher-speed network, it should be able to accommodate more computers.

D, code quality

This is not a key issue, but it is an issue that enterprises must pay attention to. This has something to do with the quality of computer code written by programmers. In fact, every big data product is a semi-finished product, they only provide a computing framework, to be actually applied to enterprise production, there are a large number of business codes that programmers need to implement. In order to make big data's application achieve high quality, the technical person in charge should do a good job in the pre-design, clear and standardize the business process, and after the programmer gets the plan, write the code in a unified format. This is a process in which both sides cooperate with each other. In other words, we should do a good job of coordination and coordination.

The above is what the hadoop cluster management system construction plan is, and the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.