In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
In order to manage and schedule the resources in the cluster uniformly, introduces the data operating system YARN in Hadoop 2.0. The introduction of YARN greatly improves the resource utilization of the cluster and reduces the cost of cluster management. First, YARN allows multiple applications to run in a cluster and allocates resources to them on demand, which greatly improves resource utilization. Secondly, YARN allows all kinds of short jobs and long services to be deployed in a cluster, and provides support for fault tolerance, resource isolation and load balancing, which greatly simplifies the deployment and management costs of jobs and services.
[toc]
Before sharing, I still want to talk about the big data exchange group I created: 784557197, both students and gods are welcome to join the discussion.
YARN generally uses the master/slave architecture, as shown in figure 1, where master is called ResourceManager,slave and NodeManager,ResourceManager is responsible for unified management and scheduling of resources on each NodeManager. When users submit an application, they need to provide an ApplicationMaster to track and manage the program, which is responsible for requesting resources from ResourceManager and asking NodeManger to start Container that can take up certain resources. Because different ApplicationMaster are distributed to different nodes and resources are isolated through a certain isolation mechanism, they will not affect each other.
The resource management and scheduling function in YARN is responsible for by the resource scheduler, which is one of the core components in Hadoop YARN and a plug-in service component in ResourceManager. YARN organizes and divides resources through hierarchical queues, and provides a variety of multi-tenant resource schedulers. This scheduler allows administrators to group users or applications according to application needs, and allocate different amounts of resources to different groups, while adding various constraints to prevent individual users or applications from monopolizing resources, thus meeting a variety of QoS needs. The typical representative is Yahoo! Capacity Scheduler and Facebook's Fair Scheduler.
As a general data operating system, YARN can not only run short jobs like MapReduce and Spark, but also deploy long services like Web Server and MySQL Server to really realize a cluster with multiple uses. This kind of cluster is usually called a lightweight elastic computing platform. We say it is lightweight because YARN adopts cgroups lightweight isolation scheme and says it is flexible. Because YARN can adjust the resources occupied by various computing frameworks or applications according to their load or demand, so that cluster resources can be shared and resources shrink flexibly.
Application of Hadoop YARN in heterogeneous Cluster
Starting from version 2.6.0 of , YARN introduced a new scheduling strategy: tag-based scheduling mechanism. The main motivation of this mechanism is to better let YARN run in heterogeneous clusters, so as to better manage and schedule mixed-type applications.
1. What is label-based scheduling?
According to the old name of , tag-based scheduling is a scheduling strategy, just like priority-based scheduling, it is one of many scheduling strategies in the scheduler, which can be mixed with other scheduling strategies. The basic idea of this strategy is that users can tag each NodeManager, such as highmem,highdisk, as a basic attribute of NodeManager; at the same time, users can set several tags for the queue in the scheduler to limit that the queue can only occupy the node resources containing the corresponding tags, so that jobs submitted to a queue can only run on specific nodes. By tagging, users can divide the Hadoop into several subclusters, which in turn allows users to run applications on nodes that meet certain characteristics, such as memory-intensive applications (such as Spark) on large memory nodes.
2.Hulu application case
Label-based scheduling strategy is widely used in Hulu. The mechanism is enabled mainly because of the following three considerations:
Clusters are heterogeneous. In the process of Hadoop cluster evolution, the configuration of the new machines is usually better than the old ones, which makes the cluster eventually become a heterogeneous cluster. At the beginning of Hadoop design, many design mechanisms assume that clusters are isomorphic. Even now, Hadoop's support for heterogeneous clusters is still not perfect. For example, MapReduce speculates that the execution mechanism has not taken into account the diversity of heterogeneous clusters. Hulu deploys MapReduce, Spark, Spark Streaming, Docker Service and other types of applications on top of the YARN cluster. When running multi-class applications in heterogeneous clusters, it often occurs that the completion time of parallel tasks varies greatly due to different machine configurations, which is not conducive to the efficient execution of distributed programs. In addition, due to the lack of complete resource isolation in YARN, multiple applications running on one node are easy to interfere with each other, which is usually intolerable for low-latency applications. Personalized machine requirements. Because of the dependence on the special environment, some applications can only run on specific nodes in a large cluster. Typical representatives are spark and docker,spark MLLib may use some native libraries, in order to prevent contamination of the system, these libraries are usually only installed on a few nodes; the operation of docker container depends on docker engine, in order to simplify operation and maintenance costs, we will only let docker run on a number of specified nodes.
In order to solve the above problems, Hulu enables the label-based scheduling policy on the basis of Capacity Scheduler. As shown in figure 3, we label a variety of nodes in the cluster based on machine configuration and application requirements, including:
Spark-node: machines used to run spark jobs, usually with high configuration, especially large memory; mr-node: machines running MapReduce jobs, these machine configurations are diverse; docker-node: machines running docker applications, these machines are equipped with docker engine;streaming-node: machines running spark streaming streaming applications.
It is important to note that YARN allows multiple tags to exist on a node at the same time, thus enabling a single machine to run multiple types of applications (within hulu, we allow some nodes to be shared and can run multiple applications at the same time). On the surface, the cluster is divided into multiple physical clusters by introducing tags, but in fact, these physical clusters are different from the traditional completely isolated clusters, which are independent and related to each other. Users can easily adjust the use of a node dynamically by changing the label.
Application case and experience Summary of Hadoop YARN
.
As a data operating system, Hadoop YARN provides rich API for users to develop applications. Hulu has done a lot of exploration and practice in the design of YARN applications, and developed several distributed computing frameworks and engines that can run directly on YARN. The typical representatives are voidbox and nesto.
(1) Docker-based Container Computing Framework voidbox
Docker is a very popular container virtualization technology in the past two years, which can package and deploy most applications automatically. it enables any program to run in a resource-isolated container environment, thus providing a more elegant solution for project construction, release and operation.
In order to integrate the unique advantages of YARN and Docker, the Beijing big data team of Hulu developed Voidbox. Voidbox is a distributed computing framework, which uses YARN as the resource management module and Docker as the engine to execute tasks, so that YARN can schedule not only traditional MapReduce and Spark applications, but also applications encapsulated in Docker images.
Voidbox supports Docker Container-based DAG (directed acyclic graph) tasks and long services (such as web service), and provides a variety of application submission methods, such as command line and IDE, to meet the needs of the production environment and development environment. In addition, Voidbox can cooperate with Jenkins,GitLab, private Docker repository to complete a complete set of development, testing, automatic release process.
In Voidbox, YARN is responsible for resource scheduling of the cluster, and Docker acts as an execution engine to pull images from Docker Registry for execution. Voidbox is responsible for requesting resources for container-based DAG tasks and running Docker tasks. As shown in figure 4, each black wireframe represents a machine with several modules running on it, as follows:
Voidbox components:
VoidboxClient: client program. This component allows users to manage Voidbox applications (Voidbox applications contain one or more Docker jobs, and one job contains one or more Docker tasks), such as submitting and killing Voidbox applications. VoidboxMaster: it is actually an Application Master of YARN that is responsible for requesting resources from YARN and further allocating the resulting resources to internal Docker tasks. VoidboxDriver: responsible for task scheduling for a single Voidbox application. Voidbox supports Docker Container-based DAG task scheduling and other user code can be inserted between tasks. Voidbox Driver is responsible for handling dependent sequential scheduling between DAG tasks and running user code. VoidboxProxy: a bridge between the YARN and the Docker engine, responsible for relaying YARN commands to the Docker engine, such as starting or killing the Docker container. StateServer: maintains the health information of each Docker engine and provides Voidbox Master with a list of machines that can run Docker Container, so that Voidbox Master can request resources more effectively.
Docker components:
DockerRegistry: stores Docker images as a version management tool for internal Docker images. DockerEngine: the engine executed by Docker Container, which obtains the corresponding Docker image from Docker Registry and executes Docker-related commands. Jenkins: cooperate with GitLab to manage the version of the application. When the application version is updated, Jenkins is responsible for compiling and packaging, generating Docker image and uploading it to Docker Registry, thus completing the process of automatic release of the application.
Similar to spark on yarn,Voidbox, it also provides two modes of application operation, namely yarn-cluster mode and yarn-client mode. In yarn-cluster mode, the control components and resource management components of the application run in the cluster. After the Voidbox application is successfully submitted, the client can exit at any time without affecting the running of the application in the cluster. Yarn- cluster mode is suitable for the production environment to submit applications; in yarn-client mode, the control components of the application run on the client, other components run in the cluster, and the client can see more information about the running status of the application. After the client exits, the application running in the cluster also exits. Yarn-client mode can facilitate users to debug.
(2) parallel computing engine nesto
Nesto is a MPP computing engine similar to presto/impala in hulu. It is specially designed to deal with complex nested data and supports complex data processing logic (SQL is difficult to express). It uses column storage, code generation and other optimization techniques to speed up data processing efficiency. The Nesto architecture is similar to presto/impala in that it is decentralized and multiple nesto server do service discovery through zookeeper.
To simplify nesto deployment and administration costs, hulu deploys nesto directly to YARN. In this way, the nesto installation and deployment process will become very simple: the Nesto installer (including configuration files and jar packages) is packed into a separate package and stored in HDFS, and users can quickly deploy a nesto cluster by running a commit command and specifying information such as the number of nesto server started and the resources required for each server.
The Nesto on yarn program consists of an ApplicationMaster and multiple Executor, in which ApplicationMaster is responsible for applying for resources like YARN and starting Executor, while the function of Executor is to start nesto server. The key design point is ApplicationMaster. Its functions include:
Communicate with ResourceManager and apply for resources, which need to come from different nodes to achieve the goal of starting only one Executor per node.
Communicate with NodeManager, start Executor, and monitor the health of these Executor. Once an Executor is found to be malfunctioning, restart a new Executor on other nodes.
An embedded web server is provided to show the health of tasks in each nesto server.
Summary of 2.Hadoop YARN development experience (1) skillfully use resources to apply for API
Hadoop YARN provides rich semantics for resource expression. Users can apply for resources on a specific node / rack, or they can no longer accept resources on a node by blacklisting.
(2) pay attention to memory overhead
The memory of a container is made up of three parts: java heap,jvm overhead and non-java memory. If the memory size set by the user for the application is X GB (- xmxXg), then the container memory size requested by ApplicationMaster should be X memory D, where D is jvm overhead, otherwise it may be killed by YARN because the total memory exceeds the limit.
(3) log rotation
For long service, service logs will accumulate more and more, so log rotation is particularly important. Since before startup, the application cannot know the specific location of the log (such as which node and which directory), in order to facilitate users to manipulate the log directory, YARN provides macros. When the macro appears in the startup command, YARN will automatically replace it with a specific log directory, such as:
Echo $log4jcontent > $PWD/log4j.properties & & java-Dlog4j.configuration=log4j.properties... Com.example.NestoServer 1 > > / server.log 2 > > / server.log
The variable log4jcontent is as follows:
(4) debugging skills
Before NodeManager starts Container, it writes the Container-related environment variables, startup commands, and other information into a shell script, and starts Container by starting the script. In some cases, the failure of Container startup may be due to miswritten startup commands (for example, some special characters have been escaped). For this reason, you can determine whether there is a problem with the startup command by looking at the contents of the last executed script, by adding a command to print the contents of the script before container executes the command.
(5) performance problems caused by shared clusters
When running multiple applications in a YARN cluster at the same time, it may result in different node loads, which may lead to slower tasks on some nodes than others, which is unacceptable for applications with OLAP requirements. In order to solve this problem, there are usually two solutions: 1) run such applications on some exclusive nodes by tagging. 2) implement a speculative execution mechanism similar to MapReduce and Spark within the application, starting one or more of the same tasks for slow tasks, exchanging space for time to avoid slow tasks slowing down the efficiency of the entire application.
Development trend of Hadoop YARN
For YARN, it will develop in the direction of general resource management and scheduling, not limited to the field of big data processing, including support for short jobs of MapReduce and Spark, as well as support for long services such as Web Service.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.