What are the new features of Hadoop3.x 04/13 Update SLTechnology News&Howtos

What are the new features of Hadoop3.x

2025-04-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the new features of Hadoop3.x". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

JDK

In Hadoop3, all Hadoop JAR package compilation environments are based on Java8, so if you are still using Java 7 or earlier, you may need to upgrade to Java8 to run Hadoop3 properly. As shown in the following figure:

EC technology

First of all, let's take a look at what Erasure Encoding is. As shown in the following figure:

Generally speaking, in storage systems, EC technology is mainly used for redundant arrays of cheap disks, namely RAID. As shown in the figure above, RAID implements EC technology through Stripping, where logical sequential data (such as files) are divided into smaller units (such as bits, bytes, or blocks) and contiguous units are stored on different disks.

Then, for each Stripe of the original data unit, a certain number of parity units are calculated and stored. This process is called coding, and the error of any Stripe unit can be recovered by decoding calculations based on valid data units and parity units. When we think about erasure coding, we can first take a look at the early scenarios that were copied in Hadoop2. As shown in the following figure:

By default, HDFS has a backup factor of 3, one original block and two other copies. The storage cost required for two replicas is 100% for each station, which makes 200% of the storage overhead, which consumes other resources, such as network bandwidth. However, copies of cold data sets with low IO activity are rarely accessed in normal operation, but still consume the same amount of resources as the original data set.

For EC technology, that is, compared with erasing encoded data and providing less fault-tolerant space, HDFS replication and EC technology can replace replication, which will provide the same fault-tolerant mechanism and reduce storage overhead. As shown in the following figure:

The integration of EC and HDFS maintains the same fault tolerance as providing storage efficiency. For example, if a copy factor is 3, 6 blocks of disk space will be consumed to copy 6 blocks of a file. However, using EC technology (6 blocks, 3 parity blocks) to deploy, it requires only 9 blocks (6 blocks + 3 parity blocks) that consume disk space. Compared with the previous storage space, these save 50% of the storage overhead.

Because erasure coding requires additional overhead on data reconstruction when performing remote reads, it is usually used to store data that is accessed less frequently. Before deploying EC, users should consider all the overhead of EC, such as storage, network, CPU, and so on.

YARN timeline V.2 service

Hadoop introduced YARN Timeline Service v.2 to solve two main problems:

Improve the scalability and reliability of timeline services

Enhance availability by introducing streams and aggregations

First of all, let's analyze its scalability.

1 scalability

YARN V1 is limited to reading and writing a single instance and cannot be well extended beyond small clusters. YARN V2 uses a more scalable distributed architecture and scalable back-end storage that separates data writes from data reads. And use a distributed collector, which is essentially the collector for every YARN application. Reads are independent instances that are specifically queried through REST API services

2 availability

For usability improvements, in many cases, users are interested in streaming or logical group information of YARN applications. It is common to launch a group or series of YARN applications to complete logical applications. As shown in the following figure:

3 Architectural system

The YARN timeline service V2 uses a set of collectors to write data to the back end for storage. Collectors are assigned and collaborate with their dedicated application hosts, and as shown in the following figure, all data belonging to that application is sent to the collector of the application timeline, with the exception of the resource manager timeline collector.

For a given application, the application can write data to the same timeline collector. In addition, the node manager of the other nodes that run the container for the application writes data to the timeline collector that runs the application's primary node. The resource manager also maintains its own time mobile line collector, which only publishes YARN's generic lifecycle events to keep its writes reasonable. The reader of time is separated from the collector by a separate daemon and is designed to serve REST API query operations.

This is the end of "what are the new features of Hadoop3.x"? thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.