What does Apache Tez mean? 07/15 Update SLTechnology News&Howtos

What does Apache Tez mean?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you what Apache Tez refers to. It is concise and easy to understand. It will definitely brighten your eyes. I hope you can gain something through the detailed introduction of this article.

You may have heard of Apache Tez, a new distributed execution framework for Hadoop data processing applications. But what is it? How does it work? Who should use it and why? If you have these questions, take a look at the presentation "Apache Tez: accelerating Hadoop query processing" provided by Bikas Saha and Arun Murthy, in which they discussed the design of Tez, some of its highlights, and shared some of the initial results achieved by letting Hive use Tez instead of MapReduce.

The rendering record is edited by Roopesh Shenoy

Tez is Apache's latest open source computing framework that supports DAG jobs. It can greatly improve the performance of DAG jobs by converting multiple dependent jobs into a single job. Tez is not directly targeted at end users-in fact, it allows developers to build faster and more scalable applications for end users. Hadoop is traditionally a platform for batch processing of large amounts of data. However, there are many use cases that require near real-time query processing performance. There are other jobs that are not suitable for MapReduce, such as machine learning. The purpose of Tez is to help Hadoop deal with these use case scenarios.

The goal of the Tez project is to support high customization so that it can meet the needs of various use cases and allow people to complete their work without other external means. If projects such as Hive and Pig use Tez instead of MapReduce as the backbone of their data processing, they will significantly improve their response time. Tez is built on top of YARN, the new resource management framework used by Hadoop.

Design philosophy

The main reason for Tez is to bypass the restrictions imposed by MapReduce. In addition to the limitations of having to write Mapper and Reducer, there are inefficiencies in forcing all types of calculations to meet this paradigm-- such as using HDFS to store temporary data between multiple MR jobs, which is a load. In Hive, queries that require multiple shuffle operations on unrelated key are common, such as join-grp by-window function-order by.

The key elements of Tez's design philosophy include:

Allow developers (including end users) to do what they want to do in the most efficient way

Better execution performanc

Tez's ability to achieve these goals depends on the following:

The expressive data flow API--Tez team hopes to define API through a set of expressive data streams so that users can describe the directed acyclic graph (DAG) of the computation they want to run. To do this, Tez implements a structured type of API to which you can add all the processors and edges and visualize the actual built graphics.

Flexible input-processor-output (Input-Processor-Output) runtime model-- runtime executors can be built dynamically by connecting different inputs, processors, and outputs.

Data type independence-only care about the movement of the data, not the data format (key-value pairs, tuple-oriented formats, etc.).

Dynamic graph reconfiguration

Simple deployment-Tez is entirely a client application that leverages YARN's local resources and distributed cache. As far as the use of Tez is concerned, you don't need to deploy anything on your own cluster, just upload the relevant Tez class libraries to HDFS, and then use the Tez client to submit them.

You can even put two class libraries on your cluster. One is for the production environment, which uses a stable version for all production tasks, and the other uses the latest version for user experience. The two libraries are independent of each other and do not affect each other.

Tez can run any MR task without making any changes. This allows tools that now rely on MR to implement distributed migration.

Let's take a closer look at these expressive data flow API-- and see what we can do with them. For example, you can use MRR mode instead of multiple MapReduce tasks, so that a single map can have multiple reduce phases; and in doing so, the data flow can flow between different processors without having to write anything to the HDFS (which will be written to disk, but only for checkpointing), which has a significant performance improvement compared to the past. The following chart illustrates this process:

The process shown in the first diagram consists of multiple MR tasks, each of which stores intermediate results on HDFS-- the reducer in the previous step provides data for the mapper in the next step. The second diagram shows the process when using Tez, where the same process can be done in only one task, and there is no need to access HDFS between tasks.

The flexibility of Tez means that you need to work harder than MapReduce to use it, you need to learn more API, you need to implement more processing logic. But this is good, after all, it is not an end-user-oriented application like MapReduce, and its purpose is to allow developers to build applications for end-users based on it.

The above is an overview of Tez and a description of its goals. Let's take a look at its actual API.

Tez API

Tez API includes the following components:

Directed acyclic graph (DAG)-defines the overall task. A DAG object corresponds to a task.

Node (Vertex)-defines the user logic and the resources and environments required to execute the user logic. A node corresponds to a step in a task.

Edge-defines the connection between the producer and consumer nodes.

Edges need to be assigned attributes, which are necessary for Tez to expand the logical diagram into a collection of physical tasks that can be executed in parallel on the cluster at run time. Here are some of these properties:

Data movement attributes that define how data is moved from a producer to a consumer.

The Scheduling attribute (sequential or parallel) helps us define when scheduling should take place between producer and consumer tasks.

Data source attributes (persistent, reliable, or temporary) that define the lifecycle or persistence of the task output, allowing us to decide when to terminate.

If you want to see a usage example of API, a detailed description of these properties, and how to expand the logical diagram at run time, you can take a look at this article provided by Hortonworks.

The runtime API is based on the input-processor-output model, with which all inputs and outputs are pluggable. For convenience, Tez uses an event-based model to enable communication between tasks and systems, and between components. Events are used to pass information (such as task failure information) to the required components, to transmit output data streams (such as generated data location information) to the input, and to make changes to the DAG execution plan at run time.

Tez also provides a variety of out-of-the-box input and output processors.

These expressive API allow writers of higher-level languages such as Hive to gracefully convert their queries into Tez tasks.

Tez scheduler

When deciding how to assign tasks, the Tez scheduler takes into account many aspects, including: task location requirements, container compatibility, the total amount of resources available to the cluster, priority waiting for task requests, automatic parallelization, releasing resources that the application no longer uses (because the data is not local to it), and so on. It also maintains a pool of preheated JVM connections using shared registration objects. Applications can choose to use these shared registration objects to store different types of precomputed information so that they can be reused later without recalculation. at the same time, these shared connection sets and container pool resources can also run tasks very quickly.

If you want to learn more about container reuse, you can check out here.

Expansibility

Overall, Tez provides developers with rich extensibility so that they can cope with complex processing logic. This can be illustrated by the example "how Hive uses Tez."

Let's take a look at this classic TPC-DS query mode, where you need to join multiple dimension tables to a fact table. Most optimizers and query systems can do the scenario described in the upper-right corner of the figure: if the dimension table is small, you can broadcast all the dimension tables to a larger fact table, in which case you can do the same thing on Tez.

But what if these broadcasts contain user-defined, computationally expensive functions? At this point, you can't all do it this way. This requires you to divide your tasks into different phases, as shown in the topology diagram on the left side of the figure. The first dimension table is broadcast with the fact table, and the result of the connection is broadcast with the second dimension table.

The third dimension table no longer has a broadcast connection because it is too large. You can choose to use shuffle connections, and Tez can navigate topologies very effectively.

The benefits of using Tez to complete this type of Hive query include:

It provides you with comprehensive DAG support, while automatically doing a lot of work on the cluster, so it can take full advantage of the parallel capabilities of the cluster; as described above, this means that there is no need to read / write data from HDFS between multiple MR tasks, and all calculations can be done through a single Tez task.

It provides sessions and reusable containers, so latency is low and reorganization can be avoided as much as possible.

Using the new Tez engine to execute this particular Hive query will improve performance by more than 100%.

Road map

Richer DAG support. For example, can Samza use Tez as its underlying support and build applications on top of it? In order for Tez to handle the core scheduling and streaming requirements of Samza, the development team needs to do some support. The Tez team will explore how to use these types of connection patterns in our DAG. They also want to provide better fault-tolerant support and more efficient data transfer to further optimize performance and improve session performance.

Considering that the complexity of these DAG is uncertain, many automated tools need to be provided to help users understand their performance bottlenecks.

Tez is a distributed execution framework that supports DAG jobs. It can easily map to higher-level declarative languages such as Hive, Pig, Cascading, and so on. It has a highly customizable execution architecture so that we can perform dynamic performance optimization at run time based on real-time information related to data and resources. The framework itself automatically determines many thorny issues so that it can run smoothly and correctly.

With Tez, you can get good performance and efficiency out of the box. The goal of Tez is to solve some problems in the field of Hadoop data processing, including latency and complexity of execution. Tez is an open source project and has been used by Hive and Pig.

What does Apache Tez refer to above? have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.