One article will take you to understand Livy-- 's Apache Spark-based REST service 07/03 Update SLTechnology News&Howtos

One article will take you to understand Livy-- 's Apache Spark-based REST service

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

background

Apache Spark, as the most popular open source big data computing framework, is widely used in data processing and analysis applications. It provides two ways to process data: one is interactive processing, for example, users use spark-shell or pyspark scripts to launch Spark applications. Spark will launch REPL at the current terminal at the same time as the application starts.(Read-Eval-Print Loop) to receive user code input, and compile it into Spark jobs submitted to the cluster for execution; the second is batch processing, batch processing program logic is implemented by the user and compiled into jar packages, spark-submit script starts Spark application to execute the logic written by the user, and interactive processing is different from batch processing program in the execution process without any interaction between the user and Spark.

Although the two ways of handling interaction look completely different, both require users to log in to the Gateway node and start the Spark process through scripts. What's wrong with that?

First, resource usage and failure probability are concentrated on these Gateway nodes. Since all Spark processes are started on the Gateway node, this will inevitably increase the resource usage burden of the Gateway node and the possibility of failure. At the same time, the failure of the Gateway node will bring a single point problem, causing the failure of the Spark program.

Second, it is difficult to manage, audit, and integrate with existing rights management tools. Because Spark uses scripts to start applications, it is much less convenient to manage and audit than the Web approach, and it is difficult to integrate with existing tools such as Apache Knox.

At the same time, the deployment details and configuration on the Gateway node are inevitably exposed to the logged in users.

To avoid these problems, while providing native Spark with the processing interaction it already has, and bringing enterprise-class management, deployment, and auditing capabilities that Spark lacks, this article introduces a new Spark based REST service: Livy.

Livy

Livy is an open-source REST service based on Spark that delivers code snippets or serialized binaries to Spark clusters for execution via REST. It provides these basic functions:

Submit Scala, Python, or R code snippets to remote Spark clusters for execution;

Submit Spark jobs written in Java, Scala, and Python to remote Spark clusters for execution;

Submit batch applications to run in clusters.

From the basic features Livy provides, you can see that Livy covers both of the processing interactions native Spark provides. Unlike native Spark, all operations are REST submitted to the Livy server, which then sends them to different Spark clusters for execution. Let's first look at Livy's architecture.

Livy's basic architecture

Livy is a typical REST service architecture that accepts and parses user REST requests and translates them into corresponding actions on the one hand, and manages all Spark clusters initiated by users on the other. See Figure 1 for the detailed architecture.

Figure 1: Basic architecture of Livy

Users can start a new Spark cluster via Livy as a REST request. Livy calls each Spark cluster started a session, which consists of a complete Spark cluster, and communicates between the Spark cluster and the Livy server via RPC protocol. Livy divides conversations into two types, depending on how they interact:

interactive session, which is the same as interactive processing in Spark, interactive session can receive code fragments submitted by users after it is launched, compiled and executed on the remote Spark cluster;

Batch session, users can start Spark applications in batch mode through Livy, such a way is called batch session in Livy, which is the same as batch processing in Spark.

As you can see, Livy provides the same core functionality as native Spark, providing two different session types instead of two different processing interactions in Spark. Let's look at these two types of conversations in detail.

Interactive Session

Using interactive sessions is similar to using spark-shell, pyspark, or sparkR that comes with Spark, where users submit code snippets to REPL, which compiles them into Spark jobs and executes them. The main difference between them is that spark-shell launches REPL on the current node to receive user input, while Livy interactive session launches REPL in the remote Spark cluster, and all code and data need to be transmitted over the network.

Let's look at how to use interactive conversations.

Create interactive sessions

POST /sessions

An interactive session requires that you create the session first. When we submit a request to create an interactive session, we need to specify the session type ("kind"), such as "spark". Livy will launch the corresponding REPL according to the type we specify. Currently Livy supports three different interactive session types: spark, pyspark or sparkr to meet the needs of different languages.

When the session is created, Livy returns us a JSON data structure that represents all the information about the current session:

What we need to pay attention to is the session id, which represents this session. All operations based on this session need to indicate its id.

submit code

POST /sessions/{sessionId}/statements

After creating an interactive session we can submit code to that session for execution. As with creating a session, the commit code also returns us an id to identify the request. We can use the id to query the result of the code execution.

query execution results

GET /sessions/{sessionId}/statements/{statementId}

Livy's REST API is designed to be non-blocking. When a code request is submitted, Livy will immediately return the request id instead of blocking the request until execution is complete, so users can use the id to poll repeatedly for results. Of course, only when the code is executed can the user's query request get the correct result.

Of course, Livy Interactive Sessions also provides a number of different REST APIs for manipulating sessions and code, which I won't go into detail here.

Using Programming APIs

In interactive session mode, Livy can receive not only user-submitted code, but serialized Spark jobs as well. Livy provides a set of programmatic APIs for users to use. Users can write Spark jobs using Livy's API just like using the native Spark API. Livy serializes and sends user-written Spark jobs to remote Spark clusters for execution. Table 1 compares PI programs written using the Spark API to programs written using the Livy API.

Table 1 PI programs written using the Spark API compared to programs written using the Livy API

You can see that the core logic is exactly the same except for the entry function, so users can easily migrate existing Spark jobs to Livy.

Livy Interactive Session is an HTTP-based implementation of Spark Interactive Processing. With Livy's interactive sessions, users don't have to log on to Gateway nodes to start Spark processes and execute code. Interactive processing in REST mode provides users with rich choices, facilitates users 'use, and more importantly, facilitates operation and maintenance management.

Batch Session

A large class of Spark applications are batch applications that do not interact with users during runtime, the most typical being Spark Streaming streaming applications. The user compiles and packages the business logic into jar packages and starts the Spark cluster via spark-submit to execute the business logic:

Livy brings the same functionality to users, who can create batch apps in a REST way:

With the user-specified "className" and "file," Livy launches a Spark cluster to run the app, a method known as batch sessions.

So far we have briefly introduced Livy's two session types, which correspond to Spark's two processing interaction modes, so it can be said that Livy provides Spark's two interactive processing modes in a REST way.

enterprise features

We introduced Livy's core features earlier, and compared to the integrity of the core features, Livy's enterprise-level features reflect its advantages over native Spark handling interactions. This chapter introduces several key enterprise features of Livy.

multi-user support

Assuming that user tom initiates a REST request to the Livy server to start a new session, and the Livy server is started by user livy, who is the user of the Spark cluster created at this time, will it be user tom or livy? By default the user of this Spark cluster is livy. This creates an access problem: user tom cannot access resources he has permissions to, whereas he can access resources owned by user livy.

To solve this problem, Livy introduced the proxy user pattern in Hadoop, which is widely used in multi-user environments such as HiveServer2. In this mode, the super user can access resources as a proxy user, and has the corresponding permissions of a normal user. When proxy user mode is enabled, the Spark cluster user started with the session created by user tom will be tom.

Figure 2: Livy Multi-User Support

In order to use this feature, the user needs to configure "livy.impersonation.enabled" and configure the user of the Livy server-side process as Hadoop proxyuser in Hadoop. Of course, there will be some extra configuration Livy will not be expanded here.

With proxy user mode support, Livy can truly support multiple users, and sessions initiated by different users will access resources as corresponding users.

end to end security

Another critical feature in enterprise applications is security. What are the security considerations for a complete Livy service?

client authentication

When a user tom initiates a REST request to access the Livy server, how do we know that the user is a legitimate user? Livy uses Kerberos based Spnego authentication. After configuring Spnego authentication on the Livy server, users must obtain Kerberos authentication before initiating Http requests. Only after authentication can they access the Livy server correctly, otherwise the Livy server will return a 401 error.

HTTPS/SSL

So how do you secure HTTP traffic between the client and the Livy server? Livy uses standard SSL to encrypt HTTP protocol to ensure the security of Http messages transmitted. To do this, you need to configure Livy Server SSL. This feature is enabled in the configuration.

SASL RPC

In addition to the communication between the client and the Livy server, there is also network communication between the Livy server and the Spark cluster. How to ensure the communication security between the two also needs to be considered. Livy uses RPC communication mechanism based on SASL authentication: when Livy server starts Spark cluster, it will generate a random string as the authentication key between them. Only Livy server and Spark cluster have the same secret key, which ensures that only Livy server can communicate with Spark cluster, preventing anonymous connection from trying to communicate with Spark cluster.

Summarizing the three security mechanisms described above is shown in Figure 3.

Figure 3 Livy end-to-end security mechanism

This constitutes Livy's complete end-to-end security mechanism, ensuring that anonymous connections cannot communicate with any part of the Livy service without authenticated users.

failure recovery

Since the Livy server is a single point, all operations need to be forwarded to the Spark cluster through Livy. How to ensure that all sessions created when the Livy server fails are not affected, and that the Livy server can reconnect with existing sessions after recovery to continue using?

Livy provides a fail-back mechanism. Livy records session-related meta-information on reliable storage when a user starts a session, and once Livy recovers from a failure, it tries to read the meta-information and reconnect with the Spark cluster. To use this feature we need to configure Livy to enable it:

Failure recovery can effectively avoid the unavailability of all sessions caused by a single point of failure of the Livy server, and also avoid unnecessary session failure caused by a restart of the Livy server.

conclusion

This article introduces Livy, a Spark-based REST service, from the limitations of Spark's interactive approach. Livy not only covers all the processing interactions Spark provides, but also combines a variety of enterprise-level features, although Livy project is still in its early stages, many features need to be added and improved, I believe that in time Livy will become a good Spark based REST service.

In order to help you make learning easy and efficient, I will share a large amount of information for free to help you overcome difficulties on the road to becoming a big data engineer and even an architect. Here I recommend a big data learning exchange circle: 658558542 Welcome to ×× flow discussion, learning exchange and common progress.

When you really start learning, you inevitably don't know where to start, which leads to inefficiency and affects confidence in continuing learning.

However, the most important thing is that I don't know which technologies need to be mastered. I often step on pits when learning, and eventually waste a lot of time. Therefore, it is necessary to have effective resources.

Finally, I wish all big data programmers who encounter bottle diseases and don't know what to do, and wish everyone all the best in their future work and interviews.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.