Top 100 Apache Spark interview questions and answers in 2018 (part 1) 07/06 Update SLTechnology News&Howtos

Top 100 Apache Spark interview questions and answers in 2018 (part 1)

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

We know that Apache Spark is now a booming technology. Therefore, it is important to understand all aspects of Apache Spark and Spark interview questions. I will introduce every aspect of Spark, which may also be a frequently asked Spark interview question. In addition, I will try my best to provide each question, and from now on, your search for the best and all Spark interview questions will end here.

Answers to Apache Spark interview questions

First, what is Apache Spark?

Apache Spark is a powerful open source flexible data processing framework built around speed, ease of use and complex analysis. Apache Spark is developing rapidly in cluster computing systems. Spark can run on Hadoop, stand-alone or in the cloud, and can access data from a variety of sources, including HDFS,HBase,Cassandra or others.

Because of the in-cluster computing in Spark, it does not need to be shuffled inside and outside the disk. This allows faster processing of data in spark.

Compared with other big data and MapReduce technologies, such as Hadoop and Storm, Spark has several advantages. Few of them are:

1.Speed

It can run programs in memory 100 times faster than Hadoop-MapReduce and 10 times faster on disk.

two。 Ease of use

Spark has an easy-to-use API that can be used for large datasets. This includes more than 100 for

A collection of operators for transforming data and a familiar data frame API for processing semi-structured data.

We can write applications in Java,Scala,Python,R.

Unified Engine

Spark comes with higher-level libraries, including support for SQL queries, streaming data, machine learning and graphics processing.

4.Runs Everywhere

Spark can be run on Hadoop,Mesos, standalone or in the cloud.

Spark ecosystem

The following is a brief overview of the Spark ecosystem and its components.

It includes:

Spark Streaming: Spark Streaming is used to handle real-time streaming data.

Spark SQL: the Spark SQL component is a library on top of the Spark cluster, and we can run SQL queries on Spark data by using it.

Spark MLlib: MLlib is an extensible machine learning library for Spark.

Spark GraphX: GraphX is used for graphics and graphics parallel computing.

two。 Why choose Apache Spark?

Basically, we have a lot of general cluster computing tools. Such as Hadoop MapReduce,Apache Storm,Apache Impala,Apache Storm,Apache Giraph and so on. But there are some limitations to everyone's functions. Such as:

Hadoop MapReduce only allows batch processing.

two。 If we talk about stream processing, only Apache Storm / S4 can be executed.

3. For interactive processing again, we need Apache Impala / Apache Tez.

4. Although we need to perform graphics processing, we choose Neo4j / Apache Giraph.

Therefore, no one engine can perform all the tasks together. As a result, there is a great demand for powerful engines that can process data in real-time (streaming) and batch modes.

In addition, it can respond to subseconds and perform memory processing

In this way, Apache Spark appeared. It is a powerful open source engine that provides interactive processing, real-time streaming, graphics processing, memory processing and batch processing. Even if it is very fast and easy to use, it is also a standard interface.

three。 What are the components of Apache Spark Ecosystem?

Apache spark consists of the following components:

1.Spark Core

2.Spark SQL

3.Spark Streaming

4.MLlib

5.GraphX

Spark streaming:spark streaming contains the basic functions of spark, including task scheduling, memory management, fault recovery components, interaction with storage systems, and so on. Spark Core is also home to API, which defines resilient distributed datasets (RDD), which is the main programming abstraction of Spark. It also provides a number of API for building and manipulating these RDDS.

Spark SQL:Spark SQL provides an interface for dealing with structured data. It allows queries in SQL as well as Apache Hive variants of SQL (HQL). It supports many sources.

Spark Streaming:Spark component, which can handle real-time data flow.

MLlib:Spark comes with a general machine learning package called MLlib

GraphX:GraphX is a library for manipulating graphics (for example, friend graphs on social networks) and performing graphical parallel computing.

Four. What is Spark Core?

Spark Core is the basic unit of the entire Spark project. It provides a variety of functions, such as task scheduling, scheduling, and input and output operations. Spark uses a special data structure called RDD (resilient distributed dataset). It is the home page of API and is used to define and manipulate RDD. Spark Core is a distributed execution engine with all the functions attached at the top. For example, MLlib,SparkSQL,GraphX,Spark Streaming. Therefore, various workloads on a single platform are allowed. All the basic functions of Apache Spark are similar to memory computing, fault tolerance, memory management, monitoring, and task scheduling provided by Spark Core.

In addition, Spark provides basic connectivity to data sources. For example, HBase,Amazon S3, HDFS and so on.

Fifth, what languages does Apache Spark support?

Apache Spark is written in Scala. Spark provides an API in Scala,Python and Java to interact with Spark. It also provides API for the R language.

Sixth, how can Apache Spark be better than Hadoop?

Apache Spark is upgrading fast cluster computing tools. Because of its very fast in-memory data analysis and processing ability, it is 100 times faster than Hadoop MapReduce.

Apache Spark is a big data framework. Apache Spark is a general-purpose data processing engine that is usually used on top of HDFS. Apache Spark is suitable for a variety of data processing requirements, from batch processing to data flow.

Hadoop is an open source framework for processing data stored in HDFS. Hadoop can handle structured, unstructured or semi-structured data. Hadoop MapReduce can only process data in batch mode.

In many cases, Apache Spark exceeds Hadoop, for example

1. Processing data in memory, which is impossible in Hadoop

. 2 . Deal with batch, iterative, interactive and streaming data, that is, real-time mode. Hadoop is processed only in batch mode.

Spark is faster because it reduces the number of disk reads and writes because it can store intermediate data in memory. In Hadoop MapReduce, the output is Map (), which is always written on the local hard disk.

Apache Spark on 4 is easy to program because it has hundreds of advanced operators with RDD (resilient distributed datasets)

5. Apache Spark code is more compact than Hadoop MapReduce. Using Scala to make it very short reduces programming effort. In addition, Spark provides a rich variety of languages, such as Java,API 's Skala, Python and [R.

Both Spark and Hadoop are highly fault tolerant.

7. Spark applications running in a Hadoop cluster are 10 times faster on disk than Hadoop MapReduce.

Seventh, what are the different ways to run Spark on Apache Hadoop?

Instead of MapReduce, we can use spark-spark with HDFS on the Hadoop ecosystem, and you can read and write data in HDFS-spark with Hive, you can read and analyze and write back to hive.

Eight, what is the SparkContext in Apache Spark?

SparkContext is the client of the Spark execution environment and acts as the master server for Spark applications. SparkContext sets up internal services and establishes a connection to the Spark execution environment. You can create RDD, accumulator, and broadcast variables after creating the SparkContext, access the Spark service and run the job (until SparkContext stops). Only one SparkContext can be activated per JVM. You must stop the active SparkContext of () before creating a new SparkContext.

In Spark shell, a special interpreter-aware SparkContext has been created for users in a variable named sc.

The first step in any Spark driver application is to create a SparkContext. SparkContext allows Spark driver applications to access the cluster through the resource manager. The resource manager can be either YARN or Cluster Manager of Spark.

SparkContext provides few features:

one. We can get the current state of the Spark application, such as configuration, application name.

two。 We can set the configuration, such as the main URL, the default log level.

3. You can create distributed entities like RDD.

Nine, what is the SparkSession in Apache Spark?

Starting with Apache Spark 2.0, Spark Session is a new entry point for Spark applications.

Before 2. 0, SparkContext was the entry point for spark work. RDD was one of the main API of the time, and it was created and operated using Spark Context. For each other API, you need a different context-for SQL, you need a SQL context; for Streaming, you need Streaming Context;; for Hive, you need Hive Context.

But since 2. 0, RDD and DataSet and its subset DataFrame API are becoming standard API and the basic units of data abstraction in Spark. All user-defined code will be written and evaluated against DataSet and DataFrame API as well as RDD.

Therefore, a new entry point build is needed to handle these new API, which is why Spark Session was introduced. Spark Session also includes all API-Spark Context,SQL Context,Streaming Context,Hive Context available in different contexts.

Ten, SparkSession vs SparkContext in Apache Spark.

Prior to Spark 2.0.0, sparkContext was used as a channel to access all spark functions.

The spark driver uses spark context to connect to the cluster through a resource manager (YARN or Mesos..).

SparkConf is necessary to create a spark context object and stores configuration parameters such as appName (which identifies your spark driver), applications, the number of cores, and the amount of memory of the executor running on the worker node.

In order to use API for SQL,HIVE and Streaming, you need to create a separate context.

Example:

Create a sparkConf:

Val conf = new SparkConf () .setAppName ("RetailDataAnalysis") .setMaster ("spark://master:7077") .set ("spark.executor.memory", "2g")

Creation of sparkContext:

Val sc = new SparkContext (conf)

Spark Conferencing:

Starting with SPARK 2.0.0, SparkSession provides a single entry point to interact with the underlying Spark functionality, and

Allows you to write Spark using DataFrame and Dataset API. All the features provided in sparkContext can also be used in sparkSession.

In order to use API for SQL,HIVE and Streaming, there is no need to create a separate context because the sparkSession contains all the API.

Once SparkSession is instantiated, we can configure the run-time configuration properties of Spark.

Example:

Create a Spark session:

Val spark = SparkSession

.builder

.appName ("

WorldBankIndex "). GetOrCreate ()

Configuration properties:

Spark.conf.set ("spark.sql.shuffle.partitions", 6)

Spark.conf.set ("spark.executor.memory", "2g")

Starting with Spark 2.0.0, it is best to use sparkSession because it provides access to all the spark functions that sparkContext has. In addition, it provides API to handle DataFrame and datasets.

National Day holiday, what is the abstraction of Apache Spark?

Apache Spark has several abstractions:

RDD:

RDD refers to resilient distributed datasets. RDD is a collection of read-only partitions of records. It is the core abstraction of Spark and the basic data structure of Spark. It provides memory computing on large clusters. Even in a fault-tolerant way. More detailed insights on RDD.follow links: introduction, features, and operation of Spark RDD-RDD

DataFrames:

It is a dataset that is organized into a designated list. DataFrames is equivalent to a table in a relational database or data framework in R / Python. In other words, we can say that it is a relational table with good optimization techniques. It is an immutable distributed data collection. Allows a higher level of abstraction, which allows developers to impose structures on distributed data collections. A more detailed insight into DataFrame. Reference link: Spark SQL DataFrame tutorial-introduction to DataFrame

Spark Streaming:

It is a core extension of Spark that allows real-time streaming from multiple sources. For example, Flume and Kafka. To provide a unified, continuous DataFrame abstraction that can be used for interactive and batch queries, the two sources work together. It provides scalable, high throughput and fault tolerant processing. A more detailed insight into Spark Streaming. Reference link: Spark Streaming tutorial for beginners

GraphX

This is another example of specialized data abstraction. It enables developers to analyze social networks. In addition, other charts work with 2D data similar to Excel.

12. How to create a RDD in Apache Spark?

Elastic distributed dataset (RDD) is the core abstraction of spark and is a resilient distributed dataset.

It is an immutable (read-only) collection of distributed objects. In RDD

Each dataset is divided into logical partitions

It can be calculated on different nodes of the cluster.

Including user-defined classes, RDD can contain any type of Python,Java or Scala object.

We can create a RDD in Apache Spark in three ways:

one. By distributing a collection of objects

two。 By loading an external dataset

3. From existing Apache Spark RDDs

Use parallelized collections

RDD is usually created by parallelizing existing collections

That is, by getting an existing collection in the program and setting the

It is passed to the parallelize () method of SparkContext.

Scala > val data = Array (1, 2, 3, 4, 5)

Scala > val dataRDD = sc.parallelize (data)

Scala > dataRDD.count

External data set

In Spark, distributed datasets can be built from any data source supported by Hadoop.

Val dataRDD = spark.read.textFile ("F:/BigData/DataFlair/Spark/Posts.xml"). Rdd

Create a RDD from an existing RDD

A transformation is a way to create a RDD from an existing RDD.

Transformation, as a function, absorbs RDD and produces another result, RDD.

Input RDD will not change, applied on RDD

Some operations are: filter,Map,FlatMap

Val dataRDD = spark.read.textFile ("F:/Mritunjay/BigData/DataFlair/Spark/Posts.xml"). Rdd

Val resultRDD = data.filter {line = > {line.trim (). StartsWith ("

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.