What is Spark? 04/28 Update SLTechnology News&Howtos

What is Spark?

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is Spark". In daily operation, I believe many people have doubts about what is Spark. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what is Spark?" Next, please follow the editor to study!

What is Spark?

Spark is a general parallel computing framework similar to Hadoop MapReduce opened by UC Berkeley AMP lab. Distributed computing based on map reduce algorithm in Spark has the advantages of Hadoop MapReduce, but unlike MapReduce, the intermediate output and results of Job can be saved in memory, so it is no longer necessary to read and write HDFS, so Spark can be better applied to map reduce algorithms that need iteration, such as data mining and machine learning. Its architecture is shown in the following figure:

Comparison between Spark and Hadoop

The intermediate data of Spark is put in memory, which is more efficient for iterative operation.

Spark is more suitable for ML and DM operations with more iterative operations. Because in Spark, there is the abstract concept of RDD.

Spark is more general than Hadoop.

Spark provides many types of dataset operations, unlike Hadoop, which only provides Map and Reduce operations. For example, there are many types of operations, such as map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, mapValues, sort,partionBy and so on. Spark calls these operations Transformations. At the same time, it also provides Count, collect, reduce, lookup, save and other actions operations.

These various types of dataset operations provide convenience for users who develop upper-level applications. The communication model between processing nodes is no longer the only Data Shuffle mode like Hadoop. Users can name, materialize, and control the storage and partition of intermediate results. It can be said that the programming model is more flexible than Hadoop.

However, due to the nature of RDD, Spark is not suitable for applications with asynchronous fine-grained status updates, such as storage of web services or incremental web crawlers and indexes. It is not suitable for the application model of incremental modification.

Fault tolerance

Fault tolerance is achieved through checkpoint in distributed dataset computing, while checkpoint has two ways, one is checkpoint data, the other is logging the updates. Users can control which way to achieve fault tolerance.

Usability

Spark improves usability by providing rich Scala, Java,Python API and interactive Shell.

The combination of Spark and Hadoop

Spark can read and write data to HDFS directly, and it also supports Spark on YARN. Spark and MapReduce can run in the same cluster and share storage resources and computing. Hive is borrowed from the data warehouse Shark implementation, which is almost completely compatible with Hive.

Applicable scenarios for Spark

Spark is a memory-based iterative computing framework, which is suitable for applications that need to manipulate specific data sets multiple times. The more repeated operations are required, the greater the amount of data that needs to be read, and the greater the benefit. In situations where the amount of data is small but computationally intensive, the benefit is relatively small (this is an important factor in considering the use of Spark in large database architectures).

Due to the nature of RDD, Spark is not suitable for applications with asynchronous fine-grained status updates, such as storage of web services or incremental web crawlers and indexes. It is not suitable for the application model of incremental modification.

Generally speaking, Spark has a wide range of applications and is more general.

Operation mode

Local mode

Standalone mode

Mesoes mode

Yarn mode

Spark ecosystem

Shark (Hive on Spark): Shark basically provides the same H iveQL command interface as Hive based on the framework of Spark. In order to maximize compatibility with Hive, Shark uses Hive's API to implement query Parsing and Logic Plan generation, and the final PhysicalPlan execution phase uses Spark instead of Hadoop MapReduce. By configuring Shark parameters, Shark can automatically cache specific RDD in memory to achieve data reuse, thus speeding up the retrieval of specific data sets. At the same time, Shark implements specific data analysis learning algorithms through UDF user-defined functions, so that SQL data query and operation analysis can be combined to maximize the reuse of RDD.

Spark streaming: build a framework for processing Stream data on Spark. The basic principle is to divide Stream data into small time fragments (seconds) and process these small pieces of data in a manner similar to batch batch processing. Spark Streaming is built on Spark, on the one hand, because Spark's low-latency execution engine (100ms +) can be used for real-time computing, on the other hand, compared with other Record-based processing frameworks (such as Storm), RDD data sets are easier to do efficient fault-tolerant processing. In addition, the way of small batch processing makes it compatible with both batch and real-time data processing logic and algorithms. It facilitates some specific applications that require joint analysis of historical data and real-time data.

Bagel: Pregel on Spark, you can use Spark for graph calculation, this is a very useful small project. Bagel comes with an example that implements Google's PageRank algorithm.

End.

At this point, the study of "what is Spark" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.