What's the use of Spark? 04/09 Update SLTechnology News&Howtos

What's the use of Spark?

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Xiaobian to share with you what is the use of Spark, I believe most people still do not know how, so share this article for everyone's reference, I hope you have a lot of gains after reading this article, let's go to understand it together!

SPARK

Apache Spark is a fast, general-purpose computing engine designed for large-scale data processing. Spark is a general parallel framework like Hadoop MapReduce open sourced by UC Berkeley AMP Lab (AMP Lab of University of California Berkeley). Spark has the advantages of Hadoop MapReduce; but unlike MapReduce, the intermediate output results of Job can be stored in memory, so there is no need to read and write HDFS. Therefore, Spark can be better applied to iterative MapReduce algorithms such as data mining and machine learning.

Spark is an open source clustered computing environment similar to Hadoop, but there are some useful differences that make Spark superior for certain workloads, in other words, Spark enables in-memory distributed datasets that optimize iterative workloads in addition to providing interactive queries.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, with Scala making it easy to manipulate distributed data sets like native collection objects.

Although Spark was created to support iterative jobs on distributed data sets, it is actually complementary to Hadoop and can run in parallel within the Hadoop file system. This behavior is supported by a third-party clustering framework called Mesos. Developed by UC Berkeley AMP Lab (Algorithms, Machines, and People Lab), Spark can be used to build large-scale, low-latency data analytics applications.

Apache Spark is a fast, general-purpose computing engine designed for large-scale data processing. Now it forms a rapidly developing and widely used ecosystem.

The starting point for learning big data

Spark has three main characteristics:

First, the high-level API takes away the focus on the cluster itself, allowing Spark app developers to focus on the computation that the app does.

Second, Spark is fast and supports interactive computing and complex algorithms.

Finally, Spark is a general purpose engine that can be used to perform a variety of operations, including SQL queries, text processing, machine learning, etc., whereas before Spark, we generally needed to learn various engines to handle these requirements separately.

performance characteristics

a faster rate

In memory computing, Spark is 100 times faster than Hadoop.

ease of use

Spark provides more than 80 advanced operators.

versatility

Spark provides a large number of libraries, including SQL, DataFrames, MLlib, GraphX, Spark Streaming. Developers can seamlessly combine these libraries in the same application.

Support for multiple resource managers

Spark supports Hadoop YARN, Apache Mesos, and its own standalone cluster manager

Spark ecosystem

Shark： Shark basically provides HiveQL command interface similar to Hive based on Spark framework. In order to maintain compatibility with Hive to the greatest extent, Shark uses Hive API to implement query Parsing and Logic Plan generation, and replaces HadoopMapReduce with Spark in the final Physical Plan execution stage. By configuring the Shark parameter, Shark can automatically cache specific RDD in memory, enabling data reuse and thus speeding up retrieval of specific datasets. At the same time, Shark implements specific data analysis learning algorithms through UDF user-defined functions, so that SQL data queries and operational analysis can be combined to maximize the reuse of RDD.

SparkR: SparkR is an R package that provides a lightweight Spark front end for R. SparkR provides a distributed data frame data structure, which solves the bottleneck that the data frame in R can only be used in a single machine. It supports many operations like select,filter,aggregate, etc. (Similar to the functionality in the dplyr package) This nicely solves R's big data bottleneck. SparkR also supports distributed machine learning algorithms, such as using the MLib machine learning library. SparkR brought the R community to life for Spark, attracting a large number of data scientists to start their data analysis journey directly on the Spark platform.

basic principle

Spark Streaming: Build a framework for processing Stream data on Spark. The basic principle is to divide Stream data into small time segments (a few seconds) and process these small portions of data in a batch-like manner. Spark Streaming is built on top of Spark, on the one hand, because Spark's low-latency execution engine (100ms+), although not as good as specialized streaming data processing software, can also be used for real-time computing, on the other hand, compared to other record-based processing frameworks (such as Storm), a part of the RDD dataset with narrow dependencies can be recalculated from the source data for fault-tolerant processing purposes. In addition, the small batch processing mode makes it compatible with both batch and real-time data processing logic and algorithms. It is convenient for some special applications that require joint analysis of historical data and real-time data.

calculation method

Bagel: Pregel on Spark, which can be used to compute graphs in Spark, is a very useful small project. Bagel comes with an example that implements Google's PageRank algorithm.

At present, Spark has not stopped at real-time computing, targeting the general big data processing platform, and terminating Shark, opening SparkSQL may have begun to take shape.

In recent years, parallel algorithms for machine learning and data mining have become an important research hotspot in the field of big data. In the early years, domestic and foreign researchers and the industry paid more attention to parallel algorithm design on Hadoop platform. However, Hadoop MapReduce platform is difficult to efficiently implement machine learning parallelization algorithms that require a large number of iterative calculations due to the high network and disk read and write overhead. With the emergence and gradual development of Spark system, a new generation of big data platform launched by UC Berkeley AMPLab, in recent years, domestic and foreign people have begun to pay attention to how to realize various machine learning and data mining parallel algorithm design on Spark platform. In order to facilitate data analysts in general application fields to use the familiar R language to complete data analysis on Spark platform, Spark provides a programming interface called SparkR, so that data analysts in general application fields can easily use Spark's parallel programming interface and powerful computing power in R language environment.

The above is all the content of this article "Spark has what to use", thank you for reading! I believe that everyone has a certain understanding, hope to share the content to help everyone, if you still want to learn more knowledge, welcome to pay attention to the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.