The first experience of spark 07/16 Update SLTechnology News&Howtos

The first experience of spark

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. the background of spark

(1) Development of MapReduce:

Disadvantages of MRv1:

Back in the Hadoop1.x version, the MRv1 version of the MapReduce programming model was adopted. The implementation of the MRv1 version is encapsulated in the org.apache.hadoop.mapred package, and the Map and Reduce of MRv1 are implemented through interfaces.

MRv1 has only three parts: the runtime environment (JobTracker and TaskTracker), the programming model (MapReduce), and the data processing engine (MapTask and ReduceTask).

Poor scalability: at run time, JobTracker is responsible for both resource management and task scheduling. When the cluster is busy, JobTracker can easily become a bottleneck and eventually lead to its scalability problems.

Poor availability: a single-node Master is used, and there is no backup Master and election operation, which makes the entire cluster unavailable in the event of a Master failure. (single point of failure)

Low resource utilization: TaskTracker uses the same amount of "slot" to divide the amount of resources on this node. Slot is divided into Map slot and Reduce slot, which are used by MapTask and ReduceTask respectively. Sometimes because the job has just started and other reasons lead to a lot of MapTask, but the Reduce Task service has not been scheduled, then the Reduce slot will be idle.

Cannot support multiple MapReduce frameworks: you cannot replace your own MapReduce framework with other implementations, such as Spark, Storm, etc., in a pluggable way.

2. Disadvantages of MRv2:

In MRv2, the programming model and data processing engine in MRv1 are reused. But the runtime environment has been refactored. JobTracker is divided into general ones: resource scheduling platform (ResourceManager, referred to as RM), node manager (NodeManager), and task scheduling model responsible for each computing framework (ApplicationMaster, referred to as AM). However, due to the frequent operations on HDFS (including persistence of computing results, data backup, resource download and Shuffle, etc.), disk Ibank O has become the bottleneck of system performance, so it is only suitable for offline data processing or batch processing, but can not support iterative, interactive, streaming data processing.

(2) advantages of Spark:

The reduction of the I/O:Spark of the disk allows the intermediate output and results of the map side to be stored in memory, and the reduce side avoids a large number of disk I mano when pulling the intermediate results. Spark buffers the resource files uploaded by the application into the memory of the Driver local file service, and reads them directly from the memory of the Driver when the Executor performs tasks, which also saves a lot of disk Ibano.

Increase parallelism: park abstracts different links as Stage, allowing multiple Stage to be executed either serially or in parallel.

Avoid double counting: when the Task execution of a partition in Stage fails, the Stage will be rescheduled, but the partition tasks that have been successfully executed will be filtered during rescheduling, so it will not cause double calculation and waste of resources.

The optional shuffle:Spark can be sorted on the map side or the reduce side according to different scenarios.

Flexible memory management strategy: Spark divides memory into four parts: storage memory on the heap, storage memory outside the heap, execution memory on the heap and execution memory outside the heap. Spark provides both a fixed boundary between execution memory and storage memory and a "soft" boundary between execution memory and storage memory. Spark uses the implementation of "soft" boundary by default, and either side of executive memory or storage memory can borrow the other side's memory when the resources are insufficient, so as to maximize the utilization of resources and reduce the waste of resources. Because of Spark's preference for memory, the amount and utilization rate of memory resources are particularly important. Therefore, the Tungsten provided by Spark's memory manager implements a data structure which is very similar to the operating system's memory Page, which is used to directly operate the operating system's memory, saving the memory occupied by the created Java objects in the heap, and making the Spark's use of memory more close to hardware. Spark allocates a matching task memory manager to each Task to manage memory at Task granularity. The memory of Task can be consumed by multiple internal consumers, and the task memory manager allocates and manages Task memory for each consumer, so Spark has more fine-grained management of memory.

(3) spark ecology:

The Spark ecosystem takes SparkCore as the core, reads data from persistence layers such as HDFS, Amazon S3 or HBase, and uses MESOS, YARN and its own Standalone as resource managers to schedule Job to complete the calculation of Spark applications.

Batch processing of SparkShell/SparkSubmit

Real-time processing Application of SparkStreaming

Structured data processing / impromptu query in SparkSQL

Trade-off query for BlinkDB

Machine learning of MLlib/MLbase, graph processing of GraphX, mathematical / scientific calculation of PySpark and data analysis of SparkR.

(4) characteristics of spark:

Seed is fast and efficient: Spark allows intermediate output and results to be stored in memory, saving a lot of disk IO. Apache Spark uses state-of-the-art DAG schedulers, query optimizers and physical execution engines to achieve high performance for bulk and streaming data. At the same time, Spark's own DAG execution engine also supports data calculation in memory. The official website of Spark claims to be 100 times faster than Hadoop. Even if insufficient memory requires disk IO, its speed is more than 10 times that of Hadoop.

Generality: full stack data processing: support batch processing, support interactive query, support interactive query, support machine learning, support graph computing.

Ease of Use is simple and easy to use: Spark now supports writing applications in programming languages such as Java, Scala, Python and R, greatly reducing the bar for users. It comes with more than 80 high-level operators (operators) that allow interactive queries in Scala,Python,R 's shell, and it is very convenient to use Spark clusters in these Shell to verify the solution to the problem.

High availability: Spark can also be independent of third-party resource management and scheduler, it implements Standalone as its built-in resource management and scheduling framework, which further reduces the threshold for the use of Spark, making it very easy for everyone to deploy and use Spark. In this mode, there can be multiple Master, solving the problem of single point of failure. Of course, this mode can be replaced with other cluster managers, such as YARN, Mesos, Kubernetes, EC2, and so on.

Rich data source support: Spark can access Cassandra, HBase, Hive, Tachyon (based on memory storage) and any Hadoop data source in addition to the operating system's own local file system and HDFS. This greatly facilitates the smooth migration of users who have already used HDFS and HBase to Spark.

(5) Application scenarios of spark:

① Yahoo uses Spark in Audience Expansion applications for click prediction and ad hoc queries, etc.

The ② Taobao technical team uses Spark to solve iterative machine learning algorithms, algorithms with high computational complexity, and so on. Apply to content recommendation, community discovery, etc.

③ Tencent big data accurately recommended to take advantage of the fast iteration of Spark to realize the real-time parallel high-dimensional algorithm in the whole process of "real-time data acquisition, algorithm real-time training, system real-time prediction". Finally, it was successfully applied to Tencent Ad solutions PCTR release system.

Youku Tudou applies Spark to video recommendation (graph computing) and advertising business, mainly realizing iterative computing such as machine learning and graph computing.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.