Introduction to 1.spark 07/12 Update SLTechnology News&Howtos

Introduction to 1.spark

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Spark is a unified computing engine for large-scale data processing. It is suitable for a variety of scenarios that previously require a variety of different distributed platform processing, including batch processing, iterative computing, interactive query, stream processing. Various processes are integrated through a unified framework.

Spark characteristics

Rapidity

Spark can perform batch and streaming with high performance through the use of advanced DAG scheduler, query optimizer, and physical execution engine. Using logical regression algorithm for iterative calculation, the speed of spark is more than 100 times faster than that of hadoop.

Simple and easy to use

Spark supports multiple programming languages, such as Java, Scala, Python, R, and SQL.

Spark provides more than 80 advanced operator operations, which makes it easy to build parallel computing applications.

Versatility

Spark constructs a perfect ecological stack, which integrates batch computing, interactive computing, streaming computing, machine learning and graph computing into a unified framework.

Run everywhere

Spark can run on standalone, YARN, Mesos, Kubernetes and EC2 scheduling platforms.

In addition, spark can access a variety of data sources, such as HDFS, Alluxio, HBase, Cassandra, Hive and local files.

Spark ecological stack

Spark Core

Spark Core implements the basic functions of Spark, including task scheduling, memory management, error recovery, interaction with the storage system and other modules. Spark Core also includes an API definition of resilient distributed datasets (resilient distributed dataset, or RDD). RDD represents a set of elements that operate in parallel on multiple computing nodes and is the core abstract model of Spark.

Spark SQL

Spark SQL is a subframework used by Spark to deal with structured data. Spark SQL supports a variety of data sources, such as Hive tables, Parquet, and JSON. Spark SQL can query data using SQL or Hive's SQL dialect (HQL), and also supports the conversion of SQL and RDD to and from each other.

Spark Streaming

Spark Streaming is a component provided by Spark for streaming real-time data. Spark Streaming provides API for manipulating data flows, as well as high levels of fault tolerance, throughput, and scalability.

MLlib

MLlib is a library that provides common machine learning (ML) functions. MLlib provides many machine learning algorithms, including classification, regression, clustering, collaborative filtering, etc., as well as model evaluation, data import and lower-level machine learning primitives (including general gradient descent optimization algorithms).

Graphx

Graphx is a library for manipulating diagrams (such as friend diagrams on social networks) that can be used for parallel graph computing. Graphx extends RDD API and can be used to create a directed graph with arbitrary attributes on both vertices and edges. Graphx supports various operations on graphs (such as subgraph for graph separation and mapVertices for all vertices), as well as some common algorithms (such as PageRank and triangle counting).

Cluster manager

Spark is designed to scale computing efficiently from one compute node to thousands of compute nodes, so for maximum flexibility, spark supports running on a variety of cluster managers, including Hadoop YARN, Apache Mesos, and Spark's own independent scheduler.

Users and uses of spark

Spark users are mainly divided into two target groups: data analysts and engineers. The typical use cases of the two groups using spark are inconsistent and can be roughly divided into two categories: data analysis and data processing.

Data analysis

The data analyst is the person who is mainly responsible for analyzing the data and modeling. They have skills in SQL, statistics, predictive modeling (machine learning), and have a certain ability to program in Python, Matlab, or R.

Spark supports data analysis tasks through a series of components. Spark shell provides python and scala interfaces for interactive data analysis. Spark SQL provides a separate SQL shell to explore data using SQL, as well as SQL queries through standard Spark programs or Spark shell. MLlib library for machine learning and data analysis. Spark also supports calling R or Matlab external programs.

Data processing.

Engineers are software developers who use Spark to develop data processing applications. They have the concept of software engineering (encapsulation, interface design and object-oriented ideas) and can use engineering technology to design software systems.

Spark provides a shortcut for developing programs for cluster parallel execution. There is no need for developers to pay attention to distributed problems, network communication and program fault tolerance. Provide engineers with sufficient interfaces for common tasks and for application monitoring, review, and performance tuning.

A brief history of spark

In 2009, Spark was born in UCBerkeley's AMP laboratory.

2010, Spark officially opened up to the outside world

2012-10-15 released by Magi Spark 0.6.0.

Extensive performance improvements, new features, and simplification of the Standalone deployment model

2013-02-27 Magi Spark 0.7.0 release

Added more key features, such as Python API, alpha version of Spark Streaming, etc.

2013-06-21 Magi Spark accepts admission to Apache Incubator

2013-09-17 Magi Spark 0.8.0 release

Support high availability of scheduling in Scala2.9/YARN2.2/Standalone deployment mode, shuffle optimization, etc.

2014-01-24 release of Spark 0.9.0

Added GraphX, new machine learning features, new streaming computing features, core engine optimization (external aggregation, enhanced support for YARN), etc.

2014-05-26 release of Magi Spark 1.0.0

Spark SQL, MLlib, GraphX and Spark Streaming have all been added and optimized. The Spark core engine also adds support for secure YARN clusters.

2014-09-03 Magi Spark 1.1.0 release

API of Spark Core and bug repair of Streaming,Python,SQL,GraphX and MLlib

2014-12-10 Magi Spark 1.2.0 release

API of Spark Core and bug repair of Streaming,Python,SQL,GraphX and MLlib

2015-03-06 Magi Spark 1.3.0 release

The biggest highlight of this release is the newly introduced DataFrame API, which provides more convenient and powerful operations for structured DataSet. In addition to DataFrame, it is also worth noting that Spark SQL has become the official version, which means that it will be more stable and comprehensive.

2015-06-03 Magi Spark 1.4.0 release

This version introduces R API to Spark, while improving the availability of Spark's core engine and MLlib, as well as Spark Streaming

2015-09-09 Magi Spark 1.5.0 release

Spark 1.5.0 is the sixth release on 1.x. This version handles a total of 1400 patches from 230+contributors and 80 + institutions.

Many of the changes in Spark 1.5 revolve around improving the performance, availability, and operational stability of Spark.

Spark 1.5.0 focuses on the Tungsten project, which improves the performance of Spark by optimizing low-level builds.

Spark 1.5 adds operational features to Streaming, such as support for backpressure. Another important update is the addition of some new machine learning algorithms and tools, and the expansion of the related API of Spark R.

Release of 2015-12-22 Spark 1.6.0

This version contains more than 1000 patches and shows three main topics here: new Dataset API, performance improvement (50% performance improvement in reading Parquet, automatic memory management, tenfold performance improvement in streaming state management), and a large number of new machine learning and statistical analysis algorithms.

DataFrame is introduced in Spark1.3.0, which provides high-level functions for Spark to better handle data structures and calculations. This allows Catalyst optimizer and Tungsten execution engine to automatically accelerate big data's analysis. Developers have received a lot of feedback since the release of DataFrame, one of which is the lack of compile-time type safety. To solve this problem, Spark adopts the new Dataset API (type extension of DataFrame API). The Dataset API extension DataFrame API supports static types and user-defined functions that run existing Scala or Java languages. Provides better memory management than traditional RDD API,Dataset API, especially for long tasks.

Release of 2016-07-20 Magi Spark 2.0.0

This version mainly updates APIs, supports SQL 2003 and R UDF, and enhances its performance. 2500 patches contributed by 300 developers

2016-12-16 Magi Spark 2.1.0 release

This is the second release of the 2.x release line. This release is a major breakthrough for Structured Streaming to enter the production environment. Structured Streaming now supports event time watermarks and Kafka0.10.

In addition, this version focuses more on availability, stability and elegance (polish) and addresses more than 1200 tickets

2017-07-01 Magi Spark 2.2.0 release

This is the third version of the 2.x series. This version removes the experimental tag (experimental tag) of Structured Streaming, meaning it is safe to use online.

The update of this version is mainly aimed at the availability, stability and code refinement of the system. These include:

API upgrades and performance and stability improvements for Core and Spark SQL, such as support for reading data from Hive metastore 2.0 CSV 2.1; support for parsing multi-line JSON or CSV files; removal of support for Java 7; removal of support for SparkR such as Hadoop 2.5 and earlier adds wider support for existing Spark SQL features, such as Structured Streaming for R language API; R language supports full Catalog API; R language supports DataFrame checkpointing, etc.

2018-02-23 release of Spark 2.3.0

This is the fourth version of the 2.x series. This release adds support for Continuous Processing in Structured Streaming and the new Kubernetes Scheduler backend

Other major updates include new DataSource and Structured Streaming v2 API, as well as some PySpark performance enhancements.

In addition, this release continues to improve the usability, stability, and continuous refinement of the code of the project.

Now

Loyal to technology, love sharing. Welcome to the official account: java big data programming to learn more technical content.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.