Introduction to Spark 04/13 Update SLTechnology News&Howtos

Introduction to Spark

2025-04-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

= = > what is Spark

-> Spark is a fast general engine for large-scale data processing

-- > Spark is an alternative to MapReduce, and it is compatible with HDFS and Hive, and can be incorporated into the ecosystem of Hadoop to make up for the deficiency of MapReduce.

= = > Spark Core RDD (Resilient Distributed Datasets resilient distributed dataset)

-> RDD can be simply understood as: a data collection that provides many operation interfaces, distributed storage in storage devices (memory or hard disk) in a cluster environment, including fault tolerance, parallel processing and other functions.

Characteristics of Spark

-- > come on

-advantages: compared with Mapreduce, Spark is 100 times faster based on memory and 10 times faster based on hard disk.

-disadvantages: no memory management, all memory management is handed over to the application to make up for the deficiency of MapReduce

OOM (out of memory) is easy to appear. You can use Java Heap Dump tools to analyze the memory overflow of Java programs.

-> easy to use

-Spark supports API of Java, Python and Scala

-supports more than 80 algorithms

-support interactive, you can use Spark in shell to verify the problem solving method

-> Universal (biosphere)

-batch processing

-Interactive query (Spark SQL)

-Real-time streaming processing (Spark Streaming)

-Machine Learning (Spark MLlib)

-Graph Computing (GraphX)

-good integration with Hadoop, can directly operate HDFS, and provide Hive on Spark, Pig on Spark framework integration Hadoop (configuration Hive on Spark is not yet mature)

-> compatibility can be easily integrated with other open source products

-Hadoop's YARN and Apache Mesos can be used as its resource management scheduler

-can handle all the data supported by Hadoop: HDFS, HBase, Cassandra, etc.

-the powerful processing power of Spark can be used without any data migration.

-without relying on third-party resource management and scheduler, Standalone can be implemented as its built-in resource management and debugging framework to reduce the complexity of deployment

-provides a Spark cluster tool for deploying Standalone on EC2

= > Spark Shengtai Circle

-- > Spark Core

-- > Spark SQL

-- > Spark Streaming

-> Spark MLLib: machine learning

-> Spark GraphX: figure calculation

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.