What are Hadoop and spark? 04/27 Update SLTechnology News&Howtos

What are Hadoop and spark?

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is Hadoop and spark". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is Hadoop and spark".

What is Hadoop?

Hadoop started as a Yahoo project in 2006 and has since been promoted to a top Apache open source project. It is a general distributed system infrastructure with multiple components: Hadoop distributed File system (HDFS), which stores files in Hadoop native format and parallelizes them in a cluster; YARN, a scheduler that coordinates application runtime; and MapReduce, an algorithm that actually processes data in parallel. Hadoop is built in the Java programming language, and applications on it can also be written in other languages. With a Thrift client, users can write MapReduce or Python code.

In addition to these basic components, Hadoop includes Sqoop, which moves relational data into HDFS; Hive, an SQL-like interface that allows users to run queries on HDFS; Mahout, machine learning. In addition to using HDFS for file storage, Hadoop can now be configured to use S3 buckets or Azure blob as input.

It can be open source in Apache distributions, or it can be provided by vendors such as Cloudera (the largest and largest Hadoop vendor), MapR or HortonWorks.

What is Spark?

Spark is a relatively new program that was launched in 2012 at AMPLab at the University of California, Berkeley. It is also a top-level Apache project that focuses on processing data in parallel in a cluster, with a big difference in that it runs in memory.

Similar to the concept of Hadoop reading and writing files to HDFS, Spark uses RDD (Elastic distributed datasets) to process data in RAM. Spark runs in stand-alone mode, and the Hadoop cluster can be used as a data source or run with Mesos. In the latter case, the Mesos master will replace the Spark master or YARN for scheduling.

Spark is built around Spark Core, and Spark Core is the engine that drives scheduling, optimization, and RDD abstraction, and connects Spark to the correct file system (HDFS,S3,RDBM or Elasticsearch). Several libraries are also running on Spark Core, including Spark SQL, which allows users to run SQL-like commands on distributed datasets, MLLib for machine learning, GraphX for solving graphics problems, and Streaming for entering continuous streaming log data.

Spark has several API. The original interface was written in Scala, and due to the use of a large number of data scientists, Python and R interfaces were added. Java is another option for writing Spark jobs.

Databricks is a company founded by Matei Zaharia, founder of Spark, and is now responsible for Spark development and providing Spark distribution to customers.

Thank you for your reading, the above is the content of "what is Hadoop and spark". After the study of this article, I believe you have a deeper understanding of what Hadoop and spark are, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.