In-depth interpretation of the two big data Analysis Systems of spark VS Hadoop 04/27 Update SLTechnology News&Howtos

In-depth interpretation of the two big data Analysis Systems of spark VS Hadoop

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Big data, no matter from the industry, or from the technical point of view, is the current development hot spot. In China, the government controls 80% of the data, and the rest is owned by big companies like "BAT". How can small and medium-sized enterprises build their own big data system? How do other enterprises build their own big data system?

Recommend two Apache open source big data framework systems: spark Hadoop, which are the most widely used and known by Chinese people.

Spark: fast and easy to use

Spark is known for its performance, but it is also slightly famous for its ease of use because it comes with an easy-to-use API that supports Scala (native language), Java, Python, and Spark SQL. SparkSQL is very similar to SQL 92, so it requires little learning and is ready to get started.

Spark is a general parallel computing framework similar to HadoopMapReduce opened by UC Berkeley AMP lab. Distributed computing based on map reduce algorithm in Spark has the advantages of HadoopMapReduce, but unlike MapReduce, the intermediate output of Job can be saved in memory, so it is no longer necessary to read and write HDFS, so Spark can be better applied to map reduce algorithms that need iteration, such as data mining and machine learning.

Spark also has an interaction mode where both developers and users can get immediate feedback on queries and other operations. There is no interactive mode for MapReduce, but with add-ons such as Hive and Pig, it is easier for adopters to use MapReduce.

In terms of cost: Spark requires a lot of memory, but can use a regular number of conventional speed disks. Some users complain that temporary files need to be cleaned up. These temporary files are usually saved for 7 days to speed up any processing of the same dataset. Disk space is relatively cheap, and since Spark does not use disk input / input for processing, the used disk space can be used for SAN or NAS.

Fault tolerance: Spark uses flexible distributed datasets (RDD), which are fault-tolerant sets in which data elements can perform parallel operations. RDD can reference datasets from external storage systems, such as shared file systems, HDFS, HBase, or any data source that provides Hadoop InputFormat. Spark can create a RDD with any storage source supported by Hadoop, including a local file system, or one of the file systems listed earlier.

Hadoop: distributed file system

Hadoop is a project of Apache.org, which is actually a software library and framework for distributed processing of large datasets (big data) across computer clusters using a simple programming model. Hadoop is flexible to support everything from a single computer system to thousands of commercial systems that provide local storage and computing power. In fact, Hadoop is the heavyweight big data platform in the field of big data analysis.

Hadoop consists of several modules that work together to build a Hadoop framework. The main modules of the Hadoop framework include the following:

HadoopCommon

Hadoop distributed File system (HDFS)

HadoopYARN

HadoopMapReduce

Although the above four modules form the core of Hadoop, there are several other modules. These modules include: Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume and Sqoop, which further enhance and expand the function of Hadoop to expand to the application field of big data and deal with large data sets.

Many companies that use large datasets and analysis tools use Hadoop. It has become the de facto standard in big data's application system. Hadoop was designed to deal with the task of searching and searching billions of web pages and collecting this information into a database. It is because of the desire to search and search the Internet that there is Hadoop's HDFS and distributed processing engine MapReduce.

Cost: MapReduce uses a regular amount of memory, and because data processing is disk-based, the company has to buy faster disks and a lot of disk space to run MapReduce. MapReduce also needs more systems to distribute disk input / output across multiple systems.

Fault tolerance: MapReduce uses the TaskTracker node, which provides a heartbeat for the JobTracker node. If there is no heartbeat, the JobTracker node reschedules all operations to be performed and ongoing operations to another TaskTracker node. This approach is effective in providing fault tolerance, but can greatly increase the completion time of some operations, even if there is only one failure.

Conclusion: Spark and MapReduce are symbiotic. Hadoop provides features that Spark does not have, such as distributed file systems, while Spark provides real-time memory processing for those datasets that need it. The perfect big data scene is exactly what the designers expected: let Hadoop and Spark work together on the same team.

Author: Zhang Jinglong Changyan (Shanghai) Information Technology Co., Ltd. CTO,CCFYOCSEF Shanghai member, JD.com tonight hotel special app technology founder and the first CTO, China's first generation smartphone developer.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.