What are the definitions of Hadoop and Spark 07/01 Update SLTechnology News&Howtos

What are the definitions of Hadoop and Spark

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the definition of Hadoop and Spark". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Definition of Hadoop

Hadoop is a project of Apache.org, which is actually a software library and framework for distributed processing of large datasets (big data) across computer clusters using a simple programming model. Hadoop is flexible to support everything from a single computer system to thousands of commercial systems that provide local storage and computing power. In fact, Hadoop is the heavyweight big data platform in the field of big data analysis.

Hadoop consists of several modules that work together to build a Hadoop framework. The main modules of the Hadoop framework include the following:

Hadoop Common

Hadoop distributed File system (HDFS)

Hadoop YARN

Hadoop MapReduce

Although the above four modules form the core of Hadoop, there are several other modules. These modules include: Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume and Sqoop, which further enhance and expand the function of Hadoop to expand to the application field of big data and deal with large data sets.

Many companies that use large datasets and analysis tools use Hadoop. It has become the de facto standard in big data's application system. Hadoop was designed to deal with the task of searching and searching billions of web pages and collecting this information into a database. It is because of the desire to search and search the Internet that there is Hadoop's HDFS and distributed processing engine MapReduce.

If the dataset becomes so large or complex that current solutions cannot process information effectively within a period of time that data users consider reasonable, Hadoop can be of great use to the company.

MapReduce is an excellent text processing engine, and it should be, because searching the Internet and searching the Internet (its top priority) are text-based tasks.

Definition of Spark

Apache Spark developers claim it is "a fast general-purpose engine for large-scale data processing". By contrast, if Hadoop's big data frame is like an 800lb gorilla, Spark is like a 130lb cheetah.

While critics of Spark's memory processing technology admit that Spark is indeed fast (up to 100 times faster than Hadoop MapReduce), they may be reluctant to admit that it runs up to 10 times faster on disk. Spark can also perform batch processing, but what it is really good at is handling streaming workloads, interactive queries, and machine-based learning.

Compared with MapReduce's disk-based batch processing engine, Spark is famous for its real-time data processing function. Spark is compatible with Hadoop and its modules. In fact, Spark is listed as a module on the project page of Hadoop.

Spark has its own page because although it can be run in a Hadoop cluster through YARN (another resource coordinator), it also has a stand-alone mode. It can be run either as a Hadoop module or as a stand-alone solution; this makes it difficult to compare the two directly. Over time, however, some big data scientists expect Spark to bifurcate and could replace Hadoop, especially when faster access to processed data is critical.

Spark is a cluster computing framework, which means it competes more with MapReduce than with the entire Hadoop ecosystem. For example, Spark does not have its own distributed file system, but it can use HDFS.

Spark uses memory and can also be processed with disk, while MapReduce is entirely disk-based. The main difference between MapReduce and Spark is that MapReduce uses persistent storage, while Spark uses resilient distributed datasets (RDDS), which is explained in more detail below.

Performance

There is no shortage of information on how fast Spark is compared to MapReduce. There is a problem with comparing the two, that is, they deal with data in different ways, which are introduced in the data processing section. The reason why Spark is so fast is that it processes all data in memory. Yes, it can also use disks to process data that is not fully loaded into memory.

Spark's memory processing provides near-real-time analysis of data from multiple sources: marketing campaigns, machine learning, Internet of things sensors, log monitoring, security analytics, and social media sites. In addition, MapReduce uses batch processing, which has never been designed for amazing speed. Its original intention is to continuously collect information from the website, without the need for these data to be real-time or near real-time.

Ease of use

Spark is known for its performance, but it is also known for its ease of use because it comes with an easy-to-use API that supports Scala (native language), Java, Python, and Spark SQL. Spark SQL is very similar to SQL 92, so it requires little learning and is ready to get started.

Spark also has an interaction mode where both developers and users can get immediate feedback on queries and other operations. There is no interactive mode for MapReduce, but with add-ons such as Hive and Pig, it is easier for adopters to use MapReduce.

Cost

MapReduce and Spark are both Apache projects, which means they are open source free software products. Although there is no cost to the software, there is a cost in sending someone to run any platform with hardware. Both products are designed to run on commercial hardware, such as so-called low-cost white-box server systems.

MapReduce and Spark run on the same hardware, so what is the cost difference between the two solutions? MapReduce uses a regular amount of memory, and because the data processing is based on disk, the company has to buy faster disks and a lot of disk space to run MapReduce. MapReduce also needs more systems to distribute disk input / output across multiple systems.

Spark requires a lot of memory, but you can use a regular number of regular speed disks. Some users complain that temporary files need to be cleaned up. These temporary files are usually saved for seven days to speed up any processing of the same dataset. Disk space is relatively cheap, and since Spark does not use disk input / input for processing, the used disk space can be used for SAN or NAS.

However, it is true that Spark systems are more expensive because they require a lot of memory to process all the data. But Spark's technology also reduces the number of systems required. Therefore, in the case of * *, the cost of the system is higher, but the number is greatly reduced. Perhaps by then, Spark can actually reduce the cost per unit of calculation, despite additional memory requirements.

For example, "Spark has proved to be easy when there is as much data as PB. It is used to sort 100TB data three times faster than Hadoop MapReduce on machines with only 1/10 data." This achievement makes Spark the benchmark for Daytona GraySort in 2014.

Compatibility

MapReduce and Spark are compatible; MapReduce is compatible with many data sources, file formats, and business intelligence tools through JDBC and ODC, and Spark has the same compatibility as MapReduce.

Data processing.

MapReduce is a batch processing engine. MapReduce operates in sequential steps, first reading data from the cluster, then performing operations on the data, writing the results back to the cluster, reading the updated data from the cluster, performing the next data operation, writing those results back to the results, and so on. Spark performs a similar operation, but one step in memory. It reads the data from the cluster, performs operations on the data, and then writes back to the cluster.

Spark also includes its own graphics calculation library GraphX. GraphX allows users to view the same data as drawings and collections. Users can also use flexible distributed datasets (RDD), change and federate graphics, and the fault tolerance section is discussed.

Fault tolerance

As for fault tolerance, MapReduce and Spark solve the problem from two different directions. MapReduce uses the TaskTracker node, which provides a heartbeat for the JobTracker node. If there is no heartbeat, the JobTracker node reschedules all operations to be performed and ongoing operations to another TaskTracker node. This approach is effective in providing fault tolerance, but can greatly increase the completion time of some operations, even if there is only one failure.

Spark uses resilient distributed datasets (RDD), which are fault-tolerant sets in which data elements can perform parallel operations. RDD can reference datasets from external storage systems, such as shared file systems, HDFS, HBase, or any data source that provides Hadoop InputFormat. Spark can create a RDD with any storage source supported by Hadoop, including a local file system, or one of the file systems listed earlier.

RDD has five main properties:

Partition list

Calculate the function of each slice

List of projects that depend on other RDD

Key RDD oriented partitioning program (for example, RDD is a hash partition), this is an optional attribute

Calculates a list of the * locations of each shard (such as the block location of the HDFS file), which is an optional attribute

RDD may be persistent to cache the dataset in memory. As a result, the future operation will be greatly accelerated, up to 10 times. Spark's cache is fault-tolerant because if any partition of RDD is lost, it will be automatically recalculated using the original transformation.

Expandability

By definition, both MapReduce and Spark can be extended using HDFS. So, how big can the Hadoop cluster become?

Yahoo is said to have a set of 42000 nodes composed of Hadoop cluster, it can be said that the expansion is unlimited. The known Spark cluster of * is 8000 nodes, but with the increase of big data, the cluster size is expected to increase to continue to meet throughput expectations.

Safety

Hadoop supports Kerberos authentication, which is troublesome to manage. However, third-party vendors allow enterprise organizations to take full advantage of active directories Kerberos and LDAP for authentication. Similarly, those third-party vendors also provide data encryption for data in transit and data at rest.

The Hadoop distributed file system supports access control lists (ACL) and the traditional file permissions mode. Hadoop provides service-level authorization (Service Level Authorization) for user control in task submission, which ensures that the customer has the correct permissions.

Spark is a bit less secure and currently only supports authentication through a shared key (password authentication). The security benefit of Spark is that if you run Spark on HDFS, it can use HDFS ACL and file-level permissions. In addition, Spark can run on YARN, so Kerberos authentication can be used.

That's all for "what are the definitions of Hadoop and Spark?" Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.