What are the advantages of Hadoop MapReduce 07/01 Update SLTechnology News&Howtos

What are the advantages of Hadoop MapReduce

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the advantages of Hadoop MapReduce". In the operation of practical cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

data! data!

There was a saying: "A lot of data is better than a good algorithm." This means that for some applications (such as movie and music recommendations based on previous preferences), no matter how good your algorithm is, a large amount of available data will always lead to better recommendations.

Now, we already have a lot of data. Unfortunately, we are currently struggling with storing and analyzing this data.

Data storage and analysis

The problem we encountered is simple: disk access speed can not keep up with the growth of disk storage capacity.

1. A way to reduce the time to read the disk?

It takes a long time to read all the data on a disk, and even slower to write. A simple way to reduce read time is to read data from multiple disks at the same time.

Just imagine, if we have 100 disks, each of which stores 1% of the data and reads it in parallel, we can read all the data in less than two minutes.

Using only 1% of disk capacity seems wasteful. But we can store 100 datasets, each dataset 1TB, and achieve access to shared disks. It is conceivable that users of such systems will be happy to use disk sharing access to shorten data analysis time; and, from a statistical point of view, users' analysis work will be carried out at different points in time, so there will not be too much interference with each other.

2. There are more problems to be solved in order to read and write data from multiple disks in parallel. Question 1), hardware failure

Once multiple hardware is used, the probability of failure of any one of the hardware will be very high.

What is the common practice to avoid data loss?

A common practice is to use backups to avoid data loss.

The system keeps a redundant copy of the data, and after a failure, another available copy of the data can be used.

For example, redundant disk array (RAID) is implemented according to this principle.

In addition, Hadoop's file system, HDFS (Hadoop Distributed FileSystem), is also used to back up copies.

Question 2). Most analysis tasks need to combine most of the data in some way to complete the analysis task.

That is, data read from one disk may need to be used in conjunction with data read from another 99 disks. Various distributed systems allow data from multiple sources to be combined and analyzed, but it is a great challenge to ensure its correctness.

MapReduce proposes a programming model. The model abstracts the problem of disk reading and writing mentioned above and transforms it into the calculation of a data set (composed of key / value pairs).

This computing model consists of two parts: map and reduce, and only these two parts provide external interface. Similar to HDFS, MapReduce also has high reliability.

In short, Hadoop provides a reliable shared storage and analysis system. HDFS implements storage, while MapReduce implements analytical processing. Although Hadoop has other functions, these two parts are its core.

Compared with other systems

MapReduce seems to be using a brute force approach. Each query needs to process the entire dataset-- or at least a large portion of the dataset.

On the other hand, this is also its ability.

MapReduce is a batch query processor, and it can handle real-time (ad hoc) queries against an entire dataset in a reasonable time frame.

Relational data management system 1. Why can't we use the database for batch analysis of large-scale data on a large number of disks? Why do we need MapReduce?

The answer to these questions comes from another development trend of disks: the increase in addressing time is much slower than the increase in transmission rate.

Addressing is the process of moving the head to a specific disk location for reading and writing. It is the main cause of disk operation delay, and the transfer rate depends on the bandwidth of the disk.

If the access mode of the data contains a large number of disk addressing, it is bound to take longer to read a large number of data sets (compared to the streaming data read mode), which mainly depends on the transfer rate.

On the other hand, if the database system updates only part of the records, then the traditional B-tree has more advantages (a data structure used in relational databases that is limited by the proportion of addressing).

However, when the database updates most of the data, the B-tree is much less efficient than MapReduce because you need to use sort / merge (sort/merge) to rebuild the database.

In many cases, MapReduce can be seen as a complement to a relational database management system.

1). The difference between the two systems is shown in the figure:

There is a picture missing here. Take up the space. Fill it next time.

2), another difference between MapReduce and relational databases is the degree of structure of the datasets they operate on

Structured data: materialized data with a given format.

Such as XML documents or database tables that meet a specific predefined format. This is what RDBMS includes.

Semi-structured data:

It's loose. Although there may be a format, it is often ignored, so it can only be used as a general guide to data structures.

For example, a spreadsheet is structured as a grid of cells, but each cell itself can hold any form of data.

Unstructured data:

There is no special internal structure.

For example, plain text or image data.

MapReduce is very effective for unstructured or semi-structured data.

Because the data is interpreted only when it is processed.

In other words: the keys and values entered by MapReduce are not inherent attributes of the data, but are selected by the people who analyze the data.

2. Why is MapReduce so suitable for analyzing various log files?

Because the Web server log is a typical non-normalized data record (for example, the full name of the client host needs to be recorded every time, causing the same client full name to appear multiple times), this is one of the reasons why MapReduce is very suitable for analyzing various log files.

3. MapReduce is a linear scalable programming model.

The programmer writes two functions for the map and reduce functions-- each function defines the mapping of a set of key / value pairs to another set of key / value pairs.

These functions do not have to care about the size of the dataset and the clusters it uses, so it can be applied intact to small or large datasets. More importantly, if you enter twice as much data, it takes twice as long to run. However, if the cluster is twice as fast, the job will still run as fast as before. SQL queries generally do not have this feature.

However, in the near future, the difference between relational database systems and MapReduce systems is likely to become blurred.

Relational databases are beginning to absorb some of the ideas of MapReduce. On the other hand, advanced query languages based on MapReduce, such as Pig and Hive, make the MapReduce system closer to the traditional database programming.

Grid computing

MapReduce will try to store data on the compute node for fast local access to the data.

Data localization is the core feature of MapReduce, and good performance is obtained as a result.

Realizing that network bandwidth is the most valuable resource in the data center environment (copying data everywhere can easily deplete network bandwidth), MapReduce tries to preserve network bandwidth by showing the network topology.

1. How to deal with the partial failure of the system in the large-scale distributed computing environment?

In a large-scale distributed computing environment, it is a great challenge to coordinate the execution of different processes. The most difficult thing is to deal with the partial failure of the system reasonably-- without knowing whether a remote operation has failed or not-- while continuing to complete the entire calculation.

MapReduce eliminates the need for programmers to think about partial failures of the system because their own system implementations can detect failed map or reduce tasks and get the running machine to re-perform these failed tasks.

It is because of the no-sharing framework that MapReduce is able to implement failure detection, which means that the tasks are independent of each other. Because the MapReduce system itself controls the process by which mapper's output is passed to reducer, rerunning reducer requires more care than rerunning mapper. Because reducer needs to get the necessary mapper output, if you don't get the necessary output, you must run the relevant mapper again to regenerate the output. )

Therefore, from the programmer's point of view, the order in which the tasks are executed is irrelevant. In contrast, MPI (messaging interface) programs must explicitly manage their own checkpoint and recovery mechanisms, although more control is given to programmers, but it also makes programming more difficult.

2. MapReduce is inspired by?

Inspired by traditional functional programming, distributed computing, and database communities, the model has had many other applications in other industries since then.

3. What is the design goal of MapReduce?

MapReduce is designed to serve jobs that can be completed in minutes or hours. And it runs in a single data center connected internally through a high-speed network, and the computers in the data center need to be made up of reliable and customized hardware.

A brief History of Hadoop Development

Hadoop was created by Doug Cutting, the founder of Apache Lucene, and Lucene is a widely used text search system library.

Hadoop originated from Apache Nutch

Hadoop originated from Apache Nutch, an open source web search engine, which itself is part of the Lucene project.

The Nutch project began in 2002 as a runnable web crawl toolbox search engine. But later, developers thought that the architecture was not scalable enough to solve the search problem of a billion web pages.

A paper published by Google in 2003 helped. The article describes the Google product architecture, which is called the Google distributed File system, or GFS. GFS or similar architecture can address the storage needs of very large files generated during page crawling and indexing. Crucially, GFS can save a lot of time on system administration, such as managing storage nodes.

1), distributed file system (NDFS) that implemented Nutch in 2004

In 2004, Doug Cutting began to implement an open source implementation, Nutch's distributed File system (NDFS).

2) at the beginning of 2005, Nutch developers implemented the MapReduce system on Nutch

In 2004, Google published a paper introducing their MapReduce system.

In early 2005, Nutch developers implemented a MapReduce system on Nutch, and by the middle of the year, all major Nutch algorithms had been ported and run with MapReduce and NDFS.

3), NDFS and MapReduce move out of Nutch to form a subproject of Lucene, called Hadoop

The NDFS and MapReduce implementations of Nutch are not just for search. In February 2006, developers moved NDFS and MapReduce out of Nutch to form a subproject of Lucene, called Hadoop.

4), in January 2008, Hadoop has become the top project of Apache 5), in April 2008, Hadoop broke the world record to become the fastest TB-level data sorting system 6), Hadoop's memorabilia Apache Hadoop and Hadoop biosphere

Although Hadoop is famous for MapReduce and its distributed file system (HDFS, renamed from NDFS). But the name Hadoop is also used for a group of related projects that use this basic platform for distributed computing and massive data processing.

The Hadoop projects mentioned in the book are briefly described as follows:

Common

A set of components and interfaces (serialization, Java RPC, and persistent data structures) of a distributed file system and a generic Iamp O.

Avro

A serialization system that supports efficient, cross-language RPC and permanent storage of data.

MapReduce

Distributed data processing model and execution environment, running in large commercial computer clusters.

HDFS

Distributed file system, running in large commercial computer clusters.

Pig

A data flow language and runtime environment for retrieving very large data sets. Pig runs on clusters of MapReduce and HDFS.

Hive

A distributed, column-based database. HBase uses HDFS as the underlying storage and supports batch computation and random reading of point queries for MapReduce.

Zookeeper

A distributed, highly available coordination service. ZooKeeper provides basic services such as distributed locks for building distributed applications.

Sqoop

A tool for efficiently transferring data between a database and HDFS.

This is the end of the content of "what are the advantages of Hadoop MapReduce". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.