Basic principles of Hadoop 07/06 Update SLTechnology News&Howtos

Basic principles of Hadoop

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Basic concepts of Hadoop

Author: Xiaoyu Ma

Link: https://www.zhihu.com/question/27974418/answer/38965760

Source: Zhihu

The copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.

Big data itself is a very broad concept, the Hadoop biosphere (or pan-biosphere) is basically born to deal with data processing beyond the stand-alone scale. You can compare it to all the tools you need for a kitchen. Pots and pans have their own uses and overlap with each other. You can use the soup pot directly as a bowl to eat and drink soup, you can use a knife or plane to peel. But each tool has its own characteristics, and although strange combinations can work, they may not be the best choice.

Big data, first of all, you must be able to save big data.

Traditional file systems are stand-alone and cannot span different machines. HDFS (Hadoop Distributed FileSystem) is essentially designed for large amounts of data to span hundreds of machines, but what you see is a file system rather than many file systems. For example, when you say I want to get / hdfs/tmp/file1 data, you refer to a file path, but the actual data is stored on a lot of different machines. As a user, you don't need to know this, just like you don't care about what tracks and sectors the files are scattered on a single computer. HDFS manages the data for you.

Once you have saved the data, you begin to think about what to do with the data. Although HDFS can manage data on different machines for you as a whole, the data is too large. A machine reads the data of P on T (large data, such as the size of all the high-definition movies in the history of the Tokyo craze), and it may take days or even weeks for a machine to run slowly. For many companies, stand-alone processing is intolerable, such as Weibo to update 24-hour Rebo, it must run these processes within 24 hours. So if I have to use a lot of machines to process, I will be faced with how to assign work, how to restart the corresponding task if a machine dies, how to communicate with each other and exchange data to complete complex calculations, and so on. This is the function of MapReduce / Tez / Spark. MapReduce is the first generation computing engine, while Tez and Spark are the second generation. The design of MapReduce adopts a very simplified calculation model, and there are only two calculation processes: Map and Reduce (in series with Shuffle in the middle). With this model, a large part of the problems in big data's field can be solved.

So what is Map and what is Reduce?

Consider that if you want to count a large text file stored on a similar HDFS, you want to know the frequency of each word in the text. You started a MapReduce program. In the Map phase, hundreds of machines read each part of the file at the same time, counting the word frequency of the parts they read, and producing similar results.

(hello, 12100 times), (world,15214 times), etc. (I put Map and Combine together here for simplification); each of these hundreds of machines produces such a collection, and then hundreds of machines start Reduce processing. Reducer machine A will receive all statistics starting with A from the Mapper machine, and machine B will receive vocabulary statistics starting with B (of course, it won't really start with a letter, of course, but will use functions to generate hash values to avoid data serialization. Because there are certainly fewer words that start with X than others, and you don't want the workload of data processing to vary greatly from machine to machine. Then the Reducer will be summarized again, (hello,12100) + (hello,12311) + (hello,345881) = (hello,370292). Each Reducer is handled as above, and you get the word frequency results of the entire file.

This seems to be a very simple model, but many algorithms can be described by this model.

Map+Reduce 's simple model is pornographic and violent, easy to use, but bulky. The second generation of Tez and Spark, in addition to new feature such as memory Cache, essentially make the Map/Reduce model more general, the boundaries between Map and Reduce more blurred, data exchange more flexible, and fewer disk reads and writes, so that it is easier to describe complex algorithms and achieve higher throughput.

With MapReduce,Tez and Spark, programmers find that MapReduce programs are really troublesome to write. They want to simplify the process. This is like you have assembly language, although you can do almost anything, but you still feel cumbersome. You want a higher-level and more abstract language layer to describe algorithms and data processing processes. So there are Pig and Hive. Pig is close to scripting to describe MapReduce,Hive using SQL. They translate scripts and SQL languages into MapReduce programs and throw them to the computing engine to calculate, and you get rid of tedious MapReduce programs and write programs in simpler and more intuitive languages.

With Hive, people find that SQL has a huge advantage over Java. One is that it is too easy to write. The word frequency just now is only one or two lines described by SQL, and it takes dozens or hundreds of lines to write in MapReduce. And more importantly, users with non-computer backgrounds finally feel love: I can also write SQL! So the data analyst was finally freed from begging the engineer for help, and the engineer was freed from writing strange one-off processing programs. Everyone is happy. Hive has gradually become the core component of big data warehouse. Even many companies' assembly line homework sets are described entirely in SQL, because they are easy to write, easy to change, easy to read, and easy to maintain.

Since data analysts began to use Hive to analyze data, they found that Hive runs on MapReduce, really × × slow! It may not matter if you have an assembly line assignment set, such as a 24-hour update recommendation, but forget about running within 24 hours. But according to data analysis, people always want to run faster. For example, I want to see how many people have stopped on the inflatable doll page in the past hour and how long they have stayed. For a huge amount of data on a huge website, this process may take dozens of minutes or even many hours. And this analysis may only be the first step in your long March. You also need to see how many people have browsed jumping eggs and how many people have seen Rachmaninov's CD, in order to report to the boss whether our users are more obscene men and women or literary and artistic youth / girls. You can't stand the torture of waiting, so you can only say to the handsome engineer, come on, faster!

So Impala,Presto,Drill was born (and, of course, countless non-famous interactive SQL engines, not to mention one by one). The core idea of the three systems is that the MapReduce engine is too slow, because it is too general-purpose, too strong, too conservative, we SQL need to be lighter, more aggressive access to resources, more specifically optimize the SQL, and do not need so much fault tolerance guarantee (because the system error is a big deal to restart the task, if the whole processing time is shorter, for example, within a few minutes). These systems allow users to deal with SQL tasks more quickly at the expense of features such as versatility and stability. If MapReduce is a machete, you are not afraid to cut anything, then the above three are bone picks, smart and sharp, but you can't do anything too big and hard.

These systems, to be honest, have not been as popular as people expected. Because at this time, two more outliers were created. They are Hive on Tez / Spark and SparkSQL. Their design philosophy is that MapReduce is slow, but if I use the new general-purpose computing engine Tez or Spark to run SQL, then I can run faster. And users do not need to maintain two systems. It's like if you have a small kitchen, lazy people, and limited requirements for the fineness of eating, you can buy an electric cooker, which can be steamed and burned, saving a lot of kitchen utensils.

The above introduction is basically the framework of a data warehouse. Bottom HDFS, run MapReduce/Tez/Spark on top, run Hive,Pig on top. Or go directly to Impala,Drill,Presto on HDFS. This solves the requirements of medium and low speed data processing.

What if I want to handle it at a higher speed?

If I am a company similar to Weibo, I want to show that it is not a 24-hour hot blog, I want to see a hot list that is constantly changing, and the above methods will not be competent if the update is delayed within a minute. So another computing model is developed, which is Streaming (streaming) computing. Storm is the most popular streaming computing platform. The idea of stream computing is that if I want to achieve more real-time updates, why don't I deal with it as the data flows in? For example, or the example of word frequency statistics, my data stream is a word, I let them flow through me and start counting at the same time. Stream computing is very powerful, basically no delay, but its disadvantage is that it is not flexible, you want to count things must know in advance, after all, the data flow is gone, you do not calculate things can not be made up. So it's a good thing, but it's no substitute for the above data warehouse and batch systems.

There is also a somewhat independent module that is KV Store, such as Cassandra,HBase,MongoDB and many others (too many to imagine). So KV Store means, I have a bunch of keys, and I can quickly get the data bound to this Key. For example, if I use my ID number, I can get your identity data. This can also be done with MapReduce, but it is likely to scan the entire dataset. KV Store is dedicated to handle this operation, and all deposits and fetches are optimized for this purpose. It may only take a few seconds to find a certificate number from the data of several Ps. This makes some special operations of big data Company greatly optimized. For example, there is a page on my web page that looks up the order content according to the order number, and the order quantity of the whole website cannot be stored in a stand-alone database, so I will consider using KV Store to store it. The idea of KV Store is that it is basically impossible to handle complex calculations, most cannot JOIN, may not be able to aggregate, and there is no strong consistency guarantee (different data are distributed on different machines, and you may read different results each time you read them, nor can you handle operations that require strong consistency such as bank transfers). But he's just fast. Very fast.

Each different KV Store design has different trade-offs, some faster, some higher capacity, and some can support more complex operations. There must be one for you.

In addition, there are some more special systems / components, such as Mahout is a distributed machine learning library, Protobuf is a code and library for data exchange, ZooKeeper is a highly consistent distributed access cooperative system, and so on.

With so many messy tools running on the same cluster, we need to respect each other and work in an orderly manner. So another important component is the scheduling system. Now the most popular is Yarn. You can think of him as central management, like your mother supervising workers in the kitchen. Hey, after your sister has finished cutting vegetables, you can take the knife to kill chickens. As long as everyone obeys your mother's assignment, everyone can cook happily.

You can think of big data biosphere as a kitchen tool biosphere. In order to cook different dishes, Chinese food, Japanese food, French food, you need all kinds of different tools. And the needs of guests are becoming more complicated, your kitchen utensils are constantly being invented, and there is no universal kitchen utensils that can handle all situations, so it will become more and more complex.

Hadoop architecture

The process of saving data to the HDFS file system by Write:Client

1. HDFS Client divides the data into blocks of specified size through the API of the HDFS file system.

2. Initiate a storage request to NameNode, and NameNode will tell the Client which DataNode the data can be stored on.

3-5:Client ends through FSData OutputStream to store data on the DataNode returned by NameNode in 2, and save at least three copies on different DataNode. After each copy is saved, DataNode will send ACK confirmation to ClientNode.

6. The client closes the FSData OutputStream data stream and informs NameNode that the data storage is complete.

The process of reading data from Read:Client to the HDFS file system

1. Client initiates a request to NameNode to open a file and read the file through Distributed FileSystem's API

2. After receiving the request, NameNode will look for two locally maintained metadata tables (one is the block information table held and the other is the DataNode path information stored in each data block) and return it to the client.

3-5. The client will look up the nearest data node according to the table returned by NameNode to obtain data, for example, the first data block is read from datanode4 and the second is read from datanode5.

6. After the data is read, Client closes FSData InputStream.

The process of mapreduce:

Mapreduce is mainly used to process files, the data on this file may be disorganized, hadoop's reduce can only deal with key-value pair format data, so the data must be extracted into key-value pair data.

The process is as follows:

Split: start a split task to cut or extract each data in the file into key-value pairs according to the criteria specified by the user. Such as

Map: send these key-value pairs to each mapper worker process. When the mapper worker receives the key-value pair, it will process the data for the first time, and then another key-value pair will be formed. Such as,. This data is reducer collapsible data.

Shuffle&sort:mapper shuffles and sorts the processed data, sends it to reduce for folding operation, and sends the same key and its corresponding data to the same reducer.

Data after store:reduce is stored.

HBase:

Master: used to schedule requests and coordinate the work of each RegionServers. It is used to control which RegionServer is sent to which location to store the data. With the help of zookeeper to complete the scheduling, for example, a certain RegionServer is down? Master reconstructs the original data on that RegionServer from other RegionServer with the help of zookeeper.

RegionServers: a server that is really used for data access

Hive: it encapsulates the API of mapreduce into a SQL interface similar to the syntax of a relational database, and it can support add, delete, modify and query statements.

Hadoop biosphere

Pig

Pig is a data flow language, which is used to process huge data quickly and easily. Pig consists of two parts: Pig Interface,Pig Latin. Pig can easily process the data of HDFS and HBase. Like Hive, Pig can deal with what it needs to do very efficiently. It can save a lot of labor and time by directly operating Pig query.

You can use Pig when you want to make some transformations on your data and don't want to write MapReduce jobs.

Hive

It originated from the role of FaceBook,Hive as a data warehouse in Hadoop. Built at the top of the Hadoop cluster, it operates on the SQL-like interface of the data stored on the Hadoop cluster. You can use HiveQL for select,join, and so on.

If you have data warehouse requirements and you are good at writing SQL and do not want to write MapReduce jobs, you can use Hive instead.

HBase

HBase runs on top of HDFS as a column-oriented database, and HDFS lacks immediate read and write operations, which is why HBase appears. HBase is modeled on Google BigTable and stored in the form of key-value pairs. The goal of the project is to quickly locate and access the required data in billions of rows of data in the host.

HBase is a database, a NoSql database, like other databases to provide immediate read and write function, Hadoop can not meet the real-time needs, HBase can meet. If you need to access some data in real time, store it in HBase.

You can use Hadoop as a static data warehouse and HBase as a data store to store data that will change if you do something.

Pig VS Hive

Hive is more suitable for data warehouse tasks, while Hive is mainly used for static structures and tasks that require frequent analysis. The similarity between Hive and SQL makes it an ideal intersection of Hadoop and other BI tools.

Pig gives developers more flexibility in the big data set domain and allows the development of concise scripts to transform data streams for embedding into larger applications.

Pig is relatively lightweight compared to Hive, and its main advantage is that it can significantly reduce the amount of code compared to using Hadoop Java APIs directly. Because of this, Pig still attracts a large number of software developers.

Both Hive and Pig can be used in combination with HBase. Hive and Pig also provide high-level language support for HBase, which makes it very easy to process data statistics on HBase.

Hive VS HBase

Hive is a batch processing system based on Hadoop to reduce the writing work of MapReduce jobs, and HBase is a project to support a project to make up for Hadoop's shortcomings in real-time operation.

Imagine you are operating a RMDB database. If it is a full table scan, use Hive+Hadoop, and if it is an index access, use HBase+Hadoop.

Hive query is that MapReduce jobs can run from 5 minutes to more than a few hours. HBase is very efficient and certainly much more efficient than Hive.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.