Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Analysis of Hadoop Ecology MapReduce and Hive

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "Hadoop Ecological Analysis MapReduce and Hive". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "Hadoop Ecological Analysis MapReduce and Hive".

1. Computing framework

Hadoop is a computing framework. At present, there are roughly five kinds of large data computing frameworks commonly used:

Batch only framework: Apache hadoop.

Only stream processing framework: Apache Storm, Apache Samza.

Hybrid framework: Apache Spark, Apache Flink.

Among them, Hadoop and Spark are the most famous and widely used.

Although both of them are called big data framework, the actual levels are different. Hadoop is a distributed data infrastructure, including computing framework MapReduce, distributed file system HDFS, YARN and so on. And Spark is specially used for distributed storage of big data's processing tool, and will not carry out data storage, more like a replacement for MapReduce.

In usage scenarios, Hadoop is mainly used for offline data computing, while Spark is more suitable for scenarios that require accurate real-time performance.

2. MapReduce

2.1 what is MapReduce

A parallel distributed computing framework based on Java.

It was mentioned earlier that HDFS provides a distributed file system based on master-slave structure. Based on this storage service support, MapReduce can distribute, track and execute tasks, and collect results.

2.2 MapReduce composition

The popular point of MapReduce's main idea is to split a large computation into Map (mapping) and Reduce (simplification). Speaking of which, JAVA8 also has map and reduce methods after introducing Lambda. The following is the use of a Java:

List nums = Arrays.asList (1, 2, 3); List doubleNums = nums.stream (). Map (number-> number * 2). Results: Optional sum = nums.stream (). Reduce (Integer::sum)

The code is simple, map is responsible for sorting, and reduce is responsible for calculating. And MapReduce in Hadoop also has similarities and differences.

The following analysis is based on the official case WordCount:

Public class WordCount {/ / Mapper generic class, 4 parameters represent input key, value, output key, value type public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map (Object key, Text value, Context context) throws IOException, InterruptedException {/ / character resolution StringTokenizer itr = new StringTokenizer (value.toString ()) While (itr.hasMoreTokens ()) {/ / nextToken (): returns the string word.set (itr.nextToken ()) from the current position to the next delimiter; context.write (word, one);} / / Reducer is also four parameters public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable (); public void reduce (Text key, Iterable values,Context context) throws IOException,InterruptedException {int sum = 0 / cycle values and record the number of "words" for (IntWritableval: values) {sum + = val.get ();} result.set (sum); context.write (key, result);}}

In this code, it is not difficult to see that the core of the program is the map function and the reduce function. Is MapReduce made up of these two? Keep looking down.

2.3 Map and Reduce

2.3.1 Map

In the WordCount case, it is obvious that the input to the map function is mainly a

Context temporarily ignores it here, which is the internal abstract class of the Mapper class, which is not used in general computing and can be understood as "context" first.

The calculation process of the map function is to extract the words from this line of text and output one for each word

2.3.2 Reduce

Next, let's take a look at reduce, where the input parameter Values is the collection of many 1s mentioned above, and Key is the specific "word" word.

It is calculated by summing the 1 in the set, and then combining the word (word) with the sum (sum).

Suppose there are two blocks of text data that need to be counted for word frequency. The MapReduce calculation process is shown in the following figure:

It's easy to understand, after all, it's just an example of HelloWorld, but the most critical part of the whole MapReduce process is actually between map and reduce.

Also take the above example: counting the number of times the same word appears in all input data, a Map can only process part of the data, while hot words are likely to appear in all Map, meaning that the same word must be merged to get the correct results. This kind of data association needs to be dealt with in almost all big data computing scenarios. If it is an example, of course, only merging Key will be OK, but for more complex data such as database join operation, you need to associate two types (or more) of data according to Key.

This data association operation in MapReduce is called: shuffle.

2.4 shuffle

Shuffle, literally, shuffle. Here is a complete MR process to see how to shuffle the cards.

Look at the left half first.

1. The data is read from the HDFS, and the data blocks are entered into each map. When the map completes the calculation, the calculation results are stored in the local file system. When the map is almost finished, the shuffle process is started.

two。 As shown in the figure, shuffle can also be divided into two types, and the one on the map side is Map shuffle. The general process is as follows: the Map task process calls a Partitioner interface for each Map generated

Here you can partition, sort, split the Map results, and merge the output of the same partition to disk to get a file with ordered partitions. In this way, no matter which server node the Map is on, the same Key must be sent to the same Reduce process. The Reduce process responds to the received

And look at the right half.

1. Reduce shuffle, which can be divided into two stages: copying Map output, sorting and merging.

After the Copy:Reduce task pulls data from each Map task, it notifies the parent TaskTracker that the status has been updated, and the TaskTracker notifies JobTracker. Reduce will periodically obtain the output location of Map from JobTracker. Once the location is obtained, the Reduce task will copy the output locally from the corresponding TaskTracker of the output, and will not wait until all Map tasks are finished.

Merge sort:

The data of Copy is first put into the memory buffer, and if the buffer is fit, the data is written to memory, that is, from memory to memory merge.

Reduce drags data to each Map, and each Map in memory corresponds to a piece of data. When the data stored in the memory cache reaches a certain extent, open the merge in memory and output the data in memory merge to the disk file, that is, the memory to the disk merge.

When all copies of the map output belonging to the reduce are completed, multiple files are generated on the reduce and a merge operation is performed, that is, disk to disk merge. At the moment, the output data of Map is in order, and Merge carries out a merge sort, and the so-called sort process on the Reduce side is the process of merging.

two。 After the previous step of Reduce shuffle, reduce does the final calculation and writes the output to HDFS.

The above are roughly four steps of shuffle. The key is which Reduce process the map output shuffle to, which is implemented by Partitioner. The default Partitioner of the MapReduce framework uses the Key hash to model the number of Reduce tasks, and the same Key will fall on the same Reduce task ID.

Public int getPartition (K2 key, V2 value, int numReduceTasks) {return (key.hashCode () & Integer.MAX_VALUE)% numReduceTasks;}

To sum up Shuffle: distributed computing is the process of merging data from different servers together for subsequent calculations.

Shuffle is a magical place in big data's calculation process, whether it is MapReduce or Spark, as long as it is big data batch calculation, there must be a shuffle process, and its internal relationship and value will be presented only if the data are associated.

3. Hive

The last part introduces MapReduce, and then we briefly talk about Hive.

I think the emergence of any technology is to solve some kind of problem, and MapReduce undoubtedly simplifies the programming difficulty developed by big data. But in fact, the more common means of data calculation may be SQL, so is there a way to run SQL directly?

3.1What is Hive

Based on a data warehouse system of Hadoop, a kind of SQL query language: Hive SQL is defined.

Here is a term data warehouse, data warehouse refers to: Subject Oriented-oriented (Integrated), relatively stable (Non-Volatile), response to historical changes (Time Variant) data set, used to support management decisions.

This may be a little abstract, break it down:

Topic: the data warehouse is organized for a topic, which refers to the key areas of concern when using the data warehouse to make decisions. For example, subscription analysis can be used as a topic.

Integration: the data warehouse needs to store the data of multiple data sources together, but the previous storage methods of the data are different, and they have to be extracted, cleaned and transformed. (i.e. ETL)

Stability: the saved data is a series of historical snapshots, not allowed to be modified, but can only be analyzed.

Time-varying: new data will be received periodically, reflecting new data changes.

Now let's look at the definition: data warehouse is the integration of data from multiple data sources according to a certain topic for extraction, cleaning and transformation. And processing the integrated data is not allowed to be modified at will, can only be analyzed, and needs to be updated regularly.

3.2Why Hive

Now that you understand the basic definition of Hive, think about this: what role can a HDFS-dependent data warehouse play in an Hadoop environment?

As mentioned earlier, whether SQL can run directly on the Hadoop platform, the answer here is Hive. It can convert Hive SQL to MapReduce programs to run.

Hive initial version defaults to Hive on Mapreduce

You usually need to start hdfs and yarn before starting hive, and you generally need to configure MySQL,Hive data storage that depends on HDFS, but in order to operate the data set on HDFS, you need to know the data segmentation format, storage type, address, and so on. This information is stored in a table, called metadata, and can be stored in MySQL.

Now let's take a look at some of Hive's commands.

Create a new database: create database xxx

Delete database: drop database xxx

Build a table:

Create table table_name (col_name data_type)

Hive's tables have two concepts: * * internal tables and external tables * *. The default internal table, simply put, the internal table data is stored in the corresponding HDFS directory for each table. The data of the external table exists elsewhere. To delete the external table, the data pointed to by the external table will not be deleted, only the metadata corresponding to the external table will be deleted.

Query:

Select * from t_table * * where** A1000

Connection query:

Select a. From tantala a join tantalb b on a.name=b.name

When you see this, you may think that I am writing SQL. Yes, for those who are familiar with SQL, Hive is very easy to use.

3.3 HIVE SQL To MapReduce

As I mentioned earlier that HQL can be 'converted' to MapReduce, let's take a look at how HQL is transformed into MapReduce's Hive infrastructure:

Submit the SQL command to Hive through Client. In the case of DDL,Hive, the information of the data table is recorded in the Metastore metadata component through the execution engine Driver, which is usually implemented in a relational database, recording Meta information such as table name, field name, field type, associated HDFS file path, etc. (meta information).

If it is DQL,Driver, it will submit the statement to its own compiler for parsing, parsing, optimization, and other operations, and finally generate a MapReduce execution plan. Then a MapReduce job is generated according to the execution plan and submitted to Hadoop's MapReduce computing framework for processing.

For example, enter a select xxx from a; the execution order is as follows: first in metastore query-> sql parsing-- > query optimization-- > physical plan-- > execute MapReduce.

Thank you for your reading. the above is the content of "Hadoop Ecological Analysis MapReduce and Hive". After the study of this article, I believe you have a deeper understanding of the Hadoop ecological analysis of MapReduce and Hive, and the specific use needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report