An overview of the mapreduce of the learning route of good programmer big data 04/21 Update SLTechnology News&Howtos

An overview of the mapreduce of the learning route of good programmer big data

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

An overview of the mapreduce of big data's learning route, mapreduce: distributed parallel offline computing framework, which is a programming framework for distributed computing programs and the core framework for users to develop "data analysis applications based on hadoop". The core function of Mapreduce is to integrate the business logic code written by users and its own default components into a complete distributed computing program that runs concurrently on a hadoop cluster.

Similar to the way HDFS solves the problem, HDFS splits large files into small files and stores them on each host in the cluster.

In the same principle, mapreduce divides a complex operation into sub-operations, and then gives them to each host in the cluster, and each host operates in parallel.

1.1 the background of mapreduce

Massive data can not be processed on a single computer because of the limitation of hardware resources.

Once the stand-alone program is extended to the cluster to run distributed, it will greatly increase the complexity and difficulty of the program development.

With the introduction of the mapreduce framework, developers can focus most of their work on the development of business logic and handle the complexity of distributed computing by the framework.

1.2 mapreduce programming model

A distributed computing model.

MapReduce abstracts this parallel computing process into two functions.

Map (mapping): a specified operation for each element of a list of independent elements, which can be highly parallel.

Reduce (reduced reduction): merges the elements of a list.

A simple MapReduce program only needs to specify map (), reduce (), input, and output, and the rest is done by the framework.

Several key nouns of Mapreduce

Job: each calculation request from a user is called a job.

Task: each job needs to be split up and handed over to multiple hosts to complete, and the split execution unit is the task.

Task is divided into the following three types of tasks:

Map: responsible for the entire data processing flow of the map phase

Reduce: responsible for the entire data processing flow of the reduce phase

MRAppMaster: responsible for process scheduling and state coordination of the whole program

1.4 mapreduce program running process

Specific process description:

When a mr program starts, the first thing to start is after MRAppMaster,MRAppMaster starts, according to the description information of this job, calculate the number of maptask instances needed, and then apply to the cluster for machines to start the corresponding number of maptask processes.

After the maptask process starts, the data is processed according to the given data slice range. The main process is as follows:

-use the inputformat specified by the customer to obtain RecordReader read data to form input KV pairs.

-perform a logical operation on the input KV (k is the line number of the file, v is the data on one line of the file) passed to the customer-defined map () method, and collect the KV pairs output by the map () method into the cache.

-KV pairs in the cache are sorted by K partition and constantly overwritten to disk files

After MRAppMaster monitors that all maptask process tasks are completed, it starts the corresponding number of reducetask processes according to the parameters specified by the customer, and informs the reducetask process of the data range to be processed (data partition).

After the Reducetask process starts, according to the location of the data to be processed told by MRAppMaster, several maptask output result files are obtained from several machines where maptask is running, and remerged and sorted locally, then according to the KV of the same key as a group, call the customer-defined reduce () method for logical operation, collect the result KV of the operation output, and then call the outputformat specified by the customer to output the result data to external storage.

1.5 write MapReduce programs

It is very simple to write distributed parallel programs based on MapReduce computing model. The main coding work of programmers is to implement Map and Reduce functions.

Other complex problems in parallel programming, such as distributed storage, job scheduling, load balancing, fault-tolerant processing, network communication and so on, are handled by the YARN framework.

In MapReduce, the map and reduce functions follow the following general format:

Map: (K1, V1) → list (K2, V2)

Reduce: (K2, list (V2)) → list (K3, V3)

The interface of Mapper:

Protected void map (KEY key, VALUE value, Context context)

Throws IOException, InterruptedException {

}

The interface of Reduce:

Protected void reduce (KEY key, Iterable values

Context context) throws IOException, InterruptedException {

}

Basic structure of Mapreduce program code

Maprecue instance Development 2.1 programming steps

The program written by the user is divided into three parts: Mapper,Reducer,Driver (submit the client running the mr program)

The input data of Mapper is in the form of KV pairs (the type of KV can be customized)

The output data of Mapper is in the form of KV pairs (the type of KV can be customized)

The business logic in Mapper is written in the map () method

The map () method (maptask process) is called once for each

The input data type of Reducer corresponds to the output data type of Mapper, which is also KV

The business logic of Reducer is written in the reduce () method

The Reducetask process calls the reduce () method once for each group with the same k

Both user-defined Mapper and Reducer inherit their respective parent classes

The whole program needs a Drvier to submit, which is a job object that describes all necessary information.

2.2 Classic wordcount programming

Requirements: there are a batch of files (TB or PB), how to count the number of occurrence of all words in these files

If there are three files, the file names are qfcourse.txt, qfstu.txt, and qf_teacher

Qf_course.txt content:

Php java linux

Bigdata VR

C C++ java web

Linux shell

Qf_stu.txt content:

Tom jim lucy

Lily sally

Andy

Tom jim sally

Qf_teacher content:

Jerry Lucy tom

Jim

Scheme

-count the number of word occurrences in each file separately-map ()

-accumulate the number of occurrences of the same word in different files-reduce ()

Implementation code

-create a simple maven project

-the main contents of adding jar,pom.xml that hadoop client depends on are as follows:

Org.apache.hadoop

Hadoop-client

2.7.1

Junit

4.11

Test

-write code

-customize a mapper class

Import java.io.IOException

Import org.apache.hadoop.io.IntWritable

Import org.apache.hadoop.io.LongWritable

Import org.apache.hadoop.io.Text

Import org.apache.hadoop.mapreduce.Mapper

/ * *

* the four types of generics in Maper are from left to right:

* LongWritable KEYIN: by default, it is the starting offset of a line of text read by the mr framework, Long, which is similar to the line number but has its own more concise serialization interface in hadoop, so instead of using Long directly, it uses LongWritable

* Text VALUEIN: by default, it is the content of a line of text read by the mr framework, String, ditto, using Text

* Text KEYOUT: the key in the output data after the user-defined logic processing is completed, here is the word, String, ditto, using Text

* IntWritable VALUEOUT: the value in the output data after the user-defined logic processing is completed, in this case, the number of words, Integer, as above, using IntWritable

, /

Public class WordcountMapper extends Mapper {

/ * *

* the business logic of the map phase is written in the custom map () method

* maptask calls our custom map () method once for each row of input data

, /

@ Override

Protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {

/ / convert the text content of the line sent to us by maptask to String first

String line = value.toString ()

/ / divide this line into words according to spaces

String [] words = line.split ("")

/ * *

* output the word as

* such as

, /

For (String word:words) {

/ / use the word as the key and the degree 1 as the value, so that subsequent data can be distributed according to the word, so that the same word will reach the same reduce task

Context.write (new Text (word), new IntWritable (1))

}

-customize a reduce class

Import java.io.IOException

Import org.apache.hadoop.io.IntWritable

Import org.apache.hadoop.io.Text

Import org.apache.hadoop.mapreduce.Reducer

/ * *

* the four types of generics in Reducer are from left to right:

* Text KEYIN: KEYOUT corresponding to the output of mapper

* IntWritable VALUEIN: VALUEOUT corresponding to the output of mapper

* KEYOUT, is a word

* VALUEOUT is the output data type of the result of custom reduce logic processing, which is the total number of times

, /

Public class WordcountReducer extends Reducer {

/ * *

* input parameter key, which is a group of key of the same word kv pair

* values is several value collections of the same key

* such as

, /

@ Override

Protected void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {

The number of occurrences of int count=0; / / cumulative words

For (IntWritable value:values) {

Count + = value.get ()

}

Context.write (key, new IntWritable (count))

}

-write a Driver class

Import org.apache.hadoop.conf.Configuration

Import org.apache.hadoop.fs.Path

Import org.apache.hadoop.io.IntWritable

Import org.apache.hadoop.io.Text

Import org.apache.hadoop.mapreduce.Job

Import org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

/ * *

* equivalent to a client of a yarn cluster

* need to encapsulate the relevant running parameters of our mr program here, and specify the jar package

* finally submitted to yarn

, /

Public class WordcountDriver {

/ * *

* this class runs on the hadoop client. As soon as main runs, the yarn client starts up and communicates with the yarn server.

* the yarn server is responsible for starting the mapreduce program and using the WordcountMapper and WordcountReducer classes

, /

Public static void main (String [] args) throws Exception {

/ / this code requires two input parameters, the first parameter supports the source file to be processed, and the second parameter is the output path of the processing result

If (args = = null | | args.length = = 0) {

Args = new String [2]

/ / the paths are all file paths of the hdfs system

Args [0] = "hdfs://192.168.18.64:9000/wordcount/input/"

Args [1] = "hdfs://192.168.18.64:9000/wordcount/output"

}

/ * *

* when nothing is set, if it is run on a machine with hadoop installed, it will automatically read

* / home/hadoop/app/hadoop-2.7.1/etc/hadoop/core-site.xml

* put files in Configuration

, /

Configuration conf = new Configuration ()

Job job = Job.getInstance (conf)

/ / specify the local path where the jar package of this program is located

Job.setJarByClass (WordcountDriver.class)

/ / specify the mapper service class to be used by the job for this service

Job.setMapperClass (WordcountMapper.class)

/ / specify the kv type of mapper output data

Job.setMapOutputKeyClass (Text.class)

Job.setMapOutputValueClass (IntWritable.class)

/ / specify the Reducer service class to be used by the job for this service

Job.setReducerClass (WordcountReducer.class)

/ / specify the kv type of the final output data

Job.setOutputKeyClass (Text.class)

Job.setOutputValueClass (IntWritable.class)

/ / specify the directory where the input original file of job is located

FileInputFormat.setInputPaths (job, new Path (args [0]))

/ / specify the directory where the output of job is located

FileOutputFormat.setOutputPath (job, new Path (args [1]))

/ / submit the relevant parameters configured in job and the jar package of the java class used by job to yarn to run

/ * job.submit (); * /

Boolean res = job.waitForCompletion (true)

System.exit (res?0:1)

}

Wordcount process

Split the file into splits, because the test file is small, so each file is a split, and split the file by line to form a pair, as shown in the following figure. This step is done automatically by the MapReduce framework, where the offset (that is, the key value) includes the number of characters occupied by carriage returns (different Windows/Linux environments).

The split map method is processed to generate a new pair, as shown in the following figure.

When you get the pairs output by the map method, Mapper sorts them by key value and executes the Combine process, adding key to the same value value to get the final output of Mapper. This is shown in the following figure.

Reducer first sorts the data received from Mapper, and then passes it to the user-defined reduce method to get a new pair, which is used as the output of WordCount, as shown in the following figure.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.