MR programming Model and MR V1 explanation 07/09 Update SLTechnology News&Howtos

MR programming Model and MR V1 explanation

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

MR programming model

The MR programming model is mainly divided into five steps: input, mapping, grouping, specification and output.

Enter (InputFormat):

It mainly includes two steps-data slicing and iterative input.

Data fragmentation (getSplits): there are as many splits as the data is divided, and the size of a single split is determined by the set split.minsize and split.maxsize. The formula is max {minsize, min {maxsize, blocksize}}. Blocksize defaults 64m before hadoop2.7.3 and 128m after. After determining the size of a single split, it is the hosts selection, a split may contain multiple block (set the minsize to be greater than 128m), and multiple block may be distributed on multiple hosts nodes (a block default 3 backup, if 4 block may be in 12 nodes), getsplits will select the part of hosts that contains the most data. Thus, in order to make the local language of the data more reasonable, it is best to have a block and a task, that is, the size of the split is the same as the size of the block. GetSplits generates two file job.split: the main storage is the HDFS file path corresponding to each shard, and its starting location and length in the HDFS file (map task is used to obtain the specific location of the shard); job.splitmetainfo: the storage of each shard in the shard data file job.split starting position, shard size and hosts and other information (mainly used during job initialization, for map task localization). Iterative input: iteratively enter pieces of data. For text data, key is the line number and the current line text of value. Mapping (map): normal map operation that maps a pair of kv to another pair of kv packets (partition):

Grouped according to the set number of reduce, getPartitions has three parameters: K, v, partitionnum

By default, according to HashPartition, if you need full sorting, you can also set TotalOrderPartitioner, which will sample a part of the data to sort and set Rmur1 (R is the number of reduce) split points to ensure that the data between the R files generated by map task and the files are orderly, and reduce only needs to sort within a single file. Protocol (reduce): reduce does aggregation processing. Output (OutputFormat):

One thing is to check whether the output directory exists, and if so, report an error

Another thing is to output the data to a temporary directory. Job submission and initialization

Job submission and initialization are roughly divided into four steps: submit, client upload files to hdfs, client and JobTracker communication submission task, JobTracker notify TaskScheduler initialization task. The communication process between JobClient and JobTracker is shown in the following two job submission sequence diagrams

Step 1: JobClient first interacts with JobTracker to get a jobid

Step 2: JobClient interacts with HDFS to create an output directory

Step 3: interact with HDFS to upload files since the task was run (configuration files, jar packages, etc.)

Step 4: JobClient calls getSplits, interacts with HDFS to generate sharding information and writes it to the sharding file.

Step 5: interact with jobtracker to submit tasks. When JobTracker receives a task submission request, it becomes a JobInProgress object, which manages and monitors the entire health of the job; JobTracker then tells TaskSchduler to initialize the job. The initial process of a job is as follows: JobTracker and TaskTrackerJobTracker are mainly responsible for the runtime management of the job, which is managed in the way of a three-level tree: first, an object JobInProgress is initialized for the job. After initialization, each task has a TaskInProgress, and each task corresponds to multiple TaskAtempt. One of the TA succeeds, the TI succeeds, and all TI succeeds. The job succeeds.

JobTracker stores a lot of data in map in the form of KV, for example, jobs stores the mapping of jobid and JobInProgress

JobTracker monitors and manages the running process of the job by receiving the heartbeat request from TaskTracker and issuing a reply. In the response, various commands are issued: run the new task, kill task, etc. TaskTracker: a TaskTracker process is started on each machine, constantly sending a heartbeat to the JobTracker, reporting the resource usage of the current node and the task operation of the current node, and executing specific commands according to the instructions of the JobTracker in the reply.

TaskTracker launches a JVM for each task (reusable, but limited to reusing the same type of task)

TaskTracker starts a new task

The first step is to localize the job first. The first task of a job on the TaskTracker will localize the job, that is, download the files and jar packages that the job depends on from hdfs to the local location. (to prevent multiple task from localizing jobs at the same time, localized operations are locked)

Step 2: create a temporary directory for tasks

Step 3: start JVM and run tasks in JVM (in some cases JVM can be reused); run the process inside Map Task

Map task can have a total of five processes: read, map, collect, splill, conbine.

Read: reading pieces of data from a data source

Map: pass the data to the map function and become another pair of KV

Collect phase:

Mainly the data processed by map is first put into the ring buffer in memory, and when the value of the ring buffer exceeds a certain proportion, the next step of spill is executed to disk.

GetPartition is called inside collect () to partition, while ring buffers store K, V and partition numbers.

The two-level index structure used here is mainly sorted in the same partition when sorting, so arrange partition first, and then sort the internal data of partition.

The partition number recorded in kvindices, the position where the key begins, the position where the value begins, that is, a pair of KV occupies three int,kvoffsets in kvindices and only records the offset address of one pair of KV in kvindices, so only one int is needed, so the two allocate memory according to the size of 1:3.

Spill process:

After exceeding a certain threshold, the memory data in the ring cache will be spill to the disk, and will be sorted in memory before splill to the disk (quick sort).

After that, write to the temporary file according to the partition number, and there will be a number after the same partition number, indicating the number of overwrites. Conbine: merge multiple files, Doren recursion, merge the smallest n files without round.

Reduce Task internal operation process

Reduce can be divided into the following stages: shuffle, merge, sort, reduce, write

Shuffle: get the completed map task list and output location from JobTracker, and get the data through http interface

The data cable pulled by merge:shuffle is put into memory, and if there is not enough memory, it will be put into disk again, and there will be a thread constantly merging memory and data in disk.

Sort:reduce pulls multiple ordered files from different map task, and then merges and sorts them again, then each reduce gets the files in order.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.