Introduction to File concurrency Operation in hadoop map-reduce 07/09 Update SLTechnology News&Howtos

Introduction to File concurrency Operation in hadoop map-reduce

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "introduction of file concurrent operation in hadoop map-reduce". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Now let the editor to take you to learn the "introduction to file concurrent operations in hadoop map-reduce"!

This can be done on either the map side or the reduce side. The following is briefly illustrated by an example from a real business scenario.

A brief description of the problem:

If reduce input key is Text (String), value is BytesWritable (byte []), different types of key are 1 million, the average size of value is about 30k, and each key corresponds to about 100. it is required to establish two files for each key, one to constantly add binary data in value, and one to record the location index of each value in the file. (a large number of small files will affect the performance of HDFS, so it is best to splice these small files.)

When the number of files is small, you can consider using MultipleOutput to split the key-value, which can be output to different files or directories according to the key. However, the number of reduce can only be 1, otherwise each reduce will generate the same directory or file, which will not achieve the final goal. In addition, the most important thing is that the operating system's limit on the number of files opened by each process is 1024 by default, and each datanode of the cluster may be configured with a higher value, but at most tens of thousands, which is still a limiting factor. Unable to meet the demand for millions of documents.

The main purpose of reduce is to merge key-value and output it to HDFS. Of course, we can also do other operations in reduce, such as reading and writing files. Because the default partitioner guarantees that the data of the same key must be in the same reduce, you only need to open two files in each reduce for reading and writing (an index file and a data file). The degree of concurrency is determined by the number of reduce, so if we set the number of reduce to 256, we can process 256key data at the same time (partioner ensures that different reduce processes different key and does not cause file read-write conflicts). The efficiency of such concurrency is very objective and the demand can be completed in a relatively short time.

This is the way of thinking, but at the same time, due to the characteristics of hdfs and the task scheduling of hadoop, there may still be many problems in the process of reading and writing files. Here are some common problems that will be encountered.

1.org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException exception

This is probably the most common problem encountered. The possible reasons are as follows:

(1) File stream conflict.

Usually when you create a file, you open a file stream for writing. We want it to be appended, so if you use the wrong API, it may cause the above problems. Taking the file system class as an example, if you use the create () method and then call the append () method, the above exception will be thrown. So it's best to use the createNewFile method, which only creates files and doesn't open the stream.

(2) mapreduce conjecture execution mechanism

In order to improve efficiency, mapreduce starts some of the same tasks (attempt) after a task is started, and after one attempt is successfully completed, the whole task is considered to be completed, the result is regarded as the final result, and the slower attempt is killed. Clusters generally turn on this option to optimize performance (trade space for time). However, it is not appropriate to speculate and implement in the context of this problem. Because we generally want a task to process a file, but if we start speculative execution, several attempt will try to manipulate the same file at the same time, and an exception will be thrown. So it's best to turn off this option and set mapred.reduce.max.attempts to 1 or mapred.reduce.tasks.speculative.execution to false.

But problems can still arise at this time. Because if there is a problem with the unique attempt of a task, the task will still have another attempt after it is dropped by the kill. At this time, because the previous attempt terminates abnormally, it may still affect the file operation of the new attempt and throw an exception. So the safest way is to learn from the mechanism of speculative execution (each attempt generates its own result, and finally selects one as the final result), appends it to the file being operated with the suffix of the id number of each attempt, and catches all file operation exceptions and handles them at the same time, so as to avoid read-write conflicts of the file. Context can be used to get some context information at run time, and you can easily get the id number of attempt. Note that it is OK to turn on speculative execution at this time, but many of the same files (one for each attempt) will be generated, which is still not the best solution.

At the same time, we can use the output of reduce to record running "abnormal" key. Most of these task are because attempt_0 was killed and an attempt_1 was restarted, so the following files are usually two. You can do some subsequent processing on the key output (file exception or attemptID > 0) of these cases, such as renaming the file, or rewriting these key immediately. Because the key in this case is generally very small, it does not affect the overall efficiency.

two。 File exception handling

It is best to set exception handling for all file operations in mapreduce. Otherwise, one file exception may cause the entire job to fail. Therefore, in terms of efficiency, it is best to record the key of a file as the output of reduce when an exception occurs. Because at the same time, mapreduce will restart a task attempts to re-read and write files, which ensures that we get the final data. Finally, all we need is some simple file renaming operations for those abnormal key.

3. Multi-directory and file splicing

If we set the type of key to 10 million, the above method will generate too many small files and affect the performance of hdfs. In addition, because all files are in the same directory, it will result in too many files in the same directory and affect access efficiency.

A useful way to create multiple subdirectories while creating files is to use reduce's taskid to create subdirectories. In this way, you can create as many subdirectories as there are reduce, and there will be no file conflicts. All key processed by the same reduce will be in the same directory.

The problem of an index to be considered in file splicing. In order to make the file index as simple as possible, you should try to ensure that all the data of the same key is in the same large file. This can be achieved using key's hashCode. If we want to create 1000 files in each directory, we just need to take the hashCode to 1000.

At this point, I believe you have a deeper understanding of the "introduction to file concurrent operations in hadoop map-reduce". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.