What are the Hadoop testing methods? 07/08 Update SLTechnology News&Howtos

What are the Hadoop testing methods?

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what are the Hadoop testing methods". In the daily operation, I believe many people have doubts about the Hadoop testing methods. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "what are the Hadoop testing methods?" Next, please follow the editor to study!

I. testing frequently asked questions

1. Delete the destination file of reduce output file, upload file, download file and so on.

[phenomenon] the first time the program was run successfully, the data and the program were not modified. Why did the same command fail when it was run the second time?

[problem description] because the hdfs file system does not overwrite the write feature. For the output of reduce, operations such as uploading files locally to hdfs and downloading hdfs files locally will fail when the destination file already exists.

[test method] for programs with the above operations, be sure to delete the corresponding destination files before the program runs, especially the temporary directory with multiple rounds of iterations.

2. Setting of HADOOP_HOME environment variable

[phenomenon] on your own test machine, use the hadoop command to create a new directory, and use the hadoop dfs-ls path command to see that the directory exists, but you can't find it on a public machine?

[problem description] multiple hadoop clients may connect to multiple different hadoop platforms on the same test machine. When you enter the hadoop command directly on the shell command line, the system defaults to using the hadoop client under HADOOP_HOME. When the HADOOP_HOME environment variable is modified by another user, it will connect to another hadoop platform, and of course you can't find the desired directory:).

[test method] when using the hadoop command in the program, be sure to specify the path of the hadoop command, especially in the program provided by rd, the path of the hadoop command must be configurable.

3. Standardization of program input directory on Hadoop

[phenomenon] there is no problem with the input data of the program: the file path and format are correct, why are the result files empty?

[problem description] for multi-input (that is, multi-format input files), rd will often judge the file type according to the path name when designing the program, and then carry out different operations. At this point, when the pathname entered by the outside world is not standardized (for example, there is:. / a pact map), the map phase determines which directory the currently processed file block comes from by comparing the passed path parameters with the current processing file path obtained by the map environment variable, and the result will determine that the currently processed file does not come from any input directory, thus the type of input data cannot be obtained. (at that time, the problem was investigated for a long time, which was once thought to be the problem of hdfs, but was finally found by looking at the source code of the program.)

[test method] if this occurs, first check the monitoring page of the task: whether the input of Map input records is 0. If so, check the correctness of the input data address. Whether Map output records is 0. Map output records represents the output of map. If 0, the data is filtered out in the map phase, and the format of the input data needs to be checked. Then check to see if Reduce input records is 0. If the input of rduece is 0, the output must be 0.

4. The influence of Hadoop copy task on the program result.

[phenomenon] Local files generated in reduce need to be uploaded to hdfs. Before uploading, in order to avoid the upload failure caused by the existence of the destination file, you need to delete it before uploading. All reduce tasks end normally, but the result file is occasionally missing. And it can't be reproduced stably.

[problem description] when hadoop runs a map,red task, in order to prevent a single task from running slowly and dragging down the completion time of the whole task, motivational task is enabled for some task, that is, multiple task run the same data. When a task is run, the system automatically kill and backup task. This may cause the backup task to be deleted by the backup task before the backup task is kill and after the correct file is uploaded, resulting in the loss of the final result file. And this phenomenon is not a stable recurrence.

[test method] when operating on the same file and directory on hdfs, we must pay attention to the interference of parallel operations. Especially when doing hdfs operations in reduce, be sure to take into account the impact of the copy (this problem is more hidden). The solution is: 1, prohibit the platform from generating copy tasks (startup parameters can be configured for this purpose). 2, perform such operations on a unified stand-alone machine. For example, now deal with the environment on a stand-alone machine, and then start the mapred task.

5. Uneven distribution of Reduce data buckets

[phenomenon] by viewing the monitoring page of the task, it is found that some reduce run time is very short, while some reduce run time is very long.

[problem description] since the task of hadoop is used, the data processed must be large. Simple hash mapping buckets can lead to uneven buckets, resulting in significant differences in the amount of data processed by multiple reduce.

[test method] at present, most of the data processed by hadoop tasks are in T. If you are dealing with such a large scale of data, uneven bucket sharing may result in excessive data processing by a single node, resulting in poor performance, and may even cause memory to exceed the threshold and be kill by the platform. Therefore, before testing, it is important to find out whether the bucket key and bucket function will cause the bucket imbalance.

6. Assignment of worker resource allocation

[phenomenon] the running time of each task is very short, and the cluster resources are abundant, but the task running time is very long.

[problem description] when the amount of data processed is large, the task will be divided into many task, and when the task starts, the cluster will allocate less worker by default, resulting in a small number of worker running the task even if the cluster resources are idle, and the running end time will be prolonged.

[test method] if the data processed is very large, be sure to specify the resource parameters when the task starts, otherwise according to the default value of the system, the work allocated will be very small (50 in the hy cluster). For large amounts of data, this limit can significantly degrade performance. When a task starts, you can check the number of worker that the task is running through the monitoring page.

7. Single worker memory limit

[phenomenon] A small amount of data, the test passed, but running big data, the task is always kill by the platform.

[problem description] now the hadoop platform limits the memory of each task runtime. The default is 800m. When the running memory of a program exceeds 800m, the platform will automatically kill the task.

[test method] there are two ways to test this point: 1. After running a large amount of data in the cluster and being kill by the platform, check the log to confirm that the memory exceeds the kill of the platform. 2. Run the mapred program locally to see the program's memory footprint. If it is about 800m, it will be very risky to go online.

8. The operation of MPI program to the file directory on hadoop

[phenomenon] occasionally fails to operate on files in the same directory on the mpi node.

[problem description] the reason for this problem is the same as the impact of the Hadoop copy task on the program result, the operation of multiple nodes on the same file on the hadoop. Just one is on hadoop, the other is on mpi.

[test method] operate on the same file or directory on the hdfs in multiple places. In particular, the same module runs on both hadoop and mpi clusters. Do not modify the same hadoop directory or file at the same time on each mpi node.

9. Setting the running parameters of map reduce

[phenomenon] the program can be executed successfully locally, but it cannot be run on hadoop

[problem description] sometimes, the running parameters of map reduce are relatively long. In order to make it easy to read, rd may fold the program parameters and add tab typesetting, which will cause the parsing command on hadoop to fail.

[test method] for cases where the running parameters of map reduce are relatively long, you can urge rd to set it with shell variable. Then replace the startup parameters of the hadoop program with the shell variable. That is, it is convenient to read without making mistakes.

10. The binary file result of bistreaming used by the result of Hadoop program

[phenomenon] the program result file is in binary format, but downloaded locally, according to the format described in the detailed design, the parsing format is always wrong.

[problem description] streaming and bistreaming can be used in the current stream mode. When using bistreaming, the resulting file is in the sequence file file format of hadoop, which contains key length and value length information. When using this data on hadoop, this format is transparent to the user. After you download it locally, you can't use it directly.

[test method] when the output outputformat=SequenceFileAsBinaryOutputFormat of the task, you can use the hadoop dfs-copySeqFileToLocal-ignoreLen command to download the binary data locally, and remove the length information, which is consistent with the format written in the document.

11. Hadoop's segmentation of the input file

[phenomenon] the input file is based on the query log line of session, and the blank lines are separated between session. When setting a map, the result of the program is correct, and when setting multiple map, the result is wrong.

[problem description] hadoop will split the input file with the least unit of behavior, so when the input is split by null behavior and there is a secondary data format, hadoop cannot guarantee that one session will not be cut into two map task. This splits a session into multiple session.

[test method] when the program implementation logic depends on a unit with a larger granularity than a line, you need to set the split size of map to be larger than a single input file, otherwise it will appear to split the input file into multiple map inputs, resulting in cutting off larger input units.

Second, common test methods

1. Copy across clusters or within clusters with a large amount of data

During the testing process, you may need to copy a large amount of test data from a cluster or a directory. If you copy the data locally and upload it to the destination cluster, it will be very time-consuming. In this case, you can consider using the distcp command.

DistCp (distributed copy) is a tool for copying within and between large clusters. It uses Map/Reduce for file distribution, error handling and recovery, and report generation. It takes the list of files and directories as input to the map task, and each task completes copying of some files in the source list.

The hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo command expands and stores all the files or directory names in the / foo/bar directory of the nn1 cluster into a temporary file. Copies of the contents of these files are assigned to multiple map tasks, and then each TaskTracker performs a copy operation from nn1 to nn2. Note that DistCp operates using absolute paths.

Because distcp cannot specify two usernames and passwords, the user names and passwords of the copied source and destination clusters must be the same, and the username and password must have read access on the source cluster and write access on the destination cluster.

2. Stand-alone simulation distributed testing function points

When testing some function points or performance with no more than 800m of memory, it may be considered to run the test on a stand-alone simulation distribution:

Cat input | mapper | sort | reducer > output

When simulating distributed tests on a stand-alone machine, there are the following points to note:

1) the input of Streaming is text divided by line, and you can use cat input, but BiStreaming is in the format of "", so you have to do some processing before typing. The common methods are:

Cat input |. / reader |. / mapper |. / reducer > output

The reader program is responsible for converting files into binary formats such as keyLength, key, valueLength, and value that the mapper program can recognize. Reader is not required when the input is already in sequencefile format.

2) when hadoop's environment variables are used in Mapper or Reducer, you need to modify these environment variables or set their values at run time when simulating on a stand-alone machine.

3. Comparison of the results of distributed programs and stand-alone programs.

When verifying the results of distributed programs, we often use the method to implement a stand-alone version of the program, and then diff the results of stand-alone and distributed versions.

Due to the split of the input file by hadoop and the reduce bucket after map. When comparing the results with the stand-alone version, you need to consider the impact of input line disorder on the results. If the input lines are out of order, it does not affect the correctness of the results. When doing distributed results and local stand-alone simulation results diff, be sure to sort first, and then in diff.

4. Testing of the master control script

Although distributed programs take map-reduce programs as the main body, each round of map-reduce tasks is submitted and started by scripts on a single machine, and most projects contain multiple rounds of map-reduce tasks. Therefore, the scheduling coordination between each round of tasks and the system operation of the project need the master control script to complete.

When testing the master script, run it with the-x argument and redirect the run log to the output file. After running the introduction, even if the results are correct, you need to look at the log running the script, and you are likely to find some unexpected problems.

At this point, the study of "what are the Hadoop testing methods" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.