How to use mapreduce to get the largest first n records of top 07/12 Update SLTechnology News&Howtos

How to use mapreduce to get the largest first n records of top

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to use mapreduce to get the first n records of top". In daily operation, I believe many people have doubts about how to use mapreduce to get the first n records of top. Xiaobian consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts of "how to use mapreduce to get the first n records of top"! Next, please follow the editor to study!

In the initial contact with mapreduce, the solution to the top n problem is to put the mapreduce output (after sorting) into a collection, taking the first n, but this writing is too simple, the size of the collection that memory can load is limited, once the amount of data is large, memory overflow is easy to occur.

Here today is another way to implement, of course, this is not the best way.

Demand, get the largest first n records of top

Only some core codes are given here, and other job configuration codes are briefly described.

Configuration conf = new Configuration (); conf.setInt ("N", 5)

Conf.setInt ("N", 5) is required before initializing job; it is intended to read NMagne in the mapreduce phase to represent top N.

The following is map

Package com.lzz.one;import java.io.IOException;import java.util.Arrays;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper / * topN* # orderid,userid,payment Productid* [root@x00 hd] # cat seventeen_a.txt* 1Reporter 9819100121 * 2 8918echelon 2000111* 3memoo 2813Ener1234, 22 * 4way9100, 1101y5, 32104901116, 1298 cat seventeen_b.txt* 20* [root@x00 hd] # cat seventeen_b.txt* 100m333, 10100 * 101m932120 * 1039912010106791prime9101030 * 104888881139 * Forecast result: (seeking the result of TopN = 5) * 1 90002 * 20003 12344 * 5 910 * * / public class TopNMapper extends Mapper {int len Int top []; @ Overridepublic void setup (Context context) throws IOException,InterruptedException {len = context.getConfiguration (). GetInt ("N", 10); top = new int [len+1];} @ Overridepublic void map (LongWritable key, Text value, Context context) throws IOException,InterruptedException {String line = value.toString (); String arr [] = line.split (",") If (arr! = null & & arr.length = = 4) {int pay = Integer.parseInt (arr [2]); add (pay);}} public void add (int pay) {top [0] = pay; Arrays.sort (top);} @ Overridepublic void cleanup (Context context) throws IOException,InterruptedException {for (int item1bot i0) ) {context.write (new IntWritable (len-i+1), new IntWritable (top [I]));}

Speaking of logic, although the drawing is relatively clear, but the time is limited, the drawing level is limited, only use language to describe it, I hope to be able to speak clearly

If you want to take top 5, you should define an array of length 6. What map needs to do is to put the field that needs to be sorted in each log into the first element of the array. Call the Arrays.sort (Array []) method to sort the array in positive order, from small to large from a numerical point of view. For example, if the first record is 9000, then the sorting result is [0queut0quot 0quot 0Log 9000], and the second log record is 8000. The sorted result is [0pc0jc0je 0j0je 8000jc9000], the third log record is 8500, the sorted result is [0pt0pr 8000pc8500pc9000], and so on. If you put a number larger than the smallest element in the array each time, it is equivalent to covering the smallest one, that is to say, the elements in the array always get the largest records in the log.

Ok,map outputs the array intact in order. Reduce receives the five sorted elements from each map and sorts them in the same way as map. After sorting, the array is sorted in order from small to large. Outputting these elements in reverse order is the final result.

Compared with the previous way, the previous map did very little, in the reduce ranking after which the first five items, the pressure of reduce is very great, to deal with all the data, and generally set the number of reduce is less, once there is more data, reduce will not be able to bear the tragedy. But the present way cleverly transfers the pressure of reduce to map, while map is clustering effect. Many servers do this, reducing the burden on a machine. In fact, each map only outputs five elements. If there are five map, in fact, reduce only operates on five and five pieces of data, and there will be no memory overflow.

At this point, the study on "how to use mapreduce to get the first n records of top" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.