Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

A sharing of Hadoop development ideas

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

A problem that has plagued me for a week has finally been solved. Thanks to the students who helped me, record it and share it with you.

Simplified description of the problem:

HDFS has a file in this format: user ID topic ID user's preference score for this topic score.

Now it is required to implement several given topics T1 Magi T2 Magi T3 Magi. TN, each topic requires filtering out a specified number of users ID is M1 Magi M2 Magi M3 Magi. MX.

It is desirable to select the preferred user ID based on score as much as possible, and the user ID between topics cannot be repeated.

In addition, the number of user ID of distinct on HDFS is greater than or equal to M1+M2+M3+...+MX.

Analysis of ideas:

There are several key points in the problem: first, no repetition, second, quantitative satisfaction, and third, preference score score.

How to ensure that users do not repeat between themes?

In fact, it is the process that a user can only belong to one topic, of course, we can easily sort all the preference scores under a user and pick out the most preferred topic.

Although this avoids the problem of repetition, the question arises: if the number of preferred topic T1 is 100W, and only 90W user ID prefers T1, how do you complement it? And how to automate this process through the program? The more you think about it, the more complicated it becomes!

This question seems to be a bit similar to the voluntary question of the college entrance examination. Each of us will fill in several volunteers, but we will only be admitted by one college after all. How do we do this? Can we learn from it?

First of all, according to the content of HDFS, we write a MapReduce to complete a calculation to form the following user volunteer HDFS content:

User ID theme ID-A:score1; theme ID-B:score2;...

Quite simply, you want to get a list of topics sorted by score desc under the user ID. The above line actually says:

The first choice of user ID is theme A, and the second choice is theme B.

-

Next, form the following list of list information:

Topic A count-A

Topic B count-B

Topic C count-C

...

Sort the ASC by the number of user ID required by the topic.

-

Next, let's take a look at how to select users for a theme:

We first choose topic A, which requires the least quantity, to complete the following MapReduce calculation:

In this way, we have completed the user filtering problem for topic A.

Next, we take out theme B in list, which requires a little more than A, so how does it fetch users?

Quite simply, we only need to pass the result HDFS path generated by theme A during the MAP processing phase (the file content under the path is, of course, the user of theme A) for filtering when fetching users in the reduce phase.

So, how does theme C get users?

Similarly, you only need to pass more topic A to MAP, and the user information already occupied by topic B is used for filtering, and other processing operations remain the same!

In fact, we can find that the above MapReduce can actually be a general-purpose program, similar to COMMAND:

Number of hadoop jar XXX.jar topic ID [input1,input2,input3,...] User volunteer HDFS path output path

Among them, input1,input2,input3,... It is optional and is actually used for user filtering.

Finally, we can write a Shell script to repeatedly call the above COMMAND according to the contents of the list, and achieve our goal by passing different parameter information, so that we can automate the above requirements!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report