Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to implement WordCount in SogouQ

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to achieve WordCount in SogouQ. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

PS1: the original format of the log is GB2312 encoding. Be sure to convert it to UTF-8.

PS2: log format and format description:

Access time\ t user ID\ t [query word]\ t ranking of the URL in the returned result\ t sequence number clicked by user\ t URL clicked by user

There are holes in this format.

Deep pit:

The separator between the fields "the ranking of the URL in the returned result\ t the sequence number clicked by the user" is not a tab\ t, but a space

Val sogouQRdd = sc.textFile ("hdfs://node1:9000/sogouQ/input") sogouQRdd.cache # caches log files in memory on the next Action operation

Implement normal WordCount, but the results are not sorted by Key (word) like MapReduce

SogouQRdd.filter (_ .split ('\ t'). Length = = 5): there are two strings searching for keywords (why only two, don't ask me how I know), and there are tabs\ t. Be sure to filter them out.

Val wcWithoutSortRdd = sogouQRdd.filter (_ .split ('\ t'). Length = = 5). Map (_ .split ('\ t') (2)). Map ((_, 1)). ReduceByKey (_ + _) wcWithoutSortRdd.saveAsTextFile ("hdfs://node1:9000/sogouQ/output/wc1")

Output result Top10 of wcWithoutSortRdd

([Zhongtian ZT1818 Review + site:www.pcpop.com | product.pcpop.com | channel.pcpop.com | pop.pcpop.com], 1) ([Sany heavy Industry + Road Construction Machinery], 1) ([fastest video website], 1) ([zhutan], 3) ([Fukang], 3) ([Battle of Shijiazhuang], 2) ([Foreign Women's Prison], 1) ([A42B331 Parameter], 1) ([78bar], 1) ([Linyi McCos], 2)

Implement WordCount sorted (descending) by Value (count)

Idea: on the basis of wcRdd, first reverse K (word) and V (count), then sort Key (count), and then reverse it.

Val wcSortByCountRdd = wcWithoutSortRdd.map (x = > (x.room2, x.room1)) .sortByKey (false) .map (x = > (x.room2, x.room1)) wcSortByCountRdd.saveAsTextFile ("hdfs://node1:9000/sogouQ/output/wc2")

Idea 2: use the sortBy () operation directly

/ / _. _ 2: item 2 of the tuple, which is count; false: sort val wcSortByCountRdd = wcWithoutSortRdd.sortBy (_. _ 2, false) wcSortByCountRdd.saveAsTextFile ("hdfs://node1:9000/sogouQ/output/wc2") in descending order

Output result Top10 of wcSortByCountRdd

([looting relief materials], 66906) ([causes of Wenchuan earthquake], 58766) ([blocking Sharon Stone], 12649) ([self-report of a female prostitute], 9758) ([commander of Guangzhou military region], 8661) ([female prostitute Li Xiang], 8584) ([Chengdu Police Anti-pornography scene], 5371) ([Baidu], 4958) / / search Baidu with Sogou, it seems to be hacking Baidu. Hey, hey, hey ([map of Nepal]) 4886) ([list of Lieutenant generals of the people's Liberation Army in active Service], 4721) read the above Do you have any further understanding of how to implement WordCount in SogouQ? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report