In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces the relevant knowledge of "how python uses multithreading + queue technology to climb the ranking of intermediary Internet sites". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Target site analysis
The target sites to be crawled this time are: intermediary Network, which provides data such as website ranking, Internet website ranking, Chinese website ranking and so on.
The sample data displayed on the website is 58341.
The address of the collection page is https://www.zhongjie.com/top/rank_all_1.html
The UI is as follows:
Because there is a [last page] hyperlink on the page, you can get the cumulative page directly through the hyperlink.
The rest of the pages follow simple paging rules:
Https://www.zhongjie.com/top/rank_all_1.htmlhttps://www.zhongjie.com/top/rank_all_2.html
Based on this, the solution of this Python crawler is as follows: page request uses requests library, page parsing uses lxml, multi-thread uses threading module, and queue still uses queue module.
Coding time
Before formal coding, sort out the logic through a diagram.
The writing steps of this crawler are described as follows:
Request the first page in advance to parse the total page number
Through the producers continue to obtain the domain name details page address, add to the queue
The consumer function gets the details page address from the queue and parses the target data.
The code for generating the total page number is very simple.
Def get_total_page (): # get_headers () function Please refer to the open source code sharing data res = requests.get ('https://www.zhongjie.com/top/rank_all_1.html', headers=get_headers ()) Timeout=5) element = etree.HTML (res.text) last_page = element.xpath ("/ / a [@ class='weiye'] / @ href") [0] pattern = re.compile ('(\ d +)') page = pattern.search (last_page) return int (page.group (1))
After the total page number is generated, you can carry out multi-thread related coding. In this case, you have not written and stored part of the code, leaving it to you to complete on your own.
The complete code is as follows:
From queue import Queueimport timeimport threadingimport requestsfrom lxml import etreeimport randomimport redef get_headers (): uas = ["Mozilla/5.0 (compatible; Baiduspider/2.0; + http://www.baidu.com/search/spider.html)"," Mozilla/5.0 (compatible; Baiduspider-render/2.0) + http://www.baidu.com/search/spider.html)"] ua = random.choice (uas) headers= {"user-agent": ua} return headersdef get_total_page (): res = requests.get ('https://www.zhongjie.com/top/rank_all_1.html', headers=get_headers () Timeout=5) element = etree.HTML (res.text) last_page = element.xpath ("/ / a [@ class='weiye'] / @ href") [0] pattern = re.compile ('(\ d +)') page = pattern.search (last_page) return int (page.group (1)) # producer def producer (): while True: # take a category ID url = urls.get () Urls.task_done () if url is None: break res = requests.get (url=url Headers=get_headers () Timeout=5) text = res.text element = etree.HTML (text) links = element.xpath ('/ / a [@ class= "copyright_title"] / @ href') for i in links: wait_list_urls.put ("https://www.zhongjie.com" + I) # Consumer def consumer (): while True: url = wait_list_urls.get () Wait_list_urls.task_done () if url is None: break res = requests.get (url=url Headers=get_headers (), timeout=5) text = res.text element = etree.HTML (text) # data extraction More data extraction You can write xpath title = element.xpath ('/ / div [@ class= "info-head-l"] / h2/text ()') link = element.xpath ('/ / div [@ class= "info-head-l"] / p [1] / a/text ()') description = element.xpath ('/ / div [@ class= "info-head-l"] / p [2] / text ()') print (title, link) Description) if _ _ name__ = = "_ _ main__": # initialize a queue urls = Queue (maxsize=0) last_page = get_total_page () for p in range (1 Last_page + 1): urls.put (f "https://www.zhongjie.com/top/rank_all_{p}.html") wait_list_urls = Queue (maxsize=0) # Open 2 producer threads for p_in in range (1,3): P = threading.Thread (target=producer) p.start () # Open 2 consumer threads for p_in in range (1 2): P = threading.Thread (target=consumer) p.start () "how does python use multithreading + queue technology to crawl the ranking of intermediary Internet sites", this is the end of the introduction. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 226
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.