Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize Python Advanced Multi-thread crawling Web Page Project

2025-02-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to achieve Python advanced multithreading web page crawling project". In daily operation, I believe that many people have doubts about how to achieve Python advanced multithreading web page crawling project. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "how to achieve Python advanced multithreading web page crawling project". Next, please follow the editor to study!

Catalogue

I. Web page analysis

Second, code implementation

I. Web page analysis

The website we chose to crawl this time is the Python page of Shuimu Community.

Web page: https://www.mysmth.net/nForum/#!board/Python?p=1

According to convention, our first step is to analyze the page structure and the request when turning the page.

Through the link analysis of the first three pages, we know that the last parameter in each page link is the number of pages, and we can modify it to get the data of other pages.

To analyze again, we need to get the link to the post under the section where id is body, and then find the table in it layer by layer, and we can traverse the title of these links.

Let's click on a post: https://www.mysmth.net/nForum/#!article/Python/162717

As before, let's first analyze the structure of the title and content in the web page

It is not difficult to find that the topic section only needs to find the section with id of main and the class under b-head corner as the second span under b-head corner.

Subject part

In the content section, just find the div whose class is a-wrap corner and find the a-content below.

Content part

After analyzing the structure of the web page, we can start writing code!

Second, code implementation

The first thing to do is to decide what to save: this time we save all the headlines of the first 10 pages of the Python page of Shuimu Community and all the responses on the first page of the post.

Parse the list page to get links to all posts

From bs4 import BeautifulSoup# parses the contents of the list page Get the content link for this page def parse_list_page (text): soup = BeautifulSoup (text, 'html.parser') # below is equivalent to soup.find (' table', class_='board-list tiz'). Find ('tbody') tbody = soup.find (' table', class_='board-list tiz'). Tbody urls = [] for tr in tbody: td = tr.find ('td') Class_='title_9') urls.append (td.a.attrs ['href']) return urls

Parse the content page to get the title and all the posts on this page

# parsing the content page Get the title and content of all posts def parse_content_page (text): soup = BeautifulSoup (text, 'html.parser') title = soup.find (' span', class_='n-left'). Text.strip ('article topic:'). Strip () content_div = soup.find ('div', class_='b-content corner') contents = [] for awrap in content_div.find_all (' div') Class_='a-wrap corner'): content = awrap.p.text contents.append (content) return title, contents

Convert the link on the list page into the link we want to grab

Def convert_content_url (path): URL_PREFIX = 'http://www.mysmth.net' path + ='? ajax' return URL_PREFIX + path

Generate links to the first 10 pages of the list page

List_urls = [] for i in range (1,11): url = 'http://www.mysmth.net/nForum/board/Python?ajax&p=' url + = str (I) list_urls.append (url)

Below is a link to all the texts in the first 10 pages of the list. At this time, we use thread pool to run.

Import requestsfrom concurrent import futuressession = requests.Session () executor = futures.ThreadPoolExecutor (max_workers=5) # this function gets the list page data and parses the link And convert it to the real link def get_content_urls (list_url): res = session.get (list_url) content_urls = parse_list_page (res.text) real_content_urls = [] for url in content_urls: url = convert_content_url (url) real_content_urls.append (url) return real_content_urls# based on the ten list page links just generated Start submitting task fs = [] for list_url in list_urls: fs.append (executor.submit (get_content_urls, list_url)) futures.wait (fs) content_urls = set () for f in fs: for url in f.result (): content_urls.add (url)

Note here that we used the set () function in line 23 to remove duplicate values. Its principle is to create a Set (collection), which is a special data type in Python, which can contain multiple elements, but cannot be repeated. Let's look at the usage of set ().

Numbers = [1,1,2,2,2,3,4] unique = set (numbers) print (type (unique)) # output: print (unique) # output: {1,2,3,4}

We see that set () converts the list numbers into a collection {1, 2, 3, 4} without repeating elements.

Taking advantage of this feature, we first create an empty collection at line 23 with content_urls = set (), and then automatically remove multiple links when we add links to it.

After getting all the text links, we parse the contents of the text page and put it in a dictionary

# get the content of the body page and parse the title and post def get_content (url): r = session.get (url) title, contents = parse_content_page (r.text) return title, contents# submit the task of parsing the text fs = [] for url in content_urls: fs.append (executor.submit (get_content, url) futures.wait (fs) results = {} for f in fs: title, contents = f.result () results [title] = contentsprint (results.keys ())

In this way, we completed the multi-threaded waterwood community crawler. Print results.keys () to see the titles of all posts.

This time I crawled all the topics on the first ten pages, as well as their first page reply. A total of 10 list pages, 300 topic pages, parsed 1533 responses. The test execution took only about 13 seconds on a machine with a good network and average performance.

The complete code is as follows

Import requestsfrom concurrent import futuresfrom bs4 import BeautifulSoup# parses the contents of the list page Get the content link def parse_list_page (text): soup = BeautifulSoup (text, 'html.parser') tbody = soup.find (' table', class_='board-list tiz'). Tbody urls = [] for tr in tbody: td = tr.find ('td', class_='title_9') urls.append (td.a.attrs [' href']) return urls# parsing content page Get the title and content of all posts def parse_content_page (text): soup = BeautifulSoup (text, 'html.parser') title = soup.find (' span', class_='n-left'). Text.strip ('article topic:'). Strip () content_div = soup.find ('div', class_='b-content corner') contents = [] for awrap in content_div.find_all (' div') Class_='a-wrap corner'): content = awrap.p.text contents.append (content) return title, contents# converts the link from the list page to the link we want to grab def convert_content_url (path): URL_PREFIX = 'http://www.mysmth.net' path + ='? ajax' return URL_PREFIX + path# generates the link for the first ten pages list_urls = [] for i in range (1 11): url = 'http://www.mysmth.net/nForum/board/Python?ajax&p=' url + = str (I) list_urls.append (url) session = requests.Session () executor = futures.ThreadPoolExecutor (max_workers=5) # this function gets list page data Parse the link and convert it to the real link def get_content_urls (list_url): res = session.get (list_url) content_urls = parse_list_page (res.text) real_content_urls = [] for url in content_urls: url = convert_content_url (url) real_content_urls.append (url) return real_content_urls# based on the ten list page links just generated Start submitting task fs = [] for list_url in list_urls: fs.append (executor.submit (get_content_urls, list_url)) futures.wait (fs) content_urls = set () for f in fs: for url in f.result (): content_urls.add (url) # get the content of the body page Parse the title and post def get_content (url): r = session.get (url) title, contents = parse_content_page (r.text) return title, contents# submit the task of parsing the text fs = [] for url in content_urls: fs.append (executor.submit (get_content, url)) futures.wait (fs) results = {} for f in fs: title, contents = f.result () results [title] = contentsprint (results.keys ()) so far On "how to achieve Python advanced multi-thread crawling web page project" study is over, I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report