Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How Python crawls posts on V2EX sites in Pyspider Framework

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

How does Python crawl V2EX website posts in Pyspider framework? this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Background:

PySpider: a powerful web crawler system written by Chinese people with powerful WebUI. Written in Python language, distributed architecture, support for a variety of database backend, powerful WebUI support script editor, task monitor, project manager and results viewer.

Premise:

You have installed Pyspider and MySQL-python (save data)

If you haven't installed it, please read my previous article to prevent you from taking detours.

Some pitfalls in learning Pyspider framework

HTTP 599: SSL certificate problem: unable to get local issuer certificate error

Some of the mistakes I encountered:

First of all, the goal of this crawler: to use the Pyspider framework to crawl questions and content in posts on V2EX sites, and then save the crawled data locally.

Most of the posts in V2EX do not need to be logged in, and of course some posts need to be logged in before they can be viewed. (because later, when I crawled, I found that I had been error, and only after checking the specific reasons did I know that I needed to log in before I could view those posts.) so I don't think it's necessary to use Cookie. Of course, if you have to log in, it's very simple. The simple way is to add your login cookie.

We scanned the https://www.v2ex.com/ and found that there was not a list that could contain all the posts, so we had to choose second best to traverse all the posts by grabbing all the tabbed list pages under the category: https://www.v2ex.com/?tab=tech and then https://www.v2ex.com/go/progr.... * the detailed address of each post is (for example): https://www.v2ex.com/t/314683...

Create a project

In the lower right corner of pyspider's dashboard, click the "Create" button

Replace the URL of the self.crawl of the on_start function:

@ every (minutes=24 * 60) def on_start (self): self.crawl ('https://www.v2ex.com/', callback=self.index_page, validate_cert=False)

Self.crawl tells pyspider to crawl the specified page and then uses the callback function to parse the result.

@ every) modifier, which means that on_start executes once a day so that you can catch posts of *.

Validate_cert=False must do this, otherwise HTTP 599: SSL certificate problem: unable to get local issuer certificate error will be reported.

Home page:

Click the green run to execute, and you will see a red 1 on the follows. Switch to the follows panel and click the green play button:

This problem appeared in the second screenshot at first. Look at the previous article for the solution, and then the problem will no longer appear.

Tab list page:

In the tab list page, we need to extract the URL of all the topic list pages. As you may have found, sample handler has extracted a very large amount of URL

Code:

@ config (age=10 * 24 * 60 * 60) def index_page (self, response): for each in response.doc ('a [href ^ = "https://www.v2ex.com/?tab="]').items(): self.crawl (each.attr.href, callback=self.tab_page, validate_cert=False)

Since the length of the post list page is not the same as the tab list page, a new callback is created here as self.tab_page

@ config (age=10 24 60 * 60) indicates that we think the page is valid for 10 days and will not update it again.

Go list page:

Code:

@ config (age=10 * 24 * 60 * 60) def tab_page (self, response): for each in response.doc ('a [href ^ = "https://www.v2ex.com/go/"]').items(): self.crawl (each.attr.href, callback=self.board_page, validate_cert=False)

Post details page (T):

You can see that there are some reply things in the result, which we don't need, we can get rid of them.

At the same time, we also need to let him realize the automatic page turning function.

Code:

Config (age=10 * 24 * 60 * 60) def board_page (self, response): for each in response.doc ('a [href^ = "https://www.v2ex.com/t/"]').items(): url = each.attr.href if url.find ('# reply') > 0:url = url [0:url.find ('#')] self.crawl (url, callback=self.detail_page) Validate_cert=False) for each in response.doc ('a. Pagepages automatically') .items (): self.crawl (each.attr.href, callback=self.board_page, validate_cert=False) # to realize automatic page flipping

Screenshot of the operation after removal:

Realize the screenshot after automatically turning the page:

At this point we can match the url of all posts.

Click the button at the back of each post to see the details of the post.

Code:

@ config (priority=2) def detail_page (self, response): title = response.doc ('H2'). Text () content = response.doc ('div.topic_content'). Html (). Replace (' ",'\") self.add_question (title, content) # insert database return {"url": response.url, "title": title "content": content,}

To insert the database, we need to define an add_question function before.

# Connect to database def _ _ init__ (self): self.db = MySQLdb.connect ('localhost',' root', 'root',' wenda', charset='utf8') def add_question (self, title, content): try: cursor = self.db.cursor () sql = 'insert into question (title, content, user_id, created_date, comment_count) values ("% s", "% s",% d % s, 0)'% (title, content, random.randint (1,10), 'now ()') # insert database SQL statement print sql cursor.execute (sql) print cursor.lastrowid self.db.commit () except Exception, e: print e self.db.rollback ()

View the running results of the crawler:

First debug, and then adjust to running. Bug of pyspider Framework under windows

To set the running speed, it is recommended not to run too fast, otherwise it is easy to be found to be a reptile and your IP will be blocked.

View running work

View the content crawled down

Then query the local database GUI software and you can see that the data has been saved locally.

If you need to use it, you can import it.

I told you the code of the crawler at the beginning. If you look at that project in detail, you will find the crawling data I uploaded. (for study only, do not do commercial use!)

Of course, you will also see other crawler code, if you feel good can give a Star, or you are also interested, you can fork my project, learn with me, this project will be updated for a long time.

Code:

# created by 10412 #! / usr/bin/env python #-*-encoding: utf-8-*-# Created on 2016-10-20 20:43:00 # Project: V2EX from pyspider.libs.base_handler import * import re import random import MySQLdb class Handler (BaseHandler): crawl_config = {} def _ init__ (self): self.db = MySQLdb.connect ('localhost',' root', 'root',' wenda' Charset='utf8') def add_question (self, title, content): try: cursor = self.db.cursor () sql = 'insert into question (title, content, user_id, created_date, comment_count) values ("% s", "% s",% d,% s, 0)'% (title, content, random.randint (1,10), 'now ()') Print sql cursor.execute (sql) print cursor.lastrowid self.db.commit () except Exception, e: print e self.db.rollback () @ every (minutes=24 * 60) def on_start (self): self.crawl ('https://www.v2ex.com/', callback=self.index_page) Validate_cert=False) @ config (age=10 * 24 * 60 * 60) def index_page (self, response): for each in response.doc ('a [href^ = "https://www.v2ex.com/?tab="]').items(): self.crawl (each.attr.href, callback=self.tab_page, validate_cert=False) @ config (age=10 * 24 * 60 * 60) def tab_page (self) Response): for each in response.doc ('a [href^ = "https://www.v2ex.com/go/"]').items(): self.crawl (each.attr.href, callback=self.board_page, validate_cert=False) @ config (age=10 * 24 * 60 * 60) def board_page (self) Response): for each in response.doc ('a [href^ = "https://www.v2ex.com/t/"]').items(): url = each.attr.href if url.find ('# reply') > 0:url = url [0:url.find ('#')] self.crawl (url, callback=self.detail_page Validate_cert=False) for each in response.doc ('a. Pagekeeper). Items (): self.crawl (each.attr.href, callback=self.board_page, validate_cert=False) @ config (priority=2) def detail_page (self, response): title = response.doc ('H2'). Text () content = response.doc ('div.topic_content'). Html (). Replace (' "'' '\ ") self.add_question (title, content) # insert database return {" url ": response.url," title ": title," content ": content,} this is the answer to the question about how Python crawls V2EX site posts in the Pyspider framework. I hope the above content can help you to a certain extent, if you still have a lot of doubts to be solved, you can follow the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report