Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize dynamic web crawler with requests module in Python

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Editor to share with you how Python uses the requests module to achieve dynamic web crawlers. I hope you will get something after reading this article. Let's discuss it together.

Development tools

Python version: 3.6.4

Related modules:

Urllib module

Random module

Requests module

Traceback module

And some modules that come with Python.

Environment building

Install Python and add it to the environment variable, and pip installs the relevant modules you need.

Then let's start the correct posture of the crawler and first write the crawler by parsing the interface.

First, find the real request. Right check, click Network, select XHR, refresh the web page, and select the jsp file in the Name list. Yes, it's as simple as that. The real request is hidden in it.

Let's take a closer look at this jsp. It's a treasure. There are real request url, request method post, Headers, and Form Data, and From Data represents the parameters passed to url. By changing the parameters, we can get the data! For the sake of safety, I gave my own Cookie a mosaic.

We tried to click on the page and found that only the pagesnum parameter would change.

1 from urllib.parse import urlencode2 import csv3 import random4 import requests5 import traceback6 from time import sleep7 from lxml import etree # lxml is a third-party web page parsing library Powerful and fast 1 base_url = 'http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search_content.jsp?' # here to replace the link 23 headers = {4' Connection': 'keep-alive',5' Accept':'* / *'in the corresponding Ajax request: 'XMLHttpRequest',7' User-Agent': 'your User-Agent' 8 'Origin':' http://www.hshfy.sh.cn',9 'Referer':' http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search.jsp?zd=splc',10 'Accept-Language':' zh-CN,zh 11 'Content-Type':' application/x-www-form-urlencoded',12 'Cookie':' your Cookie'13}

Build the get_page function with the argument page, that is, the number of pages. Create a form data as a dictionary type and use post to request web page data. Here, we should pay attention to decode the returned data and encode it as' gbk', otherwise the returned data will be garbled!

1def get_page (page): 2 n = 33 while True:4 try:5 sleep (random.uniform (1,2)) # randomly appears a number between 1 and 2 Contains decimal 6 data = {7 'yzm':' yxAH',8 'ft':'',9' ktrqks': '2020-05-22 10' ktrqjs': '2020-06-22 minus 11' spc':'',12 'yg':'' 13 'bg':'',14' ah':'',15 'pagesnum': page16} 17 url = base_url + urlencode (data) 18 print (url) 19 try:20 response = requests.request ("POST", url Headers = headers) 21 # print (response) 22 if response.status_code = = 200except requests.ConnectionError as 23 re = response.content.decode ('gbk') 24 # print (re) 25 return re # parsed content 26 except requests.ConnectionError as eRV 27 print (' Error') E.args) # output exception information 28 except (TimeoutError, Exception): 29 n-= 130 if n = = 0:31 print ('failed all three requests Abandon this url request and check the request conditions') 32 return33 else:34 print ('request failed, re-request') 35 continue

Construct the parse_page function, parse the returned web page data, extract all the field contents with Xpath, and save them in csv format.

1def parse_page (html): 2 try:3 parse = etree.HTML (html) # parse the web page 4 items = parse.xpath ('/ / * [@ id= "report"] / tbody/tr') 5 for item in items [1:]: 6 item = {7'a join: '.join (item.xpath ('. / td [1] / font/text () .strip () 8 'baked:' .join (item.xpath ('. / td [2] / font/text (). Strip (), 9'c join: '.join (item.xpath ('. / td [3] / text ()'). Strip (), 10 'dwells:' .join (item.xpath ('. / td [4] / text (). Strip () 11 'ejoin:' .join (item.xpath ('. / td [5] / text () .strip (), 12 'fallow:' .join (item.xpath ('. / td [6] / div/text ()'). Strip (), 13 'gjoin:' .join (item.xpath ('. / td [7] / div/text () .join () 14 'hinge:' .join (item.xpath ('. / td [8] / text () .strip (), 15 'ijoin:' .join (item.xpath ('. / td [9] / text ()'). Strip () 16} 17 # print (item) 18 try:19 with open ('. / law.csv') 'axiang, encoding='utf_8_sig', newline='') as fp:20 # 'a' export csv non-garbled 22 fieldnames for append mode (add) 21 # utf_8_sig format fieldnames = ['axiao,' baked, 'clocked,' dashed, 'eBay, pencils, etc. 'i'] 23 writer = csv.DictWriter (fp,fieldnames) 24 writer.writerow (item) 25 except Exception:26 print (traceback.print_exc ()) # instead of print e to output detailed exception information 27 except Exception:28 print (traceback.print_exc ())

Go through the number of pages and call the function

1 for page in range (1 page 5): # here set the number of pages you want to crawl 2 html = get_page (page) 3 # print (html) 4 print ("No." + str (page) + "Page extraction complete")

Effect:

After reading this article, I believe you have a certain understanding of "how Python uses requests module to achieve dynamic web crawler". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report