In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
Editor to share with you how Python uses the requests module to achieve dynamic web crawlers. I hope you will get something after reading this article. Let's discuss it together.
Development tools
Python version: 3.6.4
Related modules:
Urllib module
Random module
Requests module
Traceback module
And some modules that come with Python.
Environment building
Install Python and add it to the environment variable, and pip installs the relevant modules you need.
Then let's start the correct posture of the crawler and first write the crawler by parsing the interface.
First, find the real request. Right check, click Network, select XHR, refresh the web page, and select the jsp file in the Name list. Yes, it's as simple as that. The real request is hidden in it.
Let's take a closer look at this jsp. It's a treasure. There are real request url, request method post, Headers, and Form Data, and From Data represents the parameters passed to url. By changing the parameters, we can get the data! For the sake of safety, I gave my own Cookie a mosaic.
We tried to click on the page and found that only the pagesnum parameter would change.
1 from urllib.parse import urlencode2 import csv3 import random4 import requests5 import traceback6 from time import sleep7 from lxml import etree # lxml is a third-party web page parsing library Powerful and fast 1 base_url = 'http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search_content.jsp?' # here to replace the link 23 headers = {4' Connection': 'keep-alive',5' Accept':'* / *'in the corresponding Ajax request: 'XMLHttpRequest',7' User-Agent': 'your User-Agent' 8 'Origin':' http://www.hshfy.sh.cn',9 'Referer':' http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search.jsp?zd=splc',10 'Accept-Language':' zh-CN,zh 11 'Content-Type':' application/x-www-form-urlencoded',12 'Cookie':' your Cookie'13}
Build the get_page function with the argument page, that is, the number of pages. Create a form data as a dictionary type and use post to request web page data. Here, we should pay attention to decode the returned data and encode it as' gbk', otherwise the returned data will be garbled!
1def get_page (page): 2 n = 33 while True:4 try:5 sleep (random.uniform (1,2)) # randomly appears a number between 1 and 2 Contains decimal 6 data = {7 'yzm':' yxAH',8 'ft':'',9' ktrqks': '2020-05-22 10' ktrqjs': '2020-06-22 minus 11' spc':'',12 'yg':'' 13 'bg':'',14' ah':'',15 'pagesnum': page16} 17 url = base_url + urlencode (data) 18 print (url) 19 try:20 response = requests.request ("POST", url Headers = headers) 21 # print (response) 22 if response.status_code = = 200except requests.ConnectionError as 23 re = response.content.decode ('gbk') 24 # print (re) 25 return re # parsed content 26 except requests.ConnectionError as eRV 27 print (' Error') E.args) # output exception information 28 except (TimeoutError, Exception): 29 n-= 130 if n = = 0:31 print ('failed all three requests Abandon this url request and check the request conditions') 32 return33 else:34 print ('request failed, re-request') 35 continue
Construct the parse_page function, parse the returned web page data, extract all the field contents with Xpath, and save them in csv format.
1def parse_page (html): 2 try:3 parse = etree.HTML (html) # parse the web page 4 items = parse.xpath ('/ / * [@ id= "report"] / tbody/tr') 5 for item in items [1:]: 6 item = {7'a join: '.join (item.xpath ('. / td [1] / font/text () .strip () 8 'baked:' .join (item.xpath ('. / td [2] / font/text (). Strip (), 9'c join: '.join (item.xpath ('. / td [3] / text ()'). Strip (), 10 'dwells:' .join (item.xpath ('. / td [4] / text (). Strip () 11 'ejoin:' .join (item.xpath ('. / td [5] / text () .strip (), 12 'fallow:' .join (item.xpath ('. / td [6] / div/text ()'). Strip (), 13 'gjoin:' .join (item.xpath ('. / td [7] / div/text () .join () 14 'hinge:' .join (item.xpath ('. / td [8] / text () .strip (), 15 'ijoin:' .join (item.xpath ('. / td [9] / text ()'). Strip () 16} 17 # print (item) 18 try:19 with open ('. / law.csv') 'axiang, encoding='utf_8_sig', newline='') as fp:20 # 'a' export csv non-garbled 22 fieldnames for append mode (add) 21 # utf_8_sig format fieldnames = ['axiao,' baked, 'clocked,' dashed, 'eBay, pencils, etc. 'i'] 23 writer = csv.DictWriter (fp,fieldnames) 24 writer.writerow (item) 25 except Exception:26 print (traceback.print_exc ()) # instead of print e to output detailed exception information 27 except Exception:28 print (traceback.print_exc ())
Go through the number of pages and call the function
1 for page in range (1 page 5): # here set the number of pages you want to crawl 2 html = get_page (page) 3 # print (html) 4 print ("No." + str (page) + "Page extraction complete")
Effect:
After reading this article, I believe you have a certain understanding of "how Python uses requests module to achieve dynamic web crawler". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.