In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article shows you how to achieve a web page collector in Python, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
Requests module
A module based on network request encapsulated in python. Used to simulate a browser to send a request. Installation: pip install requests
Coding flow of requests module
Specify url
Initiate a request
Get the corresponding data
Persistent storage
# crawl the page source data import requests# 1 of Sogou's home page. Specify urlurl = "https://www.sogou.com"# 2. Send request getresponse = requests.get (url=url) # get return value is Response object # to get response data, response data in Response object page_text = response.text # text returns string response data # 4. Persistent storage with open ("sogou.html", "w", encoding='utf-8') as fp: fp.write (page_text) project: implement a simple web page collector
Requirements: the program is based on Sogou to enter arbitrary keywords and then get the relevant entire page corresponding to the keywords.
# 1. To specify url, you need to make the parameters carried by url dynamic url = "https://www.sogou.com/web"# to make the parameters dynamic. The stitching of parameters is not recommended. If there are too many parameters, it is quite troublesome. # requests module implements a more convenient method ky = input ("enter a key") params= {'query':ky} # to apply the dictionary corresponding to the required request parameters to the params parameter of the get method, and the params parameter accepts a dictionary response = requests.get (url=url,params=params) page_text = response.textwith open (f "{ky} .html", "w", encoding='utf-8') as fp: fp.write (page_text)
After the above code is executed:
There is garbled code.
The data is of the wrong magnitude.
# resolve garbled url= "https://www.sogou.com/web"ky = input (" enter a key ") params= {'query':ky} response = requests.get (url=url,params=params) # print (response.encoding) will print the original response encoding format response.encoding =' utf-8' # modify the response data encoding format page_text = response.textwith open (f" {ky} .html "," w ") Encoding='utf-8') as fp: fp.write (page_text)
After the above code is executed:
Received the error page (Sogou's anti-climbing mechanism)
UA detection
Most websites have UA check anti-crawling mechanism.
The portal determines whether the request is made by a crawler by detecting the identity of the request carrier.
Anti-crawling strategy: UA camouflage request header to add User-Agent
Open the browser request Sogou page, right-click to check to enter Network, and click Headers to find the browser's User-Agent
Note: the identity of any browser is fine.
# Anti-crawling strategy: add User-Agenturl = "https://www.sogou.com/web"ky = input (" enter a key ") params = {'query':ky} # to the request header. Note that the data format of the request header is a key-value pair, and they are all strings. Headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 "} response = requests.get (url=url,params=params,headers=headers) response.encoding = 'utf-8' page_text = response.textwith open (f" {ky} .html "," w ", encoding='utf-8') as fp: fp.write (page_text) the above content is how to implement a web collector in Python. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.