In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-09-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article shows you how to achieve a web page collector in Python, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
Requests module
A module based on network request encapsulated in python. Used to simulate a browser to send a request. Installation: pip install requests
Coding flow of requests module
Specify url
Initiate a request
Get the corresponding data
Persistent storage
# crawl the page source data import requests# 1 of Sogou's home page. Specify urlurl = "https://www.sogou.com"# 2. Send request getresponse = requests.get (url=url) # get return value is Response object # to get response data, response data in Response object page_text = response.text # text returns string response data # 4. Persistent storage with open ("sogou.html", "w", encoding='utf-8') as fp: fp.write (page_text) project: implement a simple web page collector
Requirements: the program is based on Sogou to enter arbitrary keywords and then get the relevant entire page corresponding to the keywords.
# 1. To specify url, you need to make the parameters carried by url dynamic url = "https://www.sogou.com/web"# to make the parameters dynamic. The stitching of parameters is not recommended. If there are too many parameters, it is quite troublesome. # requests module implements a more convenient method ky = input ("enter a key") params= {'query':ky} # to apply the dictionary corresponding to the required request parameters to the params parameter of the get method, and the params parameter accepts a dictionary response = requests.get (url=url,params=params) page_text = response.textwith open (f "{ky} .html", "w", encoding='utf-8') as fp: fp.write (page_text)
After the above code is executed:
There is garbled code.
The data is of the wrong magnitude.
# resolve garbled url= "https://www.sogou.com/web"ky = input (" enter a key ") params= {'query':ky} response = requests.get (url=url,params=params) # print (response.encoding) will print the original response encoding format response.encoding =' utf-8' # modify the response data encoding format page_text = response.textwith open (f" {ky} .html "," w ") Encoding='utf-8') as fp: fp.write (page_text)
After the above code is executed:
Received the error page (Sogou's anti-climbing mechanism)
UA detection
Most websites have UA check anti-crawling mechanism.
The portal determines whether the request is made by a crawler by detecting the identity of the request carrier.
Anti-crawling strategy: UA camouflage request header to add User-Agent
Open the browser request Sogou page, right-click to check to enter Network, and click Headers to find the browser's User-Agent
Note: the identity of any browser is fine.
# Anti-crawling strategy: add User-Agenturl = "https://www.sogou.com/web"ky = input (" enter a key ") params = {'query':ky} # to the request header. Note that the data format of the request header is a key-value pair, and they are all strings. Headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 "} response = requests.get (url=url,params=params,headers=headers) response.encoding = 'utf-8' page_text = response.textwith open (f" {ky} .html "," w ", encoding='utf-8') as fp: fp.write (page_text) the above content is how to implement a web collector in Python. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
The market share of Chrome browser on the desktop has exceeded 70%, and users are complaining about
The world's first 2nm mobile chip: Samsung Exynos 2600 is ready for mass production.According to a r
A US federal judge has ruled that Google can keep its Chrome browser, but it will be prohibited from
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
About us Contact us Product review car news thenatureplanet
More Form oMedia: AutoTimes. Bestcoffee. SL News. Jarebook. Coffee Hunters. Sundaily. Modezone. NNB. Coffee. Game News. FrontStreet. GGAMEN
© 2024 shulou.com SLNews company. All rights reserved.