Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to implement a web page collector in Python

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article shows you how to achieve a web page collector in Python, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Requests module

A module based on network request encapsulated in python. Used to simulate a browser to send a request. Installation: pip install requests

Coding flow of requests module

Specify url

Initiate a request

Get the corresponding data

Persistent storage

# crawl the page source data import requests# 1 of Sogou's home page. Specify urlurl = "https://www.sogou.com"# 2. Send request getresponse = requests.get (url=url) # get return value is Response object # to get response data, response data in Response object page_text = response.text # text returns string response data # 4. Persistent storage with open ("sogou.html", "w", encoding='utf-8') as fp: fp.write (page_text) project: implement a simple web page collector

Requirements: the program is based on Sogou to enter arbitrary keywords and then get the relevant entire page corresponding to the keywords.

# 1. To specify url, you need to make the parameters carried by url dynamic url = "https://www.sogou.com/web"# to make the parameters dynamic. The stitching of parameters is not recommended. If there are too many parameters, it is quite troublesome. # requests module implements a more convenient method ky = input ("enter a key") params= {'query':ky} # to apply the dictionary corresponding to the required request parameters to the params parameter of the get method, and the params parameter accepts a dictionary response = requests.get (url=url,params=params) page_text = response.textwith open (f "{ky} .html", "w", encoding='utf-8') as fp: fp.write (page_text)

After the above code is executed:

There is garbled code.

The data is of the wrong magnitude.

# resolve garbled url= "https://www.sogou.com/web"ky = input (" enter a key ") params= {'query':ky} response = requests.get (url=url,params=params) # print (response.encoding) will print the original response encoding format response.encoding =' utf-8' # modify the response data encoding format page_text = response.textwith open (f" {ky} .html "," w ") Encoding='utf-8') as fp: fp.write (page_text)

After the above code is executed:

Received the error page (Sogou's anti-climbing mechanism)

UA detection

Most websites have UA check anti-crawling mechanism.

The portal determines whether the request is made by a crawler by detecting the identity of the request carrier.

Anti-crawling strategy: UA camouflage request header to add User-Agent

Open the browser request Sogou page, right-click to check to enter Network, and click Headers to find the browser's User-Agent

Note: the identity of any browser is fine.

# Anti-crawling strategy: add User-Agenturl = "https://www.sogou.com/web"ky = input (" enter a key ") params = {'query':ky} # to the request header. Note that the data format of the request header is a key-value pair, and they are all strings. Headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 "} response = requests.get (url=url,params=params,headers=headers) response.encoding = 'utf-8' page_text = response.textwith open (f" {ky} .html "," w ", encoding='utf-8') as fp: fp.write (page_text) the above content is how to implement a web collector in Python. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report