Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does python operate the web page

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to operate the web page with python". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to operate the web page with python.

Introduction

Urllib library is a library that comes with python to operate the URL of web pages. It can simply crawl the contents of web pages. This feature is most commonly used in python crawler development, but request is a better choice for crawler development. But the built-in urllib can also replace the request library for easy use (and because the urllib library is built-in, no additional installation is required).

Installation

Urllib is a built-in library for python and no additional installation is required.

Function

There are four modules under the urllib library, which are request module, error module, parse module and robotparser module.

Urllib.request this module defines some functions and classes that open URL, such as initiating requests, authorization verification, redirection, cookie and other functions.

For crawlers, you generally only need to know the urlopen () method of urllib.request.

The urlopen () method can choose to pass in the following parameters (incomplete, but basically common parameters for crawlers):

Url:url address, that is, the requested link.

Data: packets sent to the server (when using the post method), default to None.

Timeout: sets the access timeout.

Headers: request header. This field needs to be used when the crawler is anti-crawling.

Method: request method. You can set the request method. The default is get request.

Code example:

Url = 'https://www.yisu.com/'headers = {# pretend to be a browser' User-Agent':' Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',} req = request.Request (url,data=None,headers=headers,method='GET')

Urllib.error this module defines exception classes for exceptions thrown by urllib.request and is used to handle exceptions caused by urllib.request.

Urllib.parse this module is used to parse URL, which can parse a url protocol, network location part, hierarchical path, parameters of the final path element, query component, fragment identification, user name, password, hostname (lowercase) and port number (provided that the URL has a corresponding value).

Generally speaking, as long as the developer has some experience in the structure of a url, the above content can be seen directly, so the function of this module is only used for automatic operation, but the function is limited for the crawler (the developer has already finished the corresponding work at the beginning of the website analysis phase). If you need to know the corresponding content, please go to the python tutorial to understand.

Urllib.robotparser this module is used to parse robot files.

Robot files are used by websites to tell crawlers what content can be crawled and what cannot be crawled. It is a gentleman's agreement between the website and crawler developers. Although there is no explicit stipulation that robot stipulates that content that cannot be crawled must not be crawled, the other party can be held accountable to the developer for crawling content that cannot be crawled by robot. Thank you for your reading, the above is the content of "how to operate the web page with python". After the study of this article, I believe you have a deeper understanding of how to operate the web page with python, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report