Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python disable request library requests

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "how to disable request library requests by Python". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

I. Preface

A very strong anti-crawler scheme-disable all HTTP 1.x requests!

At present, many crawler libraries do not support HTTP/2.0 well, such as the famous Python library-requests, which so far only supports HTTP/1.1, and it is not known when it will support HTTP/2.0.

The latest version of Scrapy framework 2.5.0 (released in 2021.04.06) adds support for HTTP/2.0, but the official website clearly indicates that it is an experimental feature and is not recommended for production environment. The original text is as follows:

"

HTTP/2 support in Scrapy is experimental, and not yet recommended for production environments. Future Scrapy versions may introduce related changes without a deprecation period or warning.

"

By the way, how can HTTP/2.0 be supported in Scrapy? You can change the Download Handlers in settings.py:

DOWNLOAD_HANDLERS = {'https':' scrapy.core.downloader.handlers.http2.H2DownloadHandler',}

Known limitations of the current HTTP/2.0 implementation of Scrapy include:

HTTP/2.0 plaintext (H3C) is not supported because no major browsers support unencrypted HTTP/2.0.

There is no setting to specify that the maximum frame size is greater than the default value of 16384, and the connection to the server that sends the larger frame will fail.

Server push is not supported.

Bytes_received and headers_received signals are not supported.

Needless to say about some other libraries, the support for HTTP/2.0 is not good. At present, the ones that support HTTP/2.0 are hyper and httpx, which are easier to use.

2. Anti-reptile

So, have you come up with an anti-crawler plan?

If we disable all HTTP/1.x requests, can we kill more than half of the crawlers? Requests is no longer available, Scrapy can barely use an experimental version unless it is upgraded to the latest version, and most of the other languages will kill, needless to say.

The browser's support for HTTP/2.0 is now very good, so it will not affect the user's experience of browsing the web.

III. Measures

Then let's do it!

How do you do this? As a matter of fact, it is very simple. Just configure it in Nginx. The main thing is to add this judgment:

If ($server_protocol! ~ * "HTTP/2.0") {return 444;}

It's that simple. Here $server_protocol is the transport protocol, and there are currently three results: HTTP/1.0, HTTP/1.1 and HTTP/2.0. In addition, we use! ~ *, which means that the judgment condition here is that, if it is not HTTP/2.0, it directly returns the 444 status code. 444 generally represents CONNECTION CLOSED WITHOUT RESPONSE, that is, no result is returned to close the connection.

My service runs in Kubernetes, so to add this configuration, we have to change the configuration of Nginx Ingress, but fortunately, https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/ reserves a configuration called nginx.ingress.kubernetes.io/server-snippet, which allows us to customize the decision logic of Nginx.

The official usage is as follows:

ApiVersion: networking.k8s.io/v1beta1kind: Ingressmetadata: annotations: nginx.ingress.kubernetes.io/server-snippet: | set $agentflag 0; if ($http_user_agent ~ * "(Mobile)") {set $agentflag 1;} if ($agentflag = 1) {return 301 https://m.example.com;}

So here, we just need to change to the configuration just now:

ApiVersion: networking.k8s.io/v1beta1kind: Ingressmetadata: annotations: nginx.ingress.kubernetes.io/server-snippet: | if ($server_protocol! ~ * "HTTP/2.0") {return 444;}

The great task has been completed!

The configuration is complete. The sample website is: https://spa16.scrape.center/

Let's take a look at the effect in the browser:

You can see that all the requests are HTTP/2.0, and the page loads normally.

However, we use requests to request:

Import requestsresponse = requests.get ('https://spa16.scrape.center/')print(response.text)

Very happy to report an error:

Traceback (most recent call last):... Raise RemoteDisconnected ("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response During handling of the above exception, another exception occurred: Traceback (most recent call last):... Raise MaxRetryError (_ pool, url, error or ResponseError (cause)) requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool (host='spa16.scrape.center', port=443): Max retries exceeded with url: / (Caused by ProxyError ('Cannot connect to proxy.', RemoteDisconnected (' Remote end closed connection without response') During handling of the above exception, another exception occurred: Traceback (most recent call last):... requests.exceptions.ProxyError: HTTPSConnectionPool (host='spa16.scrape.center') Port=443): Max retries exceeded with url: / (Caused by ProxyError ('Cannot connect to proxy.', RemoteDisconnected (' Remote end closed connection without response')

If you use requests, you can't do it anyway, because it doesn't support HTTP/2.0.

How about we switch to a library that supports HTTP/2.0? For example, httpx is installed as follows:

Pip3 install 'httpx [http2]'

Note that the Python version needs to be 3.6 or higher to use httpx.

After installation, test:

Import httpxclient = httpx.Client (http2=True) response = client.get ('https://spa16.scrape.center/')print(response.text)

The results are as follows:

Scrape | BookWe're sorry but portal doesn't work properly without JavaScript enabled. Please enable it to continue.

As you can see, HTML was successfully obtained by us! This is the magic of HTTP/2.0!

What if we set the http2 parameter to False?

Import httpxclient = httpx.Client (http2=False) response = client.get ('https://spa16.scrape.center/')print(response.text)

It is also unfortunate:

Traceback (most recent call last):... Raise RemoteProtocolError (msg) httpcore.RemoteProtocolError: Server disconnected without sending a response. The above exception was the direct cause of the following exception:... Raise mapped_exc (message) from exchttpx.RemoteProtocolError: Server disconnected without sending a response.

Therefore, this proves that as long as there is no rule of law in HTTP/1.x! You can burn incense for requests!

This is the end of "how to disable request library requests by Python". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report