Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the common anti-pickpocketing methods of python websites?

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

In this article, the editor introduces in detail "what are the common anti-pickpocketing methods on python sites", with detailed content, clear steps and proper handling of details. I hope that this article "what are the common anti-pickpocketing methods of python websites" can help you solve your doubts.

Website anti-crawler mode:

First, use User-Agent to control access:

Both browsers and crawlers send a header file: headers to indicate their identity when they make a network request to the server. For crawlers, the most important field to pay attention to is: User-Agent many websites will set up a User-Agent whitelist, and only User-Agent that belongs to the normal scope can be accessed normally.

Solution: you can set up your own User-Agent, or better yet, you can randomly pick one that meets the criteria from a series of User-Agent.

Second, use JS scripts to prevent crawlers:

For example: if you want to crawl a website, it will have a verification page to verify that you are a crawler before requesting it. How does it come true? He will generate a large random number through JS code, and then ask the browser to calculate the sum of the string of numbers through js, and then return it to the server.

Solution: use PhantomJS! PhantomJS is a Python package that can completely simulate a browser without a graphical interface. "JS scripts are no longer a problem.

Third, anti-crawlers through IP restrictions:

If a fixed ip visits a website in a short period of time, it will naturally attract attention, the administrator can block the ip by some means, and the crawler will naturally be unable to do anything.

Solution: a more mature way is: the IP proxy pool is simply accessed from different ip through the ip proxy, so that the ip will not be blocked. But the acquisition of ip agent itself is a very troublesome thing, there are free and paid online, but the quality is uneven.

If you need it in an enterprise, you can build an agent pool by purchasing cluster cloud services.

The implementation is as follows:

Def get_ip_poll ():

''

The simulated proxy pool returns a key-value pair of dictionary type

''

Ip_poll = ["http://xx.xxx.xxx.xxx:9999","

"http://xx.xxx.xxx.xxx:8000",

"http://xx.xxx.xxx.xxx:8080",

"http://xx.xxx.xxx.xxx:9922",

"http://xx.xxx.xxx.xxx:8090"]

Addresses = {}

Addresses ['http'] = ip_ Poll [len. Randint (0, len (ip_poll))]

Return addresses

Fourth, restrict crawlers through robots.txt:

The biggest and best crawler in the world is Google. The search engine itself is a super-large crawler. The crawler developed by Google continuously crawls new information on the Internet for 24 hours and returns it to the database. But the crawlers of these search engines all abide by one agreement: robots.txt robots.txt (uniform lowercase) is an ASCII-encoded text file stored in the root directory of the website. It usually tells the rover of the web search engine (also known as the web spider) what content in this site should not be accessed by the rover of the search engine and which can be accessed by the rover. The robots.txt protocol is not a specification, but a convention, so it does not guarantee the privacy of the site.

Note that robots.txt uses string comparisons to determine whether to get URL, so there is a different URL at the end of the directory than without a slash.

Robots.txt allows the use of wildcards like "Disallow: * .gif" [1] [2]. Because URL in some systems is case-sensitive, robots.txt filenames should be uniformly lowercase. Robots.txt should be placed in the root directory of the website. If you want to define separately the behavior of the search engine's rover when accessing subdirectories, you can merge the custom settings into the robots.txt under the root directory, or use robots metadata (Metadata, also known as metadata).

Of course, under certain circumstances, for example, we crawlers can get web pages at the same speed as humans browse web pages, which will not cause too much performance loss to the server. In this case, we do not have to abide by the robots protocol.

After reading this, the article "what are the common anti-theft methods on python websites" has been introduced. If you want to master the knowledge points of this article, you still need to practice and use it yourself. If you want to know more about related articles, welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report