Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the anti-crawling techniques in web development?

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article focuses on "what are the anti-crawling techniques in web development". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what are the anti-crawling techniques in web development?"

1 、 user-agent

The data request header, the most basic anti-crawling, can easily float by as long as the request header is simulated in the request.

Solution: you can set up your own user-agent, or better yet, you can randomly pick one that meets the criteria from a series of user-agent

2. Verification code

CAPTCHA is the most commonly used anti-crawler measure, but the simple CAPTCHA is automatically recognized by machine learning, and the correct rate is usually more than 50% or higher.

The complex verification code is manually typed by submitting it to a special coding platform. according to the complexity of the verification code, the workers charge an average of 1-2 cents per code, and the cost is relatively low. It is also easy to bypass, making the data easy to crawl.

3. Seal IP

This is the most effective and most likely to kill by mistake. This strategy is based on the rarity of IP. At present, hundreds of thousands of IP pools can be obtained at low cost through agent pool purchase, ADSL, or dial-up VPS. As a result, the effect of simple IP sealing strategy is getting worse and worse.

Solution:

A more mature way is: IP proxy pool

To put it simply, access from different ip through the ip proxy, so that the ip will not be blocked. But the acquisition of ip agent itself is a very troublesome thing, there are free and paid online, but the quality is uneven. If you need it in an enterprise, you can build an agent pool by purchasing cluster cloud services.

4. Slider verification

Slider verification combines machine learning techniques by sliding the slider instead of looking at letters that are sometimes too complex for the human eye to distinguish. However, because the implementation time check algorithm of some manufacturers is relatively simple, it often needs only a relatively simple simulated sliding operation to bypass, resulting in data being crawled maliciously. Similar cases: Taobao, Aliyun, Taobao Alliance.

5. Associate request context

The anti-crawler can determine whether it is accessed by a real person by the way of whether the Token or the network request context has a complete process. However, it is not too difficult for technicians with the ability of protocol analysis to carry out full simulation. Similar cases: Zhihu, Baidu login process.

6. JavaScript participates in the operation

Simple crawlers cannot perform js operations, and if some of the intermediate results require the js engine to parse and operate on js, it can make it impossible for attackers to simply crawl. However, crawler developers can still automatically parse by bringing their own js engine module or directly using phantomjs, chrome and other unprovoked browsers.

Solution: here's a big kill: "PhantomJS" PhantomJS is a Python package that can completely simulate a "browser" without a graphical interface, and js script verification is no longer a problem.

7. Increase the cost of data acquisition

When facing a professional player, it can only be achieved by increasing the opponent's labor cost, such as code confusion, dynamic encryption scheme, false data, confusing data and so on, making use of the advantage that the development speed is greater than the analysis speed. To drag down the will of the other party. If the other party insists on not relaxing, it can only continue to fight until one side gives up due to machine cost or labor cost. Typical case: car House font replacement, Qunar net hidden in the coordinates of the CSS element.

At this point, I believe you have a deeper understanding of "what are the anti-crawling technologies in web development?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report