Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What exactly is the proxy IP needed by the crawler?

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

It is believed that many inexperienced people have no idea about what the agent IP needed by the crawler is. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

When crawling some websites, we often set up a proxy IP to prevent the crawler from being blocked. We usually get the IP address of domestic well-known IP agents (such as West thorn agent, express agent, worry-free agent, etc.) free agents. These agents will generally provide transparent agents, anonymous agents, high concealment agents. So what is the difference between these agents? How should we choose? The main content is to explain the principles behind various agent IP.

1 Agent type

There are four types of agents. In addition to the previously mentioned transparent agents, anonymous agents, concealment agents, and obfuscation agents. In terms of security, the order of the four types of agents is high concealment > confusion > anonymity > transparency.

2 the principle of agency

The proxy type mainly depends on the configuration of the proxy server side. Different configurations form different types of agents. These three variables, REMOTE_ADDR,HTTP_VIA,HTTP_X_FORWARDED_FOR, are the determining factors in the configuration.

1) REMOTE_ADDR

REMOTE_ADDR represents the IP of the client, but its value is not provided by the client, but the server specifies according to the IP of the client.

If you visit a website directly using a browser, the site's web server (Nginx, Apache, etc.) will set REMOTE_ADDR as the IP address of the client.

If we set up a proxy for the browser, our request to access the target website will first go through the proxy server, and then the proxy server will convert the request to the target site. Then the web server of the website will set REMOTE_ADDR as the IP of the proxy server.

2) X-Forwarded-For (XFF)

X-Forwarded-For is a HTTP extension header that represents the real IP of the HTTP requester. When the client uses a proxy, the web server does not know the real IP address of the client. To avoid this, the proxy server usually adds a X-Forwarded-For header and the client's IP to the header.

The format of the X-Forwarded-For request header is as follows:

X-Forwarded-For: client, proxy1, proxy2

Client represents the IP address of the client; proxy1 is the device farthest from the server IP; proxy2 is the IP; of the secondary proxy device. From the format, you can see that there can be multiple layers of agents from client to server.

If a HTTP request reaches the server after three proxies Proxy1, Proxy2 and Proxy3,IP are IP1, IP2 and IP3 respectively, and the user's real IP is IP0, then according to the XFF standard, the server will eventually receive the following information:

X-Forwarded-For: IP0, IP1, IP2

Proxy3 directly connected server, which appends IP2 to XFF, indicating that it is forwarding requests for Proxy2. There is no IP3,IP3 in the list that can be obtained on the server through the Remote Address field. We know that the HTTP connection is based on the TCP connection, and there is no concept of IP in the HTTP protocol. Remote Address comes from the TCP connection, which represents the device IP that establishes the TCP connection with the server, which in this case is IP3.

3) HTTP_VIA

Via is a header in HTTP protocol, which records the proxies and gateways through which a HTTP request passes. After one proxy server, one proxy server is added, and two proxy servers are added after two.

(3) differences in agent types

1) transparent proxy (Transparent Proxy)

The proxy server is configured as follows:

REMOTE_ADDR = Proxy IP

HTTP_VIA = Proxy IP

HTTP_X_FORWARDED_FOR = Your IP

Although the transparent proxy can directly "hide" the client's IP address, it can still look up the client's IP address from HTTP_X_FORWARDED_FOR.

2) Anonymous Agent (Anonymous Proxy)

The proxy server is configured as follows:

REMOTE_ADDR = proxy IP

HTTP_VIA = proxy IP

HTTP_X_FORWARDED_FOR = proxy IP

Anonymous agents provide the ability to hide client IP addresses. Using an anonymous proxy, the server can know that the client is using the proxy, when it cannot know the real IP address of the client.

3) obfuscation Agent (Distorting Proxy)

The proxy server is configured as follows:

REMOTE_ADDR = Proxy IP

HTTP_VIA = Proxy IP

HTTP_X_FORWARDED_FOR = Random IP address

The principle is similar to that of anonymous agents, but it will be more realistic in disguise. If the client uses an obfuscation proxy, the server can still know that the client is using the proxy, but will get a fake client IP address.

2) High concealment agent (Elite Proxy or High Anonymity Proxy)

The proxy server is configured as follows:

REMOTE_ADDR = Proxy IP

HTTP_VIA = not determined

HTTP_X_FORWARDED_FOR = not determined

The high concealment agent can not only make the server not know whether the client is using the proxy, but also guarantee that the server can not get the real IP address of the client.

4 the choice of agent

An ordinary anonymous proxy can hide the real IP of the client, but it will change our request information, and the server may think that we are using the proxy. However, when using this agent, although the visited website can not know the IP address of the client, you can still know that you are using the proxy. Of course, some web pages that can detect IP can still find the IP of the client.

On the other hand, the highly anonymous agent does not change the client's request, so it looks to the server as if a real client browser is accessing it, when the client's real IP is hidden and the server will not think we are using the proxy.

Therefore, when the crawler needs to use the proxy IP, try to choose ordinary anonymous agent and high anonymous agent. In addition, if you want to ensure that the data is not known to the proxy server, it is recommended to use a proxy with the HTTPS protocol.

After reading the above, have you mastered the method of proxy IP needed by the crawler? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report