In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
What exactly is the reptile agent ip needed by the reptile?
When we crawl some websites, we often change IP to prevent crawlers from being blocked. In fact, it is also a relatively simple operation. At present, there are many IP agents on the network, such as Western thorns, sesame, rhinos and so on. These agents will generally provide transparent agents, anonymous agents, high concealment agents. So what is the difference between these agents? How should we choose? The main content of this article is to explain the principles behind various agent IP.
1 Agent type
Proxy IP can be divided into four types. The transparent agent IP mentioned earlier, the anonymous agent IP, the high anonymous agent IP, and the obfuscation agent IP. In terms of the most basic level of security, they should be arranged in the order of high concealment > confusion > anonymity > transparency.
2 the principle of agency
The proxy type mainly depends on the configuration of the proxy server side. Different configurations form different types of agents. These three variables, REMOTE_ADDR,HTTP_VIA,HTTP_X_FORWARDED_FOR, are the determining factors in the configuration.
1) REMOTE_ADDR
REMOTE_ADDR represents the IP of the client, but its value is not provided by the client, but the server specifies according to the IP of the client.
If you visit a website directly using a browser, the site's web server (Nginx, Apache, etc.) will set REMOTE_ADDR as the IP address of the client.
If we set up a proxy for the browser, our request to access the target website will first go through the proxy server, and then the proxy server will convert the request to the target site. Then the web proxy server of the website will set REMOTE_ADDR as the IP of the proxy server.
2) X-Forwarded-For (XFF)
X-Forwarded-For is a HTTP extension header that represents the real IP of the HTTP requester. When the client uses a proxy, the web proxy server does not know the real IP address of the client. To avoid this, the proxy server usually adds a X-Forwarded-For header and the client's IP to the header.
The format of the X-Forwarded-For request header is as follows:
X-Forwarded-For: client, proxy1, proxy2
Client represents the IP address of the client; proxy1 is the device farthest from the server IP; proxy2 is the IP; of the secondary proxy device. From the format, you can see that there can be multiple layers of agents from client to server.
If a HTTP request reaches the server after three proxies Proxy1, Proxy2 and Proxy3,IP are IP1, IP2 and IP3 respectively, and the user's real IP is IP0, then according to the XFF standard, the server will eventually receive the following information:
X-Forwarded-For: IP0, IP1, IP2
Proxy3 directly connected server, which appends IP2 to XFF, indicating that it is forwarding requests for Proxy2. There is no IP3,IP3 in the list that can be obtained on the server through the Remote Address field. We know that the HTTP connection is based on the TCP connection, and there is no concept of IP in the HTTP protocol. Remote Address comes from the TCP connection, which represents the device IP that establishes the TCP connection with the server, which in this case is IP3.
3) HTTP_VIA
Via is a header in HTTP protocol, which records the proxies and gateways through which a HTTP request passes. After one proxy server, one proxy server is added, and two proxy servers are added after two.
(3) differences in agent types
1) transparent proxy (Transparent Proxy)
The proxy server is configured as follows:
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP
Although the transparent proxy can directly "hide" the client's IP address, it can still look up the client's IP address from HTTP_X_FORWARDED_FOR.
2) Anonymous Agent (Anonymous Proxy)
The proxy server is configured as follows:
REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP
Anonymous agents provide the ability to hide client IP addresses. Using an anonymous proxy, the server can know that the client is using the proxy, when it cannot know the real IP address of the client.
3) obfuscation Agent (Distorting Proxy)
The proxy server is configured as follows:
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Random IP address
The principle is similar to that of anonymous agents, but it will be more realistic in disguise. If the client uses an obfuscation proxy, the server can still know that the client is using the proxy, but will get a fake client IP address.
2) High concealment agent (Elite Proxy or High Anonymity Proxy)
The proxy server is configured as follows:
REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined
The high concealment agent can not only make the server not know whether the client is using the proxy, but also guarantee that the server can not get the real IP address of the client.
4 the choice of agent
The ordinary anonymous proxy IP can hide the real IP of the client, but it can also change our request information, and the server side may think that we are using the proxy. However, when using this agent, although the visited website can not know the IP address of the client, you can still know that you are using the proxy. Of course, some web pages that can detect IP can still find the IP of the client.
Without changing the client's request, it looks like a real client browser is accessing it to the server, where the client's real IP is hidden and the server won't think we're using a proxy.
Therefore, when the crawler needs to use the crawler agent ip, try to choose ordinary anonymous agent and high anonymous agent. In addition, if you want to ensure that the data is not known to the proxy server, it is recommended to use a proxy with the HTTPS protocol.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.