In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article "Java target site anti-crawler how to solve" the knowledge points of most people do not understand, so the editor summed up the following content, detailed, clear steps, with a certain reference value, I hope you can get something after reading this article, let's take a look at this "Java target site anti-crawler how to solve" article.
A preface
In the process of website collection, we have to face a variety of anti-crawler technologies every day, but if we want to get the data, we need to develop a lot of targeted methods to break through their anti-crawling. For example, if the target website can identify your UserAgent in the process of data collection, you need to add a lot of UserAgent to forge and evade their identification. Some websites can be identified by cookie, so cookie also needs to be added. It limits the number of ip requests you can make, so you need to limit your ip speed or change ip. What is more stringent is to use the CAPTCHA to identify whether you are a human or a machine, so you need to simulate human behavior to make a breakthrough.
Take the project I am collecting now as an example, there is a project that needs to collect Dianping's data. I believe we all know that this website is very difficult to collect. The number of requests from ip is very strict, but I need a lot of data. If it is by reducing the speed of my crawler, it will certainly not work. In this case, I do not know that the data can only be collected in years and months. So I can only deal with it by constantly switching dynamic ip. Then I need to collect a large amount of data in a very short time, and I can't spend my time managing ip pools and verifying their availability, so I need to be able to automatically cut ip in java, so that I can spend my time on data collection.
I found a lot of agents on the Internet, most of which are provided with api mode, which requires me to manage the ip pool by myself. I am short of time and have a large amount of data, which is obviously not suitable. Found several companies that provide dynamic forwarding mode, and tested several, either because of the particularity of this site or because their proxy instability is not very good.
Two examples of connecting documents
JAVA
HttpClient3.1
Import org.apache.commons.httpclient.Credentials;import org.apache.commons.httpclient.HostConfiguration;import org.apache.commons.httpclient.HttpClient;import org.apache.commons.httpclient.HttpMethod;import org.apache.commons.httpclient.HttpStatus;import org.apache.commons.httpclient.UsernamePasswordCredentials;import org.apache.commons.httpclient.auth.AuthScope;import org.apache.commons.httpclient.methods.GetMethod
Import java.io.IOException
Public class Main {
Private static final String PROXY_HOST = "t.16yun.cn"
Private static final int PROXY_PORT = 31111
Public static void main (String [] args) {
HttpClient client = new HttpClient ()
HttpMethod method = new GetMethod ("https://httpbin.org/ip");
HostConfiguration config = client.getHostConfiguration ()
Config.setProxy (PROXY_HOST, PROXY_PORT)
Client.getParams () setAuthenticationPreemptive (true)
String username = "16ABCCKJ"
String password = "712323"
Credentials credentials = new UsernamePasswordCredentials (username, password)
AuthScope authScope = new AuthScope (PROXY_HOST, PROXY_PORT)
Client.getState () .setProxyCredentials (authScope credentials)
Try {
Client.executeMethod (method)
If (method.getStatusCode ()) = = HttpStatus.SC_OK) {
String response = method.getResponseBodyAsString ()
System.out.println ("Response =" + response)
}
} catch (IOException e) {
E.printStackTrace ()
} finally {
Method.releaseConnection ()
}
}}
This demo is directly copied and used. The configuration of the agent is provided in the purchased agent information, and the corresponding configuration can be run.
Matters needing attention
Dynamic forwarding is based on the number of requests per second, which needs to be purchased and used according to the amount of data. In addition, they provide standard and enhanced versions, as if the size of the ip pool is different, which needs to be understood from the customer service. Use it according to your actual needs.
The above is about the content of this article on "how to solve the anti-crawler on the Java target website". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.