What is the basic knowledge of java web crawler 07/08 Update SLTechnology News&Howtos

What is the basic knowledge of java web crawler

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "what are the basic knowledge of java web crawler". In daily operation, I believe many people have doubts about the basic knowledge of java web crawler. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the question of "what is the basic knowledge of java web crawler?" Next, please follow the editor to study!

1. A reptile with "morality"

Why would I put this in the first place? Because I think this point is more important, what is a "moral" reptile? Is to follow the rules of the crawled server, do not affect the normal operation of the crawled server, do not destroy the crawler service, this is a "moral" crawler.

A question that is often discussed is whether reptiles are legal. Zhihu, what you will see is like this.

There are tens of thousands of answers, among which I personally agree with the following answer.

As a kind of computer technology, crawler determines its neutrality, so crawler itself is not prohibited in law, but it is illegal or even criminal to use crawler technology to obtain data. The so-called specific analysis of specific problems, just as the fruit knife itself is not legally prohibited, but used to stab people, it will not be tolerated by the law.

Is it illegal for reptiles to break the law? Depending on whether what you do is illegal or not, what is the nature of a web crawler? The essence of web crawlers is to use machines instead of people to access pages. It is certainly not against the law for me to check public news, so it is not illegal for me to collect news that is open on the Internet, just like every major search engine website, other websites are eager to be crawled by the spiders of search engines. On the contrary, it is illegal to collect other people's private data and check other people's private information yourself, so it is illegal to use programs to collect them, just like the fruit knife itself is not illegal, but it is illegal to stab people.

To be a "moral" crawler, the Robots protocol is something you must understand. Here is the Baidu encyclopedia of the Robots protocol.

In many websites, it will be stated that the Robots agreement tells you which pages can be crawled and which pages cannot be crawled. Of course, the Robots agreement is only a kind of agreement, just like the seats on the bus indicating the seats for the elderly, the sick and the disabled, it is not illegal for you to take them.

In addition to the agreement, we also need to exercise restraint in collecting behavior, as stated in Article 16 of Chapter II of the measures for the Administration of data Security (draft for soliciting opinions):

Network operators adopt automatic means to access and collect website data, which shall not hinder the normal operation of the website; such behavior seriously affects the operation of the website, if the flow of automated access collection exceeds 1/3 of the average daily flow of the website, when the website is required to stop automatic access collection, it should be stopped.

This rule states that crawlers must not interfere with the normal operation of the site, if you use crawlers to destroy the site, real visitors will not be able to access the site, which is a very immoral behavior. Such behavior should be put an end to.

In addition to data collection, we also need to pay attention to the use of data. Even if we collect personal information data with authorization, we must not sell personal data, which is specifically prohibited by law. See:

According to Article 5 of the interpretation of the Supreme people's Procuratorate of the Supreme people's Court on several issues concerning the Application of laws in handling Criminal cases of infringing upon Citizens' personal Information, the interpretation of "serious circumstances":

(1) illegally obtaining, selling or providing more than 50 items of whereabouts information, communication content, credit information and property information

(2) illegally obtaining, selling or providing more than 500 pieces of personal information of citizens, such as accommodation information, communication records, health and physiological information, transaction information, and other personal information that may affect personal and property safety.

(3) those who illegally acquire, sell or provide more than 5,000 pieces of personal information other than those specified in item 3 and item 4 constitute the "serious circumstances" required by the crime of infringing upon citizens' personal information.

In addition, even if the personal information of citizens legally collected is provided to others without the consent of the person being collected, it also belongs to "providing personal information of citizens" as stipulated in Article 253 of the Criminal Law, which may constitute a crime.

2. Learn to analyze Http requests

Every time we interact with the server through the Http protocol, of course, some are not Http protocol, I do not know whether this can be collected, I have not collected, so we only talk about the Http protocol, it is relatively simple to analyze the Http protocol in the Web page. Let's take Baidu to retrieve a piece of news as an example.

We open the F12 debugging tool and click on NetWork View to view all the requests and find the link in our address bar. The main link usually has the top link of NetWork.

In the headers view bar on the right, we can see the parameters required for this request, where we need to pay special attention to the Request Headers and Query String Parameters option bars.

Request Headers represents the parameters of the request header required by the Http request. Some websites block crawlers according to the request header, so you still need to know the parameters. Most of the parameters in the request header parameters are common. User-Agent and Cookie are used frequently. User-Agent identifies the browser request header, and Cookie stores user login credentials.

Query String Parameters represents the request parameters of the Http request, which is very important for the post request, because you can view the request parameters here, which is very useful for us to simulate Post requests such as login.

The above is the link analysis of the web version of the HTTP request. If you need to collect the data in the APP, you need to use the simulator. Because there is no debugging tool in the APP, you can only use the simulator. There are two kinds of simulator tools that can be used. If you are interested, you can perform the research.

Fiddler

Wireshark

3. Learn to parse HTML pages

The pages we collect are all HTML pages, and we need to get the information we need in the HTML page, which involves HTML page parsing, that is, DOM node parsing, which is a top priority. If you don't know this, just like a magician without props, you can only stare. For example, the following HTML page

We need to get the title "java user-agent to determine whether computer access", we first check the element through F12

The span tag where the title is located has been framed in the figure. How can we parse this node information? There are tens of thousands of methods, and the frequently used selectors should be CSS selectors and XPath:

Parsing using the CSS selector is written as follows: # wgt-ask > H2 > span

Parsed with XPath is written as: / / span [@ class= "wgt-ask"]

In this way, we can get the node of span, and we just need to take out the text. For CSS selector and XPath, in addition to writing them ourselves, we can also help us with the help of browsers, such as chrome browser.

You only need to select the corresponding node and right-click to find Copy, which provides several parsing methods to obtain the node. As shown in the figure above, Copy selector corresponds to Css selector and Copy XPath corresponds to XPath, which is still very useful.

4. Understand the anti-crawler strategy

Because now crawlers are very rampant, many websites will have anti-crawler mechanisms to filter out crawler programs in order to ensure the availability of the site, which is also a very necessary means. After all, if the site can not be used, there is no interest to talk about. There are a lot of anti-reptile methods. Let's take a look at several common anti-reptile methods.

Anti-crawler Mechanism based on Headers

This is a common anti-crawler mechanism. Websites judge whether the program is a crawler program by checking the User-Agent and Referer parameters in Request Headers. To bypass this mechanism, we just need to look at the values of the User-Agent and Referer parameters required by the site in the web page, and then set these parameters in the Request Headers of the crawler.

Anti-crawler mechanism based on user behavior

This is also a common anti-crawler mechanism, the most commonly used is the IP access restriction. How many times an IP is allowed in a period of time, and if it exceeds this frequency, it will be considered a crawler. For example, Douban movies will pass the IP restriction.

For this mechanism, we can solve this problem by setting the proxy IP. We only need to get a batch of proxy ip from the proxy ip website and set the proxy IP when requested.

In addition to IP restrictions, there will also be based on your visit interval, if you visit each time interval is fixed, may also be considered a crawler. To get around this limit, the interval setting is different at the time of the request, with a ratio of 1 minute dormancy this time and 30 seconds next time.

Anti-crawler mechanism based on dynamic pages

There are many websites where the data we need to collect is requested by Ajax or generated through JavaScript, which is troublesome. There are two ways to bypass this mechanism. One is to use auxiliary tools, such as Selenium, to get the rendered pages. The second way is reverse thinking. We get the AJAX link to the requested data and access the link directly to get the data.

The above is some basic knowledge of the crawler, mainly introduced the use of web crawler tools and anti-crawler strategy, these things in the follow-up will be helpful to our crawler learning, because these years intermittently wrote several crawler projects, the use of Java crawler is also in the early stage, the later are all using Python, and recently suddenly became interested in Java crawler, so I am going to write a series of crawler blog posts Re-combing the Java web crawler is a summary of the Java crawler, and it would be even better if it can help the partner who wants to use Java to do the web crawler. Java Web crawler is expected to have six articles, from simple to complex, step by step, covering all the problems I have encountered over the years. The following are six articles of the simulation.

1. Web crawler, it's so simple.

This article is an introduction to web crawlers, which will use Jsoup and HttpClient to get the page, and then use the selector to parse the data. In the end, you will get that the crawler is a http request, it is as simple as that.

2. If I encounter login problems in web page collection, what should I do?

In this chapter, we will briefly talk about getting the data you need to log in. Take Douban's personal information as an example, we will simply talk about this kind of problem by setting cookies manually and simulating login.

3. When web page collection encounters data Ajax loading asynchronously, what should I do?

This chapter briefly talks about the problem of asynchronous data, taking NetEase News as an example, from using htmlunit tools to obtain rendered pages and reverse thinking to directly obtain Ajax request connection to obtain data, and briefly talk about how to deal with this kind of problems.

4. The web page collection IP is blocked. What should I do?

It should be common to have restricted access to IP. Take Douban movies as an example, mainly focus on setting up proxy IP, briefly talk about the solution to the limitation of IP, and also talk about how to build your own ip proxy service.

5. The performance of network collection is too poor. What should I do?

Sometimes there are requirements for the performance of crawlers, this single-threaded approach may not work, we may need multi-threaded or even distributed crawlers, so this article focuses on multi-threaded crawlers and distributed crawler architecture solutions.

6. Use case analysis of open source crawler framework webmagic

I used to use webmagic to do a crawler, but at that time I didn't understand the webmagic framework very well. After several years of experience, I now have a new understanding of this framework, so I want to build a simple demo according to the webmagic specification to experience the power of webmagic.

At this point, the study of "what are the basic knowledge of java web crawler" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.