Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the basic knowledge of web crawler in Java

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces the basic knowledge of web crawlers in Java, which has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, let the editor take you to understand it.

There are four basics you need to know to get started with Java web crawlers.

1. A reptile with "morality"

Why would I put this in the first place? Because I think this point is more important, what is a "moral" reptile? Is to follow the rules of the crawled server, do not affect the normal operation of the crawled server, do not destroy the crawler service, this is a "moral" crawler.

A question that is often discussed is whether reptiles are legal. Zhihu, what you will see is like this.

There are tens of thousands of answers, among which I personally agree with the following answer.

As a kind of computer technology, crawler determines its neutrality, so crawler itself is not prohibited in law, but it is illegal or even criminal to use crawler technology to obtain data. The so-called specific analysis of specific problems, just as the fruit knife itself is not legally prohibited, but used to stab people, it will not be tolerated by the law.

Why is it against the law for reptiles? Depending on whether what you do is illegal, what is the nature of a web crawler? The essence of web crawlers is to use machines instead of people to access pages. It is certainly not against the law for me to check public news, so it is not illegal for me to collect news that is open on the Internet, just like every major search engine website, other websites are eager to be crawled by the spiders of search engines. On the contrary, it is illegal to collect other people's private data and check other people's private information yourself, so it is illegal to use programs to collect them, just like the fruit knife itself is not illegal, but it is illegal to stab people.

To be a "moral" crawler, the Robots protocol is something you must understand. Here is the Baidu encyclopedia of the Robots protocol.

In many websites, it will be stated that the Robots agreement tells you which pages can be crawled and which pages cannot be crawled. Of course, the Robots agreement is only a kind of agreement, just like the seats on the bus indicating the seats for the elderly, the sick and the disabled, it is not illegal for you to take them.

In addition to the agreement, we also need to exercise restraint in our collection, as stated in Article 16 of Chapter II of the measures for the Administration of data Security (draft for soliciting opinions):

Network operators adopt automatic means to access and collect website data, which shall not hinder the normal operation of the website; such behavior seriously affects the operation of the website, if the flow of automated access collection exceeds 1/3 of the average daily flow of the website, when the website is required to stop automatic access collection, it should be stopped.

This rule states that crawlers must not interfere with the normal operation of the site, if you use crawlers to destroy the site, real visitors will not be able to access the site, which is a very immoral behavior. Such behavior should be put an end to.

In addition to data collection, we also need to pay attention to the use of data. Even if we collect personal information data with authorization, we must not sell personal data, which is specifically prohibited by law. See:

According to Article 5 of the interpretation of the Supreme people's Procuratorate of the Supreme people's Court on several issues concerning the Application of laws in handling Criminal cases of infringing upon Citizens' personal Information, the interpretation of "serious circumstances":

(1) illegally obtaining, selling or providing more than 50 items of whereabouts information, communication content, credit information and property information

(2) illegally obtaining, selling or providing more than 500 pieces of personal information of citizens, such as accommodation information, communication records, health and physiological information, transaction information, and other personal information that may affect personal and property safety.

(3) those who illegally acquire, sell or provide more than 5,000 pieces of personal information other than those specified in item 3 and item 4 constitute the "serious circumstances" required by the crime of infringing upon citizens' personal information.

In addition, even if the personal information of citizens legally collected is provided to others without the consent of the person being collected, it also belongs to "providing personal information of citizens" as stipulated in Article 253 of the Criminal Law, which may constitute a crime.

2. Learn to analyze Http requests

Every time we interact with the server through the Http protocol, of course, some are not Http protocol, I do not know whether this can be collected, I have not collected, so we only talk about the Http protocol, it is relatively simple to analyze the Http protocol in the Web page. Let's take Baidu to retrieve a piece of news as an example.

We open the F12 debugging tool and click on NetWork View to view all the requests and find the link in our address bar. The main link usually has the top link of NetWork.

In the headers view bar on the right, we can see the parameters required for this request, where we need to pay special attention to the Request Headers and Query String Parameters option bars.

Request Headers represents the parameters of the request header required by the Http request. Some websites block crawlers according to the request header, so you still need to know the parameters. Most of the parameters in the request header parameters are common. User-Agent and Cookie are used frequently. User-Agent identifies the browser request header, and Cookie stores user login credentials.

Query String Parameters represents the request parameters of the Http request, which is very important for the post request, because you can view the request parameters here, which is very useful for us to simulate Post requests such as login.

The above is the link analysis of the web version of the HTTP request. If you need to collect the data in the APP, you need to use the simulator. Because there is no debugging tool in the APP, you can only use the simulator. There are two kinds of simulator tools that can be used. If you are interested, you can perform the research.

Fiddler

Wireshark

3. Learn to parse HTML pages

The pages we collect are all HTML pages, and we need to get the information we need in the HTML page, which involves HTML page parsing, that is, DOM node parsing, which is a top priority. If you don't know this, just like a magician without props, you can only stare. For example, the following HTML page

We need to get the title "java user-agent to determine whether computer access", we first check the element through F12

The span tag where the title is located has been framed in the figure. How can we parse this node information? There are thousands of methods, and the frequently used selectors should be CSS selector and XPath. If you don't know these two selectors, you can click the link below to learn about them:

CSS selector reference manual: https://www.w3school.com.cn/cssref/css_selectors.asp

XPath tutorial: https://www.w3school.com.cn/xpath/xpath_syntax.asp

Parsing using the CSS selector is written as follows: # wgt-ask > H2 > span

Parsed with XPath is written as: / / span [@ class= "wgt-ask"]

In this way, we can get the node of span, and we just need to take out the text. For CSS selector and XPath, in addition to writing them ourselves, we can also help us with the help of browsers, such as chrome browser.

You only need to select the corresponding node and right-click to find Copy, which provides several parsing methods to obtain the node. As shown in the figure above, Copy selector corresponds to Css selector and Copy XPath corresponds to XPath, which is still very useful.

4. Understand the anti-crawler strategy

Because now crawlers are very rampant, many websites will have anti-crawler mechanisms to filter out crawler programs in order to ensure the availability of the site, which is also a very necessary means. After all, if the site can not be used, there is no interest to talk about. There are a lot of anti-reptile methods. Let's take a look at several common anti-reptile methods.

Anti-crawler Mechanism based on Headers

This is a common anti-crawler mechanism. Websites judge whether the program is a crawler program by checking the User-Agent and Referer parameters in Request Headers. To bypass this mechanism, we just need to look at the values of the User-Agent and Referer parameters required by the site in the web page, and then set these parameters in the Request Headers of the crawler.

Anti-crawler mechanism based on user behavior

This is also a common anti-crawler mechanism, the most commonly used is the IP access restriction. How many times an IP is allowed in a period of time, and if it exceeds this frequency, it will be considered a crawler. For example, Douban movies will pass the IP restriction.

For this mechanism, we can solve this problem by setting the proxy IP. We only need to get a batch of proxy ip from the proxy ip website and set the proxy IP when requested.

In addition to IP restrictions, there will also be based on your visit interval, if you visit each time interval is fixed, may also be considered a crawler. To get around this limit, the interval setting is different at the time of the request, with a ratio of 1 minute dormancy this time and 30 seconds next time.

Anti-crawler mechanism based on dynamic pages

There are many websites where the data we need to collect is requested by Ajax or generated through JavaScript, which is rather painful for this kind of website. There are two ways to bypass this mechanism. One is to use auxiliary tools, such as Selenium, to get the rendered pages. The second way is reverse thinking. We get the AJAX link to the requested data and access the link directly to get the data.

Thank you for reading this article carefully. I hope the article "what is the basic knowledge of web crawlers in Java" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report