Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to configure Nutch emulation browser to bypass anti-crawler restrictions

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to configure Nutch simulation browser to bypass anti-crawler restrictions". In daily operation, I believe many people have doubts about how to configure Nutch simulation browser to bypass anti-crawler restrictions. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "how to configure Nutch simulation browser to bypass anti-crawler restrictions". Next, please follow the editor to study!

When we configure Nutch to crawl http://yangshangchuan.iteye.com, all the pages crawled are: your access request is denied. This is the simplest anti-crawler strategy (which simply reads the value of the HTTP request header User-Agent to determine whether it is a human (browser) or a robot crawler), and we can bypass this limitation by simply configuring Nutch to simulate the browser (simulate web browser).

There are five configurations in nutch-default.xml related to User-Agent:

Http.agent.description Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. Http.agent.url A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. Http.agent.email An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. Http.agent.name HTTP 'User-Agent' request header. MUST NOT be empty-please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. Http.agent.version Nutch-1.7 A version string to advertise in the User-Agent header.

You can see how these five configurations make up User-Agent in the class nutch2.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java:

This.userAgent = getAgentString (conf.get ("http.agent.name"), conf.get ("http.agent.version"), conf.get ("http.agent.description"), conf.get ("http.agent.url"), conf.get ("http.agent.email")) Private static String getAgentString (String agentName, String agentVersion, String agentDesc, String agentURL String agentEmail) {if ((agentName = = null) | | (agentName.trim (). Length () = 0)) {/ / TODO: NUTCH-258 if (LOGGER.isErrorEnabled ()) {LOGGER.error ("No User-Agent string set (http.agent.name)!") }} StringBuffer buf= new StringBuffer (); buf.append (agentName); if (agentVersion! = null) {buf.append ("/"); buf.append (agentVersion) } if (agentDesc! = null) & & (agentDesc.length ()! = 0)) | | ((agentEmail! = null) & & (agentEmail.length ()! = 0)) | | ((agentURL! = null) & & (agentURL.length ()! = 0)) {buf.append ("(")) If ((agentDesc! = null) & & (agentDesc.length ()! = 0)) {buf.append (agentDesc); if ((agentURL! = null) | | (agentEmail! = null) buf.append (";");} if ((agentURL! = null) & & (agentURL.length ()! = 0)) {buf.append (agentURL) If (agentEmail! = null) buf.append (";");} if ((agentEmail! = null) & & (agentEmail.length ()! = 0)) buf.append (agentEmail); buf.append (")");} return buf.toString ();}

Use the User-Agent request header in the class nutch2.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java. The userAgent returned by http.getUserAgent () here is the userAgent in HttpBase.java:

String userAgent = http.getUserAgent (); if ((userAgent = = null) | | (userAgent.length () = = 0) {if (Http.LOG.isErrorEnabled ()) {Http.LOG.error ("User-agent is not set!");}} else {reqStr.append ("User-Agent:"); reqStr.append (userAgent); reqStr.append ("\ r\ n");}

According to the above analysis, you only need to add one of the following configurations in nutch-site.xml to simulate a specific browser (Imitating a specific browser):

1. Simulate Firefox browser:

Http.agent.name Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko http.agent.version 20100101 Firefox/27.0

2. Simulate IE browser:

Http.agent.name Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident http.agent.version 6.0)

3. Simulate Chrome browser:

Http.agent.name Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari http.agent.version 537.36

4. Simulate Safari browser:

Http.agent.name Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari http.agent.version 534.57.2

5. Simulate Opera browser:

Http.agent.name Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR http.agent.version 19.0.1326.59

Postscript: how to view User-Agent:

1. Http://www.useragentstring.com

2. Http://whatsmyuseragent.com

3. Http://www.enhanceie.com/ua.aspx

At this point, the study on "how to configure the Nutch simulation browser to bypass the anti-crawler limit" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report