In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
In this issue, the editor will bring you about how to analyze the Robots.txt configuration. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
Robots.txt is a crawler specification protocol, and you can tell by its name that it is a txt text. Put it in the root directory of the website. A robots.txt file consists of one or more rules. Each rule forbids (or allows) a specific crawler to crawl a specified file path in the corresponding site. Its main function is to tell the crawler, my website, an agreement on which you can see and which you can't.
Robots.txt has a specification, or rule, or grammar.
Robots.txt must be an ASCII or UTF-8 text file. Other characters are not allowed.
A robots.txt file consists of one or more rules.
Each rule consists of multiple instructions (instructions), each of which has one line.
Each rule contains this information: to whom this rule applies (that is, the user agent); directories or files that the agent can access, and / or; directories or files that the agent cannot access.
The rules are processed from top to bottom, and a user agent can match only one rule set (that is, the first most specific rule that matches the corresponding user agent).
The default assumption of the system is that the user agent can crawl all web pages or directories that are not prohibited by Disallow: rules.
The rules are case sensitive.
A website can only have 1 robots.txt file.
There are four main keywords in Robots.txt. Put it at the beginning of the line and divide the content or instruction body with English ":".
The name of the User-agent web page crawler
Directories or web pages that Disallow should not crawl
The directory or web page that Allow should crawl
Location of the site map for the Sitemap website
Next, let me give you a few examples to facilitate your understanding.
The above configuration will respectively prohibit them from crawling the content of the site. A crawler can have more than one Disallow, such as the configuration in Baidu's network disk.
The figure above is an example of the Robots.txt configuration of Baidu network disk. If you want to disable all crawlers, configure the following.
Robots.txt also supports fuzzy matching, such as the following configuration. It is forbidden to crawl content that ends in an .xls file.
Note that if it is not configured, or the content of the configuration, as described below, means all allowed.
In addition to Robots.txt, we can also use other technical means of Nginx to prohibit crawling. But relatively speaking, configuring Robots.txt is the most efficient!
The above is the editor for you to share how to analyze the Robots.txt configuration, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.