In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
How do you use the robots.txt file? I believe many inexperienced people are at a loss about it. Therefore, this article summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
Search engines automatically access web pages on the Internet and obtain web page information through a program called robot (also known as spider).
You can create a plain text file robots.txt in your website that declares the parts of the site that you do not want to be accessed by robot, so that some or all of the content of the site can not be included by search engines, or specify search engines to include only specified content. The robots.txt file should be placed in the root of the website.
When a search robot (sometimes called search spider) visits a site, it first checks whether there is a robots.txt in the root directory of the site, and if so, the search robot determines the scope of access according to the contents of the file; if the file does not exist, the search robot crawls along the link.
Format of the robots.txt file:
The "robots.txt" file contains one or more records separated by blank lines (with CR,CR/NL, or NL as the Terminator), and each record has the following format:
":"
You can use # to annotate in this file, using the same method as usual in UNIX. Records in this file usually start with one or more lines of User-agent, followed by several lines of Disallow, as follows:
User-agent:
The value of this item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple User-agent records indicating that multiple robot will be limited by the protocol, there must be at least one User-agent record for this file. If the value of this item is set to *, the protocol is valid for any robot, and there can only be one record like "User-agent:*" in the "robots.txt" file.
Disallow:
The value of this item is used to describe a URL that you do not want to be accessed. The URL can be a complete path or partial, and any URL that begins with Disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access to / help.html and / help/index.html, while "Disallow:/help/" allows robot to access / help.html, but not / help/index.html. Any Disallow record is empty, indicating that all parts of the site are allowed to be accessed, and there must be at least one Disallow record in the "/ robots.txt" file. If "/ robots.txt" is an empty file, the site is open to all search engines robot.
Examples of robots.txt file usage:
Example 1. Prohibit all search engines from accessing any part of the website to download the robots.txt file User-agent: * Disallow: /
Example 2. Allow all robot access (or you can create an empty file "/ robots.txt" file) User-agent: * Disallow:
Example 3. Prohibit a search engine from accessing User-agent: BadBotDisallow: /
Example 4. Allow a search engine to access User-agent: baiduspiderDisallow: User-agent: * Disallow: /
Example 5. A simple example in this case, the site has three directories that restrict search engine access, that is, search engines will not access these three directories. It is important to note that each directory must be declared separately and not written as "Disallow: / cgi-bin/ / tmp/". The * after User-agent: has a special meaning and stands for "any robot", so there cannot be a record such as "Disallow: / tmp/*" or "Disallow:*.gif" in this file. User-agent: * Disallow: / cgi-bin/Disallow: / tmp/Disallow: / ~ joe/
Robot special parameters:
1. Google
Allow Googlebot:
If you want to block all rovers except Googlebot from accessing your web page, you can use the following syntax:
User-agent:Disallow:/
User-agent:Googlebot
Disallow:
Googlebot follows a line that points to itself, not to all rovers.
"Allow" extension:
Googlebot recognizes the robots.txt standard extension called "Allow". The rovers of other search engines may not recognize this extension, so use other search engines that interest you to find it. The "Allow" line works exactly the same as the "Disallow" line. Just list the directories or pages you want to allow.
You can also use both "Disallow" and "Allow". For example, to block all pages except a page in a subdirectory, you can use the following entries:
User-Agent:Googlebot
Disallow:/folder1/
Allow:/folder1/myfile.html
These entries block all pages in the folder1 directory except myfile.html.
If you want to intercept Googlebot and allow another rover of Google, such as Googlebot-Mobile, you can use the "Allow" rule to allow access to that rover. For example:
User-agent:Googlebot
Disallow:/
User-agent:Googlebot-Mobile
Allow:
Use the * sign to match the character sequence:
You can use an asterisk (*) to match the character sequence. For example, to block access to all subdirectories that begin with private, use the following entry:
User-Agent:Googlebot
Disallow:/private*/
To intercept all containing question marks (? ), you can use the following entries:
User-agent:*
Disallow:/*? *
Use $to match the end character of the URL
You can use the $character to specify a match with the end character of the URL. For example, to block a URL that ends in .asp, use the following entry:
User-Agent:Googlebot
Disallow:/*.asp$
You can use this pattern matching with the Allow directive. For example, what if? Represents a session ID, and you can exclude all URLs that contain the ID, ensuring that the Googlebot does not crawl duplicate pages. But, to? The URL at the end may be the version of the web page you want to include. In this case, the robots.txt file can be set as follows:
User-agent:*
Allow:/*? $
Disallow:/*?
Disallow:/ *? One line will intercept contains? (specifically, it will block everything that begins with your domain name, followed by any string, followed by a question mark. Followed by the URL of any string
Allow: / *? A line will be allowed to contain any? The URL at the end (specifically, it will allow any string that begins with your domain name, followed by a question mark (? ), a URL without any characters after the question mark
Sitemap site map:
The new way to support sitemaps is to include links to sitemap files directly in robots.txt files.
It's like this:
Sitemap: http://www.eastsem.com/sitemap.xml
At present, the search engine companies that support this are Google, Yahoo, Ask and MSN.
However, I suggest you submit it in Google Sitemap, where there are many features that can analyze the status of your link.
After reading the above, have you mastered the robots.txt file and how to use it? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.