In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "what are the methods of forbidding search engines to include website content". The content in the article is simple and clear, and it is easy to learn and understand. let's study and learn "what are the methods of forbidding search engines to include website content?"
First, robots.txt method
The search engine obeys the robots.txt protocol by default. Create a robots.txt text file and place it in the root directory of the website. Edit the code as follows:
User-agent: *
Disallow: /
Through the above code, you can tell the search engine not to crawl to take this site, pay attention to the use of the code as above: this will prohibit all search engines from visiting any part of the site.
How to prohibit only Baidu search engine from crawling web pages
1. Edit the robots.txt file and mark the design as follows:
User-agent: Baiduspider
Disallow: /
The above robots files will prohibit all crawls from Baidu.
What is the user-agent of Baidu's user-agent,Baiduspider here?
Baidu uses different user-agent for each product:
Product name corresponds to user-agent
Wireless search Baiduspider
Image search Baiduspider-image
Video search Baiduspider-video
News search Baiduspider-news
Baidu searches for Baiduspider-favo
Baidu Alliance Baiduspider-cpro
Business search Baiduspider-ads
Web pages and other searches for Baiduspider
You can set different crawling rules according to different user-agent of each product. The following robots implementation prohibits all crawling from Baidu but allows image search to crawl / image/ directory:
User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-image
Allow: / image/
Please note: the pages crawled by Baiduspider-cpro and Baiduspider-ads will not be indexed, but just perform the agreed operation with the customer, so if you do not abide by the robots agreement, you need to contact Baidu people to solve the problem.
How to prohibit only Google search engines from crawling web pages as follows:
Edit the robots.txt file, and the design is marked as:
User-agent: googlebot
Disallow: /
Edit the robots.txt file
Search engines follow the robots.txt protocol by default
The robots.txt file is placed in the root of the site.
For example, when a search engine visits a website, it will first check whether the file robots.txt exists in the root directory of the site. If the search engine finds the file, it will determine the scope of the permissions it grabs based on the contents of the file.
User-agent:
The value of this item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple User-agent records indicating that multiple robot will be limited by the protocol, there must be at least one User-agent record for this file. If the value of this item is set to *, the protocol is valid for any robot, and there can only be one record like "User-agent:*" in the "robots.txt" file.
Disallow:
The value of this item is used to describe a URL that you do not want to be accessed. The URL can be a complete path or partial, and any URL that begins with Disallow will not be accessed by robot. For example, "Disallow:/help" does not allow search engine access to / help.html and / help/index.html, while "Disallow:/help/" allows robot to access / help.html, but not / help/index.html. Any Disallow record is empty, indicating that all parts of the site are allowed to be accessed, and there must be at least one Disallow record in the "/ robots.txt" file. If "/ robots.txt" is an empty file, the site is open to all search engines robot.
Here are a few examples of robots.txt usage:
User-agent: *
Disallow: /
Prohibit all search engines from accessing all parts of the site
User-agent: Baiduspider
Disallow: /
Baidu is prohibited from including the whole site.
User-agent: Googlebot
Disallow: /
Prohibit the inclusion of the whole site by Google
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
All search engines except Google are prohibited from including the whole site.
User-agent: Baiduspider
Disallow:
User-agent: *
Disallow: /
All search engines except Baidu are prohibited from including the whole site.
User-agent: *
Disallow: / css/
Disallow: / admin/
Prohibit all search engines from accessing a directory
(for example, disable admin and css in the root directory)
Second, the method of web page code
Add code between the code on the home page of the site, which prevents search engines from crawling the site and displaying snapshots of the page.
Between the code on the home page of the website, you can add it to prohibit Baidu search engine from crawling the site and displaying snapshots of web pages.
Between the code on the home page of the site, you can add to prohibit the Google search engine from crawling the site and displaying snapshots of the page.
In addition, when our needs are weird, such as the following situations:
1. The site has been added robots.txt, but also in Baidu search out?
Because it takes time to update the search engine index database. Although Baiduspider has stopped accessing pages on your site, it may take months to clear the index information that has been established in the Baidu search engine database. Also check that your robots configuration is correct. If your rejection is included in a very urgent need, you can also respond to the request through the complaint platform.
two。 Want the website content to be indexed by Baidu but not saved snapshots, what should I do?
Baiduspider complies with the Internet meta robots protocol. You can use the settings of the page meta to make Baidu display index only the page, but do not display a snapshot of the page in the search results. Like the update of robots, because the update of the search engine index database takes time, although you have prohibited Baidu from displaying a snapshot of the page in the search results through meta, it may take two to four weeks for Baidu search engine database to take effect online if the index information has been established.
Want to be indexed by Baidu, but do not save the snapshot of the website, solve the problem with the following code:
If you want to prevent all search engines from saving snapshots of your pages, the code is as follows:
Some common code combinations:
You can grab this page, and you can continue to index other links along this page
Do not crawl this page, but you can crawl other links along this page
You can crawl this page, but you are not allowed to crawl other links along this page
Do not crawl this page or index other links along this page
Thank you for your reading. the above is the content of "what are the methods of prohibiting search engines from including website content?" after the study of this article, I believe you have a deeper understanding of the method of prohibiting search engines from including website content, and the specific use still needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.