In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "how to minimize invalid URL crawling and indexing in website optimization". In daily operation, I believe many people have doubts about how to minimize invalid URL crawling and indexing in website optimization. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how to minimize invalid URL crawling and indexing in website optimization". Next, please follow the editor to study!
To put it simply, the post points out a serious and real SEO problem: many websites, especially B2C, product conditional filtering systems (such as selecting product brand, price, size, performance, parameters, etc.) will generate a large number of invalid URL, which is called invalid only from the SEO point of view, these URL can not produce SEO effect, but have a negative effect, so these URL are not included, the reasons include:
A large number of filter criteria page content is repeated or very similar (copying a large number of content will reduce the overall quality of the site)
A large number of filter criteria pages have no corresponding products and no content (such as selecting "42-inch LED TV under 100RMB")
Most filter criteria pages do not have ranking ability (ranking ability is much lower than classification pages), but waste a certain weight.
These filter criteria pages are not necessary channels for product pages (product pages should have other internal chains to help crawl and include)
Crawling a large number of filter condition pages is a great waste of spider crawling time, resulting in a decline in the chance of inclusion of useful pages (the combination of filter condition pages is huge)
To put it simply, the post points out a serious and real SEO problem: many websites, especially B2C, product conditional filtering systems (such as selecting product brand, price, size, performance, parameters, etc.) will generate a large number of invalid URL, which is called invalid only from the SEO point of view, these URL can not produce SEO effect, but have a negative effect, so these URL are not included, the reasons include:
A large number of filter criteria page content is repeated or very similar (copying a large number of content will reduce the overall quality of the site)
A large number of filter criteria pages have no corresponding products and no content (such as selecting "42-inch LED TV under 100RMB")
Most filter criteria pages do not have ranking ability (ranking ability is much lower than classification pages), but waste a certain weight.
These filter criteria pages are not necessary channels for product pages (product pages should have other internal chains to help crawl and include)
Crawling a large number of filter condition pages is a great waste of spider crawling time, resulting in a decline in the chance of inclusion of useful pages (the combination of filter condition pages is huge)
So how to prevent these URL from being crawled, indexed and included? Unfortunately, I can't think of a perfect solution at the moment. I think neither of the two methods can be solved perfectly.
The first method is to keep the URL that you don't want to include as a dynamic URL, or even deliberately the more dynamic the better, to prevent it from being crawled and included. However, search engines can now crawl and include dynamic URL, and it is less and less a problem in technology. Although a large number of parameters to a certain extent is not conducive to inclusion, but 4 or 5 parameters can usually be included. We cannot confirm how many parameters are needed to prevent inclusion, so it cannot be regarded as a reliable method. And these URL receive inside chain, do not have what rank ability, still can waste certain weight.
The second method, robots forbids inclusion. Similarly, URL receives the weight when it receives the inner chain, and the robots file forbids crawling these URL, so the received weight cannot be passed out (the search engine does not know what the exported link is without crawling), and the page becomes a black hole in which the weight can only enter and exit.
Even the links to these URL are not perfect with nofollow, similar to the prohibition of robots, the effect of nofollow in Google is that these URL do not receive weights, but the weights are not assigned to other links, so the weights are also wasted. Baidu is said to support nofollow, but how to deal with the weight is unknown.
It is useless to put these URL links in Flash and JS. Search engines can already crawl links in Flash and JS, and they are expected to become better and better at climbing in the future. What many SEO overlook is that links in JS can not only be crawled, but also pass weights, just like normal connections.
You can also make the filter condition link in the form of AJAX, and the user will not visit a new URL after clicking, or add # to the original URL and add # to the URL, which will not be regarded as a different URL. Like the JS problem, search engines are actively trying to crawl and crawl content in AJAX, which is not safe.
Another way is to add a noindex+follow tag to the head section of the page, meaning that the page is not indexed, but the links on the page are tracked. This not only solves the problem of copying content, but also solves the problem of weight black hole (weights can be transferred to other pages with exported links). What cannot be solved is the waste of spider crawling time. These pages still have to be crawled by spiders (and then you can see the noindex+follow tag in the page html). For some websites, there are a large number of filtered pages and crawled these pages. Spiders don't have enough time to climb useful pages.
Another method that can be considered is to hide the page (cloaking), that is, to use the program to detect visitors, if the search engine spider returns to the page to remove these filter links, and only if the user returns to the normal page with filter conditions. This is an ideal solution, and the only problem is that it may be regarded as cheating. The highest principle that search engines often tell SEO about cheating is: if there were no search engines, would you do it? In other words, is a certain method just for search engines? Obviously, using cloaking to hide URL that you don't want to crawl is for search engines, not for users. Although the purpose of cloaking in this case is good and harmless, there are risks and bold ones can be tried out.
Another way is to use the canonical tag, the biggest problem is whether Baidu supports the unknown, and the canonical tag is a suggestion to the search engine, not an instruction, that is to say, the search engine may not comply with the tag, which is useless. In addition, the canonical tag is intended to specify a standardized URL, and it is doubtful whether the filter criteria pages are applicable. After all, the content on these pages is often different.
At present, one of the better methods is to ban iframe+robots. Putting part of the filtering code into iframe is tantamount to calling other file content, which is not part of the current page for search engines, that is, hidden content. But does not belong to the current page does not mean does not exist, the search engine can find the content and links in iframe, or may crawl these URL, so add robots to prohibit crawling. There will still be some weight loss in the content in iframe, but because the links in iframe do not divert weights from the current page, but only from the file called, the weight loss is relatively small. In addition to headaches such as typesetting and browser compatibility, a potential problem with the iframe approach is the perceived risk of cheating. Now search engines generally do not think that iframe is cheating, a lot of ads are placed in iframe, but there is a subtle difference between hiding a bunch of links and hiding ads. Going back to the general principle of cheating judged by search engines, it is difficult to say that this is not done specifically for search engines. Remember that Matt Cutts said that Google might change the way it handles iframe in the future, and they still want to see all the content that the average user can see on the same page.
In short, I do not have a perfect answer to this realistic and serious question. Of course, not a perfect solution is not impossible to live, different sites SEO focus is different, specific analysis of specific problems, the use of one or more of the above methods should be able to solve the main problems.
And the biggest problem is not the above, but sometimes you want these filter pages to be crawled and included, which is a sad start. We'll talk about it later.
One is to keep the URL that you do not want to include as a dynamic URL, or even deliberately the more dynamic the better, in order to prevent being crawled and included. However, search engines can now crawl and include dynamic URL, and it is less and less a problem in technology. Although a large number of parameters to a certain extent is not conducive to inclusion, but 4 or 5 parameters can usually be included. We cannot confirm how many parameters are needed to prevent inclusion, so it cannot be regarded as a reliable method. And these URL receive inside chain, do not have what rank ability, still can waste certain weight.
The second method, robots forbids inclusion. Similarly, URL receives the weight when it receives the inner chain, and the robots file forbids crawling these URL, so the received weight cannot be passed out (the search engine does not know what the exported link is without crawling), and the page becomes a black hole in which the weight can only enter and exit.
Even the links to these URL are not perfect with nofollow, similar to the prohibition of robots, the effect of nofollow in Google is that these URL do not receive weights, but the weights are not assigned to other links, so the weights are also wasted. Baidu is said to support nofollow, but how to deal with the weight is unknown.
It is useless to put these URL links in Flash and JS. Search engines can already crawl links in Flash and JS, and they are expected to become better and better at climbing in the future. What many SEO overlook is that links in JS can not only be crawled, but also pass weights, just like normal connections.
You can also make the filter condition link in the form of AJAX, and the user will not visit a new URL after clicking it, or add # to the original URL and add # to URL, which will not be regarded as a different URL. Like the JS problem, search engines are actively trying to crawl and crawl content in AJAX, which is not safe.
Another way is to add a noindex+follow tag to the head section of the page, meaning that the page is not indexed, but the links on the page are tracked. This not only solves the problem of copying content, but also solves the problem of weight black hole (weights can be transferred to other pages with exported links). What cannot be solved is the waste of spider crawling time. These pages still have to be crawled by spiders (and then you can see the noindex+follow tag in the page html). For some websites, there are a large number of filtered pages and crawled these pages. Spiders don't have enough time to climb useful pages.
Another method that can be considered is to hide the page (cloaking), that is, to use the program to detect visitors, if the search engine spider returns to the page to remove these filter links, and only if the user returns to the normal page with filter conditions. This is an ideal solution, and the only problem is that it may be regarded as cheating. The highest principle that search engines often tell SEO about cheating is: if there were no search engines, would you do it? In other words, is a certain method just for search engines? Obviously, using cloaking to hide URL that you don't want to crawl is for search engines, not for users. Although the purpose of cloaking in this case is good and harmless, there are risks and bold ones can be tried out.
Another way is to use the canonical tag, the biggest problem is whether Baidu supports the unknown, and the canonical tag is a suggestion to the search engine, not an instruction, that is to say, the search engine may not comply with the tag, which is useless. In addition, the canonical tag is intended to specify a standardized URL, and it is doubtful whether the filter criteria pages are applicable. After all, the content on these pages is often different.
At present, one of the better methods is to ban iframe+robots. Putting part of the filtering code into iframe is tantamount to calling other file content, which is not part of the current page for search engines, that is, hidden content. But does not belong to the current page does not mean does not exist, the search engine can find the content and links in iframe, or may crawl these URL, so add robots to prohibit crawling. There will still be some weight loss in the content in iframe, but because the links in iframe do not divert weights from the current page, but only from the file called, the weight loss is relatively small. In addition to headaches such as typesetting and browser compatibility, a potential problem with the iframe approach is the perceived risk of cheating. Now search engines generally do not think that iframe is cheating, a lot of ads are placed in iframe, but there is a subtle difference between hiding a bunch of links and hiding ads. Going back to the general principle of cheating judged by search engines, it is difficult to say that this is not done specifically for search engines. Remember that Matt Cutts said that Google might change the way it handles iframe in the future, and they still want to see all the content that the average user can see on the same page.
In short, I do not have a perfect answer to this realistic and serious question. Of course, not a perfect solution is not impossible to live, different sites SEO focus is different, specific analysis of specific problems, the use of one or more of the above methods should be able to solve the main problems.
And the biggest problem is not the above, but sometimes you want these filter pages to be crawled and included, which is a sad start. We'll talk about it later.
At this point, the study on "how to minimize the crawling and indexing of invalid URL in website optimization" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.