Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to solve the error of crawling website

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to solve the error of crawling the website". Interested friends might as well take a look at it. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "how to solve the error of website crawling"!

How can it be included without crawling, and how can it be ranked? However, for such an obvious problem, there are a large number of websites that ignore it. Among the customers who do SEO diagnostic service here in A5, "20%" of the websites will have crawling errors, which will seriously affect the growth effect of the website. Today's article, if you are lucky enough to read it, I hope you can finish reading it and share it, because it will be really valuable.

He Guijiang: once diagnosed a 10 million level included site, but the index is always repeatedly removed, and repeatedly included, the enterprise has been unable to find the problem. But when we checked the website, we found a strange phenomenon:

1. Wrong ban

In the update of Baidu's robots.txt, if you click "detect and update" many times, there will be problems that can be updated from time to time, but often cannot be updated. In this way: things that should not be included, prohibited on robots.txt are included, and then deleted is very normal. So what's the problem with it? It's not that the server is overloaded, but because the firewall mistakenly blacklisted some Baiduspider.

The above example is a crawling error in robots.txt. As a webmaster, you should at least check whether the robots.txt can be updated properly every week. Then let's take a look at the "page crawl" error:

2. Server exception

Not to mention the conventional server, as we all know, the north, Shanghai and Guangzhou are generally good. But there are some special servers, presumably the vast majority of webmasters do not know, right? For example, the "RTHK server" of Western Digital is very interesting. is it really from RTHK? Its own computer room is in China, what kind of Hong Kong and Taiwan? In order to avoid filing, an IP of RTHK is used, and the data are all in the mainland.

What's wrong with that? We will find: the site's server is through CDN, even if you upload a picture, will be displayed as a "302 status code", the access speed is improved, but is this good for SEO? Ha. I really don't know what Western Mathematics thinks as a large domestic idc service provider, taking advantage of those ignorance?

3. Cannot get the real IP

Larger websites generally use CDN acceleration, but some sites use CDN acceleration not only for devices, but also for Spider. What is the end result? If the CDN node is unstable, then for the website spider, this problem will be fatal.

The reason why many large sites open CDN is that it is easy to be attacked. It can be imagined if you don't do "spider origin" at this time. Has your site done CDN yet? Please log in to Baidu webmaster platform to see if spider can grab the real IP address.

4. Frequent 50X errors

One of the common features of such links is: when opened, all are normal, so why does Spider report an error reminder? Just because at the moment the crawler initiated the crawl, httpcode returned 5XX, "does your site often have this problem? if so, you need to arrange the technology immediately, or notify the IDC service provider to solve it!"

5. Wrong grasping ratio

There is no problem that no website can do 100%, but there is a limit to everything: we believe that this proportion is less than 5%, which basically has no impact on the site, and such mistakes should not occur every day. The most common crawl error is generally connection timeout: "after the crawl request connection is established, the download page speed is too slow, resulting in timeout, which may be due to server overload and insufficient bandwidth":

A: try to compress the picture without affecting the quality of the picture, and then compress it when you upload it.

B: reduce the use of file types such as JS scripts, or merge

C: page size control, especially for some pages with high pageviews and fetches, it is not recommended to exceed 2MB.

D: increase the bandwidth of the website, increase the download speed, or replace the server.

At this point, I believe you have a deeper understanding of "how to solve the error of crawling the website". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report