Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are robots.txt 's tips for crawling websites quickly?

2025-01-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about the tips for robots.txt to quickly crawl the website, which may not be understood by many people. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

When I encounter a bottleneck in crawling a website and want to take a partial approach to solve it, I often look at the robots.txt file of the website first, sometimes opening another crawling door for you.

There are many distressing things to write about crawlers, such as:

1. Access frequency is too high and is restricted.

two。 How to find a large number of URL of this website

3. How to crawl the newly generated URL of a website, etc.

These problems haunt crawlers, which are not a problem if there are a large number of discrete IP and accounts, but most companies do not have this condition.

Most of the crawlers we write at work are one-off and temporary tasks, and we need you to finish the work quickly. When you encounter the above situation, try to take a look at the robots.txt file.

Take a chestnut:

The boss assigned you a task to capture Douban's new film reviews, book reviews, group posts, city posts and personal logs every day.

Just think about how big this task will be. Douban has 160 million registered users, and you have to visit everyone's home page at least once a day for the task of crawling personal logs.

You have to visit 160 million times a day, not counting group / city posts and so on.

It is impossible to design a regular crawler with dozens of IP.

A preliminary study of robots.txt

When the boss gives you the above task, rely on you this one or two guns, how do you finish, don't tell the boss the technology, he doesn't understand, he just wants results.

Let's take a look at Douban's robots.txt

Https://www.douban.com/robots.txt

Look at the red box above the picture. There are two sitemap files.

Open the sitemap_updated_index file and take a look:

Inside are compressed files, inside which are Douban's new film reviews, book reviews, posts and so on. If you are interested, you can open the compressed files and have a look.

In other words, you only need to access the sitemap file in this robots.txt every day to know what new URL is generated.

You don't have to go through hundreds of millions of links on Douban website, which not only greatly saves your crawling time and crawler design complexity, but also reduces the bandwidth consumption of Douban website, which is a win-win situation, .

The above found the recipe for crawling the newly generated URL of a website through robots.txt 's sitemap file. Along this line of thinking, you can also solve the problem of finding a large number of URL on the site.

Give me another chestnut:

The boss gave you another task. The boss said that last time you caught Douban, you said it would take a lot of IP to catch Douban every day. This time, I will give you 1000 IP to grab the industrial and commercial information of tens of millions of enterprises checked by Heavenly Eye.

You are drooling when you look at so many IP, but after analyzing the site, you find that there are few crawl entries for such sites (the crawl entry refers to the channel page, the kind of page that aggregates a lot of links).

It's easy to get rid of the stock of URL, and it's not full to look at so much IP work.

If you find tens of thousands or even hundreds of thousands of URL on this site at a time, you can make so many IP work full and won't be lazy.

Let's take a look at his robots.txt file:

Https://www.tianyancha.com/robots.txt

Open the sitemap in the red box and there are 30, 000 company URL in it. The picture above is generated on January 3. That URL is generated according to the year, month and day. If you change the URL to January 2, you can see the tens of thousands of company URL in the sitemap on January 2. In this way, you can find more than 100, 000 seed URL for you to grab.

PS: in fact, the above sitemap can also solve the problem of the newly updated URL.

A small trick not only reduces the complexity of the crawler design, but also reduces the other party's bandwidth consumption.

After reading the above, do you have any further understanding of robots.txt 's tips for crawling websites quickly? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report