In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
Today, I will talk to you about the tips for running and debugging the Scrapy crawler project, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.
Tips on running and debugging the Scrapy crawler project
Set the website robots.txt rule to False
In general, before we use the Scrapy framework to grab data, we need to advance to the settings.py file and change "ROBOTSTXT_OBEY = True" to ROBOTSTXT_OBEY = False.
The default crawler in the settings.py file follows the robots.txt rules of the site without any changes, as shown in the following figure.
If the robots.txt rules are followed, the crawled result will automatically filter out a lot of the target information we want, so it is necessary to set this parameter to False, as shown in the following figure.
After setting up the robots.txt rules, we can get more information about the web page.
Debugging with Scrapy shell
Usually when we want to run the Scrapy crawler, we type "scrapy crawl crawler_name" on the command line. Careful friends should know that the main.py file created in the previous article can also improve debugging efficiency, but both methods need to run the Scrapy crawler project from beginning to end, and each time you need to request URL, which is very inefficient. Partners who have run the Scrapy crawler project know that Scrapy runs relatively slowly, and sometimes the roots can't move because of unstable network speeds. In view of the problem that you need to run Scrapy crawler every time, here we introduce the method of Scrapy shell debugging, which can get twice the result with half the effort.
Scrapy provides us with a shell mode that allows us to get the source code of the entire URL page under the shell script. Run on the command line, the syntax command is "scrapy shell URL", and URL refers to the URL or link of the web page you need to crawl, as shown in the following figure.
The command means to debug the URL. After the command is executed, we have obtained the corresponding web content of the URL, and then we can debug under the shell. We no longer have to execute the Scrapy crawler every time and initiate a URL request.
The efficiency of debugging can be greatly improved by shell script, and the specific debugging method is consistent with the expression syntax in the main file of the crawler. Take a chestnut, as shown in the following picture.
Put the selector corresponding to the two Xpath expressions under the scrapy shell debugging script, we can clearly see the extracted target information, and save the repeated steps of running the Scrapy crawler each time, and improve the development efficiency. This method is very commonly used in the process of Scrapy crawler, and it is also very practical. I hope my friends can master it and use it actively for themselves.
After reading the above, do you have any further understanding of the tips for running and debugging the Scrapy crawler project? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.