Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Case code analysis of python crawler

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Today, the editor will share with you the relevant knowledge points of python crawler case code analysis, the content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

Crawl object: Dangdang http://book.dangdang.com/

Requirement: get the third-level class name, and get the book title and picture url of the details page.

The first step is page analysis

Demand: src of names and pictures of books with large classification, secondary classification, tertiary classification

Large classification

1 there is a span tag in the big category, but not in the source code

2 some large categories have an a tag / dt//text () .extract () under the dt tag.

Get all the text under the dt tag

The whole category is in the div/dl/dt below div con flq_body.

Second-level classification

Div level one-> dl class= "inner_dl" dt/text ()

Note: 1 some large categories have an a tag / dt//text () .extract () under the dt tag.

Three-level classification

Dl class= "inner_dl"-> dd/a/text ()

Note: the src images/model/guan/url_none.png of the picture still needs to look at the source code at this time.

Data-original= "http://img3m0.ddimg.cn/95/11/27854240-1_b_14.jpg"

The second step is the implementation step

1 create a scrapy project (crawler)

2 analyze the page and implement the logic

3 rewrite program (scrapy_redis)

Idea: first implement an ordinary scrapy crawler, and then rewrite it into scrapy_redis

(I) preparatory procedures

Enter at the terminal terminal

The name of the scrapy startproject book# crawler should not be the same as that of the crawler scrapy genspider dangdang dangdang.com.

Create a start.py file and put it in the same level directory as scrapy.cfg

# to run the whole program, just run the file from scrapy import cmdline# cmdline.execute ('scrapy crawl db'.split ()) cmdline.execute ([' scrapy','crawl','dangdang'])

Make sure the redis server is turned on and can be connected

And enter lpush dangdang:start_urls url

(II) setting.py documents

Fixed format

SPIDER_MODULES = ['book.spiders'] NEWSPIDER_MODULE =' book.spiders'# de-filtering DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # scheduler team SCHEDULER = "scrapy_redis.scheduler.Scheduler" # data persistence SCHEDULER_PERSIST = TrueROBOTSTXT_OBEY = FalseDEFAULT_REQUEST_HEADERS = {'user-agent':' Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36', 'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,*/* Q Accept-Language': 0.8, 'Accept-Language':' en',} # Open pipe ITEM_PIPELINES = {# 'book.pipelines.BookPipeline': 300,' scrapy_redis.pipelines.RedisPipeline': 400,} (3) dangdang.py file import scrapyfrom copy import deepcopyfrom scrapy_redis.spiders import RedisSpider # first step Add module class DangdangSpider (RedisSpider): # step 2 modify the inherited parent class name = 'dangdang' allowed_domains = [' dangdang.com'] # start_urls = ['http://book.dangdang.com/'] # step 3 Rewrite start_urls to reids_key=' crawler file name 'redis_key =' dangdang' def parse (self Response): div_list = response.xpath ('/ / div [@ class= "con flq_body"] / div') for div in div_list: item = {} # get large classification item ['breadcate'] = div.xpath ('. / dl/dt//text ()'). Extract () item ['breadcate'] = [I .strip () for i in item ['breadcate'] if len (i.strip ()) > 0] dl_list = div.xpath ('. / / dl [@ class= "inner_dl"]') for dl in dl_list: # get the category item ['mroomcate'] = dl.xpath ('. / dt//text ()') .extract () Item ['dd/a''] = [i.strip () for i in item ['m\% Cate'] if len (i.strip ()) > 0] # get the subcategory a_list = dl.xpath ('. / dd/a') for an in a_list: item ['s scheduled'] = a .XPath ('. / text ()') .extract_first () item ['sroomhrefs'] = a.xpath ('. / @ href') .extract_first () if item ['scuhref'] is not None: yield scrapy.Request (url=item [' sfolhrefs'] Callback=self.parse_book_list, meta= {'item':deepcopy (item)}) print (item) def parse_book_list (self Response): item = response.meta.get ('item') li_list = response.xpath (' / / ul [@ class= "list_aa"] / li') for li in li_list: # url images/model/guan/url_none.png item ['book_img'] = li.xpath ('. / a [@ class= "img"] / img/@src'). Extract _ first () if item ['book_img'] =' images/model/guan/url_none.png': item ['book_img'] = li.xpath ('. / a [@ class= "img"] / img/@data-original'). The name of the extract_first () # data item ['book_name'] = li.xpath ('. / a [@ class= Extract_first () # print (item) yield item (IV) item.py file import scrapyclass BookItem (scrapy.Item): # define the fields for your item here like: name = scrapy.Field () pass

The above is all the content of this article "python crawler case Code Analysis". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report