Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

My first Scrapy program-crawling Dangdang information

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Now that Scrapy is installed, let's implement the first test program.

Overview

Scrapy is a crawler framework, and its basic flow is as follows (screenshot from the Internet below)

To put it simply, we need to write an item file that defines the data structure returned; a spider file, a specific crawling data program, and a pipeline pipeline file as follow-up operations, such as saving data, and so on.

Let's take Dangdang as an example to see how to implement it.

The content I want to crawl in this example is the down jacket product on the first 20 pages, including product name, link and number of comments.

Process 1. Create a Scrapy project scrapy startproject dangdang2. Create a crawler file * *

Scrapy genspider-t basic dd dangdang.com

In this way, he will automatically create a crawler file with the following structure:

3. Write items.py

Items.py

#-*-coding: utf-8-*-# Define here the models for your scraped items## See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DangdangItem (scrapy.Item): # define the fields for your item here like: # name = scrapy.Field () title=scrapy.Field () url=scrapy.Field () comment=scrapy.Field () 4 Write crawler file dd.py

In the second step above, a template has been automatically generated, so we can modify it directly.

Dd.py

#-*-coding: utf-8-*-import scrapyfrom dangdang.items import DangdangItemfrom scrapy.http import Requestclass DdSpider (scrapy.Spider): name = 'dd' allowed_domains = [' dangdang.com'] start_urls = ['http://category.dangdang.com/pg1-cid4010275.html'] def parse (self) Response): item=DangdangItem () item ['title'] = response.xpath (u "/ / a [@ dd_name=' item title'] / text ()"). Extract () item ['url'] = response.xpath ("/ / a [@ dd_name=' item title'] / @ href"). Extract () item ['comment'] = response.xpath ("/ / a [@ dd_name=' item comment'] / text () "). Extract () text = response.body # content_type = chardet.detect (text) # if content_type ['encoding']! =' UTF-8': # text = text.decode (content_type ['encoding']) # text = text.encode (' utf-8') # print (text) yield item for i in range ): url=' http://category.dangdang.com/pg%d-cid4010275.html'%i yield Request (url Callback=self.parse) 5. Write pipelines.py

In order to use pipeline, the configuration file needs to be modified a little. By the way, I turned off the confirmation of the robot file.

Settings.py

ROBOTSTXT_OBEY = FalseITEM_PIPELINES = {'dangdang.pipelines.DangdangPipeline': 300,}

Pipeline.py

#-*-coding: utf-8-*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysqlclass DangdangPipeline (object): def process_item (self, item, spider): conn=pymysql.connect (host='127.0.0.1',user='root',passwd='root',db='dangdang',use_unicode=True Charset='utf8') for i in range (0Len (item ['title']): title=item [' title'] [I] link=item ['url'] [I] comment=item [' comment'] [I] print (type (title)) print (title) # sql= "insert into dd (title,link,comment) values ('" + title+ ",'" + link+ "' '"+ comment+"') "sql =" insert into dd (title,link,comment) values ('"+ title +"','"+ link +"','"+ comment+"') "try: conn.query (sql) except Exception as err: pass conn.close () return item6. Create databases and tables

My last data needs to be saved in mysql, and python can be operated through pymysql. I created a database and an empty table in the mysql command line interface ahead of time

Mysql > create database dangdang;mysql > create table dd (id int auto_increment primary, title varchar, link varchar, comment varchar); 7. Execution

Scrapy crawl dd

If you don't want to see the log, you can use the

Scrapy crawl dd-nolog

8. Test result

Test.py

#! / usr/bin/env pythonium!-*-coding:utf-8-*-# Author: Yuan Liimport pymysqlconn=pymysql.connect (host='127.0.0.1',user='root',passwd='root',db='dangdang',use_unicode=True,charset='utf8') cursor= conn.cursor (cursor=pymysql.cursors.DictCursor) # SQL query cursor.execute ("select * from dd") row=cursor.fetchall () for i in row: print (I) conn.close ()

Results the test was successful.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report