Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to install and use Scrapy

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

How to install and basic use of Scrapy, in view of this problem, this article introduces the corresponding analysis and answers in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

First, a simple example to understand the basics.

1. Install the Scrapy framework

Here, if you directly pip3 install scrapy, you may make an error.

So you can install lxml:pip3 install lxml first (please ignore it if it is installed).

Install pyOpenSSL: download the wheel file on the official website.

Install Twisted: download the wheel file on the official website.

Install PyWin32: download the wheel file on the official website.

Download address: https://www.lfd.uci.edu/~gohlke/pythonlibs/

Configure the environment variable: add the directory where scrapy is located to the system environment variable.

Ctrl+f search is fine.

Finally install scrapy,pip3 install scrapy

2. Create a scrapy project

Create a new directory, hold down the shift- right button-open the command window here

Enter: scrapy startproject tutorial to create a tutorial folder

The folder directory is as follows:

| |-tutorial |

| |-scrapy.cfg |

| |-_ init__.py |

| |-items.py |

| |-middlewares.py |

| |-pipelines.py |

| |-settings.py |

| |-spiders |

| |-_ init__.py |

Functions of the file:

Scrapy.cfg: configuration fil

Spiders: store your Spider file, that is, the py file you crawled

Items.py: equivalent to a container, more like a dictionary

Middlewares.py: define the implementation of Downloader Middlewares (downloader middleware) and Spider Middlewares (spider middleware)

Pipelines.py: define the implementation of Item Pipeline to achieve data cleaning, storage, verification.

Settings.py: global configuration

3. Create a spider (crawler file defined by yourself)

For example, take climbing the Maoyan hot word-of-mouth list as an example to learn:

Create a maoyan.py file under the spiders folder, or you can hold down the shift- right-open a command window here and enter: scrapy genspider file name the URL to crawl.

What you create by yourself needs to be written by yourself, and what you create with commands contains the most basic things.

Let's take a look at what is created with the command.

Tell me what these are for:

Name: is the name of the project

Allowed_domains: domain names that are allowed to crawl. For example, if some websites have related links, the domain name is different from this site, which will be ignored.

Atart_urls: a website crawled by Spider. Define the initial request url, which can be multiple.

Parse method: is a method of Spider, after the request start_url, this method is to parse the web page, and extract what you want.

Response parameter: the content returned after the request page, that is, the page you need to parse.

There are other parameters you can check if you are interested.

4. Define Item

Item is a container that holds crawled data in much the same way as a dictionary.

We open items.py, and the information we want to extract after that is:

Index (ranking), title (movie title), star (starring), releasetime (release time), score (rating)

So we modified the items.py file like this.

That's it.

5. Open spider again to extract the information we want

Modify it like this:

All right, a simple crawler is done.

6. Run

Under this folder, hold down the shift- right button-open a command window here, and enter: scrapy crawl maoyan (name of the project)

You can see:

7. Save

We just ran the code to see if there was an error, and we didn't save it.

If we want to save it in csv, xml, or json format, we can use the command directly:

Under this folder, hold down the shift- right button-open a command window here, and enter:

Scrapy crawl maoyan-o maoyan.csv

Scrapy crawl maoyan-o maoyan.xml

Scrapy crawl maoyan-o maoyan.json

Just choose one of them. Of course, if you want to save to other formats is also possible, here only the common ones. Choose the json format here, and you will find that there is an extra maoyan.json file under the folder. After opening it, it is found that Chinese is a string of garbled codes. The coding method needs to be modified here, of course, it can also be modified in the configuration.

(just add FEED_EXPORT_ENCODING='UTF8' to the settings.py file)

If you want to modify it directly on the command line:

Scrapy crawl maoyan-o maoyan.json-s FEED_EXPORT_ENCODING=UTF8

That's it.

Try it out for yourself here.

Of course, we can save it automatically at run time, without having to write our own commands. Introduced later (we still have more documents to use).

Second, how to interpret scrapy?

I wrote an article before: the use of three major parsing libraries

But scrapy also provides its own parsing method (Selector), which is similar to the one above, so let's take a look at:

1 、 css

First you need to import the module: from scrapy import Selector

For example, there is a piece of html code like this:

Html='Demo

This is Demo'

1.1.First, you need to build a Selector object

Sel = Selector (html)

Text = sel.css ('.cla:: text') .extract_first ()

CLA means to select the above div node, and:: text means to get the text, which is different from the previous one.

Extract_first () returns the first element, because the above sel.css ('.cla:: text') returns a list, and you can also write sel.css (' .cla:: text') [0] to get the first element, but if it is empty, it will report an error that exceeds the maximum index, which is not recommended, while using extract_first () will not report an error, and if it is written as extract_first ('123') If empty, return 123

1.2.When you select the first one, you select all: extract () indicates selecting all. If multiple values are returned, it can be written like this.

1.3.The acquisition attribute is sel.css ('.cla:: attr (' class')'). Extract_first () means to get class

1.4. Get the text of the specified attribute: sel.css ('div [class = "cla"]:: text')

1.5. other writing methods are the same as those of css.

1.6. it provides us with a simple way to write it in scrapy. In the simple example above, we know that response is the return value of the request web page.

We can directly write: response.css () to parse and extract the information we want. Similarly, the following XPath can also be written directly:

Response.xpath () to parse.

2 、 Xpath

The use of Xpath can be seen in the above article: the use of three major parsing libraries

Note: you still get the list, so add extract_first () or extract ().

3. Regular matching (response operation is used here)

For example: response.css ('a _). Re ('write regular')

Here, if the response.css ('a _ _ text') matches multiple objects, then plus the regular matches multiple objects that meet the requirements.

If you want to match the first object here, you can change re () to re_first ().

Note: response cannot call re () directly, but response.xpath ('.') .re () can achieve the effect of using regularities directly.

Use of regularities: omnipotent regular expressions

III. The use of Dowmloader Middleware

Scrapy itself provides a lot of Dowmloader Middleware, but sometimes we have to modify

For example, modify User-Agent, use proxy ip, and so on.

Take modifying User-Agent as an example (setting proxy ip is more or less the same):

First, you can add USER-AGENT='xxx' directly to settings.py.

But we want to add multiple User-Agent, one at a time randomly can be set using Dowmloader Middleware.

The first step is to change the USER-AGENT='xxx' in settings to USER-AGENT= ["xxx", "xxxxx", "xxxxxxx"]

The second step is to add to the middlewares.py:

From_crawler (): you can get the configuration information through the parameter crawler. Our User-Agent is in the configuration file, so we need to get it.

The name of the method cannot be modified.

The third step is to add to the settings.py:

Set the key value of UserAgentmiddleware that comes with scrapy to None

The custom setting is 400, and the smaller the key value, the lower the meaning of the priority call.

Fourth, the use of Item Pipeline.

1. Clean the data

In the example of one, we modify the score with a score less than or equal to 8.5 to (not good-looking! ), we think it is a bad movie, we modify the pipeline.py like this:

Add to the setting.py:

Let's do this:

2. Storage

2.1 Save in json format

We modified pipeline.py to look like this:

Add to the setting.py:

Means that the TextPipeline method is executed first, and then the JsonPipeline method is executed, first cleaned and then saved.

2.2 stored in mysql database

First create a database maoyanreying in the mysql database and a table maoyan.

We modified pipeline.py to look like this:

Add to the setting.py:

Can

End.

This is the answer to the question about how to install and use Scrapy. I hope the above content can be of some help to you. If you still have a lot of doubts to solve, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report