How to use the css selector of the crawler Scrapy framework 07/19 Update SLTechnology News&Howtos

How to use the css selector of the crawler Scrapy framework

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use the css selector of the crawler Scrapy framework. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

Introduction to css selector

Selector is a pattern in css, which is used to select elements that need to be styled. Css uses css selector to control one-to-one, one-to-many or many-to-one control of elements in html pages. Elements in html pages are controlled through css selectors.

Basic syntax of css selector

Class selector: the class attribute of the element, such as class= "box" indicates the element that selects class as box

ID selector: the id attribute of the element, such as id= "box" indicates the element that selects id as box

Element selector: directly select document elements, such as p to select all p elements, and div to select all div elements

Attribute selector: select elements with certain attributes, such as * [title] to select all elements containing title attributes, a [href] to select all an elements with href attributes, etc.

Descendant selector: select elements that contain the descendants of elements, such as li a to select all an elements under all li

Child element selector: select elements that are children of an element, such as H2 > strong to select all strong elements whose parent element is H2

Neighboring sibling selector: selects the element immediately after another element, and both have the same parent element, such as H2 + p to select all p elements immediately after H2 element

How to use css in scrapy

Take the an element as an example

Response.css ('a'): the selector object is returned

Response.css ('a') .extract (): returns a tag object

Response.css ('a _ rig _ text') .extract_first (): returns the value of the text in the first a tag

Response.css ('a::attr (href)') .extract_first (): returns the value of the href attribute in the first a tag

Response.css ('a [href*=image]:: attr (href)') .extract (): returns the value of the href attribute containing image in all a tags

Response.css ('a [href * = image] img::attr (src)') .extract (): returns the src attribute of the image tag under all a tags

Expression writing method expression meaning # box choose id as the element of box. Box select class as the element of box p select all p elements div img select img element div under div Img selects all div elements and all img elements div#box selects div element with id as box div > p selects all p elements whose parent element is div element [title~=flower] selection title attribute contains all elements of the word "flower" a [href = "/ page/2"] selects an element a [href = "/ page"] whose href attribute is / page/2, selects an element a [href $= ". Png"] whose href attribute begins with / page, selects an element whose href attribute ends with png

In the previous section, we used the XPATH selector to get the title recommended today in csdn. Now let's use the CSS selector to try to get it.

#-*-coding: utf-8-*-import scrapyclass CsdnSpider (scrapy.Spider): name = 'csdn' allowed_domains = [' www.csdn.net'] start_urls = ['http://www.csdn.net/'] def parse (self, response): result = response.css' ('.company _ list. Company _ name aVlGRV text'). Extract () for i in result: print (I)

Let's take a look at the running results of the code and see if we can get the information we want.

Get the jump link and picture address of the element

First, you can use the css selector to extract the jump link of the element and the src address of the picture. Here, you need to use the parse.urljoin () method in the urllib library to splice the path in the acquired element to make it an absolute path.

Urljoin (baes,url [, allow_frafments]), where the parameter base is used as the base address and combined with the url whose second parameter is the relative path to form an absolute URl address, where the parameter allow_fragments can be set according to your own needs.

Import scrapyfrom urllib import parseclass DribbbleSpider (scrapy.Spider): name = 'dribbble' allowed_domains = [' dribbble.com'] start_urls = ['http://dribbble.com/'] def parse (self, response): a_href = response.css ('. Dribbble-shot. Dribbble-over::attr (href)'). Extract_first (") href = parse.urljoin (response.url) A_href) print (a_href) print (href) import scrapyfrom urllib import parseclass DribbbleSpider (scrapy.Spider): name = 'dribbble' allowed_domains = [' dribbble.com'] start_urls = ['http://dribbble.com/'] def parse (self, response): image_src = response.css (' img.enrique-image::attr (src)'). Extract_first (") src = parse.urljoin (response.url Image_src) print (image_src) print (src)

Download and save the picture locally

Import scrapyfrom urllib import parseimport requestsclass DribbbleSpider (scrapy.Spider): name = 'dribbble' allowed_domains = [' dribbble.com'] start_urls = ['http://dribbble.com/'] def parse (self, response): image_src = response.css (' img.enrique-image::attr (src)'). Extract_first (") src = parse.urljoin (response.url, image_src) ret = requests.get (src Stream=True) with open ('. / 1.pnglines, 'wb') as f: for block in ret.iter_content (chunk_size=1024): f.write (block) how to quickly get the elements in a page

Get the elements of the xpath selector: select tag-> right mouse button-> copy- > Copy XPath

Get the element of the css selector: you can get it using the developer debugging tool of the third-party plug-in Chrome. You need to download the CSS Select plug-in-> and then select the tag element directly.

This is the end of this article on "how to use the css selector of the crawler Scrapy framework". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.