In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail how to use the css selector of the crawler Scrapy framework. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.
Introduction to css selector
Selector is a pattern in css, which is used to select elements that need to be styled. Css uses css selector to control one-to-one, one-to-many or many-to-one control of elements in html pages. Elements in html pages are controlled through css selectors.
Basic syntax of css selector
Class selector: the class attribute of the element, such as class= "box" indicates the element that selects class as box
ID selector: the id attribute of the element, such as id= "box" indicates the element that selects id as box
Element selector: directly select document elements, such as p to select all p elements, and div to select all div elements
Attribute selector: select elements with certain attributes, such as * [title] to select all elements containing title attributes, a [href] to select all an elements with href attributes, etc.
Descendant selector: select elements that contain the descendants of elements, such as li a to select all an elements under all li
Child element selector: select elements that are children of an element, such as H2 > strong to select all strong elements whose parent element is H2
Neighboring sibling selector: selects the element immediately after another element, and both have the same parent element, such as H2 + p to select all p elements immediately after H2 element
How to use css in scrapy
Take the an element as an example
Response.css ('a'): the selector object is returned
Response.css ('a') .extract (): returns a tag object
Response.css ('a _ rig _ text') .extract_first (): returns the value of the text in the first a tag
Response.css ('a::attr (href)') .extract_first (): returns the value of the href attribute in the first a tag
Response.css ('a [href*=image]:: attr (href)') .extract (): returns the value of the href attribute containing image in all a tags
Response.css ('a [href * = image] img::attr (src)') .extract (): returns the src attribute of the image tag under all a tags
Expression writing method expression meaning # box choose id as the element of box. Box select class as the element of box p select all p elements div img select img element div under div Img selects all div elements and all img elements div#box selects div element with id as box div > p selects all p elements whose parent element is div element [title~=flower] selection title attribute contains all elements of the word "flower" a [href = "/ page/2"] selects an element a [href = "/ page"] whose href attribute is / page/2, selects an element a [href $= ". Png"] whose href attribute begins with / page, selects an element whose href attribute ends with png
In the previous section, we used the XPATH selector to get the title recommended today in csdn. Now let's use the CSS selector to try to get it.
#-*-coding: utf-8-*-import scrapyclass CsdnSpider (scrapy.Spider): name = 'csdn' allowed_domains = [' www.csdn.net'] start_urls = ['http://www.csdn.net/'] def parse (self, response): result = response.css' ('.company _ list. Company _ name aVlGRV text'). Extract () for i in result: print (I)
Let's take a look at the running results of the code and see if we can get the information we want.
Get the jump link and picture address of the element
First, you can use the css selector to extract the jump link of the element and the src address of the picture. Here, you need to use the parse.urljoin () method in the urllib library to splice the path in the acquired element to make it an absolute path.
Urljoin (baes,url [, allow_frafments]), where the parameter base is used as the base address and combined with the url whose second parameter is the relative path to form an absolute URl address, where the parameter allow_fragments can be set according to your own needs.
Import scrapyfrom urllib import parseclass DribbbleSpider (scrapy.Spider): name = 'dribbble' allowed_domains = [' dribbble.com'] start_urls = ['http://dribbble.com/'] def parse (self, response): a_href = response.css ('. Dribbble-shot. Dribbble-over::attr (href)'). Extract_first (") href = parse.urljoin (response.url) A_href) print (a_href) print (href) import scrapyfrom urllib import parseclass DribbbleSpider (scrapy.Spider): name = 'dribbble' allowed_domains = [' dribbble.com'] start_urls = ['http://dribbble.com/'] def parse (self, response): image_src = response.css (' img.enrique-image::attr (src)'). Extract_first (") src = parse.urljoin (response.url Image_src) print (image_src) print (src)
Download and save the picture locally
Import scrapyfrom urllib import parseimport requestsclass DribbbleSpider (scrapy.Spider): name = 'dribbble' allowed_domains = [' dribbble.com'] start_urls = ['http://dribbble.com/'] def parse (self, response): image_src = response.css (' img.enrique-image::attr (src)'). Extract_first (") src = parse.urljoin (response.url, image_src) ret = requests.get (src Stream=True) with open ('. / 1.pnglines, 'wb') as f: for block in ret.iter_content (chunk_size=1024): f.write (block) how to quickly get the elements in a page
Get the elements of the xpath selector: select tag-> right mouse button-> copy- > Copy XPath
Get the element of the css selector: you can get it using the developer debugging tool of the third-party plug-in Chrome. You need to download the CSS Select plug-in-> and then select the tag element directly.
This is the end of this article on "how to use the css selector of the crawler Scrapy framework". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it out for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.