How to parse the URL field 07/02 Update SLTechnology News&Howtos

How to parse the URL field

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you about how to achieve the parsing of the URL field. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

1. Modify the crawled destination address

We know that if you want to crawl the data of a website, you need to create a spider in the spiders file. After creation, a class is automatically generated in the spider. The class name is also composed of the spider name and Spider, as generated by climbing to the csdn website in the previous section: CsdnSpider class, the name in this category is the name of the spider, allowed_domains is the domain name of the website allowed to crawl, and start_urls is the URL of the target website to be crawled. If you need to change the crawled target page, you only need to modify the start_urls

Import scrapyclass CsdnSpider (scrapy.Spider): name = 'csdn' allowed_domains = [' www.csdn.net'] start_urls = ['http://www.csdn.net/'] def parse (self, response): pass2. Resolve the hyperlink of the title, the jump address of the a tag and the content of the title

So if we continue to use today's recommendation of csdn as crawling information, we first need to use the response object in the parse (self,response) method in the CsdnSpider class to parse the href value of element an obtained by the css selector, such as response.css ('h3 a::attr (href)'). Extract (), so that you can get a list of url addresses

Import scrapyclass CsdnSpider (scrapy.Spider): name = 'csdn' allowed_domains = [' www.csdn.net'] start_urls = ['http://www.csdn.net/'] def parse (self, response): urls = response.css (' .company _ list .company _ name a::attr (href)') .extract () print (urls)

Then we loop through the list, get the url address of each a tag, and then use the Request request to pass two parameters, one is url to tell us which page to parse next, the page link can be spliced by the parse.urljoin () method, and the other is the callback callback function.

This callback function is defined by ourselves and also passes a response object, which parses the elements obtained by the css selector through this response object, so that we can get the title content of each parsed page.

Import scrapyfrom scrapy.http import Requestfrom urllib import parseclass CsdnSpider (scrapy.Spider): name = 'csdn' allowed_domains = [' www.csdn.net'] start_urls = ['http://www.csdn.net/'] def parse (self) Response): # get url urls = response.css ('.company _ list. Company _ name a::attr (href)'). Extract () # print (urls) # parsing page for url in urls: yield Request (url = parse.urljoin (response.url, url), callback=self.parse_analyse, dont_filter=True) # callback function def parse_analyse (self Response): title = response.css ('.company _ list. Company _ name av list text'). Extract_first () print (title) introduction to the 3.Request object

Class scrapy.http.Request (url [, callback, method='GET', headers, body, cookies, meta, encoding='utf-8',priority=0, dont_filter=False, errback]), a request object represents a HTTP request, which is usually generated by Spider and executed by Downloader to produce a Response

Url: the URL for the request

Callback: specify a callback function that takes the response of the request as the first argument. If callback is not specified, the parse () method of spider is used by default

Method: the method of HTTP request. Default is GET.

Headers: the header of the request

Body: the body of the request, which can be bytes or str

Cookies: the cookie that the request carries

Meta: specifies the initial value of the Request.meta property. If given, dict will make a shallow copy.

Encoding: the encoding of the request. Default is utf-8.

Priority: priority. The higher the priority, the first to download.

Dont_filter: specifies whether the request is filtered by Scheduler. This parameter enables request to reuse.

Errback: callback function that handles exceptions

The above is the editor for you to share how to achieve the resolution of the URL field, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.