Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use python 100-line code to realize picture crawling in Hanfu circle

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to use python 100-line code to achieve Hanfu circle picture crawling, the article introduces in great detail, has a certain reference value, interested friends must read it!

Analysis website

The website is as follows:

Https://www.aihanfu.com/zixun/tushang-1/

This is the first page of the URL, according to observation, the second page of the URL, that is, the above site serial number 1 has become 2, and so on, you can access all the pages.

According to the illustration, we need to get the link to each subsite, that is, the URL in href, and then go to each URL, find the picture URL, and download it.

Child link acquisition

In order to get the data in the figure above, we can use either soup or re or xpath. In this article, the editor uses xpath to locate, write a location function, get a link to each subsite, and then return the main function. Here is a technique. In the for loop, you can take a look!

Def get_menu (url, heades): "obtain the subURL params: url URL corresponding to each link according to the URL of each page" r = requests.get (url) " Headers=headers) if r.status_code = = 200: r.encoding = r.apparent_encoding html = etree.HTML (r.text) html = etree.tostring (html) html = etree.fromstring (html) # find the link for each subURL Then return children_url = html.xpath ('/ / div [@ class= "news_list"] / / article/figure/a/@href') for _ in children_url: yield _ to get the title and image address

In order to collect as much data as possible, we collect tags and image addresses, of course, if other projects need to collect publishers and time, it can also be done, this article will not be carried out.

We click on a URL link, as shown in the image above, and we can find that the title is in the node of head, and the title is used when creating a folder.

The code is as follows:

Def get_page (url, headers): "get the image address according to the sub-page link. Then package and download the params: url subURL "" r = requests.get (url) Headers=headers) if r.status_code = = 200,200: r.encoding = r.apparent_encoding html = etree.HTML (r.text) html = etree.tostring (html) html = etree.fromstring (html) # get the title title = html.xpath (r = html.xpath / header/h2/text ()) # get the address of the picture Img = html.xpath (r'//div [@ class= "arc_body"] / / figure/img/@src') # title preprocessing title =''.join (title) title = re.sub (r' [|]' '', title) print (title) save_img (title, img, headers) saves pictures

When flipping each page, we need to save the picture corresponding to the sub-link. Here, we need to pay attention to the status judgment and path judgment of the request.

Def save_img (title, img, headers): "create a subfolder based on the title to download all img links Select params:title: title params: img: picture address "" if not os.path.exists (title): os.mkdir (title) # download for I, j in enumerate (img): # iterate through the URL list r = requests.get (j) Headers=headers) if r.status_code = = 200: with open (title +'/ / + str (I) + '.png', 'wb') as fw: fw.write (r.content) print (title,' the first in', str (I), 'Zhang download complete!') Main function if _ _ name__ ='_ main__': "find params: None" path ='/ Users/*/ Hanfu /'if not os.path.exists (path): os.mkdir (path) os.chdir (path) else: os.chdir (path) # url = 'http:// " Www.aihanfu.com/zixun/tushang-1/' headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0 Win64 X64)''AppleWebKit/537.36 (KHTML, like Gecko)' 'Chrome/81.0.4044.129 Safari/537.36'} for _ in range (1,50): url =' http://www.aihanfu.com/zixun/tushang-{}/'.format(_) for _ in get_menu (url) Headers): get_page (_, headers) # get a page

So far we have completed all the links, about the crawler article, the editor has introduced more than once, on the one hand, I hope you can be more familiar with crawler skills, on the other hand, the editor believes that crawler is the basis of data analysis and data mining. If there is no crawler to get the data, there is no data analysis.

The above is all the contents of this article entitled "how to use python 100 lines of code to achieve picture crawling in Hanfu circle". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report