Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use python code to crawl Altman pictures

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use python code to achieve crawling Altman picture", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use python code to crawl Altman picture" bar!

Crawl URL: http://www.ultramanclub.com/allultraman/

Use tools: pycharm,requests

Go to the web page

Open developer tools

Click Network

Refresh the web page to get information

Request URL is the URL we crawled.

Slide to the bottom and there is a User-Agent, copy

Send a request to the server

200 means the request was successful

Using response.text to get text data

You can see that there is some garbled code.

Use encode conversion

Import requests url = 'http://www.ultramanclub.com/allultraman/' headers= {' User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'} response = requests.get (url = url,headers=headers) html = response.textHtml=html.encode ('iso-8859-1'). Decode ('gbk') print (Html)

Then start crawling the data you need.

Use Xpath to get links to web pages

To use Xpath, you must first import the parsel package

Import requestsimport parsel def get_response (html_url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'} response = requests.get (url = html_url) Headers=headers) return response url = 'http://www.ultramanclub.com/allultraman/'response = get_response (url) html=response.text.encode (' iso-8859-1'). Decode ('gbk') selector = parsel.Selector (html) period_hrefs = selector.xpath (' / / div [@ class= "btn"] / a Universe Reflector) # get the web link for period_href in period_hrefs of three ages: print (period_href.get ())

You can see that the link to the web page is incomplete, so we manually add it to period_href = 'http://www.ultramanclub.com/allultraman/' + period_href.get ()

Go to one of the web pages

As before, use Xpath to get Altman's web page information

For period_href in period_hrefs: period_href = 'http://www.ultramanclub.com/allultraman/' + period_href.get () # print (period_href) period_response = get_response (period_href). Text period_html = parsel.Selector (period_response) lis = period_html.xpath (' / / div [@ class= "ultraheros-Contents_Generations"] / div/ul/li/a/@href') for li in lis: print (li.get ())

After running, it is also found that the link is incomplete.

Li = 'http://www.ultramanclub.com/allultraman/' + li.get (). Replace ('. /',')

After getting the URL, you can continue the nesting operation, and you can get the picture data.

Png_url = 'http://www.ultramanclub.com/allultraman/' + li_selector.xpath (' / / div [@ class= "left"] / figure/img/@src') .get () .replace ('.. /','')

Complete code

Import requestsimport parselimport os dirname = "Altman" if not os.path.exists (dirname): # determine whether there is a folder named Altman, and create os.mkdir (dirname) def get_response (html_url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'} response = requests.get (url = html_url) Headers=headers) return response url = 'http://www.ultramanclub.com/allultraman/'response = get_response (url) html=response.text.encode (' iso-8859-1'). Decode ('gbk') selector = parsel.Selector (html) period_hrefs = selector.xpath (' / / div [@ class= "btn"] / an iso-8859 href) # get three times of web links for period_href in period_hrefs: period_href = 'http://www. Ultramanclub.com/allultraman/' + period_href.get () period_html = get_response (period_href). Text period_selector = parsel.Selector (period_html) lis = period_selector.xpath ('/ / div [@ class= "ultraheros-Contents_Generations"] / div/ul/li/a/@href') for li in lis: li = 'http://www.ultramanclub.com/allultraman/' + li.get (). Replace ('. /' '') # get the URL of each Altman # print (li) li_html = get_response (li). Text li_selector = parsel.Selector (li_html) url = li_selector.xpath ('/ / div [@ class= "left"] / figure/img/@src'). Get () # print (url) if url: png_url =' Http://www.ultramanclub.com/allultraman/' + url.replace ('.' '') png_title = li_selector.xpath ('/ / ul [@ class= "lists"] / li [3] / text ()'). Get () png_title = png_title.encode ('iso-8859-1'). Decode ('gbk') # print (li Png_title) png_content = get_response (png_url). Content with open (f'{dirname}\ {png_title} .png', 'wb') as f: f.write (png_content) print (png_title,' picture download completed') else: continue

When I climbed to Nexter Altman, I would return to None. I adjusted it for a long time, but I didn't understand it, so I skipped Nexter Altman with the if url: sentence. Do any bosses know why?

Url = li_selector.xpath ('/ / div [@ class= "left"] / figure/img/@src'). Get () Thank you for reading, the above is the content of "how to use python code to crawl Altman images". After the study of this article, I believe you have a deeper understanding of how to use python code to crawl Altman pictures. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report