In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how to use python code to achieve crawling Altman picture", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use python code to crawl Altman picture" bar!
Crawl URL: http://www.ultramanclub.com/allultraman/
Use tools: pycharm,requests
Go to the web page
Open developer tools
Click Network
Refresh the web page to get information
Request URL is the URL we crawled.
Slide to the bottom and there is a User-Agent, copy
Send a request to the server
200 means the request was successful
Using response.text to get text data
You can see that there is some garbled code.
Use encode conversion
Import requests url = 'http://www.ultramanclub.com/allultraman/' headers= {' User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'} response = requests.get (url = url,headers=headers) html = response.textHtml=html.encode ('iso-8859-1'). Decode ('gbk') print (Html)
Then start crawling the data you need.
Use Xpath to get links to web pages
To use Xpath, you must first import the parsel package
Import requestsimport parsel def get_response (html_url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'} response = requests.get (url = html_url) Headers=headers) return response url = 'http://www.ultramanclub.com/allultraman/'response = get_response (url) html=response.text.encode (' iso-8859-1'). Decode ('gbk') selector = parsel.Selector (html) period_hrefs = selector.xpath (' / / div [@ class= "btn"] / a Universe Reflector) # get the web link for period_href in period_hrefs of three ages: print (period_href.get ())
You can see that the link to the web page is incomplete, so we manually add it to period_href = 'http://www.ultramanclub.com/allultraman/' + period_href.get ()
Go to one of the web pages
As before, use Xpath to get Altman's web page information
For period_href in period_hrefs: period_href = 'http://www.ultramanclub.com/allultraman/' + period_href.get () # print (period_href) period_response = get_response (period_href). Text period_html = parsel.Selector (period_response) lis = period_html.xpath (' / / div [@ class= "ultraheros-Contents_Generations"] / div/ul/li/a/@href') for li in lis: print (li.get ())
After running, it is also found that the link is incomplete.
Li = 'http://www.ultramanclub.com/allultraman/' + li.get (). Replace ('. /',')
After getting the URL, you can continue the nesting operation, and you can get the picture data.
Png_url = 'http://www.ultramanclub.com/allultraman/' + li_selector.xpath (' / / div [@ class= "left"] / figure/img/@src') .get () .replace ('.. /','')
Complete code
Import requestsimport parselimport os dirname = "Altman" if not os.path.exists (dirname): # determine whether there is a folder named Altman, and create os.mkdir (dirname) def get_response (html_url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'} response = requests.get (url = html_url) Headers=headers) return response url = 'http://www.ultramanclub.com/allultraman/'response = get_response (url) html=response.text.encode (' iso-8859-1'). Decode ('gbk') selector = parsel.Selector (html) period_hrefs = selector.xpath (' / / div [@ class= "btn"] / an iso-8859 href) # get three times of web links for period_href in period_hrefs: period_href = 'http://www. Ultramanclub.com/allultraman/' + period_href.get () period_html = get_response (period_href). Text period_selector = parsel.Selector (period_html) lis = period_selector.xpath ('/ / div [@ class= "ultraheros-Contents_Generations"] / div/ul/li/a/@href') for li in lis: li = 'http://www.ultramanclub.com/allultraman/' + li.get (). Replace ('. /' '') # get the URL of each Altman # print (li) li_html = get_response (li). Text li_selector = parsel.Selector (li_html) url = li_selector.xpath ('/ / div [@ class= "left"] / figure/img/@src'). Get () # print (url) if url: png_url =' Http://www.ultramanclub.com/allultraman/' + url.replace ('.' '') png_title = li_selector.xpath ('/ / ul [@ class= "lists"] / li [3] / text ()'). Get () png_title = png_title.encode ('iso-8859-1'). Decode ('gbk') # print (li Png_title) png_content = get_response (png_url). Content with open (f'{dirname}\ {png_title} .png', 'wb') as f: f.write (png_content) print (png_title,' picture download completed') else: continue
When I climbed to Nexter Altman, I would return to None. I adjusted it for a long time, but I didn't understand it, so I skipped Nexter Altman with the if url: sentence. Do any bosses know why?
Url = li_selector.xpath ('/ / div [@ class= "left"] / figure/img/@src'). Get () Thank you for reading, the above is the content of "how to use python code to crawl Altman images". After the study of this article, I believe you have a deeper understanding of how to use python code to crawl Altman pictures. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.