Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Python crawler takes pictures and explains them in detail

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Next, three cases will be prepared in turn (it will take about a month to master each point, and by proficiency I mean you don't have to look up the code and write it out, and the following hasn't been sorted out yet):

Import requests,threading# multithreading processing and control

From lxml import etree

From bs4 import BeautifulSoup

# get the source code

Def get_html (url):

Url=' http://www.doutula.com/?qqdrsign=01495'

# get the network address, but this place is dead, so what to do, because we haven't done many pages yet

Headers= {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}

# the previous step is to simulate browser information in a fixed format that can be written down

Request=requests.get (url=url,headers=headers) # send a get request to the URL

Response=request.content# gets the source code, which is slightly better than test

# print (response)

Return response

# the next step is to get the outer page, that is, the source code of the picture itself

Def get_img_html (html):

Soup=BeautifulSoup (html,'lxml') # parsing web pages with html.pparser

All_a=soup.findall # class is the keyword, so add here

For i in all_a:

Print (I) # I means

Img_html=get_html (I ['href']) # is used to get the hyperlink part of the source code

Print (img_html)

# http://www.doutula.com/article/list/?page=2

A=get_html (1)

Get_img_html (a)

Well, we can already get part of the source code, so our next job is to start doing multiple pages.

Import requests,threading# multithreading processing and control

From lxml import etree

From bs4 import BeautifulSoup

# get the source code

Def get_html (url):

# url=' http://www.doutula.com/?qqdrsign=01495'# gets the network address, but this place is dead, so what should we do, because we haven't done many pages yet

Headers= {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}

# the previous step is to simulate browser information in a fixed format that can be written down

Request=requests.get (url=url,headers=headers) # send a get request to the URL

Response=request.content# gets the source code, which is slightly better than test

# print (response)

Return response

# the next step is to get the outer page, that is, the source code of the picture itself

Def get_img_html (html):

Soup=BeautifulSoup (html,'lxml') # parsing web pages with html.pparser

All_a=soup.findall # class is the keyword, so add here

For i in all_a:

Print (I) # I means

Img_html=get_html (I ['href']) # is used to get the hyperlink part of the source code

Print (img_html)

# http://www.doutula.com/article/list/?page=2

Def main ():

Start_url=' http://www.doutula.com/article/list/?page='

For i in range (1d10):

Start_html=get_html (start_url.format (I)) # pass in the number of the first ten pages to get the source code of the first ten pages

Get_img_html (start_html) # to get the link source code where the picture is located

Main ()

Finally, the general source code:

Import requests,threading# multithreading processing and control

From lxml import etree# parsing method to find the contents directly

From bs4 import BeautifulSoup

# get the source code

Def get_html (url):

# url=' http://www.doutula.com/?qqdrsign=01495'# gets the network address, but this place is dead, so what should we do, because we haven't done many pages yet

Headers= {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}

# the previous step is to simulate browser information in a fixed format that can be written down

Request=requests.get (url=url,headers=headers) # send a get request to the URL

Response=request.content# gets the source code, which is slightly better than test

# print (response)

Return response

# the next step is to get the outer page, that is, the source code of the picture itself

Def get_img_html (html):

Soup=BeautifulSoup (html,'lxml') # parsing web pages with html.pparser

All_a=soup.findall # class is the keyword, so add here

For i in all_a:

# print (I) # I means

Img_html=get_html (I ['href']) # is used to get the hyperlink part of the source code

Get_img (img_html)

# print (img_html)

# http://www.doutula.com/article/list/?page=2

# obtain the url of the image:

Def get_img (html): soup=etree.HTML (html) # initialization before parsing, automatically correcting the code

Items=soup.xpath ('/ / div [@ class= "artile_des"]) # @ is used to select attributes, find the corresponding box br/ > soup=etree.HTML (html) # initialization before parsing, and automatically correct the code.

Items=soup.xpath ('/ / div [@ class= "artile_des"]') # @ is used to select attributes. It is important to find the corresponding box br/ > imgurl_list=item.xpath ('table/tbody/tr/td/a/img/@onerror') # onerror.

# print (imgurl_list)

# start_save_img (imgurl_list)

# Picture found above

# next is to download by multithreading

The name of the xcake setting is named

# stitching complete links, file open

Def save_img (img_url):

Global x # sets global variables

Xerox 1

Img_url=img_url.split ('=') [- 1] [1 muri 2] .replace ('jp','jpg')

Print ("downloading" + 'http:'+img_url)

Img_content=requests.get (img_url). Content

With open ('doutu/%s.jpg'% x recording wb') as f:

F.write (img_content)

Print ('printed% s picture'% x)

# create multithreading

Def start_save_img (imgurl_list):

For i in imgurl_list:

Print (I)

Th=threading.Thread (target=save_img,args= (I))

Th.start ()

Def main ():

Start_url=' http://www.doutula.com/article/list/?page='

For i in range (1d10):

Start_html=get_html (start_url.format (I)) # pass in the number of the first ten pages to get the source code of the first ten pages

Get_img_html (start_html) # to get the link source code where the picture is located

If name=='main':

Main ()

To be continued, there will be improvements in the later stage

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report