In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/02 Report--
Next, three cases will be prepared in turn (it will take about a month to master each point, and by proficiency I mean you don't have to look up the code and write it out, and the following hasn't been sorted out yet):
Import requests,threading# multithreading processing and control
From lxml import etree
From bs4 import BeautifulSoup
# get the source code
Def get_html (url):
Url=' http://www.doutula.com/?qqdrsign=01495'
# get the network address, but this place is dead, so what to do, because we haven't done many pages yet
Headers= {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}
# the previous step is to simulate browser information in a fixed format that can be written down
Request=requests.get (url=url,headers=headers) # send a get request to the URL
Response=request.content# gets the source code, which is slightly better than test
# print (response)
Return response
# the next step is to get the outer page, that is, the source code of the picture itself
Def get_img_html (html):
Soup=BeautifulSoup (html,'lxml') # parsing web pages with html.pparser
All_a=soup.findall # class is the keyword, so add here
For i in all_a:
Print (I) # I means
Img_html=get_html (I ['href']) # is used to get the hyperlink part of the source code
Print (img_html)
# http://www.doutula.com/article/list/?page=2
A=get_html (1)
Get_img_html (a)
Well, we can already get part of the source code, so our next job is to start doing multiple pages.
Import requests,threading# multithreading processing and control
From lxml import etree
From bs4 import BeautifulSoup
# get the source code
Def get_html (url):
# url=' http://www.doutula.com/?qqdrsign=01495'# gets the network address, but this place is dead, so what should we do, because we haven't done many pages yet
Headers= {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}
# the previous step is to simulate browser information in a fixed format that can be written down
Request=requests.get (url=url,headers=headers) # send a get request to the URL
Response=request.content# gets the source code, which is slightly better than test
# print (response)
Return response
# the next step is to get the outer page, that is, the source code of the picture itself
Def get_img_html (html):
Soup=BeautifulSoup (html,'lxml') # parsing web pages with html.pparser
All_a=soup.findall # class is the keyword, so add here
For i in all_a:
Print (I) # I means
Img_html=get_html (I ['href']) # is used to get the hyperlink part of the source code
Print (img_html)
# http://www.doutula.com/article/list/?page=2
Def main ():
Start_url=' http://www.doutula.com/article/list/?page='
For i in range (1d10):
Start_html=get_html (start_url.format (I)) # pass in the number of the first ten pages to get the source code of the first ten pages
Get_img_html (start_html) # to get the link source code where the picture is located
Main ()
Finally, the general source code:
Import requests,threading# multithreading processing and control
From lxml import etree# parsing method to find the contents directly
From bs4 import BeautifulSoup
# get the source code
Def get_html (url):
# url=' http://www.doutula.com/?qqdrsign=01495'# gets the network address, but this place is dead, so what should we do, because we haven't done many pages yet
Headers= {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}
# the previous step is to simulate browser information in a fixed format that can be written down
Request=requests.get (url=url,headers=headers) # send a get request to the URL
Response=request.content# gets the source code, which is slightly better than test
# print (response)
Return response
# the next step is to get the outer page, that is, the source code of the picture itself
Def get_img_html (html):
Soup=BeautifulSoup (html,'lxml') # parsing web pages with html.pparser
All_a=soup.findall # class is the keyword, so add here
For i in all_a:
# print (I) # I means
Img_html=get_html (I ['href']) # is used to get the hyperlink part of the source code
Get_img (img_html)
# print (img_html)
# http://www.doutula.com/article/list/?page=2
# obtain the url of the image:
Def get_img (html): soup=etree.HTML (html) # initialization before parsing, automatically correcting the code
Items=soup.xpath ('/ / div [@ class= "artile_des"]) # @ is used to select attributes, find the corresponding box br/ > soup=etree.HTML (html) # initialization before parsing, and automatically correct the code.
Items=soup.xpath ('/ / div [@ class= "artile_des"]') # @ is used to select attributes. It is important to find the corresponding box br/ > imgurl_list=item.xpath ('table/tbody/tr/td/a/img/@onerror') # onerror.
# print (imgurl_list)
# start_save_img (imgurl_list)
# Picture found above
# next is to download by multithreading
The name of the xcake setting is named
# stitching complete links, file open
Def save_img (img_url):
Global x # sets global variables
Xerox 1
Img_url=img_url.split ('=') [- 1] [1 muri 2] .replace ('jp','jpg')
Print ("downloading" + 'http:'+img_url)
Img_content=requests.get (img_url). Content
With open ('doutu/%s.jpg'% x recording wb') as f:
F.write (img_content)
Print ('printed% s picture'% x)
# create multithreading
Def start_save_img (imgurl_list):
For i in imgurl_list:
Print (I)
Th=threading.Thread (target=save_img,args= (I))
Th.start ()
Def main ():
Start_url=' http://www.doutula.com/article/list/?page='
For i in range (1d10):
Start_html=get_html (start_url.format (I)) # pass in the number of the first ten pages to get the source code of the first ten pages
Get_img_html (start_html) # to get the link source code where the picture is located
If name=='main':
Main ()
To be continued, there will be improvements in the later stage
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.