In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)05/31 Report--
Most people don't understand the knowledge points of this article "How to use python to view Liyang's photography circle", so Xiaobian summarizes the following contents for everyone. The contents are detailed, the steps are clear, and they have certain reference value. I hope everyone can gain something after reading this article. Let's take a look at this article "How to use python to view Liyang's photography circle".
target site analysis
The pagination rules of the target sites to be collected this time are as follows:
http://www.jsli001.com/thread-htm-fid-45-page-{pagecode}.html
The code is composed of multithreading module +requests module +BeautifulSoup module.
Rules for adoption list page → details page:
Liyang Photography Circle Photo Collection Code
This case is a practical case, showing the complete code first, and then based on comments and key functions to explain.
The main implementation steps are as follows:
Set log output level
Declare a LiYang class that inherits from threading.Thread
Instantiate multithreaded objects
Each thread gets global resources
Call html parsing function
Get the block theme partition area, mainly to prevent getting the top theme
Parsing with lxml
Parse out the title and data
Parse image address
save the picture
import randomimport threadingimport loggingfrom bs4 import BeautifulSoupimport requestsimport lxmllogging.basicConfig(level=logging.NOTSET) #Set log output level #Declare a LiYang class that inherits from threading. Thread class LiYangThread(threading.Thread): def __init__(self): threading.Thread.__ init__(self) #instantiate multithreaded objects self._ headers = self._ get_headers() #Get ua randomly self._ timeout = 5 #Set timeout #Each thread gets global resources def run(self): # while True: #This is where multithreading starts try: res = requests.get(url="http://www.jsly001.com/thread-htm-fid-45-page-1.html", headers=self._ headers, timeout=self._ timeout) #Test Get the first page of data except Exception as e: logging.error(e) if res is not None: html_text = res.text self._ format_html(html_text) #Call html parsing function def _format_html(self, html): #Parse using lxml soup = BeautifulSoup(html, 'lxml') #Get the theme partition area of the block, mainly to prevent getting the top theme part_tr = soup.find(attrs={'class': 'bbs_tr4'}) if part_tr is not None: items = part_tr.find_all_next(attrs={"name": "readlink"}) #Get details page address else: items = soup.find_all(attrs={"name": "readlink"}) #Parse out titles and data data = [(item.text, f'http://www.jsly001.com/{item["href"]}') for item in items] #Enter the title page for name, url in data: self._ get_imgs(name, url) def _get_imgs(self, name, url): """Parse image address""" try: res = requests.get(url=url, headers=self._ headers, timeout=self._ timeout) except Exception as e: logging.error(e) #Image extraction logic if res is not None: soup = BeautifulSoup(res.text, 'lxml') origin_div1 = soup.find(attrs={'class': 'tpc_content'}) origin_div2 = soup.find(attrs={'class': 'imgList'}) content = origin_div2 if origin_div2 else origin_div1 if content is not None: imgs = content.find_all('img') # print([img.get("src") for img in imgs]) self._ save_img(name, imgs) #Save image def _save_img(self, name, imgs): """Save Picture""" for img in imgs: url = img.get("src") if url.find('http') < 0: continue #Find id attribute in parent tag id_ = img.find_parent('span').get("id") try: res = requests.get(url=url, headers=self._ headers, timeout=self._ timeout) except Exception as e: logging.error(e) if res is not None: name = name.replace("/", "_") with open(f'./ imgs/{name}_{id_}.jpg', "wb+") as f: #Note that the imgs folder is created in advance in the python runtime directory f.write(res.content) def _get_headers(self): uas = [ "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)", ] ua = random.choice(uas) headers = { "user-agent": ua } return headersif __name__ == '__main__': my_thread = LiYangThread() my_thread.run()
In this case, BeautifulSoup module uses lxml parser to parse HTML data. This parser is often used later. Please import lxml module before use.
The data extraction part uses soup.find() and soup.find_all() functions, and the code also uses the find_parent() function to collect the id attribute in the parent tag.
#Find id attribute in parent tag id_ = img.find_parent ('span ').get("id")
DEBUG message appears in the process of code operation, and the logging output level can be controlled.! [Python to see Liyang photography circle, inside the photo is very true
The above is the content of this article on "how to use python to view Liyang's photography circle." I believe everyone has a certain understanding. I hope the content shared by Xiaobian will be helpful to everyone. If you want to know more relevant knowledge, please pay attention to the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.