Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use python to view Liyang's photography circle

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

Most people don't understand the knowledge points of this article "How to use python to view Liyang's photography circle", so Xiaobian summarizes the following contents for everyone. The contents are detailed, the steps are clear, and they have certain reference value. I hope everyone can gain something after reading this article. Let's take a look at this article "How to use python to view Liyang's photography circle".

target site analysis

The pagination rules of the target sites to be collected this time are as follows:

http://www.jsli001.com/thread-htm-fid-45-page-{pagecode}.html

The code is composed of multithreading module +requests module +BeautifulSoup module.

Rules for adoption list page → details page:

Liyang Photography Circle Photo Collection Code

This case is a practical case, showing the complete code first, and then based on comments and key functions to explain.

The main implementation steps are as follows:

Set log output level

Declare a LiYang class that inherits from threading.Thread

Instantiate multithreaded objects

Each thread gets global resources

Call html parsing function

Get the block theme partition area, mainly to prevent getting the top theme

Parsing with lxml

Parse out the title and data

Parse image address

save the picture

import randomimport threadingimport loggingfrom bs4 import BeautifulSoupimport requestsimport lxmllogging.basicConfig(level=logging.NOTSET) #Set log output level #Declare a LiYang class that inherits from threading. Thread class LiYangThread(threading.Thread): def __init__(self): threading.Thread.__ init__(self) #instantiate multithreaded objects self._ headers = self._ get_headers() #Get ua randomly self._ timeout = 5 #Set timeout #Each thread gets global resources def run(self): # while True: #This is where multithreading starts try: res = requests.get(url="http://www.jsly001.com/thread-htm-fid-45-page-1.html", headers=self._ headers, timeout=self._ timeout) #Test Get the first page of data except Exception as e: logging.error(e) if res is not None: html_text = res.text self._ format_html(html_text) #Call html parsing function def _format_html(self, html): #Parse using lxml soup = BeautifulSoup(html, 'lxml') #Get the theme partition area of the block, mainly to prevent getting the top theme part_tr = soup.find(attrs={'class': 'bbs_tr4'}) if part_tr is not None: items = part_tr.find_all_next(attrs={"name": "readlink"}) #Get details page address else: items = soup.find_all(attrs={"name": "readlink"}) #Parse out titles and data data = [(item.text, f'http://www.jsly001.com/{item["href"]}') for item in items] #Enter the title page for name, url in data: self._ get_imgs(name, url) def _get_imgs(self, name, url): """Parse image address""" try: res = requests.get(url=url, headers=self._ headers, timeout=self._ timeout) except Exception as e: logging.error(e) #Image extraction logic if res is not None: soup = BeautifulSoup(res.text, 'lxml') origin_div1 = soup.find(attrs={'class': 'tpc_content'}) origin_div2 = soup.find(attrs={'class': 'imgList'}) content = origin_div2 if origin_div2 else origin_div1 if content is not None: imgs = content.find_all('img') # print([img.get("src") for img in imgs]) self._ save_img(name, imgs) #Save image def _save_img(self, name, imgs): """Save Picture""" for img in imgs: url = img.get("src") if url.find('http') < 0: continue #Find id attribute in parent tag id_ = img.find_parent('span').get("id") try: res = requests.get(url=url, headers=self._ headers, timeout=self._ timeout) except Exception as e: logging.error(e) if res is not None: name = name.replace("/", "_") with open(f'./ imgs/{name}_{id_}.jpg', "wb+") as f: #Note that the imgs folder is created in advance in the python runtime directory f.write(res.content) def _get_headers(self): uas = [ "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)", ] ua = random.choice(uas) headers = { "user-agent": ua } return headersif __name__ == '__main__': my_thread = LiYangThread() my_thread.run()

In this case, BeautifulSoup module uses lxml parser to parse HTML data. This parser is often used later. Please import lxml module before use.

The data extraction part uses soup.find() and soup.find_all() functions, and the code also uses the find_parent() function to collect the id attribute in the parent tag.

#Find id attribute in parent tag id_ = img.find_parent ('span ').get("id")

DEBUG message appears in the process of code operation, and the logging output level can be controlled.! [Python to see Liyang photography circle, inside the photo is very true

The above is the content of this article on "how to use python to view Liyang's photography circle." I believe everyone has a certain understanding. I hope the content shared by Xiaobian will be helpful to everyone. If you want to know more relevant knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report