In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you pictures and album titles and album release time on how to use Python crawler Yi Yunyin music album. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
I. Preface
I mentioned the design needs of my designer partner, who wants to make a waterfall map of the Beatles' albums over the years.
Through the search, it is found that NetEase Yun music has a relatively complete album information over the years with pictures, the picture quality is OK, although there are large and small.
Why are all my examples crawling pictures? Who told you to always play with your designer buddies? It seems that the pictures still have a deep plot for the designer, so let's see what kind of works he can make with these pictures. Look forward to showing his works later.
In fact, crawling NetEase cloud music is slightly different from the website crawled before. Of course, if the reptile writes more, it feels that the routines are fixed.
II. Operating environment
My operating environment is as follows:
System version
Windows10 .
Python version
Python3.5, the scientific computing version of Anaconda is recommended, mainly because it comes with a package management tool that can solve the problem of some package installation errors. Go to the Anaconda official website, select the Python3.5 version, and then download and install it.
IDE
I am using PyCharm, an IDE developed specifically for Python.
III. Actual combat
As mentioned above, there are two main differences between NetEase Yun's music web page and ordinary web pages:
The web page is dynamically loaded by js
Using the iframe framework
So, first of all, web requests can't use the requests library, you need to use Selenium + PhatomJS. Second, after using Selenium + PhatomJS, you also need to do specific processing for iframe.
Don't talk too much nonsense, look at the actual operation steps:
First open the web page http://music.163.com
Then see the following page, select "all albums" in the red box, and click.
What we need is the picture of all the albums, the album name and the release time of the album. Seeing this, you can imagine the crawling logic of the crawler. Navigate to the page, get the page number, and then request the page one by one to crawl the content in the page.
Click the page flip button to see if there are any rules in url.
Click on the second page and see the address bar above! I don't even bother to turn the page when I see this address bar.
The limit parameter is to limit the number of albums loaded on a page
The offset parameter is how many albums were filtered before, and now there are 12 albums per page, so the second page is offset=12, the third page offset=24, and so on.
There are nine pages, 12 on a page, and less than 120. So... ... Change the url and you don't have to turn the page!
The limit parameter equals 120 and the offset parameter equals 0, and it's done! Enter the url below to see if all the albums have been loaded.
Http://music.163.com/#/artist/album?id=101988&limit=120&offset=0
Let's start the crawler code.
Here we will use several tools and methods written in the previous blog post:
Def save_img (self, url, file_name): # # Save the picture print ('start requesting the image address, the process will be a bit long.') Img = self.request (url) print ('start saving picture') f = open (file_name,' ab') f.write (img.content) print (file_name,' picture saved successfully!) F.close () def request (self, url): # encapsulated requests request r = requests.get (url) # sends a get request like the destination url address and returns a response object. It doesn't matter with or without headers parameters. Return r def mkdir (self, path): # # this function creates a folder path = path.strip () isExists = os.path.exists (path) if not isExists: print ('create a folder named', path,') os.makedirs (path) print ('created successfully!') Return True else: print (path, 'folder already exists, no longer create') return False def get_files (self, path): # get the list of file names in the folder pic_names = os.listdir (path) return pic_names
OK, start our crawler logic section:
It is worth noting that the page uses the frame framework, and the page content in the iframe framework is not loaded after using Selenium + PhantomJS. The iframe framework is equivalent to loading another page in the page, which needs to be loaded using the switch_to.frame () method of Selenium (the method given on the official website is switch_to_frame (), but IDE reminds you to replace it with the previous method).
Look at the structure of the web page below. Iframe's id is "g_iframe":
Load the contents of the iframe framework:
Driver = webdriver.PhantomJS () driver.get (self.init_url) driver.switch_to.frame ("g_iframe") html = driver.page_source
Then find all the cover elements:
According to the structure of the web page shown above, all the album information is in the ul tag, and each album is in a li tag. The li tag contains the picture url, the album name, and the album time.
Just grab the contents.
All_li = BeautifulSoup (html, 'lxml'). Find (id='m-song-module'). Find_all (' li') for li in all_li: album_img = li.find ('img') [' src'] album_name = li.find ('title', class_='dec') [' title'] album_date = li.find ('span', class_='s-fc3'). Get_text ()
The image url obtained here still has the width and height parameters of the image, so filter the width and height parameters:
Http://p4.music.126.net/pLA1GX0KtU-vU4ZA6Cr-OQ==/1401877340532770.jpg?param=120y120
Filter out the parameters after the question mark:
End_pos = album_img.index ('?) # find the location of the question mark album_img_url = album_img [: end_pos] # intercept the content before the question mark
Picture naming logic: album time + album name.
The album title may have some special characters that need to be replaced!
Photo_name = album_date +'-'+ album_name.replace ('/','). Replace (':',') + .jpg'
Using the deduplication logic in the example of the previous post, the modified crawler logic is as follows:
Def spider (self): print ("Start!") Driver = webdriver.PhantomJS () driver.get (self.init_url) driver.switch_to.frame ("g_iframe") html = driver.page_source self.mkdir (self.folder_path) # create folder print ('start switching folder') os.chdir (self.folder_path) # switch path to the folder created above File_names = self.get_files (self.folder_path) # get all the file names in the folder The type is list all_li = BeautifulSoup (html, 'lxml'). Find (id='m-song-module'). Find_all (' li') # print (type (all_li)) for li in all_li: album_img = li.find ('img') [' src'] album_name = li.find ('p') Class_='dec') ['title'] album_date = li.find (' span', class_='s-fc3'). Get_text () end_pos = album_img.index (?') Album_img_url = album_img [: end_pos] photo_name = album_date +'-'+ album_name.replace ('/','). Replace (':',') + .jpg 'print (album_img_url, photo_name) if photo_name in file_names: print (' picture already exists Do not download again') else: self.save_img (album_img_url, photo_name)
In fact, compared with the example of the previous blog post, the logic of this crawler is quite concise.
The last complete code: it can also be downloaded from GitHub
From selenium import webdriverfrom bs4 import BeautifulSoupimport requestsimport osclass AlbumCover (): def _ _ init__ (self): self.init_url = "http://music.163.com/#/artist/album?id=101988&limit=120&offset=0" # request URL self.folder_path =" C:\ D\ TheBeatles "# the directory of files you want to store def save_img (self, url, file_name): # # Save the picture print ('start requesting the image address The process will be a little long.') Img = self.request (url) print ('start saving picture') f = open (file_name, 'ab') f.write (img.content) print (file_name,' picture saved successfully!') F.close () def request (self, url): # encapsulated requests request r = requests.get (url) # sends a get request like the destination url address and returns a response object. It doesn't matter with or without headers parameters. Return r def mkdir (self, path): # # this function creates a folder path = path.strip () isExists = os.path.exists (path) if not isExists: print ('create a folder named', path,') os.makedirs (path) print ('created successfully!') Return True else: print (path, 'folder already exists, no longer create') return False def get_files (self, path): # get the list of file names in the folder pic_names = os.listdir (path) return pic_names def spider (self): print ("Start!") Driver = webdriver.PhantomJS () driver.get (self.init_url) driver.switch_to.frame ("g_iframe") html = driver.page_source self.mkdir (self.folder_path) # create folder print ('start switching folder') os.chdir (self.folder_path) # switch path to the folder created above File_names = self.get_files (self.folder_path) # get all the file names in the folder The type is list all_li = BeautifulSoup (html, 'lxml'). Find (id='m-song-module'). Find_all (' li') # print (type (all_li)) for li in all_li: album_img = li.find ('img') [' src'] album_name = li.find ('p') Class_='dec') ['title'] album_date = li.find (' span', class_='s-fc3'). Get_text () end_pos = album_img.index (?') Album_img_url = album_img [: end_pos] photo_name = album_date +'-'+ album_name.replace ('/','). Replace (':',') + .jpg 'print (album_img_url, photo_name) if photo_name in file_names: print (' picture already exists Do not download') else: self.save_img (album_img_url, photo_name) album_cover = AlbumCover () album_cover.spider ()
Execution result:
See what it looks like in the folder:
We have got the album cover over the years, as well as the title and release date of the album.
IV. Afterwords
This actual combat makes good use of the knowledge we explained earlier:
Use Selenium + PhatomJS to crawl dynamic pages
Use switch_to.frame () of Selenium to load the contents of iframe
Use the requests library to get pictures
Use the BeautifulSoup library to parse and crawl web content.
Use the os library to create a folder and get a list of file names in the folder
The above is the picture and album title and release time of the album shared by the editor on Python crawl. if you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.