In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces how to get JS dynamic content in Python. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.
The news of the web page can not be found in the HTML source code, and it is all dynamically generated and loaded by JS.
In this case, how should we crawl the web page? There are two ways:
1. Find the JSON data returned by the JS script from the web page response; 2. Use Selenium to simulate the web page visit
Only the first method is introduced here, and there is a special article on the use of Selenium.
Find the JSON data returned by the JS script from the web page response
Even if the web page content is dynamically generated and loaded by JS, JS still needs to call an interface and load and render according to the JSON data returned by the interface.
So we can find the data interface called by JS and find the last data rendered in the web page from the data interface.
Take Jinri Toutiao as an example to demonstrate:
1. From finding the data interface requested by JS
Undefined
Undefined
Undefined
Undefined
Undefined
Undefined
Undefined
Undefined
Undefined
The data is the same as the picture news on the home page, so the data should be in it.
Check out the other links:
This should be a hot search keyword.
This is the news below the photo news.
Let's open an interface link and take a look: http://www.toutiao.com/api/pc/focus/
A string of garbled codes is returned, but the normal encoded data is viewed from the response:
With the corresponding data interface, we can imitate the previous method to request and get response to the data interface.
2. Request and parse data interface data
Start with the complete code:
# coding:utf-8import requestsimport jsonurl = 'http://www.toutiao.com/api/pc/focus/'wbdata = requests.get (url). Textdata = json.loads (wbdata) news = data [' data'] ['pc_feed_focus'] for n in news: title = n [' title'] img_url = n ['image_url'] url = n [' media_url'] print (url,title,img_url)
The result returned is as follows:
As usual, explain the code a little bit:
The code is divided into four parts
Part one: introduce related libraries
# coding:utf-8import requestsimport json
The second part: http request to the data interface
Url = 'wbdata = requests.get (url). Text
The third part: JSON the data in response to HTTP and index to the location of the news data
Data = json.loads (wbdata) news = data ['data'] [' pc_feed_focus']
Part IV: traversing and extracting the indexed JSON data
For n in news: title = n ['title'] img_url = n [' image_url'] url = n ['media_url'] print (url,title,img_url) about how to obtain JS dynamic content in Python is shared here. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.