In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "how to use Python to crawl web page data". In daily operation, I believe many people have doubts about how to use Python to crawl web page data. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts of "how to use Python to crawl web page data". Next, please follow the editor to study!
Prepare for
IDE:PyCharm
Libraries: requests, lxml
Note:
Requests: get the source code of a web page
Lxml: get the specified data in the source code of the web page
Build an environment
The building environment here is not to build a python development environment, it means that we use pycharm to create a new python project, and then finish requests and lxml.
Create a new project:
Dependent library import
Since we are using pycharm, it will be very easy for us to import these two libraries
Import requests
At this time, requests will report a red line, at this time, we will point the cursor at requests, press the shortcut key: alt + enter,pycharm will give the solution, at this time, select install package requests,pycharm will automatically install for us, we only need to wait a moment, the library is installed. Lxml is installed in the same way.
Get the source code of the web page
As I said before, requests makes it easy for us to get the source code of the web page.
Take my blog address for example: https://coder-lida.github.io/
Get the source code:
# get the source code
Html = requests.get ("https://coder-lida.github.io/")
# print source code
Print html.text
The code is that simple. This html.text is the source code of this URL.
Complete code:
Import requests
Import lxml
Html = requests.get ("https://coder-lida.github.io/")
Print (html.text)
Print:
Get specified data
Now that we have the source code of the web page, we need to use lxml to filter out the information we need
Here I will take getting a list of my blog as an example. I can find the original page and view XPath through F12, as shown in the figure.
Get the content of the web page through the syntax of XPath.
Check the title of the first article
/ / * [@ id= "layout-cart"] / div [1] / a/@title
/ / locate the root node
/ look at the lower level
Extract text content: / text ()
Extract attribute content: / @ xxxx
Import requests
From lxml import etree
Html = requests.get ("https://coder-lida.github.io/")
# print (html.text)
Etree_html = etree.HTML (html.text)
Content = etree_html.xpath ('/ / * [@ id= "layout-cart"] / div [1] / a dynamic title')
Print (content)
View all article titles
/ / * [@ id= "layout-cart"] / div/a/@title
Code:
Import requests
From lxml import etree
Html = requests.get ("https://coder-lida.github.io/")
# print (html.text)
Etree_html = etree.HTML (html.text)
Content = etree_html.xpath ('/ / * [@ id= "layout-cart"] / div/a/@title')
Print (content)
Output:
['springboot reverse engineering', 'implement a simple version of HashMap','25 single lines of JavaScript commonly used in development', 'shiro encrypted login password with salt treatment', 'Spring Boot build RESTful API and unit test', 'remember the use of jsoup']
At this point, the study on "how to use Python to crawl the data of web pages" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.