In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
In this article, the editor introduces in detail "the method of crawling web page content in Python using BeautifulSoup", the content is detailed, the steps are clear, and the details are handled properly. I hope that this article "the method of crawling web page content in Python using BeautifulSoup" can help you solve your doubts.
What is web crawling?
The simple answer is: not every site has an API for getting content. You may want to get recipes from your favorite cooking website or photos from a travel blog. If there is no API, extracting HTML or fetching may be the only way to get that content. I'll show you how to do this in Python.
Note: not all websites like crawling, and some sites may explicitly prohibit it. Check with the website owner to see if it can be crawled.
How to crawl a website in Python?
In order for network crawling to work in Python, we will perform three basic steps:
Use the request library to extract HTML content.
Analyze the HTML structure and identify the tags that contain our content.
Use BeautifulSoup to extract the tags and put the data into the Python list.
Installation library
Let's first install the libraries we need. Request to get HTML content from the website. BeautifulSoup parses the HTML and converts it into a Python object. To install these for Python 3, run:
Extraction of HTML by pip3 install requests beautifulsoup4
In this example, I will choose to crawl the technical part of the site. If you go to this page, you will see a list of articles with title, excerpt, and release date. Our goal is to create a list of articles that contain this information.
The complete URL of the technical page is:
Https://notes.ayushsharma.in/technology
We can use Requests to get the HTML content from this page:
#! / usr/bin/python3import requestsurl = 'https://notes.ayushsharma.in/technology'data = requests.get (url) print (data.text)
The variable data will contain the HTML source code for the page.
Extract content from HTML
To extract our data data from the HTML received in, we need to determine which tags have the content we need.
If you browse HTML, you will find this section near the top:
HTML:
Using variables in Jekyll to define custom content I recently discovered that Jekyll's config.yml can be used to define custom variables for reusing content. I feel like I've been living under a rock all this time. But to err over and over again is human. Aug 2021
This is the part that is repeated throughout the page of each article. We can see that .card-title has the title of the article, excerpts from .card-text, and the release date of .card-footer > small.
Let's use BeautifulSoup to extract this content.
Python:
#! / usr/bin/python3import requestsfrom bs4 import BeautifulSoupfrom pprint import pprinturl = 'https://notes.ayushsharma.in/technology'data = requests.get (url) my_data = [] html = BeautifulSoup (data.text 'html.parser') articles = html.select (' a.postlycard`) for article in articles: title = article.select ('.card-title') [0] .get _ text () excerpt = article.select (' .card-text') [0] .get _ text () pub_date = article.select ('.card-footer small') [0] .get _ text () my_data.append ({"title": title, "excerpt": excerpt "pub_date": pub_date}) pprint (my_data)
The above code will extract the articles and put them in the my_data variable. I'm using pprint for nice printout, but you can skip it in your own code. Save the above code in a file called fetch.py, and then run it using the following command:
Python3 fetch.py
If all goes well, you should see:
Python:
[{'excerpt': "I recently discovered that Jekyll's config.yml can be used to"define custom variables for reusing content. I feel like I've "'been living under a rock all this time. But to err over and over''again is human.','pub_date':' Aug 2021 title title: 'Using variables in Jekyll to define custom content'}, {' excerpt': "In this article, I'll highlight some ideas for Jekyll" 'collections, blog category pages, responsive web-design, and' 'netlify.toml to make static website maintenance a breeze.','pub_date':' Jul 2021 title title: 'The evolution of ayushsharma.in: Jekyll, Bootstrap, Netlify,' static websites And responsive design.'}, {'excerpt': "These are the top 5 lessons I've learned after 5 years of"' Terraform-ing.','pub_date': 'Jul 2021 minute title:' 5 key best practices for sane and usable Terraform setups'},... (truncated) read this, this article "how to use BeautifulSoup to grab web content in Python" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it to understand it. If you want to know more about related articles, welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.