How to use Web Scraping to crawl HTML pages 07/04 Update SLTechnology News&Howtos

How to use Web Scraping to crawl HTML pages

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use Web Scraping to climb HTML pages". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to climb HTML pages with Web Scraping".

-crawling HTML web pages

-download data files directly, such as csv,txt,pdf files

-access data through the application programming interface (API), such as movie database, Twitter

Select page crawling, of course, to understand the basic structure of the HTML page, you can refer to this page:

The basic structure of HTML

HTML tags: head,body,p,a,form,table, etc.

The tag has attributes. For example, tag an is the target of a link that has an attribute (or attribute) href.

Class and id are special attributes that html uses to control the style of each element through cascading style sheets (CSS). Id is the unique identifier of the element, while class is used to group elements for styling.

An element can be associated with multiple classes. These categories are separated by spaces, such as London

The following figure is an example from W3SCHOOL. City includes three properties, main includes one property, and London uses two city and main classes, which look like the following image.

Tags can be referenced by their position relative to each other

Child-child is a tag within another tag, for example, these two p tags are child tags of the div tag.

Parent-parent is a tag with another tag in it. For example, the html tag is the parent tag of the body tag.

A siblings-siblings is a tag that has the same parent tag as another tag. For example, in the html example, the head and body tags are sibling tags because they are both within the html. Both p tags are sibling because they are both in body.

Climb the web page in four steps:

Step 1: install the module

Install requests,beautifulsoup4 to crawl web information

Install modules requests, BeautifulSoup4/scrapy/selenium/....requests: allow you to send HTTP/1.1 requests using Python. To install:Open terminal (Mac) or Anaconda Command Prompt (Windows) code: BeautifulSoup: web page parsing library, to install, use:

Step 2: use the installation package to read the web page source code

Step 3: browse the web source code to find the location where the information needs to be read

Here different browsers read the source code is different, the following introduces a few, there are related web pages to query details.

Firefox: right click on the web page and select "view page source" Safari: please instruction here to see page source () Ineternet Explorer: see instruction at

Step 4: start reading

Beautifulsoup: simple that, support CSS Selector, but do not support XPathscrapy (): support CSS Selector and XPathSelenium: can crawl dynamic web pages (for example, constantly updated) BeautifulSoup such as lxml Tag: an xml or HTML tag tag Name: every tag has a name the name of each tag Attributes: a tag may have any number of attributes. Each tag has one or more attributes A tag is shown as a dictionary in the form of {attribute1_name:attribute1_value, attribute2_name:attribute2_value,...}. If an attribute has multiple values, the value is stored as a listNavigableString: the text within a tag

The above code:

# Import requests and beautifulsoup packages

From IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity= "all"

# import requests package

Import requests

# import BeautifulSoup from package bs4 (i.e. Beautifulsoup4)

From bs4 import BeautifulSoup

Get web page content

# send a get request to the web page

Page=requests.get ("A simple example page")

# status_code 200 indicates success.

# a status code > 200 indicates a failure

If page.status_code==200:

# content property gives the content returned in bytes

Print (page.content) # text in bytes

Print (page.text) # text in unicode

# Parse web page content

# Process the returned content using beautifulsoup module

# initiate a beautifulsoup object using the html source and Python's html.parser

Soup=BeautifulSoup (page.content, 'html.parser')

# soup object stands for the * * root**

# node of the html document tree

Print ("Soup object:")

# print soup object nicely

Print (soup.prettify ())

# soup.children returns an iterator of all children nodes

Print ("\ soup children nodes:")

Soup_children=soup.children

Print (soup_children)

# convert to list

Soup_children=list (soup.children)

Print ("\ nlist of children of root:")

Print (len (soup_children))

# html is the only child of the root node

Html=soup_children [0]

Html

# Get head and body tag

Html_children=list (html.children)

Print ("how many children under html?", len (html_children))

For idx, child in enumerate (html_children):

Print ("Child {} is: {}\ n" .format (idx, child))

# head is the second child of html

Head=html_children [1]

# extract all text inside head

Print ("\ nhead text:")

Print (head.get_text ())

# body is the fourth child of html

Body=html_children [3]

# Get details of a tag

# get the first p tag in the div of body

Div=list (body.children) [1]

P=list (div.children) [1]

# get the details of p tag

# first, get the data type of p

Print ("\ ndata type:")

Print (type (p))

# get tag name (property of p object)

Print ("\ ntag name:")

Print (p.name)

# a tag object with attributes has a dictionary

# use .attrs to get the dictionary

# each attribute name of the tag is a key

# get all attributes

P.attrs

# get "class" attribute

Print ("\ ntag class:")

Print (p ["class"])

# how to determine if 'id' is an attribute of p?

# get text of p tag

P.get_text ()

Thank you for your reading, the above is the content of "how to use Web Scraping to climb HTML pages". After the study of this article, I believe you have a deeper understanding of how to use Web Scraping to climb HTML pages, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.