What is the introduction and function of BeautifulSoup 04/19 Update SLTechnology News&Howtos

What is the introduction and function of BeautifulSoup

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what is the introduction and function of BeautifulSoup". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. BeautifulSoup Construction 1.1 builds from bs4 import BeautifulSouphtml = "The Dormouse's storyOnce upon a time there were three little sisters; and their names were by string

"" soup = BeautifulSoup (html, 'html.parser') print (soup.prettify ()) 1.2 load from bs4 import BeautifulSoupwith open (r "F:\ tmp\ etree.html") as fp: soup = BeautifulSoup (fp, "lxml") print (soup.prettify ()) II, Tag object 2.1string, strings, stripped_strings

If a node contains only text nodes, you can access the text nodes directly through string

String is None if it contains more than just text nodes

If it contains more than a text node, you can get the content of the text node through strings and stripped_strings, and strings and stripped_strings all get the generator.

2.2 get_text ()

Get only text nodes

Soup.get_text () # can specify the use of text between different nodes | segmentation. Soup.get_text ("|") # you can specify the soup.get_text ("|", strip=True) 2.3 attribute to remove spaces.

Tag.attrs is a dictionary type, and values can be obtained through tag ['id']. The subscript access method may throw an exception KeyError, so you can use the tag.get ('id') method, and return None if the id property does not exist.

III. Contents, children and descendants

Are all children of the node, but: contents is a list children is a generator

Contents and children contain only direct child nodes. Descendants is also a generator, but contains the descendants of nodes.

3.1 parent 、 parents

Parent: parent parents: recursive parent

3.2 next_sibling 、 previous_sibling

Next_sibling: the latter sibling node previous_sibling: the previous sibling node

3.3 next_element 、 previous_element

Next_element: the latter node previous_element: the previous node

The differences between next_element and next_sibling are:

Next_sibling starts parsing from the closing tag of the current tag

Next_element starts parsing from the start tag of the current tag

4. Find and find_all4.1 methods

Find_parent: find the parent node find_parents: recursively find the parent node find_next_siblings: find the next sibling node find_next_sibling: find the first sibling node that meets the condition find_all_next: find all the nodes behind find_next: find the first node that meets the condition find_all_previous: find all the nodes that meet the condition find_previous: find the first node in front that meets the condition

4.2 tag name # find all p nodes soup.find_all ('p') # find title nodes, no recursive soup.find_all ("title", recursive=False) # find p nodes and span nodes soup.find_all (["p", "span"]) # find the first a node Find _ all ("a", limit=1) soup.find ('a') 4.3attribute # find nodes with id1 id (id='id1') # find nodes with name attribute tim soup.find_all (name= "tim") soup.find_all (attrs= {"name": "tim"}) # find p-node soup.find_all ("p", "clazz") soup.find_all ("p") where class is clazz Class_= "clazz") soup.find_all ("p") Class_= "body strikeout") 4.4 regular expression import re# find nodes beginning with p soup.find_all (class_=re.compile ("^ p")) 4.5 function # find nodes with class attribute and no id attribute soup.find_all (hasClassNoId) def hasClassNoId (tag): return tag.has_attr ('class') and not tag.has_attr (' id') 4.6 text soup.find_all (string= "tim") soup.find_all (string= ["alice") "tim", "allen"]) soup.find_all (string=re.compile ("tim")) def onlyTextTag (s): return (s = = s.parent.string) # find a node with only a text node soup.find_all (string=onlyTextTag) # find a node soup.find_all ("a", string= "tim") with a text node tim 5. Select5.1 method

Compared to the find,select method, there are two, one is select, the other is select_one, the difference is that select_one only chooses the first element that meets the condition.

The focus of select is on selectors, so let's focus on some commonly used selectors by introducing examples. If you are not familiar with the corresponding css selector, you can take a look at the following introduction to CSS selector.

5.2 Select through tag # Select title node soup.select ("title") # Select all a nodes under body node soup.select ("body a") # Select title node soup.select ("html head title") under head node under html node

Selecting through tag is very simple, just split it by level, using spaces by the name of tag.

5.3 id and Class Selector # Select a node named article (".article") # Select a node soup.select with id as id1 ("a#id1") # Select node soup.select with id as id1 ("# id1") # Select node soup.select with id as id1 and id2 ("# id1,#id2")

Id and class selector are also relatively simple, class selector is used. At the beginning, the id selector starts with #.

5.4 attribute selector # Select a node soup.select with href attribute ('a [href]') # Select a node soup.select with href attribute http://mycollege.vip/tim ('a [href= "http://mycollege.vip/tim"]')# select a node soup.select where href begins with http://mycollege.vip/ ('a [href ^ =" http://mycollege.vip/"]')# select href) A node soup.select ending with png ('a [href $= "png"]') # Select href attribute containing china a node soup.select ('a [href * = "china"]') # Select href attribute containing china a node soup.select ("a [href~=china]") 5.5 other selector # p node soup.select ("div > p") whose parent node is div node # p with div node before Node soup.select ("div + p") # ul node (p and ul have a common parent) soup.select ("p~ul") # third p node soup.select ("p:nth-of-type (3)") 6. Example

Finally, let's take a look at the use of BeautifulSoup through a small example.

From bs4 import BeautifulSouptext ='

Worry-relieving grocery store [Japan] Keigo Higano / Li Yingchun / Nanhai Publishing Company / 2014-5 / 39.50 RMB8.5 (evaluated by 537322 people)

This grocery store can help you find what is lost in the hearts of modern people-there is a grocery store on a secluded street. as long as you write down the letter mouth where your troubles are thrown into the shutter door, you will be answered in the milk box at the back of the store the next day. Because of her boyfriend's illness.

Soup = BeautifulSoup (text, 'lxml') print (soup.select_one ("a.nbg"). Get ("href") print (soup.find ("img"). Get ("src") title = soup.select_one ("H3a") print (title.get ("href") print (title.get ("title") print (soup.find ("div", class_= "pub") .string) print (soup.find ("span") Class_= "rating_nums") .string) print (soup.find ("span", class_= "pl") .string.strip () print (soup.find ("p") .string)

It's very simple, and many complex structures can be easily handled if you are familiar with CSS selectors.

7. CSS selector 7.1Common selector

The selector example shows that .class.intro selects all nodes of class= "intro" # id#firstname selects all nodes of id= "firstname" * * Select all nodes elementp selects all p-node element,elementdiv P Select all div nodes and all p nodes element elementdiv p Select all p nodes within the div node element > elementdiv > p Select all p nodes whose parent is the div node element+elementdiv+p select all p nodes element~elementp~ul selections and p elements immediately after the div node have the same parent node And after the p element, the ul node [attribute ^ = value] a [src ^ = "https"] selects each a node whose src attribute value begins with "https" [attribute$=value] a [src$= ".png"] selects all a nodes [attribute*=value] a [src*= "abc"] whose src attribute ends with ".png"] selects each a node [attribute] [target] whose src attribute contains "abc" substring [attribute] selects all nodes with target attribute Node [attribute=value] [target=_blank] Select all nodes of target= "_ blank" [attribute~=value] [title~=china] Select all nodes whose title attribute contains the word "china" [attribute | = value] [lang | = zh] Select all nodes whose lang attribute value begins with "zh"

Div p contains grandchild nodes. Div > p selects only child nodes.

The element~element selector is a little difficult to understand. Take a look at the following example:

P~ul {background: red;} ul-li1

P label

Ul-li2 h3 tag ul-li3

7.2 position selector

The selector example shows that first-of-typep:first-of-type selects the first p node of its parent node: last-of-typep:last-of-type selects the last p node of its parent node: only-of-typep:only-of-type selects the unique p node of its parent node: only-childp:only-child selects the p node of the only child node of its parent node: nth-child (n) p:nth-child (2) selection The p node of the second child node of its parent node: nth-last-child (n) p:nth-last-child (2) counts from the last child node: nth-of-type (n) p:nth-of-type (2) selects the second p node of its parent node: nth-last-of-type (n) p:nth-last-of-type (2) selects the penultimate p node of its parent node: last-childp:last -child selects the last p node of its parent node

The main requirements are tag:nth-child (n) and tag:nth-of-type (n), nth-child calculation does not require the same type, nth-of-type calculation must be the same tag.

It's a little roundabout, so take a look at the following example.

Nth # wrap p:nth-of-type (3) {background: red;} # wrap p:nth-child (3) {background: yellow;}

1-1p

2-1div

3-2p

4-3p

5-4p

7.3 other selectors

The selector example shows that: not (selector): not (p) Select nodes other than p nodes: emptyp:empty Select p nodes without Child nodes:: selection::selection Select nodes selected by the user: focusinput:focus Select input nodes that gain focus: root:root Select the root node of the document: enabledinput:enabled selects each enabled input node: disabledinput:disabled selects each disabled input node: checkedinput:checked selects each Selected input node: linka:link selects all unaccessed links: visiteda:visited selects all links that have been accessed: activea:active selection active link: hovera:hover selects the link over which the mouse pointer is positioned: first-letterp:first-letter selects the initials of each p node: first-linep:first-line selects the first line of each p node: first-childp:first-child selects the first child of the parent node Each p-node of a node: beforep:before inserts content before the content of each p-node: afterp:after inserts content after the content of each p-node: lang (language) p:lang (it) Select each p-node with the value of the lang attribute starting with "it". This ends here. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.