How to use Python to extract the text of web pages 07/12 Update SLTechnology News&Howtos

How to use Python to extract the text of web pages

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to use Python to achieve the extraction of web page text related knowledge, the content is detailed and easy to understand, the operation is simple and fast, has a certain reference value, I believe that everyone after reading this how to use Python to achieve web page text extraction article will have a harvest, let's take a look.

A typical news page consists of several different areas:

News page area

The news elements we are going to extract are as follows:

Title area

Meta data area (release time, etc.)

Matching area (if you want to extract the matching area)

Text area

The text in the navigation bar area and the related link area is not an element of the news.

The title, release time and text content of the news are generally extracted from the html we crawled. If it is just a news page of a website, it is easy to extract these three contents, and it can be perfectly extracted by writing three regular expressions. However, our crawlers catch hundreds of web pages. It can be exhausting to write regular expressions for so many different formats, and once the page is slightly modified, the expressions may fail, and it will be exhausting to maintain this group of expressions.

Of course, the exhausting practice doesn't make sense, so we have to explore a good algorithm to achieve it.

1. Extraction of title

The title basically appears in the html tag, but is appended with information such as channel name, website name and so on.

The title also appears in the "title area" of the page.

So where is it easier to extract headlines from these two places?

The "title area" of the page is not clearly identified, and the html code parts of the "title area" of different sites vary greatly. So this area is not easy to extract.

Then all that's left is the tag, which is easy to extract, whether it's regular expressions or lxml parsing, but it's not easy to remove channel names, website names and other information.

First, let's take a look. The tag is full of additional information about what it looks like:

Shanghai uses "Wisdom" to activate urban traffic pulse to make roads safer, more orderly and more unobstructed. Pujiang headline _ thepaper.cn-The Paper

The "Shanghai-Hong Kong University Alliance" established today in Fudan University _ Education _ Xinmin Network

The old man in Sanya was sentenced to 3 years in prison for kicking the driver and causing the bus to lose control and hit the wall.

Ministry of Foreign Affairs: Sino-US diplomatic and Security Dialogue held in the United States on the 9th

The Expo: China's action has attracted worldwide attention, and China acts as the world's likes.

Capital market ushered in major reform what is the deep meaning of the establishment of Science and Technology Innovation Board? -Xinhuanet

Looking at these title, it is not difficult to find that there are some connection symbols between news headlines, channel names and website names. Then I can split the title through these connectors and find out that the longest part is the news headline.

This idea is also very easy to implement, so there is no more code here, leaving the little apes as thinking exercises to implement on their own.

two。 Release time extraction

Release time, refers to the time when the page is online on the site, it usually appears at the bottom of the body title-the meta data area. From the point of view of the html code, there are no special features for us to locate in this area, especially in front of a lot of website layouts, it is almost impossible to locate this area. This requires us to find another way.

Like the title, let's first take a look at the release time of some websites:

22:22 on November 6, 2018

Time: 2018-11-07 14:27:00

2018-11-07 11:20:37 Source: Xinhuanet

Source: China Daily Network, 2018-11-07 08:06:39

07:39:19 on November 07, 2018

2018-11-06 09:58 Source: thepaper.cn

The release time written on the web page has a common feature, that is, a string of time, year, month, day, hour, minute and second, nothing more than these elements. Through regular expressions, we enumerate some regular expressions with different time expressions (just a few), and we can match and extract the release time from the web page text.

This is also an easy idea to implement, but there are many details, the expression should cover as many as possible, and it is not so easy to write such a function to extract the release time. The little apes give full play to their hands-on ability to see what kind of function implementation they can write. This is also an exercise for the little apes.

3. Extraction of text

The text (including news pictures) is the main part of a news web page, which occupies the middle position visually and is the main text area of the news content. There are many methods to extract the text, and the implementation is both complex and simple. The method introduced in this paper is a simple and fast method based on the practical experience and thinking of the old ape for many years, which is called "node text density method".

We know that the html code of a web page is composed of different tags (tag) to form a tree structure tree, each tag is a node of the tree. By traversing each node of the tree structure, you can find the node with the most text, which is the node where the text is located. According to this idea, let's implement the code.

3.1 implement source code #! / usr/bin/env python3#File: maincontent.py#Author: veelionimport reimport timeimport tracebackimport cchardetimport lxmlimport lxml.htmlfrom lxml.html import HtmlCommentREGEXES = {'okMaybeItsACandidateRe': re.compile (' and | article | artical | body | column | main | shadow', re.I), 'positiveRe': re.compile ((' article | arti | body | content | entry | hentry | main | page | 'artical | zoom | arti | message | editor |' pagination | post | txt | text | blog | story'), re.I) 'negativeRe': re.compile ((' copyright | combx | comment | com- | contact | foot | footer | footnote | decl | copy |''notice |' 'masthead | media | meta | outbrain | promo | related | link | pagebottom | bottom |' other | shoutbox | sidebar | sponsor | shopping | tags | tool | widget'), re.I), class MainContent: def _ init__ (self,): self.non_content_tag = self.non_content_tag (['set' set) 'object',' embed', 'iframe',' marquee', 'select',]) self.title =' 'self.p_space = re.compile (r'\ s') self.p_html = re.compile (r'

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.