In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "how to use python to achieve accurate search and extract the core content of web pages". The explanation content in this article is simple and clear, easy to learn and understand. Please follow the ideas of Xiaobian slowly and deeply to study and learn "how to use python to achieve accurate search and extract the core content of web pages" together!
generate PDF
Start to think of a clever way, is to use tools (wkhtmltopdf[2]) will target web pages generated PDF files.
The advantage is that you don't have to care about the specific form of the page, just like taking a photo of the page, the article structure is complete.
Although PDF can be retrieved at source level, generating PDF has many disadvantages:
It consumes more computing resources, has low efficiency, high error rate and is too large.
Tens of thousands of pieces of data have already exceeded 200 gigabytes. If the amount of data comes up, light storage is a big problem.
Extract article content
Instead of generating PDFs, there is an easy way to extract all the text on the page via xpath[3].
But the content will lose structure and be poorly readable. What's more, there are a lot of irrelevant content on the web page, such as sidebars, advertisements, related links, etc., which will also be extracted, affecting the accuracy of the content.
In order to ensure a certain structure, but also identify the core content, you can only identify and extract the structure of the article. Like search engine learning, it is to find ways to identify the core content of the page.
We know that, usually, the core content on the page (such as the article part) text is relatively concentrated, you can start from this place to analyze.
So I wrote a piece of code, I used Scrapy[4] as the crawler framework, here only intercepted the code that extracted the article part:
divs = response.xpath("body//div")sel = Nonemaxvalue = 0for d in divs: ds = len(d.xpath(".// div")) ps = len(d.xpath(".// p")) value = ps - ds if value > maxvalue: sel = { "node": d, "value": value } maxvalue = value print("".join(sel['node'].getall()))
response is a response to the page, which contains all the content of the page, you can extract the desired part through xpath
"body//div" means to extract all div child tags under the body tag, note://operation is recursive
Iterate all extracted tags, calculate the number of div, and p contained in them
The difference between the number of p and the number of div is used as the weight of this element, which means that if this element contains a large number of p, it is considered to be the body of the article.
By comparing the weights, select the element with the largest weight, which is the main body of the article
After getting the article body, extract the content of this element, which is equivalent to the outerHtml of jQuery[5].
Simple and clear, testing a few pages is really good.
However, when extracting a large number of pages, it was found that many pages could not extract data. A closer look reveals two things.
Some articles were tagged, so they weren't captured.
There are articles for each
There's one on the outside, so the number of p and div cancel each other out.
I adjusted my strategy a little bit, no longer distinguish div, look at all elements.
Also prefer to choose more p's and look at fewer div's on top of that. The adjusted codes are as follows:
divs = response.xpath("body//*")sels = []maxvalue = 0for d in divs: ds = len(d.xpath(".// div")) ps = len(d.xpath(".// p")) if ps >= maxvalue: sel = { "node": d, "ps": ps, "ds": ds } maxvalue = ps sels.append(sel) sels.sort(lambda x: x.ds) sel = sels[0] print("".join(sel['node'].getall()))
In the main body of the method, first select the nodes with a large number of p, note that the if judgment condition is replaced with>= sign, and select the nodes with the same number of p when acting.
After filtering, sort by the number of divs and select the one with the fewest number of divs
This modification does make up for the previous problem to some extent, but introduces a more troublesome problem.
That is, the main body of the article found is unstable, and it is particularly vulnerable to the influence of some p in other parts.
select the optimal
Since direct calculation is not appropriate, an algorithm needs to be redesigned.
I found that the place where the text is concentrated is often the body of the article, and the previous method does not take this into account, but mechanically finds the largest p.
Another point is that the structure of the web page is a DOM tree.[6]
Then the closer to p label, the more likely it is to be the body of the article, that is, the closer the node to p should have a larger weight, while the node farther away from p should have a lot of p in time but the weight should also be smaller.
After trial and error, the final code is as follows:
def find(node, sel): value = 0 for n in node.xpath("*"): if n.xpath("local-name()").get() == "p": t = "".join([s.strip() for s in (n.xpath('text()').getall() + n.xpath("*/text()").getall())]) value += len(t) else: value += find(n, a)*0.5 if value > sel["value"]: sel["node"] = node sel["value"] = value return value sel = { 'value': 0, 'node': None}find(response.xpath("body"), sel)
Define a find function, this is to facilitate recursion, the first call parameter is the body label, as before
Go into the method, find only the direct children of the node, and then iterate over those children.
Judge if the child is a p node, extract all the characters in it, including the child nodes, and then take the length of the characters as the weight value.
The place where the text is extracted is relatively roundabout. First, the direct text is extracted, and the indirect text is synthesized. The empty characters before and after each part of the text are removed. Finally, they are merged into a character string to obtain the contained text.
If the child node is not p, the find method is recursively called, and the find method returns the length of the text contained in the specified node.
When obtaining the length of the child node, reduction processing is done to reflect the rule that the farther the distance, the lower the weight.
Finally, record the node with the highest weight by referencing the sel parameter passed
After this transformation, the effect is particularly good.
Why is that? In fact, the density principle is used, that is, the closer to the center, the higher the density, and the farther away from the center, the density is doubled, so that the density center can be screened out.
How did you get the 50% slope ratio?
In fact, it is determined through experiments. I set it to 90% at the beginning, but the body node is always optimal because the body contains all the text content.
After trial and error, determine that 50% is a good value, if it is not suitable for your application, you can make adjustments.
Thank you for reading, the above is "how to use python to achieve accurate search and extract the core content of the web page" content, after the study of this article, I believe that everyone on how to use python to achieve accurate search and extract the core content of the web page this problem has a deeper understanding, the specific use of the situation also needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.