Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of content that Selenium can't catch

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly shows you the "example analysis of content that Selenium can't catch", which is easy to understand and clear. I hope it can help you solve your doubts. Let the editor lead you to study and study this article "example analysis of content that Selenium can't catch".

Some students rely too much on Selenium when writing crawlers, thinking that as long as they use a simulation browser, they can climb to any content without being blocked by the website.

We won't talk about font anti-crawler and CSS anti-crawler. Let's take a look at a very simple web page. This page has only one HTML file, no special fonts, no CSS files.

What's so strange about this web page? We tried to use XPath Helper to extract the red text on the web page and found that XPath could not find the text, as shown in the following figure:

Then we use Selenium to give it a try:

Sure enough, Selenium can't get the scarlet letter to the content. Let's print the source code of the web page again:

This time, the source code obtained by Selenium is different from the source code shown in the Chrome developer tool?

The crux of this question lies in the following paragraph in the developer's tool:

Because this node is a shadow DOM [1]. Shadow DOM behaves much like iframe by embedding a piece of HTML information into another HTML. But the difference is that the address where iframe is embedded requires an additional HTTP service, while shadow DOM can embed only a piece of HTML code, so it is more resource-efficient than iframe.

In the screenshot above, through the following three lines of code, we put a new

The tag is embedded in the original HTML:

Var content = document.querySelector ('.content'); var root = content.attachShadow ({mode: 'open'}); Root [XSS _ clean] =' you won't catch this text!

'

This embedded shadow tag, like iframe, cannot be extracted directly using Selenium. If we extract it forcefully, we need to use JavaScript to get the shadow DOM, and then extract it. Let's take a look at a piece of code that works:

Shadow = driver.execute_script ('return document.querySelector (".content") .shadowRoot') content = shadow.find_element_by_class_name (' real_content') print (content.text)

The running effect is shown in the following figure:

This code first finds the parent node element of shadow-root through JavaScript, and then returns the .shadowRoot property of this element. After you get this property in Python, use the. Find _ element_by_class_name () method to get the contents.

It is important to note that after getting the shadow-root node, the contents can only be further filtered through the CSS selector, and XPath cannot be used, otherwise an error will be reported.

The above is all the content of the article "sample Analysis of content that Selenium can't catch". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Network Security

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report