Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to disguise PhantomJS as a Chrome browser in Python crawler

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article shows you how to disguise PhantomJS as a Chrome browser in Python crawler. The content is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Preface

In the process of writing crawlers, due to system environment or efficiency problems, we often use PhantomJS as the browser webdriver operated by Selenium, rather than directly using Chrome or FireFox's webdriver, although the latter is more intuitive.

Although PhantomJS has many advantages, there are also many disadvantages. One disadvantage that cannot be called a disadvantage is that the browser logo of PhantomJS is "PhantomJS". :))

There is nothing wrong with the logo of PhantomJS, but as more and more websites upgrade their anti-crawler technology, PhantomJS has clearly become the same target as "requests".

As long as the server background recognizes the visitor's User-Agent as PhantomJS, it may be determined by the server as crawler behavior, resulting in crawler failure.

Just like modifying the header header field in requests to disguise it as a browser, we can change the browser identity of PhantomJS to the identity of any browser in Selenium.

Browser identity of PhantomJS

First, let's take a look at what the browser logo of PhantomJS looks like.

Http://service.spiritsoft.cn/ua.html is a website that gets the browser identity User-Agent, and visiting it will display the identity of the browser currently in use:

Let's use Selunium to manipulate PhantomJS to access http://service.spiritsoft.cn/ua.html to see the returned result:

There are obvious traces of PhantomJS. Next, we modify the browser identity of PhantomJS.

Disguise as Chrome

Introduce a key module-DesiredCapabilities:

What is this module for? Let's look at the explanation of the source code:

Describes a series of encapsulated key-value pairs of browser properties, which are roughly used to set the properties of webdriverde. We use it to set the User-Agent of PhantomJS.

First, convert DesiredCapabilities to a dictionary to facilitate the addition of key-value pairs:

Then add a key-value pair identified by the browser:

Finally, set it as a parameter in the instantiated PhantomJS:

The complete code is as follows:

Let's run the code:

Successfully identified PhantomJS as a Chrome browser.

The above is how to disguise PhantomJS as a Chrome browser in Python crawler. have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report