Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to analyze dynamic crawler and build environment in chrome

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to carry out dynamic crawler analysis and environment building in chrome, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

One: Overview

Dynamic crawler and static crawler. Many years ago, web websites responded to static html pages, that is, we made a http request to the server, and then read the response to get the data we wanted. However, with the development of web, various front-end frameworks emerge one after another, and the previous crawler solution is no longer applicable. The page captured may look like the following, as shown below:

This situation requires that we can dynamically execute the js on the page and get the final html.

Dynamic rendering is definitely inseparable from the browser (unless you parse html,css,js yourself), google has opened up the chromium browser, we have custom requirements to change its source code to achieve, at the same time we need a protocol to interact with the browser, google has also opened up this protocol, Chrome DevTools Protocol has this protocol, we can deal with browsers. Finally, it is to choose a handy programming language, because it is for learning reasons, so come to a google family bucket, choose golang as the development language. The selection of the development library is almost standard chromedp, everything is available. Finally, I would like to mention what headless is, that is, a running mode of chrome. We usually need to use the interface, but when the crawler uses it, we do not need the interface. During the development process, you can also run in non-headless mode to see what the browser is doing.

Attached:

Cdp: https://chromedevtools.github.io/devtools-protocol/

Chromium: https://www.chromium.org/

Chromedp: https://github.com/chromedp/chromedp

Second: development environment

Skip installing the browser and make sure that chrome can be found in the path after installation. Add the following lines to my mac .zshrc:

# chromealias chrome= "/ Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome" alias chrome-canary= "/ Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary" alias chromium= "/ Applications/Chromium.app/Contents/MacOS/Chromium"

After the command line execution, you can see the version information to prove the ok.

λ / chrome-- versionGoogle Chrome 84.0.4147.89

Create a golang project (go mod) as follows:

Attach my go.mod dependency (most libraries are easy to use, such as reading configuration files, command line, according to personal preferences)

Module github.com/anyeshe/caterpillargo 1.14require (github.com/chromedp/cdproto v0.0.0-20200709115526-d1f6fc58448b github.com/chromedp/chromedp v0.5.3 github.com/fsnotify/fsnotify v1.4.9 / / indirect github.com/go-playground/universal-translator v0.17.0 / / indirect github.com/gobwas/pool v0.2.1 / / indirect github.com/gobwas/ws v1.0.3 / / indirect github.com/leodido/go-urn V1.2.0 / / indirect github.com/mitchellh/go-homedir v1.1.0 github.com/mitchellh/mapstructure v1.3.2 / / indirect github.com/panjf2000/ants/v2 v2.4.1 / / indirect github.com/pelletier/go-toml v1.8.0 / / indirect github.com/spf13/afero v1.3.2 / / indirect github.com/spf13/cast v1.3.1 / / indirect github.com/spf13/cobra v1.0.0 Github.com/spf13/jwalterweatherman v1.1.0 / / indirect github.com/spf13/pflag v1.0.5 / / indirect github.com/spf13/viper v1.7.0 go.uber.org/zap v1.15.0 golang.org/x/sys v0.0.0-20200625212154-ddb9806d33ae / / indirect golang.org/x/text v0.3.3 / / indirect gopkg.in/go-playground/validator.v9 v9.31.0 gopkg.in/ini.v1 v1.57.0 / / indirect)

The preparation steps are almost done by now, and it's time to familiarize yourself with the library and cdp.

3: chrome enables protocol monitoring

In the window, you can view the interaction between developer tools and chrome, which is convenient for us to refer to and learn. The opening method is as follows:

First, open the developer tool of chrome, as follows:

When you are done, click on this setting to go to the following page

Check protocol monitoring, and then in console, remember to display console drawer, as follows

After that, check the following protocol monitoring and you can see it.

Note: developer tools are actually developed based on the cdp protocol.

The content in the Method method corresponds to the domain in the official website of the cdp protocol, as follows:

After reading the above, do you have any further understanding of how to do dynamic crawler analysis and environment construction in chrome? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report