In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
In this issue, the editor will bring you about how Puppeteer is used as a crawler. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
Preface
Automated testing is a very important and convenient thing for software development, but automated testing tools can not only be used to do testing, but also can be used to simulate human operations, so some E2E automated testing tools (such as Selenium, Puppeteer, Appium) are often used by crawler engineers to grab data because of their powerful simulation function.
There are many online crawling tutorials that use automated testing tools as crawlers, but they are only limited to how to obtain data, and we know that these browser-based solutions are costly and inefficient. It's not the best choice for crawlers.
This article will introduce another use of automated testing tools to automate some manual operations. The tool we use is Puppeteer, a testing framework developed and open source by Google, which operates Chromium (an open source browser developed by Google) to automate. We will introduce step by step how to use Puppeteer to automatically publish articles on the Nuggets.
The principle of automated testing tools
The principle of automated testing tool is to control the web page to be crawled by programmatically operating the browser and simulating interaction with it (such as clicking, typing, navigation, etc.). Automated testing tools can also obtain the DOM or HTML of a web page, so it is also easy to obtain web page data.
In addition, for some dynamic websites, JS dynamic rendering data is usually not easy to obtain, while automated testing tools can easily do so because it inputs HTML into the browser to run.
Introduction to Puppeteer
Here is an excerpt from the definition on Puppeteer's Github home page.
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
Puppeteer is a Node.js library that provides advanced API to control Chrome or Chromium (via developer tool protocols); Puppeteer runs headless by default, but can be configured to be non-headless.
Loco Note: headless refers to the GUI that does not display browsers, and is designed to improve performance, because rendering images is a resource-consuming thing.
Here are some things Puppeteer can do:
Generate screenshots and page PDF
Crawl single-page applications to generate pre-rendered content (i.e. SSR, server rendering)
Automate form submission, UI testing, keyboard input, etc.
Create an up-to-date, automated test environment
Capture the timeline of the website to help diagnose performance problems
Test the Chrome plug-in
...
Puppeteer installation
Installing Puppeteer is not difficult, just make sure your environment has Node.js installed and is able to run NPM.
Since the official installation tutorial does not take into account the fact that Chromium is already installed, we use a third-party library, puppeteer-chromium-resolver, which can customize Puppeteer and manage Chromium downloads.
Run the following command to install Puppeteer:
Npm install puppeteer-chromium-resolver-save
For more information on the usage of puppeteer-chromium-resolver, please refer to the official website: https://www.npmjs.com/package/puppeteer-chromium-resolver.
Puppeteer common commands
The official API document of Puppeteer is https://pptr.dev/. There are detailed open interfaces of Puppeteer in the document for reference. Here we only list some commonly used interface commands.
Generate / close browser / / introduce puppeteer-chromium-resolverconst PCR = require ('puppeteer-chromium-resolver') / / generate PCR instance const pcr = await PCR ({revision:', detectionPath:'', folderName: '.chromium-browser-snapshots', hosts: [' https://storage.googleapis.com', 'https://npm.taobao.org/mirrors'], retry: 3 Silent: false}) / / generate browser const browser = await pcr.puppeteer.launch ({...}) / / close browser await browser.close () generate page const page = await browser.newPage () navigate await page.goto ('https://baidu.com') wait for await page.waitFor (3000) await page.goto (' https://baidu.com') get page element const el = await page.$ (selector) Click element await El.click () input content await el.type (text) execute Console code (emphasis) const res = await page.evaluate ((arg1) Arg2, arg3) = > {/ / anything frontend return 'frontend awesome'}, arg1, arg2, arg3)
This is probably the most powerful API in Puppeteer. Any developer familiar with front-end technology should know the Console in Chrome developer tools, where any JS code can be run, including click events, get elements, add, delete and modify elements, and so on. Our automated messaging program will make extensive use of this API.
You can see that the evaluate method can accept some parameters and act as parameters in the callback function in the front-end code. This allows us to inject any back-end data into the front-end DOM, such as article title, article content, and so on.
In addition, the return value in the callback function can be assigned to res as the return value of evaluate, which is often used as data fetching.
Note that the above code uses the keyword await, which is actually the new syntax of async/await in ES7 and the syntax sugar of ES6's Promise, making asynchronous code easier to read and understand. If you don't understand async/await, you can refer to this article: https://juejin.im/post/596e142d5188254b532ce2da.
Puppeteer practice: automatically publish articles on the Nuggets
As the saying goes: Talk is cheap, show me the code.
Next, we will use an example of automatic post to demonstrate the functionality of Puppeteer. The platform used as an example in this article is the Nuggets.
Why choose Nuggets? This is because unlike some other websites (such as CSDN), which requires a CAPTCHA (which increases the complexity), the Nuggets can log in by entering an account name and password.
To make it easier for beginners to understand, we will start with the basic structure of the reptile. (due to the limitation of space, we will skip the initialization of browsers and pages and focus on the main points.)
Basic structure
In order to make the crawler look less cluttered, we extracted the steps of publishing the article to form a base class (because we may not be the only platform to grab the Nuggets, using object-oriented thinking to write code, other platforms only need to inherit the base class).
The general structure of this reptile base class is as follows:
We don't have to understand all the methods, we just need to know that the entry we started is run.
Async is added to all methods, indicating that the method will return Promise, and if you need to call it synchronously, you must add the keyword await.
The run method is as follows:
Async run () {/ / initialize await this.init () if (this.task.authType = constants.authType.LOGIN) {/ / Log in await this.login ()} else {/ / use Cookie await this.setCookies ()} / / navigate to editor await this.goToEditor () / / enter editor content await this.inputEditor ( ) / / publish article await this.publish () / / close browser await this.browser.close ()}
As you can see, the crawler will first initialize and complete some basic configuration; then decide whether to use login or Cookie to pass website authentication according to the verification category of the task (authType) (this article only considers the case of login authentication); the next step is to navigate to the editor and enter the editor content; then, publish the article; finally, close the browser, and the publishing task is complete.
Log in to async login () {logger.info (`logging in...) Navigating to ${this.urls.login} `) await this.page.goto (this.urls.login) let errNum = 0 while (errNum)
< 10) { try { await this.page.waitFor(1000) const elUsername = await this.page.$(this.loginSel.username) const elPassword = await this.page.$(this.loginSel.password) const elSubmit = await this.page.$(this.loginSel.submit) await elUsername.type(this.platform.username) await elPassword.type(this.platform.password) await elSubmit.click() await this.page.waitFor(3000) break } catch (e) { errNum++ } } // 查看是否登陆成功 this.status.loggedIn = errNum !== 10 if (this.status.loggedIn) { logger.info('Logged in') } } 掘金的登录地址是 https://juejin.im/login,我们先将浏览器导航至这个地址。 这里我们循环 10 次,尝试输入用户名和密码,如果 10 次都失败了,就设置登录状态为 false;反之,则设置为 true。 接着,我们用到了 page.$(selector) 和 el.type(text) 这两个 API ,分别用于获取元素和输入内容。而最后的 elSubmit.click() 是提交表单的操作。 编辑文章 这里我们略过了跳转到文章编辑器的步骤,因为这个很简单,只需要调用 page.goto(url) 就可以了,后面会贴出源码地址供大家参考。 输入编辑器的代码如下: async inputEditor() { logger.info(`input editor title and content`) // 输入标题 await this.page.evaluate(this.inputTitle, this.article, this.editorSel, this.task) await this.page.waitFor(3000) // 输入内容 await this.page.evaluate(this.inputContent, this.article, this.editorSel) await this.page.waitFor(3000) // 输入脚注 await this.page.evaluate(this.inputFooter, this.article, this.editorSel) await this.page.waitFor(3000) await this.page.waitFor(10000) // 后续处理 await this.afterInputEditor() } 首先输入标题,调用了 page.evaluate 这个前端执行函数,传入 this.inputTitle 输入标题这个回调函数,以及其他参数;接着同样的原理,调用输入内容回调函数;然后是输入脚注;最后,调用后续处理函数。 下面我们详细看看 this.inputTitle 这个函数: async inputTitle(article, editorSel, task) { const el = document.querySelector(editorSel.title) el.focus() el.select() document.execCommand('delete', false) document.execCommand('insertText', false, task.title || article.title) } 我们首先通过前端的公开接口 document.querySelector(selector) 获取标题的元素,为了防止标题有 placeholder,我们用 el.focus()(获取焦点)、el.select()(全选)、document.execCommand('delete', false)(删除)来删除已有的 placeholder。然后我们通过 document.execCommand('insertText', false, text) 来输入标题内容。 接下来,是输入内容,代码如下(它的原理与输入标题类似): async inputContent(article, editorSel) { const el = document.querySelector(editorSel.content) el.focus() el.select() document.execCommand('delete', false) document.execCommand('insertText', false, article.content) } 有人可能会问,为什么不用 el.type(text) 来输入内容,反而要大费周章的用 document.execCommand 来实现输入呢? 这里我们不用前者的原因,是因为它是完全模拟人的敲打键盘操作的,这样会破坏已有的内容格式。而如果用后者的话,可以一次性的将内容输入进来。 我们在基类 BaseSpider 中预留了一个方法来完成选择分类、标签等操作,在继承后的类 JuejinSpider 中是这样的: async afterInputEditor() { // 点击发布文章 const elPubBtn = await this.page.$('.publish-popup') await elPubBtn.click() await this.page.waitFor(5000) // 选择类别 await this.page.evaluate((task) =>{document.querySelectorAll ('.category-list > .item') .forEach (el = > {if (el.textContent = task.category) {el.click ()}})} This.task) await this.page.waitFor (5000) / / Select the tag const elTagInput = await this.page.$ ('.tag-input > input') await elTagInput.type (this.task.tag) await this.page.waitFor (5000) await this.page.evaluate (() = > {document.querySelector (' .suggested-tag-list > .tag:nth-child (1) )') .click ()}) await this.page.waitFor (5000)} release
The publishing operation is relatively simple, just click the button that publishes. The code is as follows:
Async publish () {logger.info (`publishing Secretle`) / / publish articles const elPub = await this.page.$ (this.editorSel.publish) await elPub.click () await this.page.waitFor (10000) / / subsequent processing await this.afterPublish ()}
This.afterPublish is used to verify the status of the post and to obtain the release URL, which is not described in detail here.
Describes how to use Puppeteer to manipulate the Chromium browser to publish articles on the Nuggets.
Many people use Puppeteer to fetch data, but we think it is inefficient and expensive, so it is not suitable for large-scale crawling.
On the contrary, Puppeteer is better suited for automated work, such as operating browsers to publish articles, post posts, submit forms, and so on.
Puppeteer automation tools are very similar to RPA (Robotic Process Automation), which automate some tedious and repetitive tasks, except that the latter is not limited to browsers, its Scope is based on the entire operating system, more powerful, but more expensive.
As a relatively lightweight automation tool, Puppeteer is very suitable for doing some web page automation operations. The actual combat content of Puppeteer introduced in this article is also part of the open source multiple platform project ArtiPub. Interested students can give it a try.
The Night team, founded in 2019, includes Cui Qingcai, Zhou Ziqi, Chen Xiangan, Tang Yifei, Feng Wei, Cai Jin, Dai Huangjin, Zhang Yeqing and Wei Shidong.
The programming languages involved include but are not limited to Python, Rust, C++, Go, including crawlers, deep learning, service development, object storage, etc. The team is neither good nor evil, just do what you think is right, please be careful.
This is how the Puppeteer shared by the editor is used to make a crawler. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.