How to use Node.js to crawl arbitrary web resources and output high-quality PDF files locally 07/11 Update SLTechnology News&Howtos

How to use Node.js to crawl arbitrary web resources and output high-quality PDF files locally

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to use Node.js to crawl arbitrary web resources and output high-quality PDF files to the local", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use Node.js to crawl arbitrary web resources and output high-quality PDF files to the local" bar!

Demand:

Use Node.js to crawl web resources and configure them out of the box

Output the crawled web page content in PDF format

If you are a technician, you can read my next article, otherwise, please go directly to my github warehouse and read the documentation directly.

Warehouse address: with documentation and source code, don't forget to give a star.

Technologies used in this requirement: Node.js and puppeteer

Puppeteer official website address: puppeteer address

Node.js official website address: link description

Puppeteer is an official Node library from Google that controls headless Chrome through the DevTools protocol. You can use the api provided by Puppeteer to directly control Chrome to simulate most user operations for UI Test or to access pages as crawlers to collect data.

Environment and installation

Puppeteer itself relies on more than 6. 4 Node, but for asynchronous super-useful async/await, it is recommended to use Node version 7. 6 or later. In addition, headless Chrome itself has high requirements for the version of the library that the server depends on. The dependence of centos server is stable, and it is difficult for V6 to use headless Chrome. Upgrading the dependent version may cause a variety of server problems (including and not limited to the inability to use ssh), and * * use high-version servers. (the * version of Node.js is recommended)

Try to crawl for JD.com 's resources.

Const puppeteer = require ('puppeteer'); / / introduce dependency (async () = > {/ / use the async function * Asynchronous const browser = await puppeteer.launch (); / / Open a new browser const page = await browser.newPage (); / / Open a new web page await page.goto (' https://www.jd.com/');) / / go to the web page of 'url'' const result = await page.evaluate (() = > {/ / this result array contains all the src addresses of images let arr = []; / / the logic of writing processing inside this arrow function const imgs = document.querySelectorAll ('img') Imgs.forEach (function (item) {arr.push (item.src)}) return arr}) / / 'result is the crawler data obtained at this time. You can use the' fs' module to save'}) () copy and use the command line command `node file name` to run the package that acquires the crawler data. In fact, it opens another browser for us to re-open the web page and obtain their data.

Above only crawled the picture content of JD.com 's home page, assuming that my demand is further expanded, I need to climb JD.com 's home page

The text contents of all the title in the jump page corresponding to all the tags in the * are put into an array.

Our async function is divided into five steps, only puppeteer.launch ()

Browser.newPage (), browser.close () is a fixed way of writing.

Page.goto specifies which web page to crawl data to. You can change the internal url address or multiple times.

Call this method.

The function page.evaluate deals with the data logic that we enter into the web page we want to crawl.

Two methods, page.goto and page.evaluate, can be called multiple times within async

That means we can first go to JD.com 's web page, deal with the logic, and then call the function page.goto again.

Note that all the above logic is that the puppeteer package helps us open another one where we can't see it.

The browser, and then handles the logic, so eventually you have to call the browser.close () method to close that browser.

At this time, we optimize the code in the previous article and crawl the corresponding resources.

Const puppeteer = require ('puppeteer'); (async () = > {const browser = await puppeteer.launch (); const page = await browser.newPage (); await page.goto (' https://www.jd.com/'); const hrefArr = await page.evaluate (() = > {let arr = []; const aNodes = document.querySelectorAll ('.cate _ menu_lk')) ANodes.forEach (function (item) {arr.push (item.href)}) return arr}); let arr = []; for (let I = 0; I

< hrefArr.length; i++) { const url = hrefArr[i]; console.log(url) //这里可以打印 await page.goto(url); const result = await page.evaluate(() =>

{/ / invalid console.log inside this method return $('title') .text (); / / returns the title text content of each interface}) Arr.push (result) / / add the corresponding value} console.log (arr) / / to the array in each loop. The corresponding data can be saved to the local await browser.close ()} () through Node.js 's fs module.

The internal console.log of the Tiankeng page.evaluate function cannot be printed, and the internal cannot get external variables, only return can return.

The selector you use must first go to the console of the corresponding interface to test whether you can select DOM before using it. For example, JD.com cannot use querySelector. Here, due to

JD.com 's interface uses jQuery, so we can use jQuery. In short, they develop selectors that can be used, and we can all use them, otherwise we can't.

Next, let's directly climb the home page of Node.js 's official website and generate PDF directly.

Whether or not you are familiar with Node.js and puppeteer crawlers, please read this document carefully and perform each step in order

The requirements of this project: give us a web address, crawl his web content, and then output it into the PDF format we want, please note that it is a high-quality PDF document.

* step: install Node.js. It is recommended to download the corresponding operating system package on the Chinese official website of http://nodejs.cn/download/ dint Node.js.

In the second step, after downloading and installing Node.js, start the windows command line tool (start the system search function under windows, type cmd, enter, and you will come out)

The third step is to check whether the environment variables have been configured automatically, enter node-v in the command line tool, and if v10. 0 appears. The * * field indicates that Node.js has been successfully installed

Step 4 if you find that the corresponding field does not appear when you type node-v in step 3, please restart your computer.

Step 5: open the project folder, open the command line tool (in the windows system, type cmd in the url address bar of the file), and type npm i cnpm nodemon-g

Step 6: download the puppeteer crawler package. After completing step 5, use the cnpm i puppeteer-- save command to download.

Step 7 after completing step 6 download, open the url.js of the project and replace the web address that you need to crawl (default is http://nodejs.cn/)

Step 8 enter nodemon index.js on the command line to crawl the corresponding content and automatically output it to the index.pdf file under the current folder

TIPS: the project design idea is a web page a PDF file, so each time crawl a separate page, please copy the index.pdf out, and then continue to change the url address, continue to climb, generate a new PDF file, of course, you can also through circular compilation and other ways to crawl multiple pages at one time to generate multiple PDF files.

Corresponding to a web page like JD.com 's home page where pictures are loaded lazily, some of the content crawled is in loading status. For pages with some anti-crawler mechanisms, crawlers will also have problems, but most websites are OK.

Const puppeteer = require ('puppeteer'); const url = require ('. / url') (async () = > {const browser = await puppeteer.launch ({headless: true}) const page = await browser.newPage () / / Select the web page await page.goto (url, {waitUntil: 'networkidle0'}) / / Select the PDF file path you want to export, and output the crawled content to PDF. It must be an existing PDF, can be empty content, if it is not empty content PDF Then the content let pdfFilePath ='. / index.pdf' will be overwritten / / according to your configuration options, we select the specification output PDF of A4 paper here to facilitate printing await page.pdf ({path: pdfFilePath, format: 'A4printer, scale: 1, printBackground: true, landscape: false, displayHeaderFooter: false}); await browser.close ()}) ()

File deconstruction design

Data is very precious in this era. If you select a specific href address according to the design logic of the web page, you can first directly obtain the corresponding resources, or you can enter it again by using the page.goto method, and then call the page.evaluate () processing logic, or output the corresponding PDF files. Of course, you can also output multiple PDF files in one breath.

Thank you for your reading, the above is "how to use Node.js to crawl arbitrary web resources and output high-quality PDF files to the local" content, after the study of this article, I believe you on how to use Node.js to crawl arbitrary web resources and output high-quality PDF files to the local this problem has a more profound experience, the specific use of the situation also needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.