Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize website crawling function with the help of third-party open source library in node

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to crawl the website with the help of the third-party open source library in node, which is very detailed and has a certain reference value. Interested friends must finish reading it!

Nodejs realizes the function of website crawling

Introduction to the third-party library

Encapsulation of Network request by request

Cheerio node version of jQuery

Mkdirp creates a multi-tier folder directory

Realization idea

Get the specified url content through request

Find the jump path in the page through cheerio (de-duplicate)

Create a directory through mkdirp

Create a file through fs and write the read content to

Repeat the above steps when you get the path without access

Code implementation

Const fs = require ("fs"); const path = require ("path"); const request = require ("request"); const cheerio = require ("cheerio"); const mkdirp = require ("mkdirp"); / / definition entry urlconst homeUrl = "https://www.baidu.com";// defines the path that set storage has accessed to avoid repeated visits to const set = new Set ([homeUrl]); function grab (url) {/ / verify url normative if (! url) return / / blanking url = url.trim (); / / automatically completing the url path if (url.endsWith ("/")) {url + = "index.html";} const chunks = []; / / url may have some symbols or Chinese, which can be encoded by encodeURI request (encodeURI (url)) .on ("error", (e) = > {/ / print error message console.log (e)) }) .on ("data", (chunk) = > {/ / receive response content chunks.push (chunk);}) .on ("end", () = > {/ / convert the corresponding content into text const html = Buffer.concat (chunks). ToString (); / / did not get the content if (! html) return / / parse url let {host, origin, pathname} = new URL (url); pathname = decodeURI (pathname); / / parse html const $= cheerio.load (html) through cheerio; / / use path as directory const dir = path.dirname (pathname); / / create directory mkdirp.sync (path.join (_ _ dirname, dir)) / / write fs.writeFile to the file (path.join (_ dirname, pathname), html, "utf-8", (err) = > {/ / print error message if (err) {console.log (err); return;} console.log (`[${url}] saved successfully`);}) / / get all an elements on the page const aTags = $("a"); Array.from (aTags) .forEach ((aTag) = > {/ / get the path const href = $(aTag) .attr ("href") in the a tag / / you can verify the legality of href or control the scope of website crawled. For example, it must be / / exclude empty tags if (! href) return; / / exclude anchor connection if (href.startsWith ("#")) return; if (href.startsWith ("mailto:") return under a domain name. / / filter out / / if (/\. (jpg | jpeg | png | bit) $/ .test (href)) return; / / href must be the import url domain name let reg = new RegExp (`^ https?:\ / ${host} `); if (/ ^ https?:\ /\ / .test (href) & &! reg.test (href)) return / / you can add more logic let newUrl = ""; if (/ ^ https?:\ /\ / .test (href)) {/ / handle absolute path newUrl = href;} else {/ / handle relative path newUrl = origin + path.join (dir, href) } / / determine whether you have visited if (set.has (newUrl)) return; if (newUrl.endsWith ("/") & & set.has (newUrl + "index.html") return; if (newUrl.endsWith ("/")) newUrl + = "index.html"; set.add (newUrl); grab (newUrl);});}) } / / start crawling grab (homeUrl); these are all the contents of the article "how to crawl websites with the help of third-party open source libraries in node". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report