In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces how to crawl the website with the help of the third-party open source library in node, which is very detailed and has a certain reference value. Interested friends must finish reading it!
Nodejs realizes the function of website crawling
Introduction to the third-party library
Encapsulation of Network request by request
Cheerio node version of jQuery
Mkdirp creates a multi-tier folder directory
Realization idea
Get the specified url content through request
Find the jump path in the page through cheerio (de-duplicate)
Create a directory through mkdirp
Create a file through fs and write the read content to
Repeat the above steps when you get the path without access
Code implementation
Const fs = require ("fs"); const path = require ("path"); const request = require ("request"); const cheerio = require ("cheerio"); const mkdirp = require ("mkdirp"); / / definition entry urlconst homeUrl = "https://www.baidu.com";// defines the path that set storage has accessed to avoid repeated visits to const set = new Set ([homeUrl]); function grab (url) {/ / verify url normative if (! url) return / / blanking url = url.trim (); / / automatically completing the url path if (url.endsWith ("/")) {url + = "index.html";} const chunks = []; / / url may have some symbols or Chinese, which can be encoded by encodeURI request (encodeURI (url)) .on ("error", (e) = > {/ / print error message console.log (e)) }) .on ("data", (chunk) = > {/ / receive response content chunks.push (chunk);}) .on ("end", () = > {/ / convert the corresponding content into text const html = Buffer.concat (chunks). ToString (); / / did not get the content if (! html) return / / parse url let {host, origin, pathname} = new URL (url); pathname = decodeURI (pathname); / / parse html const $= cheerio.load (html) through cheerio; / / use path as directory const dir = path.dirname (pathname); / / create directory mkdirp.sync (path.join (_ _ dirname, dir)) / / write fs.writeFile to the file (path.join (_ dirname, pathname), html, "utf-8", (err) = > {/ / print error message if (err) {console.log (err); return;} console.log (`[${url}] saved successfully`);}) / / get all an elements on the page const aTags = $("a"); Array.from (aTags) .forEach ((aTag) = > {/ / get the path const href = $(aTag) .attr ("href") in the a tag / / you can verify the legality of href or control the scope of website crawled. For example, it must be / / exclude empty tags if (! href) return; / / exclude anchor connection if (href.startsWith ("#")) return; if (href.startsWith ("mailto:") return under a domain name. / / filter out / / if (/\. (jpg | jpeg | png | bit) $/ .test (href)) return; / / href must be the import url domain name let reg = new RegExp (`^ https?:\ / ${host} `); if (/ ^ https?:\ /\ / .test (href) & &! reg.test (href)) return / / you can add more logic let newUrl = ""; if (/ ^ https?:\ /\ / .test (href)) {/ / handle absolute path newUrl = href;} else {/ / handle relative path newUrl = origin + path.join (dir, href) } / / determine whether you have visited if (set.has (newUrl)) return; if (newUrl.endsWith ("/") & & set.has (newUrl + "index.html") return; if (newUrl.endsWith ("/")) newUrl + = "index.html"; set.add (newUrl); grab (newUrl);});}) } / / start crawling grab (homeUrl); these are all the contents of the article "how to crawl websites with the help of third-party open source libraries in node". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.