How to make simple crawler based on node.js 07/08 Update SLTechnology News&Howtos

How to make simple crawler based on node.js

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to make a simple crawler based on node.js". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to make a simple crawler based on node.js.

Goal: crawl basic information about hairstylists in all the stores on the http://tweixin.yueyishujia.com/webapp/build/html/ website.

Idea: visit the above website, analyze the content of the web page through the network of the chrome browser, find the interface to get the hairdresser in each store, analyze the parameters and return data, and traverse all the hairstylist in all stores until the traversal is finished, colleagues store the information locally.

Step 1: install node.js

Download and install node, this step is relatively simple without a detailed explanation, if you have any questions, you can ask du Niang directly.

Step 2: establish the project

1) Open the dos command bar, and cd goes to the path where you want to create the project (I put this project directly on disk E, and the following takes this path as an example)

2) mkdir node (create a folder to store the project, which I call node here)

3) cd enters the folder named node and executes the npm init initialization project (during which I will be asked to fill in some information, I enter directly)

Step 3: create a folder where the crawled data is stored

1) create a data folder to store the basic information of the hairstylist

2) create an image folder to store the hairdresser's profile picture

At this point, the files under the project are as follows:

Step 4: install the third-party dependency package (fs is a built-in module and does not need to be installed separately)

1) npm install cheerio-save

2) npm install superagent-save

3) npm install async-save

4) npm install request-save

Briefly explain the dependency packages installed above:

Cheerio: nodejs crawl page module, specially customized for the server, fast, flexible and implemented jQuery core implementation, it can parse the request result in almost the same way as jQuery parsing

Superagent: can initiate requests such as get/post/delete actively

The async:async module is created to solve nested pyramids and asynchronous process control. Because nodejs is an asynchronous programming model, there are some things that are easy to do in synchronous programming, but now it becomes very troublesome. The process control of Async is to simplify these operations.

Request: with this module, http requests become super simple, Request is easy to use, and both https and redirection are supported.

Step 5: write crawler code

Open hz.js and write code:

Var superagent = require ('superagent'); var cheerio = require (' cheerio'); var async = require ('async'); var fs = require (' fs'); var request = require ('request'); var page=1; / / gets the paging function at the hairdresser, so use this variable to control the total number of paging var num = 0rampact / crawled messages var storeid = 1WTA / store IDconsole.log (' crawler starts running.') Function fetchPage (x) {/ / encapsulates the function startRequest (x);} function startRequest (x) {superagent .post ('superagent .send ({/ / requested form information Form data page: X, storeid: storeid}) / / Header information requested by Http .set (' Accept', 'application/json, text/javascript, * / *) Qroom0.01') .set ('Content-Type','application/x-www-form-urlencoded; charset=UTF-8') .end (function (err, res) {/ / process after the request is returned / / convert the result returned in response into JSON object if (err) {console.log (err);} else {var designJson = JSON.parse (res.text) Var deslist = designJson.data.designerlist; if (deslist.length > 0) {num + = deslist.length; / / concurrent traversal of deslist objects async.mapLimit (deslist, 5, function (hair, callback) {/ / a pair of processing logic console.log ('... Crawling data ID:'+hair.id+'---- hairstylist:'+ hair.name); saveImg (hair,callback);}, function (err, result) {console.log ('... Cumulative number of captured information →→'+ num);}); page++; fetchPage (page);} else {if (page = = 1) {console.log ('... The crawler ends running ~'); console.log ('... Crawl data'+ num+'...'); return;} storeid + = 1; page = 1; fetchPage (page);});} fetchPage (page); function saveImg (hair,callback) {/ / store image var img_filename = hair.store.name+'-'+hair.name + '.png' Var img_src = 'http://photo.yueyishujia.com:8112' + hair.avatar; / / url / / use the request module to initiate a request to the server to obtain the image resource request.head (img_src,function (err,res,body) {if (err) {console.log (err);} else {request (img_src) .pipe (fs.createWriteStream ('. / image/' + img_filename) / / by streaming, write the picture to the local / image directory, and use the name of the hairdresser and the store as the name of the picture. Console.log ('... Successful storage of id='+hair.id+'-related pictures!') ;}}); / / store photo-related information var html = 'name:' + hair.name+'

Occupation:'+ hair.jobtype+'

Occupation level:'+ hair.jobtitle+'

Introduction:'+ hair.simpleinfo+'

Personality signature:'+ hair.info+'

Haircut price:'+ hair.cutmoney+' yuan

Shop name:'+ hair.store.name+'

Address:'+ hair.store.location+'

Contact information:'+ hair.telephone+'

Avatar:

'; fs.appendFile ('. / data/' + hair.store.name+'-'+ hair.name + '.html', html, 'utf-8', function (err) {if (err) {console.log (err);}}); callback (null, hair);}

Step 6: run the crawler

Enter the node hz.js command to run the crawler, and the effect is as follows:

After running successfully, the basic information of the hairdresser is stored in the data folder in the form of a html file, and the hairdresser's profile picture is stored in the image folder:

Thank you for your reading, the above is the content of "how to make a simple crawler based on node.js". After the study of this article, I believe you have a deeper understanding of how to make a simple crawler based on node.js. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.