In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article is about how to use superagent, eventproxy and cheerio to achieve a simple crawler. I think it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.
Speaking of Node.js, perhaps the most prominent feature is its asynchronous nature.
Here we teach you to use Node.js to complete a simple crawler: crawl all the post titles and links on the home page of the CNode community.
Three packages are needed for Node.js: express,superagent,cheerio.
Express:Node.js 's most widely used web framework
Superagent:http-related libraries, which can initiate get or post requests.
Cheerio: can be used to retrieve data from a web page as a css selector. In fact, it can be understood as the Node.js version of JQuery.
First, create an empty folder, creeper. Open the vs code terminal. (links to articles on basic installation and configuration of vs code). Enter the creeper folder at the vs code terminal.
As you can see, I have successfully created the creeper folder and successfully entered it. First initialize the project with the npm init command on the terminal
Then use the npm install command to install the three express,superagent,cheerio dependencies.
First, import the three packages we just depended on at the top of the app.js file
At this point, the basic configuration of the project is complete, and the next step is to write the logic code of the simple crawler.
In fact, the implementation of simple crawler code only requires a simple 30 lines of code. Import the package we depend on at the top, and the bottom listening crawler runs on port 5000. The logic processing of the crawler is realized in the middle. Define an get request interface through app.get. The interface name / indicates the access root path, that is, we test that we only need to use the get request to access the http://127.0.0.1:5000 to access the simple crawler we wrote, and use superagent to initiate a get request like the CNode community home page. The text in the result returned successfully stores the content of the CNode page. Then use cheerio.load to read the page content, and then remove the post title and link one by one through the forEach loop. Then return all the post titles and links to the client. This little crawler is done. We can test whether the interface is working properly.
You can clearly see that we have successfully crawled all the post titles and links to the home page of the CNode community. And return it to the client in json format.
Is it over here? Certainly not! Don't forget that the most important thing in this article is to learn the asynchronous features of Node.js. We just used superagent and cheerio to crawl the title and link of the first post, and we only need to make a get request through superagent. If we want to fetch the first comment of each post at the same time, we have to make a request for the link to each post we got in the previous step, and then still use cheerio to retrieve the first comment. There are 40 posts on the home page of the Cnode community. Logically, it is necessary to initiate a request to get all the post titles and links, and then make a request for each link here to get the corresponding first comment. So we need to make 41 requests, which involves the asynchronous nature of Node.js. Those who are familiar with Node.js may know to use promise or generator to resolve callbacks. However, I still prefer to callback directly in my work. When using callback for asynchronous requests, there are generally two choices: eventproxy or async.
The difference between eventproxy and async
In fact, both eventproxy and async are used for asynchronous flow control. If you crawl requests less than 10 times, you can choose to use eventproxy, if you crawl hundreds of requests, then you need to consider using async, because you initiate hundreds of requests at a time, the original website may think that you are malicious request, directly block your access to ip. At this point, you can use async to control the number of concurrent requests, five to ten at a time, slowly crawling through all the data.
Here I choose eventproxy to crawl data asynchronously. Using eventproxy requires a dependency on the eventproxy package, so first npm install imports the eventproxy dependency.
First post the adjusted code logic:
Let's take a look at the new logical thinking:
First, import the package we need to rely on at the top.
Superagent.get is actually the previous operation to get links to all posts on the home page, but as we can see from the screenshot just now, the links we crawled are all in topic/5bd4772a14e994202cd5bdb7 similar format, which is obviously not an accessible link, so we need to splice the main link of CNode to form a truly accessible link for the post.
Next, you get an instance of eventproxy
Then we need to use eventproxy's after method to make 40 requests, which is suitable for repetitive operations, such as reading 10 files, calling the database 5 times, and so on. Register handler with N triggers of the same event. When the specified number of triggers is reached, the handler will be called for execution, and the data triggered each time will be passed in as an array as a parameter in the trigger order.
At the bottom, a forEach loop is used, in which the get request is initiated through superagent, which takes turns requesting the post link to get the actual content of the post. Then tell the ep instance that my request is over through the emit method of eventproxy. When all 40 requests are completed, eq.after will execute a callback to return the fetched data to the client. 40 concurrent requests for crawling data have been successfully executed here. Next, let's take a look at the running effect.
You can see that the link to the title of the post on the home page of the CNode community was successfully fetched, and the first comment of each post was obtained through a concurrent request. Of course, this crawling method must be used with caution for more than ten requests, because some large websites will think that you are a malicious request to directly block your ip, so the loss outweighs the gain. In this case, you can consider using async to control the number of concurrency to crawl data slowly.
The above is how to use superagent, eventproxy and cheerio to achieve a simple crawler. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.