In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "how to use nodejs to write a proxy crawler website". In daily operation, I believe many people have doubts about how to use nodejs to write a proxy crawler website. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how to use nodejs to write a proxy crawler website". Next, please follow the editor to study!
Nodejs has many uses, in addition to operating files and doing web development can also do crawlers, today with a few simple lines of code to demonstrate how to use nodejs to achieve a proxy crawler.
The principle of proxy crawler mainly applies proxy server and crawler, as shown in the figure:
The main logic of the program is in the proxy server, forwarding requests, crawling data, and processing data.
The technology stack applied here includes: express, axios, cheerio, art-template
Use express to create a web service, axios crawls web pages, cheerio processes data, art-template renders data.
The target website for the crawler is this novel website: https://www.biquke.com.
The page of the website looks like this:
For simplicity, let's just demonstrate crawling one of the novels, mortal Xiuxian Biography https://www.biquke.com/bq/0/990/, the page is shown in the picture:
First, let's use express to build a web server, as shown in the figure:
Reviewing the above code, we design a route to show the directory page of the mortal Xiuxian biography.
The second step is to request the target page with axios. Both the front and back end of the library axios can be used. When used in the browser, the XMLhttprequest object called internally sends an asynchronous request, and when used on the node side, that is, the back end, it calls the request method of the http module of node. The code is as follows:
The print result is as follows:
If you look closely at the result, you can see that the result is a string in html format. These strings contain the contents of the novel of the mortal Xiuxian Biography. We need to get the following information:
1. The title of the novel
2. The latest chapter of the novel
3. The list of chapters of the novel and the links to each chapter
How do you get this information? Do you want to filter with regular expressions? Of course not.
The third step is to process the data to get the desired data. Here we need to be familiar with a npm package, cheerio, and the address of the package that processes the page data:
Let's take a look at the official website documentation and take a look at the usage of this bag. The code is as follows:
The print result is as follows:
From the above results, we can see that the function of cheerio is to convert the string of html structure into a format similar to jquerydom object, and then use the selector of jquery to filter the desired data. After understanding the above usage, we can proceed to process the data as follows:
Send the request on the browser side and view the print result in the terminal:
The above gets the data we want, but the result is not our final result. We want to render the data into a page and return it to the user. Here we use art-template.
The fourth step is to render the data into a page with art-template. The code is as follows:
The template code is as follows:
Notice how the a tag links to each chapter of the list are handled when rendering.
The result of the request for home address is as follows:
The above is just the implementation of the catalog page. For the details page of each chapter, it should be noted that when the details page is redesigned, we set a params request parameter. Through this parameter, we can splice the data of that chapter that the user requests, so as to process the data. The code is as follows:
The code address is as follows: https://github.com/clm1100/spidertest
At this point, the study on "how to use nodejs to write an agent crawler website" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.