Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use NodeJs crawler to grab ancient books

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use NodeJs crawler to capture ancient books". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use NodeJs crawler to capture ancient books".

Analysis of project implementation scheme

The project is a typical multi-level crawling case. At present, there are only three levels, namely, the book list, the chapter list corresponding to the book item, and the corresponding content of a chapter link. There are two ways to grab such a structure, one is to grab the next outer layer directly from the outer layer to the inner layer, and the other is to save the outer layer to the database first. Then grab the links to all the inner chapters according to the outer layer, save it again, and then crawl the content from the database query to the corresponding link unit. These two schemes have their own advantages and disadvantages, in fact, I have tried both, the latter has an advantage, because the three levels are grabbed separately, so it is more convenient to save as much as possible to the relevant data of the corresponding chapters. You can imagine that if you use the former, according to normal logic,

Traverse the first-level directory to the corresponding second-level chapter directory, then traverse the chapter list to grab the content, and when the third-level content unit crawls and needs to be saved, if you need a lot of first-level directory information, you need to transfer data between these hierarchical data, which should be a more complex thing to think about. So storing data separately avoids unnecessary and complex data transmission to some extent.

At present, we consider that there are not many ancient Chinese books that we want to capture, and only about 180 ancient Chinese books contain all kinds of classics and history. It and the chapter content itself is a very small amount of data, that is, there are 180 document records in a collection. All the chapters of these 180 books capture a total of 16,000 chapters, corresponding to the need to visit 16,000 pages to crawl to the corresponding content. So it should be reasonable to choose the second one.

Project realization

The main program has three methods: bookListInit and chapterListInit,contentListInit, which are initialization methods for grabbing book catalogs, chapter lists, and book content exposed to the public. The running flow of these three methods can be controlled through async, the data is saved to the database after the book catalog crawling is completed, and then the result is returned to the main program. If the main program is run successfully, the crawl of the chapter list according to the book list is executed, and the book content is crawled in the same way.

Project main entrance

/ * the crawler grabbed the main entry * / const start = async () = > {let booklistRes = await bookListInit (); if (! booklistRes) {logger.warn ('book list crawling error, program terminated'); return;} logger.info ('book list crawled successfully, now book chapter crawling...'); let chapterlistRes = await chapterListInit () If (! chapterlistRes) {logger.warn ('book chapter list crawl error, program terminated.'); return;} logger.info ('book chapter list crawled successfully, now book content crawl...'); let contentListRes = await contentListInit () If (! contentListRes) {logger.warn ('book chapter content crawling error, program terminated.'); return;} logger.info ('book content crawled successfully');} / / start entry if (typeof bookListInit = = 'function' & & typeof chapterListInit = =' function') {/ / start crawling start ();}

The introduction of bookListInit, chapterListInit,contentListInit, three methods

Booklist.js

/ * initialization entry * / const chapterListInit = async () = > {const list = await bookHelper.getBookList (bookListModel); if (! list) {logger.error ('failed to initialize query for book catalogue');} logger.info ('start crawling book chapter list, total:' + list.length + 'article'); let res = await asyncGetChapter (list); return res;}

Chapterlist.js

/ * initialization entry * / const contentListInit = async () = > {/ / get book list const list = await bookHelper.getBookLi (bookListModel); if (! list) {logger.error ('failed to initialize query for book catalog'); return;} const res = await mapBookList (list) If (! res) {logger.error ('grab chapter information, call getCurBookSectionList () for serial traversal operation, execute callback error, error message has been printed, check log!); return;} return res;}

Thoughts on content capture

The logic of the book catalog crawling is actually very simple, you only need to use async.mapLimit to do a traversal to save the data, but the simplified logic when we save the content is to traverse the chapter list to grab the content in the link. But the reality is that there are tens of thousands of links, and we can't save them all in an array from a memory footprint point of view, and then traverse them, so we need to unitalize the content fetch.

The common traversal method is to crawl a certain number of queries each time, so the disadvantage is that it is only classified by a certain number, there is no correlation between the data, and it is inserted in batches, and if there is an error, there will be some minor problems with fault tolerance. and we think there will be problems when a book is saved separately as a collection. Therefore, the second way we use is to grab and save the content in a book unit.

The async.mapLimit (list, 1, (series, callback) = > {}) method is used for traversal, which inevitably uses callbacks, which makes it disgusting. The second parameter of async.mapLimit () sets the number of simultaneous requests.

/ * * content crawl step: * step to get the book list, and find the list of all the corresponding chapters recorded in a book through the book list. * the second step is to traverse the chapter list and save the content to the database. * in the third step, go back to * after saving the data. * / * initialize the entry * / const contentListInit = async () = > {/ / get the book list const list = await bookHelper.getBookList (bookListModel). If (! list) {logger.error ('failed to initialize query for book catalog'); return;} const res = await mapBookList (list); if (! res) {logger.error ('grab chapter information, call getCurBookSectionList () for serial traversal operation, execute callback error, error message has been printed, please check log!'); return } return res;} / * traverse the list of chapters under the book catalog * @ param {*} list * / const mapBookList = (list) = > {return new Promise ((resolve, reject) = > {async.mapLimit (list, 1, (series, callback) = > {let doc = series._doc; getCurBookSectionList (doc, callback)) }, (err, result) = > {if (err) {logger.error ('error in asynchronous execution of book catalog crawling!'); logger.error (err); reject (false); return;} resolve (true) })})} / * get the next chapter list of a single book and call the chapter list traversal to grab content * @ param {*} series * @ param {*} callback * / const getCurBookSectionList = async (series, callback) = > {let num = Math.random () * 1000 + 1000; await sleep (num); let key = series.key Const res = await bookHelper.querySectionList (chapterListModel, {key: key}); if (! res) {logger.error ('failed to get the content of the current book:' + series.bookName + 'chapter content, enter the next book content crawl!'); callback (null, null); return;} / / determine whether the current data already exists const bookItemModel = getModel (key) Const contentLength = await bookHelper.getCollectionLength (bookItemModel, {}); if (contentLength = res.length) {logger.info ('current book:' + series.bookName + 'database has been crawled, move on to the next data task'); callback (null, null); return;} await mapSectionList (res); callback (null, null);}

How to save the data after crawling is a problem.

Here we use key to classify the data, and each time we get links and traverse them according to key. The advantage is that the saved data is a whole. Now we think about the problem of data preservation.

1. It can be inserted as a whole.

Advantages: high speed, database operation does not waste time.

Disadvantages: some books may have hundreds of chapters, which means saving hundreds of pages before inserting them, which also consumes a lot of memory and may cause instability in the program.

2. You can insert the database in the form of each article.

Advantages: the page crawl-and-save method enables the data to be saved in time, and there is no need to re-save the previous chapters even if subsequent errors occur

Disadvantages: it is also obvious that it is slow. If you want to climb tens of thousands of pages and do tens of thousands of times * N database operations, you can also make a buffer to save a certain number of entries at one time when the number reaches and then save this is also a good choice.

/ * call content crawling method for all chapters under a single book * @ param {*} list * / const mapSectionList = (list) = > {return new Promise ((resolve, reject) = > {async.mapLimit (list, 1, (series, callback) = > {let doc = series._doc) GetContent (doc, callback)}, (err, result) = > {if (err) {logger.error ('error in asynchronous execution of book catalog crawling!'); logger.error (err); reject (false); return;} const bookName = list [0] .bookName Const key = list [0] .key; / save saveAllContentToDB (result, bookName, key, resolve) as a unit; / / save each article as a unit / / logger.info (bookName + 'data crawl is complete and enter the next book crawl function.'); / / resolve (true) })})}

Both have their pros and cons, and we have tried both here. Two error-saved collections, errContentModel and errorCollectionModel, are prepared to save information to the corresponding collection in the event of an insertion error, either of which is acceptable. The reason for adding collections to save data is to facilitate one-time viewing and subsequent operations, without having to look at the log.

(PS, in fact, it is fine to use the errorCollectionModel collection completely. The errContentModel collection can save the chapter information completely.)

/ / save the wrong data name const errorSpider = mongoose.Schema ({chapter: String, section: String, url: String, key: String, bookName: String, author: String,}) / / save only the key and bookName information const errorCollection = mongoose.Schema ({key: String, bookName: String,})

We put the contents of each book message into a new collection named after key.

Thank you for your reading, the above is the content of "how to use NodeJs crawler to grab ancient books". After the study of this article, I believe you have a deeper understanding of how to use NodeJs crawler to capture ancient books, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report