In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/02 Report--
recently majored in the study of go. After learning the basic grammar and watching the unknown translation of "The way to go" and the course of the ccmouse god, I felt that the foundation was almost done. I continued to dig deeply into the reptile project of the god ccmouse. I got a lot of harvest, and I felt that it was still difficult to continue to nibble on. After learning, I felt that I was really a frog in a well, wasting countless time and making nothing. Thinking stays at the most primitive level and cannot move forward. Fortunately, life has been in a hurry for decades, and time is the most precious. No matter which field you choose, keep your head down and sprint forward to enrich your mind and improve your cognition. It seems to have gone a little too far. Here is a summary of the project.
The project has an entry file for main.go, followed by various subdirectory function folders Such as figure: engine is the total control file, the request and regular parsing push to the total slice [] request, fetcher mainly through the http library to obtain page body information, model is the person's information to save struct finished the directory structure, and then introduced the following process: the whole stand-alone crawler project is relatively simple, but for me, receiving is still relatively large, which involves some technical details, such as interface definition, the use of structural methods. After the concurrent version and distributed version, it is more complex, the concurrent version is to make full use of go goroutine and chan, to sort out the train of thought in the big direction, abstract some common methods and structures, important regular analysis to do test work, and then guide the next step of construction, do not move forward blindly. First of all, the concurrent version needs two chan, one in: = chan Request and another out:=chan ParseResult. The concurrent version starts WorkerCount goroutine to obtain in chan url content concurrently and parse the new url and push it to out chan. At the same time, the concurrent version has a scheduler scheduler, which puts the initial crawling Request (including url and the corresponding parser, which need to be transmitted in groups because the parser rules of each URL are different) into the workerChan in the scheduler, that is, the previously defined in chan. They are a chan, and the program starts to execute concurrently. As the execution is relatively fast, the crawler site will be cut off, so you can use time.Tike (time) to limit the speed, and you may need to set the corresponding header header when crawling. Otherwise it will be blocked. Due to the concurrent version of multiple worker are competing for Request to execute, the control is relatively small, only suitable for a single machine, not suitable for multi-machine distributed deployment, so evolved a third version: queue implementation. The efficiency of queue execution is about the same as that of concurrent version. There are rqquestChan chan Request and workerChan chan chan Request in the scheduler scheduler (note that there are two chan here), and an out chan ParseResult is defined in the run method. Compared with the parallel version, the queue version has more chan of workerChan, which is mainly used to implement queue scheduling. Try to describe the whole process is not necessarily clear: the following is attached a few ccmouse lecture ppt for everyone to understand, if you are not clear, please leave a message below for discussion.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.