Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Golang web crawler framework gocolly/colly

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to use the Golang web crawler framework gocolly/colly, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Installation

Go get-u github.com/gocolly/colly/...

Examples

Import ("fmt"github.com/gocolly/colly") func main () {c: = colly.NewCollector () c.OnResponse (func (r * colly.Response) {fmt.Println ("IP:", string (r.Body))}) / / c.SetProxy ("http://127.0.0.1:1080") c.Visit (" http://ip.cip.cc/")}) "

The SetProxy function can be used to configure the HTTP proxy.

The body of the colly is the Collector object, which manages network traffic and is responsible for performing additional fallback functions while the job is running. To use colly, you need to initialize Collector:

C: = colly.NewCollector ()

Vgo

Vgo is a third-party library management tool released by the Go language, which is used in the new version of the GE language.

Commonly used command line:

Go help mod views help.

The go mod init initialization module generates go.mod files in the root directory of the project, which can be edited manually.

Most of the dependency packages are on Github. Installing dependencies may cause problems such as connection timeout. You can configure the global git agent:

Git config-global http.proxy http://127.0.0.1:1080git config-global https.proxy https://127.0.0.1:1080# cancellation Agent: git config-- global-- unset http.proxygit config-- global-- unset https.proxy

Cmd goes to shadowsocks agent:

Set http_proxy=127.0.0.1:1080set https_proxy=127.0.0.1:1080curl cip.ccIP: 140.206.97.42 address: Shanghai data 2: Shanghai | China Unicom URL: http://www.cip.cc/140.206.97.42

Linux uses export to set environment variables, as shown above.

Drop the calling order of the function

1. OnRequest is called before initiating the request.

2. If an error occurs during the OnError request, it is called.

3. OnResponse is called after receiving the reply.

4. OnHTML is called after OnResponse, if the received content is HTML.

5. OnScraped is called after OnHTML.

The official sample code for Basic:

Package mainimport ("fmt"github.com/gocolly/colly") func main () {/ / Instantiate default collector c: = colly.NewCollector (/ / Visit only domains: hackerspaces.org, wiki.hackerspaces.org colly.AllowedDomains ("hackerspaces.org", "wiki.hackerspaces.org"),) / / On every an element which has href attribute call callback c.OnHTML ("a [href]" Func (e * colly.HTMLElement) {link: = e.Attr ("href") / / Print link fmt.Printf ("Link found:% Q->% s\ n", e.Text, link) / / Visit link found on page / / Only those links are visited which are in AllowedDomains c.Visit (e.Request.AbsoluteURL (link))}) / / Before making a request print "Visiting..." C.OnRequest (func (r * colly.Request) {fmt.Println ("Visiting", r.URL.String ())}) / / Start scraping on https://hackerspaces.org c.Visit ("https://hackerspaces.org/")}"

The example program only accesses the link in the hackerspaces. org domain, and the selector of the OnHTML return function is a [href]. Select the a-type element with the href attribute on the page, find the link and continue crawling. Some of the results of the run are as follows:

Visiting https://hackerspaces.org/Link found: "navigation"-> # column-oneLink found: "search"-> # searchInputLink found: ""-> / File:Cbase07.jpgVisiting https://hackerspaces.org/File:Cbase07.jpgLink found: "navigation"-> # column-oneLink found: "search"-> # searchInputLink found: "File"-> # fileLink found: "File history"-> # filehistoryLink found: "File usage"-> # filelinksLink Found: "" > / images/e/ec/Cbase07.jpgVisiting https://hackerspaces.org/images/e/ec/Cbase07.jpgLink found: "800x600 pixels"-> / images/thumb/e/ec/Cbase07.jpg/800px-Cbase07.jpgVisiting https://hackerspaces.org/images/thumb/e/ec/Cbase07.jpg/800px-Cbase07.jpg

The above is all the content of the article "how to use Golang Web Crawler Framework gocolly/colly". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report