Go language web crawler example code sharing 07/19 Update SLTechnology News&Howtos

Go language web crawler example code sharing

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "Go language web crawler example code sharing". In daily operation, I believe that many people have doubts about Go language web crawler example code sharing. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "Go language web crawler example code sharing". Next, please follow the editor to study!

Crawl the page

This article uses the example of web crawler to understand the function characteristics of Go language, such as recursion, multi-return value, delayed function call, anonymous function and so on.

The first is a basic example of a crawler. The following two examples show the contents of a page crawled through a net/http package.

Get a URL

The following program shows getting information from the Internet, getting the contents of URL, and then outputting it without parsing:

/ / output the content obtained from URL package mainimport ("fmt"io"net/http"os"strings") func main () {for _, url: = range os.Args [1:] {url = checkUrl (url) resp, err: = http.Get (url) if err! = nil {fmt.Fprintf (os.Stderr, "ERROR fetch request% s:% v\ n" Url, err) os.Exit (1) / / when the process exits Return status code 1} _, err = io.Copy (os.Stdout, resp.Body) resp.Body.Close () if err! = nil {fmt.Fprintf (os.Stderr, "ERROR fetch reading% s:% v\ n", url, err) os.Exit (1)}} func checkUrl (s string) string {if strings.HasPrefix (s) "http") {return s} return fmt.Sprint ("http://", s)}

This program uses the net/http package. The http.Get function generates a HTTP request. If there is no error, the returned result is stored in the response structure resp. The Body domain of resp contains a readable data stream of server-side responses. Here you can read the result of the entire response with ioutil.ReadAll. However, the io.Copy (dst, src) function is used here, so there is no need to buffer the entire data flow of the response. After reading the data, close the Body data stream to avoid resource leakage.

Obtain multiple URL concurrently

This program is the same as the previous one, getting the contents of URL, and getting them concurrently. This version discards the contents of the response and reports only the size and time spent of each response:

/ / get URL concurrently and report their time and size package mainimport ("fmt"io"io/ioutil"net/http"os"strings"time") func main () {start: = time.Now () ch: = make (chan string) for _, url: = range os.Args [1:] {url = checkUrl (url) go fetch (url) Ch)} for range os.Args [1:] {fmt.Println (

Many programming languages use fixed-length function call stacks that range in size from 64KB to 2MB. The depth of recursion is limited by the fixed-length stack size, so you must beware of stack overflows when making deep calls. A fixed-length stack may even cause some security risks. Compared with the fixed-length stack, the implementation of the Go language uses a variable-length stack, and the size of the stack increases with use, reaching the upper limit of 1GB or so. This allows us to safely use recursion without worrying about overflow.

Traversing HTML Node Tree 2

Here is another way to implement it, using function variables. The operational logic of each node can be separated from the logic of traversing the tree structure. Instead of reusing the fetch program this time, it's all written together:

Package mainimport ("fmt"net/http"os"golang.org/x/net/html") func main () {for _, url: = range os.Args [1:] {outline (url)}} func outline (url string) error {resp, err: = http.Get (url) if err! = nil {return err} defer resp.Body.Close () doc Err: = html.Parse (resp.Body) if err! = nil {return err} forEachNode (doc, startElement, endElement) return nil} / / call pre (x) and post (x) traverse each node in the tree with n as the root. / both functions are optional / / pre is called before the child node is accessed (pre-order call) / / post calls (subsequent call) func forEachNode (n * html.Node, pre) after access Post func (n * html.Node)) {if pre! = nil {pre (n)} for c: = n.FirstChild C! = nil C = c.NextSibling {forEachNode (c, pre, post)} if post! = nil {post (n)}} var depth intfunc startElement (n * html.Node) {if n.Type = = html.ElementNode {fmt.Printf ("% * s\ n", depth*2, "" N.Data) depth++}} func endElement (n * html.Node) {if n.Type = = html.ElementNode {depth-- fmt.Printf ("% * s\ n", depth*2, "", n.Data)}}

The forEachNode function here takes two functions as arguments, one called before the node accesses the child node, and the other after all the child nodes have been accessed. Such a code organization provides a lot of flexibility for callers.

It also skillfully takes advantage of the indented output of fmt. The * sign in% * s outputs a string with a variable number of spaces. The width of the output and the string are determined by the next two parameters, where you only need to output spaces, and the string uses an empty string.

This time the output is the structure of the effect to be indented:

PS H:\ Go\ src\ gopl\ ch6\ outline2 > go run main.go http://baidu.com PS H:\ Go\ src\ gopl\ ch6\ outline2 > deferred function call (defer)

The two functions of the following example are recommended to be implemented by delaying the function call to defer.

Get the title of the page

Using the data returned by the http.Get request directly, it will work fine if the requested URL is HTML, but many pages contain images, text, and other file formats. Something unexpected can happen if you ask the HTML parser to parse such files. This requires first determining that the Get request returns a HTML page, which is determined by the Content-Type of the response header returned. Generally: Content-Type: text/html; charset=utf-8. Then parse the HTML tag to get the contents of the title tag:

Package mainimport ("fmt"net/http"os"strings"golang.org/x/net/html") func forEachNode (n * html.Node, pre, post func (n * html.Node)) {if pre! = nil {pre (n)} for c: = n.FirstChild; c! = nil C = c.NextSibling {forEachNode (c, pre, post)} if post! = nil {post (n)}} func title (url string) error {resp, err: = http.Get (url) if err! = nil {return err} defer resp.Body.Close () / / check that the returned page is HTML by judging Content-Type, such as Content-Type: text/html Charset=utf-8 ct: = resp.Header.Get ("Content-Type") if ct! = "text/html" &! strings.HasPrefix (ct, "text/html" ") {return fmt.Errorf ("% s has type% s, not text/html ", url, ct)} doc, err: = html.Parse (resp.Body) if err! = nil {return fmt.Errorf (" parseing% s as HTML:% v ", url Err)} visitNode: = func (n * html.Node) {if n.Type = = html.ElementNode & & n.Data = = "title" & & n.FirstChild! = nil {fmt.Println (n.FirstChild.Data)}} forEachNode (doc, visitNode, nil) return nil} func main () {for _ Url: = range os.Args [1:] {err: = title (url) if err! = nil {fmt.Fprintf (os.Stderr, "get url:% v\ n", err)}} Save the page to a file

Request a page using Get and save it to a local file. Use the path.Base function to get the last part of the URL path as the file name:

Package mainimport ("fmt"io"net/http"os"path") func fetch (url string) (filename string, n int64, err error) {resp, err: = http.Get (url) if err! = nil {return ", 0 Err} defer resp.Body.Close () local: = path.Base (resp.Request.URL.Path) if local = = "/" {local = "index.html"} f, err: = os.Create (local) if err! = nil {return ", 0, err} n, err = io.Copy (f, resp.Body) / / close the file And keep the error message if closeErr: = f.Close () Err = = nil {/ / if the err returned by io.Copy is empty The closeErr error err = closeErr} return local, n, err} func main () {for _, url: = range os.Args [1:] {local, n, err: = fetch (url) if err! = nil {fmt.Fprintf (os.Stderr, "fetch% s:% v\ n", url, err) continue} fmt.Fprintf (os.Stderr) is reported "% s = >% s (% d bytes).\ n", url, local, n)}}

In the fetch function in the example, a file is opened by os.Create. However, there will be some problems if you use a deferred call to f.Close to close a local file, because os.Create opens a file to write and create. In many file systems, especially NFS, write errors are often not returned immediately but delayed until the file is closed. Failure to check the results of the close operation can result in a series of data loss. Then, if io.Copy and f.Close fail at the same time, we are more likely to report the error of io.Copy because it occurs earlier and is more likely to record the cause of the failure. The last error handling in the example is such processing logic.

Optimize the location of defer

After opening the file, you should write the defer statement immediately after dealing with the open error to ensure that the defer statement and the file opening operation appear in pairs. But there is also an io.Copy statement inserted here, which in this case is just a statement, or it could be a piece of code, which makes defer less clear. Here, we can make use of the feature that defer can change the return value of the function, and move the defer to before the execution of io.Copy, and we can ensure that the returned error value err can record the possible errors after the execution of io.Copy.

On the basis of the original code, you only need to encapsulate the code block of if in an anonymous function in defer:

F, err: = os.Create (local) if err! = nil {return ", 0, err} defer func () {if closeErr: = f.Close (); err = = nil {err = closeErr}} () n, err = io.Copy (f, resp.Body) return local, n, err

What you do here is to change the result returned to the caller in defer.

Anonymous function

The traversal of web crawlers.

Parsing links

On the basis of traversing the node tree before, this time to get all the links in the page. Replacing the previous visit function with an anonymous function (closure), you can now add the links found in the anonymous function directly to the links slice, which makes the logic clearer and easier to understand. Because the Extract function only needs to be called in the preceding order, here you pass the parameter values of the post part to nil. Make a package here, and continue to use it later:

/ / function package linksimport ("fmt"net/http"golang.org/x/net/html") / / initiates a HTTP GET request to a given URL / / parses HTML and returns the link func Extract (url string) ([] string, error) {resp, err: = http.Get (url) if err! = nil {return nil that exists in the HTML document Err} if resp.StatusCode! = http.StatusOK {resp.Body.Close () return nil, fmt.Errorf ("get% s:% s", url, resp.Status)} doc, err: = html.Parse (resp.Body) resp.Body.Close () if err! = nil {return nil, fmt.Errorf ("parse% s:% s", url Err)} var links [] string visitNode: = func (n * html.Node) {if n.Type = = html.ElementNode & & n.Data = = "a" {for _, a: = range n.Attr {if a.Key! = "href" {continue} link Err: = resp.Request.URL.Parse (a.Val) if err! = nil {continue / / ignore illegal URL} links = append (links, link.String ())} forEachNode (doc, visitNode, nil) / / as long as the preorder traverses If it is not executed later, pass nil return links, nil} func forEachNode (n * html.Node, pre, post func (n * html.Node)) {if pre! = nil {pre (n)} for c: = n.FirstChild C! = nil C = c.NextSibling {forEachNode (c, pre, post)} if post! = nil {post (n)} / * function func main () {url: = "https://baidu.com" urls, err: = Extract (url) if err! = nil {/ / error handling No new package fmt.Printf ("extract:% v\ n", err) return} for n, u: = range urls {fmt.Printf ("% 2d:% s\ n", n, u)} * /

Parsing URL becomes an absolute path

Instead of adding href directly to the slice, it parses it as a relative path resp.Request.URL based on the current document. The resulting link is in the form of an absolute path, so you can continue the call directly with http.Get.

Traversal of graphs

The core of the web crawler is to solve the traversal of the graph, and the recursive method can be used to achieve depth-first traversal. For web crawlers, breadth-first traversal is required. In addition, concurrent traversal can be done, which is not covered here.

The following example function shows the essence of breadth-first traversal. The caller provides an initial list worklist that contains the items to be accessed and a function variable f to process each item. Each item has a string to identify. The function f returns a new list of items that need to be newly added to the worklist. The breadthFirst function will return after all node items have been accessed. It needs to maintain a collection of strings to ensure that each node is accessed only once.

In a crawler, each node is a URL. Here, you need to provide the most f value passed by the crawl function to the breadthFirst function, which is used to output the URL, then parse the link and return:

Package mainimport ("fmt", "log", "os", "gopl/ch6/links") / / A pair of elements in each worklist calls fbank / and adds the returned content to the worklist, for each element Call ffunc breadthFirst (ffunc (item string) [] string, worklist [] string) {seen: = make (map [string] bool) for len (worklist) > 0 {items: = worklist worklist = nil for _, item: = range items {if! seen [item] {seen [item] = true worklist = append (worklist) F (item)...)}} func crawl (url string) [] string {fmt.Println (url) list, err: = links.Extract (url) if err! = nil {log.Print (err)} return list} func main () {/ / start breadth traversal / / start breadthFirst (crawl) from the command line argument Os.Args [1:])} traversing the output link

The next step is to find a web page to test. Here are some links to the output:

PS H:\ Go\ src\ gopl\ ch6\ findlinks3 > go run main.go http://lab.scrapyd.cn/http://lab.scrapyd.cn/http://lab.scrapyd.cn/archives/57.htmlhttp://lab.scrapyd.cn/tag/%E8%89%BA%E6%9C%AF/http://lab.scrapyd.cn/tag/%E5%90%8D%E7%94%BB/http://lab.scrapyd.cn/archives/55.htmlhttp:// Lab.scrapyd.cn/archives/29.html http://lab.scrapyd.cn/tag/%E6%9C%A8%E5%BF%83/http://lab.scrapyd.cn/archives/28.htmlhttp://lab.scrapyd.cn/tag/%E6%B3%B0%E6%88%88%E5%B0%94/http://lab.scrapyd.cn/tag/%E7%94%9F%E6%B4%BB/http://lab.scrapyd.cn/archives/27.html......

The whole process will end when all accessible web pages are accessed or memory is exhausted.

Example: concurrent Web crawler

The next step is concurrent programming so that the above programs that search for connections can run concurrently. In this way, the independent call to crawl can make full use of the Imax O parallel mechanism on Web.

Concurrent modification

The crawl function is still the same as before and does not need to be modified. The following main function is similar to the original breadthFirst function. Here, as before, a task category is used to record the queue of items to be processed, and each entry is a URL list to be crawled, this time using channels instead of slices to represent the queue. Each call to crawl occurs in its own goroutine, and then sends the found link back to the task list:

Package mainimport ("fmt"log"os"gopl/ch6/links") func crawl (url string) [] string {fmt.Println (url) list, err: = links.Extract (url) if err! = nil {log.Print (err)} return list} func main () {worklist: = make (chan [] string) / / starting with the command line argument go func () {worklist

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.