Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The way of crawlers: the comparison of Python, Golang and GraphQuery

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

This article will use Python, Golang and GraphQuery to parse the material details page of a website, this page is characterized by a clear data structure, but the DOM structure is not standard enough to locate page elements through a separate selector, resulting in some twists and turns in the analysis of the page. Through the parsing process of this page, we can have a profound and simple understanding of the similarities and differences between the parsing ideas of crawlers and these languages.

The way of crawlers: the comparison of Python, Golang and GraphQuery. Preface 1. Semantic DOM structure 2. Stable parsing code 2, page parsing using Python for page parsing 1. Get title node 2. Get size node 3. The complete Python code uses Golang to parse the page and uses GraphQuery to parse 1. Call GraphQuery2. Net in Golang. Call GraphQuery in Python III, postscript 1, preface

In the preface, in order to prevent unnecessary confusion in later chapters, we will first understand some basic programming concepts.

1. Semantic DOM structure

The semantic DOM structure we are talking about here includes not only semantic html tags, but also semantic selectors. In front-end development, it should be noted that all dynamic text should be wrapped with separate html tags, and it is best to give them semantic class or id attributes. This is of great benefit to both front-end and back-end development in the iteration of version features, such as the following HTML code:

Serial number: 32490230

Mode: RGB

Volume: 16.659 MB

Resolution: 72dpi

This is the front-end code that is not semantic enough. 32504070 HTML 16.659 MB,72dpi these values are dynamic attributes and will change with the change of number. in the development of the specification, these dynamic attributes should be wrapped in this kind of inline tags and given a certain semantic selector. In the above HTML structure, we can roughly infer that this is a page rendered directly by foreach at the back end. This is not in line with the idea of front-end separation, if one day they decide to use jsonp or Ajax to render these attributes, the front-end rendering, the workload will undoubtedly go up a level. The semantic DOM structure tends to be as follows:

Mode: RGB

You can also use property-mode directly as the class property of span, so that both back-end rendering and front-end dynamic rendering reduce the burden of production iterations.

two。 Stable parsing code

After the semantic DOM structure, let's talk about stable parsing code for the following DOM structure:

Serial number: 32490230

Mode: RGB

Volume: 16.659 MB

Resolution: 72dpi

If we want to extract schema information, of course we can take the following steps:

Select the div that contains main-right in the class attribute, select the second p element in the div, take out the text it contains and delete the pattern in the text:, get the pattern as RGB

Although we have successfully obtained the desired results, we consider this parsing method to be unstable, which refers to a certain degree of structural change in the nodes of elements other than their ancestral elements, sibling elements, and so on. causes parsing errors or failures, such as if one day adds a size attribute in front of the node where the pattern is located:

Serial number: 32490230

Size: 4724 × 6299 pixels

Mode: RGB

Volume: 16.659 MB

Resolution: 72dpi

Then there will be an error in our previous parsing (what? You don't think such a change is possible? Compare Page1 with Page2).

So how should we write more stable parsing code? for the above DOM structure, we can have the following ideas:

Idea 1: iterate through the p node whose class attribute is main-rightStage, and then determine whether the text of the node begins with the pattern. If so, take out the later content. The disadvantage is that there is too much logic, which is not easy to maintain and reduces the readability of the code.

Idea 2: use regular expression pattern: ([Amurz] +) for matching, the disadvantage is that improper use may cause efficiency problems.

Idea 3: using the contains method in the CSS selector, such as. Main-rightStage:contains (pattern), you can select nodes that contain patterns in the text and main-rightStage in the class attribute. However, the disadvantage is that different languages and different libraries have different degrees of support for this syntax, and lack of compatibility.

Which method to use, different people have different opinions. Different parsing ideas lead to different parsing stability, code complexity, running efficiency and compatibility. Developers need to weigh various factors to write the best parsing code.

Second, parse the page

Before extracting the page data, the first thing to do is to make clear what data we need and what data is provided on the page, and then design the data structure we need. First of all, open the page to be parsed, because the top of the pageviews, collections, downloads and other data is dynamically loaded, not needed in our demonstration, and the size, mode and other data on the right side of this page, through the comparison of Page1 and Page2 above, we can know that these attributes do not necessarily exist, so put them together into metainfo. So the data we need to get is shown in the following figure:

As a result, we can quickly design our data structure:

{title pictype number type metadata {size volume mode resolution} author images [] tags []}

Among them, size, volume, mode and resolution may not exist, so they fall under metadata. Images is an array of image addresses, and tags is an array of tags. After determining the data structure to be extracted, you can start parsing.

Use Python for page parsing

The number of Python libraries is very large, and there are many excellent libraries that can help us. When using Python for page parsing, we usually use the following libraries:

Re libraries that provide regular expression support pyquery and beautifulsoup4 libraries that provide CSS selector support and lxml libraries that provide Xpath support jsonpath_rw libraries that provide JSON PATH support

These libraries are supported under Python 3 and can be installed through pip install.

Because the syntax of the CSS selector is more concise than the Xpath syntax, and pyquery is more convenient than beautifulsoup4 in method calls, we chose pyquery between 2 and 3.

Below, we will take the acquisition of title and type attributes as examples, and the acquisition of other nodes is the same. First, let's use the requests library to download the source file for this page:

Import requestsfrom pyquery import PyQuery as pqresponse = requests.get ("http://www.58pic.com/newpic/32504070.html")document = pq (response.content.decode ('gb2312')

The following parsing using Python will be done on the premise.

1. Get the title node

Open the page to be parsed, right-click on the title, click to view the element, and you can see that its DOM structure is as follows:

At this time, we noticed that we want to extract the title text swordsman poster Jin Yong martial arts ink black and white, and not wrapped by the html tag, which is not in line with the semantic dom structure mentioned above. At the same time, using the CSS selector, it is not possible to directly select this text node (you can use Xpath to select it directly, this article is brief). For such nodes, we can think of the following two ways:

Idea 1: first select its parent element node, get its HTML content, use regular expression, match in and 0: filetype = file_type_matches [0] print (filetype)

Since we can all take a similar approach to getting attributes such as size, volume, mode, and resolution, we can boil down to a regular extracted function:

Def regex_get (text, expr): matches = re.compile (expr) .findall (text) if len (matches) = 0: return "" return matches [0]

Therefore, when we get the size node, our code can be reduced to:

Size = regex_get (context, r "size: (. *? pixel)") 3. Complete Python code

At this point, most of the problems we may encounter in parsing the page have been solved. The entire Python code is as follows:

Import requestsimport refrom pyquery import PyQuery as pqdef regex_get (text Expr): matches = re.compile (expr) .findall (text) if len (matches) = 0: return "" return matches [0] conseq = {} # # download document response = requests.get ("http://www.58pic.com/newpic/32504070.html")document = pq (response.text) # # get the file title title_node = document.find (" .detail-title ") title_node.find (" div "). Remove () title_node.find ("p"). Remove () conseq ["title"] = title_node.text () # # get the material type conseq ["pictype"] = document.find (".pic-type"). Text () # # get the file format conseq ["filetype"] = regex_get (document.find (".mainright-file"). Text () R "File format: ([a murz] +)") # # get metadata context = document.find (".main-right p"). Text () conseq ['metainfo'] = {"size": regex_get (context, r "size: (. *? Pixel), "volume": regex_get (context, r "Volume: (. *? MB), "mode": regex_get (context, r "mode: ([Amurz] +)"), "resolution": regex_get (context, r "resolution: (\ d+dpi)") } # # get the author conseq ['author'] = document.find ('. User-name'). Text () # # get the picture conseq ['images'] = [] for node_image in document.find ("# show-area-height img"): conseq [' images'] .append (pq (node_image) .attr ("src") # # get tagconseq ['tags'] = [] for node_image in document.find (".mainright-tagBox. Fl "): conseq ['tags'] .append (pq (node_image) .text () print (conseq) uses Golang for page parsing

The following libraries are commonly used to parse html and xml documents in Golang:

Regexp libraries that provide regular expression support, github.com/PuerkitoBio/goquery libraries that provide CSS selector support, gopkg.in/xmlpath.v2 libraries that provide Xpath support, github.com/tidwall/gjson libraries that provide JSON PATH support

You can obtain these libraries through go get-u. Since we have sorted out the parsing logic in the above Python parsing, we only need to reproduce them in Golang. Unlike Python, we'd better define a struct for our data structure first, like the following:

Type Reuslt struct {Title string Pictype string Number string Type string Metadata struct {Size string Volume string Mode string Resolution string} Author string Images [] string Tags [] string}

At the same time, because our page to be parsed is a non-mainstream gbk code, after downloading the document, we need to manually convert the utf-8 code into the gbk code. Although this process is not within the scope of parsing, it is also one of the steps that must be done. We use the library github.com/axgle/mahonia to convert the code and sort out the function decoderConvert of the code conversion:

Func decoderConvert (name string, body string) string {return mahonia.NewDecoder (name) .ConvertString (body)}

Therefore, the final golang code should look like this:

Package mainimport ("encoding/json"log"regexp"strings"github.com/axgle/mahonia"github.com/parnurzeal/gorequest"github.com/PuerkitoBio/goquery") type Reuslt struct {Title string Pictype string Number string Type string Metadata struct {Size string Volume string Mode string Resolution string} Author string Images [] string Tags [] string} func RegexGet (text string Expr string) string {regex, _: = regexp.Compile (expr) return regex.FindString (text)} func decoderConvert (name string, body string) string {return mahonia.NewDecoder (name) .ConvertString (body)} func main () {/ / download document request: = gorequest.New () _, body, _: = request.Get ("http://www.58pic.com/newpic/32504070.html").End() document Err: = goquery.NewDocumentFromReader (strings.NewReader (decoderConvert ("gbk") Body)) if err! = nil {panic (err)} conseq: = & Reuslt {} / / get the file title titleNode: = document.Find (".detail-title") titleNode.Find ("div"). Remove () titleNode.Find ("p"). Remove () conseq.Title = titleNode.Text () / / get the material type conseq.Pictype = document.Find (". Pic-type ") .Text () / / get the file format conseq.Type = document.Find (" .mainRight-file ") .Text () / / get metadata context: = document.Find (" .main-right p ") .Text () conseq.Metadata.Mode = RegexGet (context) `Dimension: (. *?) Pixel `) conseq.Metadata.Resolution = RegexGet (context,` Volume: (. *? MB) `) conseq.Metadata.Size = RegexGet (context,` mode: ([Amurz] +) `) conseq.Metadata.Volume = RegexGet (context, `Resolution: (\ d+dpi)`) / / get author conseq.Author = document.Find (".user-name"). Text () / / get document.Find ("# show-area-height img") .Each (func (I int, element * goquery.Selection) {if attribute Exists: = element.Attr ("src") Exists & & attribute! = "{conseq.Images = append (conseq.Images, attribute)}) / / get tag document.Find (" .mainright-tagBox .fl ") .Each (func (I int, element * goquery.Selection) {conseq.Tags = append (conseq.Tags, element.Text ()}) bytes, _: = json.Marshal (conseq) log.Println (string (bytes))}

The parsing logic is exactly the same, and the code volume and complexity are about the same as those of the python version. Let's take a look at how the new GraphQuery is done.

Parsing with GraphQuery

It is known that the data structure we want is as follows:

{title pictype number type metadata {size volume mode resolution} author images [] tags []}

The code for GraphQuery looks like this:

{title `xpath ("/ html/body/div [4] / div [1] / div/div/div [1] / text ()") `pictype `css (".pic-type") `number `css (".detailBtn-down"); attr ("data-id") `type `css (".main-right p") `{ size `regex ("size: (. *?)") Pixel ") `volume `regex (" Volume: (. *? MB) `images `regex ("mode: ([Amurz] +)") `resolution `regex ("resolution: (\ d+dpi)") `} author `css (".user-name") `images `css ("# show-area-height img") `[src `attr ("src")`] tags `css (".mainRight-tagBox .fl") `[tag `text ()`]}

By comparison, we can see that it just adds some functions wrapped in backquotes to the data structure we designed. Amazingly, it can fully restore the parsing logic we have above in Python and Golang, and can read the returned data structure more clearly from its syntax structure. The implementation result of this GraphQuery is as follows:

{"data": {"author": "Ice bear", "images": ["http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0"," "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024"," http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"]," metadata ": {" mode ":" RGB "," resolution ":" 200dpi "," size ":" 4724 × 6299 " "volume": "196.886 MB"}, "number": "32504070", "pictype": "original", "tags": ["Hero", "poster", "Black and White", "Jin Yong", "Ink", "Swordsman", "Chinese style"], "title": "Hero poster Jin Yong Wushu Chinese style Black and White" "type": "psd"}, "error": "," timecost ": 10997800}

GraphQuery is a text query language, it does not depend on any back-end language, can be called by any back-end language, a GraphQuery query statement, in any language can get the same parsing results.

It has built-in xpath selector, css selector, jsonpath selector and regular expression, as well as a sufficient number of text processing functions, the structure is clear and easy to read, and can ensure the consistency of the data structure, parsing code, and return result structure.

Project address: github.com/storyicon/graphquery

The grammar of GraphQuery is simple and easy to understand, even if you touch it for the first time, you can use it quickly. One of its grammar design ideas is intuitive. How should we implement it?

1. Call GraphQuery in Golang

In golang, you just need to first use go get-u github.com/storyicon/graphquery to get the GraphQuery and call it in the code:

Package mainimport ("log"github.com/axgle/mahonia"github.com/parnurzeal/gorequest"github.com/storyicon/graphquery") func decoderConvert (name string, body string) string {return mahonia.NewDecoder (name) .ConvertString (body)} func main () {request: = gorequest.New () _, body, _: = request.Get ("http://www.58pic.com/newpic/32504070.html").End() body = decoderConvert (" gbk ") Body) response: = graphquery.ParseFromString (body, "{title `xpath (\" / html/body/div [4] / div [1] / div/div/div [1] / text ()\ ") `pictype `css (\" .pic-type\ ") `number `css (\" .detailBtn-down\ ") Attr (\ "data-id\") `size `regex (\ "File format: ([amerz] +)\") `metadata `css (\ ".main-right p\") `{ size `regex (\ "size: (. *?)) Pixel\ ") `volume `regex (\" Volume: (. *? MB)\ ") `images `regex (\" Mode: ([Amurz] +)\ ") `resolution `regex (\" Resolution: (\\ d+dpi)\ ") `} author `css (\" .user-name\ ") `images `css (\" # user\ ") `[src `attr (\" src\ ")`] tags `css (\" .mainRight-tagBox .fl\ ") `[tag `text ()`]}") log.Println (response)}

Our GraphQuery expression is passed in as a single line as the second argument to the function graphquery.ParseFromString, and the result is exactly the same as expected.

two。 Call GraphQuery in Python

In other back-end languages such as Python, to call GraphQuery, you need to start its service first. The service has been compiled for windows, mac and linux, and you can download it from GraphQuery-http.

After unzipping and starting the service, we can happily parse any document graphically in any back-end language using GraphQuery. The sample code for the Python call is as follows:

Import requestsdef GraphQuery (document, expr): response = requests.post ("http://127.0.0.1:8559", data= {" document ": document," expression ": expr,}) return response.textresponse = requests.get (" http://www.58pic.com/newpic/32504070.html")conseq = GraphQuery (response.text) R "" {title `xpath ("/ html/body/div [4] / div [1] / div/div/div [1] / text ()") `pictype `css (".pic-type") `number `css (".detailBtn-down") Attr ("data-id") `type `regex ("File format: ([Amurz] +)") `metadata `css (".main-right p") `{ size `regex ("size: (. *?)) Pixel ") `volume `regex (" Volume: (. *? MB) `images `regex ("mode: ([Amurz] +)") `resolution `regex ("resolution: (\ d+dpi)") `} author `css (".user-name") `images `css ("# show-area-height img") `[src `attr ("src")`] tags `css (".mainRight-tagBox .fl") ") `[ tag `text () `]}") print (conseq)

The output is as follows:

{"data": {"author": "Ice bear", "images": ["http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0"," "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024"," http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"]," metadata ": {" mode ":" RGB "," resolution ":" 200dpi "," size ":" 4724 × 6299 " "volume": "196.886 MB"}, "number": "32504070", "pictype": "original", "tags": ["Hero", "poster", "Black and White", "Jin Yong", "Ink", "Swordsman", "Chinese style"], "title": "Hero poster Jin Yong Wushu Chinese style Black and White" "type": "psd"}, "error": "," timecost ": 10997800} III. Postscript

Complex parsing logic not only brings the problem of readability of the code, but also causes great trouble in the maintenance and transplantation of the code. Different languages and different libraries also cause differences in the parsing results of the code. GraphQuery is a new open source project, its main purpose is to let developers free from these repeated and tedious parsing logic and write code with high readability, high portability and high maintainability. Welcome practice, continuous attention and code contribution to witness the development of GraphQuery and the open source community!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report