Do you use xpath or regular expressions to extract data? 07/02 Update SLTechnology News&Howtos

Do you use xpath or regular expressions to extract data?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about whether to use xpath or regular expressions to extract data. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Xpath and regular expressions are the two most commonly used methods for data extraction. Which one is better?

The test code is shown below. The goal of the experiment is the same HTML document. The xpath and regular expressions in the xpath,lxml library in the webscrpaing library are used to extract 100 times, and the time of each method is counted:

View plain copy to clipboard print?

# coding: utf-8

# xpath_speed_test.py

Import re

Import time

From lxml import etree

From webscraping import common, download, xpath

TEST_TIMES = 100

Def test ():

Url = 'http://hotels.ctrip.com/international/washington26363'

Html = download.Download () .get (url)

Html = common.to_unicode (html)

# Test the xpath extraction speed of webscraping library

Start_time = time.time ()

For i in range (TEST_TIMES):

For hid, hprice in zip (xpath.search (html,'/ / div [@ class= "hlist_item"] / @ id'), xpath.search (html,'/ / div [@ class= "hlist_item_price"] / span')):

# print hid, hprice

Pass

End_time = time.time ()

Webscraping_xpath_time_used = end_time-start_time

Print'"webscraping.xpath" time used: {} seconds.'.format (webscraping_xpath_time_used)

# Test the xpath extraction speed of lxml library

Start_time = time.time ()

For i in range (TEST_TIMES):

Root = etree.HTML (html)

For hlist_div in root.xpath ('/ / div [@ class= "hlist_item"]'):

Hid = hlist_div.get ('id')

Hprice = hlist_div.xpath ('. / / div [@ class= "hlist_item_price"] / span') [] .text

# print hid, hprice

Pass

End_time = time.time ()

Lxml_time_used = end_time-start_time

Print'"lxml" time used: {} seconds.'.format (lxml_time_used)

# testing the speed of regular expressions

Start_time = time.time ()

For i in range (TEST_TIMES):

For hid, hprice in zip (re.compile (ritual class= "hlist_item" id= "(\ d +)"') .findall (html), re.compile (r'¥([\ d\.] +)) .findall (html)):

# print hid, hprice

Pass

End_time = time.time ()

Re_time_used = end_time-start_time

Print'"re" time used: {} seconds.'.format (re_time_used)

If _ _ name__ = ='_ _ main__':

Test ()

The running results are as follows:

View plain copy to clipboard print?

"webscraping.xpath" time used: 100.677000046 seconds.

"lxml" time used: 2.09100008011 seconds.

"re" time used: 0.138999938965 seconds.

The result was shocking:

The regular expression took only 0.14 seconds.

Lxml's xpath took 2.1 seconds.

Webscraping's xpath took 101 seconds!

Because xpath is simple and flexible, we usually choose to develop crawlers, but through this experiment, we find that its efficiency is much lower than regular expressions, especially the speed of xpath in webscrpaing library is frighteningly slow.

Therefore, in our crawler development process, we should choose regular expressions, if using regular expressions is really difficult to implement, then consider xpath, in addition, when using xpath, we must choose efficient libraries, such as lxml. Especially when the amount of data is very large, efficiency is particularly important.

Thank you for reading! This is the end of the article on "using xpath or regular expressions in data extraction". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.