In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article is about whether to use xpath or regular expressions to extract data. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
Xpath and regular expressions are the two most commonly used methods for data extraction. Which one is better?
The test code is shown below. The goal of the experiment is the same HTML document. The xpath and regular expressions in the xpath,lxml library in the webscrpaing library are used to extract 100 times, and the time of each method is counted:
View plain copy to clipboard print?
# coding: utf-8
# xpath_speed_test.py
Import re
Import time
From lxml import etree
From webscraping import common, download, xpath
TEST_TIMES = 100
Def test ():
Url = 'http://hotels.ctrip.com/international/washington26363'
Html = download.Download () .get (url)
Html = common.to_unicode (html)
# Test the xpath extraction speed of webscraping library
Start_time = time.time ()
For i in range (TEST_TIMES):
For hid, hprice in zip (xpath.search (html,'/ / div [@ class= "hlist_item"] / @ id'), xpath.search (html,'/ / div [@ class= "hlist_item_price"] / span')):
# print hid, hprice
Pass
End_time = time.time ()
Webscraping_xpath_time_used = end_time-start_time
Print'"webscraping.xpath" time used: {} seconds.'.format (webscraping_xpath_time_used)
# Test the xpath extraction speed of lxml library
Start_time = time.time ()
For i in range (TEST_TIMES):
Root = etree.HTML (html)
For hlist_div in root.xpath ('/ / div [@ class= "hlist_item"]'):
Hid = hlist_div.get ('id')
Hprice = hlist_div.xpath ('. / / div [@ class= "hlist_item_price"] / span') [] .text
# print hid, hprice
Pass
End_time = time.time ()
Lxml_time_used = end_time-start_time
Print'"lxml" time used: {} seconds.'.format (lxml_time_used)
# testing the speed of regular expressions
Start_time = time.time ()
For i in range (TEST_TIMES):
For hid, hprice in zip (re.compile (ritual class= "hlist_item" id= "(\ d +)"') .findall (html), re.compile (r'¥([\ d\.] +)) .findall (html)):
# print hid, hprice
Pass
End_time = time.time ()
Re_time_used = end_time-start_time
Print'"re" time used: {} seconds.'.format (re_time_used)
If _ _ name__ = ='_ _ main__':
Test ()
The running results are as follows:
View plain copy to clipboard print?
"webscraping.xpath" time used: 100.677000046 seconds.
"lxml" time used: 2.09100008011 seconds.
"re" time used: 0.138999938965 seconds.
The result was shocking:
The regular expression took only 0.14 seconds.
Lxml's xpath took 2.1 seconds.
Webscraping's xpath took 101 seconds!
Because xpath is simple and flexible, we usually choose to develop crawlers, but through this experiment, we find that its efficiency is much lower than regular expressions, especially the speed of xpath in webscrpaing library is frighteningly slow.
Therefore, in our crawler development process, we should choose regular expressions, if using regular expressions is really difficult to implement, then consider xpath, in addition, when using xpath, we must choose efficient libraries, such as lxml. Especially when the amount of data is very large, efficiency is particularly important.
Thank you for reading! This is the end of the article on "using xpath or regular expressions in data extraction". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.