How to implement a web crawler in Python 07/01 Update SLTechnology News&Howtos

How to implement a web crawler in Python

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces you how to achieve a web crawler in Python, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Part one:

Get web page information:

Import requestsurl = "https://voice.baidu.com/act/newpneumonia/newpneumonia"response = requests.get (url)

The second part:

You can observe the characteristics of the data: the data is contained in the script tag, using xpath to obtain the data. Import a module from lxml import etree to generate a html object and parse it to get a content of type list. Use the first item to get all the content. Then, first get the content of component, then use the json module to change the string type into a dictionary (Python data structure) in order to obtain domestic data, you need to find caseList in component.

Next, add the code:

From lxml import etreeimport json# generates the HTML object html = etree.HTML (response.text) result = html.xpath ('/ / script [@ type= "application/json"] / text ()') result = result [0] # json.load () method to convert the string to the python data type result = json.loads (result) result_in = result ['component'] [0] [' caseList']

The third part:

Store domestic data in excel tables: using the openyxl module, import openpyxl first creates a workbook, creates a worksheet under the workbook, and then names and assigns attributes to the worksheet

The code is as follows:

Import openpyxl# create workbook wb = openpyxl.Workbook () # create worksheet ws = wb.activews.title = "domestic epidemic" ws.append (['province', 'cumulative diagnosis', 'death', 'cure', 'existing diagnosis', 'cumulative diagnosis increment', 'death increment', 'cure increment' Area-- > mostly provincial city-- > city confirmed-- > cumulative crued-- > range relativeTime-- > confirmedRelative-- > cumulative increment curedRelative-- > range increment curConfirm-- > existing Quzhen curConfirmRelative-> existing Quzhen increment''for each in result_in: temp_list = [each [' area'], each ['confirmed'], each [' died'], each ['crued'] Each ['curConfirm'], each [' confirmedRelative'], each ['diedRelative'], each [' curedRelative'], each ['curConfirmRelative']] for i in range (len (temp_list)): if temp_ [I] ='': temp_ list [I] ='0' ws.append (temp_list) wb.save ('. / data.xlsx')

Part IV:

Store foreign data in excel: get foreign data in component's globalList and then create sheet in excel table, representing different continents.

The code is as follows:

Data_out = result ['component'] [0] [' globalList'] for each in data_out: sheet_title = each ['area'] # create a new worksheet ws_out = wb.create_sheet (sheet_title) ws_out.append ([' country', 'cumulative diagnosis', 'death', 'cure', 'existing diagnosis' ]) for country in each ['subList']: list_temp = [country [' country'], country ['confirmed'], country [' died'], country ['crued'], country [' curConfirm'] Country ['confirmedRelative']] for i in range (len (list_temp)): if list_ temp [I] ='': list_ temp [I] ='0' ws_out.append (list_temp) wb.save ('. / data.xlsx')

The overall code is as follows:

Import requestsfrom lxml import etreeimport jsonimport openpyxl url = "https://voice.baidu.com/act/newpneumonia/newpneumonia"response = requests.get (url) # print (response.text) # generate the HTML object html = etree.HTML (response.text) result = html.xpath ('/ / script [@ type=" application/json "] / text ()') result = result [0] # json.load () method can convert the string to python data type result = json.loads (result) # Create workbook wb = openpyxl.Workbook () # create worksheet ws = wb.activews.title = "domestic epidemic" ws.append (['province') 'cumulative diagnosis', 'death', 'cure', 'existing diagnosis', 'cumulative diagnosis increment', 'death increment', 'cure increment' ]) result_in = result ['component'] [0] [' caseList'] data_out = result ['component'] [0] [' globalList'] 'area-- > mostly provincial city-- > city confirmed-- > cumulative crued-- > range relativeTime-- > confirmedRelative-- > cumulative increment curedRelative-- > range increment curConfirm-- > existing Quezhen curConfirmRelative-- > existing Quechen increment'' For each in result_in: temp_list = [each ['area'] Each ['confirmed'], each [' died'], each ['crued'], each [' curConfirm'], each ['confirmedRelative'], each [' diedRelative'], each ['curedRelative'] Each ['curConfirmRelative']] for i in range (len (temp_list)): if temp_ list [I] =': temp_ list [I] ='0' ws.append (temp_list) # get foreign epidemic data for each in data_out: sheet_title = each ['area'] # create a new worksheet ws_out = wb .create _ sheet (sheet_title) ws_out.append (['country') 'cumulative diagnosis', 'death', 'cure', 'existing diagnosis', 'cumulative diagnosis increment'] for country in each ['subList']: list_temp = [country [' country'], country ['confirmed'], country [' died'], country ['crued'], country [' curConfirm'] Country ['confirmedRelative']] for i in range (len (list_temp)): if list_ temp [I] ='': list_ temp [I] ='0' ws_out.append (list_temp) wb.save ('. / data.xlsx')

The results are as follows:

Domestic:

Abroad:

Recommended:

020 is constantly updated, the boutique small circle has new content every day, and the concentration of practical information is extremely high.

Strong connections, discuss technology, everything you want here!

Get into the group first and outperform your peers! (there is no charge for joining the group)

Click here to communicate and learn with Python developer Daniel

Group number: 858157650

Application will be delivered immediately:

Python software installation package, Python hands-on tutorial

Free access to materials, including Python basic learning, advanced learning, crawlers, artificial intelligence, automated operation and maintenance, automated testing, etc.

On how to achieve a web crawler in Python to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.