How does Python crawl the document data of a website 07/13 Update SLTechnology News&Howtos

How does Python crawl the document data of a website

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how Python crawls a website document data". In daily operation, I believe that many people have doubts about how Python crawls a website document data. The editor consulted all kinds of materials and sorted out a simple and easy-to-use method of operation. I hope it will be helpful to answer the question of "how Python crawls a website document data". Next, please follow the editor to study!

Preface

The text and pictures of this article come from the network, only for learning, communication and use, do not have any commercial use, if you have any questions, please contact us in time to deal with.

Basic development environment

Python 3.6

Pycharm

The use of related modules import osimport requestsimport timeimport reimport jsonfrom docx import Documentfrom docx.shared import Cm

Install Python and add it to the environment variable, and pip installs the relevant modules you need.

Free online viewing of Python crawler, data analysis, website development and other case tutorials

Https://space.bilibili.com/523606542Python Learning Communication Group: 1039649593 Target Page Analysis

The document content of the website exists in the form of pictures. It has its own data interface.

Interface links:

Https://openapi.book118.com/getPreview.html?&project_id=1&aid=272112230&t=f2c66902d6b63726d8e08b557fef90fb&view_token=SqX7ktrZ_ZakjDI@vcohcCwbn_PLb3C1&page=1&callback=jQuery18304186406662159248_1614492889385&_=1614492889486

Request parameters of the interface

Overall thinking

Request web page to return response data (string)

Extract intermediate data (list) by matching re module. Index 0 (string).

Through json module, the extracted data is converted into json module.

Get the url address of each picture by traversing

Save pictures to local folder

Save the picture to an word document

Crawler code implementation

The crawler code implements def download (): content = 0 for page in range (1,96) 6): # given a 2-second delay time.sleep (2) # get the timestamp now_time = int (time.time () * 1000) url = 'https://openapi.book118.com/getPreview.html' # request parameter params = {' project_id': '1mm,' aid': '272112230' 'tasking: 'f2c66902d6b63726d8e08b557fef90fbproof,' view_token': 'SqX7ktrZ_ZakjDI@vcohcCwbn_PLb3C1',' page': f' {page}','_': now_time,} # request header headers = {'Host':' openapi.book118.com' 'Referer': 'https://max.book118.com/html/2020/0427/8026036013002110.shtm',' User-Agent': 'Mozilla/5.0 (Windows NT 10.0 WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} response = requests.get (url=url, params=params, headers=headers) # extract content result = re.findall ('jsonpReturn\ ((. *?)\)' using regular expressions Response.text) [0] # string to json data json_data = json.loads (result) ['data'] # ergodic for value in json_data.values (): content + = 1 # spliced picture url img_url =' http:' + value print (img_url) Headers_1 = {'Host':' view-cache.book118.com' 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0 WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} # request picture url address to obtain content binary data img_content = requests.get (url=img_url Headers=headers_1) .content # File name img_name = str (content) + '.jpg' # Save path filename = 'img\\' # Save in binary (pictures, audio, video, etc.) with open (filename + img_name) Mode='wb') as f: f.write (img_content)

Note:

1. Be sure to give a delay, otherwise the API data will not be requested later.

2. When requesting an image url, the headers parameter needs to be fully written, otherwise saving the image cannot be opened.

3. Name a given number, such as 1.jpg and 2.jpg, to facilitate subsequent saving to word.

The crawler part of the code is relatively simple, there is no particular difficulty.

To crawl these documents, you need to print or query, so save these individual images into the word document.

Write to the document def save_picture (): document = Document () path ='. / img/' lis = os.listdir (path) c = [] for li in lis: index = li.replace ('.jpg','') c.append (index) Clear1 = sorted (list (map (int) )) print (height=Cm 1) new_files = [(str (I) + '.jpg') for i in Clear1] for num in new_files: img_path = path + num document.add_picture (img_path, width=Cm (17), height=Cm (24)) document.save ('tu.doc') # Save document os.remove (img_path) # Delete locally saved images here The study on "how Python crawls the document data of a website" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.