In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "how Python crawls a website document data". In daily operation, I believe that many people have doubts about how Python crawls a website document data. The editor consulted all kinds of materials and sorted out a simple and easy-to-use method of operation. I hope it will be helpful to answer the question of "how Python crawls a website document data". Next, please follow the editor to study!
Preface
The text and pictures of this article come from the network, only for learning, communication and use, do not have any commercial use, if you have any questions, please contact us in time to deal with.
Basic development environment
Python 3.6
Pycharm
The use of related modules import osimport requestsimport timeimport reimport jsonfrom docx import Documentfrom docx.shared import Cm
Install Python and add it to the environment variable, and pip installs the relevant modules you need.
Free online viewing of Python crawler, data analysis, website development and other case tutorials
Https://space.bilibili.com/523606542Python Learning Communication Group: 1039649593 Target Page Analysis
The document content of the website exists in the form of pictures. It has its own data interface.
Interface links:
Https://openapi.book118.com/getPreview.html?&project_id=1&aid=272112230&t=f2c66902d6b63726d8e08b557fef90fb&view_token=SqX7ktrZ_ZakjDI@vcohcCwbn_PLb3C1&page=1&callback=jQuery18304186406662159248_1614492889385&_=1614492889486
Request parameters of the interface
Overall thinking
Request web page to return response data (string)
Extract intermediate data (list) by matching re module. Index 0 (string).
Through json module, the extracted data is converted into json module.
Get the url address of each picture by traversing
Save pictures to local folder
Save the picture to an word document
Crawler code implementation
The crawler code implements def download (): content = 0 for page in range (1,96) 6): # given a 2-second delay time.sleep (2) # get the timestamp now_time = int (time.time () * 1000) url = 'https://openapi.book118.com/getPreview.html' # request parameter params = {' project_id': '1mm,' aid': '272112230' 'tasking: 'f2c66902d6b63726d8e08b557fef90fbproof,' view_token': 'SqX7ktrZ_ZakjDI@vcohcCwbn_PLb3C1',' page': f' {page}','_': now_time,} # request header headers = {'Host':' openapi.book118.com' 'Referer': 'https://max.book118.com/html/2020/0427/8026036013002110.shtm',' User-Agent': 'Mozilla/5.0 (Windows NT 10.0 WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} response = requests.get (url=url, params=params, headers=headers) # extract content result = re.findall ('jsonpReturn\ ((. *?)\)' using regular expressions Response.text) [0] # string to json data json_data = json.loads (result) ['data'] # ergodic for value in json_data.values (): content + = 1 # spliced picture url img_url =' http:' + value print (img_url) Headers_1 = {'Host':' view-cache.book118.com' 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0 WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} # request picture url address to obtain content binary data img_content = requests.get (url=img_url Headers=headers_1) .content # File name img_name = str (content) + '.jpg' # Save path filename = 'img\\' # Save in binary (pictures, audio, video, etc.) with open (filename + img_name) Mode='wb') as f: f.write (img_content)
Note:
1. Be sure to give a delay, otherwise the API data will not be requested later.
2. When requesting an image url, the headers parameter needs to be fully written, otherwise saving the image cannot be opened.
3. Name a given number, such as 1.jpg and 2.jpg, to facilitate subsequent saving to word.
The crawler part of the code is relatively simple, there is no particular difficulty.
To crawl these documents, you need to print or query, so save these individual images into the word document.
Write to the document def save_picture (): document = Document () path ='. / img/' lis = os.listdir (path) c = [] for li in lis: index = li.replace ('.jpg','') c.append (index) Clear1 = sorted (list (map (int) )) print (height=Cm 1) new_files = [(str (I) + '.jpg') for i in Clear1] for num in new_files: img_path = path + num document.add_picture (img_path, width=Cm (17), height=Cm (24)) document.save ('tu.doc') # Save document os.remove (img_path) # Delete locally saved images here The study on "how Python crawls the document data of a website" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.