Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to clean up invalid websites in your favorites

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to use Python to clean up invalid websites in favorites". In daily operation, I believe many people have doubts about how to use Python to clean up invalid websites in favorites. Xiaobian consulted all kinds of materials and sorted out simple and easy to use operation methods. I hope to answer your doubts about "how to use Python to clean up invalid websites in favorites"! Next, please follow the small series to learn together!

Recently opened a lot of previous collection sites, found that many have been invalid, a lot of previous want to see the content has not had time to see, also can not find...

Failed bookmarks

When we browse the website daily, we will encounter something new from time to time, so we will silently click on a collection or bookmark. However, when faced with hundreds of bookmarks and favorites, we always have a headache...

Especially programming blogs that were updated yesterday and are dead today and never updated. Or yesterday to see the energetic movie website, today directly 404. There are so many invalid pages, every time I open it, I know it is invalid, and I need to delete it manually. Can this be a programmer's job?

However, whether it is Google browser or domestic browser, at most, it provides a backup service for favorites, which can only be Python.

Favorite file formats supported by Python

There is very little support for favorites, mainly because favorites are hidden in the browser and we can only manually export htm files for management.

The content is relatively simple. I don't know much about the front end, but I can clearly see the tree structure and internal logic.

Fixed format URL Fixed format Page name Fixed format

It is easy to think of regular matching, which has two substrings. Extract them and visit them one by one to see which ones are invalid, delete them, and get the favorites after cleaning up.

Read Favorite Files

path = "C:\\Users\\XU\\Desktop"fname = "bookmarks.html"os.chdir(path)bookmarks_f = open(fname, "r+" ,encoding='UTF-8')booklists = bookmarks_f.readlines()bookmarks_f.close()

Because of the unfamiliarity with the frontend, this exported favorite can be abstractly divided into

structure code

Save key code for web bookmarks

Among them, we can't move the structure code, we have to keep it intact, and save the key code of the web bookmark, we have to extract the content and judge the retention and deletion.

So here we use the readlines function, reading each line and judging it separately.

regular matching

pattern = r'HREF="(.*?) " .*?> (.*?) 'while len(booklists)>0: bookmark = booklists.pop(0) detail = re.search(pattern, bookmark)

If it is a key code: the extracted substring is in detail.group(1) and detail.group(2)

And if it's structural code: detail == None

access page

import requestsr = requests.get(detail.group(1),timeout=500)

After trying to code, I found that there are four cases

r.status_code == requests.codes.ok

r.status_code==404

r.status_code!= 404 && Unable to access (may be blocked crawler, recommended to keep)

requests.exceptions.ConnectionError

Similar to Zhihu, Jane books are basically anti-climbing, so simple get can not be effectively accessed, details are not worth a lot of trouble, just keep it directly. Error: Just throw an exception with try, otherwise the program will stop running.

After adding logic: (code can be dragged left and right)

while len(booklists)>0: bookmark = booklists.pop(0) detail = re.search (pattern, bookmark) if detail: #print(detail.group (1) +"---"+ detail.group(2)) try: #visit r = requests.get(detail.group (1),timeout=500) #if available add if r.status_code == requests.codes.ok: new_lists.append(bookmark) print( "ok----reserved: "+ detail.group (1)+"+ detail.group(2)) else: if (r.status_code==404): print ("inaccessible delete: "+ detail.group(1)+" "+ detail.group(2) +'error code'+str (r.status_code)) else: print ("Other reasons reserved: "+ detail.group(1)+" "+ detail.group(2) +'Error code'+str (r.status_code)) new_lists.append(bookmark) except: print( "inaccessible delete: "+ detail.group(1)+" "+ detail.group(2)) #new_lists.append(bookmark) else:#did not match yes struct new_lists.append(bookmark)

the implementation of the procedures for the

Export htm

bookmarks_f = open('new_'+fname, "w+" ,encoding='UTF-8')bookmarks_f.writelines(new_lists)bookmarks_f.close()

Import Browser

actually applied to my browser.

At this point, the study of "how to use Python to clean up invalid websites in your favorites" is over, hoping to solve everyone's doubts. Theory and practice can better match to help everyone learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report