Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Python crawl face recognition picture dataset / python crawl picture / python crawler

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

I have long sold a large amount of Weibo data, travel website review data, and provide a variety of designated data crawling services, Message to YuboonaZhang@Yahoo.com. Also welcome to join the social media data exchange group: 99918768

Preface

Recently, in the study of face recognition under machine learning, machine learning is a bit violent, which largely depends on the amount of data trained to determine the effect. In order to find the data, I browsed several well-known data sets through the guidance of a blog.

Several large datasets are downloaded through email applications, several small datasets are downloaded directly from the link to the web page, and a Pubfig dataset provides links to a large number of images for us to write our own programs to download.

After weighing the demand of the amount of data, I finally chose the data set of Pubfig, so I wrote a python image acquisition program, which uses urllib and requests methods.

Analyze the characteristics of download files provided by Pubfig

This data file provides all the characters that appear in the dataset

This data file provides everyone's urls.

You can see that the processing of this data set is actually very simple. You can store it in the list by readlines and separate the data with spaces to extract the urls.

Deal with the urls file

Urls in the middle and back of the file, write a file to simply extract it, easy to use.

I extracted the part of Miley_Cyrus separately and put a txt file.

Pic_url = [] with open ('. / Miley_Cyrus.txt') as f: for i in f.readlines (): pic_url.append (i.strip ('\ r\ n')) urls = [] for s in pic_url: _, url, _, _ = s.split () urls.append (url) # write to the file with open ('url.data' 'w') as f: for i in urls: f.write (I) f.write ('\ n') crawl urls images 1. Urllibs method import urllib.request as requestimport socketimport os# creates new folder images in the same directory os.mkdir ('. / img') # add a header user_agent = 'Mozilla/5.0 (X11) to the request Linux x86 / 64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'headers = ('User-Agent', user_agent) opener = request.build_opener () opener.addheaders = [headers] request.install_opener (opener) # set no response time Prevent some bad pictures from being unable to download for a long time timeout = 20socket.setdefaulttimeout (timeout) # read urlsurls = [] with open ('. / url.data') as f: for i in f.readlines (): if I! =': urls.append (I) else: pass# gets all the pictures through urllibs's requests count = 1bad_url = [] For url in urls: url.rstrip ('\ n') print (url) try: pic = request.urlretrieve (url '. / img3/%d.jpg'% count) print (' pic% d'% count) count + = 1 except Exception as e: print (Exception,':', e) bad_url.append (url) print ('\ n') print ('got all photos that can be got') # save urls that has not been captured with open (' bad_url3.data' 'w') as f: for i in bad_url: f.write (I) f.write ('\ n') print ('saved bad urls') 2. Requests method import requestsimport socketimport os# creates a new folder in the same directory to store the image os.mkdir ('. / img') # set the no response time Prevent some bad pictures from being unable to download for a long time timeout = 20socket.setdefaulttimeout (timeout) # read urlsurls = [] with open ('. / url.data') as f: for i in f.readlines (): if I! ='': urls.append (I) else: pass# add a header to the request and get the picture user_agent = 'Mozilla/5.0 (X11) Linux x86 / 64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'headers = {'User-Agent': user_agent} bad_url = [] count = 1for url in urls: url.rstrip ('\ n') print (url) try: pic = requests.get (url, headers=headers) with open ('. / img2/%d.jpg'% count 'wb') as f: f.write (pic.content) f.flush () print (' pic% d'% count) count + = 1 except Exception as e: print (Exception,':', e) bad_url.append (url) print ('\ n') print ('got all photos that can be got') # Save bad link with open (' bad_url.data' 'w') as f: for i in bad_url: f.write (I) f.write ('n') print ('saved bad urls') personal blog

8aoy1.cn

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report