In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
In this issue, the editor will bring you an example analysis of whether the data of Python climbing Jiayuan can prove it unreliable. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
Preface
Today, I saw a story on Zhihu about how reliable Jiayuan is to find a partner. The discussion, which had 1903 followers, was viewed 1940753 times, and most of the answers were unreliable. Can it be proved unreliable to crawl Jiayuan's data with Python?
After flipping through a few pages, I found a link to search_v2.php. The return value is an irregular json string, including nickname, gender, whether to match, matching conditions, and so on.
Through the get method of url + parameters, 10000 pages of data were crawled, totaling 240116.
The module that needs to be installed is openpyxl, which is used to filter special characters
# coding:utf-8import csvimport jsonimport requestsfrom openpyxl.cell.cell import ILLEGAL_CHARACTERS_REimport reline_index = 0def fetchURL (url): headers = {'accept':' * / *', 'user-agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36', 'Cookie':' guider_quick_search=on; accessID=20201021004216238222; PHPSESSID=11117cc60f4dcafd131b69d542987a46; is_searchv2=1 SESSION_HASH=8f93eeb87a87af01198f418aa59bccad9dbe5c13; user_access=1; Qs_lvt_336351=1603457224 Qs_pv_336351=4391272815204901400%2C3043552944961503700'} r = requests.get (url, headers=headers) r.raise_for_status () return r.text.encode ("gbk", 'ignore'). Decode ("gbk", "ignore") def parseHtml (html): html = html.replace ('\',') html = ILLEGAL_CHARACTERS_RE.sub (rushing, html) s = json.loads (html) Strict=False) global line_index userInfo = [] for key in s ['userInfo']: line_index = line_index + 1a = (key [' uid'], key ['nickname'], key [' age'], key ['work_location'], key [' height'], key ['education'], key [' matchCondition'], key ['marriage'], key [' shortnote'] .replace ('\ n') '') userInfo.append (a) with open ('sjjy.csv',' asides, newline='') as f: writer = csv.writer (f) writer.writerows (userInfo) if _ name__ = ='_ main__': for i in range (1) 10000): url = 'http://search.jiayuan.com/v2/search_v2.php?key=&sex=f&stc=23:1,2:20.30&sn=default&sv=1&p=' + str (I) +' & flip selectlistStyleFormPhoto' html = fetchURL (url) print (str (I) + 'Page' + str (len (html)) +'*'* 20) parseHtml (html) II Deweighting
When dealing with data to remove repetition, I found a lot of repetition. I thought there was something wrong with the code. After checking bug for a long time, I finally found that the site only had a lot of duplicate data on 100 pages. The following two pictures are 110pages of data and 111pages of data.
110 pages of data
111 pages of data
After filtering the repeated data, there is only 1872 left, which is really a lot of water.
Def filterData (): filter = [] csv_reader = csv.reader (open ("sjjy.csv") Encoding='gbk')) I = 0 for row in csv_reader: I = I + 1 print ('processing:' + str (I) + 'line') if row [0] not in filter: filter.append (row [0]) print (len (filter)) the above is an example of whether the data of Python climbing Century Jiayuan that the editor shared for you can prove it unreliable. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
Comprehensive Health Service platform based on big data
© 2024 shulou.com SLNews company. All rights reserved.