In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Python how to crawl 58.com rental data and crack font encryption, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail, people with this need can come to learn, I hope you can gain something.
[1] the idea of conquering encrypted fonts
F12 open the debugging template, through the page analysis, you can observe that the site where there are numbers, are displayed as garbled, this situation is the font encryption, so what means to achieve font encryption?
There is a @ font-face rule in CSS, which allows you to specify online fonts for web pages, that is, custom fonts can be introduced. This rule is intended to eliminate dependence on computer fonts, and now many websites also use this rule to implement anti-crawling.
You can see the font used in the website on the right, and the others are common Microsoft Yahei, Song style, etc., but there is a special one: fangchan-secret, it is not difficult to see that this should be 58.com 's custom font.
To overcome the encrypted font, then we must analyze his font file, first find a way to get his encrypted font file, also look at the source code, search the fangchan-secret font information in the source code
The selected blue part is the encrypted font string encoded by base64. We decode it into binary encoding and write it into the .woff font file. This process can be achieved through the following code:
Import requestsimport base64headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'} url= 'https://wh.58.com/chuzu/'response = requests.get (url=url, headers=headers) # matches the encrypted font string base64_string = response.text.split ("base64) encoded by base64 ") [1] .split ("'") [0] .strip () # Decode the font string encoded by base64 into binary encoding bin_data = base64.decodebytes (base64_string.encode ()) # Save as font file with open ('58font.woffcuts,' wb') as f: f.write (bin_data)
After getting the font file, we can use the software FontCreator to see what the font code is:
Look at the code we see in the source code of the web page: similar to ","
Compare the corresponding encoding of font files: similar to uni9FA4 and nui9F92
You can see that except for the first three characters, the latter characters are all the same, except that there is a difference in English case.
Now we might think, isn't it OK to just replace the code with the corresponding number? However, it is not that simple.
If you try to refresh the web page, you can observe that the encrypted font string encoded by base64 will change, that is, the encoding and numbers do not correspond one to one. Get several font files again, and you can see it by comparison.
You can see that although the corresponding codes of numbers are different each time, these 10 codes are always the same, so there must be some corresponding relationship between codes and numbers. We can convert font files into xml files to observe the corresponding relationship, and the conversion function can be realized by improving the original code:
Import requestsimport base64from fontTools.ttLib import TTFontheaders = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'} url= 'https://wh.58.com/chuzu/'response = requests.get (url=url, headers=headers) # matches the encrypted font string base64_string = response.text.split ("base64) encoded by base64 ") [1] .split ("'") [0] .strip () # decode the font string encoded by base64 into binary encoding bin_data = base64.decodebytes (base64_string.encode ()) # Save as font file with open ('58font.woffcuts,' wb') as f: f.write (bin_data) # get the font file Convert it to the xml file font = TTFont ('58font.woff') font.saveXML (' 58font.xml')
Open the 58font.xml file and analyze it. You can see the familiar encoding similar to 0x9476 and 0x958f in the tag. The next four characters happen to be the encrypted encoding of the web font. You can see that each encoding corresponds to a code at the beginning of glyph.
Comparing it with the 58font.woff file, we can see that the code code is 0x958f corresponds to the number 3, and the corresponding name code is glyph00004.
Let's get a font file again for comparative analysis.
At this point, we know the corresponding relationship between the code and the number. Next, we can look up the value of the name corresponding to the code in the xml file, that is, the code that begins with glyph, and then return the corresponding number, and then replace the code in the source code of the web page. We can successfully get the information we need!
Summarize the general ideas for conquering encrypted fonts:
Analyze the web page and find the corresponding encrypted font file
If the encrypted font referenced is a base64-encoded string, it needs to be converted to binary and saved to the woff font file
Convert a font file to a xml file
Using FontCreator software to observe font file, combined with xml file, analyze the relationship between its coding and real font.
After figuring out the relationship between coding and font, find a way to replace the coding with normal font.
[2] mind map
[3] encrypted font processing module gets the font file and converts it to xml file def get_font (page_url, page_num): response = requests.get (url=page_url, headers=headers) # matches the encrypted font string base64_string = response.text.split ("base64) encoded by base64 ") [1] .split ("'") [0] .strip () # print (base64_string) # decode the font string encoded by base64 into binary encoding bin_data = base64.decodebytes (base64_string.encode ()) # Save as a font file with open ('58font.woffcuts,' wb') as f: f.write (bin_data) print ('+ str (page_num) + 'times to visit the web page Font file saved successfully!') # get the font file and convert it to a xml file font = TTFont ('58font.woff') font.saveXML (' 58font.xml') print ('successfully converted the font file to a xml file!') Return response.text
Pass in the url to be sent by the main function, use the split () method of the string to match the encrypted font string encoded by base64, use the base64.decodebytes () method of the base64 module to decode the font string encoded by base64 into binary encoding and save it as a font file, and use the FontTools library to convert the font file into a xml file.
Match encrypted font encodings with real fonts def find_font (): # digits corresponding to encodings starting with glyph glyph_list = {'glyph00001':' 0, 'glyph00002':' 1, 'glyph00003':' 2, 'glyph00004':' 3, 'glyph00005':' 4, 'glyph00006':' 5' 'glyph00007':' 6, 'glyph00008':' 7, 'glyph00009':' 8, 'glyph00010':' 9'} # Ten encrypted fonts are encoded unicode_list = ['0x9476codes,' 0x958fcodes, '0x993ccodes,' 0x9a4baths, '0x9e3acodes,' 0x9ea3codes, '0x9f64codes,' 0x9f92codes, '0x9fa4' '0x9fa5'] num_list = [] # use xpath syntax to match the contents of the xml file font_data = etree.parse ('. / 58font.xml') for unicode in unicode_list: # cycle through the name result = font_data.xpath corresponding to code in the xml file ("/ / cmap//map [@ code=' {}'] / @ name" .format (unicode)) [0] # print (result) # key of the circular dictionary If the name corresponding to code is the same as the key of the dictionary, you will get the value for key in glyph_list.keys (): if key = = result: num_list.append (glyph_ list [key]) print ('the number corresponding to the code has been found successfully!') # print (num_list) # return value list return num_list
From the previous analysis, we know that the value of name (that is, the code that begins with glyph) corresponds to a fixed number, glyph00001 corresponds to the number 0, glyph00002 corresponds to the number 1, and so on, so it can be constructed as a dictionary glyph_list
Also construct a list of ten code (that is, encrypted font codes similar to 0x9476)
Loop to find the corresponding name values of the ten code in the xml file, and then compare the value of name with the key value of the dictionary file. If the two values are the same, get the value value of the key, resulting in a list of num_list. The element in the unicode_list list is the true value of each encrypted font in the unicode_list list.
Replace all encrypted font codes in the web page def replace_font (num, page_response): # 9476 958F 993C 9A4B 9E3A 9EA3 9F64 9F92 9FA4 9FA5 result = page_response.replace ('letters', num [0]). Replace ('letters', num [1]). Replace ('letters', num [2]). Replace ('Mr.', num [3]). Replace ('Mr.', num [4]). Replace ('Mr.', num [5]). Replace ('Mr.') Num [6]). Replace ('encrypted', num [7]). Replace ('encrypted', num [8]). Replace ('encrypted', num [9]) print ('all encrypted fonts have been successfully replaced!') Return result
Pass in the list of real fonts obtained by the find_font () function in the previous step, and use the replace () method to replace ten encrypted font codes in turn
[4] Rental information extraction module def parse_pages (pages): num = 0 soup = BeautifulSoup (pages, 'lxml') # find the li tag all_house = soup.find_all (' li', class_='house-cell') for house in all_house: # title title = house.find ('a') Class_='strongbox'). Text.strip () # print (title) # Price price = house.find ('div', class_='money'). Text.strip () # print (price) # Household and area layout = house.find (' packs, class_='room'). Text.replace ('' '') # print (layout) # property and address address = house.find ('paired, class_='infor'). Text.replace ('','). Replace ('\ nfolk,') # print (address) # if there is a broker if house.find ('div', class_='jjr'): agent = house.find (' div') Class_='jjr') .text.replace (','). Replace ('\ napartment,') # if there is a branded apartment elif house.find ('paired, class_='gongyu'): agent = house.find (' paired, class_='gongyu'). Text.replace (','). Replace ('\ n') '') # if there is a personal housing else: agent = house.find ('paired, class_='geren'). Text.replace ('','). Replace ('\ nails,') # print (agent) data = [title, price, layout, address Agent] save_to_mysql (data) num + = 1 print ('th'+ str (num) + 'data crawl completed Pause 3 seconds!') Time.sleep (3)
It is easy to extract relevant information by using the BeautifulSoup parsing library. It should be noted that there are three sources of rental information: brokers, brand apartments and personal housing sources. The element nodes of these three are also different, so we should pay attention to matching.
[5] MySQL data storage module [5.1] create tables in MySQL database def create_mysql_table (): db= pymysql.connect (host='localhost', user='root', password='000000', port=3306, db='58tc_spiders') cursor = db.cursor () sql = 'CREATE TABLE IF NOT EXISTS 58tc_data (title VARCHAR) NOT NULL, price VARCHAR (255) NOT NULL, layout VARCHAR (255) NOT NULL, address VARCHAR (255) NOT NULL Agent VARCHAR (255) NOT NULL) 'cursor.execute (sql) db.close ()
First of all, specify the database as 58tc_spiders, which needs to be created using MySQL statement in advance, or manually through MySQL Workbench.
Then use the SQL statement to create a table: 58tc_data, which contains five fields: title, price, layout, address and agent, all of type varchar
This table creation operation can also be created manually in advance, and this function is not needed after manual creation.
[5.2] Save data to MySQL database def save_to_mysql (data): db= pymysql.connect (host='localhost', user='root', password='000000', port=3306, db='58tc_spiders') cursor = db.cursor () sql = 'INSERT INTO 58tc_data (title, price, layout, address, agent) values (% s,% s)' try: cursor.execute (sql, (data [0]) Data [1], data [2], data [3], data [4]) db.commit () except: db.rollback () db.close ()
The function of the commit () method is to insert the data, which is the real method to submit the statement to the database for execution. The try except statement is used to handle the exception. If the execution fails, the rollback () method is called to perform the data rollback to ensure that the original data is not destroyed.
[6] complete code import requestsimport timeimport randomimport base64import pymysqlfrom lxml import etreefrom bs4 import BeautifulSoupfrom fontTools.ttLib import TTFontheaders = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'} # get the font file and convert it to a xml file def get_font (page_url, page_num): response = requests.get (url=page_url, headers=headers) # matches the encrypted font string base64_string = response.text.split ("base64) encoded by base64 ") [1] .split ("'") [0] .strip () # print (base64_string) # decode the font string encoded by base64 into binary encoding bin_data = base64.decodebytes (base64_string.encode ()) # Save as a font file with open ('58font.woffcuts,' wb') as f: f.write (bin_data) print ('+ str (page_num) + 'times to visit the web page Font file saved successfully!') # get the font file and convert it to a xml file font = TTFont ('58font.woff') font.saveXML (' 58font.xml') print ('successfully converted the font file to a xml file!') Return response.text# matches encrypted font encodings with real fonts def find_font (): # digits glyph_list = {'glyph00001':' 0', 'glyph00002':' 1', 'glyph00003':' 2', 'glyph00004':' 3', 'glyph00005':' 4', 'glyph00006':' 5' for codes beginning with glyph 'glyph00007':' 6, 'glyph00008':' 7, 'glyph00009':' 8, 'glyph00010':' 9'} # Ten encrypted fonts are encoded unicode_list = ['0x9476codes,' 0x958fcodes, '0x993ccodes,' 0x9a4baths, '0x9e3acodes,' 0x9ea3codes, '0x9f64codes,' 0x9f92codes, '0x9fa4' '0x9fa5'] num_list = [] # use xpath syntax to match the contents of the xml file font_data = etree.parse ('. / 58font.xml') for unicode in unicode_list: # cycle through the name result = font_data.xpath corresponding to code in the xml file ("/ / cmap//map [@ code=' {}'] / @ name" .format (unicode)) [0] # print (result) # key of the circular dictionary If the name corresponding to code is the same as the key of the dictionary, you will get the value for key in glyph_list.keys (): if key = = result: num_list.append (glyph_ list [key]) print ('the number corresponding to the code has been found successfully!') # print (num_list) # returns the value list return num_list# replaces all encrypted font codes def replace_font (num, page_response) in the web page: # 9476 958F 993C 9A4B 9E3A 9EA3 9F64 9F92 9FA4 9FA5 result = page_response.replace ('encrypted', num [0]). Replace ('encrypted', num [1]). Replace ('Mr.', num [2]). Replace ('Mr.', num [3]). Replace ('encrypted' Num [4]). Replace ('encrypted', num [5]). Replace ('encrypted', num [6]). Replace ('encrypted', num [7]). Replace ('encrypted', num [8]). Replace ('encrypted', num [9]) print ('all encrypted fonts have been successfully replaced!') Return result# extract rental information def parse_pages (pages): num = 0 soup = BeautifulSoup (pages, 'lxml') # find the li tag all_house = soup.find_all (' li', class_='house-cell') for house in all_house: # title title = house.find ('a') Class_='strongbox'). Text.strip () # print (title) # Price price = house.find ('div', class_='money'). Text.strip () # print (price) # Household and area layout = house.find (' packs, class_='room'). Text.replace ('' '') # print (layout) # property and address address = house.find ('paired, class_='infor'). Text.replace ('','). Replace ('\ nfolk,') # print (address) # if there is a broker if house.find ('div', class_='jjr'): agent = house.find (' div') Class_='jjr') .text.replace (','). Replace ('\ napartment,') # if there is a branded apartment elif house.find ('paired, class_='gongyu'): agent = house.find (' paired, class_='gongyu'). Text.replace (','). Replace ('\ n') '') # if there is a personal housing else: agent = house.find ('paired, class_='geren'). Text.replace ('','). Replace ('\ nails,') # print (agent) data = [title, price, layout, address Agent] save_to_mysql (data) num + = 1 print ('th'+ str (num) + 'data crawl completed Pause 3 seconds!') Time.sleep (3) # create the table of the MySQL database: 58tc_datadef create_mysql_table (): db= pymysql.connect (host='localhost', user='root', password='000000', port=3306, db='58tc_spiders') cursor = db.cursor () sql = 'CREATE TABLE IF NOT EXISTS 58tc_data (title VARCHAR) NOT NULL, price VARCHAR (255) NOT NULL, layout VARCHAR (255) NOT NULL, address VARCHAR (255) NOT NULL Agent VARCHAR (255NOT NULL) 'cursor.execute (sql) db.close () # stores data in the MySQL database def save_to_mysql (data): db= pymysql.connect (host='localhost', user='root', password='000000', port=3306, db='58tc_spiders') cursor = db.cursor () sql =' INSERT INTO 58tc_data (title, price, layout, address, agent) values (% s,% s % s) 'try: cursor.execute (sql, (data [0], data [1], data [2], data [3], data [4]) db.commit () except: db.rollback () db.close () if _ name__ = =' _ main__': create_mysql_table () print ('MySQL table 58tc_data created successfully!') For i in range (1,71): url = 'https://wh.58.com/chuzu/pn' + str (I) +' / 'response = get_font (url, I) num_list = find_font () pro_pages = replace_font (num_list, response) parse_pages (pro_pages) print (' page'+ str (I) + 'page data crawling complete!') Time.sleep (random.randint (3,60)) print ('all data crawled complete!') Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.