In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "how to use the K nearest neighbor algorithm and CSS dynamic font encryption after Python anti-crawler", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "using K nearest neighbor algorithm and CSS dynamic font encryption after Python how to anti-crawler" bar!
1. Font anti-crawling
Font anti-crawling is a custom font encryption mapping, which calls the custom font file to render the text in the web page, and the text in the web page is no longer the text, but the corresponding font coding. It is impossible to collect the encoded text content by copying or simply collecting.
two。 The font viewing software font creator can be downloaded or not, with the help of the web version tool
Fonts before and after 3.CSS processing
The data on the web page we see is normal.
But when we open the developer tool to check the font, the amount and box office data become similar to garbled characters.
When we check the source code of the web page, we find that the data is different from the above, and the amount of each request is encrypted into a different ciphertext.
After many requests, it is found that the repetition probability of the returned font file is too low (there are some, but few).
4. Solution idea
Anyone who knows CSS should know (I don't know) that there is a @ font-face in CSS that allows web developers to specify online fonts for their pages. Originally used to eliminate the dependence on users' computer fonts, now there is a new function-font anti-crawling. CN/docs/Web/CSS/@font-face then looks at the data in the source code, such as the encoded data.
Looking closely, it is found that the data in some specific span has been processed, as shown in the following figure
So we looked up the class name and found its font style.
The woff is the font file, and there are others, such as ttf fonts, woff2,svg, etc. There is only woff here, which can be viewed in the font column.
Download the font and open it in the json font editor. Https://font.qqe2.com/, you can see the font. If you refresh it many times, the same number is not the same.
Once again, we compare some of the figures before and after processing:
The initial number is two forty one. three
.
Font $E290 $ED17 $F1A7 $EFBD $EFBD
UniE290 uniED17 uniF1A7 uniEFBD uniEFBD
You see the pattern, but we know that every time the digital position is dynamic.
5. Use TTfont to convert woff files into xml files
First convert the font to a xml file.
Import requests
From fontTools.ttLib import TTFont
Def woff_xml ():
Url = "https://vfile.meituan.net/colorstone/167b59ea53b59e17be72018703b759c32284.woff"
Woff_dir = r ". / colorstone/"
File_name = url.split ("/") [- 1]
Xml_name = file_name.replace (file_name.split (".") [- 1], "xml")
Save_woff = file_name
Save_xml = xml_name
Resp = requests.get (url=url)
With open (woff_dir+save_woff, "wb") as f:
F.write (resp.content)
F.close ()
Font = TTFont (woff_dir+save_woff)
Font.saveXML (woff_dir+save_xml)
The converted data is shown in the figure:
After taking a closer look, determine the tags related to our font: and, where the data in the tags is in the image above, we will view:
Among them, there is the equivalent of XJZ, Ymin, XMJ, Ymax, which is obviously the information of some coordinate points. In fact, it is to determine the coordinates of the font shape. If you don't believe us, we can draw it: Zhengzhou abortion Hospital http://www.120zzzzyy.com/
Import matplotlib.pyplot as plt
Import re
Str = "
Copy the corresponding content up
""
X = [int (I) for i in re.findall (r')
Get 10 sets of base fonts, which were initially saved in XML, but later found to be unnecessary
: return: None
''
For i in range (0Jing 10): # get 10 sets of fonts as the base font
Time.sleep (1)
Res = requests.get (url=self.start_url,headers=self.headers,proxies=self.proxies)
Res.encoding = "utf-8"
Part_font_url = re.findall (r "url\ ('(. {, 100}?\ .woff)", res.text,re.S)
# request to get part of the url at a time
If part_font_url:
Font_url = "https:" + part_font_url [0]
File_name = str (iTun1) + ".woff" # Font file 1.woff
Save_woff = file_name
Resp = requests.get (url=font_url,proxies=self.proxies)
Try:
With open (r ". / colorstone/" + save_woff, "wb") as FRV # Save the woff file
F.write (resp.content)
F.close ()
# font = TTFont (r ". / colorstone/" + save_woff)
# font.saveXML (r ". / colorstone/base" + str (iTun1) + ".xml") # Save as a file name like base1.xml
Print ("set {} base font saved!" .format ((iTun1)
Except Exception as e:
Print (e)
Else:
Print ("the {} request failed, please check whether access to the website is prohibited, etc." .format ((item1)
6.2. Extract the numbers + coordinates from the sample font:
Def base_font (self):
''
Get the x, y values of numbers in 10 sets of base fonts
: return: None
''
# View 10 sets of base fonts to get the order of numbers
# base_num1 = [3, 8, 9, 2, 0, 1, 7, 5, 4, 6]
# base_num2 = [3, 6, 5, 2, 4, 8, 9, 9, 1, 7, 0]
# base_num3 = [6, 0, 4, 4, 8, 1, 9, 5, 5, 2, 3, 7]
# base_num4 = [1, 8, 2, 5, 7, 9, 4, 6, 3, 0]
# base_num5 = [0re9, 8, 6, 1, 4, 7, 3, 2, 5]
# base_num6 = [9, 7, 5, 8, 3, 4, 6, 6, 1, 2, 0]
# base_num7 = [6, 5, 9, 4, 0, 2, 8, 3, 1, 7]
# base_num8 = [6, 5, 1, 0, 4, 7, 8, 8, 2, 9, 3]
# base_num9 = [0re6, 9, 5, 3, 8, 4, 1, 2, 7]
# base_num10 = [0re6, 2, 8, 5, 9, 5, 5, 3, 1, 7]
Base_num = [3, 8, 5, 0, 7, 7, 7, 7, 7], [3, 6, 5, 5, 5, 4, 7], [3, 6, 5, 9, 4, 4, 4, 8, 4, 4, 8, 4, 8, 4, 8, 4, 8, 4, 4, 6, 9, 4, 6, 4, 6, 6, 3, 7].
[0,9,8,6,1,4,7,3,2,5], [9,7,5,8,3,4,6,1,2,0], [6,5,9,4,0,2,8,3,1,7], [6,5,1,0,4,7,8,2,9,3]
[0,6,9,5,3,8,4,1,2,7], [0,6,2,8,5,9,5,3,1,7]]
Num_coordinate = []
For i in range (0Pol 10):
Woff_path = ". / colorstone/" + str (iTun1) + ".woff"
Font = TTFont (woff_path)
Obj1 = font.getGlyphOrder () [2:] # filter to the first two unwanted
For j, g in enumerate (obj1):
Coors = font ['glyf'] [g] .coordinates
Coors = [_ for c in coors for _ in c]
Coors.insert (0, base_ Numi [j])
Num_coordinate.append (coors)
Return num_coordinate
6.3. In the function knn (self):
6.3.1 get eigenvalues, target values
Num_coordinate = self.base_font ()
Data = pd.DataFrame (num_coordinate)
Data = data.fillna (value=0)
X = data.drop ([0], axis=1)
Y = data [0]
6.3.2 data segmentation: training set and test set
X_train, x_test, y_train, y_test = train_test_split (x, y, test_size=0.25)
6.3.3 call the KNN algorithm (where the parameter of n is verified by the grid, and the optimal parameter is 1):
Knn = KNeighborsClassifier (n_neighbors=1)
Knn.fit (x_train, y_train)
6.4. Establish a mapping to make the number and the corresponding code in dictionary form:
Def get_map (self):
Font = TTFont (". / colorstone/target.woff")
Glyf_order = font.getGlyphOrder () [2:]
Info = []
For g in glyf_order:
Coors = font ['glyf'] [g] .coordinates
Coors = [_ for c in coors for _ in c]
Info.append (coors)
Print (info)
Knn,length = self.knn ()
Df = pd.DataFrame (info)
Data = pd.concat ([df, pd.DataFrame (np.zeros (
(df.shape [0], length-df.shape [1]), columns=range (df.shape [1], length))])
Data = data.fillna (value=0)
Y_predict = knn.predict (data)
Num_uni_dict = {}
For I, uni in enumerate (glyf_order):
Num_uni_dict [uni.lower () .replace ('uni',' & # x') +';'] = str (y _ roomt [I])
Return num_uni_dict
6.5. Collect and replace the data to get the correct data:
According to the structure of the web page, extract data:
Def get_info (self):
Res = requests.get (url=self.start_url, headers=self.headers)
Res.encoding = "utf-8"
Part_font_url = re.findall (r "url\ ('(. {, 100}?\ .woff)", res.text, re.S)
# request to get part of the url at a time
If part_font_url:
Font_url = "https:" + part_font_url [0]
Resp = requests.get (url=font_url,proxies=self.proxies)
With open (r ". / colorstone/target.woff", "wb") as f: # saves font files that need to be analyzed
F.write (resp.content)
F.close ()
Html = res.text
Map_dict = self.get_map ()
For uni in map_dict.keys ():
Html = html.replace (uni, map_ requests [UNI])
Parse_html = etree.HTML (html)
For i in range (0Jol 11):
Name = parse_html.xpath ('/ / dd [{}] / p [@ class= "name"] / a/@title'.format (I))
Star = parse_html.xpath ('/ / dd [{}] / p [@ class= "star"] / text () '.format (I))
Releasetime = parse_html.xpath ('/ / dd [{}] / p [@ class= "releasetime"] / text () '.format (I))
Realtime_amount= parse_html.xpath ('/ / dd [{}] / p [@ class= "realtime"] / / text () '.format (I))
Total_amount = parse_html.xpath ('/ / dd [{}] / p [@ class= "total-boxoffice"] / / text () '.format (I))
Print (".join (name),", ".join (star),", ".join (releasetime)," .join (realtime_amount) .replace ("", "). Replace ("\ n ","), ".join (total_amount) .replace (",")
Print the result
Compare with the original web page
The data is exactly the same, and this is the end of the dynamic font reverse crawl.
At this point, I believe that everyone on the "use of K nearest neighbor algorithm and CSS dynamic font encryption Python how to anti-crawler" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.