How to Anti-crawler with Python after using K nearest neighbor algorithm and CSS dynamic Font encryption 07/01 Update SLTechnology News&Howtos

How to Anti-crawler with Python after using K nearest neighbor algorithm and CSS dynamic Font encryption

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use the K nearest neighbor algorithm and CSS dynamic font encryption after Python anti-crawler", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "using K nearest neighbor algorithm and CSS dynamic font encryption after Python how to anti-crawler" bar!

1. Font anti-crawling

Font anti-crawling is a custom font encryption mapping, which calls the custom font file to render the text in the web page, and the text in the web page is no longer the text, but the corresponding font coding. It is impossible to collect the encoded text content by copying or simply collecting.

two。 The font viewing software font creator can be downloaded or not, with the help of the web version tool

Fonts before and after 3.CSS processing

The data on the web page we see is normal.

But when we open the developer tool to check the font, the amount and box office data become similar to garbled characters.

When we check the source code of the web page, we find that the data is different from the above, and the amount of each request is encrypted into a different ciphertext.

After many requests, it is found that the repetition probability of the returned font file is too low (there are some, but few).

4. Solution idea

Anyone who knows CSS should know (I don't know) that there is a @ font-face in CSS that allows web developers to specify online fonts for their pages. Originally used to eliminate the dependence on users' computer fonts, now there is a new function-font anti-crawling. CN/docs/Web/CSS/@font-face then looks at the data in the source code, such as the encoded data.

Looking closely, it is found that the data in some specific span has been processed, as shown in the following figure

So we looked up the class name and found its font style.

The woff is the font file, and there are others, such as ttf fonts, woff2,svg, etc. There is only woff here, which can be viewed in the font column.

Download the font and open it in the json font editor. Https://font.qqe2.com/, you can see the font. If you refresh it many times, the same number is not the same.

Once again, we compare some of the figures before and after processing:

The initial number is two forty one. three

   . 

Font $E290 $ED17 $F1A7 $EFBD $EFBD

UniE290 uniED17 uniF1A7 uniEFBD uniEFBD

You see the pattern, but we know that every time the digital position is dynamic.

5. Use TTfont to convert woff files into xml files

First convert the font to a xml file.

Import requests

From fontTools.ttLib import TTFont

Def woff_xml ():

Url = "https://vfile.meituan.net/colorstone/167b59ea53b59e17be72018703b759c32284.woff"

Woff_dir = r ". / colorstone/"

File_name = url.split ("/") [- 1]

Xml_name = file_name.replace (file_name.split (".") [- 1], "xml")

Save_woff = file_name

Save_xml = xml_name

Resp = requests.get (url=url)

With open (woff_dir+save_woff, "wb") as f:

F.write (resp.content)

F.close ()

Font = TTFont (woff_dir+save_woff)

Font.saveXML (woff_dir+save_xml)

The converted data is shown in the figure:

After taking a closer look, determine the tags related to our font: and, where the data in the tags is in the image above, we will view:

Among them, there is the equivalent of XJZ, Ymin, XMJ, Ymax, which is obviously the information of some coordinate points. In fact, it is to determine the coordinates of the font shape. If you don't believe us, we can draw it: Zhengzhou abortion Hospital http://www.120zzzzyy.com/

Import matplotlib.pyplot as plt

Import re

Str = "

Copy the corresponding content up

X = [int (I) for i in re.findall (r')

Get 10 sets of base fonts, which were initially saved in XML, but later found to be unnecessary

: return: None

For i in range (0Jing 10): # get 10 sets of fonts as the base font

Time.sleep (1)

Res = requests.get (url=self.start_url,headers=self.headers,proxies=self.proxies)

Res.encoding = "utf-8"

Part_font_url = re.findall (r "url\ ('(. {, 100}?\ .woff)", res.text,re.S)

# request to get part of the url at a time

If part_font_url:

Font_url = "https:" + part_font_url [0]

File_name = str (iTun1) + ".woff" # Font file 1.woff

Save_woff = file_name

Resp = requests.get (url=font_url,proxies=self.proxies)

Try:

With open (r ". / colorstone/" + save_woff, "wb") as FRV # Save the woff file

F.write (resp.content)

F.close ()

# font = TTFont (r ". / colorstone/" + save_woff)

# font.saveXML (r ". / colorstone/base" + str (iTun1) + ".xml") # Save as a file name like base1.xml

Print ("set {} base font saved!" .format ((iTun1)

Except Exception as e:

Print (e)

Else:

Print ("the {} request failed, please check whether access to the website is prohibited, etc." .format ((item1)

6.2. Extract the numbers + coordinates from the sample font:

Def base_font (self):

Get the x, y values of numbers in 10 sets of base fonts

: return: None

# View 10 sets of base fonts to get the order of numbers

# base_num1 = [3, 8, 9, 2, 0, 1, 7, 5, 4, 6]

# base_num2 = [3, 6, 5, 2, 4, 8, 9, 9, 1, 7, 0]

# base_num3 = [6, 0, 4, 4, 8, 1, 9, 5, 5, 2, 3, 7]

# base_num4 = [1, 8, 2, 5, 7, 9, 4, 6, 3, 0]

# base_num5 = [0re9, 8, 6, 1, 4, 7, 3, 2, 5]

# base_num6 = [9, 7, 5, 8, 3, 4, 6, 6, 1, 2, 0]

# base_num7 = [6, 5, 9, 4, 0, 2, 8, 3, 1, 7]

# base_num8 = [6, 5, 1, 0, 4, 7, 8, 8, 2, 9, 3]

# base_num9 = [0re6, 9, 5, 3, 8, 4, 1, 2, 7]

# base_num10 = [0re6, 2, 8, 5, 9, 5, 5, 3, 1, 7]

Base_num = [3, 8, 5, 0, 7, 7, 7, 7, 7], [3, 6, 5, 5, 5, 4, 7], [3, 6, 5, 9, 4, 4, 4, 8, 4, 4, 8, 4, 8, 4, 8, 4, 8, 4, 4, 6, 9, 4, 6, 4, 6, 6, 3, 7].

[0,9,8,6,1,4,7,3,2,5], [9,7,5,8,3,4,6,1,2,0], [6,5,9,4,0,2,8,3,1,7], [6,5,1,0,4,7,8,2,9,3]

[0,6,9,5,3,8,4,1,2,7], [0,6,2,8,5,9,5,3,1,7]]

Num_coordinate = []

For i in range (0Pol 10):

Woff_path = ". / colorstone/" + str (iTun1) + ".woff"

Font = TTFont (woff_path)

Obj1 = font.getGlyphOrder () [2:] # filter to the first two unwanted

For j, g in enumerate (obj1):

Coors = font ['glyf'] [g] .coordinates

Coors = [_ for c in coors for _ in c]

Coors.insert (0, base_ Numi [j])

Num_coordinate.append (coors)

Return num_coordinate

6.3. In the function knn (self):

6.3.1 get eigenvalues, target values

Num_coordinate = self.base_font ()

Data = pd.DataFrame (num_coordinate)

Data = data.fillna (value=0)

X = data.drop ([0], axis=1)

Y = data [0]

6.3.2 data segmentation: training set and test set

X_train, x_test, y_train, y_test = train_test_split (x, y, test_size=0.25)

6.3.3 call the KNN algorithm (where the parameter of n is verified by the grid, and the optimal parameter is 1):

Knn = KNeighborsClassifier (n_neighbors=1)

Knn.fit (x_train, y_train)

6.4. Establish a mapping to make the number and the corresponding code in dictionary form:

Def get_map (self):

Font = TTFont (". / colorstone/target.woff")

Glyf_order = font.getGlyphOrder () [2:]

Info = []

For g in glyf_order:

Coors = font ['glyf'] [g] .coordinates

Coors = [_ for c in coors for _ in c]

Info.append (coors)

Print (info)

Knn,length = self.knn ()

Df = pd.DataFrame (info)

Data = pd.concat ([df, pd.DataFrame (np.zeros (

(df.shape [0], length-df.shape [1]), columns=range (df.shape [1], length))])

Data = data.fillna (value=0)

Y_predict = knn.predict (data)

Num_uni_dict = {}

For I, uni in enumerate (glyf_order):

Num_uni_dict [uni.lower () .replace ('uni',' & # x') +';'] = str (y _ roomt [I])

Return num_uni_dict

6.5. Collect and replace the data to get the correct data:

According to the structure of the web page, extract data:

Def get_info (self):

Res = requests.get (url=self.start_url, headers=self.headers)

Res.encoding = "utf-8"

Part_font_url = re.findall (r "url\ ('(. {, 100}?\ .woff)", res.text, re.S)

# request to get part of the url at a time

If part_font_url:

Font_url = "https:" + part_font_url [0]

Resp = requests.get (url=font_url,proxies=self.proxies)

With open (r ". / colorstone/target.woff", "wb") as f: # saves font files that need to be analyzed

F.write (resp.content)

F.close ()

Html = res.text

Map_dict = self.get_map ()

For uni in map_dict.keys ():

Html = html.replace (uni, map_ requests [UNI])

Parse_html = etree.HTML (html)

For i in range (0Jol 11):

Name = parse_html.xpath ('/ / dd [{}] / p [@ class= "name"] / a/@title'.format (I))

Star = parse_html.xpath ('/ / dd [{}] / p [@ class= "star"] / text () '.format (I))

Releasetime = parse_html.xpath ('/ / dd [{}] / p [@ class= "releasetime"] / text () '.format (I))

Realtime_amount= parse_html.xpath ('/ / dd [{}] / p [@ class= "realtime"] / / text () '.format (I))

Total_amount = parse_html.xpath ('/ / dd [{}] / p [@ class= "total-boxoffice"] / / text () '.format (I))

Print (".join (name),", ".join (star),", ".join (releasetime)," .join (realtime_amount) .replace ("", "). Replace ("\ n ","), ".join (total_amount) .replace (",")

Print the result

Compare with the original web page

The data is exactly the same, and this is the end of the dynamic font reverse crawl.

At this point, I believe that everyone on the "use of K nearest neighbor algorithm and CSS dynamic font encryption Python how to anti-crawler" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.