How to import cdn log files into mysql for analysis based on python 07/13 Update SLTechnology News&Howtos

How to import cdn log files into mysql for analysis based on python

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to import cdn log files into mysql for analysis based on python", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn how to import cdn log files into mysql for analysis based on python.

I. the demand background of this article

There are a large number of CDN requests on Saturday and Sunday. Now we need to analyze the frequency and source of the requests and query whether there is a problem of being attacked.

This article uses Ali Cloud CDN log as an auxiliary query data, and other cloud platforms are more or less the same.

The offline logs provided by the system are as follows:

Second, the demand is as follows

The log example is as follows:

[9/Jun/2015:01:58:09 + 0800] 10.10.10.10-1542 "-"GET http://www.aliyun.com/index.html" 200 191 2830 MISS" Mozilla/5.0 (compatible; AhrefsBot/5.0; + http://example.com/robot/)"text/html"

The relevant fields are explained as follows:

[9/Jun/2015:01:58:09 + 0800]: log start time.

10.10.10.10: visit IP.

: agent IP.

1542: request response time in milliseconds.

"-": the Referer in the HTTP request header.

GET: request method.

Http://www.aliyun.com/index.html: the URL link requested by the user.

200:HTTP status code.

191: request size in bytes.

2830: the request returns the size in bytes.

MISS: hit the message.

HIT: the user request hits the resource on the edge node of the CDN (no origin-pull is required).

MISS: the content requested by the user is not cached on the CDN edge node and needs to be fetched upstream (which may be the CDN L2 node or the origin server).

Mozilla/5.0 (compatible; AhrefsBot/5.0; + http://example.com/robot/): User-Agent request header information.

Text/html: file type.

Follow the above field instructions to create a MySQL table for subsequent import of MySQL data through Python. Fields can be defined arbitrarily.

SET NAMES utf8mb4;SET FOREIGN_KEY_CHECKS = 0Mutual-Table structure for ll-- DROP TABLE IF EXISTS `ll` CREATE TABLE `ll` (`id` int (11) NOT NULL, `s_ time` varchar (255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `ip` varchar (255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `pro_ ip` varchar (255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `dura_ time` int (11) NULL DEFAULT NULL, `referer` varchar (255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `method` varchar (255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL `url` varchar CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `code` int NULL DEFAULT NULL, `size` double NULL DEFAULT NULL, `res_ size` double NULL DEFAULT NULL, `miss` varchar (255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `ua` varchar (255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `html_ type` varchar (255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL PRIMARY KEY (`id`) USING BTREE) ENGINE = MyISAM CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic SET FOREIGN_KEY_CHECKS = 1

After downloading all the logs, use Python to import them into the database in batches. The parsing code is as follows. Before starting in advance, you need to take a look at each row of data to be extracted.

[11/Mar/2022:00:34:17 + 0800] 118.181.139.215-1961 "http://xx.baidu.cn/"" GET https://cdn.baidu.com/video/1111111111.mp4" 20666 3739981 HIT "Mozilla/5.0 (iPad; CPU OS 15.1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 SP-engine/2.43.0 main%2F1.0 baiduboxapp/13.5.0.10 (Baidu P215.1) NABar/1.0 "" video/mp4 "

At first glance, we will slice with spaces, such as the following code:

Import os# gets the file name my_path = r "C: log directory" file_names = os.listdir (my_path) file_list = [os.path.join (my_path, file) for file in file_names] for file in file_list: with open (file,'r' Encoding='utf-8') as f: lines = f.readlines () for i in lines: item_list = i.split ('') s_time = item_list [0] +'+ item_list [1] ip = item_list [2], pro_ip = item_list [3], dura_time = item_list [4] Referer = item_list [5], method = item_list [6], url = item_list [7], code = item_list [8], size = item_list [9], res_size = item_list [10], miss = item_list [11], html_type = item_list [12] print Pro_ip,dura_time,referer,method,url,code,size,res_size,miss,html_type)

After running, you will find that there are spaces in both the start time position and the UA position, so the scheme is discarded and then extracted using regular expressions.

Write a regular expression with reference to the template to be extracted as follows:

\ [(?. *?)\] (?\ d {1Magne3}\.\ d {1Magne3}\. D {1Magne3}) (?. *) (?\ d +)\ "(?. *?) (?. *?) (?. *) (?)\" (?)

Next, read the data in a loop, and then extract:

Import osimport reimport pymysql# gets the file name my_path = r "C: log folder" file_names = os.listdir (my_path) file_list = [os.path.join (my_path, file) for file in file_names] wait_list = [] for file in file_list: with open (file,'r') Encoding='utf-8') as f: lines = f.readlines () for i in lines: pattern = re.compile ('\ [(? P.C.)?)\] (? P\ d {1je 3}\.\ d {1je 3}\.\ d {1jue 3}\.\ d {1jue 3}) (? P\ d +)\ "(? P.J. \ "(? P.C.?) (? P.B.?)\" (? P\ d +) (? P?)?)\ "(? P.C.))\"\ "(?) pattern.findall (I) item_list = gs [0] s_time = item_list [0] Ip = item_list [1] pro_ip = item_list [2] dura_time = item_list [3] referer = item_list [4] method = item_list [5] url = item_list [6] code = item_list [7] size = item_list [8] Res_size = item_list [9] miss = item_list [10] ua = item_list [11] html_type = item_list [12] values_str = f "('{s_time}' '{ip}','{pro_ip}', {int (dura_time)},'{referer}','{method}','{url}', {int (code)}, {int (size)}, {int (res_size)},'{miss}','{ua}','{html_type}') "wait_list.append (values_str)

Read the data and store it in the wait_list list, then manipulate the list and write to the MySQL, which inserts 1000 elements at a time in order to prevent the SQL statement from being too long.

Def insert_data (): for i in range (len (wait_list) / 1000mm 1): items = wait_ list [I * 1000wait_list I * 1000+ 1000] item_str = "," .join (items) inser_sql = f "INSERT INTO ll (s_time, ip, pro_ip, dura_time, referer, method, url,code, size, res_size, miss, ua Html_type) VALUES {item_str} "db = pymysql.connect (host='localhost', user='root', password='root') Database='logs') cursor = db.cursor () try: cursor.execute (inser_sql) db.commit () except Exception ase: # print (content) print (e) db.rollback ()

The final result is as follows:

After importing MySQL, you can sort and query according to your own needs.

III. Custom query

You can calculate the number of requests through refer:

Select count (id) num,referer from ll GROUP BY referer ORDER BY num desc

At this point, I believe you have a deeper understanding of "how to import cdn log files into mysql for analysis based on python". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.