# IT star is not a dream # use Python for website log analysis 07/01 Update SLTechnology News&Howtos

# IT star is not a dream # use Python for website log analysis

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

The website visit log is a very important file, through the analysis visit log, can excavate many valuable information. This article introduces how to use Python to analyze the visit log of a real website. The article will comprehensively use Python file operation, string processing, lists, sets, dictionaries and other related knowledge points. The access log used in this article comes from my personal Cloud Virtual Machine, which can be downloaded from the attachment at the end of this article.

1. Extract logs for specified dates

Here is a typical Web site visit log, one for each resource a client visits on a Web site.

193.112.9.107 - - [25/Jan/2020:06:32:58 +0800] "GET /robots.txt HTTP/1.1" 404 208 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"

Each log is divided into nine parts by spaces, of which the most important are:

Part 1, 193.112.9.107, IP address of client. Section 4,[25/Jan/2020:06:32:58 + 0800], Time of occurrence of user access request. Part 5, GET/robots.txt HTTP/1.1, the first line of information in the HTTP request message header sent by the client. This section, expressed in the format Request Method Request Resource Request Protocol, is the most important section of the log. "GET/robots.txt HTTP/1.1" means that the client requests access to the server 's/robots.txt file using the GET method, using HTTP/1.1. Section 6,"404," HTTP response status code. The status code is used to indicate whether the user's request is successful. If the value is 200, it indicates that the user's access is successful, otherwise there may be a problem. Generally speaking, a status code starting with 2 can indicate that the user's access is successful, a status code starting with 3 indicates that the user's request has been redirected to another location on the page, a status code starting with 4 indicates that the client has encountered an error, and a status code starting with 5 indicates that the server has encountered an error. Part 7,"208," response message size in bytes, this value does not include the header of the response message. Adding up these values in the log records tells you how much data the server has sent in a given amount of time. Part 9,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0," indicates the value of the header "User-Agent" in the HTTP request message sent by the client, i.e., the requesting application, usually a browser.

A log file contains many days of log records, and we usually analyze logs for a certain day, so we first need to extract the logs of that day from the log file.

For example, to extract the log generated on January 25, you can execute the following code:

>>> with open('access_log','r') as f1, open('access_log-0125','w') as f2:... for line in f1:... if '25/Jan/2020' in line:... f2.write(line)

In this code, the log file access_log is opened in r read mode as file object f1. Create file access_log-0125 in w write mode as file object f2.

We then traverse each line in f1 and determine whether it contains the keyword "25/Jan/2020," and if so, write that line to f2.

This extracts all log entries for January 25 and saves them to file access_log-0125. Here we analyze the file access_log-0125.

2. Statistical PV and UV

PV refers to PageView, the number of visits to the website. Each user request or visit to a page on the site is recorded as 1 PV, for example, a user visits 4 pages on the site, then PV +4. Moreover, the PV of multiple visits to the same page by users is also cumulative.

UV refers to UniqueView, the number of unique visitors to the site, an IP visiting the site is considered a visitor, and the same IP is only counted once in the same day.

Therefore, we only need to take out the IP in each log and count the number, then we can get PV, IP to weight, we can get UV.

Execute the following code to extract the IP of each log and store it in the list ips.

>>> ips = []>>> with open('access_log-0125','r') as f:... for line in f:... ips.append(line.split()[0])

In this code, we first define an empty list ips, then open the file access_log-0125 and traverse it. Each time we traverse a line, we divide the line into a list with a space as a separator, and take the first element in the list (that is, IP address) and append it to the list ips.

Below we only need to count the length of the list ips is PV, after the list elements are removed, the statistical length is UV. De-duplication here uses the set() function to convert the list into a collection, using the characteristics of Python collection itself to complete the de-duplication operation simply and efficiently.

>>> pv = len(ips)>> uv = len(set(ips))>>> print(pv,uv)1011 483. Statistics website error page proportion

The percentage of website errors is a very important piece of data, directly related to the user experience of the website. To calculate the percentage of user access errors, you can get the HTTP status code of each request by counting the HTTP status code. If the status code is 2xx or 3xx, it is regarded as correct access, and if the status code is 4xx or 5xx, it is regarded as access error.

First of all, you can extract the status codes of all pages and save them to the list.

>>> codes = []>>> with open('access_log-0125','r') as f:... for line in f:... codes.append(line.split()[8])

Then count the number of occurrences of each state code and save it in the dictionary:

>>> ret = {}>>> for i in codes:... if i not in ret:... ret[i] = codes.count(i)... >>> >>> ret{'200': 192, '404': 796, '"-"': 4, '400': 13, '403': 3, '401': 2, '405': 1}

The above code uses a dictionary, where it traverses the list codes that store the state codes, extracts the state codes from them as the keys of the dictionary, and counts the number of times this state code appears in the list codes as the dictionary value.

If you want to calculate the proportion of 404 pages, you can execute the following code:

>>> ret['404']/sum(ret.values())0.7873392680514342

In this code, ret ['404 '] represents the number of elements with key'404'taken from the dictionary ret, i.e., the number of 404 status codes. ret.values() means to take the values of all elements in the dictionary and sum them with the sum() function to get the total number of all state codes. The ratio of the two is the proportion of error pages.

As can be seen from the results, my website page error rate is particularly high, even reached 78.7%, if it is a normal website, this must be a problem. But I am not a public website, there is no valuable page, so most of the visit logs are actually generated by some vulnerability scanning software, which also reminds us that there are people at any time on our online website for various scanning tests.

4. Popular resources for statistical websites

Below we continue to count the number of user visits to each page and sort them.

The first step is still to traverse the log file, take out all the pages visited by the user, and save them to the list:

>>> webs = []>>> with open('access_log-0125','r') as f:... for line in f:... webs.append(line.split()[6])

Then count the number of visits to each page and store it in the dictionary:

>>> counts = {}>>> for i in webs:... if i not in counts:... counts[i] = webs.count(i)...

Sort by page views in descending order:

>>> sorted(counts.items(),key=lambda x:x[1],reverse=True)[('/', 175), ('/robots.txt', 25), ('/phpinfo.php', 6), ('/Admin13790d6a/Login.php', 4), ……

To better understand the usage of the sorted() function above, here is an example. For example, we define a dictionary called services. If we sort this dictionary directly with the sorted() function, the default is to sort it in ascending order by key. To display everything in the dictionary, you can use the items() method, where each key-value pair in the dictionary is grouped into a tuple and sorted by default by the first element in the tuple, the dictionary key.

>>> services = {'http':80,'ftp':21,'https':443,'ssh':22}>>> sorted(services)['ftp', 'http', 'https', 'ssh']>>> sorted(services.items())[('ftp', 21), ('http', 80), ('https', 443), ('ssh', 22)]

If you want to sort by dictionary values, that is, by the second element in a tuple, you can specify a lambda expression with the key parameter, with the second element in each tuple as the key.

>>> sorted(services.items(),key=lambda x:x[1])[('ftp', 21), ('ssh', 22), ('http', 80), ('https', 443)]

So this explains the meaning of the sorted() function. As for lambda expression, it is actually a small function that can be defined and used at any time according to needs."lambda x:x[1]", the x to the left of the colon is the parameter to be processed by the function, the expression to the right of the colon is the operation to be performed by the function, and finally the result of this expression is returned.

This article is part of the "Python Security and Operations" series of courses, which has been updated to the second part. Interested friends can refer to:

Part 1 Python Basic Syntax https://edu.51cto.com/sd/53aa7

Part 2 Python Programming Basics https://edu.51cto.com/sd/d100c

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.