How to analyze website log data in Python 07/11 Update SLTechnology News&Howtos

How to analyze website log data in Python

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

How to analyze the website log data in Python, in view of this problem, this article introduces the corresponding analysis and answer in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Data source

Import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport apache_log_parser # first creates the parser p = apache_log_parser.make_parser (fformat) sample_string = 'koldunov.net 85 by installing the library% matplotlib inlinefformat ='% V% h% u% t\ "% r\"% > s% b\ "% {Referer} I\"\ "% {User-Agent} I\"% T'#. 26.235.202-- [16/Mar/2013:00:19:43 + 0400] "GET /? pinch 364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364"" Mozilla/5.0 (Windows NT 0400) AppleWebKit/537.11 (KHTML Like Gecko) Chrome/23.0.1271.64 Safari/537.11 "0'data = p (sample_string) # the parsed data is dictionary structure data

Datas = open (ringing H:\ python data Analysis\ data\ apache_access_log'). Readlines () # read log data line by line log_list = [] # read and parse into the dictionary for line in datas:data = p (line) data ['time_received'] = data [' time_received'] [1:12] +'+ data ['time_received'] [13:21] +' + data ['time_received'] [ 22:27] # time data processing log_list.append (data) # incoming list log = pd.DataFrame (log_list) # construct DataFramelog = log [['status'] 'response_bytes_clf','remote_host','request_first_line','time_received']] # extract the field of interest log.head () # number of bytes returned by status status code response_bytes_clf (traffic) remote_host remote host IP address request_first_line request content time_received time data

Log ['time_received'] = pd.to_datetime (log [' time_received']) # converts the time_received field to the time data type and sets it to index log = log.set_index ('time_received') log.head ()

Log ['status'] = log [' status'] .astype ('int') # converted to int type log [' response_bytes_clf'] .unique () array (['26126,' 10532, '1853,...,' 66386, '47413,' 48212], dtype=object) log [log ['response_bytes_clf'] = =' -'] .head () # converts response_bytes_clf fields to Times error Find out the reason and find that it contains "-"

Def dash3nan (x): # defines a conversion function that, when it is a "-" character, replaces it with a space and converts the byte data into M data if x = ='-': X = np.nan else:x = float (x) / 1048576 return xlog ['response_bytes_clf'] = log [' response_bytes_clf'] .map (dash3nan) log.head ()

Log.dtypes

The flow does not fluctuate much, but there is a great peak that exceeds the 20MB.

Log [log ['response_bytes_clf'] > 20] # View peak traffic

T_log = log ['response_bytes_clf'] .resample (' 30t') .count () t_log.plot ()

Resampling the time (30min) and counting, you can see that the number of visits in each time period is the highest at 8: 00 in the morning, and the rest of the time is fluctuating up and down.

H_log = log ['response_bytes_clf'] .resample (' H'). Count () h_log.plot ()

When continuing to convert the frequency to the low frequency, the fluctuation is not obvious.

D_log = pd.DataFrame ({'count':log [' response_bytes_clf'] .resample ('10t'). Count (),' sum':log ['response_bytes_clf'] .resample (' 10t'). Sum () d_log.head ()

Construct the DataFrame for the number of visits and access traffic.

Plt.figure (figsize= (10je 6)) # set chart size ax1 = plt.subplot (111) # one subplotax2 = ax1.twinx () # Common x-axis ax1.plot (d_log ['count'], color='r',label='count') ax1.legend (loc=2) ax2.plot (d_log [' sum'], label='sum') ax2.legend (loc=0)

Draw a line chart, which shows that there is a correlation between the number of visits and access traffic.

IP address analysis ip_count = log ['remote_host'] .value_counts () [0:10] # counts remote_host and takes the first 10 bits ip_count

Ip_count.plot (kind='barh') # IP Top Ten histograms

Import pygeoip # pip install pygeoip installation Library # also need to download the DAT file on the website (http://dev.maxmind.com/geoip/legacy/geolite) to resolve the IP address gi = pygeoip.GeoIP (ritual H:\ python data Analysis\ data\ GeoLiteCity.dat', pygeoip.MEMORY_CACHE) info = gi.record_by_addr ('64.233.161.99') info # resolve the IP address

Ips = log.groupby ('remote_host') [' status'] .agg (['count']) # Statistics for IP address packets ips.head ()

Ips.drop ('91.224.246.183' Inplace=True) ips ['country'] = [gi.record_by_addr (I) [' country_code3'] for i in ips.index] # write IP parsed country and latitude and longitude to DataFrameips ['latitude'] = [gi.record_by_addr (I) [' latitude'] for i in ips.index] ips ['longitude'] = [gi.record_by_addr (I) [' longitude'] for i in ips.index] ips.head () country = ips.groupby ('country') [' count'] .sum () # Group statistics for country fields country = country.sort_values (ascending=False) [0:10] # filter out the top 10 national country

Country.plot (kind='bar')

Russia has the most visitors, and it can be inferred that the website originated in Russia.

From mpl_toolkits.basemap import Basemapplt.style.use ('ggplot') plt.figure (figsize= (10Power6)) map1 = Basemap (projection='robin', lat_0=39.9, lon_0=116.3,resolution =' lame, area_thresh = 1000.0) map1.drawcoastlines () map1.drawcountries () map1.drawmapboundary () map1.drawmeridians (np.arange (0360,30) map1.drawparallels (np.arange (- 90,90,30) size= 0.03for lon, lat, mag in zip (list (ips ['longitude'])) List (ips ['latitude']), list (ips [' count']): XMagi y = map1 (lon, lat) msize = mag * sizemap1.plot (XMague y, 'ro', markersize=msize)

This is the answer to the question about how to analyze the log data of the website in Python. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.