In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
How to analyze the website log data in Python, in view of this problem, this article introduces the corresponding analysis and answer in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.
Data source
Import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport apache_log_parser # first creates the parser p = apache_log_parser.make_parser (fformat) sample_string = 'koldunov.net 85 by installing the library% matplotlib inlinefformat ='% V% h% u% t\ "% r\"% > s% b\ "% {Referer} I\"\ "% {User-Agent} I\"% T'#. 26.235.202-- [16/Mar/2013:00:19:43 + 0400] "GET /? pinch 364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364"" Mozilla/5.0 (Windows NT 0400) AppleWebKit/537.11 (KHTML Like Gecko) Chrome/23.0.1271.64 Safari/537.11 "0'data = p (sample_string) # the parsed data is dictionary structure data
Datas = open (ringing H:\ python data Analysis\ data\ apache_access_log'). Readlines () # read log data line by line log_list = [] # read and parse into the dictionary for line in datas:data = p (line) data ['time_received'] = data [' time_received'] [1:12] +'+ data ['time_received'] [13:21] +' + data ['time_received'] [ 22:27] # time data processing log_list.append (data) # incoming list log = pd.DataFrame (log_list) # construct DataFramelog = log [['status'] 'response_bytes_clf','remote_host','request_first_line','time_received']] # extract the field of interest log.head () # number of bytes returned by status status code response_bytes_clf (traffic) remote_host remote host IP address request_first_line request content time_received time data
Log ['time_received'] = pd.to_datetime (log [' time_received']) # converts the time_received field to the time data type and sets it to index log = log.set_index ('time_received') log.head ()
Log ['status'] = log [' status'] .astype ('int') # converted to int type log [' response_bytes_clf'] .unique () array (['26126,' 10532, '1853,...,' 66386, '47413,' 48212], dtype=object) log [log ['response_bytes_clf'] = =' -'] .head () # converts response_bytes_clf fields to Times error Find out the reason and find that it contains "-"
Def dash3nan (x): # defines a conversion function that, when it is a "-" character, replaces it with a space and converts the byte data into M data if x = ='-': X = np.nan else:x = float (x) / 1048576 return xlog ['response_bytes_clf'] = log [' response_bytes_clf'] .map (dash3nan) log.head ()
Log.dtypes
The flow does not fluctuate much, but there is a great peak that exceeds the 20MB.
Log [log ['response_bytes_clf'] > 20] # View peak traffic
T_log = log ['response_bytes_clf'] .resample (' 30t') .count () t_log.plot ()
Resampling the time (30min) and counting, you can see that the number of visits in each time period is the highest at 8: 00 in the morning, and the rest of the time is fluctuating up and down.
H_log = log ['response_bytes_clf'] .resample (' H'). Count () h_log.plot ()
When continuing to convert the frequency to the low frequency, the fluctuation is not obvious.
D_log = pd.DataFrame ({'count':log [' response_bytes_clf'] .resample ('10t'). Count (),' sum':log ['response_bytes_clf'] .resample (' 10t'). Sum () d_log.head ()
Construct the DataFrame for the number of visits and access traffic.
Plt.figure (figsize= (10je 6)) # set chart size ax1 = plt.subplot (111) # one subplotax2 = ax1.twinx () # Common x-axis ax1.plot (d_log ['count'], color='r',label='count') ax1.legend (loc=2) ax2.plot (d_log [' sum'], label='sum') ax2.legend (loc=0)
Draw a line chart, which shows that there is a correlation between the number of visits and access traffic.
IP address analysis ip_count = log ['remote_host'] .value_counts () [0:10] # counts remote_host and takes the first 10 bits ip_count
Ip_count.plot (kind='barh') # IP Top Ten histograms
Import pygeoip # pip install pygeoip installation Library # also need to download the DAT file on the website (http://dev.maxmind.com/geoip/legacy/geolite) to resolve the IP address gi = pygeoip.GeoIP (ritual H:\ python data Analysis\ data\ GeoLiteCity.dat', pygeoip.MEMORY_CACHE) info = gi.record_by_addr ('64.233.161.99') info # resolve the IP address
Ips = log.groupby ('remote_host') [' status'] .agg (['count']) # Statistics for IP address packets ips.head ()
Ips.drop ('91.224.246.183' Inplace=True) ips ['country'] = [gi.record_by_addr (I) [' country_code3'] for i in ips.index] # write IP parsed country and latitude and longitude to DataFrameips ['latitude'] = [gi.record_by_addr (I) [' latitude'] for i in ips.index] ips ['longitude'] = [gi.record_by_addr (I) [' longitude'] for i in ips.index] ips.head () country = ips.groupby ('country') [' count'] .sum () # Group statistics for country fields country = country.sort_values (ascending=False) [0:10] # filter out the top 10 national country
Country.plot (kind='bar')
Russia has the most visitors, and it can be inferred that the website originated in Russia.
From mpl_toolkits.basemap import Basemapplt.style.use ('ggplot') plt.figure (figsize= (10Power6)) map1 = Basemap (projection='robin', lat_0=39.9, lon_0=116.3,resolution =' lame, area_thresh = 1000.0) map1.drawcoastlines () map1.drawcountries () map1.drawmapboundary () map1.drawmeridians (np.arange (0360,30) map1.drawparallels (np.arange (- 90,90,30) size= 0.03for lon, lat, mag in zip (list (ips ['longitude'])) List (ips ['latitude']), list (ips [' count']): XMagi y = map1 (lon, lat) msize = mag * sizemap1.plot (XMague y, 'ro', markersize=msize)
This is the answer to the question about how to analyze the log data of the website in Python. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.