Pandas from entry to upstairs 07/01 Update SLTechnology News&Howtos

Pandas from entry to upstairs

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Data object

There are mainly two kinds of data objects in pandas

SeriesDataFrame

Note: the following code uses pandas version 0.20.1 and is introduced through import pandas as pd

Series

Series is a sequence object with an index

Creation mode

Simply create the following

# initialize a Series object to pd.Series by passing a sequence, such as lists1 = pd.Series (list ("1234")) print (S1) 0 11 22 33 4dtype: objectDataFrame

Similar to data objects with rows and rows in database table

The creation method is as follows # initialize a DataFrame object to pd.DataFrame by passing in a numpy 2D array or dict object # through the numpy 2D array import numpy as npdf1 = pd.DataFrame (np.random.randn (6) 4) print (df1) 01230-0.646340-1.249943 0.393323-1.5618731 0.371630 0.069426 1.693097 0.9074192-0.328575- 0.256765 0.693798-0.7873433 1.875764-0.416275-1.028718 0.1582594 1.644791-1.321506-0.337425 0.8206895 0.006391-1.447894 0.506203 0.97729 through the dict dictionary df2 = pd.DataFrame ({'A': 1. 'B': pd.Timestamp ('20130102'),'C': pd.Series (1 test indexied list (range (4)), dtype='float32'),'D': np.array ([3] * 4 dtypewriter int32'),'E': pd.Categorical (["test", "train", "test") "train"]),'F': 'foo'}) print (df2) A B C D E F0 1.0 2013-01-02 1.0 3 test foo1 1.0 2013-01-02 1.0 3 train foo2 1.0 2013-01-02 1.0 3 test foo3 1.0 2013-01-02 1.0 3 train foo index

Both Series and DataFrame objects have an index corresponding to the object. The index of Series is similar to each element, and the index of DataFrame corresponds to each row.

View

When you create an object, each object initializes a self-increasing index list with a starting value of 0, the same as DataFrame

# when printing objects, the first column is the index print (S1) 0 11 22 33 4dtype: object# or just view the index. DataFrame is the same as print (s1.index) to add, delete, check and modify.

The additions, deletions and modifications here are mainly based on DataFrame objects.

In order to have enough data to display, select the data of tushare here

Tushare installation

Pip install tushare

Create the data object as follows

Import tushare as tsdf = ts.get_k_data ("000001")

DataFrame rows and rows, axis diagram

Query

View the data type of each column

# View df data type df.dtypesdate objectopen float64close float64high float64low float64volume float64code objectdtype: object

View a specified number of rows

The head function views the first five lines by default, and the tail function views the last five lines by default. You can pass a specified value to view the specified number of rows.

View the first five lines of df.head () date open close high low volume code0 2015-12-23 9.935 10.174 9.871 1039018.0 0000011 2015-12-24 9.919 9.823 9.998 9.744 640229.0 0000012 399845.0-25 9.855 9.879 9.927 9.815 399845.0 9.927 9.935-12-28 9.895 9.537 9.919 9 .537 822408.0 0000014 2015-12-29 9.545 9.624 9.632 9.529 619802.0 00000 viewing the last 5 lines df.tail () date open close high low volume code636 2018-08-01 9.42 9.15 9.50 9.11 814081.0 000001637 2018-08-02 9.13 8.94 9.15 8.88 931401.0 931401.0 1638 2018-08-03 8.93 8 .91 9.10 8.91 476546.0 000001639 2018-08-06 8.94 8.94 9.89 554010.0 000001640 8.96 9.17 9.17 8.88 690423.0 000001 View the first 10 lines df.head (10) date open close high low volume code0 2018-12-23 9.927 9.935 10.174 9.871 1039018.0 690423.0 00011 -12-24 9.919 9.823 9.998 9.744 640229.0 0000012 2015-12-25 9.855 9.879 9.927 9.815 399845.0 0000013 2015-28 9.895 9.537 9.919 9.537 822408.0 0000014 2015-29 9.545 9.632 9.529 619802.0 619802.0 2015-12-30 9.624 9.624 9.632 9.640 9.513 532667.0 0000016 2015-12-31 9.632 9.545 9.656 9.537 491258.0 0000017 2016-01-04 9.553 8.995 9.577 8.940 563497.0 0000018 9.640-01-05 8.972 9.075 9.210 8.876 663269.0 0000019 2016-01-06 9.091 9.179 9.202 9.067 515706.0

View one or more rows, one or more columns

# View the first line df [0:1] date open close high low volume code0 2015-12-23 9.927 9.935 10.174 9.871 1039018.0 00000 viewing 10 to 20 lines df [10:21] date open close high low volume code10 2016-01-07 9.083 8.709 8.685 174761.0 00000111 9.083-01-08 8.924 8.852 8. 987 8.677 747527.0 00000112 2016-01-11 8.757 8.566 8.820 8.502 732013.0 00000113-01-12 8.621 8.605 8.685 8.470 561642.0 00000114 8.566-01-13 8.669 8.526 8.709 8.518 391709.0 391709.0 00000115 8.566-01-14 8.430 8.597 8.343 666314.000000116 2016-01-15 8.486 8.327 8.597 8.295 448202.0 00000117 2016-01-18 8.231 8.287 8.406 8.199 421040.0 00000118 2016-01-19 8.319 8.526 8.582 8.287 501109.0 00000119 8.486-01-20 8.518 8.390 8.597 8.397 603752.0 603752.0 603752.0 Check the first five data of Date column df ["date"] .head () # or df.date.head () 0 2015-12-231 2015-12-606145.0 2015-12-253 2015-12-284 2015-12-29Name: date Dtype: object# check Date column, code column, open column the first five data df [["date", "code", "open"] .head () date code open0 2015-12-23 000001 9.9271 2015-12-24 000001 9.9192 2015-12-25 000001 9.8553 9.8553-12-28 000001 9.8954 9.8954-12-29 000001 9.545

Query using column combination criteria

# View date, line 10 of code column df.loc [10, ["date", "code"]] date 2016-01-07code 000001Name: 10, dtype: object# view date, lines 10 to 20 of code column df.loc [10:20, ["date" "code"] date code10 2016-01-07 00000111 2016-01-08 00000112 2016-01-11 00000113 2016-01-12 00000114 2016-01-13 00000115 2016-01-14 00000116 2016-01-15 00000117 2016-01-18 00000118 2016-01-19 00000119 2016-01-20 00000120 2016-01-21 00000 df.loc [0, "open"] 9.9269999999999996

Query via = = location = =

It is worth noting that the index value above is a specific location.

# View line 1 () df.iloc [0] date 2015-12-24open 9.919close 9.823high 9.998low 9.744volume 640229code 000001Name: 0 Dtype: object# view the last line df.iloc [- 1] date 2018-08-08open 9.16close 9.12high 9.16low 9.1volume 29985code 000001Name: 640, dtype: object# view the first column The first five values df.iloc [:, 0] .head () 0 2015-12-241 2015-12-252 2015-12-283 2015-12-294 2015-12-30Name: date, dtype: object# View the first 2 to 4 rows, column 1, 3 df.iloc [2:4, [0Magne2]] date close2 2015-12-28 9.5373 2015-12-29 9.624

Pass conditional screening

View the first 5 rows of open column greater than 10 df [df.open > 10] .head () date open close high low volume code378 2017-07-14 10.483 10.570 10.609 10.337 1722570.0 000001379 2017-07-17 10.619 10.483 10.987 10.396 3273123.0 000001380 2017-07-18 10.425 10.716 10.803 10.299 2349431.0 2349431.0 1381 2017-07-19 10.657 10.754 10.551 193 3075 .0 000001382 2017-07-20 10.745 10.638 10.880 10.580 1537338.0 00000 check the first five rows of df with an open column greater than 10 and an open column less than 1537338.0 [(df.open > 10) & (df.open)

< 10.6)].head() date open close high low volume code378 2017-07-14 10.483 10.570 10.609 10.337 1722570.0 000001380 2017-07-18 10.425 10.716 10.803 10.299 2349431.0 000001387 2017-07-27 10.550 10.422 10.599 10.363 1194490.0 000001388 2017-07-28 10.441 10.569 10.638 10.412 819195.0 000001390 2017-08-01 10.471 10.865 10.904 10.432 2035709.0 000001# 查看open列大于10或open列小于10.6的前五行df[(df.open >

10) | (df.open

< 10.6)].head() date open close high low volume code0 2015-12-24 9.919 9.823 9.998 9.744 640229.0 0000011 2015-12-25 9.855 9.879 9.927 9.815 399845.0 0000012 2015-12-28 9.895 9.537 9.919 9.537 822408.0 0000013 2015-12-29 9.545 9.624 9.632 9.529 619802.0 0000014 2015-12-30 9.624 9.632 9.640 9.513 532667.0 000001增加在前面已经简单的说明Series, DataFrame的创建,这里说一些常用有用的创建方式 # 创建2018-08-08到2018-08-15的时间序列，默认时间间隔为Days2 = pd.date_range("20180808", periods=7)print(s2)DatetimeIndex(['2018-08-08', '2018-08-09', '2018-08-10', '2018-08-11', '2018-08-12', '2018-08-13', '2018-08-14'], dtype='datetime64[ns]', freq='D')# 指定2018-08-08 00：00 到2018-08-09 00:00 时间间隔为小时# freq参数可使用参数, 参考: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliasess3 = pd.date_range("20180808", "20180809", freq="H")print(s2)DatetimeIndex(['2018-08-08 00:00:00', '2018-08-08 01:00:00', '2018-08-08 02:00:00', '2018-08-08 03:00:00', '2018-08-08 04:00:00', '2018-08-08 05:00:00', '2018-08-08 06:00:00', '2018-08-08 07:00:00', '2018-08-08 08:00:00', '2018-08-08 09:00:00', '2018-08-08 10:00:00', '2018-08-08 11:00:00', '2018-08-08 12:00:00', '2018-08-08 13:00:00', '2018-08-08 14:00:00', '2018-08-08 15:00:00', '2018-08-08 16:00:00', '2018-08-08 17:00:00', '2018-08-08 18:00:00', '2018-08-08 19:00:00', '2018-08-08 20:00:00', '2018-08-08 21:00:00', '2018-08-08 22:00:00', '2018-08-08 23:00:00', '2018-08-09 00:00:00'], dtype='datetime64[ns]', freq='H')# 通过已有序列创建时间序列s4 = pd.to_datetime(df.date.head())print(s4)0 2015-12-241 2015-12-252 2015-12-283 2015-12-294 2015-12-30Name: date, dtype: datetime64[ns]修改# 将df 的索引修改为date列的数据,并且将类型转换为datetime类型df.index = pd.to_datetime(df.date)df.head() date open close high low volume codedate 2015-12-24 2015-12-24 9.919 9.823 9.998 9.744 640229.0 0000012015-12-25 2015-12-25 9.855 9.879 9.927 9.815 399845.0 0000012015-12-28 2015-12-28 9.895 9.537 9.919 9.537 822408.0 0000012015-12-29 2015-12-29 9.545 9.624 9.632 9.529 619802.0 0000012015-12-30 2015-12-30 9.624 9.632 9.640 9.513 532667.0 000001# 修改列的字段df.columns = ["Date", "Open","Close","High","Low","Volume","Code"]print(df.head()) Date Open Close High Low Volume Codedate 2015-12-24 2015-12-24 9.919 9.823 9.998 9.744 640229.0 0000012015-12-25 2015-12-25 9.855 9.879 9.927 9.815 399845.0 0000012015-12-28 2015-12-28 9.895 9.537 9.919 9.537 822408.0 0000012015-12-29 2015-12-29 9.545 9.624 9.632 9.529 619802.0 0000012015-12-30 2015-12-30 9.624 9.632 9.640 9.513 532667.0 000001# 将Open列每个数值加1, apply方法并不直接修改源数据，所以需要将新值复制给dfdf.Open = df.Open.apply(lambda x: x+1)df.head() Date Open Close High Low Volume Codedate 2015-12-24 2015-12-24 10.919 9.823 9.998 9.744 640229.0 0000012015-12-25 2015-12-25 10.855 9.879 9.927 9.815 399845.0 0000012015-12-28 2015-12-28 10.895 9.537 9.919 9.537 822408.0 0000012015-12-29 2015-12-29 10.545 9.624 9.632 9.529 619802.0 0000012015-12-30 2015-12-30 10.624 9.632 9.640 9.513 532667.0 000001# 将Open，Close列都数值上加1,如果多列，apply接收的对象是整个列df[["Open", "Close"]].head().apply(lambda x: x.apply(lambda x: x+1)) Open Closedate 2015-12-24 11.919 10.8232015-12-25 11.855 10.8792015-12-28 11.895 10.5372015-12-29 11.545 10.6242015-12-30 11.624 10.632删除通过drop方法drop指定的行或者列注意: drop方法并不直接修改源数据，如果需要使源dataframe对象被修改，需要传入inplace=True 通过之前的axis图解，知道行的值(或者说label)在axis=0，列的值(或者说label)在axis=1 # 删除指定列,删除Open列df.drop("Open", axis=1).head() #或者df.drop(df.columns[1]) Date Close High Low Volume Codedate 2015-12-24 2015-12-24 9.823 9.998 9.744 640229.0 0000012015-12-25 2015-12-25 9.879 9.927 9.815 399845.0 0000012015-12-28 2015-12-28 9.537 9.919 9.537 822408.0 0000012015-12-29 2015-12-29 9.624 9.632 9.529 619802.0 0000012015-12-30 2015-12-30 9.632 9.640 9.513 532667.0 000001# 删除第1，3列. 即Open,High列df.drop(df.columns[[1,3]], axis=1).head() # 或df.drop(["Open", "High], axis=1).head() Date Close Low Volume Codedate 2015-12-24 2015-12-24 9.823 9.744 640229.0 0000012015-12-25 2015-12-25 9.879 9.815 399845.0 0000012015-12-28 2015-12-28 9.537 9.537 822408.0 0000012015-12-29 2015-12-29 9.624 9.529 619802.0 0000012015-12-30 2015-12-30 9.632 9.513 532667.0 000001pandas常用参数数值显示格式当数值很大的时候pandas默认会使用科学计数法 # float数据类型以{:.4f}格式显示，即显示完整数据且保留后四位pd.options.display.float_format = '{:.4f}'.format常用函数统计# descibe方法会计算每列数据对象是数值的count, mean, std, min, max, 以及一定比率的值df.describe() Open Close High Low Volumecount 641.0000 641.0000 641.0000 641.0000 641.0000mean 10.7862 9.7927 9.8942 9.6863 833968.6162std 1.5962 1.6021 1.6620 1.5424 607731.6934min 8.6580 7.6100 7.7770 7.4990 153901.000025% 9.7080 8.7180 8.7760 8.6500 418387.000050% 10.0770 9.0960 9.1450 8.9990 627656.000075% 11.8550 10.8350 10.9920 10.7270 1039297.0000max 15.9090 14.8600 14.9980 14.4470 4262825.0000# 单独统计Open列的平均值df.Open.mean()10.786248049922001# 查看居于95%的值, 默认线性拟合df.Open.quantile(0.95)14.187# 查看Open列每个值出现的次数df.Open.value_counts().head()9.8050 129.8630 109.8440 109.8730 109.8830 8Name: Open, dtype: int64缺失值处理删除或者填充缺失值 # 删除含有NaN的任意行df.dropna(how='any')# 删除含有NaN的任意列df.dropna(how='any', axis=1)# 将NaN的值改为5df.fillna(value=5)排序按行或者列排序, 默认也不修改源数据 # 按列排序df.sort_index(axis=1).head() Close Code Date High Low Open Volumedate 2015-12-24 9.8230 000001 2015-12-24 9.9980 9.7440 10.9190 640229.00002015-12-25 1.0000 000001 2015-12-25 1.0000 9.8150 10.8550 399845.00002015-12-28 1.0000 000001 2015-12-28 1.0000 9.5370 10.8950 822408.00002015-12-29 9.6240 000001 2015-12-29 9.6320 9.5290 10.5450 619802.00002015-12-30 9.6320 000001 2015-12-30 9.6400 9.5130 10.6240 532667.0000# 按行排序，不递增df.sort_index(ascending=False).head() Date Open Close High Low Volume Codedate 2018-08-08 2018-08-08 10.1600 9.1100 9.1600 9.0900 153901.0000 0000012018-08-07 2018-08-07 9.9600 9.1700 9.1700 8.8800 690423.0000 0000012018-08-06 2018-08-06 9.9400 8.9400 9.1100 8.8900 554010.0000 0000012018-08-03 2018-08-03 9.9300 8.9100 9.1000 8.9100 476546.0000 0000012018-08-02 2018-08-02 10.1300 8.9400 9.1500 8.8800 931401.0000 000001 安装某一列的值排序 # 按照Open列的值从小到大排序df.sort_values(by="Open") Date Open Close High Low Volume Codedate 2016-03-01 2016-03-01 8.6580 7.7220 7.7770 7.6260 377910.0000 0000012016-02-15 2016-02-15 8.6900 7.7930 7.8410 7.6820 278499.0000 0000012016-01-29 2016-01-29 8.7540 7.9610 8.0240 7.7140 544435.0000 0000012016-03-02 2016-03-02 8.7620 8.0400 8.0640 7.7380 676613.0000 0000012016-02-26 2016-02-26 8.7770 7.7930 7.8250 7.6900 392154.0000 000001合并 concat, 按照行方向或者列方向合并 # 分别取0到2行，2到4行，4到9行组成一个列表，通过concat方法按照axis=0，行方向合并, axis参数不指定，默认为0split_rows = [df.iloc[0:2,:],df.iloc[2:4,:], df.iloc[4:9]]pd.concat(split_rows) Date Open Close High Low Volume Codedate 2015-12-24 2015-12-24 10.9190 9.8230 9.9980 9.7440 640229.0000 0000012015-12-25 2015-12-25 10.8550 1.0000 1.0000 9.8150 399845.0000 0000012015-12-28 2015-12-28 10.8950 1.0000 1.0000 9.5370 822408.0000 0000012015-12-29 2015-12-29 10.5450 9.6240 9.6320 9.5290 619802.0000 0000012015-12-30 2015-12-30 10.6240 9.6320 9.6400 9.5130 532667.0000 0000012015-12-31 2015-12-31 10.6320 9.5450 9.6560 9.5370 491258.0000 0000012016-01-04 2016-01-04 10.5530 8.9950 9.5770 8.9400 563497.0000 0000012016-01-05 2016-01-05 9.9720 9.0750 9.2100 8.8760 663269.0000 0000012016-01-06 2016-01-06 10.0910 9.1790 9.2020 9.0670 515706.0000 000001# 分别取2到3列，3到5列，5列及以后列数组成一个列表，通过concat方法按照axis=1，列方向合并split_columns = [df.iloc[:,1:2], df.iloc[:,2:4], df.iloc[:,4:]]pd.concat(split_columns, axis=1).head() Open Close High Low Volume Codedate 2015-12-24 10.9190 9.8230 9.9980 9.7440 640229.0000 0000012015-12-25 10.8550 1.0000 1.0000 9.8150 399845.0000 0000012015-12-28 10.8950 1.0000 1.0000 9.5370 822408.0000 0000012015-12-29 10.5450 9.6240 9.6320 9.5290 619802.0000 0000012015-12-30 10.6240 9.6320 9.6400 9.5130 532667.0000 000001 追加行, 相应的还有insert, 插入插入到指定位置 # 将第一行追加到最后一行df.append(df.iloc[0,:], ignore_index=True).tail()Date Open Close High Low Volume Code637 2018-08-03 9.9300 8.9100 9.1000 8.9100 476546.0000 000001638 2018-08-06 9.9400 8.9400 9.1100 8.8900 554010.0000 000001639 2018-08-07 9.9600 9.1700 9.1700 8.8800 690423.0000 000001640 2018-08-08 10.1600 9.1100 9.1600 9.0900 153901.0000 000001641 2015-12-24 10.9190 9.8230 9.9980 9.7440 640229.0000 000001对象复制由于dataframe是引用对象，所以需要显示调用copy方法用以复制整个dataframe对象绘图 pandas的绘图是使用matplotlib,如果想要画的更细致, 可以使用matplotplib，不过简单的画一些图还是不错的因为上图太麻烦，这里就不配图了，可以在资源文件里面查看pandas-blog.ipynb文件或者自己敲一遍代码。 # 这里使用notbook，为了直接在输出中显示，需要以下配置%matplotlib inline# 绘制Open,Low,Close.High的线性图df[["Open", "Low", "High", "Close"]].plot()# 绘制面积图df[["Open", "Low", "High", "Close"]].plot(kind="area")数据读写读写常见文件格式,如csv,excel,json等, 甚至是读取==系统的剪切板==.这个功能有时候很有用。直接将鼠标选中复制的内容读取创建dataframe对象。 # 将df数据保存到当前工作目录的stock.csv文件df.to_csv("stock.csv")# 查看stock.csv文件前5行with open("stock.csv") as rf: print(rf.readlines()[:5])['date,Date,Open,Close,High,Low,Volume,Code\n', '2015-12-24,2015-12-24,9.919,9.823,9.998,9.744,640229.0,000001\n', '2015-12-25,2015-12-25,9.855,9.879,9.927,9.815,399845.0,000001\n', '2015-12-28,2015-12-28,9.895,9.537,9.919,9.537,822408.0,000001\n', '2015-12-29,2015-12-29,9.545,9.624,9.632,9.529,619802.0,000001\n']# 读取stock.csv文件并将第一行作为indexdf2 = pd.read_csv("stock.csv", index_col=0)df2.head() Date Open Close High Low Volume Codedate 2015-12-24 2015-12-24 9.9190 9.8230 9.9980 9.7440 640229.0000 12015-12-25 2015-12-25 9.8550 9.8790 9.9270 9.8150 399845.0000 12015-12-28 2015-12-28 9.8950 9.5370 9.9190 9.5370 822408.0000 12015-12-29 2015-12-29 9.5450 9.6240 9.6320 9.5290 619802.0000 12015-12-30 2015-12-30 9.6240 9.6320 9.6400 9.5130 532667.0000 1# 读取stock.csv文件并将第一行作为index，并且将000001作为str类型读取, 不然会被解析成整数df2 = pd.read_csv("stock.csv", index_col=0, dtype={"Code": str})df2.head()简单实例这里以处理web日志为例,也许不太实用 ,因为ELK处理这些绰绰有余，不过喜欢什么自己来也未尝不可分析access.log 日志文件: https://raw.githubusercontent.com/Apache-Labor/labor/master/labor-04/labor-04-example-access.log 日志格式及示例# 日志格式# 字段说明, 参考:https://ru.wikipedia.org/wiki/Access.log％h％l％u％t \"％r \"％>

S% b\ "% {Referer} I\"\ "% {User-Agent} I\" # concrete example 75.249.65.145 US-[51.003372] "GET / cms/tina-access-editor-for-download/ HTTP/1.1" 7113 "-" Mozilla/5.0 (compatible; Googlebot/2.1) + http://www.google.com/bot.html)" www.example.com 124.165.3.7 443 redirect-handler-+ "-" Vea2i8CoAwcAADevXAgAAAAB TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 701 12118 -% 88871 803 000 read and parse log files

Parsing log files

HOST = r'^ (? P.C.?) 'SPACE = r'\ s'IDENTITY = r'\ S+'USER = r "\ S+" TIME = r'\ [(? P.C.))\]'# REQUEST = r'\ "(? P.C.))\" REQUEST = r'\ "(? P.C.))\ s (? P.C.))\ s (? P.C.)\ s (? P.B.)\" STATUS = r'(? P\ d {3}) 'SIZE = r' ( ? P\ S+) 'REFER = r "\ S+" USER_AGENT = r'\ "(? P.C.)\" 'REGEX = HOST+SPACE+IDENTITY+SPACE+USER+SPACE+TIME+SPACE+REQUEST+SPACE+STATUS+SPACE+SIZE+SPACE+IDENTITY+USER_AGENT+SPACEline =' 79.81.243.171-- [30/Mar/2009:20:58:31 + 0200] "GET / exemples.php HTTP/1.1" 11481 "http://www.facades.fr/"" Mozilla/4.0 (compatible) MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727) "-" 'reg = re.compile (REGEX) reg.match (line). Groups ()

Inject data into DataFrame objects

COLUMNS = ["Host", "Time", "Method", "Path", "Protocol", "status", "size", "User_Agent"] field_lis = [] with open ("access.log") as rf: for line in rf: # because some records do not match, exceptions need to be caught The format of the data that cannot be captured is as follows: # 80.32.156.105-[27/Mar/2009:13:39:51 + 0100] "GET HTTP/1.1" 400-"-" # since the emphasis is not on writing regular expressions, we skip try: fields = reg.match (line). Groups () except Exception as E: # print (e) # print (line) pass field_lis.append (fields) log_df = pd.DataFrame (field_lis) # modify the column name log_df.columns = COLUMNSdef parse_time (value): try: return pd.to_datetime (value) except Exception as e: print (e) print (value) # The value is modified to the pandas parsable time format log_df.Time = log_df.Time.apply (lambda x: x.replace (":") "", 1) log_df.Time = log_df.Time.apply (parse_time) # modify index to use the Time column as index And drop is dropped on the timeline log _ df.index = pd.to_datetime (log_df.Time) log_df.drop ("Time") Inplace=True) log_df.head () Host Time Method Path Protocol status size User_AgentTime 2009-03-22 06:00:32 88.191.254.20 2009-03-22 06:00:32 GET / HTTP/1.0 8674 "- 2009-03-22 06:06:20 66.249.66.231 2009-03-22 06:06:20 GET / popup.php?choix=- 89 HTTP/1.1 200 1870 "Mozilla/5.0 (compatible Googlebot/2.1; + htt...2009-03-22 06:11:20 66.249.66.231 2009-03-22 06:11:20 GET / specialiste.php HTTP/1.1 10743 "Mozilla/5.0 (compatible; Googlebot/2.1; + htt...2009-03-22 06:40:06 83.198.250.175 2009-03-22 06:40:06 GET / HTTP/1.1 8714" Mozilla/4.0 (compatible; MSIE 8714) Windows N.. 2009-03-22 06:40:06 83.198.250.175 2009-03-22 06:40:06 GET / style.css HTTP/1.1 1692 "Mozilla/4.0 (compatible; MSIE 7.0; Windows N.

View data types

# View the data type log_df.dtypesHost objectTime datetime64 [ns] Method objectPath objectProtocol objectstatus objectsize objectUser_Agent objectdtype: object

As can be seen from the above, except for the time type of Time field, all the other fields are object, but Size and Status should be numbers.

Def parse_number (value): try: return pd.to_numeric (value) except Exception as e: pass return sets Size Status field value changed to numeric type log_df [["Status", "Size"]] = log_df [["Status", "Size"]] .apply (lambda x: x.apply (parse_number)) log_df.dtypesHost objectTime datetime64 [ns] Method objectPath objectProtocol objectStatus int64Size int64User_Agent objectdtype: object

Statistical status data

# count the number of different status values log_df.Status.value_counts () 200 5737304 1540404 1186400 251302 37403 3206 2Name: Status, dtype: int64

Draw pie diagram

Log_df.Status.value_counts () .plot (kind= "pie", figsize= (10j.8))

View log file time span

Log_df.index.max ()-log_df.index.min () Timedelta ('15 days 11 Timedelta 1230')

View the start and end time separately

Print (log_df.index.max ()) print (log_df.index.min ()) 2009-04-06 17 print 12 15 2009-03-22 06:00:32

According to this method, you can also count Method and User_Agent fields, but User_Agent also needs to clean the following data

Statistics top 10 IP address

91.121.31.184 74588.191.254.20 44141.224.252.122 420194.2.62.185 25586.75.35.144 184208.89.192.106 17079.82.3.8 16190.3.72.207 15762.147.243.132 15081.249.221.143 141Name: Host, dtype: int64

Draw the request trend chart

Log_df2 = log_df.copy () # add a request field to each line with the value of 1log_df2 ["Request"] = count the number of request every hour, and replace the NaN value with 0, and finally draw a linear graph with the size of 16x9log_df2.Request.resample ("H"). Sum (). Fillna (0) .plot (kind= "line", figsize= (16Magne10))

Separate drawing

Resample the 202304404 status respectively And put it in a list req_df_lis = [log_ df2 [log _ df2.Status = = 200] .Request.resample ("H"). Sum (). Fillna (0), log_ df2 [log _ df2.Status = = 304] .Request.resample ("H"). Sum (). Fillna (0) Log_ df2 [log _ df2.Status = = 404] .Request.resample ("H"). Sum (). Fillna (0)] # combine three dataframe together req_df = pd.concat (req_df_lis,axis=1) req_df.columns = [200,304,404] # drawing req_df.plot (figsize= (165,10))

Reference link

Https://pandas.pydata.org/pandas-docs/stable/index.html

Source code and other resource files

Https://github.com/youerning/blog/tree/master/pandas

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.