In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces Python data processing and visualization example analysis, the article introduces in great detail, has a certain reference value, interested friends must read it!
I. preliminary use of NumPy
A table is a general representation of data, but it is incomprehensible to the machine, that is, unrecognizable data, so we need to adjust the form of the table.
The commonly used representation of machine learning is data matrix.
When we look at this table, we find that there are two kinds of attributes in the matrix, one is numerical, the other is Boolean. So let's build a model to describe the table now:
# data matrixization import numpy as npdata = np.mat ([1pjing200 remiere 105recorder 3forceFalse], [2mai 165pyrrine 2forerie false], [3184.5120parry 2false], [4meme 116pr 70.8mel 1m false], [5mei 270mt 150ref]]) row = 0for line in data: row + = 1print (row) print (data.size) print (data)
The first line of code here means to introduce NumPy and rename it to np. In the second row, we use the mat () method in NumPy to build a data matrix, and row is the introduced variable to calculate the number of rows.
The size here means a table of 5: 5. You can see the data by printing data directly:
II. The use of Matplotlib package-graphical data processing
Let's look at the top table, and the second column is the difference in house prices. We think it's not easy to see the difference intuitively (because there are only numbers). So we want to be able to draw it (the way to study numerical differences and anomalies is to map the distribution of the data):
Import numpy as npimport scipy.stats as statsimport pylabdata = np.mat ([1Jing 200rec 105je 3dje false], [2Me 165je 80je 2le false], [3184.5je 120pr 2le false], [4Med 116pr 70.8m 1m 1m false], [5pje 270je 150jue 4ref]) coll = [] for row in data: coll.append (row [0re1]) stats.probplot (coll,plot=pylab) pylab.show ()
The result of this code is to generate a diagram:
So we can see the difference clearly.
The requirement of a coordinate graph is to show the specific value of the data through different rows and columns.
Of course, we can also show the coordinate map:
Third, the theory and method of deep learning-similarity calculation (can be skipped)
There are many methods to calculate similarity, and we choose the two most commonly used, that is, Euclidean similarity and cosine similarity.
1. Similarity calculation based on Euclidean distance.
Euclidean distance is used to express the true distance between two points in three-dimensional space. As a matter of fact, we all know the formula, but we hear very few names:
So let's take a look at its practical application:
This table shows the ratings of the items by three users:
D12 indicates the similarity between user 1 and user 2, then there are:
Similarly, d13:
It can be seen that user 2 is more similar to user 1 (the smaller the distance, the greater the similarity).
2. Similarity calculation based on cosine angle.
The starting point of the calculation of cosine angle is the difference of included angle.
It can be seen that user 2 and user 1 are more similar than user 3 (the more similar the two targets are, the smaller the angle formed by their segments).
Fourth, the visual display of data statistics (taking the precipitation in Bozhou City as an example) the quartile of data
Quartile, is a kind of quartile in statistics, that is, the data is arranged from small to large, and then divided into four equal parts, and the data at the three points of division is the quartile.
First quartile (Q1), also known as the lower quartile
Second quartile (Q1), also known as median
Third quartile (Q1), also known as the lower quartile
The difference between the third quartile and the first quartile is also called the quartile gap (IQR).
If n is the number of items, then:
Position of Q1 = (nasty 1) * 0.25
Position of Q2 = (nasty 1) * 0.50
Position of Q3 = (nasty 1) * 0.75
Quartile example:
With regard to this rain.csv, you can ask for documents if you need it. I am using the monthly precipitation in Bozhou City from 2010 to 2019.
From pylab import * import pandas as pdimport matplotlib.pyplot as plotfilepath = ("C:\ Users\ AWAITXM\ Desktop\ rain.csv") # "C:\ Users\ AWAITXM\ Desktop\ rain.csv" dataFile = pd.read_csv (filepath) summary = dataFile.describe print (summary) array = dataFile.iloc [:,:] .valuesboxplot (array) plot.xlabel ("year") plot.ylabel ("rain") show ()
The following is the result of the plot run:
This is the operation of pandas.
The fluctuation range of the data can be clearly seen here.
It can be seen that there is a big difference in precipitation in different months, with the most in August and the least in January-April and October-December.
So how do you compare the increase and decrease of monthly precipitation?
From pylab import * import pandas as pdimport matplotlib.pyplot as plotfilepath = ("C:\\ Users\\ AWAITXM\\ Desktop\\ rain.csv") # "C:\ Users\ AWAITXM\ Desktop\ rain.csv" dataFile = pd.read_csv (filepath) summary = dataFile.describe () minRings =-1maxRings = 99nrows = 11for i in range (nrows): dataRow = dataFile.iloc [iMagazl13] labelColor = (dataFile.iloc [I) 12]-minRings) / (maxRings-minRings) dataRow.plot (color = plot.cm.RdYlBu (labelColor), alpha = 0.5) plot.xlabel ("Attribute") plot.ylabel (("Score")) show ()
The result is shown in the figure:
It can be seen that the precipitation rises or falls irregularly in the month.
So is the monthly precipitation relevant?
From pylab import * import pandas as pdimport matplotlib.pyplot as plotfilepath = ("C:\\ Users\\ AWAITXM\\ Desktop\\ rain.csv") # "C:\ Users\ AWAITXM\ Desktop\ rain.csv" dataFile = pd.read_csv (filepath) summary = dataFile.describe () corMat = pd.DataFrame (dataFile.iloc [1AWAITXM 20). Corr () plot.pcolor (corMat) plot.show ()
The result is shown in the figure:
It can be seen that the color distribution is very uniform, indicating that there is no much correlation, so it can be considered that the monthly precipitation is an independent behavior.
The above is all the contents of the article "sample Analysis of Python data processing and Visualization". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.