In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
What are the six common probability distributions of Python implementation in data science? I believe many inexperienced people don't know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
Introduction
Having a good statistical background may be of great benefit to the daily work of data scientists. Every time we start to explore a new data set, we first need to do exploratory data analysis (EDA) to understand the probability distribution of certain features. If we can understand whether there are specific patterns in the data distribution, we can tailor the machine learning model that is most suitable for us. In this way, we will be able to get better results (fewer optimization steps) in a shorter time. In fact, some machine learning models are designed to work best under some distribution assumptions. Therefore, knowing which probability distribution we are using can help us determine which models are most appropriate to use.
Different types of data
Every time we use a dataset, our dataset represents a sample of the population. Then using this sample, we can try to understand its probability distribution so that we can use it to predict the population.
Suppose we want to predict house prices based on a set of data, we can find a data set (our sample) that contains all the house prices in San Francisco, and after some statistical analysis, we can make a fairly accurate forecast of house prices in any other city in the United States (our overall).
The dataset consists of two main types of data: numeric values (such as integers, floating point numbers) and labels (such as name, computer brand).
Numerical data can also be divided into two other categories: discretization and continuation. Discrete data can only take certain values (for example, the number of students in the school), while continuous data can use any actual or fractional value (for example, the concept of height and weight).
From the discrete random variables, the probability quality function can be calculated, while from the continuous random variables, the probability density function can be obtained.
The probability quality function gives the probability that the variable can be equal to a certain value. The value of the probability density function itself is not a probability and needs to be integrated in a given range.
There are many different probability distributions in nature. In this article, I will introduce to you the most commonly used probability distributions in data science.
I will provide code on how to create each different probability distribution. First, let's import all the necessary libraries:
Import pandas as pd import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats import seaborn as sns
Bernoulli distribution
The Bernoulli distribution is one of the easiest to understand distributions and can be used as a starting point for deriving more complex distributions. There are only two possible results for this distribution. A simple example is the tossing of skewed / unbiased coins. In this example, you can think that the probability that the result may be positive is equal to p, while for the reverse is (1murp) (the sum of the probabilities of mutually exclusive events containing all possible results is 1).
Probs = np.array ([0.75,0.25]) face = [0axes 1] plt.bar (face, probs) plt.title ('Loaded coin Bernoulli Distribution', fontsize=12) plt.ylabel (' Probability', fontsize=12) plt.xlabel ('Loaded coin Outcome', fontsize=12) axes = plt.gca () axes.set_ylim
Uniform distribution
Uniform distribution can be easily derived from Bernoulli distribution. In this case, the number of results may be unlimited and the probability of occurrence of all events is the same. For example, when you roll the dice, there are multiple possible events, each of which has the same probability.
Probs = np.full ((6), 1go 6) face = [1 plt.bar (face, probs) plt.ylabel ('Probability', fontsize=12) plt.xlabel (' Dice Roll Outcome', fontsize=12) plt.title ('Fair Dice Uniform Distribution', fontsize=12) axes = plt.gca () axes.set_ylim ([0L1])
Binomial distribution
Binomial distribution is considered to be the sum of the results of events that follow Bernoulli distribution. Therefore, the binomial distribution is used for binary outcome events, and the probability of success and failure is the same in all subsequent trials. This distribution takes two parameters as input: the number of events and the probability of success of the experiment. The simplest example of a binomial distribution is to toss a biased / unbiased coin a certain number of times.
You can take a look at the binomial distribution under different probabilities:
# pmf (random_variable, number_of_trials, probability) for prob in range (3,10,3): X = np.arange (0,25) binom = stats.binom.pmf (x, 20, 0.1*prob) plt.plot (x, binom,'- oasis, label= "p = {: F}" .format (0.1*prob)) plt.xlabel ('Random Variable', fontsize=12) plt.ylabel (' Probability') Fontsize=12) plt.title ("Binomial Distribution varying p") plt.legend ()
The main characteristics of binomial distribution are:
Given multiple experiments, each test is independent of each other (the results of one test do not affect the other).
Each trial can produce only two possible results (for example, winning or losing), with probabilities of p and (1-p), respectively.
If the success probability (p) and the number of trials (n) are obtained, the success probability (x) in the n trials can be calculated using the following formula.
Normal (Gaussian) distribution
Normal (Gaussian) distribution is one of the most commonly used distributions in data science.
Many common phenomena that occur in our daily life follow the normal distribution, such as the income distribution in the economy, the average number of students reported, the average height and so on. In addition, the central limit theorem shows that under appropriate conditions, the mean of a large number of mutually independent random variables converges to normal distribution according to distribution after proper standardization.
N = np.arange (- 50,50) mean = 0 normal = stats.norm.pdf (n, mean, 10) plt.plot (n, normal) plt.xlabel ('Distribution', fontsize=12) plt.ylabel (' Probability', fontsize=12) plt.title ("Normal Distribution")
We can see the characteristics of normal distribution:
The curve is symmetrical at the center. Therefore, the mean, mode and median are all equal, so that all values are symmetrically distributed around the mean.
The area under the distribution curve is equal to 1 (the sum of all probabilities must be equal to 1)
The normal distribution can be obtained by using the following formula
When using normal distribution, mean and standard deviation play a very important role. If we know their values, the probability of the predicted accurate value can be easily found through the probability distribution. According to the characteristics of normal distribution, 68% of the data are within a standard deviation of the mean, 95% of the data are within two standard deviations of the mean, and 99.7% of the data are within the three standard deviations of the mean.
Many machine learning models are designed to follow normal distribution for the best results. Here are some examples:
Gaussian naive Bayesian classifier
Linear discriminant analysis
Quadratic discriminant analysis
Regression Model based on least Squares
In some cases, non-normal data can be converted into normal form by logarithmic and square root transformations.
Poisson distribution
Poisson distribution is usually used to find the frequency at which events may or may not occur, and to predict how many times events are likely to occur in a given period of time.
For example, insurance companies often use Poisson distribution for risk analysis (predicting the number of car accidents that occur within a predetermined period of time) to determine the pricing of car insurance.
When using Poisson distribution, we can be sure of the average time between different events, but the exact time of events occurs at random intervals.
The Poisson distribution can be modeled using the following formula, where λ represents the average incidence of random events per unit time (or unit area).
The main characteristics of Poisson distribution are:
Events are independent of each other.
An event can occur any number of times (within a defined period of time)
Two events cannot happen at the same time
The average incidence between events is constant.
The following figure shows how changing the value of λ affects the Poisson distribution:
For lambd in range (2,8,2): n = np.arange (0,10) poisson = stats.poisson.pmf (n, lambd) plt.plot (n, poisson,'- oasis, label= "λ = {: F}" .format (lambd)) plt.xlabel ('Number of Events', fontsize=12) plt.ylabel (' Probability', fontsize=12) plt.title ("Poisson Distribution varying λ") plt.legend ()
Exponential distribution
Exponential distribution is used to model the time between different events.
For example, suppose we work in a restaurant and want to predict the intervals at which different customers come to eat. Use exponential distribution as an ideal starting point for this kind of problem. Another common application of exponential distribution is survival analysis (such as the expected life of equipment / machines).
The exponential distribution is adjusted by the parameter λ. The larger the λ value, the faster the slope of the curve changes.
For lambd in range (1Jing 10,3): X = np.arange (0,15,0.1) y = 0.1*lambd*np.exp (- 0.1*lambd*x) plt.plot (xPowery, label= "λ = {: F}" .format (0.1*lambd)) plt.xlabel ('Random Variable', fontsize=12) plt.ylabel (' Probability', fontsize=12) plt.title ("Exponential Distribution varying λ") plt.legend ()
The exponential distribution is modeled using the following formula
After reading the above, have you mastered what are the six common probability distributions of Python implementation in data science? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.