table of contents -...

1

PROBABILITY & STATISTICS IN PYTHON

from Learning Python for Data Analysis and Visualization by Jose Portilla https://www.udemy.com/learning-python-for-data-analysis-and-visualization/

Notes by Michael Brothers Companion to the file Python for Data Analysis.

Table of Contents

Discrete Uniform Distributions .............................................................................................................................................. 2 Setting up a Discrete Uniform Distribution using Scipy .................................................................................................... 2

Continuous Uniform Distributions ......................................................................................................................................... 2 Setting up a Continuous Uniform Distribution using Scipy ............................................................................................... 2

Binomial Distribution ............................................................................................................................................................. 3 Setting up a Binomial Distribution using Scipy.................................................................................................................. 3

Poisson Distribution ............................................................................................................................................................... 4 Setting up a Poisson Distribution using Scipy ................................................................................................................... 4

Normal Distribution ............................................................................................................................................................... 5 Setting up a Normal Distribution using Scipy .................................................................................................................... 5

Sampling ................................................................................................................................................................................. 5

T-Distributions ........................................................................................................................................................................ 6

Hypothesis Testing ................................................................................................................................................................. 7

Chi-Square Distribution and the Chi-Square Test ................................................................................................................. 7

Bayes' Theorem ...................................................................................................................................................................... 7

Beta Distribution .................................................................................................................................................................... 8

APPENDIX I: Mathematical Symbols & Conventions: ........................................................................................................... 9

https://www.udemy.com/learning-python-for-data-analysis-and-visualization/

2

Standard Imports: import numpy as np

from numpy.random import randn

import pandas as pd

from scipy import stats

import matplotlib as mpl

import matplotlib.pyplot as plt

import seaborn as sns

from __future__ import division if using Python 2 %matplotlib inline

Discrete Uniform Distributions A random variable X has a discrete uniform distribution if each of the n values in its range x1, x2, x3 .... xn has equal probability. The probability mass function f(xi) and variance are:

𝑓(𝑥𝑖) = 1

𝑛 𝜎2 =

𝑛2 − 1

12

Setting up a Discrete Uniform Distribution using Scipy on the roll of a fair die: from scipy.stats import randint

roll_options = [1,2,3,4,5,6]

low, high = 1, 7 go to 7 since index starts at 0 mean, var = randint.stats(low, high) .stats also returns skew & kurtosis print 'The mean is %2.1f, and the variance is %2.2f' %(mean, var)

The mean is 3.5, and the variance is 2.92 µ = (6+1)/2 =3.5 2 = ((6-1+1)2–1)/12 =2.92 Make a PMF bar plot using matplotlib: plt.bar(roll_options, randint.pmf(roll_options, low, high));

Simple bar plot (blue) with wide, easily read bars. roll_options sets the x-values (1,2,3,4,5,6), height=0.167 For more info: http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.randint.html)

Continuous Uniform Distributions The probability density function is calculated as the area under the curve – in the case of uniformity, under a horizontal

straight line. Calculations for the probability density function f(x) and variance 2 are as follows:

𝑓(𝑥) = 1

(𝑏 − 𝑎) 𝜎2 =

(𝑏 − 𝑎)2

12

Setting up a Continuous Uniform Distribution using Scipy from scipy.stats import uniform

import numpy as np


%matplotlib inline

# Let's set an A and B A=19

B=27

# Set x as 1000 linearly spaced points about A and B x = np.linspace(A-5, B+5, 1000) Note: linspace can't be used to plot perfectly vertical lines # Use uniform(loc=start point, scale=endpoint – start point) rv = uniform(loc=A, scale=B-A)

# Plot the PDF of that uniform distribution plt.plot(x, rv.pdf(x));

http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.randint.html

3

Binomial Distribution A Bernoulli Trial is a random experiment in which there are only two possible outcomes - success and failure. A series of trials n will follow a binary distribution so long as the probability of success p remains constant, and trials are independent of one another. The probability mass function of a binomial distribution is:

𝑃(𝑋 = 𝑘) = 𝐶(𝑛, 𝑘)𝑝𝑘(1 − 𝑝)𝑛−𝑘 𝑤ℎ𝑒𝑟𝑒 𝐶(𝑛, 𝑘) = 𝑛!

𝑘! (𝑛 − 𝑘)!

Setting up a Binomial Distribution using Scipy For this example, a basketball player takes an average 11 shots per game, with a 72% probability of success. n, p = 11, .72

from scipy.stats import binom

mean, var = binom.stats(n, p) .stats also returns skew & kurtosis print 'Mean = %1.2f StdDev = %1.2f' %(mean, var**0.5)

Mean = 7.92 StdDev = 1.49 in practice it we might use whole numbers Plotting the probability mass function using matplotlib: x = range(n+1)

Y = binom.pmf(x,n,p)


%matplotlib inline

plt.plot(x,Y,'o');

So what's the probability of making exactly 8 shots in a game? binom.pmf(8,11,.72) returns 0.261588 (26%)

NOTE: Y is a 12-item numpy array. Y.sum() returns 1.0000000000000007. Just for fun, let's plot two binomial distributions together. A second player averages 15 shots per game with a 48% probability of success n1, p1 = 11, .72

n2, p2 = 15, .48

x1 = range(n1+1)

Y = binom.pmf(x1,n1,p1)

x2 = range(n2+1)

Z = binom.pmf(x2,n2,p2)


%matplotlib inline

plt.plot(x1,Y,'o')

plt.plot(x2,Z,'x',color='r');

To see the first player's skew and kurtosis values: mean, var, skew, kurt = binom.stats(n1, p1, moments='mvsk')

print 'skew =',skew,' and kurtosis =',kurt

skew = -0.295468420143 and kurtosis = -0.0945165945166

For more info: 1.) http://en.wikipedia.org/wiki/Binomial_distribution 2.) http://stattrek.com/probability-distributions/binomial.aspx 3.) http://mathworld.wolfram.com/BinomialDistribution.html

http://en.wikipedia.org/wiki/Binomial_distribution

http://stattrek.com/probability-distributions/binomial.aspx

http://mathworld.wolfram.com/BinomialDistribution.html

4

Poisson Distribution Considers the number of discrete events or occurrences over a specified interval or continuum (e.g. time, distance,etc.). The Poisson Distribution takes just one argument – the mean – and fits values about the mean from zero to infinity.

𝑃(𝑋 = 𝑘) = 𝜆𝑘𝑒−𝑘

𝑘!

where e = Euler's number = 2.718… Setting up a Poisson Distribution using Scipy On average, 10 customers visit a restaurant during lunchtime rush. What are the odds of seeing exactly 7 customers? from scipy.stats import poisson

mu = 10 this is our known mean for the interval (don't use "lambda" as a variable name) x = 7 this is our target for the probability mass function mean, var = poisson.stats(mu) returns the mean (mu) and variance (if we want) odds = poisson.pmf(x,mu) returns the probability mass function print 'There is a %2.2f%% chance that exactly %d customers show up at the lunch

rush' %(100*odds, x)

There is a 9.01% chance that exactly 7 customers show up at the lunch rush

Note: poisson.pmf(0,10) returns 0.0000454 (4 thousandths of a percent chance that no people show up) Plotting the Poisson Distribution k = range(30)

Y = poisson.pmf(k,mu)


%matplotlib inline

plt.bar(k,Y);

Applying a Cumulative Distribution Function So what is the probability that more than 10 customers arrive? We need to sum up the value of every bar past the 10 customers bar. We can do this by using a Cumulative Distribution Function (CDF). This describes the probability that a random variable X with a given probability distribution (such as the Poisson in this current case) will be found to have a value less than or equal to X. If we use the CDF to calculate the probability of 10 or less customers showing up we can take that probability and subtract it from the total probability space, which is 1 (the sum of all probabilities). x = poisson.cdf(10,10)

print 'The probability that more than 10 customers show up is %2.2f%%.' %(1-x)

The probability that more than 10 customers show up is 41.70%.

For more info: 1.)http://en.wikipedia.org/wiki/Poisson_distribution#Definition 2.)http://stattrek.com/probability-distributions/poisson.aspx 3.)http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html

http://en.wikipedia.org/wiki/Poisson_distribution#Definition

http://stattrek.com/probability-distributions/poisson.aspx

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html

5

Normal Distribution The distribution is defined by the probability density function equation:

𝑓(𝑥, 𝜇, 𝜎) =1

𝜎√2𝜋𝑒

−12𝑧2 𝑤ℎ𝑒𝑟𝑒 𝑧 =

(𝑋 − 𝜇)

𝜎

The area under the curve = 1. Setting up a Normal Distribution using Scipy from scipy import stats

import numpy as np


%matplotlib inline

mu = 0

std = 1

X = np.arange(-4,4,.01) grab 800 datapoints Y = stats.norm.pdf(X,mu,std)

plt.plot(X,Y);

Alternatively, you can pull a "random, normal" set of datapoints from numpy: import numpy as np

mu, sigma = 0, 0.1

norm_set = np.random.normal(mu,sigma,1000)

And plot them using matplotlib: import matplotlib.pyplot as plt

%matplotlib inline

plt.hist(norm_set, bins=50)

To be clear, Y is a set of y-values that fit a curve, and norm_set is a set of random x-values: Y[300] returned 0.24197 and norm_set[300] returned 0.24952 Y[301] returned 0.24439 and norm_set[301] returned -0.08108 For more info: 1.) http://en.wikipedia.org/wiki/Normal_distribution 2.) http://mathworld.wolfram.com/NormalDistribution.html 3.) http://stattrek.com/probability-distributions/normal.aspx

Sampling

When taking n samples from a population N, the sample mean should equal the population mean (and when taking

repeated samples, the sampling distribution of the sample mean will have a normal distribution about the population

mean). The sample standard deviation is calculated as

𝜎�̅� =𝜎

√𝑛

This means that to achieve half the measurement error, the sample size must be quadrupled. For more info: https://en.wikipedia.org/wiki/Sampling_distribution

http://en.wikipedia.org/wiki/Normal_distribution

http://mathworld.wolfram.com/NormalDistribution.html

http://stattrek.com/probability-distributions/normal.aspx

https://en.wikipedia.org/wiki/Sampling_distribution

6

T-Distributions For previous distributions the sample size was assumed large (n>30). The Student's t-distribution allows for use of small samples, but does so by sacrificing certainty with a margin-of-error trade-off. The t-distribution takes into account the sample size using n-1 degrees of freedom, which means there is a different t-distribution for every different sample size. If we see the t-distribution against a normal distribution, you'll notice the tail ends increase as the peak get 'squished' down. It's important to note that as n gets larger, the t-distribution converges into a normal distribution. Setting up a T-Distribution using Scipy import numpy as np


%matplotlib inline


from scipy.stats import t

Create x range x = np.linspace(-5,5,100)

Create the t distribution with scipy rv = t(3) this creates a "frozen" RV object with a shape parameter (df=3) norm1 = stats.norm.pdf(x,0,1)

Plot the PDF versus the x range (I added a normal distribution for comparison) plt.plot(x, norm1, color='r', label='normal dist')

plt.plot(x, rv.pdf(x), label='t distribution')

plt.legend(loc='upper right');

For more info: 1.) http://en.wikipedia.org/wiki/Student%27s_t-distribution 2.) http://mathworld.wolfram.com/Studentst-Distribution.html 3.) http://stattrek.com/probability-distributions/t-distribution.aspx 4.) http://docs.scipy.org/doc/scipy-0.17.0/reference/tutorial/stats.html

5.) http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.t.html

Trivia: the t-distribution was devised by William Sealy Gosset under the pseudonym Student

http://en.wikipedia.org/wiki/Student%27s_t-distribution

http://mathworld.wolfram.com/Studentst-Distribution.html

http://stattrek.com/probability-distributions/t-distribution.aspx

http://docs.scipy.org/doc/scipy-0.17.0/reference/tutorial/stats.html

http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.t.html

7

Hypothesis Testing Refer to the jupyter notebook for exercises on testing a null hypothesis, choosing between t- and z- scores, and avoiding Type I and Type II errors. For more info: 1.) Wolfram 2.) Stat Trek 3.) Wikipedia 4.) Probability Book Sample

Chi-Square Distribution and the Chi-Square Test (also written as 2 test) Refer to the jupyter notebook for an exercise on testing the fairness of a pair of dice after observing 500 rolls. from scipy import stats

chisq, p = stats.chisquare(observed, expected)

print 'The chi-squared test statistic is %.2f' %chisq

print 'The p-value for the test is %.2f' %p

The chi-squared test statistic is 9.89

The p-value for the test is 0.45

With such a high p-value, we have no reason to doubt the fairness of the dice. For more info: 1. Wikipedia 2. Stat trek 3. Khan Academy 4. http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chisquare.html Bayes' Theorem

Assesses the likelihood of A given B when you know the overall likelihood of A, of B, and the likelihood of B given A.

𝑃(𝐴|𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴)

𝑃(𝐵)

When you don't know the overall likelihood of B, use

𝑃(𝐴|𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴)

𝑃(𝐵|𝐴)𝑃(𝐴) + 𝑃(𝐵|𝑛𝑜𝑡𝐴)𝑃(𝑛𝑜𝑡𝐴)

Refer to the jupyter notebook for an exercise involving a test for breast cancer. For more info: 1.) Simple Explanation using Legos 2.) Wikipedia 3.) Stat Trek

http://mathworld.wolfram.com/HypothesisTesting.html

http://stattrek.com/hypothesis-test/hypothesis-testing.aspx

http://en.wikipedia.org/wiki/Statistical_hypothesis_testing

http://www.sagepub.com/upm-data/40007_Chapter8.pdf

http://en.wikipedia.org/wiki/Chi-squared_test

http://stattrek.com/chi-square-test/independence.aspx

https://www.khanacademy.org/math/probability/statistics-inferential/chi-square/v/pearson-s-chi-square-test-goodness-of-fit

http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chisquare.html

https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego

http://en.wikipedia.org/wiki/Bayes%27_theorem

http://stattrek.com/probability/bayes-theorem.aspx

8

Beta Distribution The beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.

import numpy as np


%matplotlib inline


a = 0.5

b = 0.5

x = np.arange(0.01, 1, 0.01)

y = stats.beta.pdf(x, a, b)

plt.plot(x, y)

plt.title('Beta: a=%.1f, b=%.1f' %(a,b))

plt.xlabel('x')

plt.ylabel('Probability density')

For more info: https://en.wikipedia.org/wiki/Beta_distribution

https://en.wikipedia.org/wiki/Beta_distribution

9

APPENDIX I: Mathematical Symbols & Conventions:

Greek alphabet

alpha

beta

gamma

delta

epsilon

zeta

eta

theta

iota

kappa

lambda

mu

nu

ksi

omicron

pi

rho

sigma

tau

upsilon

phi

chi

psi

omega

Number Prefix Symbol

10 1 deka- da

10 2 hecto- h

10 3 kilo- k

10 6 mega- M

10 9 giga- G

10 12 tera- T

10 15 peta- P

10 18 exa- E

10 21 zeta- Z

10 24 yotta- Y

Number Prefix Symbol

10 -1 deci- d

10 -2 centi- c

10 -3 milli- m

10 -6 micro-

10 -9 nano- n

10 -12 pico- p

10 -15 femto- f

10 -18 atto- a

10 -21 zepto- z

10 -24 yocto- y

𝑃(𝐴|𝐵) Probability of A given B 𝐴 ∈ 𝐵 A is an element of B 𝑐 ⊂ 𝐶 c is a subset of C 𝐴 ∩ 𝐵 the intersection of A and B 𝐴 ∪ 𝐵 the union of A and B 𝐴 ∧ 𝐵 logical A and B 𝐴 ∨ 𝐵 logical A or B

table of contents -...

Documents