lecture 2 probability distributionsfanis/courses/stats2017/lectures/lecture2.pdf · an experiment...
TRANSCRIPT
Comment on measures of dispersion
Why do common measures of dispersion (variance and standard deviation) use sums of squares:
instead of sums of absolute residuals:nX
i=1
|xi � µ̂|
nX
i=1
(xi � µ̂)2
Comment on measures of dispersion
Working with absolute values can be difficult, but this is not the main reason.
The measure of central tendency that minimizes the sums of absolute differences is the median, not the mean.
And since the mean is the prevalent measure of central tendency, we commonly use sums of squares.
However, for statistical methods that rely on medians, sums of absolute differences can be more appropriate.
Consider the population of all possible outcomes when throwing two dice. How probable is each sum S of counts from the two dice?
What is a probability distribution?
The probability distribution provides the probability of occurrence of all possible outcomes in an experiment
Probability distribution of a population
It is generally not known. However, we may have sufficient information about this distribution to meet the goals of our research.
Statistical modeling: It is possible to rely on a small set of probability distributions that capture the key characteristics of a population.
We can then infer the parameters of these model probability distributions from our sample.
Why? We expect that the sample contains some information about the population from which it was sampled.
Example distributions
Source: https://medium.com/runkeeper-everyone-every-run/how-long-till-the-finish-line-494361cc901b
Half Marathon and Marathon race finish time
The population (and hence its samples) contain discrete values, either finite or infinite in number
e.g., {-3, 1, 0, 1, 2}, {‘blue’, ‘brown’, ‘green’}, or {1, 2, 3, ...}
The probability of a value x in a population can be expressed as a function: f(x) = p(X = x) (the probability of random variable X taking value x)
The function f(x) is called a probability mass function (pmf)
Discrete probability distributions
Discrete probability distributions
Probabilities should sum to 1:
The expected value (or mean):
The mode is the most frequent value: which one?
The median is the middle value: which one?
X
x2S
f(x) = 1
E(x) =X
x2S
f(x)x
= 21
36+ 3
1
18+ 4
1
12+ 5
1
9+ 6
5
36+ 7
1
6+ 8
5
36+ 9 ⇤ 1
9+ 10
1
12+ 11
1
18+ 12
1
36= 7
Symmetrical probability distributions
When the mean coincides with the median
The above is a symmetrical, unimode distribution
The binomial distribution
Consider a population containing two values: 1 and 0 Let’s set the probability of 1 to Pr(1) = .75 and the probability of 0 to Pr(0) = .25.
A single sample (n = 1) from such a population is known as a Bernoulli trial.
a coin flipping trial with a fair coin is a Bernoulli trial with Pr(Head) = .5 and Pr(Tail) = .5
If we perform n independent Bernoulli trials, their outcomes will follow a binomial distribution.
The binomial distribution
If a random variable X has a binomial distribution, we write:
where n and P are the parameters of the distribution: • n is the number of Bernoulli trials • P is the chance (probability) of success
If we know n and P, we can fully describe the distribution
X ⇠ B(n, P )
The binomial distribution
Probability Mass Function (pmf)
f(x;n, P ) =
✓n
x
◆P
x(1� P )n�x
The number of possible ways that n Bernoulli trails lead to x successes
The probability of exactly x successes
The probability of exactly n - x failures
The binomial distributionConsider the distribution of errors for 10 trials, when for each trial, errors occur with a probability of 40%
Continuous distributions
Not restricted to specific values. They can take any value between the lower and upper bound of a population (of course, populations can be unbounded).
The probability of any particular value is zero. Probabilities can only be obtained for intervals (i.e., a range of values):
where f(x) is the probability density function (pdf). It provides the relative (rather than absolute) likelihood that a random variable X takes the value x.
Continuous distributions
Pr(a X b) =
Z b
af(x)dx
Pr(�1 X 1) =
Z 1
�1f(x)dx = 1
Continuous distributions
Mode: value of highest peak
Median: value that divides the area exactly in half
Mean: µ =
Z 1
�1xf(x)dx
The normal distribution
Symmetrical, unimodal and continuous
Can be derived as a sum of an infinite number of independent random variables
Thus, it is appropriate when data arise from a process that involves adding together the contributions from a large number of independent, random events.
Example
The human height can be considered as the outcome of many independent random genetic and environmental influences
Normal distribution parameters
A normal distribution can be fully described by two only parameters: its mean μ and its variance σ2
A normally distributed variable X can be written as
Its probability density function (pdf) is as follows:
X ⇠ N(µ,�2)
f(x;µ,�2) =1p2⇡�2
e
� (x�µ
2)
2�2
Standard normal distribution
It is the normal distribution with a mean equal to 0 and a standard deviation (also variance) equal to 1:
The standard normal distribution is often abbreviated to z. It is frequently used to simplify working with normal distributions.
z ⇠ N(0, 1)
Sampling distribution of a statistic
It is the distribution obtained by calculating the statistic (e.g. the mean) from an infinite number of independent samples of size n
Example
An experiment measures the time it takes n = 10 people to visually locate a target on a computer screen.
The same experiment is repeated a large (or infinite) number of times, where each time, we draw a new sample of size n. For each experiment, we compute the mean time:
Experiment 1: M = 11.4 sExperiment 2: M = 12.6 sExperiment 3: M = 10.2 s...
What’s the distribution of these mean values?
Sampling distribution of a statistic
Such distributions are interesting as they determine the probability of observing a particular value of the statistic, e.g., the mean.
It is often very different than the distribution of the data used to calculate the statistic.
distribution of the data ≠ sampling distribution of their means
Sampling distribution of the mean
Its mean value is also the mean (expected value) of the original population the samples were drawn from
Its standard deviation (SD) is known as the standard error of the mean (SEM)
The central limit theorem (CLT)
States that the sampling distribution of a statistic approaches the normal distribution as n approaches infinity.
It applies to statistics computed by summing or averaging quantities (means, variances) but not to standard deviations (squared root of an average)
The central limit theorem (CLT)
States that the sampling distribution of a statistic approaches the normal distribution as n approaches infinity.
It applies to statistics computed by summing or averaging quantities (means, variances) but not to standard deviations (squared root of an average)
central = fundamental to probabilities and statistics limit = refers to a limit condition n ! 1
Practical importance of the CLT
If the size of the sample is sufficiently large, then the sampling distribution of the statistic will be approximately normal
(no matter what the distribution of the original population was)
But which sample size is sufficiently large?
Sampling from normal distributions
If the original population is normal, then the CLT will always hold, even if the sample size is as low as n = 1
The further the original population moves away from a normal distribution, the larger the sample size n should be
Sampling from binomial distributions
n = 10, P = .15
Frequency
0 2 4 6 8
010000
25000
n = 30, P = .15
Frequency
0 2 4 6 8 12
05000
15000
n = 100, P = .15
Frequency
5 10 20 30
05000
15000
n = 10, P = .35
Frequency
0 2 4 6 8 10
010000
20000
n = 30, P = .35
Frequency
5 10 15 20
05000
10000
n = 100, P = .35
Frequency
20 30 40 50
05000
15000
Statistic of interest: Count of successes from n Bernoulli trials
Sampling from exponential distributions
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
x
dexp(x)
n=10
Frequency
0.0 0.5 1.0 1.5 2.0 2.5
05000
10000150002000025000
n=30
Frequency
0.5 1.0 1.5 2.0
05000
10000
15000
20000
n=100
Frequency
0.6 0.8 1.0 1.2 1.4
05000
10000
15000
20000
Distribution of source population
Sampling distributions of the mean
Statistic of interest: Mean of a sample of n drawn from an exponential distribution
What n is sufficiently large?
Several textbooks claim that n = 30 is enough to assume that a sampling distribution is normal, irrespective of the shape of the source distribution.
« This is untrue » [Baguley] There is no magic number to guarantee that.
Log-normal distribution
A random variable X is log-normally distributed if the logarithm of X is normally distributed:
X ⇠ LogN(µ,�2) () ln(X) ⇠ N(µ,�2)
Simple math with logarithms
logb(x) = a () b
a = x
logb(1) = 0 () b
0 = 1
logb(b) = 1 () b
1 = b
47
If the base of the logarithm is equal to the Euler number e = 2.7182, we write:
ln(x) = loge(x)
Which we base to use is not important, but it is common to use e as a base.
Log-normal distribution
A common choice for real-world data bounded by zero e.g., response time or task-completion time
« The reasons governing frequency distributions in nature usually favor the log-normal, whereas people are in favor of the normal »
« For small coefficients of variation, normal and log-normal distribution both fit well. »
[ Limpert et al. 2001 ] https://stat.ethz.ch/~stahel/lognormal/bioscience.pdf
Sampling from lognormal distributions
0 1 2 3 4 5
0.0
0.2
0.4
0.6
x
dlno
rm(x
, mea
nlog
= 0
, sdl
og =
1)
n=10
Frequency
0 2 4 6 8 10 12 14
010000
30000
50000
n=30
Frequency
2 4 6 8
010000
20000
30000
40000
n=100
Frequency
1.0 1.5 2.0 2.5 3.0 3.5
05000
15000
25000
Distribution of source population
Sampling distributions of the mean
µ = 0,� = 1
Sampling from lognormal distributions
n=10
Frequency
0 2 4 6 8 10 12 14
010000
30000
50000
n=30
Frequency
2 4 6 8
010000
20000
30000
40000
n=100
Frequency
1.0 1.5 2.0 2.5 3.0 3.5
05000
15000
25000
n=10
Frequency
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
05000
10000
15000
20000
n=30
Frequency
-0.5 0.0 0.5
05000
10000
15000
20000
n=100
Frequency
-0.4 -0.2 0.0 0.2 0.4
05000
10000
15000
Before...
...and after applying a log transformation on the data
Sampling distributions of the mean
The chi-square (χ2) distribution Consider a squared observation z2 drawn at random
from the standard normal (z) distribution
The distribution of z2 will follow a χ2 distribution with 1 degree of freedom (df)
The chi-square (χ2) distribution Consider a squared observation z2 drawn at random
from the standard normal (z) distribution
The distribution of z2 will follow a χ2 distribution with 1 degree of freedom (df)
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
x
Pro
babi
lity
dens
ity
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Pro
babi
lity
dens
ity
Z Z2
The chi-square (χ2) distribution
A χ2 distribution with k degrees of freedom is the distribution of a sum of squares of k independent variables that follow a standard normal distribution
Q =kX
i=1
Z2i =) Q ⇠ �2(k)
The chi-square (χ2) distribution
Given the link between variances and sums of squares, the chi-square distribution is useful for modeling variances of samples from normal (or approximately normal) distributions.
The t distribution (Student’s t)
The sampling distribution of means for a normally distributed population
Useful when: the sample size is small, and
the population standard deviation is unknown
published by William Gosset (1908) under the pseudonym « Student »
The t distribution (Student’s t)
When the population standard deviation σ is unknown and is estimated from the unbiased variance estimate:
then, the resulting standardized sample mean has a t distribution with ν = n - 1 degrees of freedom.
�̂
2 =
nPi=1
(xi � µ̂)
n� 1
Note: We’ll further explain this later.
The t distribution (Student’s t)
A random variable X following a t distribution is denoted as:
X ⇠ t(⌫)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Pro
babi
lity
dens
ity
tz
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Pro
babi
lity
dens
ity
tz
t(1) t(29)
R distribution functions
dbinom(x, n, P) Provides the probability mass function for the binomial distribution B(n,P)
Examples:dbinom(4, 20, .2): It will return the probability of x = 4 successes for n = 20 Bernoulli trials with a P=.2 probability of success.
dbinom(c(1,2,3,4), 10, .2): It will return a vector with the probabilities of x = {1, 2, 3, 4} successes for n =10 Bernoulli trials with a P =.2 probability of success.
Binomial distribution
R distribution functions
pbinom(x, n, P) Provides the cumulative probability mass function for the binomial distribution B(n,P) Example: pbinom(4, 20, .2): It will return the cumulative probability up to x = 4 successes for n = 20 Bernoulli trials with a P = .2 probability of success.
Binomial distribution
R distribution functions
rbinom(size, n, P) It will generate a random sample of size size from the binomial distribution B(n,P) Example: rbinom(10, 20, .2): It will return a random sample of size = 10 from the binomial distribution B(n = 20, P = .2)
Binomial distribution
R distribution functions
dnorm(x, mean, sd) Provides the probability density function for the normal distribution with a mean value equal to mean and a standard deviation equal to sd.
Examples:dnorm(.2, 0, 1): It will return the relative likelihood of the value x = .2, for the standard normal distribution.
curve(dnorm(x, 100, 15), xlim = c(60, 140)): It will plot the probability density function from x = 60 to x = 140 for the normal distribution with mean = 100 and sd = 15.
Normal distribution
R distribution functions
pnorm(x, mean, sd) Provides the cumulative probability density function for the normal distribution with a mean value equal to mean and a standard deviation equal to sd.Example: pnorm(100, 100, 15): It will return the cumulative probability up to x = 100 for the normal distribution with mean = 100 and sd = 15. (What do you expect it to be?)
Normal distribution
R distribution functions
rnorm(size, mean, sd) It will generate a random sample of size size from the normal distribution with a mean value equal to mean and a standard deviation equal to sd.Example: rnorm(10, 0, 1): It will return a random sample of size = 10 from the standard normal distribution.
Normal distribution
R distribution functions
Distribution function (pmf or cdf)
Cumulative distr. function
Random sampling
Binomialdbinom(x, n, P) pbinom(x, n, P) rbinom(size, n, P)
Normaldnorm(x, mean, sd) pnorm(x, mean, sd) rnorm(size, mean, sd)
Log-normaldlnorm(x, mean, sd) plnorm(x, mean,
sd)rlnorm(size, mean, sd)
chi-squared dchisq(x, k) pchisq(x, k) rchisq(size, k)
Studentdt(x, ν) pt(x, ν) rt(size, ν)
Statistical inference
The process of deducing the parameters of an underlying probability distribution from a sample
Four broad types: point estimation
interval estimation
hypothesis testing
prediction
Point estimates
How much informative is the following graph?
T1 T2 T3
Techniques
Mea
n Ti
me
(s)
01
23
4
Point estimates
A point estimate can be thought of as a « best guess » of the true population parameter
Descriptive statistics such a the sample mean or the median are examples of point estimates
Question: What are the point estimates of a population’s variance and standard deviation?
Point estimates
How much informative is the following graph?
A point estimate communicates no information about the uncertainty or quality of the estimate it provides
T1 T2 T3
Techniques
Mea
n Ti
me
(s)
01
23
4
Interval estimate
An interval estimate does not provide an exact value, but rather a range of values that the parameter might plausibly take.
Most common method: constructing a confidence interval (CI)
Confidence interval (CI)It specify a range of values that is expected to contain the true parameter value (but it may not)
It is associated with a confidence level, usually expressed as a percentage
e.g., 95% CI or 99% CI
true value
Formal interpretation of CIsClassical frequentists statistics view a probability as a statement about the frequency with which events occur in the long run.
Of the many 95% CIs that might be constructed, 95% are expected to contain the true population parameter. The other 5% may completely fail!
Informal interpretation of CIs
Formally speaking, a CI does not specify a probability range! A 95% CI does not necessarily contain the true population parameter with a 95% probability (or 95% confidence).
However, it is often reasonable to treat a CI as an expression of confidence or belief that it does contain the true value. See [Baguley] and [Cumming and Finch, 2005]
Attention: This view is not shared by everyone! It has received a lot of criticism by Bayesian statisticians.
Confidence level
A 100% CI will include the whole range of possible values
A 0% CI reduces to a point estimate
A 95% CI is the most common choice (by tradition)
alpha
If C is the confidence level of a CI, then:
C = 100 (1 - α)
where α (or alpha) represents the number of times that a C% CI is expected to fail:
If C = 95, the α = .05
Structure of a confidence interval
It is defined by two points that form its limits, i.e., its lower and upper bounds
It can be symmetrical, where the point estimate lies in the center of the CI
...or asymmetrical, where the point estimate is not at the center of the CI
Symmetrical CIs
The intervals can be described by the point estimate plus or minus half of the interval, e.g., 165 ± 6 cm
This half width of the interval is known as the margin of error (MOE)
Width of a CI
Depends on the confidence level e.g., 99% CIs are wider than 95% CIs
Also depends on the sampling distribution of the statistic
The smaller the sample size, the wider the sampling distribution
Small samples produce wide CIs
Example
Consider the sampling distribution of the mean for a normally distributed population (M = 100, SD = 10)
The sampling distribution becomes narrower as more samples are added.
n=10
Frequency
85 90 95 100 105 110 115 120
05000
10000
15000
20000
n=30
Frequency
85 90 95 100 105 110 115 120
05000
10000
15000
20000
n=100
Frequency
85 90 95 100 105 110 115 120
05000
10000
15000
20000