an informal introduction to statistics in 2hcs.brown.edu/courses/cs195w/slides/introtostats.pdf ·...
TRANSCRIPT
An Informal Introduction to Statistics in 2h
Tim Kraska
Goal of this Lecture• This is not a replacement for a proper
introduction to probability and statistics• Instead, it only tries to convey the very basic
intuition behind some of the ideas• The risk of this lecture: Half knowledge can be
dangerous • Most slides are based on CS 155 (big thanks to
Eli)
The Very Basic
Statistics ≠ Probability
Probability: mathematical theory that describes uncertainty.
Statistics: set of techniques for extracting useful information from data.
Probability Space
Probability Function
Tossing a (Fair) Coin
Ω = H ,T F = 2Ω = 22 = 4Events
F = , H , T , H ,T
Pr ( )= 0
Pr H ( )= 0.5
Pr T ( )= 0.5
Pr H ,T ( )= 1
Rolling a Dice
Ω = 1,2,3,4,5,6 F = 2Ω = 26Events
Pr ( )= 0
Pr 1 ( )= Pr 2 ( )= Pr 3 ( )= Pr 4 ( )= Pr 5 ( )= Pr 6 ( )= 16
Pr 1,2 ( )= Pr 1,3 ( )= Pr 1,4 ( )= Pr 1,5 ( )= Pr 1,6 ( )= 26
...
Independent Events
Tossing a (Fair) Coin Twice
Ω = HH ,HT ,TH ,TT F = 2Ω = 24Events
Pr ( )= 0
Pr HH ( )= Pr H ( )Pr H ( )= 0.5 × 0.5 = 0.25
Pr HT ( )= Pr TH ( )= Pr TT ( )= 0.25
Pr HT ,TT ( )= Pr HH ,TH ( )= 0.5
Pr HH ,HT ( )= Pr TH ,TT ( )= 0.5...
Conditional Probability
Computing Conditional Probabilities
Example - a posteriori probability
Law of Total Probability
In Class Exercises1. A fair coin was tossed 10 times and always
ended up on its head. What is the likelihood that it will end up tail next?
2. Stan has two kids. One of his kids is a boy. What is the likelihood that the other one is also a boy
Bayesian Statistics
Bayes’ Law
Bayes Theorem
P(H|D) =P(D|H) P(H)
P(D)
PriorThe probability of the hypothesis being true before collecting data
MarginalWhat is the probability of collecting this data under all possible hypotheses?
LikelihoodProbability of collecting this data when our hypothesis is true
PosteriorThe probability of our hypothesis being true given the data collected
Deriving Bayes’ Law
A ~A
B P(A B)
~B
P(A B) = P(A|B) * P(B) = P(B|A) * P(A)U
U
P(B|A) * P(A)
P(B) P(A|B) =
Application: Finding a Biased Coin
Class Example: Drug Test• 0.4% of the Rhode Island population use
Marijuana* • Drug Test: The test will produce 99% true
positive results for drug users and 99% true negative results for non-drug users.
If a randomly selected individual is tested positive, what is the probability he or she is a user?
* http://medicalmarijuana.procon.org/view.answers.php?questionID=001199
Class Example: Drug Test• 0.4% of the Rhode Island population use Marijuana* • Drug Test: The test will produce 99% true positive results for
drug users and 99% true negative results for non-drug users.
If a randomly selected individual is tested positive, what is the probability he or she is a user?
P User +( )=P +User( )P User( )
P +( )
=P +User( )P User( )
P +User( )P User( )+ P + !User( )P !User( )
= 0.99 × 0.0040.99 × 0.004 + 0.01× 0.996
= 28.4%
Spam Filtering with Naïve Bayes
9/12/13 Bill Howe, UW 25
P spam words( )=P spam( )P words spam( )
P words( )
P spam viagra,rich,..., friend( )=P spam( )P viagra,rich,..., friend spam( )
P viagra,rich,..., friend( )
P spam words( )≈
P spam( )P viagra spam( )P rich spam( )…P friend spam( )P viagra,rich,..., friend( )
Bayesian Inference
P H E( )=P E H( )P H( )
P E( )
P Θ E∩α( )=P E Θ ∩α( )P Θ α( )
P E α( )
H HypothesisP H( ) Prior Probability
P H E( ) Posterior Probability
P E H( ) Probability of observing E given H, likelihood
P E( ) Model Evidence (marginal likelihood)
Random Variables
How to Model A Simple Game
I get $5 from you
You get $10 from me
Random Variables
Independence
Expectation
µ
Linearity of Expectation
How to Model A Simple Game
I get $5 from you
You get $10 from me
Would you play this game?
Variance
Variance
So far we knew the distribution What if we do not?
Population N
Red/Blue/Green Lottery
Empirical Probability
fi = niN
= nini
i∑
Population N
fblue = 1020 fgreen = 4
20fred = 6
20
Population
Mea
nVa
rianc
eµ =
xii
∑N
σ 2 =xi − µ( )
i∑ 2
N
≈
Population N
Red/Blue/Green Lottery
Sample n
Population vs. Sample
Population (parameter) Sample (Statistic) àEstimates
Mea
nVa
rianc
e
µ =xi
i∑N
σ 2 =xi − µ( )
i∑ 2
N
x =xi
i∑n
SN2 =
xi − µ( )i
∑ 2
n
SN−12 =
xi − µ( )i
∑ 2
n −1
Bias
ed
Estim
ate
Un-
Bias
ed
Estim
ate
Big DataHow to calculate the Variance in 1-Pass
SN−12 =
xi − µ( )i
∑ 2
n −1
= 1n −1
xii
∑ 2− 1n
xii
∑
2
= 1n −1
xi − x( )i
∑ 2− 1n
xi − x( )i
∑
2
Law of Large Numbers
Law of Large Numbers
• Draw independent observations at random from any population with finite mean μ.
• As the number of observations increases, the sample mean approaches mean μ of the population.
• The more variation in the outcomes, the more trials are needed to ensure that is close to μ.
Weak Law of large numbers
Strong law of large numbers
X → µ
Pr limn→∞
Xn = µ( )= 1
Central Limit Theorem
Law of Large Numbers (Coin)
Convolution
Die 1: XDie 2: YDice 1+2 : Z = X Y
P Z = z( )= P(X = k)P(Y = z− k)k=−∞
∞
∑
Tossing 2 Dice
Distribution of X1: Die 1 or Die 2
0.
0.1
0.2
1 2 3 4 5 6
Distribution of S2: 2 Dice
0.
0.1
0.2
2 3 4 5 6 7 8 9 10 11 12
Distribution of S4
0.
0.1
0.2
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Distribution of S8
0.
0.1
8 12 16 20 24 28 32 36 40 44 48
Distribution of S16
0
0.0 1
0.02
0.03
0.04
0.05
0.06
0.07
16 24 32 40 48 56 64 72 80 88 96
Distribution of S32
0
0.005
0.0 1
0.0 15
0.02
0.025
0.03
0.035
0.04
0.045
32 48 64 80 96 112 128 144 160 176 192
Distribution of X1
0.
0.1
0.2
0.3
0.4
0.5
0.6
1 2 3
Distribution of S2
0.
0.1
0.2
0.3
0.4
2 3 4 5 6
Distribution of S4
0.
0.1
0.2
0.3
4 5 6 7 8 9 10 11 12
Distribution of S8
0.
0.1
0.2
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Distribution of S16
0
0.05
0.1
0.15
16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48
Distribution of S32
0.
0.1
32 37 42 47 52 57 62 67 72 77 82 87 92
Distribution of X1
0.
0.1
0.2
0.3
0.4
0 1 2 3 4 5
Distribution of S2
0.
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
Distribution of S4
0.
0.1
0.2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Distribution of S8
0.
0.1
0 4 8 12 16 20 24 28 32 36 40
Distribution of S16
0
0.0 1
0.02
0.03
0.04
0.05
0.06
0 8 16 24 32 40 48 56 64 72 80
Distribution of S32
0
0.005
0.0 1
0.0 15
0.02
0.025
0.03
0.035
0.04
0 16 32 48 64 80 96 112 128 144 160
Normal Distribution
Probability Density Function( ) ( )22 2/
21)( σµ
πσ−−= xexf
Probability Density Function (PDF) Cumulative Distribution Function (CDF)
Ν µ,σ 2( )
The Central Limit Theorem
1. The distribution of means will be approximately a normal distribution for larger sample sizes
2. The mean of the distribution of means approaches the population mean, μ, for large sample sizes
3. The standard deviation of the distribution of means approaches for large sample sizes, where σ is the standard deviation of the population and n is the sample size
σ/ n
The Central Limit Theorem Side Notes
1. For practical purposes, the distribution of means will be nearly normal if the sample size is larger than 30
2. If the original population is normally distributed, then the sample means will remain normally distributed for any sample size n, and it will become narrower
3. The original variable can have any distribution, it does not have to be a normal distribution
Shapes of Distributions as Sample Size Increases
Testing
Hypothesis TestingThe FDA or “science” needs to decide on a new theory, drug, treatment…• H0: The null hypothesis - the current theory,
drug, treatment, is as good or better• H1: The alternative hypothesis - the new theory,
drug, treatment, should replace the old oneResearchers do not know which hypothesis is true. They must make a decision on the basis of evidence presented.
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.
Chap 10-73
What is a Hypothesis?
• A hypothesis is a claim (assumption) about a population parameter:
– population mean
• population proportion
Example: The mean monthly cell phone bill of this city is μ = $42
Example: The proportion of adults in this city with cell phones is p = .68
The Null Hypothesis, H0
n
n
3μ:H0 =
3μ:H0 = 3X:H0 =
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.
Hypothesis Testing Process
Population
Claim: thepopulationmean age is 50.(Null Hypothesis:
REJECT
Supposethe samplemean age is 20: X = 20
SampleNull Hypothesis
20 likely if μ = 50?=IsIf not likely,
Now select a random sample
H0: μ = 50 )
X
Reason for Rejecting H0
76
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.
Chap 10-77
Outcomes and Probabilities
Actual SituationDecision
Do NotReject
H0
No error (1 - α )
Type II Error ( β )
RejectH0
Type I Error( )α
Possible Hypothesis Test Outcomes
H0 False H0 True
Key:Outcome
(Probability) No Error ( 1 - β )
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.
Chap 10-78
Level of Significance and the Rejection Region
H0: μ ≥ 3 H1: μ < 3
0
H0: μ ≤ 3 H1: μ > 3
α
α
Represents
critical value
Lower-tail test
Level of significance = α
0Upper-tail test
Two-tail testRejection region is shaded
/2
0
α /2αH0: μ = 3 H1: μ ≠ 3
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.
Chap 10-79
p-Value Approach to Testing
• p-value: Probability of obtaining a test statistic more extreme ( ≤ or ≥ ) than the observed sample value given H0 is true
– Also called observed level of significance
– Smallest value of α for which H0 can be rejected
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.
Chap 10-80
Reject H0
α = .10
Do not reject H0
0
Reject H0
Calculate the p-value and compare to α
p-Value
p-Value
9/12/13 Bill Howe, Data Science, Autumn 2012 81http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer
Anders Pape Møller, 1991“female barn swallows were far more likely to mate with male birds that had long, symmetrical feathers”“Between 1992 and 1997, the average effect size shrank by eighty per cent.”
Joseph Rhine, 1930s, coiner of the term extrasensory perceptionTested individuals with card-guessing experiments. A few students achieved multiple low-probability streaks. But there was a “decline effect” – their performance became worse over time.
Jonah Lehrer, 2010, The New YorkerThe Truth Wears off
John Davis, University of Illinois“Davis has a forthcoming analysis demonstrating that the efficacy of antidepressants has gone down as much as threefold in recent decades.”
Jonathan Schooler, 1990“subjects shown a face and asked to describe it were much less likely to recognize the face when shown it later than those who had simply looked at it.”The effect became increasingly difficult to measure.
Reason 1: Publication Bias
9/12/13 Bill Howe, Data Science, Autumn 2012 82
“In the last few years, several meta-analyses have reappraised the efficacy and safety of antidepressants and concluded that the therapeutic value of these drugs may have been significantly overestimated.”
Publication bias: What are the challenges and can they be overcome?Ridha Joober, Norbert Schmitz, Lawrence Annable, and Patricia BoksaJ Psychiatry Neurosci. 2012 May; 37(3): 149–152. doi: 10.1503/jpn.120065
“Although publication bias has been documented in the literature for decades and its origins and consequences debated extensively, there is evidence suggesting that this bias is increasing.”
“A case in point is the field of biomedical research in autism spectrum disorder (ASD), which suggests that in some areas negative results are completely absent”
(emphasis mine)
“… a highly significant correlation (R2= 0.13, p < 0.001) between impact factor and overestimation of effect sizes has been reported.”
Publication Bias
9/12/13 Bill Howe, UW 83
“decline effect”
9/12/13 Bill Howe, UW 84
“decline effect” = publication bias!
Background: Effect Size
• Expressed in relevant units
• Not just “significant” – how significant? • Used prolifically in meta-analysis to combine results from multiple
studies– But be careful – averaging results from different experiments can produce
nonsense
9/12/13 Bill Howe, UW 85Robert Coe, 2002, Annual Conference of the British Educational Research Association It's the Effect Size, Stupid: What effect size is and why it is important.
[Mean of experimental group] – [Mean of control group]
standard deviationEffect size =
Caveat: Other definitions of effect size exist: odds-ratio, correlation coefficient
Effect Size
• Standardized Mean Difference
09/12/2013 Bill Howe, UW 86
Lots of ways to estimate the pooled standard deviation
e.g., Hartung et al., 2008
Glass, 1976
Effect size: Cohen’s Heuristic
• Standardized mean difference effect size– small = 0.20– medium = 0.50– large = 0.80
09/12/2013 Bill Howe, UW 87
Reason 3: Multiple Hypothesis Testing
• If you perform experiments over and over, you’re bound to find something
• This is a bit different than the publication bias problem: Same sample, different hypotheses
• Significance level must be adjusted down when performing multiple hypothesis tests
9/12/13 Bill Howe, UW 88
9/12/13 Bill Howe, UW 89
P(detecting an effect when there is none) = α = 0.05
P(detecting an effect when it exists) = 1 – α
P(detecting an effect when it exists on every experiment) = (1 – α)k
P(detecting an effect when there is none on at least one experiment) = 1 – (1 – α)k
α = 0.05
“Familywise Error Rate”
Familywise Error Rate Corrections• Bonferroni Correction
– Just divide by the number of hypotheses
• Šidák Correction– Asserts independence
09/12/2013 Bill Howe, UW 90
Summary• Stochastic Variables• Basics in Statistics• Bayes’ Law• Central Limit Theorem• Law of Large Numbers• Testing