introduction to data science · 2017-04-07 · introduction to data science john p dickerson...
TRANSCRIPT
INTRODUCTION TO DATA SCIENCEJOHN P DICKERSON
Lecture #19 – 4/6/2017
CMSC320Tuesdays & Thursdays3:30pm – 4:45pm
ANNOUNCEMENTSMini-Project #3 is out!• It is now due April 21st (was April 18th).
• ELMS link: https://myelms.umd.edu/courses/1218364/assignments/4389209
This is a two-part mini-project (regression & classification)!Midterm next Thursday (4/13); good ways to study:• Lecture slides; trying stuff out in iPython; Piazza; external links
on the course website; asking TAs about gaps in knowledge
2
TODAY’S LECTURE
Data collection
Data processing
Exploratory analysis
&Data viz
Analysis, hypothesis testing, &
ML
Insight & Policy
Decision
3
BIG THANKS: Zico Kolter (CMU)
SUMMARYHypothesis testing allows us to formulate beliefs about investment attributes and subject those beliefs to rigorous testing following the scientific method.
• For parametric hypothesis testing, we formulate our beliefs (hypotheses), collect data, and calculate a value of the investment attribute in which we are interested (the test statistic) for that set of data (the sample), and then we compare that with a value determined under assumptions that describe the underlying population (the critical value). We can then assess the likelihood that our beliefs are true given the relationship between the test statistic and the critical value.
• Commonly tested beliefs associated with the expected return and variance of returns for a given investment or investments can be formulated in this way.
4
Outline
Motivation
Background: sample statistics and central limit theorem
Basic hypothesis testing
Experimental design
4
5
Motivating setting
For a data science course, there has been very little “science” thus far…
“Science” as I’m using it roughly refers to “determining truth about the real world”
5
6
Asking scientific questions
Suppose you work for a company that is considering a redesign of their website; does their new design (design B) offer any statistical advantage to their current design (design A)?
In linear regression, does a certain variable impact the response? (E.g. does energy consumption depend on whether or not a day is a weekday or weekend?)
In both settings, we are concerned with making actual statements about the nature of the world
6
7
Outline
Motivation
Background: sample statistics and central limit theorem
Basic hypothesis testing
Experimental design
7
8
Sample statistics
To be a bit more consistent with standard statistics notation, we’ll introduce the notion of a population and a sample
8
Population Sample
Mean
Variance
! = "[#]
$ = "[ # − ! 2]
'̅ =1)
∑ ' +,
+=1
.2 =1
) − 1∑ ' + − '̅ 2,
+=1
9
Sample mean as random variable
The same mean is an empirical average over ) independent samples from the distribution; it can also be considered as a random variable
This new random variable has the mean and variance
" '̅ = "1)
∑ ' +,
+=1=
1)
∑ " #,
+=1= " # = !
/01 '̅ = /011)
∑ ' +,
+=1=
1)2 ∑ /01[#]
,
+=1=
$2
)
where we used the fact that for independent random variables #1, #2/01 #1 + #2 = /01 #1 + /01 #2
When estimating variance of sample, we use .2/) (the square root of this term is called the standard error)
9
10
Central limit theorem
Central limit theorem states further that '̅ (for “reasonably sized” samples, in practice ) ≥ 30) actually has a Gaussian distribution regardless of the distribution of #
'̅ → 4 !,$2
) or equivalently
'̅ − !$/)1/2 → 4(0,1)
In practice, for ) < 30 and for estimating $2 using sample variance, we use a Student’s t-distribution with ) − 1 degrees of freedom
'̅ − !./)1/2 → 5,−1, 6 '; 7 ∝ 1 +
'2
7
−9+12
10
11
Aside: why the ) − 1 scaling?
We scale the sample variance by ) − 1 so that it is an unbiased estimate of the population variance
" ∑ ' + − '̅ 2,
+=1= " ∑ ' + − ! − '̅ − !
2,
+=1
= " ∑ ' + − ! 2 − '̅ − !,
+=1∑ ' + − !,
+=1+ ) '̅ − ! 2
= " ∑ ' + − ! 2,
+=1− )" ∑ '̅ − ! 2
,
+=1
= )/01 # −)/01 #
)= ) − 1 $2
11
12
Outline
Motivation
Background: sample statistics and central limit theorem
Basic hypothesis testing
Experimental design
12
13
Hypothesis testing
Using these basic statistical techniques, we can devise some tests to determine whether certain data gives evidence that some effect “really” occurs in the real world
Fundamentally, this is evaluating whether things are (likely to be) true about the population (all the data) given a sample
Lots of caveats about the precise meaning of these terms, to the point that many people debate the usefulness of hypothesis testing at all
But, still incredibly common in practice, and important to understand
13
14
Hypothesis testing basics
Posit a null hypothesis :0 and an alternative hypothesis :1 (usually just
that “:0 is not true”
Given some data ', we want to accept or reject the null hypothesis in favor of the alternative hypothesis
14
<= true <> true
Accept <= CorrectType II error
(false negative)
Reject <=Type I error
(false positive)Correct
6 reject :0 :0 true = “significance of test”
6 reject :0 :1 true = “power of test”
15
Basic approach to hypothesis testing
Basic approach: compute the probability of observing the data under the null hypothesis (this is the p-value of the statistical test)
6 = 6 data :0 is true)
Reject the null hypothesis if the p-value is below the desired significance level (alternatively, just report the p-value itself, which is the lowest significance level we could use to reject hypothesis)
Important: p-value is 6 data :0 is true) not 6 :0 not true data)
15
16
Canonical example: t-test
Given a sample ' 1 ,… , ' , ∈ ℝ
:0: ! = 0 (for population):1: ! ≠ 0
By central limit theorem, we know that '̅ − ! /(./)12) ∼ 5,−1
(Student’s t-distribution with ) − 1 degrees of freedom)
So we just compute E = '/̅ ./)12 (called test statistic), then compute
6 = 6 ' > E + 6 ' < − E = F − E + 1 − F E = 2F (− E )
(where F is cumulative distribution function of Student’s t-distribution)
16
17
Visual example
What we are doing fundamentally is modeling the distribution 6 '̅ :0and then determining the probability of the observed '̅ or a more extreme value
17
6 = Area
18
Code in Python
Compute E statistic and 6 value from data
18
import numpy as npimport scipy.stats as stx = np.random.randn(m)
# compute t statistic and p valuexbar = np.mean(x)s2 = np.sum((x - xbar)**2)/(m-1)std_err = np.sqrt(s2/m)t = xbar/std_err
t_dist = st.t(m-1)p = 2*td.cdf(-np.abs(t))
# with scipy alonet,p = st.ttest_1samp(x, 0)
19
Two-sided vs. one-sided tests
The previous test considered deviation from the null hypothesis in both directions (two-sided test), also possible to consider a one-sided test
:0: ! ≥ 0 (for population):1: ! < 0
Same E statistic as before, but we only compute the area under the left side of the curve
6 = 6 ' < E = F (E)
19
20
Confidence intervals
We can also use the E statistic to create confidence intervals for the mean
Because ' ̅has mean ! and variance .2/), we know that 1 − G of its probability mass must lie within the range
'̅ = ! ±.
)1/2 ⋅ F −1 1 −G2
≡ ! + JK ., ), G
⟺ ! = '̅ ± JK ., ), G
where F −1 denotes the inverse CDF function of E-distribution with ) − 1 degrees of freedom
20
# simple confidence interval compuationCI = lambda s,m,a : s / np.sqrt(m) * st.t(m-1).ppf(1-a/2)
21
Outline
Motivation
Background: sample statistics and central limit theorem
Basic hypothesis testing
Experimental design
21
22
Experimental design: A/B testing
Up until now, we have assumed that the null hypothesis is given by some known mean, but in reality, we may not know the mean that we want to compare to
Example: we want to tell if some additional feature on our website makes user stay longer, so we need to estimate both how long users stay on the current site and how long they stay on redesigned site
Standard approach is A/B testing: create a control group (mean !1) and a
treatment group (mean !2):0: !1 = !2 or e. g. !1 ≥ !2:1: !1 ≠ !2 or e. g. !1 < !2
22
23
Independent E-test (Welch’s E-test)
Collect samples (possibly different numbers) from both populations
'11 , … , '1
,1 , '21 , … , '2
,2
compute sample mean '1̅, '2̅ and sample variance .12, .2
2 for each group
Compute test statistic
E ='1̅ − '2̅
.12/)1 + .2
2/)21/2
And evaluate using a t distribution with degrees of freedom given by
.12/)1 + .2
2/)22
.12/)1
2
)1 − 1 + .22/)2
2
)2 − 1
23
24
Starting seem a bit ad-hoc?
There are a huge number of different tests for different situations
You probably won’t need to remember these, and can just look up whatever test is most appropriate for your given situation
But the basic idea in call cases is the same: you’re trying to find the distribution of your test statistic under the hull hypothesis, and then you are computing the probability of the observed test statistic or something more extreme
All the different tests are really just about different distributions based upon your problem setup
24
25
Hypothesis testing in linear regression
One last example (because it’s useful in practice): consider the linear regression M ≈ OP ', and suppose we want to perform a hypothesis test on the coefficients of O
Example: suppose that instead of just two website, you have a website with multiple features that can be turned on/off, and your sample data includes a wide variety of different samples
We would like to ask the question: is the Qth variable relevant for predicting the output?
We’ve already seen ways we can do this (i.e., evaluate cross-validation error, but it’s a bit difficult to understand what such mean
25
26
Formula for sample variance in linear regression
There is an analogous formula for sample variance on the errors that a linear regression model makes
.2 =1
) − R∑ M + − OP ' + 2,
+=1
Use this to determine sample covariance of coefficients
STU O = .2 #P # −1
Can then evaluate null hypothesis :0: O+ = 0, using t statistic
E = O+/STU O +,+1/2
Similar procedure to get confidence intervals of coefficients
26
27
P-values considered harmful
A basic problem is that 6 data :0 ≠ 6(:0|data) (despite being frequently interpreted as such)
People treat 6 < 0.05 with way too much importance
27
Histogram of p values from ~3,500 published journal papers(from E. J. Masicampo and Daniel Lalande, A peculiar prevalence of p values just below .05, 2012)
QUICK ASIDE FOR HW #3Precision P:
#correct positive results / #positive results returnedRecall R:
#correct positive results / #all possible positive results
28
QUICK ASIDE FOR HW #3F-Score F:
weighted average of the precision and recall of a testF1: (harmonic) mean of precision and recall:
Can be parameterized to attach higher importance to recall:
29
NEXT CLASS:MIDTERM PREP & REGULARIZATION
30