Download - Stat Methods
Some definitions
◦ Individual: each object described by a set of data
◦ Variable: any characteristic of an individual
⋄ Categorical variable: places an individual into one of several
groups or categories.
⋄ Quantitative variable: takes numerical values on which we can
do arithmetic.
◦ Distribution of a variable: tells what values it takes and how often
it takes these values.
Example:
The following data set consists of five variables about 20 individuals.
ID Age Education Sex Total income Job class
1 43 4 1 18526 52 35 3 2 5400 73 43 2 1 3900 74 33 3 1 28003 55 38 3 2 43900 76 53 4 1 53000 57 64 6 1 51100 68 27 4 2 44000 59 34 4 1 31200 510 27 3 2 26030 511 47 6 1 6000 612 48 3 1 8145 513 39 2 1 37032 514 30 3 2 30000 515 35 3 2 17874 516 47 4 2 400 517 51 4 2 22216 518 56 5 1 26000 619 57 6 1 100267 720 34 1 1 15000 5
Age: age in yearsEducation: 1=no high school, 2=some high school, 3=high school diplom,
4=some college, 5=bachelor’s degree, 6=postgraduate degreeSex: 1=male, 2=femaleTotal income: income from all sourcesJob class: 5=private sector, 6=government, 7=self employed
Variables Age and Total income are quantitative, variables Eduction, Sex,
and Job class are categorical.
Graphical Description of Data, Jan 5, 2004 - 1 -
Categorical variable analysis
Questions to ask about a categorical variable:
◦ How many categories are there?
◦ In each category, how many observations are there?
Bar graphs and pie charts
Categorical data can be displayed by bar graphs or pie charts.
◦ In a bar graph, the horizontal axis lists the categories, in any order.
The height of the bars can be either counts or percentages.
◦ For better comparison of the frequencies, the variables can be ordered
from most frequent to lest frequent.
◦ In a pie chart, the area of each slide is proportional to the percentage
of individuals who fall into that category.
Example: Education of people aged 25 to 34
010
2030
Per
cent
of p
eopl
e ag
ed 2
5 to
34
no HS some HS HS diploma Bachelor’s some college postgradEducation level
010
2030
Per
cent
of p
eopl
e ag
ed 2
5 to
34
HS diploma Bachelor’s some college some HS postgrad no HSEducation level
3.6%7.5%
30.4%
29.1%
22.7%
6.7%
no HS some HSHS diploma Bachelor’ssome college postgrad
Graphical Description of Data, Jan 5, 2004 - 2 -
Categorical variable analysis
Example: Education of people aged 25 to 34
STATA commands:
. infile ID AGE EDUC SEX EARN JOB using individuals.txt, clear
. drop if AGE<25 | AGE>34
. label values EDUC Education
. label define Education 1 "no HS" 2 "some HS" 3 "HS diploma" 4 "Bachelor’s"
> 5 "some college" 6 "postgrad"
. set scheme s1mono
. gen COUNT=100/_N
. graph bar (sum) COUNT, over(EDUC) ytitle("Percent of people aged 25 to 34")
> b1title("Education level")
. translate @Graph bar1.eps, translator(Graph2eps) replace
. graph bar (sum) COUNT, over(EDUC, sort(1) descending)
> ytitle("Percent of people aged 25 to 34") b1title("Education level")
. translate @Graph bar2.eps, translator(Graph2eps) replace
. set scheme s1color
. graph pie COUNT, over(EDUC) plabel(_all perc, format(%4.1f) gap(-5))
. translate @Graph pie.eps, translator(Graph2eps) replace
Graphical Description of Data, Jan 5, 2004 - 3 -
Quantitative variables: stemplots
Example: Sammy Sosa home runs
Producing stemplots in STATA:
. infile YEAR HR using sosa.dat
. stem HR
Stem-and-leaf plot for HR
0* | 48
1* | 05
2* | 5
3* | 366
4* | 009
5* | 0
6* | 346
Year Home runs
1989 41990 151991 101992 81993 331994 251995 361996 401997 361998 661999 632000 502001 642002 492003 40
How to make a stemplot
1. Separate each observation into a stem and a leaf.
e.g. 15 → 1︸︷︷︸stem
5︸︷︷︸leaf
and 4 → 0︸︷︷︸stem
4︸︷︷︸leaf
2. Write the stems in a vertical column in increasing order.
3. Write each leaf next to stem, in increasing order out from the stem.
How to choose the stem
◦ Rounding: each leaf should have exactly one digit, so rounding long
numbers before producing the stemplot can help produce a more com-
pact and informative plot.
◦ Splitting: if each stem (or many stems) have a large number of leaves,
all stems can be split, with leaves of 0-4 going to the first stem and 5-9
going to the second.
Graphical Description of Data, Jan 5, 2004 - 4 -
Quantitative variables: histograms
How to make a histogram
1. Group the observations into “bins” according to their value. Choose
the bins carefully: too few hide detail, too many decimate the pattern.
2. Count the individuals in each bin.
3. Draw the histogram
◦ Leave no space between bars.
◦ Label the axes with units of measurement.
◦ The y-axis is can be counts or percentages (per unit).
Example: Sammy Sosa home runs
Year Home runs
1989 41990 151991 101992 81993 331994 251995 361996 401997 361998 661999 632000 502001 642002 492003 40
0.0
1.0
2.0
3.0
4D
ensi
ty
0 10 20 30 40 50 60 70Home runs
The area of each bar is proportional to the percentage of data in that range.
We care about the area, not the height, but when the bar has equal width,
area is determined by the height.
For simplicity, use equally spaced bins.
Graphical Description of Data, Jan 5, 2004 - 5 -
Quantitative variables: histograms
Example: Sammy Sosa home runs
Histograms with different bin widths:
Histogram of Sosa Home Runs
Home Runs
Per
cent
age
0 10 20 30 40 50 60 70
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Histogram of Sosa Home Runs
Home Runs
Per
cent
age
0 10 20 30 40 50 60 70
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Histogram of Sosa Home Runs
Home Runs
Per
cent
age
0 10 20 30 40 50 60 70
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Histogram of Sosa Home Runs
Home Runs
Per
cent
age
0 10 20 30 40 50 60 70
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Producing histograms in STATA:
. infile YEAR HR using sosa.dat
. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs)
. translate @Graph hist1.eps, translator(Graph2eps) replace
. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs) freq
. translate @Graph hist2.eps, translator(Graph2eps) replace
0.0
1.0
2.0
3.0
4D
ensi
ty
0 10 20 30 40 50 60 70Home runs
01
23
45
Fre
quen
cy
0 10 20 30 40 50 60 70Home runs
Why is a histogram not a bar graph?
◦ Frequencies are represented by area, not height.
◦ There is no space between the bars.
◦ The horizontal axis represents a numerical quantity, with an inherent
order.
Graphical Description of Data, Jan 5, 2004 - 6 -
Interpreting histograms
◦ Describe the overall pattern and any significant deviations from that
pattern.
◦ Shape: Is the distribution (approximately) symmetric or skewed?
Histogram of x
x
Fre
quen
cy
0.0 0.5 1.0 1.5 2.0
050
010
0015
0020
00
This distribution is skewed right
because it has a long right-hand
tail.
◦ Center: Where is the “middle” of the distribution?
◦ Spread: What are the smallest and largest values?
◦ Outliers: Are there any observations that lie outside the overall pat-
tern? They could be unusual observations, or they could be mistakes.
Check them!
Example: Newcomb’s measurements of the passage time of light (IPS Tab
1.1)
Time
Fre
quen
cy
−60 −40 −20 0 20 40 600
5
10
15
20
25
Graphical Description of Data, Jan 5, 2004 - 7 -
Time plots
Example: Average retail price of gasoline from Jan 1988 to Apr 2001
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Ret
ail g
asol
ine
pric
e
1988 1990 1992 1994 1996 1998 2000Year
Note: Whenever data are collected over time, it is a good idea to have
a time plot. Stemplots and histograms ignore time order, which can be
misleading when systematic change over time exists.
Producing a time plot in STATA:
. infile PRICE using gasoline.txt, clear
. graph twoway line PRICE T, ylabel(0.9(0.1)1.8, format(%3.1f)) xtick(0(12)159)
> xlabel(0 "1988" 24 "1990" 48 "1992" 72 "1994" 96 "1996" 120 "1998" 144 "2000")
> xtitle(Year) ytitle(Retail gasoline price)
Graphical Description of Data, Jan 5, 2004 - 8 -
Measures of center
The mean
The mean of a distribution is the arithmetic average of the obser-
vations:
x =x1 + · · · + xn
n= 1
n
n∑i=1
xi
The median
The median is the midpoint of a distribution: the number M
such that
◦ half the observations are smaller and
◦ half are larger.
How to find the median
Suppose the observations are x1, x2, . . . , xn.
1. Arrange the data in increasing order and let x(i) denote the ith
smallest observation.
2. If the number of observations n is odd, the median is the center
observation in the ordered list:
M = x((n+1)/2)
3. If the number of observation n is even, the median is the av-
erage of the two center observations in the ordered list:
M =x(n/2) + x(n/2+1)
2Numerical Description of Data, Jan 7, 2004 - 1 -
Measures of center
Examples:
Data set 1:
x1 x2 x3 x4 x5 x6 x7 x8 x9
2 4 3 4 6 5 4 -6 5
Arrange in increasing order:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
-6 2 3 4 4 4 5 5 6
There is an odd number of observations, so the median is
M = x((n+1)/2) = x(5) = 4.
The mean is given by
x =2 + 4 + 3 + 4 + 6 + 5 + 4 + (−6) + 5
9=
27
9= 3.
Data set 2:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
2.3 8.8 3.9 4.1 6.4 5.9 4.2 2.9 1.3 5.1
Arrange in increasing order:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10)
1.3 2.3 2.9 3.9 4.1 4.2 5.1 5.9 6.4 8.8
There is an even number of observations, so the median is
M =x(n/2) + x(n/2+1)
2=
x(5) + x(6)
2=
4.1 + 4.2
2= 4.15.
The mean is given by
x =2.3 + 8.8 + 3.9 + 4.1 + 6.4 + 5.9 + 4.2 + 2.9 + 1.3 + 5.1
10=
44.9
10= 4.49.
Numerical Description of Data, Jan 7, 2004 - 2 -
Mean versus median
◦ The mean is easy to work with algebraically, while the median
is not.
◦ The mean is sensitive to extreme observations, while the median
is more robust.
Example:
0 1 2 3 4 5 6 7 8 9 10
The original mean and median are
x =0 + 1 + 2
3= 1 and M = x((n+1)/2) = 1
The modified mean and median are
x =0 + 1 + 10
3= 3
2
3and M = x((n+1)/2) = 1
◦ If the distribution is exactly symmetric, then mean=median.
◦ In a skewed distribution, the mean is further out in the longer
tail than the median.
◦ The median is preferable for strongly skewed distributions, or
when outliers are present.
Numerical Description of Data, Jan 7, 2004 - 3 -
Measures of spread
Example: Monthly returns on two stocks
Stock A
Daily returns (in %)
Fre
quen
cy
−10 −5 0 5 10 15 200
10
20
30
40Stock B
Daily returns (in %)
Fre
quen
cy
−10 −5 0 5 10 15 200
10
20
30
40
Stock A Stock B
Mean 4.95 4.82
Median 4.99 4.68
The distributions of the two stocks have approximately the same
mean and median, but stock B is more volatile and thus more risky.
◦ Measures of center alone are an insufficient description of a
distribution and can be misleading
◦ The simplest useful numerical description of a distribution con-
sists of both a measure of center and a measure of spread.
Common measures of spread are
◦ the quartiles and the interquartile range
◦ the standard deviation
Numerical Description of Data, Jan 7, 2004 - 4 -
Quartiles
Quartiles divide data into 4 even parts
◦ Lower (or first) quartile QL:
median of all observations less than the median M
◦ Middle (or second) quartile M = QM :
median of all observations
◦ Upper (or third) quartile QU :
median of all observations lgreater than the median M
◦ Interquartile range: IQR = QU − QL
distance between upper and lower quartile
How to find the quartiles
1. Arrange the data in increasing order and find the median M
2. Find the median of the observations to the left of M, that is the lower
quartiles, QL
3. Find the median of the observations to the right of M, that is the
upper quartiles, QU
Examples:
Data set:
x1 x2 x3 x4 x5 x6 x7 x8 x9
2 4 3 4 6 5 4 -6 5
Arrange in increasing order:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
-6 2 3 4 4 4 5 5 6
◦ QL is the median of {−6, 2, 3, 4}: QL = 2.5
◦ QU is the median of {4, 5, 5, 6}: QU = 5
◦ IQR = 5 − 2.5 = 2.5
Numerical Description of Data, Jan 7, 2004 - 5 -
Percentiles
More generally we might be interested in the value which is ex-
ceeded only by a certain percentage of observations:
The pth percentile of a set of observations is the value such that
◦ p% of the observation are less than or equal to it and
◦ (100 − p)% of the observation are greater than or equal to it.
How to find the percentiles
1. Arrange the data into increasing order.
2. If np/100 is not an integer, then x(k+1) is the pth percentile,
where k is the largest integer less than np/100.
3. If np/100 is an integer, the pth percentile is the average of the
x(np/100) and x(np/100+1).
Five-number summary
A numerical summary of a distribution {x1, . . . , xn} is given by
x(1) QL M QU x(n)
A simple boxplot is a graph of the five-number summary.
Numerical Description of Data, Jan 7, 2004 - 6 -
Boxplots
A common “rule” for discovering outliers is the 1.5 × IQR rule:
An observations is a suspected outlier if it lies more than falls more
than 1.5 × IQR below QL or above QU .
How to draw a boxplot Box-and-whisker
plot)
1. A box (the box) is drawn from the lower to
the upper quartile (QL and QU).
2. The median of the data is shown by a line in
the box.
3. Lines (the whiskers) are drawn from the ends
of the box to the most extreme observations
within a distance of 1.5 IQR (Interquartile
range).
4. Measurements falling outside 1.5 IQR from
the ends of the box are potential outliers and
marked by ◦ or ∗.
−10
010
20
Stock A Stock B
Plotting a boxplot with STATA:
. infile A B using stocks.txt, clear
. label var A "Stock A"
. label var B "Stock B"
. graph box A B, xsize(2) ysize(5)
Numerical Description of Data, Jan 7, 2004 - 7 -
Boxplots
Interpretation of Box Plots
◦ The IQR is a measure for the sample’s variability.
◦ If the whiskers differ in length the distribution of the data is
probably skewed in the direction of the longer whisker.
◦ Very extreme observations (more than 3 IQR away from the
lower resp. upper quartile) are outliers, with one of the following
explanations:
a) The measurement is incorrect (error in measurement process
or data processing).
b) The measurement belongs to a different population.
c) The measurement is correct, but represents a rare (chance)
event.
We accept the last explanation only after carefully ruling out
all others.
Numerical Description of Data, Jan 7, 2004 - 8 -
Variance and standard deviation
Suppose there are n observations x1, x2, . . . , xn,
The variance of the n observations is:
s2 =(x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2
n − 1
= 1
n − 1
n∑i=1
(xi − x)2
This is (approximately) the average of the squared distances of the
observations from the mean.
The standard deviation is:
s =√
s2 =
√1
n − 1
n∑i=1
(xi − x)2
Why n − 1?
Division by n − 1 instead of n in the variance calculation is a
common cause of confusion. Why n − 1? Note that
n∑
i=1
(xi − x) = 0
Thus, if you know any n − 1 of the differences, the last difference
can be determined from the others. The number of “freely varying”
observations, n− 1 in this case, is called the “degrees of freedom”.
Numerical Description of Data, Jan 7, 2004 - 9 -
Properties of s
◦ Measures spread around the mean =⇒ use only if the mean
is used as a measure of center.
◦ s = 0 ⇔ all observations are the same
◦ s is in the same units as the measurements, while s2 is in the
square of these units.
◦ s, like x is not resistant to outliers.
Five-number summary versus standard deviation
◦ The 5-number summary is better for describing skewed distri-
butions, since each side has a different spread.
◦ x and s are preferred for symmetric distributions with no out-
liers.
Numerical Description of Data, Jan 7, 2004 - 10 -
Histograms and density curves
What’s in our toolkit so far?
◦ Plot the data: histogram (or stemplot)
◦ Look for the overall pattern and identify deviations and outliers
◦ Numerical summary to briefly describe center and spread
A new idea:
If the pattern is sufficiently regular, approximate it with a
smooth curve.
Any curve that is always on or above the horizontal axis and has
total are underneath equal to one is a density curve.
◦ Area under the curve in a range of values indicates the propor-
tion of values in that range.
◦ Come in a variety of shapes, but the “normal” family of familiar
bell-shaped densities is commonly used.
◦ Remember the density is only an approximation, but it sim-
plifies analysis and is generally accurate enough for practical
use.
The Normal Distrbution, Jan 9, 2004 - 1 -
Examples
Sulfur oxide (in tons)
Den
sity
0 10 20 30 400.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Sulfur oxide (in tons)
Den
sity
0 10 20 30 400.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Sulfur oxide (in tons)
Den
sity
0 10 20 30 400.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Shaded area of histogram: 0.29
Shaded area under the curve: 0.30
Waiting time between eruptions (min)
Den
sity
40 46 52 58 64 70 76 82 88 94 1000.00
0.01
0.02
0.03
0.04
The Normal Distrbution, Jan 9, 2004 - 2 -
Median and mean of a density curve
Median:
The equal-areas point with 50% of the “mass” on either side.
Mean:
The balancing point of the curve, if it were a solid mass.
Note:
◦ The mean and median of a symmetric density curve are equal.
◦ The mean of a skewed curve is pulled away from the median in
the direction of the long tail.
The mean and standard deviation of a density are denoted µ and
σ, rather than x and s, to indicate that they refer to an idealized
model, and not actual data.
The Normal Distrbution, Jan 9, 2004 - 3 -
Normal distributions: N (µ, σ)
The normal distribution is
◦ symmetric,
◦ single-peaked,
◦ bell-shaped.
The density curve is given by
f(x) = 1√2πσ2
exp(− 1
2σ2(X − µ)2
).
It is determined by two parameters µ and σ:
◦ µ is the mean (also the median)
◦ σ is the standard deviation
Note: The point where the curve changes from concave to convex
is σ units from µ in either direction.
The Normal Distrbution, Jan 9, 2004 - 4 -
The 68-95-99.7 rule
◦ About 68% of the data fall inside (µ − σ, µ + σ).
◦ About 95% of the data fall inside (µ − 2σ, µ + 2σ).
◦ About 99.7% of the data fall inside (µ − 3σ, µ + 3σ).
The Normal Distrbution, Jan 9, 2004 - 5 -
Example
Scores on the Wechsler Adult Intelligence Scale (WAIS) for the 20
to 34 age group are approximately N(110, 25).
◦ About what percent of people in this age group have scores
above 110?
◦ About what percent have scores above 160?
◦ In what range do the middle 95% of all scores lie?
The Normal Distrbution, Jan 9, 2004 - 6 -
Standardization and z-scores
Linear transformation of normal distributions:
X ∼ N (µ, σ) ⇒ a X + b ∼ N (a µ + b, a σ)
In particular it follows that
X − µ
σ∼ N (0, 1).
N (0, 1) is called standard normal distribution.
For a real number x the standardized value or z-score
z =x − µ
σ
tells how many standard deviations x is from µ, and in what di-
rection.
Standardization enables us to use a standard normal table to find
probabilities for any normal variable.
For example:
◦ What is the proportion of N(0, 1) observations less than 1.2?
◦ What is the proportion of N(3, 1.5) observations greater than 5?
◦ What is the proportion of N(10, 5) observations between 3 and 9?
The Normal Distrbution, Jan 9, 2004 - 7 -
Normal calculations
Standard normal calculations
1. State the problem in terms of x.
2. Standardize: z = x−µσ .
3. Look up the required value(s) on the standard normal table.
4. Reality check: Does the answer make sense?
Backward normal calculations
We can also calculate the values, given the probabilities:
If MPG ∼ N (25.7, 5.88), what is the minimum MPG required to be in the
top 10%?
“Backward” normal calculations
1. State the problem in terms of the probability of being less
than some number.
2. Look up the required value(s) on the standard normal table.
3. “Unstandardize,” i.e. solve z = x−µσ for x.
The Normal Distrbution, Jan 9, 2004 - 8 -
Example
Suppose X ∼ N (0, 1).
◦ P(X ≤ 2) = ?
◦ P(X > 2) = ?
◦ P(−1 ≤ X ≤ 2) = ?
◦ Find the value z such that
⋄ P(X ≤ z) = 0.95
⋄ P(X > z) = 0.99
⋄ P(−z ≤ X < z) = 0.68
⋄ P(−z ≤ X < z) = 0.95
⋄ P(−z ≤ X < z) = 0.997
Suppose X ∼ N (10, 5).
◦ P(X < 5) = ?
◦ P(−3 < X < 5) = ?
◦ P(−x < X < x) = 0.95
The Normal Distrbution, Jan 9, 2004 - 9 -
Assessing Normality
How to make a normal quantile plot
1. Arrange the data in increasing order.
2. Record the percentiles ( 1n,
2n, . . . ,
nn).
3. Find the z-scores for these percentiles.
4. Plot x on the vertical axis against z on the horizontal axis.
Use of normal quantile plots
◦ If the data are (approximately) normal, the plot will be close
to a straight line.
◦ Systematic deviations from a straight line indicate a nonnormal
distribution.
◦ Outliers appear as points that are far away from the overall
patter of the plot.
−3 −2 −1 0 1 2 3
−2
−1
01
2
Theoretical Quantiles
Sam
ple
Qua
ntile
s
N(0, 1)−3 −2 −1 0 1 2 3
01
23
45
6
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Exp(1)−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
Theoretical Quantiles
Sam
ple
Qua
ntile
s
U(0, 1)
The Normal Distrbution, Jan 9, 2004 - 10 -
Density Estimation
The normal density is just one possible density curve. There are
many others, some with compact mathematical formulas and many
without.
Density estimation software fits an arbitrary density to data to give
a smooth summary of the overall pattern.
Velocity of galaxy (1000km/s)
Den
sity
0 10 20 30 400.0
0.1
0.2
The Normal Distrbution, Jan 9, 2004 - 11 -
Histogram
How to scale a histogram?
◦ Easiest way to draw a histogram:
⋄ qqually spaced bins
⋄ counts on the vertical axis
Sosa home runs
Fre
quen
cy
0 10 20 30 40 50 60 700
1
2
3
4
5
Disadvantage: Scaling depends on number of observations and
bin width.
◦ Scale histogram such that area of each bar corresponds to pro-
portion of data:
height =counts
width · total number
Sosa home runs
Den
sity
0 10 20 30 40 50 60 700.00
0.01
0.02
0.03
0.04
Proportion of data in interval (0, 10]:
height · width = 0.02 · 10 = 0.2 = 20%
Since n = 15 this corresponds to 3 observations.
The Normal Distrbution, Jan 9, 2004 - 12 -
Density curves
x
Den
sity
−4 −3 −2 −1 0 1 2 3 40.0
0.1
0.2
0.3
0.4
0.5n=250
x
Den
sity
−4 −3 −2 −1 0 1 2 3 40.0
0.1
0.2
0.3
0.4
0.5n=2500
x
Den
sity
−4 −3 −2 −1 0 1 2 3 40.0
0.1
0.2
0.3
0.4
0.5n=250000
x
Den
sity
−4 −3 −2 −1 0 1 2 3 40.0
0.1
0.2
0.3
0.4
0.5n → ∞
Proportion of data in (1,2]:
#{xi : 1 < xi ≤ 2}n
↓ n → ∞
∫ 2
1
f(x) dx
Probability that a new observation X fall into [a, b]P(a ≤ X ≤ b) =
∫ b
a
f(x) dx = limn→∞
#{xi : 1 < xi ≤ 2}n
The Normal Distrbution, Jan 9, 2004 - 13 -
Relationships between data
Example: Smoking and mortality
◦ Data from 25 occupational groups
(condensed from data on thousands of individual men)
◦ Smoking (100 = average number of cigarettes per day)
◦ Mortality ratio for deaths from lung cancer
(100 = average ratio for all English men)
Scatter plot of the data:
70 80 90 100 110 120 130
60
80
100
120
140
Smoking (index)
Mor
talit
y (in
dex)
In STATA:
. insheet using smoking.txt
. graph twoway scatter mortality smoking
Scatterplots and correlation, Jan 12, 2004 - 1 -
Relationship between data
Assessing a scatter plot:
◦ What is the overall pattern?
⋄ form of the relationship?
⋄ direction of the relationship?
⋄ strength of the relationship?
◦ Are there any deviations (e.g. outliers) from these patterns?
Direction of relationship/association:
◦ positive association: above-average values of both variables
tend to occur together, and the same for below-average values
◦ negative association: above-average values of one variable
tend to occur with below-average values of the other, and vice
versa.
Strength of relationship/association:
◦ determined by how closely the points follow the overall pattern
◦ difficult to assess numerical measure
Scatterplots and correlation, Jan 12, 2004 - 2 -
Correlation
Correlation is a numerical measure of the direction and strength
of the linear relationship between two quantitative variables.
The sample correlation r is defined as
rxy =sxy√sx sy
.
where
sx = 1
n − 1
n∑i=1
(xi − x)2,
sy = 1
n − 1
n∑i=1
(yi − y)2,
sxy = 1
n − 1
n∑i=1
(xi − x)(yi − y).
Properties:
◦ dimensionless quantity
◦ not affected by linear transformations:
for x′i = a xi + b and y′i = c yi + d
rx′y′ = rxy
◦ −1 ≤ rxy ≤ 1
◦ rxy = 1 if and only if yi = a xi + b for some a and b
◦ measures linear association between xi and yi
Scatterplots and correlation, Jan 12, 2004 - 3 -
Correlation
−2 −1 0 1 2
−2
01
2
x
yρ = −0.9
−2 −1 0 1 2
−2
01
2
x
y
ρ = −0.6
−2 −1 0 1 2
−2
01
2
x
y
ρ = −0.3
−2 −1 0 1 2
−2
01
2
x
y
ρ = 0
−2 −1 0 1 2
−2
01
2
x
y
ρ = 0.3
−2 −1 0 1 2
−2
01
2
x
y
ρ = 0.6
−2 −1 0 1 2
−2
01
2
x
y
ρ = 0.9
−2 −1 0 1 2
−2
01
2
x
y
ρ = 0.99
Scatterplots and correlation, Jan 12, 2004 - 4 -
Introduction to regression
Regression describes how one variable (response) depends on
another variable (explanatory variable).
◦ Response variable: variable of interest, measures the out-
come of a study
◦ Explanatory variable: explains (or even causes) changes in
response variable
Examples:
◦ Hearing difficulties:
response - sound level (decibels), explanatory - age (years)
◦ Real estate market:
response - listing prize ($), explanatory - house size (sq. ft.)
◦ Salaries:
response - salary ($), explanatory - experience (years), educa-
tion, sex
Least squares regression, Jan 14, 2004 - 1 -
Introduction to regression
Example: Food expenditures and income
Data: Sample of 20 households
0 20 40 60 80 100 1200
4
8
12
16
20
income
food
exp
endi
ture
Questions:
◦ How does food expenditure (Y ) depend on income (X)?
◦ Suppose we know that X = x0, what can we tell about Y ?
Linear regression:
If the response Y depends linearly on the explanatory variable
X , we can use a straight line (regression line) to predict Y
from X .
Least squares regression, Jan 14, 2004 - 2 -
Least squares regression
How to find the regression line
0 20 40 60 80 100 1200
4
8
12
16
20
income
foo
d e
xp
en
ditu
re
50 60 70 80 90
8
10
12
14
16
18
income
foo
d e
xpe
nd
iture
Predicted y
Difference y − y
Observed y
Since we intend to predict Y from X , the errors of interest are
mispredictions of Y for fixed X .
The least squares regression line of Y on X is the line that
minimizes the sum of squared errors.
For observations (x1, y1), . . . , (xn, yn), the regression line is given
by
Y = a + b X
where
b = r sy
sxand a = y − b x
(r correlation coefficient, sx, sx standard deviations, x, y means)
Least squares regression, Jan 14, 2004 - 3 -
Least squares regression
Example: Food expenditure and incomeX 28 26 32 24 54 59 44 30 40 82
Y 5.2 5.1 5.6 4.6 11.3 8.1 7.8 5.8 5.1 18.0
X 42 58 28 20 42 47 112 85 31 26
Y 4.9 11.8 5.2 4.8 7.9 6.4 20.0 13.7 5.1 2.9
The summary statistics are:
◦ x = 45.50
◦ y = 7.97
◦ sx = 23.96
◦ sy = 4.66
◦ r = 0.946
The regression coefficients are:
b = r sy
sx= 0.946 · 4.66
23.96= 0.184
a = y − b x = 7.97 − 0.184 · 45.5 = −0.402
0 20 40 60 80 100 120
0
5
10
15
20
income
food
exp
endi
ture
Least squares regression, Jan 14, 2004 - 4 -
Interpreting the regression model
◦ The response in the model is denoted Y to indicate that these
are predicted Y values, not the true Y values. The “hat” de-
notes prediction.
◦ The slope of the line indicates how much Y changes for a unit
change in X .
◦ The intercept is the value of Y for X = 0. It may or not have
a physical interpretation, depending on whether or not X can
take values near 0.
◦ To make a prediction for an unobserved X , just plug it in and
calculate Y .
◦ Note that the line need not pass through the observed data
points. In fact, it often will not pass through any of them.
Least squares regression, Jan 14, 2004 - 5 -
Regression and correlation
Correlation analysis:
We are interested in the joint distribution of two (or more)
quantitive variables.
Example: Heights of 1,078 fathers and sons
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80
Points are scattered around the SD line:
◦ (y − y) =sy
sx(x − x)
◦ goes through center (x, y)
◦ has slope sy/sx
The correlation r measures how much the points spread around
the SD line.
Least squares regression, Jan 14, 2004 - 6 -
Regression and correlation
Regression analysis:
We are interested how the distribution of one response variable
depends on one (or more) explanatory variables.
Example: Heights of 1,078 fathers and sons
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80 Father’s height = 64 inches
Son’s height (inches)
Den
sity
58 60 62 64 66 68 70 72 74 76 78 800.00
0.05
0.10
0.15
0.20x
Father’s height = 68 inches
Son’s height (inches)
Den
sity
58 60 62 64 66 68 70 72 74 76 78 800.00
0.04
0.08
0.12
0.16x
Father’s height = 72 inches
Son’s height (inches)
Den
sity
58 60 62 64 66 68 70 72 74 76 78 800.00
0.03
0.06
0.09
0.12
0.15
0.18x
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80
In each vertical strip, the
points are distributed
around the regression
line.
Least squares regression, Jan 14, 2004 - 7 -
Properties of least squares regression
◦ The distinction between explanatory and response variables is
essential. Looking at vertical deviations means that changing
the axes would change the regression line.
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80
y = a + bx
x = a’ + b’y
◦ A change of 1 sd in X corresponds to a change of r sds in Y .
◦ The least squares regression line always passes through the
point (x, y).
◦ r2 (the square of the correlation) is the fraction of the variation
in the values of y that is explained by the least squares regres-
sion on x.
When reporting the results of a linear regression,
you should report r2.
These properties depend on the least-squares fitting criterion and
are one reason why that criterion is used.
Least squares regression, Jan 14, 2004 - 8 -
The regression effect
Regression effect
In virtually all test-retest situations, the bottom group on the
first test will on average show some improvement on the sec-
ond test - and the top group will on average fall back. This is
the regression effect. The statistician and geneticist Sir Fran-
cis Galton (1822-1911) called this effect “regression to medi-
ocrity”.
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80
Regression fallacy
Thinking that the regression effect must be due to something
important, not just the spread around the line, is the regression
fallacy.
Least squares regression, Jan 14, 2004 - 9 -
Regression in STATA
. infile food income size using food.txt
. graph twoway scatter food income || lfit food income, legend(off)> ytitle(food). regress food income
Source | SS df MS Number of obs = 20------------+------------------------------ F( 1, 18) = 151.97
Model | 369.572965 1 369.572965 Prob > F = 0.0000Residual | 43.7725361 18 2.43180756 R-squared = 0.8941
------------+------------------------------ Adj R-squared = 0.8882Total | 413.345502 19 21.7550264 Root MSE = 1.5594
---------------------------------------------------------------------------food | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+--------------------------------------------------------------income | .1841099 .0149345 12.33 0.000 .1527336 .2154862_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615
---------------------------------------------------------------------------
0
5
10
15
20
Foo
d ex
pend
iture
0 20 40 60 80 100 120Income
This graph has been generated using the graphical user interface of STATA.
The complete command is:
. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black))> (lfit food income, range(0 120) clcolor(black) clpat(solid) clwidth(medium)),> ytitle(Food expenditure, size(large)) ylabel(, valuelabel angle(horizontal)> labsize(medlarge)) xtitle(Income, size(large)) xscale(range(0 120))> xlabel(0(20)120, labsize(medlarge)) legend(off) ysize(2) xsize(3)
Least squares regression, Jan 14, 2004 - 10 -
Residual plots
Residuals: difference of observed and predicted values
ei = observed y − predicted y
= yi − yi
= yi − (a + b xi)
For a least squares regression, the residuals always have mean zero.
Residual plot
A residual plot is a scatterplot of the residuals against the
explanatory variable. It is a diagnostic tool to assess the fit of
the regression line.
Patterns to look for:
◦ Curvature indicates that the relationship is not linear.
◦ Increasing or decreasing spread indicates that the prediction
will be less accurate in the range of explanatory variables where
the spread is larger.
◦ Points with large residuals are outliers in the vertical direc-
tion.
◦ Points that are extreme in the x direction are potential high
influence points.
Influential observations are individuals with extreme x values
that exert a strong influence on the position of the regression line.
Removing them would significantly change the regression line.
Least squares regression, Jan 14, 2004 - 11 -
Regression Diagnostics
Example: First data set
Y
X5 10 15
0
5
10
Res
idua
ls
Fitted values4 6 8 10
−2
−1
0
1
2
Res
idua
ls
X5 10 15
−2
−1
0
1
2
residuals are regularly distributed
Least squares regression, Jan 14, 2004 - 12 -
Regression Diagnostics
Example: Second data set
Y
X5 10 15
0
5
10
Res
idua
ls
Fitted values4 6 8 10
−2
−1
0
1
2
Res
idua
ls
X5 10 15
−2
−1
0
1
2
functional relationship other than linear
Least squares regression, Jan 14, 2004 - 13 -
Regression Diagnostics
Example: Third data set
Y
X5 10 15
0
5
10
15
Res
idua
ls
Fitted values4 6 8 10
−1
0
1
2
3
Res
idua
ls
X5 10 15
−1
0
1
2
3
outlier, regression line misfits majority of data
Least squares regression, Jan 14, 2004 - 14 -
Regression Diagnostics
Example: Fourth data set
Y
X5 10 15
0
5
10
15
Res
idua
ls
Fitted values4 6 8 10
−2
−1
0
1
2
Res
idua
ls
X5 10 15
−2
−1
0
1
2
heteroscedasticity
Least squares regression, Jan 14, 2004 - 15 -
Regression Diagnostics
Example: Fifth data set
Y
X5 10 15 20
0
5
10
15
Res
idua
ls
Fitted values6 8 10 12 14
−2
−1
0
1
2
Res
idua
ls
X5 10 15 20
−2
−1
0
1
2
one separate point in direction of x, highly influential
Least squares regression, Jan 14, 2004 - 16 -
The Question of Causation
Example: Are babies brought by the stork?
◦ Data from 54 countries
◦ Variables:
⋄ Birth rate (newborns per 1000 women)
⋄ Number of storks (per 1000 women)
0 1 2 3 4 50
3
6
9
12
15
18
21
Number of storks (per 1000 women)
Birt
h ra
te
Model: Birth rate (Y) is proportional to the number of storks (X)
Y = b X + ε
Least squares regression yields for the slope of the regression line
b = 4.3 ± 0.2.
Can we conclude that babies are brought by the stork?
Causation, Jan 16, 2004 - 1 -
The Question of Causation
A more serious example:
Variables:
◦ Income Y - response
◦ level of education X - explanatory variable
There is a positive association between income and the education.
Question: Does better education increase income?
X Y X Y X Y
ZZ
causal effect confounding
?
(a) (b) (c)
?
Possible alternative explanation: Confounding
◦ People from prosperous homes are likely to receive many years of edu-
cation and are more likely to have high earnings.
◦ Education and income might both be affected by personal attributes
such as self assurance. On the other hand the level of education could
have an impact on e.g. self assurance. The effects of education and self
assurance can not be separated.
Confounding:
Response and explanatory variable both depend on a third
(hidden) variable.
Causation, Jan 16, 2004 - 2 -
Establishing Causal Relationships
Controlled experiments:
A cause-effect relationship between two variables X and Y can be
established by conducting an experiment where
◦ the values of X are manipulated and
◦ the effect on Y is observed.
Problem: Often such experiments are not possible.
If we cannot establish a causal relationship by a controlled experi-
ment, we can still collect evidence from observational studies:
◦ The association is strong.
◦ The association is consistent across multiple studies.
◦ Higher doses are associated with stronger responses.
◦ The alleged cause precedes the effect in time.
◦ The alleged cause is plausible.
Example: Smoking and lung cancer
Causation, Jan 16, 2004 - 3 -
Caution about Causation
Association is not causation
Two variables may be correlated because both are affected
by some other (measured or unmeasured) variable.
Unmeasured confounding variables can influence the in-
terpretation of relationships among the measured vari-
ables. They
◦ may suggest a relationship where there is none or
◦ may mask a real relationship.
No causation in - no causation out
Causation is - unlike association - no statistical concept.
For inference on cause-effect relationships, we need some
knowledge about the causal relationships between the vari-
ables in the study.
Randomized experiments guarantee the absence of any
confounding variables. Any relationship between the ma-
nipulated variable and the response must be due to a
cause-effect relationship.
Causation, Jan 16, 2004 - 4 -
Experiments and Observational Studies
Two major types of statistical studies
◦ Observational study - observes individuals/objects and mea-
sures variables of interest but does not attempt to interfere with
the natural process.
◦ Designed experiment - deliberately imposes some treatment
on individuals to observe their responses.
Remarks:
◦ Sample survey are an example of an observational study.
◦ In economics, most studies are observational.
◦ Clinical studies are often designed experiments.
◦ Designed experiments allow statements about causal relation-
ship between treatment and response.
◦ Observational studies have no control over variables. Thus the
effect of the explanatory variable on the response variable might
be confounded (mixed up) with the effect of some other vari-
ables. Such variables are called confounder and a major source
of bias.
Experiments and Observational Studies, Jan 16, 2004 - 5 -
Designed Experiments
• In controlled experiments, the subjects are assigned to one of
two groups,
◦ treatment group and
◦ control group (which does not receive treatment).
• A controlled experiment is randomized if the subjects are ran-
domly assigned to one of the two groups.
• One precaution in designed experiments if the use of a placebo,
which are made of a completely neutral substance. The sub-
jects do not know whether they receive the treatment or a
placebo, any difference in the response thus cannot be attir-
buted to psychological and psychosomatical effects.
• In a double blind experiment, neither the subjects nor the
treatment administrators know who is assigned to the two
groups.
Example: The Salk polio vaccine field trial
◦ Randomized controlled double-blind experiment in 11 states
◦ 200,000 children in treatment group
◦ 200,000 children in control group treated with placebo
The difference between the responses of the two groups show that
the vaccine reduces the risk of polio infection.
Experiments and Observational Studies, Jan 16, 2004 - 6 -
Confounding
Confounding means a difference between the treatment and con-
trol groups—other than the treatment—which affects the responses
being studied. A confounder is a third variable. associated with
exposure and with disease.
Example: Lanarkshire Milk Experiment
The purpose of the experiment was to study the effect of pasteur-
ized milk on the health of children.
◦ The subjects of the experiment were school children.
◦ The children in the treatment group got a daily portion of pas-
teurized milk.
◦ The children in the control did not receive any extra milk.
◦ The teachers assigned poorer children to treatment group so
that they got extra milk
The effect of pasteurized milk on the health of children is con-
founded with the effect of wealth: Poorer children are more exposed
to diseases.
Experiments and Observational Studies, Jan 16, 2004 - 7 -
Observational Studies
Confounding is a major problem in observational studies.
Association is NOT Causation
Example: Does smoking cause cancer.
• Designed experiment not possible (cannot make people
smoke).
• Observation: Smokers have higher cancer rates
• Tobacco industry: There might be a gene which
◦ makes people smoke and
◦ causes cancer
In that case stopping smoking would not prevent cancer since
it is caused by the gene. The observed high association could
be attributed to the confounding effect of such a gene.
• However: Studies with identical twins—one smoker and one
nonsmoker—puts some serious doubt on the gene theory.
Experiments and Observational Studies, Jan 16, 2004 - 8 -
Example
Do screening programs speed up detection of breast cancer?
◦ Large-scale trial run by the Health Insurance Plan of Greater
New York, starting in 1963
◦ 62,000 women age 40 to 64 (all members of the plan)
◦ Randomly assigned to two equal groups
◦ Treatment group:
⋄ women were encouraged to come in for annual screeening
⋄ 20,200 women did come in for screening
⋄ 10,800 refused.
◦ Control group:
⋄ was offered usual health care
◦ All the women were followed for many years.
Epidemiologists who worked on the study found that
◦ screening had little impact on diseases other than breast cancer;
◦ poorer women were less likely to accept screening than richer
ones; and
◦ most diseases fall more heavily on the poor than the rich.
Experiments and Observational Studies, Jan 16, 2004 - 9 -
Example
Deaths in the first five years of the screening trial, by cause. Rates per
1,000 women.
Cause of Death
Breast cancer All other
Number of persons Number Rates Number Rates
Treatment group 31,000 39 1.3 837 27
Examined 20,200 23 1.1 428 21
Refused 10,800 16 1.5 409 38
Control group 31,000 63 2.0 879 28
Questions:
◦ Does screening save lives?
◦ Why is the death rate from all other causes in the whole treatment
group (“examined” and “refused” combined) about the same as the
rate in the control group?
◦ Why is the death rate from all other causes higher for the “refused”
group than the “examined” group?
◦ Breast cancer (like polio, but unlike most other diseases) affects the
rich more than the poor. Which numbers in the table confirm this
association between breast cancer and income?
◦ The death rate (from all causes) among women who accepted screening
is about half the death rate among women who refused. Did screening
cut the death rate in half? In not, what explains the difference in death
rates?
◦ To show that screening reduces the risk from breast cancer, someone
wants to compare 1.1 and 1.5. Is this a good comparison? Is it biased
against screening? For screening?
Experiments and Observational Studies, Jan 16, 2004 - 10 -
Survey Sampling
Situation:
Population of N individuals (or items)
e.g. ◦ students at this university
◦ light bulbs produced by a company on one day
Seek information about population
e.g. ◦ amount of money students spent on books this quarter
◦ percentage of students who bought more than 10 books
in this quarter
◦ lifetime of light bulbs
Full data collection is often not possible because it is e.g.
◦ too expensive
◦ too time consuming
◦ not sensible (e.g. testing every produced light bulb for its lifetime)
Statistical approach:
◦ collect information from part of the population (sample)
◦ use information on sample to draw conclusions on whole pop-
ulation
Questions:
◦ How to choose a sample?
◦ What conclusions can be drawn?
Survey Sampling, Jan 19, 2004 - 1 -
Survey Sampling
Objective of a sample survey:
Gather information on some variable for population of N individ-
uals:
xi value of interest for ith individual
x1, . . . , xN values for population
Sample of length n:
x1, . . . , xn values obtained from sampling
Parameter - number that describes the population, e.g.
µpop =1
N
N∑j=1
xj population mean
σ2pop =
1
N
N∑j=1
(xj − µpop)2 population variance
Estimate population parameter from sampled values:
µpop = x =1
n
n∑i=1
xi sample mean
σ2pop = s2 =
1
n − 1
N∑j=1
(xj − x)2 sample variance
A function of the sample x1, . . . , xn is called a statistic.
Survey Sampling, Jan 19, 2004 - 2 -
Sampling Distribution
Suppose we are interested in the amount of money students at this
university have spent on books this quarter.
Idea: Ask 20 students about the amount they have spent and take
the average.
The value we obtain will vary from sample to sample, that is, if we
asked another 20 students we would get a different answer.
Sampling distribution
The sampling distribution of a statistic is the distribution of
all values taken by the statistic if evaluated for all possible
samples of size n taken from the same population.
In our example, the sampling distribution of the average amount
obtained from the sample depends on the way we choose the sample
from the population:
◦ Ask 20 students in this class.
◦ Ask 20 students in your department.
◦ Ask 20 students in the University bookshop.
◦ Select randomly 20 students from the register of the university.
The design of a sample refers to the method used to choose the
sample from the population.
Survey Sampling, Jan 19, 2004 - 3 -
Sampling Distribution
Example:
Consider a population of 20 students who spent the following
amounts on books:x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15
100 120 150 180 200 220 220 240 260 280 290 300 310 350 400
x
Fre
quen
cy (
%)
0 100 200 300 4000
3
6
9
12σ = 55.4247
(a)
x
Fre
quen
cy (
%)
0 100 200 300 4000
3
6
9
12σ = 43.38302
(b)
x
Fre
quen
cy (
%)
0 100 200 300 4000
3
6
9
12σ = 35.96526
(c)
Sampling distribution of
x = 1
n
n∑i=1
xi
for sample sizes
(a) n = 2
(b) n = 3
(c) n = 4
Survey Sampling, Jan 19, 2004 - 4 -
Bias
Example:
Suppose we are interested in the amount of money students at this
university have spent on books last quarter.
Sample: 20 students in the University bookshop
Do we get a good estimate for the average amount spent on books
last quarter by UofC students?
◦ Students who buy more books and spend more money on books
are more likely to be found in bookshops than students who buy
less books.
◦ The sample mean might overestimate the true amount spent
on books.
◦ The sample is not representative for the population of all stu-
dents.
Careful: A poor sample design can produce misleading conclu-
sions.
The design of a study is biased if it systematically favors some
parts of the population over others.
A statistic is unbiased if the mean of its sampling distribution
is equal to the parameter being estimated. Otherwise we say the
statistic is biased.
Survey Sampling, Jan 19, 2004 - 5 -
Bias
Examples: Biased Sampling
◦ Midway Airlines Ads in the New York Times and the Wall Street Jour-
nal stated that “84 percent of frequent business travelers to Chicago
prefer Midway Metrolink to American, United, and TWA.”
The survey was “conducted among Midway Metrolink passengers be-
tween New York and Chicago.
◦ A 1992 Roper poll asked “Does it seem possible or does it seem im-
possible to you that the Nazi extermination of Jews never happened?”
22% of the American respondents said “seems possible.”
A reworded poll 1994 asked “Does it seem possible to you that the Nazi
extermination of Jews never happened, or do you feel certain that it
happened?” This time only 1% of the respondents said it was “possible
it never happened.”
◦ ABC network program Nightline once asked whether the United Na-
tions should continue to have its headquarters in the United States.
More than 186,000 callers responded, and 67% said “No.”
A properly designed sample survey showed that 72% of adults want the
UN to stay.
◦ A call-in poll conducted by USA Today concluded that Americans love
Donald Trump.
USA Today later reported that 5,640 of the 7,800 calls for the poll came
from the offices owned by one man, Cincinnati financier Carl Lindner.
Survey Sampling, Jan 19, 2004 - 6 -
Caution about Sample Surveys
• Undercoverage
◦ occurs when same groups in the population are left out of
the process of choosing the sample
◦ no accurate list of the population
◦ results in bias if this group differs from the rest of the
population
• Nonresponse
◦ occurs when a chosen individual cannot be contacted or
does not cooperate
◦ results in bias if this group differs from the rest of the
population
• Response bias
◦ subjects may not want to admit illegal or unpopular be-
haviour
◦ subjects may be affected by the interviewers appearance or
tone
◦ subjects may not remember correctly
• Question wording
◦ confusing or leading questions can introduce strong bias
◦ do not trust sample survey results unless you have read the
exact questions posed
Survey Sampling, Jan 19, 2004 - 7 -
Simple Random Sampling
A simple random sample (SRS) of size n consists of n indi-
viduals chosen from the population in such a way that every set of
n individuals is equally likely to be selected.
◦ Every possible sample has an equal chance of being selected.
◦ Every individual has an equal chance of being selected.
◦ Random selection eliminates bias in sampling.
SRS or Not?
Is each of the following samples an SRS or not?
◦ A deck of cards if shuffled, and the top five dealt.
◦ A sample of Illinois residents is drawn by choosing all the resi-
dents in each of 100 census blocks (in such a way that each set
of 100 blocks is equally likely to be chosen)
◦ A telephone survey is conducted by dialing telephone numbers
at random (i.e. each valid phone number is equally likely).
◦ A sample of 10%of all student at the University of Chicago is
chosen by numbering the students 1, . . . , N , drawing a random
integer i from 1 to 10, and drawing every tenth student begin-
ning with i.
(E.g. if i = 5, students 5, 15, 25, . . . are chosen.)
Survey Sampling, Jan 19, 2004 - 8 -
Stratified Sampling
Example:
◦ Population: Students at this university
◦ Objective: Amount of money spent on books this quarter
◦ Knowledge: Students in e.g. humanities spend more money on
books
Use knowledge to build sample:
◦ divide sample into groups of similar individuals, called strata
◦ choose simply random sample within each group
◦ size of samples in each groups e.g. proportional to size of groups
Can reduce variability of estimate significantly.
Survey Sampling, Jan 19, 2004 - 9 -
Summary
◦ A number which describes a population is a parameter.
◦ A number computed from the data is a statistic.
◦ Use statistics to make inferences about unknown population
parameters.
◦ A Simple random sample (SRS) of size n consists of n in-
dividuals from the population sampled without replacement,
that is, every set of n individuals has an equal chance to be the
sample actually selected.
◦ A statistic from a random sample has a sampling distribution
that describes how the statistic varies in repeated data produc-
tion.
◦ A statistic as an estimator of a parameter may suffer from bias
or from high variability. Bias means that the mean of the
sampling distribution is not equal to the true value of the pa-
rameter. The variability of the statistic is described by the
spread of its sampling distribution.
Survey Sampling, Jan 19, 2004 - 10 -
First Step Towards Probability
Experiment:
Toss a die and observe the number on the face up.
What is the chance
◦ of getting a six?Event of interest: 6All possible events: 1 2 3 4 5 6⇒ 1
6(one out of six)
◦ of getting an even number?Event of interest: 2 4 6All possible events: 1 2 3 4 5 6⇒ 1
2(three out of six)
The classical probability concept:
If there are N equally likely possibilities, of which one must occur
and s are regarded favorable, or as a “success”, then the probability
of a “success” is
s
N.
Counting, Jan 21, 2003 - 1 -
First Step Towards Probability
Example:
Suppose that of 100 applicants for a job 50 were women and 50
were men, all equally qualified. Further suppose that the company
hired 2 women and 8 men.
How likely is this outcome under the assumption that
the company does not discriminate?
How many ways are there to choose
◦ 10 out of 100 applicants? (⇒ N)
◦ 2 out of 50 female applicants and 8 out of 50 male applicants?
(⇒ s)
To compute such probabilities we need a way to count the num-
ber of possibilities (favorable and total).
Counting, Jan 21, 2003 - 2 -
The Multiplicative Rule
Suppose you have k choices with N1, . . . , Nk possibilities, re-
spectively, to make. Then the total number of possibilities is
the product
N1 · · ·Nk.
Sampling in order with replacement
If you sample n times in order with replacement from a set of N
elements, then the total number of possible sequences (x1, . . . , xn)
is Nn.
Example:
If you toss a die 5 times, the number of possible results is 65 = 7776.
Sampling in order without replacement
If you sample n times in order without replacement from a set of N
elements, then the total number of possible sequences (x1, . . . , xn)
is
N(N − 1) · · · (N − n + 1) =N !
(N − n)!.
Example:
If you select 5 cards in order from a card deck of 64, the number
of possible results is 64 · 63 · 62 · 61 · 60 = 914, 941, 440.
Counting, Jan 21, 2003 - 3 -
Permutations and Combinations
Example:
If you select 5 cards from a card deck of 64, you are typically only
interested in the cards you have, not in the order in which you
received them. How many different combinations of 5 cards out
of 64 are there?
To answer this question we first address the question of how many
different sequences of the same 5 cards exist.
Permutation:
Let (x1, . . . , xn) be a sequence. A permutation of this sequence is
any rearrangement of the elements without loosing or adding any
elements, that is, any new sequence
(xi1, . . . , xin)
with permuted indices {i1, . . . , in} = {1, . . . , n}. The trivial per-
mutation does not change the order, i.e. ij = j.
How many permutations of n distinct elements are there? The
multiplicative rule yields
n · · · (n − 1) · · · 1 = n!.
Example (contd):
The number of different sequences of 5 fixed cards is 5! = 5 · 4 · 3 ·2 · 1 = 120.
Counting, Jan 21, 2003 - 4 -
Permutations and Combinations
How many different combinations of n elements chosen from
N distinct elements are there?
Recall that
◦ The number of different sequences of length n that can be cho-
sen from N distinct elements are
N !
(N − n)!.
◦ The number of permutions of any sequence of length n is n!.
Thus the number of combinations of n elements chosen from N
distinct elements is
N !
n! (N − n)!=
(N
n
)=
(N
N − n
).
(Nn
)are referred to as binomial coefficient.
Since two permuted (ordered) sequences (x1, . . . , xn) lead to the same (un-
ordered) combination {x1, . . . , xn} we divide the number of ordered se-
quences by the number of permutations.
Counting, Jan 21, 2003 - 5 -
Examples
Example:
If you select 5 cards from a card deck of 64, you are typically only
interested in the cards you have, not in the order in which you
received them. How many different combinations of 5 cards out
of 64 are there?
The answer is(
64
5
)=
64 · 63 · 62 · 61 · 60
5 · 4 · 3 · 2 · 1=
914941444
120= 7, 624, 512.
Example:
Recall the example with the 100 applicants for a job. The number
of ways to choose
◦ 2 women out of 50 is(
502
).
◦ 8 men out of 50 is(
508
).
◦ 10 applicants out of 100 is(
10010
).
Thus the chance of this event is(
502
)(508
)(
10010
) = 0.037
Moreover, the chance of this or a more extreme event (only one or
no woman is hired) is 0.046.
Counting, Jan 21, 2003 - 6 -
Summary
The number of possibilities to sample with or without replacement
in order or unordered n elements from a set of N distinct elements
are summarized in the following table:
Sampling in order without order
without replacementN !
(N − n)!
(N
n
)
with replacement Nn
(N + n − 1
N
)
Counting, Jan 21, 2003 - 7 -
Introduction to Probability
Classical Concept:
◦ requires finitely many and equally likely outcomes
◦ probability of event defined as number of favorable outcomes
(s) divided by number of total outcomes (N):
Probability of event =s
N
◦ can be determined by counting outcomes
In many practical situations the different outcomes are not equally
likely:
◦ Success of treatment
◦ Chance to die of a heart attack
◦ Chance of snowfall tomorrow
It is not immediately clear how to measure chance in each of these
cases.
Three Concepts of Probability
◦ Frequency interpretation
◦ Subjective probabilities
◦ Mathematical probability concept
Elements of Probability, Jan 23, 2003 - 1 -
The Frequentist Approach
In the long run, we are all dead.
John Maynard Keynes (1883-1943)
The Frequency Interpretation of Probability
The probability of an event is the proportion of time that events
of the same kind (repeated independently and under the same
conditions) will occur in the long run.
Example:
Suppose we collect data on the weather in Chicago on Jan 21 and
we note that in the past 124 years it snowed in 34 years on Jan 21,
that is 34124
100% = 27.4% of the time.
Thus we would estimate the probability of snowfall on Jan 21 in
Chicago as 0.274.
The frequency interpretation of probability is based on the follow-
ing theorem:
The Law of Large Numbers
If a situation, trial, or experiment is repeated again and again, the
proportion of successes will converge to the probability of any one
outcame being a success.
Elements of Probability, Jan 23, 2003 - 2 -
The Frequentist Approach
Number of Tosses
Rel
ativ
e F
requ
ency
of H
eads
0.0
0.2
0.4
0.6
0.8
1.0
0 100 200 300 400 500 600 700 800 900 1000
Tosses 1 − 1000
Number of Tosses (in 1000s)
Rel
ativ
e F
requ
ency
of H
eads
0.48
0.49
0.50
0.51
0.52
1 10 20 30 40 50 60 70 80 90 100
Tosses 1000 − 100000
Number of Tosses (in 100000s)
Rel
ativ
e F
requ
ency
of H
eads
0.49
50.
500
0.50
5
1 2 3 4 5 6 7 8 9 10
Tosses 100000 − 1000000
Elements of Probability, Jan 23, 2003 - 3 -
The Subjectivist (Bayesian) Approach
Not all events are repeatable:
◦ Will it snow tomorrow?
◦ Will Mr Jones, 42, live to 65?
◦ Will the Dow Jones rise tomorrow?
◦ Does the Iraq have weapons of mass destruction?
To all these questions the answer is either “yes” or “no”, but we
are uncertain about the right answer.
Need to quantify our uncertainty about an event A:
Game with two players:
◦ 1st player determines p such that he will “win” $c · (1 − p) if
event A occurs and otherwise he will “loose” $c · p.
◦ 2nd player chooses c which can be positive or negative.
The Bayesian interpretation of probability is that probability
measures the personal (subjective) uncertainty of an event.
Example: Weather forecast
Meteorologist says that the probability of snowfall tomorrow is
90%.
He should be willing to bet $90 against $10 that it snows tomorrow
and $10 against $90 that it does not snow.
Elements of Probability, Jan 23, 2003 - 4 -
The Elements of Probability
A (statistical) experiment is a process of observation or mea-
surement. For a mathematical treatment we need:
Sample Space S - set of possible outcomes
Example: An urn contains five balls, numbered from 1 through
5. We choose two at random and at the same time. What is the
sample space?
S ={{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5},{3, 4}, {3, 5}, {4, 5}
}.
Events A ⊆ S - an event is a subset of the sample space S
Example: In the example above the event A that two balls with
uneven numbers are choses is
A ={{1, 3}, {1, 5}, {3, 5}
}.
Probability Function P - assigns each A a value in [0, 1]
Example: Assuming that all events are equally likely we obtainP(A) =3
10.
Elements of Probability, Jan 23, 2003 - 5 -
The Elements of Probability
Why not assign probabilities to outcomes?
Example: Spinner labeled from 0 to 1.
◦ Suppose that all outcomes s ∈ S = [0, 1) are equally likely.
◦ Assign probabilities uniformly on S.
◦ P({s}) = c > 0 ⇒ P(S) = ∞◦ P({s}) = 0 ⇒ P(S) = 0
Solution: Assign to each subset of S a probability equal to the
“length” of that subset:
◦ Probability that the spinner lands in [0, 14) is 1
4.
◦ Probability that the spinner lands in [12, 3
4) is 1
4.
◦ Probability that the spinner lands on 12 is 0.
In integral notation we haveP(spinner lands in [a, b]) =
∫ b
a
dx = b − a.
Remark:
Strictly speaking, we can define above probability only on a set A of subsets A ⊆ S which
however covers all important and for this class relevant subsets.
In the case of finite or countably infinite sample spaces S there are no such exceptions
and A covers all subsets of S.
Elements of Probability, Jan 23, 2003 - 6 -
A Set Theory Primer
A set is “a collection of definite, well distinguished objects of our perception
or of our thought”. (Georg Cantor, 1845-1918)
Some important sets:
◦ N = {1, 2, 3, . . .}, the set of natural numbers
◦ Z = {. . . ,−2,−1, 0, 1, 2, . . .}, the set of integers
◦ R = (−∞,∞), the set of real numbers
Intervals are denoted as follows:
[0, 1] the interval from 0 to 1 including 0 and 1
[0, 1) the interval from 0 to 1 including 0 but not 1
(0, 1) the interval from 0 to 1 not including 0 and 1
If a is an element of the set A then we write a ∈ A.
If a is not an element of the set A then we write a /∈ A.
Suppose that A and B are subsets of S (denoted as A, B ⊆ S).
The empty set is denoted by ∅ (Note: ∅ ⊆ A for all subsets A of S).
Difference of A and B (A\B): Set of all elements in A which are not in B.
Intersection of A and B (A ∩ B): Set of all elements in S which are both
in A and in B.
Union of A and B (A∪B): Set of all elements in S that are in A or in B.
Complement of A (A∁ or A′): Set of all elements in S that are not in A.
Note that A ∩ A∁ = ∅ and A ∪ A∁ = S
A and B are disjoint if A and B have no common elements, that is A∩B =
∅. Two events A and B with this property are said to be mutually
exclusive.
Elements of Probability, Jan 23, 2003 - 7 -
The Postulates of Probability
A probability on a sample space S (and a set A of events) is a
function which assigns each subset A a value in [0, 1] and satisfies
the following rules:
Axiom 1: All probabilities are nonnegative:P(A) ≥ 0 for all events A.
Axiom 2: The probability of the whole sample space is 1:P(S) = 1.
Axiom 3 (Addition Rule): If two events A and B are mutu-
ally exclusive thenP(A ∪ B) = P(A) + P(B),
that is the probability that one or the other occurs is the sum
of their probabilities.
More generally, if countably many events Ai, i ∈ N are mutu-
ally exclusive (i.e. Ai ∩ Aj = ∅ whenever i 6= j) thenP( ∞⋃i=1
Ai
)=
∞∑i=1
P(Ai).
Elements of Probability, Jan 23, 2003 - 8 -
The Postulates of Probability
Classical Concept of Probability
The probability of an event A is defined asP(A) =#A
#S,
where #A denotes the number of elements (outcomes) in A.
It satisfies
◦ P(A) ≥ 0
◦ P(S) = #S/#S = 1
◦ If A and B mutually exclusive thenP(A ∪ B) =#(A ∪ B)
#S
=#A
#S+
#B
#S= P(A) + P(B).
Elements of Probability, Jan 23, 2003 - 9 -
The Postulates of Probability
Frequency Interpretation of Probability
The probability of an event A is defined asP(A) = limn→∞
n(A)
n,
where n(A) is the number of times event A occurred in n repeti-
tions.
It satisfies
◦ P(A) ≥ 0
◦ P(S) = limn→∞nn
= 1
◦ If A and B mutually exclusive then n(A∪B) = n(A) + n(B).
HenceP(A ∪ B) = limn→∞
n(A ∪ B)
n
= limn→∞
(n(A)
n+
n(B)
n
)
= limn→∞
n(A)
n+ lim
n→∞n(B)
n= P(A) + P(B).
Elements of Probability, Jan 23, 2003 - 10 -
The Postulates of Probability
Example: Toss of one die
The events A = {1} and B = {4 5} are mutually exclusive.
Since all outcomes are equiprobable we obtainP(A) =1
6
and P(B) =1
3.
The addition rule yieldsP(A ∪ B) =1
6+
1
3=
3
6=
1
2.
On the other hand we get for C = A ∪ B = {1 4 5}P(C) =3
6=
1
2.
The first two axioms can be summarized by the
Cardinal Rule: For any subset A of S
0 ≤ P(A) ≤ 1.
In particular
◦ P(∅) = 0
◦ P(S) = 1
Elements of Probability, Jan 23, 2003 - 11 -
The Calculus of Probability
Let A and B be events in a sample space S.
Partition rule:P(A) = P(A ∩ B) + P(A ∩ B∁)
Example: Roll a pair of fair diceP(Total of 10)
= P(Total of 10 and double) +P(Total of 10 and no double)
=1
36+
2
36=
3
36=
1
12
Complementation rule:P(A∁) = 1 − P(A)
Example: Often useful for events of the type “at least one”:P(At least one even number)
= 1 −P(No even number) = 1 − 9
36=
3
4
Containment ruleP(A) ≤ P(B) for all A ⊆ B
Example: Compare two aces with doubles,
1
36= P(Two aces) ≤ P(Doubles) =
6
36=
1
6
Calculus of Probability, Jan 26, 2003 - 1 -
The Calculus of Probability
Inclusion and exclusion formulaP(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Example: Roll a pair of fair diceP(Total of 10 or double)
= P(Total of 10) +P(Double) −P(Total of 10 and double)
=3
36+
6
36− 1
36=
8
36=
2
9
The two events are
Total of 10 = {46,55,64}and
Double = {11,22,33,44,55,66}
The intersection is
Total of 10 and double = {55}.
Adding the probabilities for the two events, the probability for the
event 55 is added twice.
Calculus of Probability, Jan 26, 2003 - 2 -
Conditional Probability
Probability gives chances for events in sample space S.
Often: Have partial information about event of interest.
Example: Number of Deaths in the U.S. in 1996
Cause All ages 1-4 5-14 15-24 25-44 45-64 ≥ 65
Heart 733,125 207 341 920 16,261 102,510 612,886
Cancer 544,161 440 1,035 1,642 22,147 132,805 386,092
HIV 32,003 149 174 420 22,795 8,443 22
Accidents1 92,998 2,155 3,521 13,872 26,554 16,332 30,564
Homicide2 24,486 395 513 6,548 9,261 7,717 52
All causes 2,171,935 5,947 8,465 32,699 148,904 380,396 1,717,218
1 Accidents and adverse effects, 2 Homicide and legal intervention
measure probability with respect to a subset of S
Conditional probability of A given BP(A|B) =P(A ∩ B)P(B)
, if P(B) > 0
If P(B) = 0 then P(A|B) is undefined.
Conditional probabilities for causes of death:
◦ P(accident) = 0.04282
◦ P(age=10) = 0.00390
◦ P(accident|age=10) = 0.42423
◦ P(accident|age=40) = 0.17832
Calculus of Probability, Jan 26, 2003 - 3 -
Conditional Probability
Example: Select two cards from 32 cards
◦ What is the probability that the second card is an ace?P(2nd card is an ace) =1
8
◦ What is the probability that the second card is an ace if the
first was an ace?P(2nd card is an ace|1st card was an ace) =3
31
Calculus of Probability, Jan 26, 2003 - 4 -
Multiplication rules
Example: Death Rates (per 100,000 people)
All Ages 1-4 5-14 15-24 25-44 45-64 ≥ 65
872.5 38.3 22.0 90.3 177.8 708.0 5071.4
Can we combine these rates with the table on causes of death?◦ What is the probability to die from an accident (HIV)?
◦ What is the probability to die from an accident at age 10 (40)?
Know P(accident|die) = P(die from accident)/P(die)
⇒ P(die from accident) = P(accident|die)P(die)
Calculate probabilities:
◦ P(die from accident) = 0.04281 · 0.00873 = 0.00037
◦ P(die from accident|age = 10) = 0.42423 · 0.00090 = 0.00038
◦ P(die from accident|age = 40) = 0.17832 · 0.00178 = 0.00031
◦ P(die from HIV) = 0.01473 · 0.00873 = 0.00013
◦ P(die from HIV|age = 10) = 0.02055 · 0.00090 = 0.00002
◦ P(die from HIV|age = 40) = 0.15308 · 0.00178 = 0.00027
General multiplication ruleP(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)
Calculus of Probability, Jan 26, 2003 - 5 -
Independence
Example: Roll two dice
◦ What ist the probability that the second die shows 1?P(2nd die = 1) =1
6
◦ What ist the probability that the second die shows 1 if the first
die already shows 1?P(2nd die = 1|1st die = 1) =1
6
◦ What ist the probability that the second die shows 1 if the first
does not show 1?P(2nd die = 1|1st die 6= 1) =1
6
The chances of getting 1 with the second die are the same, no
matter what the first die shows. Such events are called indepen-
dent:
The event A is independent of the event B if its chances are
not affected by the occurrence of B,P(A|B) = P(A).
Equivalently, A and B are independent ifP(A ∩ B) = P(A)P(B)
Otherwise we say A and B are dependent.
Calculus of Probability, Jan 26, 2003 - 6 -
Let’s Make a Deal
The Rules:
◦ Three doors - one price, two blanks
◦ Candidate selects one door
◦ Showmaster reveals one loosing door
◦ Candidate may switch doors
1 2 3
Would YOU change?
Can probability theory help you?
◦ What is the probability of winning if candidate switches doors?
◦ What is the probability of winning if candidate does not switch
doors?
Calculus of Probability, Jan 26, 2003 - 7 -
The Rule of Total Probability
Events of interest:
◦ A - choose winning door at the beginning
◦ W - win the price
Strategy: Switch doors (S)
Know: ◦ PS(W |A) = 0
◦ PS(W |A∁) = 1
◦ PS(A) = 13
◦ PS(A∁) = 23
Probability of interest: PS(W ):PS(W ) = PS(W ∩ A) + PS(W ∩ A∁)
= PS(W |A)PS(A) + PS(W |A∁)PS(A∁)
= 0 · 1
3+ 1 · 2
3=
2
3
Strategy: Do not switch doors (N)
Know: ◦ PN(W |A) = 1
◦ PN(W |A∁) = 0
◦ PN(A) = 13
◦ PN(A∁) = 23
Probability of interest: PN(W ):PN(W ) = PN(W ∩ A) + PN(W ∩ A∁)
= PN(W |A)PN(A) + PN(W |A∁)PN(A∁)
= 1 · 1
3+ 0 · 2
3=
1
3
Calculus of Probability, Jan 26, 2003 - 8 -
The Rule of Total Probability
Rule of Total Probability
If B1, . . . , Bk mutually exclusive and B1 ∪ . . . ∪ Bk = S, thenP(A) = P(A|B1)P(B1) + . . . + P(A|Bk)P(Bk)
Example:
Suppose an applicant for a job has been invited for an interview.
The chance that
◦ he is nervous is P(N) = 0.7,
◦ the interview is succussful if he is nervous is P(S|N) = 0.2,
◦ the interview is succussful if he is not nervous isP(S|N∁) = 0.9.
What is the probability that the interview is successful?P(S) = P(S|N)P(N) + P(S|N∁)P(N∁)
= 0.2 · 0.7 + 0.9 · 0.3
= 0.441
Calculus of Probability, Jan 26, 2003 - 9 -
The Rule of Total Probability
Example:
Suppose we have two unfair coins:
◦ Coin 1 comes up heads with probability 0.8
◦ Coin 2 comes up heads with probability 0.35
Choose a coin at random and flip it. What is the probability of its
being a head?
Events: H=“heads comes up”, C1=“1st coin”, C2=“2nd coin”P(H) = P(H|C1)P(C1) + P(H|C2)P(C2)
=1
2(0.8 + 0.35) = 0.575
Calculus of Probability, Jan 26, 2003 - 10 -
Bayes’ Theorem
Example: O.J. Simpson
“Only about 110 of one percent of wife-batterers actually murder their wives”
Lawyer of O.J. Simpson on TV
Fact: Simpson pleaded no contest to beating his wife in 1988.
So he murdered his wife with probability 0.001?
◦ Sample space S - married couples in U.S. in which the husband
beat his wife in 1988
◦ Event H - all couples in S in which the husband has since
murdered his wife
◦ Event M - all couples in S in which the wife has been murdered
since 1988
We have ◦ P(H) = 0.001
◦ P(M |H) = 1 since H ⊆ M
◦ P(M |H∁) = 0.0001 at most in the U.S.
ThenP(H|M) =P(M |H)P(H)P(M)
=P(M |H)P(H)P(M |H)P(H) + P(M |H∁)P(H∁)
=0.001
0.001 + 0.0001 · 0.999= 0.91
Calculus of Probability, Jan 26, 2003 - 11 -
Bayes’ Theorem
Reversal of conditioning (general multiplication rule)P(B|A)P(A) = P(A|B)P(B)
Rewriting P(A) using the rule of total probability we obtain
Bayes’ TheoremP(B|A) =P(A|B)P(B)P(A|B)P(B) + P(A|B∁)P(B∁)
If B1, . . . , Bk mutually exclusive and B1 ∪ . . . ∪ Bk = S, thenP(Bi|A) =P(A|Bi)P(Bi)P(A|B1)P(B1) + . . . + P(A|Bk)P(Bk)
(General form of Bayes’ Theorem)
Calculus of Probability, Jan 26, 2003 - 12 -
Bayes’ Theorem
Example: Testing for AIDS
Enzyme immunoassay test for HIV:
◦ P(T+|I+) = 0.98 (sensitivity - positive for infected)
◦ P(T-|I-) = 0.995 (specificity - negative for noninfected)
◦ P(I+) = 0.0003 (prevalence)
What is the probability that the tested person is infected if the
test was positive?P(I+|T+) =P(T+|I+)P(I+)P(T+|I+)P(I+) + P(T+|I-)P(I-)
=0.98 · 0.0003
0.98 · 0.0003 + 0.005 · 0.9997
= 0.05556
Consider different population with P(I+) = 0.1 (greater risk)P(I+|T+) =0.98 · 0.1
0.98 · 0.1 + 0.005 · 0.9= 0.956
testing on large scale not sensible (too many false positives)
Repeat test (Bayesian updating):
◦ P(I+|T++) = 0.92 in 1st population
◦ P(I+|T++) = 0.9998 in 2nd population
Calculus of Probability, Jan 26, 2003 - 13 -
Random Variables
Aim: ◦ Learn about population
◦ Available information: observed data x1, . . . , xn
Problem: ◦ Data affected by chance variation
◦ New set of data would look different
Suppose we observe/measure some characteristic (variable) of n
individuals. The actual observed values x1, . . . , xn are the outcome
of a random phenomenon.
Random variable: a variable whose value is a numerical out-
come of a random phenomenon
Remark: Mathematically, a random variable is a real-valued func-
tion on the sample space S:
SX−−−−→ R
ω 7−→ x = X(ω)
◦ SX = X(S) is the sample space of the random variable.
◦ The outcome x = X(ω) is called realisation of X .
◦ X induces a probability P (B) = P(X ∈ B) on SX , the prob-
ability distribution of X
Example: Roll one die
Outcome ω 1 2 3 4 5 6Realization X(ω) 1 2 3 4 5 6
Random Variables, Jan 28, 2003 - 1 -
Random Variables
Example: Roll two dice
◦ X1 - number on the first die
◦ X2 - number on the second die
◦ Y = X1 + X2 - total number of points
(a function of random variables is again a random variable)
Table of outcomes:
Outcome (X1, X2) Y1 1 (1,1) 21 2 (1,2) 31 3 (1,3) 41 4 (1,4) 51 5 (1,5) 61 6 (1,6) 72 1 (2,1) 32 2 (2,2) 42 3 (2,3) 52 4 (2,4) 62 5 (2,5) 72 6 (2,6) 83 1 (3,1) 43 2 (3,2) 53 3 (3,3) 63 4 (3,4) 73 5 (3,5) 83 6 (3,6) 9
Outcome (X1, X2) Y4 1 (4,1) 54 2 (4,2) 64 3 (4,3) 74 4 (4,4) 84 5 (4,5) 94 6 (4,6) 105 1 (5,1) 65 2 (5,2) 75 3 (5,3) 85 4 (5,4) 95 5 (5,5) 105 6 (5,6) 116 1 (6,1) 76 2 (6,2) 86 3 (6,3) 96 4 (6,4) 106 5 (6,5) 116 6 (6,6) 12
Random Variables, Jan 28, 2003 - 2 -
Random Variables
Two important types of random variables:
• Discrete random variable
◦ takes values in a finite or countable set
• Continuous random variable
◦ takes values in a continuum, or uncountable set
◦ probability of any particular outcome x is zeroP(X = x) = 0 for all x ∈ SX
Example: Ten tosses of a coin
Suppose we toss a coin ten times. Let
◦ X be the number of heads in ten tosses of a coin
◦ Y be the time it takes to toss ten times
Random Variables, Jan 28, 2003 - 3 -
Discrete Random Variables
Suppose X is a discrete random variables with values x1, x2, . . ..
Example: Roll two dice
Y = X1 + X2 total number of points
y 2 3 4 5 6 7 8 9 10 11 12P(Y = y) 136
236
336
436
536
636
536
436
336
236
136
Frequency function: The function
p(x) = P(X = x) = P({ω ∈ S|X(ω) = x})
is called the frequency function or probability mass function.
Note: p defines a probability on SX = {x1, x2, . . .}:
P (B) =∑x∈B
p(x) = P(X ∈ B).
We call P the (probability) distribution of X .
Properties of a discrete probability distribution
◦ p(x) ≥ 0 for all values of X
◦ ∑i p(xi) = 1
Random Variables, Jan 28, 2003 - 4 -
Discrete Random Variables
Example: Roll one die
Let X denote the number of points on the face turned up. Since
all numbers are equally likely we obtain
p(x) = P(X = x) =
{16 if x ∈ {1, . . . , 6}0 otherwise
.
Example: Roll two dice
The probability mass function of the total number of points
Y = X1 + X2
can be written as:
p(y) = P(Y = y) =
{136
(6 − |y − 7|
)if y ∈ {2, . . . , 12}
0 otherwise
Example: Three tosses of a coin
Let X be the number of heads in three tosses of a coin. There are(3x
)outcomes with x heads and 3 − x tails, thus
p(x) =
(3
x
)1
8.
Random Variables, Jan 28, 2003 - 5 -
Continuous Random Variables
For a continuous random variable X , the probability that X falls
in the interval (a, b ] is given byP(a < X ≤ B) =∫ b
a
f(x)dx,
where f is the density function of X .
Note: The density defines a probability on R:
P([a, b]
)=
∫ b
a
f(x) dx = P(X ∈ [a, b]
)
We call P the (probability) distribution of X .
Remark: The definition of P can be extended to (almost) all B ⊆ R.
Example: Spinner
Consider a spinner that turns freely on its axis and slowly comes to a stop.
◦ X is the stopping point on the circle marked from 0 to 1.
◦ X can take any value in SX = [0, 1).
◦ The outcomes of X are uniformly distributed over the interval [0, 1).
Then the density function of X is
f(x) =
{1 if 0 ≤ x < 1
0 otherwise.
ConsequentlyP(X ∈ [a, b]
)= b − a.
Note that for all possible outcomes x ∈ [0, 1) we haveP(X ∈ [x, x]
)= x − x = 0.
Random Variables, Jan 28, 2003 - 6 -
Independence of Random Variables
Recall: Two events A and B are independent ifP(A ∩ B) = P(A)P(B)
Independence of Random Variables
Two discrete random variables X and Y are independent ifP(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)
for all A ⊆ SX and B ⊆ SY .
Remark: It is sufficient to show thatP(X = x, Y = y) = pX(x) pY (y) = P(X = x)P(Y = y)
for all x ∈ SX and y ∈ SY .
More generally, X1, X2, . . . are independent if for all n ∈ NP(X1 ∈ A1, . . . , Xn ∈ An) = P(X1 ∈ A1) · · ·P(Xn ∈ An).
for all Ai ⊆ Xi.
Example: Toss coin three times
Consider
Xi =
{1 if head in ith toss of coin
0 otherwise
X1, X2, and X3 are independent:P(X1 = x1, . . . , X3 = x3) =1
8= P(X1 = x1)P(X2 = x2)P(X3 = x3)
Random Variables, Jan 28, 2003 - 7 -
Multivariate Distributions: Discrete Case
Discrete Case
Let X and Y be discrete random variables.
Joint frequency function of X and Y
pXY (x, y) = P(X = x, Y = y) = P({X = x} ∩ {Y = y})
Marginal frequency function of X
pX(x) =∑i
pXY (x, yi)
Marginal frequency function of Y
pY (y) =∑i
pXY (xi, y)
The random variables X and Y are independent if and only if
pXY (x, y) = pX(x) pY (y)
for all possible values x ∈ SX and y ∈ SY .
Conditional probability of X = x given Y = yP(X = x|Y = y) = pX|Y (x|y) =pXY (x, y)
pY (y)=P(X = x, Y = y)P(Y = y)
where pX|Y (x|y) is the conditional frequency function.
Random Variables, Jan 28, 2003 - 8 -
Multivariate Distributions
Discrete Case
Example: Three Tosses of a Coin
◦ X - number of heads on the first toss (values in {0, 1})
◦ Y - total number of heads (values in {0, 1, 2, 3})
The joint frequency function pXY (x, y) is given by the following
table
x\y 0 1 2 3
0 18
28
18
0 12
1 0 18
28
18
12
18
38
38
18
1
Marginal frequency function of Y
pY (0) = P(Y = 0)
= P(Y = 0, X = 0) + P(Y = 0, X = 1)
= 18 + 0 = 1
8
pY (1) = P(Y = 1)
= P(Y = 1, X = 0) + P(Y = 1, X = 1)
= 28
+ 18
= 38
...
Random Variables, Jan 28, 2003 - 9 -
Multivariate Distributions
Continuous Case
Let X and Y be continuous random variables.
Joint density function of X and Y : fXY such that∫
A
∫
B
fXY (x, y) dy dx = P(X ∈ A, Y ∈ B)
Marginal density function of X :
fX(x) =∫
fXY (x, y) dy
Marginal density function of Y
fY (y) =∫
fXY (x, y) dx
The random variables X and Y are independent if and only if
fXY (x, y) = fX(x) fY (y)
for all possible values x ∈ SX and y ∈ SY .
Conditional density function of X given Y = y
fX|Y (x|y) =fXY (x, y)
fY (y)
Conditional probability of X ∈ A given Y = yP(X ∈ A|Y = y) =∫
A
fX|Y (x|y) dx
Random Variables, Jan 28, 2003 - 10 -
Bernoulli Distribution
Example: Toss of coin
Define X = 1 if head comes up and
X = 0 if tail comes up.
Both realizations are equally likely: P(X = 1) = P(X = 0) = 12
Examples:
Often: Two outcomes which are not equally likely:
◦ Success of medical treatment
◦ Interviewed person is female
◦ Student passes exam
◦ Transmittance of a disease
Bernoulli distribution (with parameter θ)
◦ X takes two values, 0 and 1, with probabilities p and 1 − p
◦ Frequency function of X
p(x) =
{θx(1 − θ)1−x for x ∈ {0, 1}0 otherwise
◦ Often:
X =
{1 if event A has occured
0 otherwise
Example: A = blood pressure above 140/90 mm HG.
Distributions, Jan 30, 2003 - 1 -
Bernoulli Distribution
Let X1, . . . , Xn be independent Bernoulli random variables with
same parameter θ.
Frequency function of X1, . . . , Xn
p(x1, . . . , xn) = p(x1) · · · p(xn) = θx1+...+xn(1 − θ)n−x1−...−xn
for xi ∈ {0, 1} and i = 1, . . . , n
Example: Paired-Sample Sign Test
◦ Study success of new elaborate safety program
◦ Record average weekly losses in hours of labor due to accidents before
and after installation of the program in 10 industrial plants
Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Define for the ith plant
Xi =
{1 if first value is greater than the second
0 otherwise
Result: 1 1 1 1 0 1 1 1 1 1
The Xi’s are independently Bernoulli distributed with unknown
parameter θ.
Distributions, Jan 30, 2003 - 2 -
Binomial Distribution
Let X1, . . . , Xn be independent Bernoulli random variables
◦ Often only interested in number of successes
Y = X1 + . . . + Xn
Example: Paired Sample Sign Test (contd)
Define for the ith plant
Xi =
{1 if first value is greater than the second
0 otherwise
Y =n∑
i=1
Xi
Y is the number of plants for which the number of lost hours has
decreased after the installation of the safety program
We know:
◦ Xi is Bernoulli distributed with parameter θ
◦ Xi’s are independent
What is the distribution of Y ?
◦ Probability of realization x1, . . . , xn with y successes:
p(x1, . . . , xn) = θy(1 − θ)n−y
◦ Number of different realizations with y successes:(ny
)
Distributions, Jan 30, 2003 - 3 -
Binomial Distribution
Binomial distribution (with parameters n and θ)
Let X1, . . . , Xn be independent and Bernoulli distributed with pa-
rameter θ and
Y =n∑
i=1
Xi.
Y has frequency function
p(y) =(
n
y
)θy (1 − θ)n−y for y ∈ {0, . . . , n}
Y is binomially distributed with parameters n and θ. We write
Y ∼ Bin(n, θ).
Note that
◦ the number of trials is fixed,
◦ the probability of success is the same for each trial, and
◦ the trials are independent.
Example: Paired Sample Sign Test (contd)
Let Y be the number of plants for which the number of lost hours
has decreased after the installation of the safety program. Then
Y ∼ Bin(n, θ)
Distributions, Jan 30, 2003 - 4 -
Binomial Distribution
Binomial distribution for n = 10
p(x)
0 1 2 3 4 5 6 7 8 9 100.0
0.1
0.2
0.3
0.4
x
θ = 0.1
p(x)
0 1 2 3 4 5 6 7 8 9 100.0
0.1
0.2
0.3
0.4
x
θ = 0.3
p(x)
0 1 2 3 4 5 6 7 8 9 100.0
0.1
0.2
0.3
0.4
x
θ = 0.5
p(x)
0 1 2 3 4 5 6 7 8 9 100.0
0.1
0.2
0.3
0.4
x
θ = 0.8
Distributions, Jan 30, 2003 - 5 -
Geometric Distribution
Consider a sequence of independent Bernoulli trials.
◦ On each trial, a success occurs with probability θ.
◦ Let X be the number of trials up to the first success.
What is the distribution of X?
◦ Probability of no success in x − 1 trials: (1 − θ)x−1
◦ Probability of one success in the xth trial: θ
The frequency function of X is
p(x) = θ(1 − θ)x−1, x = 1, 2, 3, . . .
X is geometrically distributed with parameter θ.
Example:
Suppose a batter has probability 13 to hit the ball. What is the chance that
he misses the ball less than 3 times?
The number X of balls up to the first success is geometrically distributed
with parameter 13. ThusP(X ≤ 3) =
1
3+
1
3· 2
3+
1
3
(2
3
)2
= 0.7037.
Distributions, Jan 30, 2003 - 6 -
Hypergemetric Distribution
Example: Quality Control
Quality control - sample and examine fraction of produced units
◦ N produced units
◦ M defective units
◦ n sampled units
What is the probability that the sample contains x defective units?
The frequency function of X is
p(x) =
(Mx
)(N−Mn−x
)(Nn
) , x = 0, 1, . . . , n.
X is a hypergeometric random variable with parameters N , M ,
and n.
Example:
Suppose that of 100 applicants for a job 50 were women and 50 were men,
all equally qualified. If we select 10 applicants at random what is the
probability that x of them are female?
The number of chosen female applicants is hypergeometrically distributed
with parameters 100, 50, and 10. The frequency function is
p(x) =
(50x
)(50
10−x
)(10010
) for x ∈ {0, . . . , n}
for x = 0, 1, . . . , 10.
Distributions, Jan 30, 2003 - 7 -
Poisson Distribution
Often we are interested in the number of events which occur in aspecific period of time or in a specific area of volume:◦ Number of alpha particles emitted from a radioactive source during a
given period of time
◦ Number of telephone calls coming into an exchange during one unit of
time
◦ Number of diseased trees per acre of a certain woodland
◦ Number of death claims received per day by an insurance company
Characteristics
Let X be the number of times a certain event occurs during a given
unit of time (or in a given area, etc).
◦ The probability that the event occurs in a given unit of time is
the same for all the units.
◦ The number of events that occur in one unit of time is inde-
pendent of the number of events in other units.
◦ The mean (or expected) rate is λ.
Then X is a Poisson random variable with parameter λ and
frequency function
p(x) =λx
x!e−λ, x = 0, 1, 2, . . .
Distributions, Jan 30, 2003 - 8 -
Poisson Approximation
The Poisson distribution is often used as an approximation for
binomial probabilities when n is large and θ is small:
p(x) =(
n
x
)θx (1 − θ)n−x ≈ λx
x!e−λ
with λ = n θ.
Example: Fatalities in Prussian cavalry
Classical example from von Bortkiewicz (1898).
◦ Number of fatalities resulting from being kicked by a horse
◦ 200 observations (10 corps over a period of 20 years)
Statistical model:
◦ Each soldier is kicked to death by a horse with probability θ.
◦ Let Y be the number of such fatalities in one corps. Then
Y ∼ Bin(n, θ)
where n is the number of soldiers in one corps.
Observation: The data are well approximated by a Poisson distribution
with λ = 0.61
Deaths per Year Observed Rel. Frequency Poisson Prob.
0 109 0.545 0.543
1 65 0.325 0.331
2 22 0.110 0.101
3 3 0.015 0.021
4 1 0.005 0.003
Distributions, Jan 30, 2003 - 9 -
Poisson Approximation
Poisson approximation of Bin(40, θ)
p(x)
0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0
0.2
0.4
0.6
0.8
1.0
x
p(x)
0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0
0.2
0.4
0.6
0.8
1.0
x
p(x)
0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0
0.1
0.2
0.3
0.4
0.5
x
p(x)
0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0
0.1
0.2
0.3
0.4
0.5
x
p(x)
0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0
0.1
0.2
x
p(x)
0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0
0.1
0.2
x
p(x)
0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0
0.1
0.2
x
p(x)
0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0
0.1
0.2
x
θ = 14
θ = 18
θ = 140
θ = 1400
λ = 10
λ = 5
λ = 1
λ = 110
Distributions, Jan 30, 2003 - 10 -
Continuous Distributions
Uniform distribution U(0, θ)
Range (0, 1)
f(x) =1
θ1(0,θ)(x)E(X) =θ
2
var(X) =θ2
12
Exponential distribution Exp(λ)
Range [0,∞)
f(x) = λ exp(−λx)1[0,∞)(x)E(X) =1
λ
var(X) =1
λ2
Normal distribution N (µ, σ2)
Range Rf(x) =
1√2πσ2
exp(− 1
2σ2(x−µ)2
)E(X) = µ
var(X) = σ2
X
Fre
quen
cy
−2 −1 0 1 2 3 4
010
2030
40
U(0, θ)
X
Fre
quen
cy
−2 −1 0 1 2 3 4
010
2030
40
Exp(λ)
X
Fre
quen
cy
−2 −1 0 1 2 3 4
010
2030
40
N(µ, σ2)
−2
02
46
U(0, θ) Exp(λ) N(µ, σ2)
Distributions, Jan 30, 2003 - 11 -
Expected Value
Let X be a discrete random variable which takes values in SX =
{x1, x2, . . . , xn}
Expected Value or Mean of X :E(X) =n∑
i=1
xi p(xi)
Example: Roll one die
Let X be outcome of rolling one die. The frequency function is
p(x) =1
6, x = 1, . . . , 6,
and henceE(X) =6∑
x=1
x
6=
7
2= 3.5
Example: Bernoulli random variable
Let X ∼ Bin(1, θ).
p(x) = θx(1 − θ)1−x
Thus the mean of X isE(X) = 0 · (1 − θ) + 1 · θ = θ.
Expected Value and Variance, Feb 2, 2003 - 1 -
Expected Value
Linearity of the expected value
Let X and Y be two discrete random variables. ThenE(a X + b Y ) = aE(X) + bE(Y )
for any constants a, b ∈ RNote: No independence is required.
Proof:E(a X + b Y ) =∑x,y
(a x + b y)p(x, y)
= a∑x,y
x p(x, y) + b∑x,y
y p(x, y)∑x
p(x, y) = p(y)
x
= a∑x
x p(x) + b∑y
y p(y)
= aE(X) + bE(Y )
Example: Binomial distribution
Let X ∼ Bin(n, θ). Then X = X1+. . .+Xn with Xi ∼ Bin(1, θ):E(X) =n∑
i=1
E(Xi) =n∑
i=1
θ = nθ
Expected Value and Variance, Feb 2, 2003 - 2 -
Expected Value
Example: Poisson distribution
Let X be a Poisson random variable with parameter λ.
E(X) =∞∑
x=0
xλx
x!e−λ
= λ e−λ∞∑
x=0
λx−1
(x − 1)!
= λ e−λeλ
= λ
Remarks:
◦ For most distributions some “advanced” knowledge of calculus
is required to find the mean.
◦ Use tables for means of commonly used distribution.
Expected Value and Variance, Feb 2, 2003 - 3 -
Expected Value
Example: European Call Options
Agreement that gives an investor the right (but not the obliga-
tion) to buy a stock, bond, commodity, or other instruments at
a specific time at a specific price.
What is a fair price P for European call options?
If ST is the price of the stock at time T , the profit will be
Profit = (ST − K)+ − P.
Profit is a random variable.
0 10 20 30 40 50
−10
010
2030
0
0
Fair price P for this option is expected value
P = E(ST − K)+.
Expected Value and Variance, Feb 2, 2003 - 4 -
Expected Value
Example: European Call Options (contd)
Consider the following simple model:
◦ St = St−1 + εt, t = 1, . . . , T
◦ P(εt = 1) = p and P(εt = −1) = 1 − p.
St is also called a random walk.
The distribution of ST is given by (s0 known at time 0)
ST = s0 + 2 Y − T, with Y ∼ Bin(T, p)
Therefore the price P is (assuming s0 = 0 without loss of generality)
P = E(ST − K)+ =T∑
y=1(2 y − T − K) pθ(y) 1{y>(K+T )/2}
Let n = 20, K = 10, θ = 0.6
P = 2.75
Profit
p(x)
−2.
75
−0.
75
1.2
5
3.2
5
5.2
5
7.2
5
9.2
5
11.2
5
13.2
5
15.2
5
17.2
5
19.2
5
0.0
0.1
0.2
0.3
0.4
0.5
Frequency function of profit
Expected Value and Variance, Feb 2, 2003 - 5 -
Expected Value
Example: Group testing
Suppose that a large number of blood samples are to be screened for a rare
disease with prevalence 1 − p.
• If each sample is assayed individually, n tests will be required.
• Alternative scheme:
◦ n samples, m groups with k samples
◦ Split each sample in half and pool all samples in one group
◦ Test pooled sample for each group
◦ If test positive test all samples in group separately
What is the expected number of tests under this alternative scheme?
Let Xi be the number of tests in group i. The frequency function of Xi is
p(x) =
{pk if x = 1
1 − pk if x = k + 1
The expected number of tests in each group isE(Xi) = pk + (k + 1)(1− pk) = k + 1 − kpk
HenceE(N) =m∑
i=1
E(Xi) = n(1 +
1
k− pk
)
Plot of E(N):
The mean is minimized forgroups of size 11.
2 4 6 8 10 12 14 16
0.20
0.25
0.30
0.35
0.40
0.45
0.50
k
Pro
port
ion
Expected Value and Variance, Feb 2, 2003 - 6 -
Variance
Let X be a random variable.
Variance of X :
var(X) = E(X − E(X)
)2.
The variance of X is the expected squared distance of X from its
mean.
Suppose X is discrete random variable with SX = {x1, . . . , xn}.
Then the variance of X can be written as
var(X) =n∑
i=1
(xi −
n∑j=1
xj p(xj))2
p(xi)
Example: Roll one die
X takes values in {1, 2, 3, 4, 5, 6}with frequency function p(x) = 16.E(X) =
6∑x=1
x1
6=
7
2
var(X) =6∑
x=1
(x − 7
2
)2 1
6=
1
6
(25
4+
9
4+
1
4+
1
4+
9
4+
25
4
)=
35
12
We often denote the variance of a random variable X by σ2X ,
σ2X = var(X)
and its standard deviation by σX .
Expected Value and Variance, Feb 2, 2003 - 7 -
Properties of the Variance
The variance can also be written as
var(X) = E(X2) −(E(X)
)2
To see this (using linearity of the mean):
var(X) = E(X −E(X))2
= E[X2 − 2XE(X) +
(E(X))2]
= E(X2
)− 2E(X)E(X) +
(E(X))2
= E(X2) −(E(X)
)2
Example: Let X ∼ Bin(1, θ). Then
var(X) = E(X2) −(E(X)
)2
= E(X) −(E(X)
)2= θ − θ2 = θ (1 − θ)
Rules for the variance:
◦ For constants a and b
var(aX + b) = a2var(X).
◦ For independent random variables X and Y
var(X + Y ) = var(X) + var(Y ).
Example: Let X ∼ Bin(n, θ). Then
var(X) = n θ (1 − θ)
Expected Value and Variance, Feb 2, 2003 - 8 -
Covariance
For independent random variables X and Y we have
var(X + Y ) = var(X) + var(Y ).
Question: What about dependent random variables?
It can be shown that
var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y )
where
cov(X, Y ) = E[(X − E(X))(Y − E(Y )
]
is the covariance of X and Y .
Properties of the covariance
◦ cov(X, Y ) = E(XY ) − E(X)E(Y )
◦ cov(X, X) = var(X)
◦ cov(X, 1) = 0
◦ cov(X, Y ) = cov(Y, X)
◦ cov(a X1 + b X2, Y ) = a cov(X1, Y ) + b cov(X2, Y )
Expected Value and Variance, Feb 2, 2003 - 9 -
Covariance
Important:
cov(X, Y ) = 0 does NOT imply that X and Y are independent.
Example:
Suppose X ∈ {−1, 0, 1} with probabilities P(X = x) = 13
for
x = −1, 0, 1. Then E(X) = 0 and
cov(X, X2) = E(X3) = E(X) = 0
On the other handP(X = 1, X2 = 0) = 0 6= 19 = P(X = 1)P(X2 = 0),
that is, X and Y are not independent!
Note: The covariance of X and Y measures only linear depen-
dence.
Expected Value and Variance, Feb 2, 2003 - 10 -
Correlation
The correlation coefficient ρ is defined as
ρXY = corr(X, Y ) =cov(X, Y )√var(X)var(Y )
.
Properties:
◦ dimensionless quantity
◦ not affected by linear transformations, i.e.
corr(a X + b, c Y + d) = corr(X, Y )
◦ −1 ≤ ρXY ≤ 1
◦ ρXY = 1 if and only if P(Y = a + b X) = 1 for some a and b
◦ measures linear association between X and Y
Example: Three boxes: pp, pd, and dd (Ex 3.6)
Let Xi = 1{penny on ith draw}. Then Xi ∼ Bin(1, p) with p = 12
and
joint frequency function
p(x1, x2):
x1\x2 0 1
0 13
16
1 16
13
Thus:
cov(X1, X2) = E[(X1 − p)(X2 − p)]
= 14· 1
3+ 1
4· 1
3+ 2 1
4· 1
6= 1
12
corr(X1, X2) = 41 · 1
12 = 13
Expected Value and Variance, Feb 2, 2003 - 11 -
Prediction
An instructor standardizes his midterm and final so the class aver-
age is µ = 75 and the SD is σ = 10 on both tests. The correlation
between the tests is always around ρ = 0.50.
◦ X - score of student on the first examination
◦ Y - score of student on the second examination
Since X and Y are dependent we should be able to predict the
score in the final from the midterm score.
Approach:
◦ Predict Y from linear function a + b X
◦ Minimize mean squared error
MSE = E(Y − a − b X
)2
= var(Y − b X) +[E(Y − a − b X)
]2
Solution:
a = µ − b µ and b =σXY
σ2X
= ρ
Thus the best linear predictor is
Y = µ + ρ (X − µ)
Note:
We expect the student’s score on the final to differ from the mean
only by half the difference observed in the midterm (regression to
the mean).
Expected Value and Variance, Feb 2, 2003 - 12 -
Summary
Bernoulli distribution - Bin(1, θ)
p(x) = θx(1 − θ)1−x E(X) = θ
var(X) = θ(1 − θ)
Binomial distribution - Bin(n, θ)
p(x) =
(n
x
)θx(1 − θ)n−x E(X) = nθ
var(X) = nθ(1 − θ)
Poisson distribution - Poiss(λ)
p(x) =λx
x!e−λ E(X) = λ
var(X) = λ
Geometric distribution
p(x) = θ(1 − θ)x−1 E(X) =1
θ
var(X) =1 − θ
θ2
Hypergeometric distribution - H(N, M, n)
p(x) =
(Mx
)(N−Mn−x
)(Nn
) E(X) =n M
N
Expected Value and Variance, Feb 2, 2003 - 13 -
Properties of the Sample Mean
Consider X1, . . . , Xn independent and identically distributed (iid)
with mean µ and variance σ2.
X =1
n
n∑i=1
Xi (sample mean)
ThenE(X) =1
n
n∑i=1
µ = µ
var(X) =1
n2
n∑i=1
σ2 =σ2
n
Remarks:
◦ The sample mean is an unbiased estimate of the true mean.
◦ The variance of the sample mean decreases as the sample size
increases.
◦ Law of Large Numbers: It can be shown that for n → ∞
X =1
n
n∑i=1
Xi → µ.
Question:
◦ How close to µ is the sample mean for finite n?
◦ Can we answer this without knowing the distribution of X?
Central Limit Theorem, Feb 4, 2004 - 1 -
Properties of the Sample Mean
Chebyshev’s inequality
Let X be a random variable with mean µ and variance σ2.
Then for any ε > 0P(|X − µ| > ε
)≤ σ2
ε2.
Proof: Let
1{|xi − µ| > ε} =
{1 if |xi − µ| > ε
0 otherwise
Then
n∑i=1
1{|xi − µ| > ε} p(xi) =n∑
i=1
1{(xi − µ)2
ε2> 1
}p(xi)
≤n∑
i=1
(xi − µ)2
ε2p(xi) =
σ2
ε2
Application to the sample mean:P(µ − 3σ√
n≤ X ≤ µ +
3σ√n
)≥ 1 − 1
9≈ 0.889
However: Known to be not very precise
Example: Xiiid∼ N (0, 1)
X =1
n
n∑i=1
Xi ∼ N (0, 1n)
ThereforeP(− 3√
n≤ X ≤ 3√
n
)= 0.997
Central Limit Theorem, Feb 4, 2004 - 2 -
Central Limit Theorem
Let X1, X2, . . . be a sequence of random variables
◦ independent and identically distributed
◦ with mean µ and variance σ2.
For n ∈ N define
Zn =√
nX − µ
σ=
1√n
n∑i=1
Xi − µ
σ.
Zn has mean 0 and variance 1.
Central Limit Theorem
For large n, the distribution of Zn can be approximated by the
standard normal distribution N (0, 1). More precisely,
limn→∞
P(a ≤ √
nX − µ
σ≤ b
)= Φ(b) − Φ(a),
where Φ(x) is the standard normal probability
Φ(z) =∫ z
−∞f(x) dx,
that is, the area under the standard normal curve to left of z.
Example:
◦ U1, . . . , U12 uniformly distributed on [ 0, 12).
◦ What is the probability that the sample mean exceeds 9?P(U > 9) = P(√12
U − 6√12
> 3)≈ 1 − Φ(3) = 0.0013
Central Limit Theorem, Feb 4, 2004 - 3 -
Central Limit Theoremde
nsity
f(x)
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4 U[0,1],n=1
dens
ity f(
x)
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4 U[0,1],n=2
dens
ity f(
x)
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4 U[0,1],n=6
dens
ity f(
x)
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4 U[0,1],n=12
dens
ity f(
x)
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4 U[0,1],n=100
dens
ity f(
x)
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0 Exp(1),n=1
dens
ity f(
x)
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Exp(1),n=2
dens
ity f(
x)
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5 Exp(1),n=6
dens
ity f(
x)
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4Exp(1),n=12
dens
ity f(
x)
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
Exp(1),n=100
Central Limit Theorem, Feb 4, 2004 - 4 -
Central Limit Theorem
Example: Shipping packages
Suppose a company ships packages that vary in weight:
◦ Packages have mean 15 lb and standard deviation 10 lb.
◦ They come from a arge number of customurs, i.e. packages are
independent.
Question: What is the probability that 100 packages will have a
total weight exceeding 1700 lb?
Let Xi be the weight of the ith package and
T =100∑i=1
Xi.
ThenP(T > 1700 lb) = P(T − 1500 lb√
100 · 10 lb>
1700 lb − 1500 lb√100 · 10 lb
)
= P(T − 1500 lb√
100 · 10 lb> 2
)
≈ 1 − Φ(2) = 0.023
Central Limit Theorem, Feb 4, 2004 - 5 -
Central Limit Theorem
Remarks
• How fast approximation becomes good depends on distribution
of Xi’s:
◦ If it is symmetric and has tails that die off rapidly, n can
be relatively small.
Example: If Xiiid∼ U [0, 1], the approximation is good for
n = 12.
◦ If it is very skewed or if its tails die down very slowly, a
larger value of n is needed.
Example: Exponential distribution.
• Central limit theorems are very important in statistics.
• There are many central limit theorems covering many situa-
tions, e.g.
◦ for not identically distributed random variables or
◦ for dependent, but not “too” dependent random variables.
Central Limit Theorem, Feb 4, 2004 - 6 -
The Normal Approximation to the Binomial
Let X be binomially distributed with parameters n and p.
Recall that X is the sum of n iid Bernoulli random variables,
X =n∑
i=1
Xi, Xiiid∼ Bin(1, p).
Therefore we can apply the Central Limit Theorem:
Normal Approximation to the Binomial Distribution
For n large enough, X is approximately N(np, np(1 − p)
)
distributed:P(a ≤ X ≤ b) ≈ P(
a − 12 ≤ Z ≤ b + 1
2
)
where
Z ∼ N(np, np(1 − p)
).
Rule of thumb for n: np > 5 and n(1 − p) > 5.
In terms of the standard normal distribution we getP(a ≤ X ≤ b) = P(
a − 12 − np√
np(1 − p)≤ Z ′ ≤ b + 1
2 − np√np(1 − p)
)
= Φ
(b + 1
2 − np√np(1 − p)
)− Φ
(a − 1
2 − np√np(1 − p)
)
where Z ′ ∼ N (0, 1).
Central Limit Theorem, Feb 4, 2004 - 7 -
The Normal Approximation to the Binomialp(
x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.2
0.4
0.6
0.8
1.0
x
Bin(1,0.5)
p(x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.2
0.4
0.6
0.8
1.0
x
Bin(2,0.5)
p(x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.1
0.2
0.3
0.4
0.5
x
Bin(5,0.5)
p(x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.1
0.2
0.3
x
Bin(10,0.5)
p(x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.1
0.2
0.3
x
Bin(20,0.5)
p(x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.2
0.4
0.6
0.8
1.0
x
Bin(1,0.1)
p(x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.2
0.4
0.6
0.8
1.0
x
Bin(5,0.1)
p(x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.1
0.2
0.3
0.4
0.5
x
Bin(10,0.1)
p(x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.1
0.2
0.3
x
Bin(20,0.1)
p(x)
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0
0.1
0.2
0.3
x
Bin(50,0.1)
Central Limit Theorem, Feb 4, 2004 - 8 -
The Normal Approximation to the Binomial
Example: The random walk of a drunkard
Suppose a drunkard executes a “random” walk in the following
way:
◦ Each minute he takes a step north or south, with probability 12
each.
◦ His successive step directions are independent.
◦ His step length is 50 cm.
How likely is he to have advanced 10 m north after one hour?
◦ Position after one hour: X · 1 m − 30 m
◦ X binomially distributed with parameters n = 60 and p = 12
◦ X is approximately normal with mean 30 and variance 15:P(X · 1 m − 30 m > 10 m)
= P(X > 40)
≈ P(Z > 39.5) Z ∼ N (30, 15)
= P(Z − 30√
15>
9.5√15
)
= 1 − Φ(2.452) = 0.007
How does the probability change if he has same idea of where he
wants to go and steps north with probability p = 23
and south with
probability 13?
Central Limit Theorem, Feb 4, 2004 - 9 -
Estimation
Example: Cholesterol levels of heart-attack patients
Data: Observational study at a Pennsylvania medical center
◦ blood cholesterol levels patients treated for heart attacks
◦ measurements 2, 4, and 14 days after the attack
Id Y1 Y2 Y3 Id Y1 Y2 Y3
1 270 218 156 15 294 240 264
2 236 234 193 16 282 294 220
3 210 214 242 17 234 220 264
4 142 116 120 18 224 200 213
5 280 200 181 19 276 220 188
6 272 276 256 20 282 186 182
7 160 146 142 21 360 352 294
8 220 182 216 22 310 202 214
9 226 238 248 23 280 218 170
10 242 288 298 24 278 248 198
11 186 190 168 25 288 278 236
12 266 236 236 26 288 248 256
13 206 244 238 27 244 270 280
14 318 258 200 28 236 242 204
Aim: Make inference on distribution of
◦ cholesterol level 14 days after the attack: Y3
◦ decrease in cholesterol level: D = Y1 − Y3
◦ relative decrease in cholesterol level: R = Y1 − Y3
Y3
Confidence intervals I, Feb 11, 2004 - 1 -
Estimation
Data:
d1, . . . , d28 observed decrease in cholesterol level
In this example, parameters of interest might be
µD = E(D) the mean decrease in cholesterol level,
σ2D = var(D) the variation of the cholesterol level,
pD = P(D ≤ 0) probability of no decrease in cholesterol level
These parameters are naturally estimated by the following sample
statistics:
µD =1
n
n∑i=1
di (sample mean)
σ2D =
1
n
n∑i=1
(di − d)2, (sample mean)
pD =#{di|di ≤ 0}
n(sample proportion)
Such statistics are point estimators since they estimate the corre-
sponding parameter by a single numerical value.
◦ Point estimates provide no information about their chance vari-
ation.
◦ Estimates without an indication of their variability are of lim-
ited value.
Confidence intervals I, Feb 11, 2004 - 2 -
Confidence Intervals for the Mean
Recall:
◦ CLT for the sample mean: For large n we have
X ≈ N(µ,
σ2
n
)
◦ 68-95-99 rule: With 95% probability the sample differs from
its mean µ by less that two standard deviations.
More precisely, we haveP(µ − 1.96
σ√n≤ X ≤ µ + 1.96
σ√n
)= 0.95,
or equivalently, after rearranging the terms,P(X − 1.96
σ√n≤ µ ≤ X + 1.96
σ√n
)= 0.95.
Interpretation: There is 95% probability that the random in-
terval[X − 1.96
σ√n
, X + 1.96σ√n
]
will cover the mean µ.
Example: Cholesterol levels
d = 36.89, σ = 51.00, n = 28.
Therefore, the 95% confidence interval for µ is
[18.00, 55.78].
Confidence intervals I, Feb 11, 2004 - 3 -
Confidence Intervals for the Mean
Assumption: The population standard deviation σ is known.
◦ In the next lecture, we will drop this unrealistic assumption.
◦ Assumption is approximately satisfied for large sample sizes,
since then σ ≈ σ by the law of large numbers.
Definition: Confidence interval for µ (σ known)
The interval[X − zα/2
σ√n, X + zα/2
σ√n
]
is called a 1− α confidence interval for the population mean
µ. (1 − α) is the confidence level.
For large sample sizes n, an approximate (1 − α) confidence
interval for µ is given by[X − zα/2
σ√n, X + zα/2
σ√n
].
Here, zα is the α-critical value of the standard normal distribution:
z
f(x)
−3 −2 −1 0 1 2 30.0
0.1
0.2
0.3
0.4
zα
α
◦ zα has area α to its right
◦ Φ(zα) = 1 − α
Confidence intervals I, Feb 11, 2004 - 4 -
Confidence Interval for the Mean
Example: Community banks
◦ Community banks are banks with less than a billion dollars of assets.
◦ Approximately 7500 such banks in the United States.
Annual survey of the Community Bankers Council of the American Bankers
Association (ABA)
◦ Population: Community banks in the United States.
◦ Variable of interest: Total assets of community banks.
◦ Sample size: n = 110
◦ Sample mean: X = 220 millions of dollars
◦ Sample standard deviation: SD = 161 millions of dollars
◦ Histogram of sampled values:Assets of Community Banks in the U.S.
Assets (in millions of dollars)
Fre
quen
cy
0
5
10
15
20
0 100 200 300 400 500 600 700 800 900 1000
(sample of 110 community banks)
Suppose we want to give a 95% confidence interval for the mean total assets
of all community banks in the United States.
◦ α = 0.05, zα/2 = 1.96
A 95% confidence interval for the mean assets (in millions of dollars) is[220 − 1.96 · 161√
110, 220 + 1.96 · 161√
110
]≈
[190, 250].
Confidence intervals I, Feb 11, 2004 - 5 -
Sample Size
Example: Cholesterol levels
Suppose we want a 99% confidence interval for the decrease in
cholesterol level:
◦ α = 0.01, z0.005 = 2.58
◦ The 99% confidence interval for µD is[36.89− 2.58 · 50.93√
28, 36.89 + 2.58 · 50.93√
28
]≈
[12.06, 61.72].
Note: If we raise the confidence level, the confidence interval
becomes wider.
Suppose we want to obtain increase the confidence level without
increasing the error of estimation (indicated by the half-width of
the confidence interval). For this we have to increase the sample
size n.
Question: What sample size n is needed to estimate the mean
decrease in cholesterol with error e = 20 and confidence level 99%?
The error (half-width of the confidence interval) is
e = zα/2σ√n
Therefore the sample size ne needed is given by
ne ≥(
zα/2σ
e
)2
=(
2.58 · 50.93
20
)2
= 43.16,
that is, a sample of 44 patients is needed to estimate µD with error
e = 20 and 99% confidence.
Confidence intervals I, Feb 11, 2004 - 6 -
Estimation of the Mean
Example: Banks’ loan-to-deposit ratio
The ABA survey of community banks also asked about the loan-to-deposit
ratio (LTDR), a bank’s total loans as a percent of its total deposits.
Sample statistics:
◦ n = 110
◦ µLTDR = 76.7
◦ σLTDR = 12.3
Loan−To−Deposit Ratio of Community Banks
LTDR (in %)
Fre
quen
cy
0
3
6
9
12
15
18
50 60 70 80 90 100 110 120
(sample of 110 community banks)
Construction of 95% confidence interval:
◦ α = 0.05, zα/2 = 1.96
◦ Standard error σX =σLTDR√
n= 1.17
◦ 95% confidence interval for µLTDR:[X − zα/2
σLTDR√n
, X + zα/2σLTDR√
n
]=
[74.4, 79.0
]
◦ To get an estimation with error e = 3.0 (half-width of confidence inter-
val) it suffices to sample ne banks,
ne ≥(
zα/2σLTDR
e
)2
=
(1.96 · 12.3
3.0
)2
= 64.6.
Thus a sample of ne = 65 banks it sufficient.
Confidence intervals I, Feb 11, 2004 - 7 -
Confidence intervals
Definition: Confidence interval
A (1 − α) confidence interval for a parameter is an interval that
◦ depends only on sample statistics and
◦ covers the parameter with probability (1 − α)
Note:
◦ Confidence intervals are random while the estimated parameter
is fixed.
◦ For repeated samples, only 95% of the confidence intervals will
cover the true parameter is a random:
µ
Confidence intervals II, Feb 13, 2004 - 1 -
Confidence Intervals for the Mean
Suppose that X1, . . . , Xniid∼ N (µ, σ2). Then
X − µ
σ/√
n∼ N (0, 1) (*)
Assuming that σ is known, we obtain[X − zα/2 · σ√
n, X + zα/2 · σ√
n
]
as (1 − α) confidence interval for µ.
More realistic situation: σ is unknown.
Approach: Replace by estimate σ = s
This approach leads to the t statistic
T =X − µ
s/√
n∼ tn−1.
It is t distributed with n − 1 degrees of freedom. x
f(x)
−4 −3 −2 −1 0 1 2 3 40.0
0.1
0.2
0.3
0.4 t1t3t10
N(0, 1)
Confidence interval for the mean µ (σ unknown)
The interval[X − tn−1,α/2 · s√
n, X + tn−1,α/2 · s√
n
]
is a (1 − α) confidence interval for the mean µ.
Notation: Critical values of distributions
zα standard normal distribution
tn,α t distribution with n degrees of freedom
Confidence intervals II, Feb 13, 2004 - 2 -
Confidence Intervals for the Mean
Example: Cholesterol levels
In the study on cholesterol levels, the standard deviation of the decrease
of cholesterol level was unknown.
◦ µD = 36.89, σD = 50.94
◦ t27,0.025 = 2.05
◦ Then[36.89− 2.05 · 50.94√
27, 36.89 + 2.05 · 50.94√
27
]= [16.78, 57.01]
is a 95% confidence interval for µD
◦ The large sample confidence interval based on (*) was [18.00,55.78].
Example: Level of vitamin C
The following data are the amounts of vitamin C, measured in milligrams
per 100 grams (mg/100 g) of corn soy blend, for a random sample of size 8
from a production run:
26 31 23 22 11 22 14 31
What is the 95% confidence interval for µ, the mean vitamin C content of
the CSB produced during this run?
◦ µ = 22.5, σ = 7.2, t7,0.025 = 2.36
◦ The 95% confidence interval for µ is[22.5− 2.36 · 7.2√
8, 22.5 +
2.36 · 7.2√8
]= [16.5, 28.5].
◦ The large sample CI would be [17.5, 27.5].
Confidence intervals II, Feb 13, 2004 - 3 -
Confidence Intervals for the Variance
For normally distributed data X1, . . . , Xniid∼ N (µ, σ2), the ratio
(n − 1) · s2
σ2
has a χ2 distribution with n − 1 degrees of freedom.
The (1 − α) confidence interval for σ2 is[
(n − 1) · s2
χ2
n−1,α/2
, (n − 1) · s2
χ2
n−1,1−α/2
].
where χ2n−1,α is the α fractile of the χ2
n−1 distribution.
Caution: This confidence interval is not robust against depar-
tures from normality regardless of the sample size.
Example: Cholesterol levels
Suppose we are interested in the variance of Y3, the cholesterol level 14
days after the attack.
◦ Normal probability plot:
−2 −1 0 1 2
150
200
250
300
Normal quantiles
Cho
lest
erol
leve
l
Data seem to be normally distributed.
◦ s2 = 2030.55, χ227,0.975 = 14.57, χ2
27,0.025 = 43.19
◦ The 95% confidence interval for σ2 is[
27 · 2030.55
43.19,27 · 2030.55
14.57
]= [1269.26, 3761.99]
Confidence intervals II, Feb 13, 2004 - 4 -
Statistical Tests
Example:
Suppose that of 100 applicants for a job 50 were women and 50 were men,
all equally qualified. Further suppose that the company hired 2 women
and 8 men.
Question:
◦ Does the company discriminate against female job applicants?
◦ How likely is this outcome under the assumption that the company
does not discriminate?
Example:
◦ Study success of new elaborate safety program
◦ Record average weekly losses in hours of labor due to accidents before
and after installation of the program in 10 industrial plants
Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Question:
◦ Has the safety program an effect on the loss of labour due to accidents?
◦ In 9 out of 10 plants the average weekly losses have decreased after
implementation of the safety program. How likely is this (or a more
extreme) outcome under the assumption that there is no difference
before and after implementation of the safety program.
Testing Hypotheses I, Feb 16, 2004 - 1 -
Statistical Tests
Example: Fair coin
Suppose we have a coin. We suspect it might be unfair. We devise a
statistical experiment:
◦ Toss coin 100 times
◦ Conclude that coin is fair if we see between 40 and 60 heads
◦ Otherwise decide that the coin is not fair
Let θ be the probability that the coin lands heads, that is,P(Xi = 1) = θ and P(Xi = 0) = 1 − θ.
Our suspicion (“coin not fair”) is a hypothesis about the population pa-
rameter θ (θ 6= 12) and thus about P. We emphasize this dependence of P
on θ by writing Pθ.
Decision problem:
Null hypothesis H0: X ∼ Bin(100, 12)
Alternative hypothesis Ha: X ∼ Bin(100, θ), θ 6= 12
The null hypothesis represents the default belief (here: the coin is fair).
The alternative is the hypothesis we accept in view of evidence against the
null hypothesis.
The data-based decision rule
reject H0 if X /∈ [40, 60]
do not reject H0 if X ∈ [40, 60]
is called a statistical test for the test problem H0 vs. Ha.
Testing Hypotheses I, Feb 16, 2004 - 2 -
Statistical Tests
Example: Fair coin (contd)
Note: It is possible to obtain e.g. X = 55 (or X = 65)
◦ with probability 0.048 (resp. 0.0009) if p = 0.5
◦ with probability 0.048 (resp. 0.0049) if p = 0.6
◦ with probability 0.0005 (resp. 0.047) if p = 0.7
0.00
0.02
0.04
0.06
0.08
0.10
x
p(x)
Bin(100,0.5)
20 25 30 35 40 45 50 55 60 65 70 75 80
Accept H0: p ≠ 0.5Reject H0: p ≠ 0.5
0.00
0.02
0.04
0.06
0.08
0.10
x
p(x)
Bin(100,0.6)
20 25 30 35 40 45 50 55 60 65 70 75 80
Accept H0: p ≠ 0.5Reject H0: p ≠ 0.5
0.00
0.02
0.04
0.06
0.08
0.10
x
p(x)
Bin(100,0.7)
20 25 30 35 40 45 50 55 60 65 70 75 80
Accept H0: p ≠ 0.5Reject H0: p ≠ 0.5
Testing Hypotheses I, Feb 16, 2004 - 3 -
Types of errors
Example: Fair coin (contd)
It is possible that the test (decision rule) gives a wrong answer:
◦ If θ = 0.7 and x = 55, we do not reject the null hypothesis that the
coin is fair although the coin in fact is not fair.
◦ If θ = 0.5 and x = 65, we reject the null hypothesis that the coin is fair
although the coin in fact is fair.
The following table lists the possibilities:
Decision H0 true H0 false
Reject H0 type I error correct decision
Accept H0 correct decision type II error
Definition (Types of error)
◦ If we reject H0 when in fact H0 is true, this is a Type I error.
◦ If we do not reject H0 when in fact H0 is false, this is a Type II error.
Testing Hypotheses I, Feb 16, 2004 - 4 -
Types of errors
Question: How good is our decision rule?
For a good decision rule, the probability of committing an error of either
type should be small.
Probability of type I error: α
If the null hypothesis is true, i.e. θ = 12, thenPθ(reject H0) = Pθ(X /∈ [40, 60])
= 1 −Pθ(X ∈ [40, 60])
= 1 −60∑
x=40
(100
x
)(1
2
)100
= 0.035.
Thus the probability of a type I error, denoted as α, is 3.5%.
Probability of type II error: β(θ)
If the null hypothesis is false and the true probability of observing “head”
is θ with θ 6= 12 , thenPθ(accept H0) = Pθ(X ∈ [40, 60])
=60∑
x=40
(100
x
)θx(1 − θ)n−x
Thus, the probability of an error of type II depends on θ. It will be denoted
as β(θ).
Testing Hypotheses I, Feb 16, 2004 - 5 -
Power of Tests
Question: How good is our test in detecting the alternative?
Consider the probability of rejecting H0Pθ(reject H0) = Pθ(X /∈ [40, 60])
= 1 −Pθ(accept H0) = 1 − β(θ).
Note:
◦ If θ = 12 this is the probability of committing a error of type I:
1 − β(
1
2
)= α
◦ If θ > 12 this is the probability of correctly rejecting H0.
Definition (Power of a test)
We call 1 − β(θ) the power of the test as it measures the ability to
detect that the null hypothesis is false.
θ
1−
β(θ)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0
0.2
0.4
0.6
0.8
1.0
reject if X ∉ [40,60]
Testing Hypotheses I, Feb 16, 2004 - 6 -
Significance Tests
Idea: minimize probability of committing an error of type I and II
Different probabilities of type I error
θ
1−
β(θ)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0
0.2
0.4
0.6
0.8
1.0
reject if X ∉ [40,60]reject if X ∉ [38,62]reject if X ∉ [42,58]
Note: If we decrease the probability of a type I error,
◦ the power of the test, 1 − β(θ) decreases as well and
◦ the probablity of a type II error increases.
Problem: cannot minimize both errors simultaneously
Solution:
◦ choose fixed level α for probability of a type I error
◦ under this restriction find test with small probability of a type II error
Remark:
◦ you do not have to do this minimization yourself.
◦ all tests taught in this course are of this kind.
Definition
A test of this kind is called a significance test with significance level α.
Testing Hypotheses I, Feb 16, 2004 - 7 -
Statistical Hypotheses
A statistical hypothesis is an assertion or conjecture about a population,
which may be expressed in terms of
◦ some parameter: mean is zero;
◦ some parameters: mean and median are identical; or
◦ some sampling distribution: this sample is normally distributed.
Test problem - decide between two hypotheses
◦ the null hypothesis H0 and
◦ the alternative hypothesis Ha.
Popperian approach to scientific theories
◦ Scientific theories are subject to falsification.
◦ It is impossible to verify a scientific theory.
Null hypothesis H0
default (current) theory which we try to falsify
Alternative hypothesis Ha
alternative to adopt if null hypothesis is rejected
Examples:
◦ Clinical study of new drug - H0 : drug has no effect
◦ Criminal case - H0 : suspect is not guilty
◦ Safety test of nuclear power station - H0 : power station is not safe
◦ Chances of new investment - H0 : project not profitable
◦ Testing for independence - H0 : random variables are independent
Testing Hypotheses II, Feb 18, 2004 - 1 -
Statistical Tests
Example: Testing for pesticide in discharge water
Suppose the Environmental Protection Agency takes 10 readings on the
amount of pesticide in the discharge water of a chemical company.
Question: Does the concentration cP of pesticide in the water exceed the
allowed maximum concentration c0?
◦ Before taking action against the company, the agency must have some
evidence that the concentration cP exceeds the allowed level.
◦ Without evidence the agency assumes that the pesticide concentration
cP is within the limits of the law.
Consequently, the null hypothesis of the agency is that the pesticide con-
centration cP does not exceed c0. Thus the question corresponds to the
test problem
H0 : cP ≤ c0 vs Ha : cP > c0.
Suppose that the company regularly also runs tests on the amount of pes-
ticide in the discharge water.
Question: Does the concentration cP of pesticide in the water exceed the
allowed maximum concentration c0?
◦ The aim of the company is to avoid fines for exceeding the allowed
level. Thus the company wants to make sure that the concentration
stays within the allowed limits.
Thus, the null hypothesis of the company should be that the pesticide
concentration cP exceeds c0. The question now corresponds to the test
problem
H0 : cP ≥ c0 vs Ha : cP < c0.
Testing Hypotheses II, Feb 18, 2004 - 2 -
Six Steps of Conducting a Test
Steps of a significance test
1. Determine null hypothesis H0 and alternative Ha.
2. Decide on probability of type I error, the significance level α.
3. Find an appropriate test statistic T .
4. Based on the sampling distribution of T , formulate a criterion for
testing H0 against Ha.
5. Calculate value of the test statistic T .
6. Decide whether or not to reject the null hypothesis H0.
Example: Fair coin (contd)
We want to decide from 100 tosses of a coin whether it is fair or not. Let
θ be the probability of heads.
1. Test problem:
H0 : θ = 12 vs Ha : θ 6= 1
2
2. Significance level:
α = 0.05 (most commonly used significance level)
3. Test statistic:
T = X (number of heads in 100 tosses of the coin)
4. Rejection criterion:
reject H0 if T /∈ [40, 60]
5. Observed value of test statistic: Suppose after 100 tosses we obtain
t = 55
6. Decision: Since 55 does not lie in the rejection region, we
do not reject H0.
Testing Hypotheses II, Feb 18, 2004 - 3 -
One and Two-sided Hypotheses
Example: Blood cholesterol after a heart attack
Suppose we are interested in whether the blood cholesterol level two days
after a heart attack differs from the average cholesterol level in the (general)
population (µ0 = 193).
Two cases:
◦ We are interested in any difference from the population mean µ0. Then
we have a two-sided test problem
H0 : µY1= µ0 vs H0 : µY1
6= µ0.
◦ We suspect that the cholesterol level after a heart attack might me
higher than in the general population. In this case, we have a one-sided
test problem
H0 : µY1= µ0 vs H0 : µY1
> µ0.
Remark:
◦ More generally, we might be interested in one-sided test problems of
the form
H0 : µY1≤ µ0 vs H0 : µY1
> µ0,
which accounts for the possibility that µ might be smaller than µ0.
◦ For all common test situations (in particular those discussed in this
course), the form of the test does not depend on the form of H0, but
only on the parameter value in H0 that is closest to Ha, that is µ0.
Testing Hypotheses II, Feb 18, 2004 - 4 -
Test Statistic
Let θ be the parameter of interest.
Two-sided test problem
H0 : θ = θ0 against Ha : θ 6= θ0
One-sided test problem
H0 : θ = θ0 against Ha : θ > θ0 (or Ha : θ < θ0)
Suppose that θ is an estimate for θ.
◦ If θ = θ0 (null hypothesis), we expect the estimate θ to take a value
near θ0.
◦ Large deviations from θ0 are evidence against H0.
This suggests the following decision rules:
◦ Ha : θ > θ0: reject H0 if θ − θ0 is much larger than zero
◦ Ha : θ < θ0: reject H0 if θ − θ0 is much smaller than zero
◦ Ha : θ 6= θ0: reject H0 if |θ − θ0| is much larger than zero
Problem: Often the sampling distribution of the estimate θ depends on the
unknown parameter θ.
Definition (Test statistic)
A test statistic is a random variable
◦ that measures the compatibility between the null hypothesis and the
data and
◦ has a sampling distribution which we know (under H0).
Testing Hypotheses II, Feb 18, 2004 - 5 -
Test Statistic
Example: Blood cholesterol after a heart attack
Data: X1, . . . , X28
◦ blood cholesterol level of 28 patients two days after a heart attack
◦ assumed to be normally distributed with mean µX and variance σ2X
The parameter µ can be estimated by the sample mean
X =1
28
28∑i=1
Xi ∼ N(µX ,
σ2
X
28
).
This suggests to the standardized sample mean as a test statistic
X − µ0
σ/√
28∼ N (0, 1) (under H0).
Test H0 : µ ≤ 193 vs Ha : µ > 193 at significance level α = 0.05
◦ Test statistic: Assume σ = 47.7 to be known.
T =X − µ0
σ/√
28
◦ Rejection criterion: Reject H0 if T > z0.05 = 1.645
◦ Outcome of test: Since the observed value of T is
t =253.9 − 193
47.7/√
28= 6.76,
we reject the null hypothesis that µ = 193.
Testing Hypotheses II, Feb 18, 2004 - 6 -
Tests for the Mean
Tests for the mean µ (σ2 known):
◦ Test statistic:
T =X − µ0
σ/√
n
◦ Two sided test:
H0 : µ = µ0 against Ha : µ 6= µ0
reject H0 if |T | > zα/2
◦ One sided tests:
H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0)
reject H0 if T > zα (T < −zα)
Tests for the mean µ (σ2 unknown):
◦ Test statistic:
T =X − µ0
s/√
n
◦ Two sided test:
H0 : µ = µ0 against Ha : µ 6= µ0
reject H0 if |T | > tn−1,α/2
◦ One sided tests:
H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0)
reject H0 if T > tn−1,α (T < −tn−1,α)
Example: Blood cholesterol after a heart attack
Estimating the standard deviation from the data, we obtain the test statis-
tic
T =X − µ0
s/√
28∼ t27.
Noting that t27,0.05 = 1.703 and t = 6.76, we still reject H0.
Testing Hypotheses II, Feb 18, 2004 - 7 -
Tests and Confidence Intervals
Consider level α significance test for the two-sided test problem
H0 : θ = θ0 vs Ha : θ 6= θ0.
Let
◦ T = Tθ0(X) be the test statistic of the test (depends on θ0)
◦ R be the critical region of the test
Then
C(X) = {θ : Tθ(X) /∈ R}
is a (1 − α) confidence interval for θ: If θ is the true parameter, thenPθ
(θ ∈ C(X)
)= Pθ
(Tθ(X) /∈ R
)= 1 −Pθ
(Tθ(X) ∈ R
)= 1 − α.
We have
θ0 ∈ C(X) ⇔ Tθ0(X) /∈ R ⇔ H0 is not rejected
Result A level α two-sided significance test rejects the null hypothesis
H0 : θ = θ0 if and only if the parameter θ0 falls outside a (1 − α)
confidence interval for θ.
Example: Normal distribution
Let X1, . . . , Xniid∼ N (µ, σ2). We reject H0 : µ = µ0 if
∣∣∣X − µ0
s/√
n
∣∣∣ > tn−1,α/2
or equivalently
∣∣X − µ0
∣∣ > tn−1,α/2s√n
Rearranging terms, we find that we reject if
µ0 /∈[X − tn−1,α/2
s√n, X + tn−1,α/2
s√n
].
Testing Hypotheses II, Feb 18, 2004 - 8 -
The P -value
Definition (P -value)
The probability that under the null hypothesis H0 the test statistic
would take a value as extreme or more extreme that that actually
observed is called the P -value of the test.
The P -value is often interpreted a measure for the strength of evidence
against the null hypothesis: the smaller the P -value, the stronger the evi-
dence.
However:
◦ The P -value is a random variable (under H0 uniformly distr. on [0, 1]).
◦ Without a measure of its variability it is not safe to interpret the actu-
ally observed P -value.
◦ If the P -value is smaller than the chosen significance level α, we reject
the null hypothesis H0.
Three approaches to deciding on test problem:
◦ reject if θ0 /∈ C(X)
◦ reject if T (X) ∈ R
◦ reject if P -value p ≤ α
Example: Blood cholesterol after a heart attack
The observed value for the test statistic
T =X − µ0
s/√
28∼ t27.
is t = 6.76. The corresponding P -value isP(T > 6.76) = 1.47 · 10−07.
We thus reject the null hypothesis.
Equivalently, the confidence interval for µ is [235.43, 272.42]. Since it does
not contain µ0 = 193 we reject H0 (for the third and last time!).
Testing Hypotheses II, Feb 18, 2004 - 9 -
Example
Data: Banks’ net income
◦ percent change in net income between first half of last year and first
half of this year
◦ sample mean x = 8.1%
◦ sample standard deviation s = 26.4%
Test problem: H0 : µ = 0 against Ha : µ 6= 0
. ttesti 110 8.1 26.4 0
One-sample t test
------------------------------------------------------------------| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
----+-------------------------------------------------------------x | 110 8.1 2.517141 26.4 3.111108 13.08889
------------------------------------------------------------------Degrees of freedom: 109
Ho: mean(x) = 0
Ha: mean < 0 Ha: mean != 0 Ha: mean > 0t = 3.2179 t = 3.2179 t = 3.2179
P < t = 0.9991 P > |t| = 0.0017 P > t = 0.0009
Critical value of t distribution with 109 degrees of freedom:
t109,0.025 = 1.982
Result:
◦ |t| > t109,0.025, therefore the test rejects H0 at significance level α = 0.05.
◦ Equivalently, µ0 = 0 /∈ [3.11, 13.09] and thus the test rejects H0.
◦ Equivalently, P -value is less than α = 0.05 and thus the test rejects H0.
Testing Hypotheses II, Feb 18, 2004 - 10 -
Exact Binomial Test
Example: Fair coin
Data: 100 tosses of a coin which we suspect might be unfair.
Modelling:
◦ θ is the probability that the coin lands heads up
◦ X is the number of heads in 100 tosses of the coin
◦ X is binomially distributed with parameters n and θ.
Decision problem:
◦ Null hypothesis H0: coin is fair
◦ Alternative hypothesis Ha: coin is unfair
Testproblem:
H0 : θ =1
2vs Ha : θ 6= 1
2.
Under the null hypothesis H0, the distribution of X is known,
X ∼ Bin(100,
1
2
).
Reject null hypothesis if
X /∈ [b100,0.5,0.975, b100,0.5,0.025] = [40, 60].
where bn,θ,α is the α fractile of Bin(n, θ).
Note:
◦ Exact binomial tests typically have smaller significance level α due to
discreteness of distribution.
◦ In the above example, the probability of a type I error isP(reject H0) = α = 0.035.
Testing Hypotheses III, Feb 20, 2004 - 1 -
Sign Test
Example: Safety program
◦ Study success of new elaborate safety program
◦ Record average weekly losses in hours of labor due to accidents before
and after installation of the program in 10 industrial plants
Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Question:
◦ Has the safety program an effect on the loss of labour due to accidents?
The Sign Test for matched pairs
◦ Ignore pairs with difference 0
◦ Number of trials n is the count of the remaining pairs
◦ The test statistic is the count X of pairs with positive difference
◦ X is binomially distributed with parameters n and θ.
◦ Null hypothesis H0: θ = 12
(i.e. median of the differences is zero)
Example:
For the safety program data, we find
◦ n = 10, X = 9
◦ Test H0 : θ = 12 against Ha : θ > 1
2
◦ The P -value of the observed count X isP(X ≥ 9) = 9(
1
2
)10
+(
1
2
)10
= 0.0107
Since the P -value is smaller than α = 0.05 we reject the null hypothesis H0
that the safety program has no effect on the loss of labour due to accidents.
Testing Hypotheses III, Feb 20, 2004 - 2 -
Tests for Proportions
Example: Blood cholesterol after a heart attack
Suppose we are interested in the proportion p of patients who show a
decrease of cholesterol level between the second and the 14th day after a
heart attack.
The proportion p can be estimated by the sample proportion
p =X
n
where X is the number of patients whose cholesterol level decreased.
Question: Does a decrease occur more often than an increase?
Test problem: H0 : p = 12 vs Ha : p > 1
2
Exact tests:
Since X is binomially distributed, we can use exact binomial tests.
Large sample approximations:
Facts: ◦ E(p) = p
◦ var(p) =p(1 − p)
n
◦ p − p√p(1 − p)/n
≈ N (0, 1) (for large n)
Under the null hypothesis H0, we get
T =p − p0√
p0(1 − p0)/n≈ N (0, 1).
Hence, we reject H0 if T > zα.
Example: Blood cholesterol after a heart attack
◦ n = 28, x = 22, p = 0.79, α = 0.05, z0.05 = 1.645
◦ t =0.79 − 0.5√0.79 · 0.21/28
= 3.7675
◦ P-value: P(T > t) = 8.24 · 10−5.
Testing Hypotheses III, Feb 20, 2004 - 3 -
Confidence Intervals for Proportions
Exact binomial confidence intervals
◦ difficult to compute
◦ use statistics software
Example: Blood cholesterol after a heart attack
◦ 28 patients in the study
◦ 22 showed a decrease in cholesterol level between second and 14th day
after the attack
Computation of an exact binomial confidence interval in STATA:
. cii 28 22
-- Binomial Exact --Variable | Obs Mean Std. Err. [95% Conf. Interval]---------+-----------------------------------------------------------
| 28 .7857143 .0775443 .590469 .9170394
Testing Hypotheses III, Feb 20, 2004 - 4 -
Confidence Intervals for Proportions
Large sample approximations
The CLT states that for large n p is approximately normally distributed,
p ≈ N(p,
p(1 − p)
n
)
Problems:
◦ variance is unknown
◦ estimate p(1 − p)/n is zero if p = 0 or p = 1
Example: What is the proportion of HIV+ students at the UofC?
◦ Random sample of 100 students
◦ None test positive for HIV
Are you absolutely sure that there are no HIV+ students at the UofC?
Idea: Estimate p by
p =X + 2
n + 4(Wilson estimate)
and use[p − zα/2
√p(1 − p)
n + 4, p + zα/2
√p(1 − p)
n + 4
]
as a (1 − α) confidence interval for p
Example: Blood cholesterol after a heart attack
. cii 28 22, wilson
------ Wilson ------Variable | Obs Mean Std. Err. [95% Conf. Interval]---------+-----------------------------------------------------------
| 28 .7857143 .0775443 .6046141 .8978754
Testing Hypotheses III, Feb 20, 2004 - 5 -
Paired Samples
Example: Safety program
◦ Study success of new elaborate safety program
◦ Record average weekly losses in hours of labor due to accidents before
and after installation of the program in 10 industrial plants
Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Question: Does the safety program have a positive effect?
Approach:
◦ Consider differences before and after implementation of the program:
Di = X(after)i − X
(before)i
◦ Di’s are approximately normal
Diiid∼ N (µ, σ2)
◦ H0 : µ = 0 against Ha : µ > 0
◦ Significance level α = 0.01
◦ One sample t test:
T =D
s/√
n
Reject if T > tn−1,α
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0
5
10
15
20
25
Normal quantiles
Dec
reas
e in
loss
es o
f wor
k
Result:
◦ y = 10.27, s = 7.98, n = 10
◦ t = 4.07 and t9,0.01 = 2.82, P -value: 0.0014
◦ Test rejects H0 at significance level α = 0.01
Testing Hypotheses III, Feb 20, 2004 - 6 -
Paired Sample t Test
Data: (X1, Y1), . . . , (Xn, Yn)
Assumptions:
◦ Pairs are independent
◦ Di = Xi − Yiiid∼ N (µ, σ2)
◦ Apply one-sample t test
Paired sample t test
◦ Test statistic
T =D − µ0
s/√
n
◦ Two-sided test:
H0 : µ = µ0 against Ha : µ 6= µ0
reject H0 if |T | > tn−1,α/2
◦ One-sided test:
H0 : µ = µ0 against Ha : µ > µ0
reject H0 if T > tn−1,α
Power of the paired sample t test and the paired sign test:
δ
1−
β(δ)
0 1 2 3 4 5 6 7 8 9 10 110.0
0.2
0.4
0.6
0.8
1.0
Sign test
t test
Testing Hypotheses III, Feb 20, 2004 - 7 -
Sign and t Test
t test:
◦ based on Central Limit Theorem
◦ readsonably robust against departures from normality
◦ do not use if n is small and
⋄ data are strongly skewed or
⋄ data have clear outliers
Sign test:
◦ uses much less information than t test
◦ for normal data less powerful than t test
◦ no assumption on distribution keeps significance level regardless of
distribution
◦ preferable for very small data sets
Remark:
◦ The two-step procedure
1. assess normality by normal quantile plot
2. conduct either t test or sign test depending on result in step 1
does not attain the chosen significance level α (two tests!).
◦ The sign test is rarely used since there are more powerful distribution-
free tests.
Testing Hypotheses III, Feb 20, 2004 - 8 -
Two Sample Problems
Two sample problems
◦ The goal of inference is to compare the responses in two groups.
◦ Each group is a sample from a different population.
◦ The responses in each group are independent of those in the other
group.
Example: Effects of ozone
Study the effects of ozone by controlled randomized experiment
◦ 55 70-day-old rats were randomly assigned to two treatment or control
◦ Treatment group: 22 rats were kept in an environment containing ozone.
◦ Control group: 23 rats were kept in an ozone-free environment
◦ Data: Weight gains after 7 days
We are interested in the difference in weight gain be-
tween the treatment and control group.
Question: Do the weight gains differ between groups?
◦ x1, . . . , x22 - weight gains for treatment group
◦ y1, . . . , y23 - weight gains for control group
◦ Test problem:
H0 : µX = µY vs Ha : µX 6= µY
◦ Idea: Reject null hypothesis if x − y is large.
Treatment Control
−10
010
2030
4050
Wei
ght g
ain
(in g
ram
)
Two Sample Tests, Feb 23, 2004 - 1 -
Comparing Means
Let X1, . . . , Xm and Y1, . . . , Yn be two independent normally distributed
samples. Then
X − Y ∼ N(
µX − µY ,σ2
X
m+
σ2Y
n
)
Two-sample t test
◦ Two-sample t statistic
T =X − Y√s2
X
m +s2
Y
n
Distribution of T can be approximated by t distribution
◦ Two-sided test:
H0 : µX = µY against Ha : µX 6= µY
reject H0 if |T | > tdf,α/2
◦ One-sided test:
H0 : µX = µY against Ha : µX > µY
reject H0 if T > tdf,α
◦ Degrees of freedom:
◦ Approximations for df provided by statistical software
◦ Satterthwaite approximation
df =
(s2
X
m +s2
Y
n
)2
1m−1
(s2
X
m
)2
+ 1n−1
(s2
Y
n
)2
commonly used, conservative approximation
◦ Otherwise: use df = min(m − 1, n − 1)
Two Sample Tests, Feb 23, 2004 - 2 -
Comparing Means
Example: Effects of ozone
Data:
◦ Treatment group: x = 11.01, sX = 19.02, m = 22
◦ Control group: x = 22.43, sX = 10.78, n = 23
Testproblem:
◦ H0 : µX = µY vs Ha : µX 6= µY
◦ α = 0.05, df = min(m − 1, n − 1) = 21, t21,0.025 = 2.08
The value of the test statistic is
t =x − y√s2
X
m+
s2
Y
m
= −2.46
The corresponding P-value isP(|T | ≥ |t|) = P(|T | ≥ 2.46) = 0.023
Thus we reject the hypothesis that ozone has no effect on weight gain.
Two-sample t test with STATA:
. ttest weight, by(group) unequal
Two-sample t test with unequal variances
----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+------------------------------------------------------------------0 | 23 22.42609 2.247108 10.77675 17.76587 27.08631 | 22 11.00909 4.054461 19.01711 2.577378 19.4408
---------+------------------------------------------------------------------combined | 45 16.84444 2.422057 16.24765 11.96311 21.72578---------+------------------------------------------------------------------
diff | 11.417 4.635531 1.985043 20.84895----------------------------------------------------------------------------Satterthwaite’s degrees of freedom: 32.9179
Ho: mean(0) - mean(1) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0t = 2.4629 t = 2.4629 t = 2.4629
P < t = 0.9904 P > |t| = 0.0192 P > t = 0.0096
Two Sample Tests, Feb 23, 2004 - 3 -
Comparing Means
Suppose that σ2X = σ2
Y = σ2. Then
σ2
m+
σ2
n= σ2
(1
m+
1
n
).
Estimate σ2 by the pooled sample variance
s2p =
(m − 1)s2X + (n − 1)s2
Y
m + n − 2.
Pooled two-sample t test
◦ Two-sample t statistic
T =X − Y
sp
√1m + 1
n
T is t distributed with m + n − 2 degrees of freedom.
◦ Two-sided test:
H0 : µX = µY against Ha : µX 6= µY
reject H0 if |T | > tm+n−2,α/2
◦ One-sided test:
H0 : µX = µY against Ha : µX > µY
reject H0 if T > tm+n−2,α
Remarks:
◦ If m ≈ n, the test is reasonably robust against
◦ nonnormality and
◦ unequal variances.
◦ If sample sizes differ a lot, test is very sensitive to unequal variances.
◦ Tests for differences in variances are sensitive to nonnormality.
Two Sample Tests, Feb 23, 2004 - 4 -
Comparing Means
Example: Parkinson’s disease
Study on Parkinson’s disease
◦ Parkinson’s disease, among other things, affects a
person’s ability to speak
◦ Overall condition can be improved by an operation
◦ How does the operation affect the ability to speak?
◦ Treatment group: Eight patients received operation
◦ Control group: Fourteen patients
◦ Data:
⋄ score on several test
⋄ high scores indicate problem with speaking
Treat. Contr.
1.5
2.0
2.5
3.0
Spe
akin
g ab
ility
Pooled twpo sample t test with STATA:
. infile ability group using parkinson.txt
. ttest ability, by(group)
Two-sample t test with equal variances
---------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+-----------------------------------------------------------------0 | 14 1.821429 .148686 .5563322 1.500212 2.1426451 | 8 2.45 .14516 .4105745 2.106751 2.793249
---------+-----------------------------------------------------------------combined | 22 2.05 .1249675 .5861497 1.790116 2.309884---------+-----------------------------------------------------------------
diff | -.6285714 .2260675 -1.10014 -.1570029---------------------------------------------------------------------------Degrees of freedom: 20
Ho: mean(0) - mean(1) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0t = -2.7805 t = -2.7805 t = -2.7805
P < t = 0.0058 P > |t| = 0.0115 P > t = 0.9942
Two Sample Tests, Feb 23, 2004 - 5 -
Comparing Variances
Example: Parkinson’s disease
In order to apply the pooled two-sample t test, the variances of the two
groups have to be equal. Are the data compatible with this assumption?
F test for equality of variances
The F test statistic
F =s2X
s2Y
.
is F distributed with m − 1 and n − 1 degrees of freedom.
. sdtest ability, by(group)
Variance ratio test
------------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------0 | 14 1.821429 .148686 .5563322 1.500212 2.1426451 | 8 2.45 .14516 .4105745 2.106751 2.793249
---------+--------------------------------------------------------------------combined | 22 2.05 .1249675 .5861497 1.790116 2.309884------------------------------------------------------------------------------
Ho: sd(0) = sd(1)
F(13,7) observed = F_obs = 1.836F(13,7) lower tail = F_L = 1/F_obs = 0.545F(13,7) upper tail = F_U = F_obs = 1.836
Ha: sd(0) < sd(1) Ha: sd(0) != sd(1) Ha: sd(0) > sd(1)P < F_obs = 0.7865 P < F_L + P > F_U = 0.3767 P > F_obs = 0.2135
Result: We cannot reject the null hypothesis that the variances are equal.
Problem: Are the data normally
distributed?
−1.5 −0.5 0.5 1.0 1.5
1.8
2.0
2.2
2.4
2.6
2.8
3.0
Theoretical Quantiles
Spe
akin
g ab
ility
(T
reat
.)
−1 0 1
1.5
2.0
2.5
3.0
Theoretical Quantiles
Spe
akin
g ab
ility
(C
ontr
.)
Two Sample Tests, Feb 23, 2004 - 6 -
Comparing Proportions
Suppose we have two populations with unknown proportions p1 and p2.
◦ Random samples of size n1 and n2 are drawn from the two population
◦ p1 is the sample proportion for the first population
◦ p2 is the sample proportion for the second population
Question: Are the two proportions p1 and p2 different?
Test problem:
H0 : p1 = p2 vs H1 : p1 6= p2
Idea: Reject H0 if p1 − p2 is large.
Note that
p1 − p2 ≈ N(p1 − p2,
p1(1 − p1)
n1
+p2(1 − p2)
n2
)
This suggests the test statistic
T =p1 − p2√
p(1 − p)(
1
n1
+ 1
n2
)
where p is the combined proportion of successes in both samples
p =X1 + X2
n1 + n2
=n1 p1 + n2 p2
n1 + n2
with X1 and X2 denoting the number of successes in each sample.
Under H0, the test statistic is approximately standard normally dis-
tributed.
Two Sample Tests, Feb 23, 2004 - 7 -
Comparing Proportions
Example: Question wording
The ability of question wording to affect the outcome of a survey can be a
serious issue. Consider the following two questions:
1. Would you favor or oppose a law that would require a person to obtain
a police permit before purchasing a gun?
2. Would you favor or oppose a law that would require a person to obtain
a police permit before purchasing a gun, or do you think such a law
would interfere too much with the right of citizens to own guns?
In two surveys, the following results were obtained:
Question Yes No Total
1 463 152 615
2 403 182 585
Question: Is the true proportion of people favoring the permit law the
same in both groups or not?
. prtesti 615 0.753 585 0.689
Two-sample test of proportion x: Number of obs = 615y: Number of obs = 585
--------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]---------+----------------------------------------------------------------
x | .753 .0173904 .7189155 .7870845y | .689 .0191387 .6514889 .7265111
---------+----------------------------------------------------------------diff | .064 .0258595 .0133163 .1146837
| under Ho: .0258799 2.47 0.013--------------------------------------------------------------------------
Ho: proportion(x) - proportion(y) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0z = 2.473 z = 2.473 z = 2.473
P < z = 0.9933 P > |z| = 0.0134 P > z = 0.0067
Two Sample Tests, Feb 23, 2004 - 8 -
Final Remarks
Statistical theory focuses on the significance level, the probability of a type
I error.
In practice, discussion of power of test also important:
Example: Efficient Market Hypothesis
“Efficient market hypothesis” for stock prices:
◦ future stock prices show only random variation
◦ market incorporates all information available now in present prices
◦ no information available now will help to predict future stock prices
Testing of the efficient market hypothesis:
◦ Many studies tested
H0: Market is efficient
Ha: Prediction is possible
◦ Almost all studies failed to find good evidence against H0.
◦ Consequently the efficient market hypothesis became quite popular.
Problem:
◦ Power was generally low in the significance tests employed in the stud-
ies.
◦ Failure to reject H0 is no evidence that H0 is true.
◦ More careful studies showed that the size of a company and measures
of value such as ratio of stock price to earnings do help predict future
stock prices.
Two Sample Tests, Feb 23, 2004 - 9 -
Final Remarks
Example
◦ IQ of 1000 women and 1000 men
◦ µw = 100.68, σw = 14.91
◦ µm = 98.90, σm = 14.68
◦ Pooled two-sample t test: T = −2.7009
◦ Reject H0 : µw = µm since |T | > t1998,0.005 = 2.58.
◦ The difference in the IQ is statistically significant at the 0.01 level.
◦ However we might conclude that the difference is scientifically irrele-
vant.
Note: A low significance level does not mean there is a large difference,
but only that there is strong evidence that there is some difference.
Two Sample Tests, Feb 23, 2004 - 10 -
Final Remarks
Example: Is radiation from cell phones harmful?
◦ Observational study
◦ Comparison of brain cancer patients and similar group without brain
cancer
◦ No statistically significant association between cell phone use and a
group of brain cancers known as gliomas.
◦ Separate analysis for 20 types of gliomas found association between
phone use and one rare from.
◦ Risk seemed to decrease with greater mobile phone use.
Think for a moment:
◦ Suppose all 20 null hypotheses are true.
◦ Each test has 5% chance of being significant - the outcome is Bernoulli
distributed with parameter 0.05.
◦ The number of false positive tests is binomially distributed:
N ∼ Bin(20, 0.05)
◦ The probability of getting one or more positive results isP(N ≥ 1) = 1 −P(N = 0) = 1 − 0.9520 = 0.64.
We therefore might have expected at least one significant association.
Beware of searching for significance
Two Sample Tests, Feb 23, 2004 - 11 -
Final Remarks
Problem: If several tests are performed, the probability of a type I error
increases.
Idea: Adjust significance level of each single test.
Bonferroni procedure:
◦ Perform k tests
◦ Use significance level α/k for each of the k tests
◦ If all null hypothesis are true, the probability is α that any of the tests
rejects its null hypothesis.
Example
Suppose we perform k = 6 tests and obtain the following P -values:
P -value α/k
0.476 0.032 0.241 0.008* 0.010 0.001* 0.0083
Only two tests (*) are significant at the 0.05 level.
Two Sample Tests, Feb 23, 2004 - 12 -
Two-Way Tables
Example: Depression and marital status
Question: Does severity of depression depend on marital status?
◦ Study of 159 depression patients
◦ Patients were categorized by
⋄ severity of depression (severe, normal, mild)
⋄ marital status (single, married, widowed/divorced)
The following two-way table summarizes the data:
Depression Marital Status Total
Single Married Wid/Div
Severe 16 22 19 57
Normal 29 33 14 76
Mild 9 14 3 26
Total 54 69 36 159
◦ Each combination of values defines a cell.
◦ The severity of depression is a row variable.
◦ The marital status is a column variable.
Inference for Two-Way Tables, Feb 25, 2004 - 1 -
Two-Way Tables
From this table of counts, the sample distribution can be obtained
by dividing each cell by the total sample size n = 159:
Depression Marital Status Total
Single Married Wid/Div
Severe 0.101 0.138 0.119 0.358
Normal 0.182 0.208 0.088 0.478
Mild 0.057 0.088 0.019 0.164
Total 0.340 0.434 0.226 1.000
◦ Joint distribution: proportion for each combination of values
◦ Marginal distribution: distribution of the row and column
variables separately.
◦ Conditional distribution: distribution of one variable at a
given level of the other variable
Inference for Two-Way Tables, Feb 25, 2004 - 2 -
Test for Independence
Example: Depression and marital status
Conditional distributions of severity of depression given marital
status:
single married wid/div
severenormalmild
0.0
0.1
0.2
0.3
0.4
0.5
Marital status
Sam
ple
prop
ortio
n
Question: Is a relationship between the row variable (depression)
and the column variable (marital status)?
◦ The distribution for widowed/divorced patients seems to differ
from the distributions for single or married patients.
◦ Are these differences significant or can they be attributed to
chance variation?
◦ How likely are differences as large or larger than those observed
if the two variables were indeed independent (and thus the con-
ditional distribution were the same)?
A statistical test will be required to answer these questions.
Inference for Two-Way Tables, Feb 25, 2004 - 3 -
Test for Independence
Test problem:
H0: the row and the column variables are independent
Ha: the row and the column variables are dependent
How can we measure evidence against the null hypothesis?
◦ What counts would we expect to observe if the null hypothesis
were true?
Expected Cell Count =row total × column total
total count
Recall: For two independent events A and B, P(A∩B) = P(A)P(B).
If the null hypothesis H0 is true, then the table of expected
counts should be “close” to the observed table of counts.
◦ We need a statistic that measures the difference between the
tables.
◦ And we need to know what is the distribution of the statistic
to make statistical inference.
Inference for Two-Way Tables, Feb 25, 2004 - 4 -
Test for Independence
Idea of the test:
◦ construct table of expected counts
◦ compare expected with observed counts
◦ if the null hypothesis is true, the difference between the tables
should be “small”
The χ2 (Chi-Squared) Statistic
To measure how far the expected table is from the observed table,
we use the following test statistic:
X =∑
all cells
(Observed − Expected)2
Expected
◦ Under the null hypothesis, T is approximately χ2 distributed
with (r − 1)(c − 1) degrees of freedom.Why (r − 1)(c − 1)?
Recall that our “expected” table is based on some quantities estimated
from the data: namely the row and column totals.
Once these totals are known, filling in any (r− 1)(c− 1) undetermined
table entries actually gives us the whole table. Thus, there are only
(r − 1)(c − 1) freely varying quantities in the table.
◦ We reject H0 if observed and expected counts are very different
and hence X is large. Consequently we reject H0 at significance
level α if
X ≥ χ2(r−1)(c−1),α.
Inference for Two-Way Tables, Feb 25, 2004 - 5 -
The χ2 Distribution
What does the χ2 distribution look like?
0 10 20 30 40 500.00
0.05
0.10
0.15
0.20
χ2
Den
sity
χ2 Densities
Degrees ofFreedom
15102030
◦ Unlike the Normal or t distributions, the χ2 distribution takes
values in (0,∞).
◦ As with the t distribution, the exact shape of the χ2 distribution
depends on its degrees of freedom.
Recall that X has only an approximate χ2(r−1)(c−1) distribution.
When is the approximation valid?
◦ For any two-way table larger than 2 × 2, we require that the
average expected cell count is at least 5 and each expected count
is at least one.
◦ For 2×2 tables, we require that each expected count be at least
5.
Inference for Two-Way Tables, Feb 25, 2004 - 6 -
Test for Independence
Example: Depression and marital status
The following table show the observed counts and expected counts
(in brackets):
Depression Marital Status Total
Single Married Wid/Div
Severe 16 22 19 57
(19.36) (24.74) (12.90)
Normal 29 33 14 76
(25.81) (32.98) (17.21)
Mild 9 14 3 26
(8.83) (11.28) (5.89)
Total 54 69 36 159
◦ The table is 3 × 3, so there are (r − 1)(c − 1) = 2 × 2 = 4
degrees of freedom.
◦ The critical value (significance level α = 0.05) is χ24,0.05 = 9.49.
◦ The observed value of the χ2 test statistic is
x =(16 − 19.36)2
19.36+
(22 − 24.74)2
24.74+ . . . +
(3 − 5.89)2
5.89
= 6.83 ≤ χ24,0.05
Thus we do not reject the null hypothesis of independence.
◦ The corresponding P-value isP(X ≥ x) = P(X ≥ 6.83) = 0.145 ≥ α
Again we do not reject H0
Inference for Two-Way Tables, Feb 25, 2004 - 7 -
Test for Independence
The χ2 test in STATA:
. insheet using depression.txt, clear(3 vars, 159 obs)
. tabulate depression marital, chi2
| MaritalDepression | Married Single Wid/Div | Total-----------+---------------------------------+----------
Mild | 14 9 3 | 26Normal | 33 29 14 | 76Severe | 22 16 19 | 57
-----------+---------------------------------+----------Total | 69 54 36 | 159
Pearson chi2(4) = 6.8281 Pr = 0.145
The same result can be obtained by the command
. tabi 16 22 19 \ 29 33 14 \ 9 14 3, chi2
| colrow | 1 2 3 | Total
-----------+---------------------------------+----------1 | 16 22 19 | 572 | 29 33 14 | 763 | 9 14 3 | 26
-----------+---------------------------------+----------Total | 54 69 36 | 159
Pearson chi2(4) = 6.8281 Pr = 0.145
Inference for Two-Way Tables, Feb 25, 2004 - 8 -
Models for Two-Way Tables
The χ2-test for the presence of a relationship between two distributions
in a two-way table is valid for data produced by several different study
designs, although the exact null hypothesis varies.
◦ Examining independence between variables
⋄ Select random sample of size n from a population.
⋄ Classify each individual according to two categorical variables.
Question: Is there a relationship between the two variables?
Test problem:
H0: The two variables are independent
Ha: The two variables are not independent
Example: Suppose we collect an SRS of 114 college students, and cate-
gorize each my major and GPA (e.g. (0, 0.5], . . . , (3.5, 4]). Then, we can
use the χ2-test to ascertain whether grades and major are independent.
◦ Comparing several populations
⋄ Select independent random samples from each of c population, of
sizes n1, . . . , nc.
⋄ Classify each individual according to a categorical response variable
with r possible values (the same across populations),
⋄ This yields a r × c table.
Question: Does the distribution of the response variable differs be-
tween populations?
Test problem:
H0: The distribution is the same in all populations.
Ha: The distribution is not the same.
Example: Suppose we select independent SRSs of Psychology, Biology
and Math majors, of sizes 40, 39, 35, and classify each individual by
GPA range. Then, we can use a χ2-test to ascertain whether or not the
distribution of grades is the same in all three populations.
Inference for Two-Way Tables, Feb 25, 2004 - 9 -
Models for Two-Way Tables
Example: Literary Analysis (Rice, 1995)
When Jane Austen died, she left the novel Sanditon only partially com-
pleted, but she left a summary of the reminder. A highly literate admirer
finished the novel, attempting to emulate Austen’s style, and the hybrid
was published. Someone counted the occurrences of various words in sev-
eral chapters from various works.
Austen Imitator
Sense and Emma Sanditon I Sanditon II
Word Sensibility
a 147 186 101 83
an 25 26 11 29
this 32 39 15 15
that 94 105 37 22
with 59 74 28 43
without 18 10 10 4
TOTAL 375 440 202 196
Questions:
◦ Is there consistency in Austen’s work (do the frequencies with which
Austen used these words change from work to work)?
Answer X = 12.27, df=?, P-value=?
◦ Was the imitator successful (are the frequencies of the words the same
in Austen’s work and the imitator’s work)?
Inference for Two-Way Tables, Feb 25, 2004 - 10 -
Simpson’s Paradoxon
Example: Medical study
◦ contact randomly chosen people in a district in England
◦ data on 1314 women contacted
◦ either current smoker or who had never smoked
Question: Survival rate after 20 years?
Smoker Not
Dead 139 230
Alive 438 502
Result: A higher percent of smokers stayed alive!
Here are the same data classified by their age at time of the survey:
Age 18 to 44
Smoker Not
Dead 19 13
Alive 269 327
Age 45 to 64
Smoker Not
Dead 78 52
Alive 162 147
Age 65+
Smoker Not
Dead 42 165
Alive 7 28
Age at time of the study is a confounding variable, in each age
group a higher percent of nonsmokers survive.
Simpson’s Paradoxon
An association/comparison that holds for all of several groups can
reverse direction when the data are combined to form a single
group.
Inference for Two-Way Tables, Feb 25, 2004 - 11 -
Simple Linear Regression
Example: Body density
Aim: Measure body density (weight per unit volume of the body)
(Body density indicates the fat content of the human body.)
Problem:
◦ Body density is difficult to measure directly.
◦ Research suggests that skinfold thickness can accurately predict body
density.
◦ Skinfold thickness is measures by pinching a fold of skin between
calipers.
1.03 1.04 1.05 1.06 1.07 1.08 1.09
1.0
1.2
1.4
1.6
1.8
2.0
Skinfold Thickness (mm)
Bod
y D
ensi
ty (1
03 kgm
3 )
Questions:
◦ Are body density and skinfold thickness related?
◦ How accurately can we predict body density from skinfold thickness?
Regression: predict response variable for fixed value of explanatory variable
◦ describe linear relationship in data by regression line
◦ fitted regression line is affected by chance variation in observed data
Statistical inference: accounts for chance variation in data
Simple Linear Regression, Feb 27, 2004 - 1 -
Population Regression Line
Simple linear regression studies the relationship between
◦ a response variable Y and
◦ a single explanatory variable X.
We expect that different values of X will produce different mean responses
of Y .
For given X = x, we consider the subpopulation with X = x:
◦ this subpopulation has mean
µY |X=x = E(Y |X = x) (cond. mean of Y given X = x)
◦ and variance
σ2Y |X=x = var(Y |X = x) (cond. variance of Y given X = x)
Linear regression model with constant variance:E(Y |X = x) = µY |X=x = a + b x (population regression line)
var(Y |X = x) = σ2Y |X=x = σ2
◦ The population regression line connects the conditional means of the
response variable for fixed values of the explanatory variable.
◦ This population regression line tells how the mean response of Y varies
with X.
◦ The variance (and standard deviation) does not depend on x.
Simple Linear Regression, Feb 27, 2004 - 2 -
Conditional Mean
01
23
45
67
89
1011
120
1
2
3
45
6
1 2 3 4 5 67
89
1011
12 01
23
45
6
01
23
45
67
89
1011
120
1
2
3
4
5
6
01
23
45
67
89
1011
120
1
2
3
4
5
6
Sample (x1, y1), . . . , (xn, yn)
Sampling probability
f(x, y)y
fix x = x0
f(x0, y)
y
rescale by fX(x0)
Conditional probability
f(y|x0) =fXY (x0, y)
fX(x0)
E(Y |X = x0) =
∫y fY |X(y|x0) dy conditional mean
Simple Linear Regression, Feb 27, 2004 - 3 -
The Linear Regression Model
Simple linear regression
Yi = a + b xi + εi, i = 1, . . . , n
where
Yi response (also dependent variable)
xi predictor (also independent variable)
εi error
Assumptions:
◦ Predictor xi is deterministic (fixed values, not random).
◦ Errors have zero mean, E(εi) = 0.
◦ Variation about mean does not depend on xi, i.e. var(εi) = σ2.
◦ Errors εi are independent.
Often we additionally assume:
◦ The errors are normally distributed,
εiiid∼ N (0, σ2).
For fixed x the response Y is normally distributed with
Y ∼ N (a + b x, σ2).
Simple Linear Regression, Feb 27, 2004 - 4 -
Least Squares Estimation
Data: (Y1, x1), . . . , (Yn, xn)
Aim: Find straight line which fits data best:
Yi = a + b xi fitted values for coefficients a and b
a - intercept
b - slope
Least Squares Approach:
Minimize squared distance between observed Yi and fitted Yi:
L(a, b) =n∑
i=1
(Yi − Yi)2 =
n∑i=1
(Yi − a − b xi)2
Set partial derivatives to zero (normal equations):
∂L
∂a= 0 ⇔
n∑i=1
(Yi − a − b xi) = 0
∂L
∂b= 0 ⇔
n∑i=1
(Yi − a − b xi) · xi = 0
Solution: Least squares estimators
a = Y − SXY
SXX· X
b =SXY
SXX
where
SXY =n∑
i=1
(Yi − Y )(xi − x) (sum of squares)
SXX =n∑
i=1
(xi − x)2
Simple Linear Regression, Feb 27, 2004 - 5 -
Least Squares Estimation
Least squares predictor Y
Yi = a + b xi
Residuals εi:
εi = Yi − Yi
= Yi − a − b xi
Residual sum of squares (SSResidual)
SSResidual =n∑
i=1
ε2i =
n∑i=1
(Yi − Yi)2
Estimation of σ2
σ2 =1
n − 2
n∑i=1
(Yi − Yi)2 =
1
n − 2SSResidual
Regression standard error
se = σ =√
SSResidual/(n − 2)
Variation accounting:
SSTotal =n∑
i=1
(Yi − Y )2 total variation
SSModel =n∑
i=1
(Yi − Y )2 variation explained by linear model
SSResidual =n∑
i=1
(Yi − Yi)2 remaining variation
Simple Linear Regression, Feb 27, 2004 - 6 -
Least Squares Estimation
Example: Body density
Scatter plot with least squares regression line:
1.03 1.04 1.05 1.06 1.07 1.08 1.09
1.0
1.2
1.4
1.6
1.8
2.0
Skinfold Thickness (mm)
Bod
y D
ensi
ty (1
03 kgm
3 )
Calculation of least squares estimates:
x y SXX SXY SY Y SSResidual
1.064 1.568 0.0235 -0.2679 4.244 1.187
b =SXY
SXX=
−0.267
0.023= −11.40
a = y − bx = 1.568 + 11.40 · 1.064 = 13.70
σ2 =RSS
n − 2=
1.187
90= 0.0132
se =√
σ2 =√
0.0132 = 0.1149
Simple Linear Regression, Feb 27, 2004 - 7 -
Least Squares Estimation
Example: Body density
Using STATA:
. infile ID BODYD SKINT using bodydens.txt, clear(92 observations read)
. regress BODYD SKINT
Source | SS df MS Number of obs = 92-------------+------------------------------ F( 1, 90) = 231.89
Model | 3.05747739 1 3.05747739 Prob > F = 0.0000Residual | 1.18663025 90 .013184781 R-squared = 0.7204
-------------+------------------------------ Adj R-squared = 0.7173Total | 4.24410764 91 .046638546 Root MSE = .11482
------------------------------------------------------------------------------BODYD | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------SKINT | -11.41345 .7494999 -15.23 0.000 -12.90246 -9.924433_cons | 13.71221 .7975822 17.19 0.000 12.12768 15.29675
------------------------------------------------------------------------------
. twoway (lfitci BODYD SKINT, range(1 1.1)) (scatter BODYD SKINT), xtitle(Skin thickn> ess) ytitle(Body density) scheme(s1color) legend(off)
11.
52
2.5
Bod
y de
nsity
1 1.02 1.04 1.06 1.08 1.1SKin thickness
Simple Linear Regression, Feb 27, 2004 - 8 -
Properties of Estimators
Statistical properties of a and b
Mean and variance of bE(b) = b
var(b) =σ2
SXX
Distribution of b
b ∼ N(
b,σ2
SXX
)
Mean and variance of aE(a) = a
var(a) =
(1
n+
x2
SXX
)σ2
Distribution of a
a ∼ N(
a,
(1
n+
x2
SXX
)σ2
)
Recall that
SXX =n∑
i=1
(xi − x)2
Inference for Regression, Mar 1, 2004 - 1 -
Confidence Intervals
Note that b ∼ N(b, σ2
SXX
). Thus
b − b
σ/√
SXX
∼ N (0, 1)
Substituting se for σ, we obtain
b − b
se/√
SXX
∼ tn−2
(1 − α) confidence interval for b:
b ± tn−2,a/2 ·se√SXX
Similarly
a − a
σ√
1n
+ X2
SXX
∼ N (0, 1)
Substituting se for σ, we obtain
a − a
se
√1n + x2
SXX
∼ tn−2
(1 − α) confidence interval for a:
a ± tn−2,α/2 · se ·√
1
n+
x2
SXX
Inference for Regression, Mar 1, 2004 - 2 -
Tests on the Coefficients
Question: Is b equal to some value b0?
The correspoding test problem is
H0 : b = b0 versus Ha : b 6= b0.
The test statistic is given by
Tb =b − b0
se/√
SXX
∼ tn−2
The null hypothesis H0 : b = b0 is rejected if
|T | > tn−2,α/2
Question: Is a equal to some value a0?
The correspoding test problem is
H0 : a = a0 versus Ha : a 6= a0.
The test statistic is given by
Ta =a − a0
se
√1n + x2
SXX
∼ tn−2
The null hypothesis H0 : a = a0 is rejected if
|T | > tn−2,α/2
Inference for Regression, Mar 1, 2004 - 3 -
Inference for the Coefficients
Example: Body density
The confidence interval for b is given by
b ± tn−2,α/2 ·se√SXX
= −11.41± 1.99 ·√
0.0132√0.023
= [−12.92,−9.90]
The confidence interval for a is given by
a ± tn−2,α/2 se
√1
n+
x2
SXX
= 13.71± 1.99 ·√
0.0132 ·√
1
92+
1.062
0.023= [12.11, 15.30]
Furthermore we find for
Tb =b
se/√
SXX
= −15.22 > t90,0.025 = 1.99
Thus we reject H0 : b = 0 at significance level 0.05: The coefficient b is
statistically significantly different from zero.
Similarly
Ta =a
se
√1n + x2
SXX
= 17.26 > t90,0.025 = 1.99
Thus we reject H0 : a = 0 at significance level 0.05: The coefficient a is
statistically significantly different from zero.
The corresponding P -values are
◦ P(|Ta| ≥ 15.22) ≈ 0
◦ P(|Tb| ≥ 17.26) ≈ 0
Inference for Regression, Mar 1, 2004 - 4 -
Estimating the Mean
In the linear regression model, the mean of Y at x = x0 is given byE(Y ) = a + b x0
Our estimate for the mean of Y at X = x0 is
Yx0= a + b x0.
Question: How precise is this estimate?
Note that
Yx0= a + b x0 = Y − b(x0 − x).
Hence we obtainE(Yx0) = a + b x0
var(Yx0) =
(1
n+
(x0 − x)2
SXX
)σ2
(1 − α) confidence interval for E(Yx0)
(a + b x0) ± tn−2,α/2 · se ·√
1
n+
(x0 − x)2
SXX
Inference for Regression, Mar 1, 2004 - 5 -
Estimating the Mean
Example: Body density
Suppose the measured skin thickness is x0 = 1.1 mm.
What is the mean body density for this value of skin thickness?
◦ Point estimate:
Yx0= a + hb x0 = 13.71− 11.41 · 1.1 = 1.159
The mean body density is 1.159 · 103 kg/m3.
◦ Confidence interval:
(a + b x0) ± tn−2,α/2 · se ·√
1
n+
(x0 − x)2
SXX
= (13.71− 11.41 · 1.1)± 1.99 ·√
0.0132 ·√
1
92+
(1.1 − 1.06)2
0.023
= [1.09, 1.22]
In STATA, the standard error for estimating the mean of Y is calculated
by passing the option stdp to predict:
. predict BDH
. predict SE, stdp
. generate low=BDH-invttail(49,.025)*SE
. generate high=BDH+invttail(49,.025)*SE
. sort SKINT
. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)
11.
21.
41.
61.
82
1.02 1.04 1.06 1.08 1.1SKINT
Inference for Regression, Mar 1, 2004 - 6 -
Prediction
Suppose we want to predict Y at x = x0.
Aim: (1 − α) confidence interval for Y
Note that
a + b x0 − Y ∼ N(
0, σ2
(1 +
1
n+
(x0 − X)2
SXX
))
Thus the desired (1 − α) confidence interval for Yx0is given by
a + b x0 ± tn−2,α/2 · se ·√
1 +1
n+
(x0 − X)2
SXX
Inference for Regression, Mar 1, 2004 - 7 -
Prediction
Example: Body density
Suppose the measured skin thickness is x0 = 1.1 mm.
What is the predicted body density for this value of skin thickness?
◦ Point estimate: Yx0= a + hb x0 = 13.71− 11.41 · 1.1 = 1.159
The predicted body density is 1.159 · 103 kg/m3.
◦ Confidence interval:
(a + b x0) ± tn−2,α/2 · se ·√
1 +1
n+
(x0 − x)2
SXX
= (13.71− 11.41 · 1.1)± 1.99 ·√
0.0132 ·√
1 +1
92+
(1.1 − 1.06)2
0.023
= [0.92, 1.40]
In STATA, the standard error for predicting Y is calculated by passing the
option stdf to predict:
. drop SE low high
. predict SE, stdf
. generate low=tbillh-invttail(49,.025)*SE
. generate high=tbillh+invttail(49,.025)*SE
. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)
Alternatively, we can use the following command:
. twoway (lfitci BODYD SKINT, range(1 1.1) stdf) (scatter BODYD SKINT),> xtitle(Skin thickness) ytitle(Body density) scheme(s1color) legend(off)
11.
52
2.5
1.02 1.04 1.06 1.08 1.1SKINT
11.
52
2.5
Bod
y de
nsity
1 1.02 1.04 1.06 1.08 1.1SKin thickness
Inference for Regression, Mar 1, 2004 - 8 -
Multiple Regression
Example: Food expenditure and family income
Data: ◦ Sample of 20 households
◦ Food expenditure (response variable)
◦ Family income and family size
. regress food income-------------------------------------------------------------------------
food | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+----------------------------------------------------------------income | .1841099 .0149345 12.33 0.000 .1527336 .2154862_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615
-------------------------------------------------------------------------
. regress food number-------------------------------------------------------------------------
food | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+----------------------------------------------------------------number | 2.287334 .4224493 5.41 0.000 1.399801 3.174867_cons | 1.217365 1.410627 0.86 0.399 -1.746252 4.180981
-------------------------------------------------------------------------
Income
Foo
d E
xpen
ditu
re
0 20 40 60 80 100 1200
4
8
12
16
20
Family Size
Foo
d E
xpen
ditu
re
0 1 2 3 4 5 60
4
8
12
16
20
Multiple Regression, Mar 3, 2004 - 1 -
Multiple Regression
Multiple regression model
Yi = b0 + b1 x1,i + b2 x2,i + . . . + bp xp,i + εi i = 1, . . . , n
where
◦ Yi response variable
◦ x1,i, . . . , xp,i predictor variables (fixed, nonrandom)
◦ b0, . . . , bp regression coefficients
◦ εiiid∼ N (0, σ2) error variable
Example: Food expenditure and family income
Fitting multiple regression models in STATA:
. regress food income number
Source | SS df MS Number of obs = 20--------+------------------------------ F( 2, 17) = 121.47
Model | 386.312865 2 193.156433 Prob > F = 0.0000Resid. | 27.0326365 17 1.59015509 R-squared = 0.9346--------+------------------------------ Adj R-squared = 0.9269
Total | 413.345502 19 21.7550264 Root MSE = 1.261-------------------------------------------------------------------------
food | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+----------------------------------------------------------------income | .1482117 .0163786 9.05 0.000 .1136558 .1827676number | .7931055 .2444411 3.24 0.005 .2773798 1.308831_cons | -1.118295 .6548524 -1.71 0.106 -2.499913 .2633232
-------------------------------------------------------------------------
Multiple Regression, Mar 3, 2004 - 2 -
Multiple Regression
Example: Food expenditure and family income
Data: (Foodi, Incomei, Numberi), i = 1, . . . , 20
Fitted regression model:
Food = b0 + b1 Income + b2 Number
020
4060
80100
120
0
1
23
45
6
0
4
8
12
16
20
Yi
Yi
Fitted model is a two-dimensional plane - difficult to visualize.
Multiple Regression, Mar 3, 2004 - 3 -
Inference for Multiple Regression
Multiple regression model (matrix notation)
Y = X b + ε
where
Y n dimensional vector
X n × (1 + p) dimensional matrix
b 1 + p dimensional vector
ε n dimensional vector
Thus the model can be written as
Y1...
Yn
=
1 x1,1 · · · xp,1...
... . . . ...
1 x1,n · · · xp,n
b0...
bp
+
ε1...
εn
Least squares approach: Minimize
‖Y − Y ‖ =n∑
i=1
(Yi − Yi)2
Results:
b = (XTX)−1XTY ∼ N(b, σ2(XTX)−1
)
Y = X(XTX)−1XTY ∼ N(X b, σ2X(XTX)−1XT
)
ε = Y − Y =(1 − X(XTX)−1XT
)Y ∼ N
(0, σ2
(1 − X(XTX)−1XT
))
σ2 = s2e =
‖Y − Y ‖2
n − p
=1
n − p
n∑i=1
(Yi − Yi)2
Details course in regression analysis (STAT 22200) or econometrics
Multiple Regression, Mar 3, 2004 - 4 -
Inference for Multiple Regression
Example: Food expenditure and family income
Interpretation of regression coefficients
. quietly regress food income
. predict e_food1, residuals
. quietly regress number income
. predict e_num, residuals
. regress e_food1 e_num------------------------------------------------------------------------e_food1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+--------------------------------------------------------------
e_num | .7931055 .2375541 3.34 0.004 .2940229 1.292188------------------------------------------------------------------------
. quietly regress food number
. predict e_food2, residuals
. quietly regress income number
. predict e_inc, residuals
. regress e_food2 e_inc------------------------------------------------------------------------e_food2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+--------------------------------------------------------------
e_inc | .1482117 .0159172 9.31 0.000 .114771 .1816525------------------------------------------------------------------------
Result:
◦ bj measures the dependence of Y on xj after removing the linear effects
of all other predictors xk, k 6= j.
◦ bj = 0 if xj does not provide information for the prediction of Y addi-
tionally to the information given by the other predictor variables.
Multiple Regression, Mar 3, 2004 - 5 -
Multiple Regression
Example: Heart cathederization
Description: A Teflon tube (catheder) 3 mm is diameter is passed into a major vein or
artery at the femoral region and pushed up into the heart to obtain information about
the heart’s physiology and functional ability. The length of the catheder is typically
determined by a physician’s educated guess.
Data:
◦ Study with 12 children with congenital heart defects
◦ Exact required catheder length was measured using a fluoroscope
◦ Patient’s height and weight were recorded
Question: How accurately can catheder length be determined by height
and length?
30 40 50 60
20
25
30
35
40
45
50
Height (in)
Dis
tanc
e (c
m)
20 40 60 80
20
25
30
35
40
45
50
Weight (lb)
Dis
tanc
e (c
m)
Multiple Regression, Mar 3, 2004 - 6 -
Multiple Regression
Example: Heart cathederization (contd)
Regression model:
Y = b0 + b1 x1 + b2 x2 + ε
where ◦ Y - distance to pulmonary artery
◦ x1 - height
◦ x2 - weight
STATA regression output:
. regress distance height weight
Source | SS df MS Number of obs = 12-------------+------------------------------ F( 2, 9) = 18.62
Model | 578.81613 2 289.408065 Prob > F = 0.0006Residual | 139.913037 9 15.545893 R-squared = 0.8053
-------------+------------------------------ Adj R-squared = 0.7621Total | 718.729167 11 65.3390152 Root MSE = 3.9428
------------------------------------------------------------------------------distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------height | .1963566 .3605845 0.54 0.599 -.6193422 1.012056weight | .1908278 .165164 1.16 0.278 -.1827991 .5644547_cons | 21.0084 8.751156 2.40 0.040 1.211907 40.80489
------------------------------------------------------------------------------
Note:
◦ Neither height nor weight seem to be significant for predicting the dis-
tance to the pulmonary artery.
◦ The regression on both variables explains 80% of the variation of the
response (length of catheder).
Multiple Regression, Mar 3, 2004 - 7 -
Multiple Regression
Example: Heart cathederization (contd)
Consider predicting the length by height alone and by weight alone:
. regress distance heightR-squared = 0.7765
------------------------------------------------------------------------------distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------height | .5967612 .1012558 5.89 0.000 .3711492 .8223732_cons | 12.12405 4.247174 2.85 0.017 2.660752 21.58734
------------------------------------------------------------------------------
. regress distance weightR-squared = 0.7989
------------------------------------------------------------------------------distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------weight | .2772687 .0439881 6.30 0.000 .1792571 .3752804_cons | 25.63746 2.004207 12.79 0.000 21.17181 30.10311
------------------------------------------------------------------------------
Note:
◦ In a simple regression of Y on either height or weight, the explanatory
variable is highly significant for predicting Y .
◦ In a multiple regression of Y on height and weight, the coefficients for
both height and weight are not significantly different from zero.
Problem: Explanatory variables are highly linearly dependent (collinear)
20 30 40 50 60 70
20
40
60
80
Height (in)
Wei
ght (
lb)
Multiple Regression, Mar 3, 2004 - 8 -
Analysis of Variance
Decomposition of variation:
◦ SSTotal =∑
i(Yi − Y )2 - total variation
◦ SSResidual =∑
i(Yi − Yi)2 - variation in regression model
◦ SSModel = SSTotal − SSResidual
=∑
i(Yi − Y )2 - variation explained by regression
Coefficient of determination: The ratio
R2 =SSModel
SSTotal
indicates how well the regression model predicts the response. R2 is also
the squared multiple correlation coefficient - in a simple linear regression
we have
R2 = ρ2XY .
Example: Heart cathederization
Source | SS df MS Number of obs = 12-------------+------------------------------ F( 2, 9) = 18.62
Model | 578.81613 2 289.408065 Prob > F = 0.0006Residual | 139.913037 9 15.545893 R-squared = 0.8053
-------------+------------------------------ Adj R-squared = 0.7621Total | 718.729167 11 65.3390152 Root MSE = 3.9428
The coefficient of determination for these data is
R2 =578.82
718.73= 0.81.
Regression on height and weight explains 81% of the variation of distance.
Multiple Regression, Mar 3, 2004 - 9 -
Analysis of Variance
Question: Is improvement in prediction (decrease in variation) significant?
Our null hypothesis is that none of the explanatory variables helps to
predict the response, that is,
H0 : b1 = . . . = bp = 0 versus Ha : bj 6= 0 for any j ∈ {1, . . . , p}.
Under the null hypothesis H0 the F statistic
F =n − p − 1
p· SSModel
SSResidual=
n − p − 1
p· SSTotal − SSResidual
SSResidual
is F distributed with p and n − p − 1 degrees of freedom.
The null hypothesis H0 is rejected at level α if F > Fp,n−p−1,α.
Example: Heart cathederization
Source | SS df MS Number of obs = 12-------------+------------------------------ F( 2, 9) = 18.62
Model | 578.81613 2 289.408065 Prob > F = 0.0006Residual | 139.913037 9 15.545893 R-squared = 0.8053
-------------+------------------------------ Adj R-squared = 0.7621Total | 718.729167 11 65.3390152 Root MSE = 3.9428
The value of the F statistic is
F =9
2· 578.82
139.91= 18.61.
The critical value for rejecting H0 : b1 = b2 = 0 is F2,9,0.05 = 4.26. Thus
the null hypothesis H0 that both coefficients b1 and b2 are zero is rejected
at significance level α = 0.05.
Multiple Regression, Mar 3, 2004 - 10 -
Comparing Models
Example: Cobb-Douglas production function
Y = t · Ka · Lb · M c
where ◦ Y - output
◦ K - capital
◦ L - labour
◦ M - materials
Regression model:
log Y = log t + a log K + b log L + c log M
K
Y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
L
Y
−0.2 0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
0.8
MY
−0.2 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
Multiple Regression, Mar 3, 2004 - 11 -
Comparing Models
Example: Cobb-Douglas production function (contd)
Regression model M0 for Cobb-Douglas function:
log Y = log t + a log K + b log L + c log M
. regress LY LK LM LLSource | SS df MS Number of obs = 25
---------+----------------------------- F( 3, 21) = 138.98Model | 1.35136742 3 .450455808 Prob > F = 0.0000
Residual | .068065609 21 .003241219 R-squared = 0.9520---------+----------------------------- Adj R-squared = 0.9452
Total | 1.41943303 24 .059143043 Root MSE = .05693-------------------------------------------------------------------------
LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+---------------------------------------------------------------
LK | .0718626 .1543912 0.47 0.646 -.2492114 .3929366LM | .7072231 .3004146 2.35 0.028 .0824768 1.331969LL | .2117778 .4248755 0.50 0.623 -.6717991 1.095355
_cons | .0347117 .0374354 0.93 0.364 -.0431395 .1125629
Two variables, log K and log L, do not improve prediction of log Y .
alternative model M1
log Y = log t + c log M
. regress LY LMSource | SS df MS Number of obs = 25
---------+----------------------------- F( 1, 23) = 445.69Model | 1.34977753 1 1.34977753 Prob > F = 0.0000
Residual | .069655501 23 .0030285 R-squared = 0.9509---------+----------------------------- Adj R-squared = 0.9488
Total | 1.41943303 24 .059143043 Root MSE = .05503-------------------------------------------------------------------------
LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+---------------------------------------------------------------
LM | .9086794 .0430421 21.11 0.000 .81964 .9977188_cons | .0512244 .0189767 2.70 0.013 .011968 .0904808
Question: Is model M0 significantly better than model M1?
Multiple Regression, Mar 3, 2004 - 12 -
Comparing Models
Consider the multiple regression model with p explanatory variables
Yi = b0 + b1 x1,i + . . . + bp xp,i + εi.
Problem:
Test the null hypothesis
H0: q specific explanatory variables all have zero coefficients
versus
Ha: any of these q explanatory variables has a nonzero coefficient.
Solution:
◦ Regress Y on all p explanatory variables and read SS(1)Residual from the
output.
◦ Regress Y on just p − q explanatory variables that remain after you
remove the q variables from the model. Read SS(2)Residual from the output.
◦ The test statistic is
F =n − p − 1
q· SS
(2)Residual − SS
(1)Residual
SS(1)Residual
.
Under the null hypothesis, F is F distributed with q and n − p − 1
defrees of freedom.
◦ Reject if F > Fq,n−p−1,α.
Multiple Regression, Mar 3, 2004 - 13 -
Comparing Models
Example: Cobb-Douglas production function
Comparison of models M0 and M1:
◦ M0: SS(0)Residual = .06807 and n − p − 1 = 21.
◦ M1: SS(1)Residual = .06966 and q = 2.
◦
F =21
2· .06966− .06807
.06807= 0.2453
◦ Since F < F2,21,0.05 = 3.47 we cannot reject H0 : a = b = 0.
Using STATA:
. test LK LL
( 1) LK = 0( 2) LL = 0
F( 2, 21) = 0.25Prob > F = 0.7847
. test LK LL _cons
( 1) LK = 0( 2) LL = 0( 3) _cons = 0
F( 3, 21) = 2.43Prob > F = 0.0934
Multiple Regression, Mar 3, 2004 - 14 -
Case Study
Example: Headaches and pain reliever
◦ 24 patients with a common type of headache were treated with a new
pain reliever
◦ Medicamentation was given to each patient in one of four dosage levels:
2,5,7 or 10 grams
◦ Response variable: time until noticeable relieve (in minutes)
◦ Other explanatory variables:
⋄ sex (0=female, 1=male)
⋄ blood pressure (0.25=low, 0.50=medium, 0.75=high)
Box plots
0
10
20
30
40
50
60
Tim
e (in
min
utes
)
female male female male female male female male2 grams 5 grams 7 grams 2 grams
Multiple Regression II, Mar 5, 2004 - 1 -
Case Study
. regress time dose bp if sex==0
R-squared = 0.8861--------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+----------------------------------------------------------------
dose | -5.519608 .6608907 -8.35 0.000 -7.014646 -4.024569bp | -5 9.439407 -0.53 0.609 -26.35342 16.35342
_cons | 61.11765 6.458495 9.46 0.000 46.50752 75.72778--------------------------------------------------------------------------
. predict YHf(option xb assumed; fitted values)
. twoway line YHf dose if bp==0.25||line YHf dose if bp==0.5||> line YHf dose if bp==0.75||scatter time dose if(sex==0), saving(a, replace)(file a.gph saved)
. regress time dose bp if sex==1
R-squared = 0.5765--------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+----------------------------------------------------------------
dose | -3.343137 .9564492 -3.50 0.007 -5.506776 -1.179499bp | -2.5 13.66083 -0.18 0.859 -33.40294 28.40294
_cons | 51.39216 9.346814 5.50 0.000 30.2482 72.53612--------------------------------------------------------------------------
. predict YHm(option xb assumed; fitted values)
. twoway line YHm dose if bp==0.25||line YHm dose if bp==0.5||> line YHm dose if bp==0.75||scatter time dose if(sex==1), saving(b, replace)(file b.gph saved)
. graph combine a.gph b.gph
020
4060
2 4 6 8 10dose
Fitted values Fitted values
Fitted values time
020
4060
2 4 6 8 10dose
Fitted values Fitted values
Fitted values time
Multiple Regression II, Mar 5, 2004 - 2 -
Case Study
Model:
Time = Dose + Sex + Sex · Dose + BP + ε
. infile time dose sex bp using headache.dat(24 observations read). generate sexdose=sex*dose. regress time dose sex sexdose bp
Source | SS df MS Number of obs = 24----------+------------------------------ F( 4, 19) = 16.78
Model | 4387.65319 4 1096.9133 Prob > F = 0.0000Residual | 1242.30515 19 65.3844814 R-squared = 0.7793----------+------------------------------ Adj R-squared = 0.7329
Total | 5629.95833 23 244.780797 Root MSE = 8.0861---------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]----------+----------------------------------------------------------------
dose | -5.519608 .8006399 -6.89 0.000 -7.195367 -3.843849sex | -8.47549 7.553222 -1.12 0.276 -24.28457 7.333585
sexdose | 2.176471 1.132276 1.92 0.070 -.19341 4.546351bp | -3.75 8.086067 -0.46 0.648 -20.67433 13.17433
_cons | 60.49265 6.698634 9.03 0.000 46.47224 74.51305---------------------------------------------------------------------------
. predict YH(option xb assumed; fitted values). predict E, residuals
Residual plot: residualsi vs Dose
2 4 6 8 10
−10
−5
0
5
10
15
Dose (in grams)
Res
idua
ls (
in m
inut
es)
−2 −1 0 1 2
−10
−5
0
5
10
15
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Multiple Regression II, Mar 5, 2004 - 3 -
Case Study
Model:
Time = Dose + Dose2 + Sex + Sex · Dose + BP + ε
. drop YH E
. generate dosesq=dose^2
. regress time dose sex sexdose dosesq bp
Source | SS df MS Number of obs = 24----------+------------------------------ F( 5, 18) = 24.20
Model | 4901.02819 5 980.205637 Prob > F = 0.0000Residual | 728.930147 18 40.4961193 R-squared = 0.8705----------+------------------------------ Adj R-squared = 0.8346
Total | 5629.95833 23 244.780797 Root MSE = 6.3637---------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]----------+----------------------------------------------------------------
dose | -12.91961 2.171775 -5.95 0.000 -17.48234 -8.356878sex | -8.47549 5.944312 -1.43 0.171 -20.96403 4.013047
sexdose | 2.176471 .8910901 2.44 0.025 .3043598 4.048581dosesq | .6166667 .1731968 3.56 0.002 .2527937 .9805396
bp | -3.75 6.363656 -0.59 0.563 -17.11955 9.619545_cons | 77.45098 7.104701 10.90 0.000 62.52456 92.3774
---------------------------------------------------------------------------
. predict E, residuals
2 4 6 8 10
−10
−5
0
5
10
Dose (in grams)
Res
idua
ls (
in m
inut
es)
−2 −1 0 1 2
−10
−5
0
5
10
Theoretical Quantiles
Sam
ple
Qua
ntile
s
. test sex bp
( 1) sex = 0( 2) bp = 0
F( 2, 18) = 1.19Prob > F = 0.3270
Multiple Regression II, Mar 5, 2004 - 4 -
Case Study
Model:
Time = Dose + Dose2 + Sex · Dose + ε
. regress time dose sexdose dosesq
Source | SS df MS Number of obs = 24----------+------------------------------ F( 3, 20) = 38.81
Model | 4804.63916 3 1601.54639 Prob > F = 0.0000Residual | 825.319178 20 41.2659589 R-squared = 0.8534----------+------------------------------ Adj R-squared = 0.8314
Total | 5629.95833 23 244.780797 Root MSE = 6.4239---------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]----------+----------------------------------------------------------------
dose | -12.34823 2.154675 -5.73 0.000 -16.8428 -7.853653sexdose | 1.033708 .3931338 2.63 0.016 .2136452 1.853771dosesq | .6166667 .1748353 3.53 0.002 .2519667 .9813667_cons | 71.33824 5.667294 12.59 0.000 59.51647 83.16
---------------------------------------------------------------------------
. twoway line YH dose if sex==0|| line YH dose if sex==1,> legend(label(1 "female") label(2 "male"))
2 4 6 8 10
0
10
20
30
40
50
60
Dose (in grams)
Fitt
ed ti
me
(in m
inut
es)
Multiple Regression II, Mar 5, 2004 - 5 -
Comparing Several Means
Example: Comparison of laboratories
◦ Task: Measure amount of chlorpheniramine maleate in tablets
◦ Seven laboratories were asked to make 10 determinations of one tablet
◦ Study consistency between labs and variability of measurements
Box plot
Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 73.80
3.85
3.90
3.95
4.00
4.05
4.10
Am
ount
of c
hlor
phen
imar
ine
(in m
g)
One-Way Analysis of Variance, Mar 8, 2004 - 1 -
Comparing Several Means
Example: Comparison of drugs
◦ Experimental study of drugs to relieve itching
◦ Five drugs were compared to a placebo and no drug
◦ Ten volunteer male subjects
◦ Each subject underwent one treatment per day (randomized order)
◦ Drug or placebo were given intravenously
◦ Itching was induced on forearms with cowage
◦ Subjects recorded duration of itching
Box plot
No drug Papaverine Aminophylline Tripelennamine
100
200
300
400
Dur
atio
n of
itch
ing
(sec
)
Placebo Morphine Pentobarbital
One-Way Analysis of Variance, Mar 8, 2004 - 2 -
Comparing Several Means
. infile amount lab using labs.txt(70 observations read)
. graph box amount, over(lab)
. oneway amount lab, bonferroni tabulate
| Summary of amountlab | Mean Std. Dev. Freq.
------------+------------------------------------1 | 4.062 .03259178 102 | 3.997 .08969706 103 | 4.003 .02311808 104 | 3.920 .03333330 105 | 3.957 .05716445 106 | 3.955 .06704064 107 | 3.998 .08482662 10
------------+------------------------------------Total | 3.9845715 .07184294 70
Analysis of VarianceSource SS df MS F Prob > F
------------------------------------------------------------------------Between groups .1247371 6 .020789517 5.66 0.0001Within groups .231400073 63 .003673017------------------------------------------------------------------------
Total .356137173 69 .005161408
Bartlett’s test for equal variances: chi2(6) = 24.3697 Prob>chi2 = 0.000
Comparison of amount by lab(Bonferroni)
Row Mean-|Col Mean | 1 2 3 4 5 6---------+------------------------------------------------------------------
2 | -.065| 0.408|
3 | -.059 .006| 0.698 1.000|
4 | -.142 -.077 -.083| 0.000 0.127 0.068|
5 | -.105 -.04 -.046 .037| 0.005 1.000 1.000 1.000|
6 | -.107 -.042 -.048 .035 -.002| 0.004 1.000 1.000 1.000 1.000|
7 | -.064 .001 -.005 .078 .041 .043| 0.448 1.000 1.000 0.115 1.000 1.000
One-Way Analysis of Variance, Mar 8, 2004 - 3 -
Comparing Several Means
. oneway duration drug, bonferroni tabulate
| Summary of durationdrug | Mean Std. Dev. Freq.
------------+------------------------------------1 | 191.0 54.861442 102 | 204.8 105.723750 103 | 118.2 52.809511 104 | 148.0 44.738748 105 | 144.3 42.076782 106 | 176.5 68.856130 107 | 167.2 67.499465 10
------------+------------------------------------Total | 164.28571 68.463709 70
Analysis of VarianceSource SS df MS F Prob > F
------------------------------------------------------------------------Between groups 53012.8857 6 8835.48095 2.06 0.0708Within groups 270409.4 63 4292.2127------------------------------------------------------------------------
Total 323422.286 69 4687.2795
Bartlett’s test for equal variances: chi2(6) = 11.3828 Prob>chi2 = 0.077
Comparison of duration by drug(Bonferroni)
Row Mean-|Col Mean | 1 2 3 4 5 6---------+------------------------------------------------------------------
2 | 13.8| 1.000|
3 | -72.8 -86.6| 0.328 0.092|
4 | -43 -56.8 29.8| 1.000 1.000 1.000|
5 | -46.7 -60.5 26.1 -3.7| 1.000 0.904 1.000 1.000|
6 | -14.5 -28.3 58.3 28.5 32.2| 1.000 1.000 1.000 1.000 1.000|
7 | -23.8 -37.6 49 19.2 22.9 -9.3| 1.000 1.000 1.000 1.000 1.000 1.000
One-Way Analysis of Variance, Mar 8, 2004 - 4 -