lecture 10. random sampling and sampling distributions david r. merrell 90-786 intermediate...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Lecture 10. Random Sampling and Sampling Distributions
David R. Merrell90-786 Intermediate Empirical
Methods for Public Policy and Management
Agenda
Normal Approximation to Binomial Poisson Process Random sampling Sampling statistics and sampling
distributions Expected values and standard
errors of sample sums and sample means
Binomial Random Variable
Binomial random variable X is the number of “successes” in n trials, where
Probability of success remains the same from trial to trial
Trials are independent
Binomial Probability Distribution
Discrete distribution with: P(X=x) = (n!/(x!(n-x)!))px qn-x
n is number of trials x is number of successes in n trials
(x = 0, 1, 2, ..., n) p is the probability of success on a single trial q is the probability of failure on a single trial
Properties of the Binomial RV
Mean: = np
Variance: = npq
Standard Deviation:
Binomial(n = 10, p = .4)
x P(X=x)0 0.0060471 0.0403112 0.1209323 0.2149914 0.2508235 0.2006586 0.1114777 0.0424678 0.0106179 0.00157310 0.000105
0 0.0060471 0.0403112 0.1209323 0.2149914 0.2508235 0.2006586 0.1114777 0.0424678 0.0106179 0.00157310 0.000105
Approximation to Binomial Distribution Use normal distribution when:
n is large np > 10 n(1 - p) > 10
Parameters of the approximating normal distribution are the mean and standard deviation from the binomial distribution
605040302010
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
C1
C2
Approximation of Binomial Distribution
n = 80, p = .4
10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58
How Good is the Approximation?
Normal with mean = 32.0000 and standard deviation = 4.38000
x P( X <= x) 28.0000 0.1806
x P( X <= x) 28.5000 0.2121
Binomial with n = 80 and p = 0.400000
x P( X <= x) 28.00 0.2131
P(X < 29)
Application 1
The Chicago Equal Employment Commission believes that the Chicago Transit Authority (CTA) discriminates against Republicans. The records show that 37.5% of the individuals listed as passing the CTA exam were Republicans; the remainder were Democrats (no one registers as an independent in Illinois). CTA hired 30 people last year, 25 of them were Democrats. What is the probability that this situation could exist if CTA did not discriminate?
Application 1 (cont.)
Success: a Republican is hired The probability of success, p = 0.375 The number of trials, n = 30 The number of successes, x = 5 P(x 5) = ???
Application 1 (cont.)
Mean: = np = 30*.375 =
11.25
Variance: = npq =
30*.375*.625 = 7.03
Standard Deviation: = 2.65Normal with mean = 11.25 and standard deviation = 2.65
x P( X <= x) 5.5000 0.0150
Poisson Process
time homogeneityindependenceno clumping
rate xxx
0 time
Assumptions
Poisson Process
Earthquakes strike randomly over time with a rate of = 4 per year.
Model time of earthquake strike as a Poisson process
Count: How many earthquakes will strike in the next six months?
Duration: How long will it take before the next earthquake hits?
Count: Poisson Distribution
What is the probability that 3 earthquakes will strike during the next six months?
Poisson Distribution
Count in time period t
P Y ye t
yy
t y
( )( )
!, , ,
0 1
Minitab Probability Calculation
Click: Calc > Probability Distributions > Poisson
Enter: For mean 2, input constant 3 Output:Probability Density FunctionPoisson with mu = 2.00000 x P( X = x) 3.00 0.1804
Duration: Exponential Distribution
Time between occurrences in a Poisson process
Continuous probability distribution Mean =1/t
Exponential Probability Problem
What is the probability that 9 months will pass with no earthquake?
t = 1/12, t= 1/3 1/ t = 3
Minitab Probability Calculation
Click: Calc > Probability Distributions > Exponential
Enter: For mean 3, input constant 9 Output:Cumulative Distribution FunctionExponential with mean = 3.00000 x P( X <= x) 9.0000 0.9502
Exponential Probability Density Function
MTB > set c1 DATA > 0:12000 DATA > end Let c1 = c1/1000 Click: Calc > Probability distributions > Exponential
> Probability density > Input column Enter: Input column c1 > Optional storage c2 Click: OK > Graph > Plot Enter: Y c2 > X c1 Click: Display > Connect > OK
Exponential Probability Density Function
1050
0.3
0.2
0.1
0.0
C1
C2
Sampling
Population - entire set of objects that we are interested in studying
Sample - a chosen subset of a population
Some Samples Are ...
random -- each item in the population has an equal chance of being selected to be part of the sample
representative -- has the same characteristics as the population under study, a microcosm of the population
Population Parameters and Sample Statistics
Population Parameter Numerical descriptor of a population Values usually uncertain e.g., population mean (), population
standard deviation () Sample Statistics
Numerical descriptor of a sample Calculated from observations in the sample e.g., sample mean , sample standard
deviation SX
What is a sampling distribution?
Sample statistics are random variables
Sample statistics have probability distributions
“Sampling distribution” is the probability distribution of a sample statistic
MTB > Retrieve 'C:\MTBWIN\DATA\RESTRNT.MTW'.Retrieving worksheet from file: C:\MTBWIN\DATA\RESTRNT.MTWWorksheet was saved on 5/31/1994MTB > info
Information on the Worksheet
Column Name Count MissingC1 ID 279 0C2 OUTLOOK 279 1C3 SALES 279 25C4 NEWCAP 279 55C5 VALUE 279 39C6 COSTGOOD 279 42C7 WAGES 279 44C8 ADS 279 44C9 TYPEFOOD 279 12C10 SEATS 279 11C11 OWNER 279 10C12 FT.EMPL 279 14C13 PT.EMPL 279 13C14 SIZE 279 16
MTB > desc 'sales'
Descriptive Statistics
Variable N N* Mean Median TrMean StDev SEMeanSALES 254 25 332.6 200.0 248.9 650.5 40.8
Variable Min Max Q1 Q3SALES 0.0 8064.0 83.7 382.7
8000
7000
6000
5000
4000
3000
2000
1000
0
SA
LE
S
MTB > boxp 'sales'* NOTE * N missing = 25
800070006000500040003000200010000
200
100
0
SALES
Fre
que
ncy
MTB > hist 'sales'* NOTE * N missing = 25
MTB > let c15 = loge('sales')MTB > let c15 = loge('sales') J*** Values out of bounds during operation at J Missing returned 1 times
MTB > let c15 = loge('sales' + 1)MTB > name c15 'logsales'MTB > desc 'logsales'
Descriptive Statistics
Variable N N* Mean Median TrMean StDev SEMeanlogsales 254 25 5.1830 5.3033 5.2134 1.1387 0.0715
Variable Min Max Q1 Q3logsales 0.0000 8.9953 4.4394 5.9500
MTB > boxp 'logsales'* NOTE * N missing = 25
9
8
7
6
5
4
3
2
1
0
log
sale
s
9876543210
90
80
70
60
50
40
30
20
10
0
logsales
Fre
que
ncy
76543
15
10
5
0
C16
Fre
que
ncy
8642
25
20
15
10
5
0
C17
Fre
que
ncy
765432
20
10
0
C18
Fre
que
ncy
76543
20
10
0
C19
Fre
que
ncy
Four Samples of Size 50 From Restaurant “Logsales” Data--Histograms
MTB > Desc c16-c19
Descriptive Statistics
Variable N N* Mean Median TrMean StDev SEMeanC16 43 7 5.246 5.375 5.280 0.867 0.132C17 43 7 5.351 5.352 5.383 1.223 0.186C18 48 2 5.366 5.461 5.388 0.888 0.128C19 43 7 5.244 5.198 5.253 0.937 0.143
Variable Min Max Q1 Q3C16 2.773 6.621 4.625 5.787C17 1.099 8.456 4.710 6.176C18 2.485 7.091 4.961 5.994C19 3.434 6.868 4.595 6.089
Random Samples from Restaurant “Logsales” Data--Summary
Next Time ...
Central Limit Theorem--”Sample averages are approximately normally distributed”