ch7. sampling and sampling distributions - sogangocw.sogang.ac.kr/rfile/2014/business...
TRANSCRIPT
1
CH7. Sampling and Sampling Distributions
Sampling variation
• (Sample) statistic – a random variable whose value depends on which
population items happen to be included in the random sample (independent &
identical).
• Depending on the sample size, the sample statistic could either represent the
population well or differ greatly from the population
• Ex) Consider eight random samples of size n = 5 from a large population of
GMAT scores for MBA applicants.
Point Estimator
• Estimator – a statistic ( ) derived from a sample to infer the value of a
population parameter ( ).
• Estimate – the specific value of the estimator in a particular sample.
• Population parameters are represented by Greek letters and the
corresponding statistic by Roman letters.
2
Ex) sample mean, sample proportion, sample standard deviation
Sampling distribution
• The sampling distribution of an estimator is the probability distribution of all
possible values the statistic may assume when a random sample of size n is
taken.
• An estimator is a random variable since samples vary.
• Sampling Error =
• We want to have an estimator minimizing the sampling error.
Unbiased
• Bias is the difference between the expected value of the estimator and the
true parameter.
• Bias =
• An estimator is unbiased if
• On average, an unbiased estimator neither overstates nor understates the
true parameter.
Efficiency (under unbiased)
• Efficiency refers to the variance of the estimator’s sampling distribution.
• A more efficient estimator has smaller variance.
• If and are unbiased estimator with , then is more
efficient estimator than .
Better
• If and are with , then is better estimator
than .
ˆ( )E
ˆ( )E
1 2 1 2ˆ ˆ( ) ( )V V 1
2
1 2 1 2ˆ ˆ( ) ( )MSE MSE 1
2
3
Consistency
• A consistent estimator converges toward the parameter being estimated as
the sample size increases. (as n , )
Central limit theorem
• Central limit theorem for a mean: If a random sample of size n is drawn from
a population with mean and standard deviation , the distribution of
the sample mean x approaches a normal distribution with mean and
standard deviation /x n as the sample size increase.
• The central limit theorem does not specify the population distribution into a
normal distribution.
Sample mean ( )
• The sample mean is an unbiased estimator of , therefore,
( ) ( )E X E X
• The standard deviation (standard error) of the sample mean is:
( ) /xV X n
• If is unknown, use the following:
/xS S n
• The sample mean is a consistent estimator.
Sample variance (2S )
2 2ˆ ˆ ˆ ˆ ˆ( ) ( ) [{ ( )} { ( ) }]MSE E E E E
2 2ˆ ˆ ˆ ˆ ˆ ˆ[{ ( )} 2{ ( )}{ ( ) } { ( ) } ]E E E E E
2 2 2ˆ ˆ ˆ ˆ ˆ{ ( )} { ( ) } ( ) ( )E E E V bias
2 2( ) ?E S 2 2
1
1( )
1
n
i
i
S X Xn
2 2
1 1
( ) ( )n n
i i
i i
X X X X
2 2
1
2 2
1
[( ) 2( )( ) ( ) ]
( ) ( )
n
i i
i
n
i
i
X X X X
X n X
x
4
Sampling distribution of
• If we use a large (n > 30) simple random sample, the central limit theorem
enables us to conclude that the sampling distribution of can be
approximated by a normal distribution.
• When the simple random sample is small (n < 30), the sampling distribution
of can be considered normal only if we assume the population has a
normal distribution.
• Ex) The average price, , of a 5 GB MP3 player is $80.00 with a standard
deviation, , equal to $10.00. We have a sample of 20 players.
( ) ( )E X E X = $80.00
/x n =10 / 20 =2.236
1) If the distribution of prices for these players is a normal distribution,
then the sampling distribution on x is? N (80.00, 2.236).
2) Compute P (80< X <83) under the normal distribution assumption.
Sample proportion (p)
• X~ Binomial (n, π ), X: # of success
• X=Y1+Y2+…+Yn , Yi~ Bernoulli (π)
• p = X/n (a proportion is a mean of data whose only value is 0 or 1).
• E(p)=π and V(p)= π(1- π)/n
Sampling distribution of p
• The sampling distribution of p can be approximated by a normal distribution
whenever the sample size is large.
p~ N(π, π(1- π)/n)
• (Rule of Thumb) The sample size is considered large whenever these
conditions are satisfied:
n π > 5 and n (1- π) > 5
2 2( )iE X
2 2 2
1 1
1 1[ ( ) ] [ ( ) ( ) ]
1 1
n n
i i
i i
E X X E X n Xn n
22 21
[ ]1
n nn n
x
x
x
5
CH8. Interval Estimation
Confidence level
• Assume that 2~ ( , )X N and
2 is known. 2~ ( , / )X N n
~ (0,1)/
XZ N
n
• The confidence level (usually expressed as 100 (1 ) %) is the area under the curve
of the sampling distribution.
/2 /2 /2 /21 ( ) ( )/
XP Z X Z P Z Z
n n n
/2 /2( )P Z Z Z
We can rewrite this in terms of : If 0.05 (or 1 0.95 ), then
1.96 / 1.96 /X n X n
Confidence interval for population mean with known
• Assume that 2~ ( , )X N and
2 is known. 2~ ( , / )X N n
• A sample mean x is a point estimate of the population mean .
• A confidence interval for the population mean is a range
• Other expressions are 0 0( , )x z x zn n
or 0x z
n
.
Interpretation
• A confidence interval either does or does not contain .
• Out of 100 confidence intervals, approximately 95% would contain , while
approximately 5% would not contain 95% C.I. for .
0 0x z x zn n
6
Confidence interval for population mean with unknown
• Use the Student’s t distribution instead of the normal distribution when the population is
normal but the standard deviation is unknown
• The confidence interval for (unknown ) is
0 0
s sx t x t
n n
(or 0 0( , )
s sx t x t
n n , or
0
sx t
n )
Here 0 1( / 2)nt t
Note: If 2~ ( , )X N , then
1~/
n
Xt t
s n
Degree of freedom in student t distribution
• Degrees of Freedom (d.f.) is a parameter based on the sample size that is used to
determine the value of the t statistic.
• Degrees of freedom tell how many observations are used to calculate s, less the number
of intermediate estimates used in the calculation. d.f. = n - 1
• As n increases, the t distribution approaches the shape of the normal distribution.
Ex) Construct a 90% C.I. for the mean GMAT score of all MBA applicants.
x =510, s=73.77
Since is unknown, use the Student’s t for the C.I. with n = 20 – 1 = 19 d.f.
First find 0t from student t distribution.
The 90% confidence interval is:
1.729 1.729s s
x xn n
Confidence Interval for a population proportion ( )
• The confidence interval for might be
0
(1 )p z
n
7
• Since is unknown, the confidence interval for p = x/n (assuming a large sample) is
0
(1 )p pp z
n
Ex) A sample of 75 retail in-store purchases showed that 24 were paid in cash.
What is p? p = x/n = 24/75 = .32
Is p normally distributed? np = (75)(.32) = 24, n(1-p) = (75)(.88) = 51
The 95% confidence interval for the proportion of retail in-store purchases that are paid in
cash is:
0
(1 ) .32(1 .32).32 1.96
75
p pp z
n
Therefore, 0.214 < < 0.426
Confidence interval width
• Example: /2X Z
n
⇒
/22 Zn
• Confidence interval width reflects
- the sample size,
- the confidence level and
- the standard deviation.
• To obtain a narrower interval and more precision
- increase the sample size
- lower the confidence level (e.g., from 90% to 80% confidence)?
Sample size determination for a mean
• To estimate a population mean with a precision of + E (allowable error), you would
need a sample of size
2
0zn
E
Sample size determination for a proportion
• To estimate a population proportion with a precision of + E (allowable error), you
would need a sample of size
2
0 (1 )z
nE
• How to estimate ?
8
Method 1: Take a Preliminary Sample
Take a small preliminary sample and use the sample proportion (p) in
place of in the sample size formula.
Method 2: Use a Prior Sample or Historical Data
How often are such samples available? might be different enough to
make it a questionable assumption.
Method 3: Assume that = .50
This conservative method ensures the desired precision. However, the
sample may end up being larger than necessary.
Chi-square distribution
• If the population is normal, then the modification of sample variance 2s
follows the chi-square distribution (2 ) with degrees of freedom v = n – 1.
• Lower (left) (2
L ) and upper (right) (2
U ) tail percentiles for the chi-square
distribution can be found using the chi-square distribution table.
Confidence interval for a population variance (2 )
• Using the sample variance 2s , the confidence interval is
2 22
2 2
( 1) ( 1)
U L
n s n s
Summary
(1 )100% Confidence Interval Formulas
A. One population case:
1. For (population mean):
1) Population is a normal distribution. Population variance 2 is known.
/ 2x Zn
2) Population is a normal distribution. Population variance 2 is unknown.
( 1), / 2n
sx t
n ,
where ‘s’ is a sample standard deviation (2 2
1
( ) /( 1)n
i
i
s x x n
).
9
3) Population distribution is not known, but large samples. (Approximate version)
(i) Population variance 2 is known.
/ 2x Zn
(ii) Population variance 2 is unknown.
/ 2
sx Z
n
2. For (population proportion):
Under normal approximation of binomial distribution;
/ 2
(1 )p pp Z
n
,
where p is a sample proportion.
3. For 2 (population variance):
Population is a normal distribution.
2 22
2 2
1, / 2 1,1 / 2
( 1) ( 1)
n n
n s n s
10
11
12
<Chi-square critical values>