chapters1n2

7/29/2019 chapters1n2

1/22

Chapter 1

Elements of the sampling problem

1.1 Introduction

Often we are interested in some characteristics of a finite population, e.g. the averageincome of last years graduates from HKUST; unemployment rate of last quarter in HK.Since the population is usually very large, we would like to say something (i.e. make in-ference) about the population by collecting and analyzing only a part of that population.

The principles and methods of collecting and analyzing data from a finite population isa branch of statistics known as Sample Survey Method. The theory involved is calledSampling Theory. Sample survey is widely used in many areas such as agriculture,education, industry, social aairs, medicine.

1.2 Some technical terms An element is an object on which a measurement is taken.

A population is a collection of elements about which we require information.

Population characteristic: this is the aspect of the population we wish to mea-sure, e.g. the average income of last years graduates from HKUST, or the totalwheat yield of all farmers in a certain country.

Sampling units are nonoverlapping collections of elements from the population.

Sampling units may be the individual members of the population, they may be acoarser subdivision of the population, e.g. a household which may contain morethan one individual member.

A frame is a list of sampling units, e.g., telephone directory.

A sample is a collection of sampling units drawn from a frame or frames.

1.3 Why sampling?

If a sample is equal to the population, we have a census, which contains all the informa-tion one wants. However, census is rarely conducted for several reasons:

1

central limit thrm is difficult; becuz dependence relation

i.e., UST graduate, a student is an element

population

sometimes units can be household, groups

order them, list of names;/ st ID

100sample 2000


2/22

cost, (money is limited)

time, (time is limited)

destructive (testing a product can be destructive, e.g. light bulbs),

accessibility (non-response can be a serious issue).

In those cases, sampling is the only alternative.

1.4 How to select the sample: the design of the sam-ple survey

The procedure for selecting the sample is called the sample survey design. The gen-

eral aim of sample survey is to draw samples which are representative of the wholepopulation. Broadly speaking, we can classify sampling schemes into two categories:probability sampling and some other sampling schemes.

1. Probability sampling

This is a sampling scheme whereby particular samples are numerated and each has anon-zero probability of being selected. With probability built in the design, we can makestatements such as our estimate is unbiased and we are 95% confident that it is within2 percentage point of the true proportion. In this course, we shall only concentrate onProbability sampling.

2. Some other sampling schemes

a) volunteer sampling: a TV telephone polls, medical volunteers for research.

b) subjective sampling: We choose samples that we consider to be typical orrepresentative of the population.

c) quota sampling: One keeps sampling until certain quota is filled.

All these sampling procedures provide some information about the population, but it ishard to deduce the nature of the population from the studies as the samples are verysubjective and often very biased. Furthermore, it is hard to measure the precision of theseestimates.

1.5 How to design a questionnaire and plan a survey

This can be the most important and perhaps most dicult part of the survey samplingproblem. We shall come back to this point in more details later.

2

probability structure, we kw how the random structure is, we can put error

bond, sure line within bonds

no error bond, not kw structure, not kw perfms of parameter

Statistics a lot of assumptions, nothing is wrong; just how accurate

i.e. weight, measurement error->kw underline structurehowfar away from the real

drug control group,aids

effect or not

bias sampling

well structured sample to address those unknowns

prefer bias sampling


3/22

1.6 Some useful websites

Many government statistical organizations and other collectors of survey data now haveWeb sites where they provide information on the survey design. Here are a few examples.

Note that these sites are subject to change, but you should be able to find the organizationthrough a search.

Organization AddressFederal Interagency Council of Statistical Policy www.fedstats.govU.S. Bureau of the Census www.census.govStatistics Canada www.statcan.caStatistics Norway www.ssb.noStatistics Sweden www.scb.seUK Oce for National Statistics www.ons.gov.uk

Australian Bureau of Statistics www.statistics.gov.auStatistics New Zealand www.stats.govt.nzStatistics Netherlands www.cbs.nlGallup Organization www.gallup.comNielsen Media Research www.nielsenmedia.comNational Opinion Research Center www.norc.uchicago.eduInter-University Consortium for Political and Social Research www.icpsr.umich.edu

3


4/22

Chapter 2

Simple random sampling

Simple random sampling is the simplest sampling procedure, and is the building blockfor other more complicated sampling schemes to be introduced in later chapters.

Definition: If a sample of size n is drawn from a population of size N in such a way thatevery possible sample of size n has the same probability of being selected, the samplingprocedure is called simple random sampling (s.r.s. for short). The resulting sample iscalled a simple random sample.

2.1 How to draw a simple random sample

Suppose that the population of size N has values

{u1, u2, , uN}.

There are

Nn

!possible samples of size n. If we assign probability 1

,Nn

!to each of

the dierent samples, then each sample thus obtained is a simple random sample. Denotesuch a s.r.s as

(y1, y2, , yn).

Remark: In other statistics course, we use upper-case letters like X, Y etc. to denoterandom variables and lower-case letters like x, y etc. to represent fixed values. However,

in survey sampling, by convention, we use lower-case letters like y1, y2 etc. to denoterandom variables.

We have the following result.

Theorem 2.1.1 For simple random sampling, we have

P (y1 = ui1, y2 = ui2, , yn = uin) =(N n)!

N!.

where i1, i2, , in are mutually dierent.

4


5/22

Proof. By the definition of s.r.s, the probability of obtaining the sample {ui1, ui2 , , uin}

(where the order is not important) is 1

,Nn

!. There are n! number of ways of ordering

{ui1, ui2, , uin}. Therefore,

P (y1 = ui1 , y2 = ui2, , yn = uin) =1

Nn

!n!

=(N n)!n!

N!n!=

(N n)!N!

.

Recall that the total number of all possible samples is

N

n

!, which could be very

large if N and n are large. Therefore, getting a simple random sample by first listing allpossible samples and then drawing one at random would not be practical. An easier way

to get a simple random sample is simply to draw n values at random without replacementfrom the N population values. That is, we first draw one value at random from the Npopulation values, and then draw another value at random from the remaining N 1population values and so on, until we get a sample of n (dierent) values.

Theorem 2.1.2 A sample obtained by drawingn values successively without replacementfrom the N population values is a simple random sample.

Proof. Suppose that our sample obtained by drawing n values without replacement fromthe N population values is

{a1, a2, , an},where the order is not important. Let {ai1, ai2, , ain} be any permutation of{a1, a2, , an}.Since the sample is drawn without replacement, we have

P (y1 = ai1, , yn = ain) =1

N

1

(N 1) 1

(N n + 1) =(N n)!

N!.

Hence, the probability of obtaining the sample {a1, , an} (where the order is not im-portant) is

Xall (i1,,in) P (y1 = ai1 , , yn = ain) = Xall (i1,,in)(N

n)!

N! = n!(N

n)!

N! =

1 Nn

! .The theorem is thus proved by the definition of the simple random sampling.

5


6/22

Two special cases will be used later when n = 1, and n = 2.

Theorem 2.1.3 For any i, j = 1,...,n and s, t = 1,...,N,

(i) P (yi = us) =1

N.

(ii) P (yi = us, yj = ut) =1

N(N 1) , i 6= j, s 6= t.

Proof.

P(yk = uj) =X

all (i1, , in), but ik = j

P(y1 = ui1, , yk = uik , , yn = uin)

=(N n)!

N!

N 1n

1

!(n 1)! = (N n)!

N! (N 1)!

(N

n)!

=1

N.

P (yk = us, yj = ut) =X

all (i1, , in), but ik = s,ij = t

P(y1 = ui1, , yn = uin)

=(N n)!

N!

N 2n 2

!(n 2)! = (N n)!

N! (N 2)!

(N n)! =1

N(N 1) .

Example 1. A population contains {a,b,c,d}. We wish to draw a s.r.s of size 2. List allpossible samples and find out the prob. of drawing {b, d}.

Solution. Possible samples of size 2 are

{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d},

The probability of drawing {b, d} is 1/6.

2.2 Estimation of population mean and total

2.2.1 Estimation of population mean

For a population of size N: {u1, u2, , uN}, we are interested in

the population mean

=u1 + u2 + + uN

N=

1

N

NXi=1

ui,

the population variance

2 = 1N

NXi=1

(ui )2.

6


7/22

Given a s.r.s of size n: {y1, y2, , yn}, an obvious estimator for is the sample mean:

= y =1

n

nXi=1

yi.

Theorem2.2.1

(i) E(yi) = , V ar(yi) = 2.

(ii) Cov(yi, yj) = 2

N 1 , for i 6= j.Proof. (i). By Theorem 2.1.3,

E(yi) =NXk=1

ukP(yi = uk) =NXk=1

uk1

N= ,

V ar(yi) =N

Xk=1(uk )2P(yi = uk) =

N

Xk=1(uk )2 1

N= 2.

(ii). By defintion, Cov(yi, yj) = E(yiyj)E(yi)E(yj) = E(yiyj) 2. Now,

E(yiyj) =X

all s 6= tusutP(yi = us, yj = ut) =

Xall s 6= t

usut1

N(N 1)

=1

N(N 1)

264 Xall s, t

usut Xs=t

usut

375 = 1N(N 1)

"NXs=1

us

!NXt=1

ut

!

NXs=1

u2s

#

=1

N(N

1)

"(N)2

N

Xs=1(us )2 + N 2

!#

=1

N(N 1)hN22 N2 N2

i=

2

N 1 + 2.

Thus, Cov(yi, yj) = E(yiyj) 2 = 2N1 .

Theorem 2.2.2 E(y) = , Var(y) = 2

n

NnN1

.

Proof. Note y =1

n(y1 + ... + yn). So E(y) =

1

n(Ey1 + ... + Eyn) =

1

n(n) = . Now

V ar(y) =1

n2Cov(

n

Xi=1 yi,n

Xj=1 yj) =1

n2

n

Xi=1n

Xj=1 Cov(yi, yj)=

1

n2

0@Xi6=j

Cov(yi, yj) +Xi=j

Cov(yi, yj)

1A=

1

n2

0@Xi6=j

( 2

N 1) +nXi=1

V ar(yi)

1A=

1

n2

n(n 1)(

2

N 1) + n2

!

=2

n (n 1)(1

N 1) + 1

=2

n

N nN 1

.

7


8/22

Remark: From Theorem 2.2.2, y is an unbiased estimator for . Also as n gets large (butn N), V ar(y) tends to 0. Thus y becomes more accurate for as n gets larger. In particular,when n = N, we have a census and V ar(y) = 0.

Remark: In previous statistics courses, the sample (y1, y2, , yn) are usually independent andidentically distributed (i.i.d.), namely they are drawan from the population with replacement.As a result,

Eiid(y) = , V ariid(y) =2

n.

Notice that V ariid(y) is dierent from V ar(y) in Theorem 2.2.2. In fact, for n > 1,

V ar(y) =2

n

N nN 1

chapters1n2

Documents