1
Data Collection and Sampling
ST 511
2
Methods of Collecting Data• The reliability and accuracy of the data affect the
validity of the results of a statistical analysis.• The reliability and accuracy of the data depend
on the method of collection.• Three of the most popular sources of statistical
data are:– Published data– Observational studies– Experimental studies
3
– This is often a preferred source of data due to low cost and convenience.
– Published data is found as printed material, tapes, disks, and on the Internet.
– Data published by the organization that has collected it is called PRIMARY DATA.
For example:Data published by the US Bureau of Census.
– Data published by an organization different than the organization that has collected it is called SECONDARY DATA.
For example:•The Statistical abstracts of the United States,compiles data from primary sources• Compustat, sells variety of financial data tapescompiled from primary sources
Published Data
4
– Observational study is one in which measurements representing a variable of interest are observed and recorded, without controlling any factor that might influence their values.
– Experimental study is one in which measurements representing a variable of interest are observed and recorded, while controlling factors that might influence their values.
• When published data is unavailable, one needs to conduct a study to generate the data.
Observational and experimental studies
5
• Surveys solicit information from people.• Surveys can be made by means of
– personal interview– telephone interview– self-administered questionnaire
Surveys
6
A good questionnaire must be well designed:• Keep the questionnaire as short as possible.• Ask short,simple, and clearly worded questions.• Start with demographic questions to help respondents get started comfortably.• Use dichotomous and multiple choice questions.• Use open-ended questions cautiously. • Avoid using leading-questions.• Pretest a questionnaire on a small number of people.• Think about the way you intend to use the collected data when preparing the questionnaire.
Surveys
7
Sampling and Sampling Plans
• Motivation for conducting a sampling procedure:– Costs.– Population size.– The possible destructive nature of the sampling
process.• The sampled population and the target
population should be similar to one another.
8
Sampling Plans
• We introduce four different sampling plans– Simple random samples– Stratified random samples– Cluster samples– Systematic samples
9
Simple Random Samples• In simple random sampling all the samples with the
same size are equally likely to be chosen.– It is a consequence of this definition that each individual in
the population has an equal chance to be chosen• An SRS is the standard against which we measure
other sampling methods, and the sampling method on which the theory of working with sampled data is based
• To conduct random sampling… – assign a number to each element of the chosen population
(or use already given numbers),– randomly select the sample numbers (members). Use a
random numbers table, or a software package.
10
Simple Random Samples (cont.)• To select a sample at random, we first need to define where the
sample will come from. – The sampling frame is a list of individuals from which the sample is
drawn.– E.g., To select a random sample of students from a college, we might
obtain a list of all registered full-time students.– When defining sampling frame, must deal with details defining the
population; are part-time students included? How about current study-abroad students?
• Once we have our sampling frame, the easiest way to choose an SRS is with random numbers.
11
• Example– A government income-tax auditor is responsible for
1,000 tax returns.– The auditor will randomly select 40 returns to audit.– Use Excel’s random number generator to select the
returns.• Solution
• We generate 50 numbers between 1 and 1000 (we need only 40 numbers, but the extra might be used if duplicate numbers are generated.)
Simple Random Sampling
12
Simple Random Sampling
0.3820002 382.00018 3830.1006806 100.68056 1010.5964843 596.48427 5970.8991058 899.10581 9000.8846095 884.60952 8850.9584643 958.46431 9590.0144963 14.496292 150.4074221 407.4221 4080.8632466 863.24656 8640.1385846 138.58455 1390.2450331 245.03311 246
. . .
. . .
X(1000) Round-up
38310159790088595915408864139246..
The auditor should select 40 files numbered 383, 101, ...
50 Random numbersbetween 0 and 1000,each has a probabilityof 1/1000 to be selected
50 numbers uniformly distributed between 0 and 1
50 random uniformly distributed whole-numbers between 1 and 1000.
13
• This sampling procedure separates the population into mutually exclusive sets (strata), and then selects simple random samples from each stratum.
Sex• Male• Female
Age• under 20• 20-30• 31-40• 41-50
Occupation• professional• clerical• blue-collar
Stratified Random Sampling
14
• With this procedure we can acquire information about– the whole population– each stratum– the relationships among strata.
Stratified Random Sampling
15
Stratified Random Sampling
• There are several ways to build the stratified sample. For example, keep the proportion of each stratum in the population.
A sample of size 1,000 is to be drawn
Stratum Income Population proportion
1 under $15,000 25% 2502 15,000-29,999 40% 4003 30.000-50,000 30% 3004 over $50,000 5% 50
Stratum size
Total 1,000
16
• Cluster sampling is a simple random sample of groups or clusters of elements.
• This procedure is useful when– it is difficult and costly to develop a complete list of the population
members (making it difficult to develop a simple random sampling procedure.
– the population members are widely dispersed geographically.• Cluster sampling may increase sampling error, because of
probable similarities among cluster members.
Cluster Sampling
17
Systematic Samples• Sometimes we draw a sample by selecting individuals
systematically.– For example, you might survey every 10th person on an
alphabetical list of students.• To make it random, you must still start the systematic
selection from a randomly selected individual.• When there is no reason to believe that the order of the list
could be associated in any way with the responses sought, systematic sampling can give a representative sample.
18
Systematic Samples (cont.)
• Systematic sampling can be much less expensive than true random sampling.
• When you use a systematic sample, you need to justify the assumption that the systematic method is not associated with any of the measured variables.
19
What Can Go Wrong?—or,How to Sample Badly
• Sample Badly with Volunteers:– In a voluntary response sample, a large group of individuals is
invited to respond, and all who do respond are counted. • Voluntary response samples are almost always biased, and so
conclusions drawn from them are almost always wrong.
– Voluntary response samples are often biased toward those with strong opinions or those who are strongly motivated.
– Since the sample is not representative, the resulting voluntary response bias invalidates the survey.
20
What Can Go Wrong?—or,How to Sample Badly (cont.)
• Sample Badly, but Conveniently:– In convenience sampling, we simply include the
individuals who are convenient. • Unfortunately, this group may not be representative of the
population.– Convenience sampling is not only a problem for
students or other beginning samplers.• In fact, it is a widespread problem in the business world—
the easiest people for a company to sample are its own customers.
21
What Can Go Wrong?—or,How to Sample Badly (cont.)
• Sample from a Bad Sampling Frame:– An SRS from an incomplete sampling frame introduces bias
because the individuals included may differ from the ones not in the frame.
• Undercoverage:– Many of these bad survey designs suffer from undercoverage,
in which some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population.
– Undercoverage can arise for a number of reasons, but it’s always a potential source of bias.
22
What Else Can Go Wrong?
• Watch out for nonrespondents.– A common and serious potential source of bias for
most surveys is nonresponse bias.– No survey succeeds in getting responses from
everyone. • The problem is that those who don’t respond may differ
from those who do.• And they may differ on just the variables we care about.
23
What Else Can Go Wrong? (cont.)
• Don’t bore respondents with surveys that go on and on and on and on…– Surveys that are too long are more likely to be
refused, reducing the response rate and biasing all the results.
24
What Else Can Go Wrong? (cont.)
• Work hard to avoid influencing responses.– Response bias refers to anything in the survey
design that influences the responses. – For example, the wording of a question can influence the
responses: