stt 421 day 7: september 28, 2015 september 28, 2015stt 421: vince melfi1
TRANSCRIPT
STT 421
Day 7: September 28, 2015
September 28, 2015 STT 421: Vince Melfi 1
Sample Surveys
• Want to learn something about a (often large) group called the population.
• We only can collect data on a subset of the population, called the sample.
• We’d like the sample to be “representative” of the population.
• If a sampling method over or under represents an important characteristic, it’s called biased.
September 28, 2015 STT 421: Vince Melfi 2
Literary Digest Poll (1936)
• Goal: Predict the outcome of the 1936 presidential election between Roosevelt and Landon
• Literary digest magazine mailed out 10 million surveys and got 2.4 million responses.
• Of those who responded, 57% preferred Landon to Roosevelt.
• On the basis of this (large!) sample, Literary Digest predicted a landslide victory for Landon
September 28, 2015 STT 421: Vince Melfi 3
Literary Digest Poll (1936)
• George Gallup, a pollster, also tried to predict the outcome of the election
• He had a smaller sample size of 50,000. • But he selected his sample via “quota sampling”
where he tried to get proportions in his sample matching those in the population for important groups.
• For example, the sample should have the same proportion of middle class urban women, lower class rural men, etc.
September 28, 2015 STT 421: Vince Melfi 4
Literary Digest Poll (1936)
• Roosevelt won the election by a landslide• Gallup’s poll predicted this.• Literary digest went out of business shortly after
1936• Gallup polls are still conducted today. (But they
don’t use “quota sampling” any more. There are better methods that we’ll learn about.)
September 28, 2015 STT 421: Vince Melfi 5
Literary Digest Poll (1936)
• What went wrong for Literary Digest? – They found their 10 million names in three places
• Their own readers (who tended to be affluent)• Telephone registries (in 1936, at the height of the
depression, many poorer people had no phone)• Automobile registries (in 1936, at the height of the
depression, many poorer people had no phone)
– So the sample wasn’t representative of the population. In fact it overrepresented the wealthy
September 28, 2015 STT 421: Vince Melfi 6
Randomization
• How do we avoid bias even if we don’t know much about the population?
• The key idea is randomization. • By choosing people “at random” we guard
against potential biases.• There are many sampling methods that
employ randomization. One of the most basic is “simple random sampling.”
September 28, 2015 STT 421: Vince Melfi 7
Population and Sample
• The population is the group we’re interested in.
• Numerical characteristics of the population are called parameters.
• The sample is the group we’re able to collect data on
• Numerical characteristics of the sample are called statistics.
September 28, 2015 STT 421: Vince Melfi 8
Population and Sample
• Example: 1936 election prediction.• Population is all those who will vote.• Parameter of interest is p, the proportion
of those who vote who will vote for Roosevelt
• Statistic we’d calculate from the sample is the proportion in the sample who say they’ll vote for Roosevelt, denoted
September 28, 2015 STT 421: Vince Melfi 9
Simple Random Sample
• A simple random sample of size n is drawn in such a way that every sample of size n from the population has the same chance of being selected.
• Example: Population is A, B, C, D. n=2• {A,B}, {A,C}, {A,D}, {B,C}, {B,D}, {C, D} are
all the samples of size 2. All should have the same chance of being selected.
September 28, 2015 STT 421: Vince Melfi 10
“Good” samples aren’t so easy to obtain
• Example: In an election poll, how do you determine who will actually vote, to avoid having people in your sample who are registered voters but won’t vote?
• Even ignoring this, how do you deal with people who refuse to answer, who lie, who will change their vote by the time of the election, etc?
September 28, 2015 STT 421: Vince Melfi 11
The Salk polio vaccine study
• Polio was a very feared disease in the first half of the 20th century
• Franklin Roosevelt contracted polio and was partially paralyzed
• Polio is caused by a virus• Not all cases of polio cause severe symptoms:
Some mild cases are hard to distinguish from other illnesses
February 13, 2013 STT 200: Vince Melfi 12
The Salk polio vaccine study
• Two references (class material largely drawn from the second):–“Polio: An American Story.” by David
Oshinsky–“The Biggest Public Health
Experiment Ever: The 1954 Field Trial of the Salk Poliomyelitis Vaccine.” by Paul Meier
February 13, 2013 STT 200: Vince Melfi 13
The early 1950s• In the early 1950s there were two vaccines
under development that had substantial promise• A “live virus” vaccine developed by Albert Sabin• A “killed virus” vaccine developed by Jonas Salk• Based on preliminary data, it was decided to do
a large-scale study of the effectiveness of the Salk vaccine
• The vaccine was NOT expected to be 100% effective
February 13, 2013 STT 200: Vince Melfi 14
A Simple Study
• Safety of the vaccine was not a worry• A simple plan: Make the vaccine available
as widely as possible; let subjects (or their parents) volunteer to get the vaccine.
• See whether and how much the rate of polio drops
• This is an observational study
February 13, 2013 STT 200: Vince Melfi 15
Which of these are potential problems with the simple idea of distributing the vaccine widely and comparing the rate of polio with that in the past?
(a) If the rate drops, we don’t know whether the drop is due to the vaccine or other factors
(b) Those who volunteer may have different health characteristics than those who do not
(c) Since polio is hard to diagnose, doctors who know a patient is vaccinated might be less likely to diagnose polio
February 13, 2013 STT 200: Vince Melfi 16
February 13, 2013 STT 200: Vince Melfi 17
1937 1940 1943 1946 1949 1952
010
000
3000
050
000
Number of polio cases by year
Adding a control group• A control group (people who would not have the
opportunity to receive the vaccine) can help with some of the issues
• A suggestion: – Offer (but do not require) vaccination for all
second graders (the treatment group)– Don’t offer vaccination to others– First and third graders form the control group
February 13, 2013 STT 200: Vince Melfi 18
Which of these are potential problems with the modified study which includes a control group?
(a) Those who volunteer may have different health characteristics than those who do not
(b) Since polio is hard to diagnose, doctors who know a patient is vaccinated might be less likely to diagnose polio
(c) There may be differences between the treatment and control group that affect the results
February 13, 2013 STT 200: Vince Melfi 19
Experiment vs observational study
• Adding a control group moves us closer to a designed experiment
February 13, 2013 STT 200: Vince Melfi 20
An experimental study
• Assign children at random to one of two groups:– “Treatment” group: receives the polio vaccine– “Placebo control” group: receives an injection of an
innocuous serum that does not affect polio
• Children, parents, physicians, not allowed to know which children are in the control group and which are in the treatment group (a double-blind study)
February 13, 2013 STT 200: Vince Melfi 21
Sample Size
• Polio was relatively rare, about 50 cases per 100,000
• The vaccine was not expected to be 100% effective without further refinement
• Clearly a large sample size would be needed to detect effectiveness
February 13, 2013 STT 200: Vince Melfi 22
If the incidence of polio is 50 per 100,000, the vaccine is 50% effective, and there are 40,000 children in the treatment group and 40,000 in the control group, how many children in the treatment group would be expected to contract polio?
(a) 20
(b) 40
(c) 50
(d) 10
February 13, 2013 STT 200: Vince Melfi 23
Results of first study
Group Size # Poiio Cases Rate (per 100,000)
Vaccine (2nd grade)
221,988 56 25
No vaccine (1st and 3rd grade)
725,173 391 54
Refused vaccine (2nd grade)
123,605 54 44
February 13, 2013 STT 200: Vince Melfi 24
Results of second study
Group Size # Poiio Cases Rate (per 100,000)
Vaccinated 200, 745 57 28Placebo 201,229 142 71
February 13, 2013 STT 200: Vince Melfi 25