chapter 19! yay!. when we take a sample, it is not the same as the population. samples will…
DESCRIPTION
Our goal is to do a little work (by taking a sample) to draw conclusions (about the whole population) without doing a census (since they are dumb). Our goal in inferential statistics is going to be to draw conclusions about the population. For quantitative data, we are looking for mu and sigma, even though we only have x-bar and s.TRANSCRIPT
THOUGHTS ON CONFIDENCE
Chapter 19! Yay!
SAMPLES AND POPULATIONS ARE DIFFERENT When we take a sample, it is not the same
as the population. Samples will pretty much always be at
least a little different than the population. When we take a sample, we presumably
have pretty much complete knowledge of the sample itself.
We also pretty much have rather incomplete knowledge of the population.
The sample is, however, at least sort of like the population pretty much every time.
INFERENCES ARE FUN Our goal is to do a little work (by taking
a sample) to draw conclusions (about the whole population) without doing a census (since they are dumb).
Our goal in inferential statistics is going to be to draw conclusions about the population.
For quantitative data, we are looking for mu and sigma, even though we only have x-bar and s.
INFERENCES ARE FUN For proportional data, we are trying to
draw conclusions about p (the true population p) using only the proportion gained from our sample (p-hat).
In chapter 19, we will focus on the proportional data side of this.
ESTIMATING P While we would expect p-hat to be close
to p, we should hopefully realize that it is not exactly the same as p.
In fact, the idea that the true average would only be close to a given number is quite normal, like we talked about yesterday.
People estimate most times using words like “about ___ minutes” or “___ to ___ minutes”.
ABOUT ABOUT When you say “about 4 minutes” you
are really saying it is close to 4 minutes but not necessarily 4 minutes exactly.
In adult, grown-up statistics we do not want to settle for crude approximations like “about 4 minutes”.
Such approximations are for regular people.
ABOUT BETWEEN We will give an interval that our p is
likely between. We can only be 100% guaranteed that it
is between 0% and 100%, but this is useless.
We can also reasonably assume that it is not the exact same value as the p-hat we found in our sample.
ABOUT BETWEEN p should be close to p-hat, however. So we start from p-hat, and add and
subtract a reasonable amount from it, so that we are spread out enough to likely include the true p, but not so spread out that our interval is useless.
To do this, we will go an appropriate number of standard deviations in each direction.
THE FORMULA! This is the one-proportion confidence
interval formula:
MORE FORMULA STUFF! The z with the asterisk is pronounced z-
star. The reverse z-score formula becomes
this formula with a little proportional data substitution.
The reason it is a plus/minus instead of just a plus sign is because we want to consider that we might have overestimated or underestimated.
Z-STAR This is also known as the critical z, or
the critical value of z. When we do a 95% confidence interval,
we use z = 1.96 because 95% of the data in the z-curve is between -1.96 and 1.96.
If we did a different confidence level, we would need to find a different critical z.
Z-STAR Since the middle however many percent
is in the middle, half of the remainder is above the middle and half of the remainder is below the middle.
So the middle 95% cuts off the lower 2.5% and the upper 2.5% (since half of the remaining 5% is 2.5%).
The invnorm(.025) will give us the z-score of -1.96.
Due to the plus/minus we can just use positive 1.96.
MARGIN OF ERROR The part after the plus/minus sign is the
margin of error. It is called this because it is the amount
we think our sample might be off by. The variable we use for margin of error
is usually lowercase e, although I have seen uppercase e.
MARGIN OF ERROR FORMULA This is the formula:
MARGIN OF ERROR PROBLEMS Generally the margin of error formula is
used to solve for the sample size, n. If we know what margin of error we
want and we know what confidence level we will use, we can solve this formula for n.
Of course, since we usually do this before the study, we might not know p-hat either.
MARGIN OF ERROR FORMULA, PART 2 Formula if you can comfortably assume
to know p-hat:
MARGIN OF ERROR PROBLEMS Sometimes you cannot assume to know
p-hat. In fact, in almost all real-world scenarios
you will not know p-hat ahead of time. The most conservative value for p-hat is
.5. So, if you put in .5 in place of p-hat, you
get the largest sample size you should need no matter what p-hat is.
Doing this gives you a different, easier formula.
MARGIN OF ERROR FORMULA, PART 3 Formula if you do not know p-hat:
REALLY IMPORTANT!!! You must, must, must, must, must,
must, must always round up when determining sample size.
n = 24.000001 means that 24 is not large enough.
The answer would be 25. This is a really big deal if you round
wrong, actually.
FORESHADOWING FTW In addition to these formulas, we will
have to discuss two very important conceptual topics.
First, we will need to discuss what confidence level really means.
Second, we will need to discuss how to interpret a confidence interval.
There will be a script.The last time we did scripts they started
“the model predicts”.
WHEN YOU ASSUME
It is frequently inconvenient for u and me!
THE 4 CONDITIONS There are 4 conditions to check. They are Random, Independent, Less
Than 10%, and Large Enough sample. For proportional data, Large Enough
becomes Success/Failure instead. Some of these conditions are obviously
met in a problem, but you are still expected to address them.
RANDOM For the “The sample was selected randomly”
condition, it will generally be addressed in one of three ways.
Option 1: “Random – The problem said it was random.”
Option 2: “Random – The problem did not specify that it was random, but this is a reasonable assumption because <reason>” Often a good claim is since the study was done
by experienced researchers, they obviously would have randomized.
Option 3: “Random – The sample fails to be random because <reason>.”
CONSEQUENCES OF FAILURE We need the sample to be random because if
it is not random, we introduce bias, and that means our sample might be even less representative than a random one.
This makes generalizing it to the population an unreliable process.
Therefore if it is not random, our results are not able to be fully generalized to the population with the same degree of accuracy.
In other words, a 95% confidence interval might actually carry with it even less confidence than the level suggests.
INDEPENDENT For the “The subjects of the sample are
independent from one another condition” it will generally be handled just one way.
Option 1: “Independent – It is reasonable to assume independence because one <subject>’s <measurement> does not determine another.” Ex: “Independent – One person’s preference in hair
color does not determine another person’s preference.”
For truly random phenomena such as die rolling and coin flipping, there is another option.
Option 2: “Independent – Coin flips/Rolls of a die are independent.”
CONSEQUENCES OF FAILURE Unless there is some reason that the
method used is generating truly dependent values, this assumption pretty much always works.
If, however, you were not generating independent values, just like with a randomization failure, the results are less reliable.
Meaning that using the methods of these two units is doomed to failure in this case.
LESS THAN 10% When we get an outlier, we can generally just
assume that the outlier is unique and is a freak occurrence as it were.
The problem is that in a large enough segment of the population, freak occurrences are going to happen regular.
Consider the problem about the 200 coin tosses done in a large introductory stats class.
Even though the girl got a very unlikely result (42% and below was only about 1% likely), in such a large class where all results were recorded, we would actually expect someone to get a result like that.
LESS THAN 10% To go from 10% to 100%, we must multiply
by 10. So, for this condition, we first need to take
the sample size and multiply it by 10. We meet the condition if the population
size is presumably at larger than that. We do not meet the condition if our
population size is not. Consider Mr. Sanford testing 16 bags of
microwave popcorn to see what percent of my sample pops at least 99% of the kernels when popped for 2 minutes.
LESS THAN 10% There would be two ways to handle this. Option 1: “Less than 10% - There are
more than 160 popcorn bags out there.”Option 1 Alternative (Use sass): “Less than
10% - There are more than 160 popcorn bags in my basement alone.”
Option 2: “Less than 10% - The number of popcorn bags is less than 160.”
Keep in mind this would mean…the full total.
CONSEQUENCES OF FAILURE If there is noticeable skew or outliers in
a sample that is more than 10% of the population, it does not fully normalize.
For proportional data, our methods similarly break down and confidence does not mean exactly the same thing.
This condition is, however, the least harmful to violate, as its effect is smaller on the validity of our processes.
LARGE ENOUGH – SUCCESS/FAILURE If n times p is more than 10 and n times the
complement of p is more than 10, you meet this condition.
You are expected to show the product of both multiplications.
I do not expect you to write a sentence for this.
I feel like the multiplication speaks for itself. You do, however, need to say “Success/Failure
– ” or “Large Enough – ” before you start the multiplying so that I know you know you are doing it to check this condition.
CONSEQUENCES OF FAILURE If you fail this condition, rather than
using z scores to find probability, you need to construct the entire binomial distribution in order to discuss confidence intervals or hypothesis tests.
I will not go over how to do this since you will not be expected to actually manage this.
WHEN WE FAIL In a more perfect world, if we failed a
condition, we might fix the failure and resample.
On the homework problems, quiz questions, and test questions we will instead go ahead with the rest of the process anyway.
In the real world if you failed an assumption like that you would either use a model that worked around it you would not be able to continue further.
ASSIGNMENTS Chapter 18: 5, 17, 25, and 27
Due Today. Chapter 19: 7, 11, 13, 15, 23
Due Friday. Chapters 18+19 quiz will be Thursday. Midterm project presentations will be in
two weeks.
QUIZ BULLETPOINTS Be able to use z-scores to find
probabilities for individuals. Be able to use z-scores to find
probabilities for sample averages. Be able to use z-scores to find
probabilities for sample proportions. Be able to find a confidence interval for
the true proportion based on a sample. Be able to find the sample size in order
get a desired margin of error.