introduction to inference - mr. song's statistics

Sampling Distributions

Introduction to Inference

Parameter

• A parameter is a number that describes the population.

– A parameter always exists but in practice we rarely know it’s value because we cannot examine the entire population.

– We use Greek letters to describe them (μ or σ). If we are talking about a proportion of parameter, we use rho (ρ).

Statistic

• A statistic is a number that describes a sample. – Value of a statistic can be found when we sample.

– A statistic can change from sample to sample. (Sampling variability)

– Statistics use variables like 𝑥 , 𝑠 and 𝜌 .

– Ex: I take a random sample of 500 American males and find their IQ’s. We find that 𝑥 = 103.2.

– Ex: I take a random sample of 200 women and find that 40 like broccoli. Then 𝜌 = .2

Exercises

• For each of the following, use appropriate notation to describe each number.

– 9.1 Making Ball Bearings A lot of ball bearings has mean diameter 2.5003 cm. Inspector chooses 100 bearings from the lot that have the mean diameter of 2.5009 cm.

𝜇 = 2.5003; 𝑥 = 2.5009

– 9.2 Unemployment The Bureau of Labor Statistics last month interviewed 60,000 members of the U.S. labor force, of whom 7.2% were unemployed.

𝜌 = 7.2% is a statistic

– 9.3 Telemarketing A telemarketing firm in LA uses a device that dials residential telephone numbers in that city at random. Of the first 100 numbers dialed, 48% are unlisted. This is not surprising because 52% of all LA residential phones are unlisted.

𝜌 = 48% is a statistics; ρ = 52% 𝑖𝑠 𝑎 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 – 9.4 Well-fed Rats A researcher carries out a

randomized comparative experiment with young rats to investigate the effects of a toxic compound in food. She feeds the control group a normal diet. The experimental group receives a diet with 2500 parts per million of the toxic material. After 8 weeks, the mean weight gain is 335g for the control group and 289g for the experimental group.

Both 𝑥 1 = 335 and 𝑥 2 = 289 are statistics

Describing Sampling Distribution

• Television executives and companies who advertise on TV are interested in how many viewers watch particular television shows. According to 2001 Nielsen ratings, Survivor II was one of the most-watched television shows in the U.S. during every week that it aired. Suppose that the true proportion of U.S. adults who watched Survivor II is 𝜌 = 0.37. Figure 9.5 shows the results of drawing 1000 SRSs of size n=100 from a population with 𝜌 = 0.37.

• The overall shape of the distribution is symmetric and approximately normal.

• The center of the distribution is very close to the true value 𝜌 = 0.37.

• The values of 𝜌 have a large spread. The range from 0.22 to 0.54. Because the distribution is close to normal, we can use the StDev to describe its spread. The StDev is about 0.05.

• There are no outliers or other important deviations from the overall pattern.

Bias

• A statistic used to estimate a parameter is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated. The statistic is called an unbiased estimator of the parameter.

• We say something is biased if it’s a poor predictor.

• An unbiased statistic will sometimes fall above the true value of the parameter and sometimes below if we take many samples. Because its sampling distribution is centered at the true value, however, there is no systematic tendency to overestimate or underestimate the parameter.

The Approximate sampling distributions for sample proportions 𝜌 for SRSs of two sizes drawn from a population with 𝜌 = 0.37.

(a) Sample size 100. (b) Sample size 1000.

• The approximate sampling distribution of 𝜌 for samples of size 100, shown in (a), is close to the normal distribution with mean 0.37 and standard deviation 0.05. So, 95% of values of 𝜌 will fall within two standard deviation of the mean, 𝜌 = 0.37. If in fact 37% of U.S. adults have seen survivor II, the estimates from repeated SRSs of size 100 will usually fall between 27% and 47%. That’s not very satisfactory.

• For sample size 1000, shown in (b), the standard deviation is only about 0.01. So 95% of these samples will give an estimate within about 0.02 of the true parameter, that is, between 0.35 and 0.39. an SRS of size 1000 can be trusted to give sample estimates that are very close to the truth about the entire population.

Variability

• The variability of a statistic is described by the spread of its sampling distribution. This spread is determined by the sampling design and the size of the sample. Larger samples give smaller spread.

• As long as the population is much larger than the sample (at least 10 times as large), the spread of the sampling distribution is approximately the same for any population size.

• The size of the pop have little influence on the behavior of statistics from random samples

• A statistic from an SRS of size 2500 from the more than 300 million residents of the U.S. is just as precise as an SRS of size 2500 from the 775,000 inhabitants of San Francisco.

• Why does the size of the population have little influence on the behavior of statistics from random samples?

• Imagine sampling harvested corn by thrusting a scoop into a lot of corn kernels. The scoop doesn’t know whether it is surrounded by a bag of corn or by an entire truckload. As long as the corn is well mixed (so that the scoop selects a random sample), the variability of the result depends only on the size of the scoop.

Bias and Variability

• We can think of the true value of the population parameter as the bull’s-eye on a target and of the sample statistic as an arrow fired at the target. Both bias and variability describe what happens when we take many shots at the target.

• Bias means that our aim is off and we consistently miss the bull’s-eye in the same direction. Our sample values do not center on the population value.

• High variability means that repeated shots are widely scattered on the target. Repeated samples do not give very similar results.

The Sampling Distribution

• The sampling distribution of a statistic is the distribution of means of all possible samples of the same size from the population.

• When we sample, we sample with replacement.

• A sampling distribution is a sample space – it describes everything that can happen when we sample.

Central Limit Theorem

• As you take more and more SRSs of the same size, the distribution of their means will get closer and closer to a normal curve centered around the true population mean no matter what the shape of the parent population.

• The Sampling Distribution of means has a

mean of µ and a standard deviation of 𝜎

𝑛.

CLT Summary • The mean of the population (what we want to find) will

be the same as the mean of all your many samples.

• The standard Deviation of all your many samples will be the population standard deviation divided by 𝑛 (your sample size).

• The histogram of the samples will appear normal. • The larger the sample size (n), the smaller the standard

deviation will be and the more constricted the graph will be.

Example

• The true average study time for a final exam in history is found to be 6 hours and 25 minutes with a standard deviation of 1 hour and 45 minutes. Assume the distribution is normal. N(6.417, 1.75) – What is the probability that a student chosen at random

spends more than 7 hours studying? Normalcdf(7,100,6.417,1.75) = 37% – What is the probability that an SRS of 4 students will

average more than 7 hours in studying? Normalcdf(7,100,6.417,1.75/√4) = 25.3%. – Why did the probability go down?

• A student to study more than 7 hours is not probable…a group of 4 to average more than 7 is less probable.

Example 2

• The length of pregnancy from conception to birth varies normally with a mean of 266 days and a standard deviation of 16 days – What is the probability that a woman chosen at random

has a pregnancy lasting more than 270 days? 40.1% – What is the probability that an SRS of 16 women have

pregnancies averaging more than 270 days? 15.9% – What is the mean and standard deviation of my sampling

distribution?

𝜇𝑋 = 𝜇 = 266 and 𝜎𝑋 =𝜎

𝑛=

16

16= 4

What if we’re talking about proportions?

𝜌 =coung of successes" in sample

size of sample=

X

n

Provided that the population is much larger than the sample, the count X will follow a binomial distribution.

𝜇𝑋 = 𝑛𝜌 and 𝜎𝑋 = 𝑛𝜌(1 − 𝜌)

• Choose an SRS of size n from a large population with population proportion 𝜌 having some characteristic of interest. Let 𝜌 be the proportion of the sample having that characteristic. Then:

• The mean of the sampling distribution of 𝜌 is exactly 𝜌.

• The standard deviation of the sampling distribution of

𝜌 is 𝜌(1−𝜌)

𝑛

Rule of Thumb

1. Use 𝜎

𝑛 or

𝜌(1−𝜌)

𝑛 for 𝜌 only when the

population is at least 10 times as large as the sample.

2. We will use the normal approximation to the sampling distribution of 𝜌 for values of n and p that satisfy 𝑛𝜌 ≥ 10 𝑎𝑛𝑑 𝑛(1 − 𝜌) ≥ 10.

Exercise 9.19 Do you drink the cereal milk? A USA Today poll asked a random sample of 1012 U.S. adults what they do with the milk in the bowl after they have eaten the cereal. Of the respondents, 67% said that they drink it. Suppose that 70% of U.S. adults actually drink the cereal milk.

(a) Find the mean and standard deviation of the proportion 𝜌 of the sample that say they drink the cereal milk?

(b) Explain why you can use the formula for the standard deviation of 𝜌 in this setting (rule of thumb 1).

(c) Check that you can use the normal approximation of the distribution of 𝜌 (rule of thumb 2).

(d) Find the probability of obtaining a sample of 1012 adults in which 67% or fewer say they drink the cereal milk. Do you have any doubts about the result of this poll?

(e) What sample size would be required to reduce the standard deviation of the sample proportion to half the value you found in (a)?

introduction to inference - mr. song's statistics

Documents