lecture 4 variability: standard deviation. variability reminder - how spread out the scores...
Post on 20-Dec-2015
216 views
TRANSCRIPT
Lecture 4
Variability: Standard Deviation
Variability
Reminder - How spread out the scores are…Range - How does the range of each of these distributions vary? Or the Interquartile range?
Measure of error - is our sample similar to the population OR is an individual score representative of its sample
Standard Deviation Standard deviation - the average distance on
either side of the mean. Goal of the SD is to measure the standard or
typical distance from the mean.– But it’s not practical with large N, so we need to
estimate the variance and standard deviation using equations
60
62
64
66
68
70
72
74
76
Ben Tom Bill James Matt
He
igh
t (i
n.)
• Mean = 70.8
•Ben is 66 in. tall. His deviation from the mean is -4.8.
•James is 75 in. tall. His deviation from the mean is 4.2
How much scores typically vary around the mean; a measure of dispersion
Usually 1/5 - 1/6 of the range Based on the mean, therefore:
– Requires at least interval data– Sensitive to outliers– accounts for all
scores in a distribution
Standard Deviation
f
1 2 3 4 5 6 7 98
M
Logic of the Standard Deviation:Let’s start by looking at the population Step 1: Find the Deviation for each
score from the mean. X - . Be sure to include both the sign (+/-) and the number. X X -
65 -1490 +1184 +576 -381 +298 +1982 +356 -23
= 79 0
* Notice that the sum of the deviations = 0. This reflects the fact that the mean is a balancing point
* Bonus - you can use this fact to check yourselves
Step 2 - Remember the standard deviation is the average of the deviations, but this won’t work because the sum of our deviations = 0– Solution = get rid of the signs (+/-)– Square each score
Square of each score and sum them = Sum of Squared Deviations
= SS
X X - (X – )265 -14.4 207.490 10.6 112.484 4.6 21.276 -3.4 11.681 1.6 2.698 18.6 346.082 2.6 6.859 -20.4 416.2
X = 79.4 0 1123.9 * Sum of Squared Deviations = SS
Step 3 - Calculate the mean squared deviation = SS / N
This value is called the variance and is represented with the symbol MS or 2 .
Variance will be important for use in inferential stats methods, but it isn’t the best descriptive stat.
-- it’s hard to visualize variability with
the variance alone.
X X - (X – )265 -14.4 207.490 10.6 112.484 4.6 21.276 -3.4 11.681 1.6 2.698 18.6 346.082 2.6 6.859 -20.4 416.2
X = 79.4 0 1123.9
MS = 1123.9 / 8 = 140.5
* Sum of Squared Deviations = SS
Step 4: Correct for having squared all the deviations because we want a value that easily corresponds to the mean that we can visualize:– Standard deviation = variance
X - (X – )2207.4112.4
346.0
416.2
1123.9
X65 -14.490 10.684 4.6 21.276 -3.4 11.681 1.6 2.698 18.682 2.6 6.859 -20.4
X = 79.4 0
140.5 = 11.9Standard deviation = the square root of the mean squared deviation
Conceptually the average distance from the mean: on average a random point pulled from this distribution will be 11.9 away from the mean.
Putting it Together
X - (X – )2207.4112.4
346.0
416.2
1123.9
X65 -14.490 10.684 4.6 21.276 -3.4 11.681 1.6 2.698 18.682 2.6 6.859 -20.4
X = 79.4 0
= 11.9 What can we say about a score that lies 12 points from the mean, 91 points?
What about a score that lies 30 points from the mean, 49 points?
REVIEW: variance = mean squared deviation = greek lower case letter sigma 2 = SS / N
Standard deviation = = SS/ N Computing SS:
– Definitional formula: SS = (X - )2
Shows exactly how scores vary about the mean (like we just did). Works best on whole numbers.
– Computational formula: SS = X2 - [ (X)2 / N]
Easier for calculations because it works directly with the scores, but less intuitive about the mean.
Population Standard Deviation
Formulas for Pop. SD and Variance
Variance = SS / N (mean squared deviation)
Standard deviation = SS/N
Denoted by Greek letters and 2
Let’s Do It TogetherX X - (X - )2 X2 (X)2 2 24
24
28
32
33
48
64
42
38
67
55
455
-17.4
-17.4
-13.4
-9.4
-8.4
6.6
22.6
0.6
-3.4
25.6
13.6
0
302.8
302.8
179.6
88.4
70.6
43.6
510.8
.36
11.6
655.4
185
2351
576
576
784
1024
1089
2304
4096
1764
1444
4489
3025
21171
207025 213.7 14.6
Definitional:SS = (X - )2
Computational:
SS = X2 - [ (X)2 / N]
Another Example… Find for the following sets of numbers X = 1, 7, 7, 9 X = 1, 6, 1, 1, 1, 1
X X2 (X)2 2 10
15
17
21
24
31
13
Definitional:SS = (X - )2
Computational:
SS = X2 - [ (X)2 / N]
Samples vs. Populations Rationale: Inferential statistics rely on
samples to draw general conclusions about the population.– PROBLEM - sample variability tends to be
less than population variability.– Thus, this variability is biased. That is, it
underestimates the pop. variability. pop. variability
xx
xx
xx
sample variability
Terms Biased - a sample statistic is said to be
biased if on the average the sample statistic consistently underestimates or overestimates the population parameter.
Unbiased - a sample statistic is said to be unbiased if on average the sample statistics is equal to the population parameter
An Analogy for a Biased Stat Imagine you were interested in studying
learning in elementary school children.– What if you chose as your sample child
geniuses from computer and science camp?
– Could you generalize from your sample to the population of elementary school children?
A sample statistic for SD will be biased even with a representative sample - We have to perform a correction
Samples: s and Changes in notation to reflect a sample:
– So to calculate SS (same as for pop.):• (1) Find deviation: X - M• (2) Squared each deviation: (X - M)2
• (3) Sum squared devations: SS = (X - M)2
Correcting for the bias is done in the calculation for the mean square deviation or variance:– Sample variance - s2 = SS / (n - 1)– Sample standard deviation = s = SS / (n - 1)
or s = s2
Let’s Do it TogetherXf
1 2 3 4 5 6 7 98
X X2
4
5
6
6
6
7
7
7
8
8
8
8
9
9
98
16
25
36
36
36
49
49
49
64
64
64
64
81
81
714
The smallest distance from the mean is 1 and the largest distance is 3, so the SD should be somewhere in between.
SS = 714 - (982 / 14) = 28
* NOTE: do not correct for bias in SS
S2 or MS = SS / (n-1)
S2 or MS = 28 / 13 = 2.2
S = 2.2 = 1.5
SS = X2 - [ (X)2 / n]
Start Easy: Find s
X = 5, 1, 5, 5
X = 1, 7, 1, 1
•NOTE: do not correct for bias in SS
S2 or MS = SS / (n-1)
S = S2
SS = X2 - [ (X)2 / n]
A little more complexX X^2
322.84336.63368.80276.84512.20285.05239.68262.86302.13300.12326.62257.65429.81291.71263.15323.49
SS = X2 - [ (X)2 / n]
MS or S2 = SS / n-1
s = SS / (n - 1)
104223.10113319.68136011.7376638.36
262348.6881251.0757446.6369094.1291283.7290071.31
106683.0666383.09
184733.6685093.6669247.72
104644.41
5099.6 1698474.01
SS = 1698474.01 - (26005920.2 / 16)
MS = 73104 / 15
s = 69.8
Sample Variability and Degrees of Freedom: Why do we correct with n-1?
(1) the deviations computed from a sample are not “real” deviations.
Sampling error - sample and pop. are close, but not exact. SS is smaller for the sample - math. proof Using a sample mean places a restriction on the variability
X X - (X - )2 X X - M (X - M)2 12
8
10
+4
0
+2
16
0
4SS = 17
Where = 8
12
8
10
+2
+2
+2
4
4
4
SS = 12Where M = 10
More about n -1 Sample mean is known before
deviations and SS can be computed.
Sample of n=3 with a M=10. Therefore, as soon as the first two values are given X = 12, 8 you know the last value is 10.
n-1 scores can vary; the last score is not free to vary
X X - (X - )2 X X - M (X - M)2 12
8
10
+4
0
+2
16
0
4SS = 17
Where = 8
12
8
10
+2
+2
+2
4
4
4
SS = 12Where M = 10
Degrees of Freedom df commonly encountered as n - 1, where n is
the number of scores in the sample Refers to the number of scores in a distribution
that are free to vary once the M & n are set
Example{5, 10, 15}; n = 3; M = 10
How many scores could you change and still
have n = 3 & M = 10?
n = 1 or 2
So, s2 = SS / n-1 = SS / df
Cafeteria degrees of freedom: An analogy
You are 4th in line at the cafeteria to choose your dessert. The choices are a cheesecake, a piece of fruit, pumpkin pie, and a stale cookie.– The first person chooses the cheescake– Next to go is the apple– Then the pumpkin pie– The last choice is restricted and can’t vary.
You are stuck with the stale cookie
Degrees of Freedom Why n - 1?
– Because you are estimating the from M. Once this is done, the estimate is fixed & cannot be changed. Therefore, you can only vary N - 1 scores with this fixed value
This is the case whenever we are estimating a parameter from a statistic.
A little more about biased stats Population N=6 (0, 0, 3, 3, 9, 9) = 4, 2 =14 Take all possible n = 2 samples
Biased variance unbiased varianceSample First score Second score Mean n n-1
1 0 0 0 0 02 0 3 1.5 2.25 4.53 0 9 4.5 20.25 40.54 3 0 1.5 2.25 4.55 3 3 3 0 06 3 9 6 9 187 9 0 4.5 20.25 40.58 9 3 6 9 189 9 9 9 0 0
36 63 126
Properties of the Standard Deviation
Distribution:– Homogeneous sample: data values are
very similar = small s2 and s.– Heterogeneous sample: data values are
dissimilar = big s2 and s.
Helps make predictions about the amount of error in your sample. How close is your sample to the population
Properties of the Standard Deviation Transforming scores:
Adding or subtracting a constant does not change the SD
f
1 2 3 4 5 6 7 98 3 4 5 6 7 8 9 1311
Another way to determine if the SD is affected by a constant is to pick any two scores and calculate the distance between the two both before and after the constant
e.g. you and a friend compare scores on an exam your friend earned a 85 and you earned a 90. Later you find out that a 5 point curve was added to everyone’s score.
Properties of the Standard Deviation Transforming scores:
Multiplying or dividing by a constant changes SD by that amount
f
Another way to determine if the SD is affected by a constant is to pick any two scores and calculate the distance between the two both before and after the constant
1 2 3 4
f
10 20 30 40
1 10
Factors that affect Variability Extreme Scores:
– Range is most affected– SD and variance somewhat affected– SIR not affected
Sample Size:– Range is directly related to sample size.
This is unacceptable.– SD, variance, and SIR unaffected by
sample size Open-ended Distributions:
– Cannot computer range, SD, or variance– SIR is your only option
Relationship with other Statistics SD is derived using information about
the mean (distances) - the two go hand-in-hand
Interquartile range (& SIR) are based on percentiles, so is the median (mdn is 50th percentile)
Range has no direct relationship with any other statistical measures
Why we need to know this information Variability influences how easy it is to
see patterns in our data….
Estimate M for each sample
Sample 1 Sample 2
X
34
35
36
35
X
26
10
64
40
Why we need to know this information Keep the goal in mind:
– Research uses samples to deduce information about the population
– Consider the data from two experiments and determine whether or not there appears to be a consistent difference
f
Talk therapy = M = 20
Meditation = M = 40
5 10 15 20 25 30 35 40 45 50 60
f
5 10 15 20 25 30 35 40 45 50 60
Experiment 1 Experiment 2
Graphical Representation of
f
1 2 3 4 5 6 7 98
=1.58
Graphic Representation - Box Plots Also called box-and-whisker plots Useful for
– comparing distributions– displaying variability
Box defines the interquartile range– Top line defines the third quartile– Bottom line defines the first quartile
Whiskers extend out to the highest and lowest scores
Median is often displayed by a line
Graphic Representation - Boxplots
Pearson’s Coefficient of Skew Pearson’s coefficient of skew tells us if a distribution
is positive or negatively skewed and how much (+/- 0.5 is approximately symmetric/normal)
s3 = [3(M - mdn)] / s
M = 20, s = 5, md = 24
s3 = [3(20 - 24)] / 5 s3 = -2.4
Negatively skewed
Try one M = 50, Mdn = 30, s = 7
s3 = [3(M - mdn)] / s
X
1
2
3
4
5
6
7
8
9
10
11
12
13
f
1
1
1
1
1
2
4
5
6
9
11
6
2
Putting it all together…
Find Pearson’s coefficient of skew
s3 = [3(M - mdn)] / s
For this table s = 2.74
Homework: Chapter 4
1, 3, 4, 6, 8, 11, 12, 14, 19, 20, 23, 24, 25
Read IN THE LITERATURE pg 122-123.
Skim Chapter 6 pages 161 - 166; section on Probability.
** BRING YOUR TEXT BOOKS TO CLASS TOMORROW**