hss2381a – stats and stuff

53
Hss2381a – stats and stuff The Normal Curve, part 1

Upload: mili

Post on 05-Jan-2016

37 views

Category:

Documents


2 download

DESCRIPTION

Hss2381a – stats and stuff. The Normal Curve, part 1. No class on Thursday!. Interdisciplinary Journal of Health Sciences. WANTED : Seeking applicants for the 2011-2012 editorial team Students in both the English and French HSS streams are encouraged to apply. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hss2381a – stats and stuff

Hss2381a – stats and stuff

The Normal Curve, part 1

Page 2: Hss2381a – stats and stuff

No class on Thursday!

Page 3: Hss2381a – stats and stuff

Interdisciplinary Journal of Health Sciences

• WANTED: Seeking applicants for the 2011-2012 editorial team

• Students in both the English and French HSS streams are encouraged to apply.

• Send an email expressing your interest in the position to [email protected], with your resume attached.

• Successful candidates will be invited to a panel interview.

• Deadline to apply: Wednesday, September 28th, 2011

Page 4: Hss2381a – stats and stuff

Last time….

• We covered measures of central tendency:– Mode– Median– Mean

• And two measures of variability:– Range– Interquartile Range

Page 5: Hss2381a – stats and stuff

Two More Measures of Variability

• Standard deviation• Variance

Page 6: Hss2381a – stats and stuff

The Standard Deviation

• Standard deviation (SD or σ): An index that conveys how much, on average, scores in a distribution vary

• SDs are based on deviation scores (x), calculated by subtracting the mean from each person’s original score

x = X - M

Page 7: Hss2381a – stats and stuff

Standard Deviation Interpretation

• In a normal distribution, a fixed percentage of cases lie within certain distances from the mean:

Page 8: Hss2381a – stats and stuff

Example

• We weigh 10 students and collect their weight in pounds:– 110 120 130 140 150 150 160 170 180 190

• What is the mean? (M) 150

For the lightest person, their weight is the mean – 40For the heaviest person, their weight is the mean +40

Page 9: Hss2381a – stats and stuff

What’s a deviation?

• A “deviation” is how much each data point deviates from the mean– So for X1 the deviation is -40

– And for x10 the deviation is +40

• So what’s a “standard deviation”?• It’s some sort of measure of how much the

“typical” data point deviates from the mean

Page 10: Hss2381a – stats and stuff

Let’s go back to our data…

• Mean = 150

Data (weights in pounds)

Deviation from Mean

110120130140150150160170180190

TOTAL

-40-30-20-1000102030400

Page 11: Hss2381a – stats and stuff

Defining Standard Deviation

• The sum of all deviation scores in a distribution always = 0

• to compute SDs, deviation scores must be squared (x2) before being summed

• SD equation: SD = Square root of: Σx2 ÷ (N -1)

Page 12: Hss2381a – stats and stuff

Standard Deviation (cont’d)

Weights (pounds): 110 120 130 140 150 150 160 170 180 190

Deviation scores (x) for M = 150: -40 -30 -20 -10 0 0 10 20 30 40

Squared deviation scores (x2): 1600 900 400 100 0 0 100 400 900 1600

Sum of squared deviation scores: 1600+900+400+100+0+0+100+400+900+1600 = 6000

SD = √(6000/(N -1) = SD = √(6000/(9) = 25.82

Page 13: Hss2381a – stats and stuff

A little bit about notation

σ

s

“sigma” = standard deviation in the reference population

Lower case “s” = standard deviation in the sample

The textbook uses “SD” for both

Page 14: Hss2381a – stats and stuff

Standard Deviation Interpretation

• Provides a “standard”—the SD indicates the average amount of deviation of scores from the mean

• Tells you how wrong, on average, the mean is as a summary of the overall distribution

• An SD provides valuable information when the distribution is normal:– There are approximately three SDs above and

below the mean in a normal distribution

Page 15: Hss2381a – stats and stuff

Standard Deviation Interpretation (cont’d)

• In a normal distribution, a fixed percentage of cases lie within certain distances from the mean:

Page 16: Hss2381a – stats and stuff

SDs and Individual Scores

• A person who scores one SD below the mean has a higher score than 16% of the cases (2.3% + 13.6%)

• A person who scores one SD above the mean has a higher score than 84% of the cases (50.0% + 34.1%)

Page 17: Hss2381a – stats and stuff

Standard Deviation: Advantages

• Takes all data into account in describing variability

• Is more stable as a measure of variability than the range or IQR

• Lends itself to computation of other measures often used in inferential statistics

• Is helpful in interpreting individual scores when data are distributed approximately normally

Page 18: Hss2381a – stats and stuff

Standard Deviation: Disadvantages

• Can be influenced by extreme scores

• Not as “intuitive” or as easy to interpret as the range

Page 19: Hss2381a – stats and stuff

Variance

• An important variability concept in inferential statistics, but not used descriptively

• The variance = SD2

• In earlier example, SD2 = 25.822 = 666.67• Not easily interpreted because it is not in

units of original data—it is in units squared (here, pounds squared)

Page 20: Hss2381a – stats and stuff

More about notationσ

s

“sigma” = standard deviation in the reference population

Lower case “s” = standard deviation in the sample

σ2

s2

“sigma squared” = variance in the reference population

Variance in the sample

Page 21: Hss2381a – stats and stuff

Formulae for Variance

Population variance

Sample variance

Page 22: Hss2381a – stats and stuff

Measurement Scales and Descriptive Statistics

Scale Central Tendency Index

Variability Index

Nominal Mode --

Ordinal Median Range, IQR

Interval and ratio

Mean Standard deviation, Variance

Page 23: Hss2381a – stats and stuff

Relative Standing

• Central tendency and variability indexes describe a distribution

• There are also descriptive statistics to describe individual scores—i.e., their relative standing or position in a distribution:– Percentile ranks– Standard scores

Page 24: Hss2381a – stats and stuff

Percentiles

• A percentile is one one-hundredth of a distribution

• Quartiles divide a distribution into quarters

• Deciles divide a distribution into tenths

• Each percentile, quartile, etc. can be determined in relation to a score in a distribution

Page 25: Hss2381a – stats and stuff

Percentile Rank

• A percentile rank is the location of a given score in the distribution—it communicates what percentage of cases fall at or below that value

– Score What percentile rank?– Percentile What score?

Page 26: Hss2381a – stats and stuff

Percentiles and Outliers

• Outliers are often defined in relation to percentiles

• There are:– Mild outliers– Extreme outliers

Page 27: Hss2381a – stats and stuff

NOT what we’re talking about

Page 28: Hss2381a – stats and stuff

An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. -Grubbs (Wikipedia)

In this course (as per the textbook), an outlier is a value that is >1.5 times the IQR

Page 29: Hss2381a – stats and stuff

Outliers: Formal Definition

• A mild outlier is a score that is between 1.5 and 3.0 times the value of the IQR, below Q1 or above Q3

• An extreme outlier is a score that is greater than 3.0 times the value of the IQR, below Q1 or above Q3

Page 30: Hss2381a – stats and stuff

Box Plots

• A box plot (or box-and-whiskers plot) is a graphic depiction of a distribution that shows the median, the IQR, and the outer limits of values not considered outliers– Outlying cases can be shown on the box plot,

with identifying information (e.g., an ID number)

Page 31: Hss2381a – stats and stuff

Traditionally…

Page 32: Hss2381a – stats and stuff

But for the purposes of this course (due to the textbook’s insistence)…

The extent of the boxplot is NOT the range, but rather those data points that are NOT outliers

Page 33: Hss2381a – stats and stuff

Box Plots (cont’d)

• Bottom of “box” shows Q1

• Top of “box” shows Q3

• Horizontal line in box shows median• “Whiskers” show outer limits of what is NOT

an outlier– In SPSS, a circle O indicates value and ID of a

mild outlier – An asterisk * is for an extreme outlier

Page 34: Hss2381a – stats and stuff

Box Plot Illustration – p52

Textbook Heart Rate Data:

Q1 = 62Q2 = 66 = MedianQ3 = 68

“Whiskers” limits: 53, 77

Mild outliers: 50 (#106), 45 (#105)

Extreme outliers: 40 (#104), 90 (#103), 95 (#102), 100 (#101)

Page 35: Hss2381a – stats and stuff

Box Plots Versus Histograms

• Outliers can be seen in histograms, but box plots give more useful information about degree of extremity and ID numbers

Page 36: Hss2381a – stats and stuff

(Stolen from wikipedia)

Page 37: Hss2381a – stats and stuff

Standard Scores

• Also called z-score or z-statistic or z-value or normal score

• Is a measure of how far an observation is from the mean of its distribution

• The z-score only has meaning if you know the parameters of the reference population

• i.e.: μ and σ

Page 38: Hss2381a – stats and stuff

Standard Scores

• Standard scores—another index of “relative standing” helpful in interpreting raw scores

• A standard score (also called a z score) is a score expressed in standard deviation units, in relative distance from the mean

Page 39: Hss2381a – stats and stuff
Page 40: Hss2381a – stats and stuff

Standard Scores (cont’d)

• Standard score equation: z = (X – M) ÷ SD

• That is, the mean is subtracted from an individual score, then divided by the SD

• For example:M = 100, SD = 25, X = 125, z = 1.0M = 100, SD = 25, X = 50, z = -2.0

Page 41: Hss2381a – stats and stuff

How is this useful?

• Very useful in standardized testing (like MCAT, GRE, SAT, etc)

• Allows us to:– Calculate the probability of a score occurring

within a normal distribution– Compare two scores that are from different

normal distributions

Page 42: Hss2381a – stats and stuff

Calculating a Probability Using a z-score

For a variable distributed normally (such as MCAT scores in Canada, a z-score of 1.96 will have 95% of observations falling within its range.

Page 43: Hss2381a – stats and stuff

Example

• We know that the LSAT score in Canada is normally distributed. The mean mark is 60% and the SD is 15. So….– What is the lowest mark among those who were

in the top 10% of performers?– (Why? Because law schools will only take the top

10% and need to know what mark to make their cut-off)

Page 44: Hss2381a – stats and stuff

Example

• We know that the LSAT score in Canada is normally distributed. The mean mark is 60% and the SD is 15. So….

We get the “1.282” by looking it up in a table, or using a z-score calculator

http://www.fourmilab.ch/rpkp/experiments/analysis/zCalc.html

Page 45: Hss2381a – stats and stuff

Using z-scores to compare tests

• A student is in two classes, English and Math.• She got 70% in English and 70% in Math and

wants to know which class she’s doing better in– Why isn’t the answer automatically “English”?

Page 46: Hss2381a – stats and stuff

Using z-scores to compare tests

• A student is in two classes, English and Math.

Page 47: Hss2381a – stats and stuff

Using z-scores to compare tests

Since these scores are from two different distributions, we need to standardise them into z-scores so that they can be directly compared. This gives us:

Page 48: Hss2381a – stats and stuff

Using z-scores to compare tests

How do we interpret this?

Z=0.67 suggests that the student performed 0.67 SDs above the mean in both classes. This makes her above average in both classes. But she’s doing equally well in both.

(If we use a z-score calculator, we’d find out that z=0.67 means that she’s in the top 25.1% of the class.)

Page 49: Hss2381a – stats and stuff

Standard Scores (cont’d)

• Standard scores have a mean of 0.0 and an SD of 1.0:

• But z scores can be transformed mathematically to have any mean and SD

• Most typical:– Mean = 500, SD = 100 (e.g., GRE, SAT)– Mean = 100, SD = 15 (e.g., IQ tests)– Mean = 50, SD = 50 (called T scores)

Page 50: Hss2381a – stats and stuff

The Normal Distribution• Central Limit Theorem:

– Under “mild” conditions, a large number of any random variable will be distributed “normally”

• For fun, go to:– http://www.math.csusb.edu/faculty/stanton/probstat/clt.html– This is an “applet” that you keep clicking on. It produces a graph of a random variable.

You will see that it always ends up being a Normal curve

Page 51: Hss2381a – stats and stuff

Properties of the Normal Distribution

• About 68% of values drawn from a normal distribution are within one standard deviation ( σ )away from the mean

• about 95% of the values lie within two standard deviations from the mean

• about 99.7% are within three standard deviations

• This fact is known as the 68-95-99.7 rule or the empirical rule or the 3-sigma rule

Page 52: Hss2381a – stats and stuff

3-sigma rule

Page 53: Hss2381a – stats and stuff

Homework

• P.57, A4, A5