introductory statistics for laboratorians dealing with high throughput data sets

32
Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control

Upload: tirzah

Post on 08-Feb-2016

34 views

Category:

Documents


1 download

DESCRIPTION

Introductory Statistics for Laboratorians dealing with High Throughput Data sets. Centers for Disease Control. Interpreting Scores. What do the numbers mean?. Johnny came home from 4 th grade and told his mother he’d made 100 on his test. That’s good! But it was a 200 point test. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Introductory Statistics for Laboratorians dealing with High

Throughput Data sets

Centers for Disease Control

Page 2: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Interpreting Scores

What do the numbers mean?

Page 3: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Johnny came home from 4th grade and told his mother he’d made 100 on his test.

• That’s good!• But it was a 200 point test.• That’s bad!• But it was a very difficult test and Johnny’s score

was one of the highest in the district.• That’s good!• But Johnny wasn’t the only one who got 100, the

average score on the test was 100.• That’s not so good.

Page 4: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

What have we learned?

• The fact is that a raw score by itself is meaningless.

• To interpret a persons score you must know how everybody else scored.

• For a score to have meaning, you have to know where that score is in the distribution.

Page 5: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

The two main things we need to know to interpret a score are:

• How far is is from the mean• How spread out are the scores

Page 6: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

The Deviation Score

• Deviation score commonly used in statistics to make a score more interpretable.

• Deviation score: how far the score is from the mean

Page 7: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Some Notation• In statistics the raw score is symbolized by

a UPPER CASE • The mean of the raw scores is symbolized

by• The deviation score is symbolized by a

lower case• The deviation score is computed by

subtracting the mean from the score:

X

X

x

XXx

Page 8: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

• If someone scores at the mean, the deviation score would be zero.

• If someone scores above average, the deviation score will be a positive number.

• If the score is below the mean the deviation score will be a negative number

Page 9: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

• If Johnny had come home and told his mother that his deviation score on the test was 0, she would have known immediately that he was average.

• (Johnny’s mother is a professor of statistics at the local college)

Page 10: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

But that is not all.

• While the distance a persons score is from the mean is more meaningful than the raw score, the interpretation of the distance from the mean depends on how spread out the scores are.

Page 11: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

The importance of Dispersion

• For example, if Johnny tells his mother he scored 10 points above the mean on a test, we know right away that he is above average.

• Question is, how much above average.

Page 12: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

• If the average score on the test is 55 and Johnny scores 65 and that is the highest score on the test then scoring 10 points above the mean is very good. (see figure1)

Page 13: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

• If on the other hand, the highest score on the test is 100, then a 65 is not so great.

Score

20

15

10

5

0

-5

Johnny's Score = 65

55

Page 14: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

So?

• What we really need is a way to express a score that takes into account both how far the score is from the mean and how spread out the scores are.

Page 15: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

z-Scores

• The standard deviation is the parameter that measures the dispersion or spread of the distribution.

• z-scores measure the distance from the mean in standard deviation units.

sXXz

Page 16: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

z-Scores

• If a person scores 1 standard deviation (SD) above the mean, the z-score will be +1

• If they score 1 SD below the mean the z-score will be –1

• If they score 2 SD’s above the mean the z-score will be +2

• If they score at the mean the z-score will be zero. • Etc.

Page 17: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Areas Under the Normal Curve

• The proportion of the area under the normal curve can be interpreted as the probability that a score appears in that area.

• Areas here are shown for standard deviation units.

Page 18: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Areas Under the Curve

• As shown here, the percentage of the distribution in a standard deviation band is the same regardless of the shape of the distribution

Page 19: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 10: Compute z-ScoresSubject Score x = X - Mean x2 z scoreS1 1S2 4S3 4S4 5S5 5S6 6S7 7S8 8N= Total =

Mean =SS =s =

sx

sXXz

NSSs

Page 20: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 10: Compute z-ScoresSubject Score x = X - Mean x2 z scoreS1 1 -4 16 -2S2 4 -1 1 -0.5S3 4 -1 1 -0.5S4 5 0 0 0S5 5 0 0 0S6 6 1 1 0.5S7 7 2 4 1S8 8 3 9 1.5N= 8 Total = 40

Mean = 5SS = 32s = 2

sx

sXXz

NSSs

Page 21: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 11: Properties of z-ScoresSubject z – Scores (from

Problem 10)Deviation score of the z’s

Squared deviations of z’s

S1

S2

S3

S4

S5

S6

S7

S8

N = Total of z’s = Mean of z’s =

SS of z’s =Standard deviation of z’s =

Page 22: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 11: Properties of z-ScoresSubject z – Scores (from

Problem 10)Deviation score of the z’s

Squared deviations of z’s

S1 -2 -2 4

S2 -0.5 -0.5 0.25

S3 -0.5 -0.5 0.25

S4 0 0 0

S5 0 0 0

S6 0.5 0.5 0.25

S7 1 1 1

S8 1.5 1.5 2.25

N = 8 Total of z’s = 0Mean of z’s = 0

SS of z’s = 8Standard deviation of z’s = 1

Page 23: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Using the Standard Normal Distribution• Because all Normal distributions share the same properties, we

can us the standard normal distribution (the distribution of z-scores) for our computations and get the same results.

• In the distribution with mean of 64.5 and standard deviation of 2.5, 68% of the distribution is between 62 and 67 (-1 SD to +1 SD).

• In the standard normal distribution (with mean 0 and standard deviation 1), 68% of the distribution is between -1 SD and +1 SD.

N(0,1)

=>

z

x

N(64.5, 2.5)

Standardized height (no units)

Page 24: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 12: Women’s Heights

• The average woman is 64.5 inches tall.

• Mean = 64.5• Standard Deviation =

2.5

Page 25: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 12: Women’s Heights

• Maria is 67 inches tall (5’ 7”).

• What is Maria’s z-score?

• What percent of women are shorter than Maria?

• What percent are taller?

Page 26: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 12: Women’s Heights

• Alexis is 62 inches tall (5’ 2”).

• What is Alexis’ z-score?

• What percent of women are shorter than Alexis?

• What percent are taller?

Page 27: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 12: Women’s Heights• Barbie is 69.5 inches tall

(5’ 9.5”). • What is Barbie’s z-score?• What percent of women

are shorter than Barbie?• What percent are between

Alexis and Barbie?

Page 28: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 12: Women’s Heights

• Leela is 68 ¾ inches tall (5’ 8 ¾ ”).

• What is Leela’s z-score?

• Can we compute the percent of women who are shorter than Leela?

• Why or why not?

Page 29: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 12: Women’s Heights

• Leela is 68 ¾ inches tall. • Her z-score is 1.5• Use http://davidmlane.com/hyperstat/z_table.html to

compute the percent of women who are shorter than Leela.

Page 30: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 12: Women’s Heights

• How tall do you have to be to be taller than 50% of the women?

• How tall do you have to be to be taller than 84% of the women?

• How tall do you have to be to be taller than 97.6% of the women?

Page 31: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 12: Women’s Heights

• Use http://davidmlane.com/hyperstat/z_table.html for the following problems:

• How tall do you have to be to be taller than 95% of the women?

• How tall do you have to be to be taller than 99% of the women?

• We can be sure that 95% of the women are between what heights?

Page 32: Introductory Statistics  for  Laboratorians  dealing with High Throughput Data sets

Problem 13• Use http://davidmlane.com/hyperstat/z_table.html for the

following problem: • You have been timing how long it takes to get to work in

the mornings. The mean is 22.6 minutes with a standard deviation of 8.16 minutes.

• You have to be at work at 8:30 am at the latest. • How many minutes before 8:30 do you have to leave to be

95% confident that you will get there at or before 8:30?• When do you have to leave to be 99% sure you’ll be there

by 8:30?