objectives 1.2describing distributions with numbers measures of center: mean, median mean versus...

Post on 26-Dec-2015

225 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Objectives

1.2 Describing distributions with numbers

Measures of center: mean, median

Mean versus median

Measures of spread: quartiles, standard deviation

Five-number summary and boxplot

Choosing among summary statistics

Changing the unit of measurement

Numerical descriptions of distributions

Describe the shape, center, and spread of a distribution…

Center: mean, median and mode.

Spread: range, IQR, standard deviation (SD).

We treat these as aids to understanding the distribution of the variable at hand…

The mean is often called the "average" and is in fact the arithmetic average ("add all the values and divide by the number of observations").

The mean or arithmetic average

To calculate the average, or mean, add all

values, then divide by the number of

individuals. It is the “center of mass.”

height58.259.560.760.961.9

Measure of center: sample mean: Example 1

Sum of heights is 301.2

divided by 5 women = 301.2/5=60.24 inches

x 1598.3

2563.9

Mathematical notation:(Sample mean)

x 1

n ixi1

n

woman(i)

height(x)

woman(i)

height(x)

i = 1 x1= 58.2 i = 14 x14= 64.0

i = 2 x2= 59.5 i = 15 x15= 64.5

i = 3 x3= 60.7 i = 16 x16= 64.1

i = 4 x4= 60.9 i = 17 x17= 64.8

i = 5 x5= 61.9 i = 18 x18= 65.2

i = 6 x6= 61.9 i = 19 x19= 65.7

i = 7 x7= 62.2 i = 20 x20= 66.2

i = 8 x8= 62.2 i = 21 x21= 66.7

i = 9 x9= 62.4 i = 22 x22= 67.1

i = 10 x10= 62.9 i = 23 x23= 67.8

i = 11 x11= 63.9 i = 24 x24= 68.9

i = 12 x12= 63.1 i = 25 x25= 69.6

i = 13 x13= 63.9 n= 25 =1598.3

Learn right away how to get the mean using your calculators.

x x1 x2 ... xn

n

Measure of center: sample mean: Example 2

Your numerical summary must be meaningful!

The distribution of women’s heights appears coherent and symmetrical. The mean is a good numerical summary.

9.63x

Height of 25 women in a class

The Median (M) is often called the "middle" value and is the value at the midpoint of the observations when they are ranked from smallest to largest value….

Steps to get median: arrange the data from smallest to largest if n is odd then the median is the single observation in the

center (at the (n+1)/2 position in the ordering) if n is even then the median is the average of the two middle

observations (at the (n+1)/2 position; i.e., in between…) E.g1: 5, 1, 7, 4, 3 E.g2: 5, 1, 7, 4, 3, 8

Note: for a median, 50% of the data are less than it and 50% of the data are bigger than it

Example1: with the data listed below, what are the mean and median?

2, 3, 5, 1. Example2: with the data listed below, what are the mean and median?

2, 3, 5, 1, 100. Example3: with the data listed below, what are the mean and median? -100, 2, 3, 5, 1, 100.Question: What can we conclude from the examples above?

Measure of center: the median

Mean is sensitive to outliers;Median is robust to outliers.

Measure of center: the medianThe median is the midpoint of a distribution—the number such

that half of the observations are smaller and half are larger.

1. Sort observations by size.n = number of observations

______________________________

1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.510 10 2.811 11 2.912 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.6

n = 24 n/2 = 12

Median = (3.3+3.4) /2 = 3.35

2.b. If n is even, the median is the mean of the two middle observations.

1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.510 10 2.811 11 2.912 12 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.625 12 6.1

n = 25 (n+1)/2 = 26/2 = 13 Median = 3.4

2.a. If n is odd, the median is observation (n+1)/2 down the list

The median, on the other hand,

is only slightly pulled to the right

by the outliers (from 3.4 to 3.6).

The mean is pulled to the

right a lot by the outliers

(from 3.4 to 4.2).

P

erc

en

t o

f p

eo

ple

dyi

ng

Mean and median of a distribution with outliers

4.3x

Without the outliers

2.4x

With the outliers

Disease X:

Mean and median are the same.

Mean and median of a symmetric

4.3

4.3

M

x

Multiple myeloma:

5.2

4.3

M

x

… and a right-skewed distribution

The mean is pulled toward the skew.

Impact of skewed data

We can describe the shape, center and spread of a density curve in the same way we describe data… e.g.,

the median of a density curve is the “equal-areas” point - the point on the horizontal axis that divides the area under the density curve into two equal (.5 each) parts.

The mean of the density curve is the balance point - the point on the horizontal axis where the curve would balance if it were made of a solid material. (See figures 1.24b and 1.25 below)

Skewness: The mean is pulled toward the skew.

Mode = Mean = Median

SKEWED LEFT(negatively)

SYMMETRIC

Mean Mode Median

SKEWED RIGHT(positively)

Mean Mode Median

The mean is pulled toward the skew.

Spread: percentiles, quartiles (Q1 and Q3), IQR,5-number summary (and boxplots), range, standard deviation

pth percentile of a variable is a data value such that p% of the values of the variable fall at or below it.

The lower (Q1) and upper (Q3) quartiles are special percentiles dividing the data into quarters (fourths). get them by finding the medians of the lower and upper halves of the data

IQR = interquartile range = Q3 - Q1 = spread of the middle 50% of the data. IQR is used with the so-called 1.5*IQR criterion for outliers - know this!

Measure of spread: the quartiles

Eg1: Dataset: 3, 2, 1, 5, 6.

1) Find the Median, Q1, Q3 and IQR.

2) Find the 5-# summary.

3) Draw a Boxplot for Eg1.

Examples to find 5-# summary and Boxplot

Eg2: Dataset: 3, 2, 1, 5, 6, 8.

1) Find the Median, Q1, Q3 and IQR.

2) Find the 5-# summary.

3) Draw a Boxplot for Eg1.

Definition, pg 35Introduction to the Practice of Statistics, Sixth Edition

© 2009 W.H. Freeman and Company

Measure of spread: the quartiles

M = median = 3.4

Q1= first quartile = 2.2

Q3= third quartile = 4.35

1 1 0.62 2 1.23 3 1.54 4 1.65 5 1.96 6 2.17 7 2.38 1 2.39 2 2.510 3 2.811 4 2.912 5 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 1 4.722 2 4.923 3 5.324 4 5.625 5 6.1

Measure of spread: the quartiles

The first quartile, Q1, is the value in the

sample that has 25% of the data at or

below it ( it is the median of the lower

half of the sorted data, excluding M).

The third quartile, Q3, is the value in the

sample that has 75% of the data at or

below it ( it is the median of the upper

half of the sorted data, excluding M).

Definition, pg 37Introduction to the Practice of Statistics, Sixth Edition

© 2009 W.H. Freeman and Company

Definition, pg 38aIntroduction to the Practice of Statistics, Sixth Edition

© 2009 W.H. Freeman and Company

M = median = 3.4

Q3= third quartile = 4.35

Q1= first quartile = 2.2

25 6 6.124 5 5.623 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.94 4 1.63 3 1.52 2 1.21 1 0.6

Largest = max = 6.1

Smallest = min = 0.6

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

Five-number summary:

min Q1 M Q3 max

Five-number summary and boxplot

BOXPLOT

0123456789

101112131415

Disease X Multiple Myeloma

Yea

rs u

ntil

deat

h

Comparing box plots for a normal and a right-skewed distribution

Boxplots for skewed data

Boxplots remain

true to the data and

depict clearly

symmetry or skew.

5-number summary: min. , Q1, median, Q3, maxwhen plotted, the 5-number summary is a boxplot we can also

do a modified boxplot to show outliers (mild and extreme). Boxplots have less detail than histograms and are often used for comparing distributions… e.g., Fig. 1.17, p.47 and below...

Suspected outliers: how to detect outliersOutliers are troublesome data points, and it is important to be able to

identify them.

One way to raise the flag for a suspected outlier is to compare the

distance from the suspicious data point to the nearest quartile (Q1 or Q3).

We then compare this distance to the interquartile range (distance

between Q1 and Q3).

We call an observation a suspected outlier if it falls more than 1.5 times

the size of the interquartile range (IQR) above the first quartile or below

the third quartile. This is called the “1.5 * IQR rule for outliers.”

Modified Boxplot Modified boxplot (helps detect outliers)

Calculate 1.5*IQR Q1 – 1.5*IQR

Q3+1.5*IQR

Draw box and line (similar to before). Draw whiskers to minimum and maximum observation

within (Q1 – 1.5*IQR, Q3+1.5*IQR). Observations outside this range should be plotted as

dots separately.

Q3 = 4.35

Q1 = 2.2

25 6 7.924 5 6.123 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.94 4 1.63 3 1.52 2 1.21 1 0.6

Modified Boxplot

Q1: Is there any suspected outliers?

Q2: If yes, then find the following values: Calculate 1.5*IQR; Lower bound = Q1 – 1.5*IQR;

Upper bound = Q3+1.5*IQR; Find Min*=min within lower/upper

bounds; Find Max*=max within lower/upper

bounds;

Q3: Can we verify any outliers?

Q4: Now draw the Modified Boxplot: Draw Min* and Max*, Q1, Med, Q3. For all observations outside this range

should be plotted as dots separately.

Q3 = 4.35

Q1 = 2.2

25 6 7.924 5 6.123 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.94 4 1.63 3 1.52 2 1.21 1 0.6

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile rangeQ3 – Q1

4.35 − 2.2 = 2.15

Distance to Q3

7.9 − 4.35 = 3.55

Individual #25 has a value of 7.9 years, which is 3.55 years above the

third quartile. This is more than 3.225 years, 1.5 * IQR. Thus, individual

#25 is an outlier by our 1.5 * IQR rule.

Modified Boxplot

The standard deviation “s” is used to describe the variation around the mean. Like the mean, it is not resistant to skew or outliers.

2

1

2 )(1

1xx

ns

n

i

1. First calculate the variance s2.

2

1

)(1

1xx

ns

n

i

2. Then take the square root to get

the standard deviation s.

Measure of spread: the standard deviation

Mean± 1 s.d.

x

Calculations …For data: 1, 2, 3, 4, 5. Q: Find the sample variance and sample SD.

Make sure to know how to get the standard deviation using your calculator.

2

1

)(1

xxdf

sn

i Mean = 3

Sum of squared deviations from mean = 10

Degrees freedom (df) = (n − 1) = 4

s2 = sample variance = 10/4 = 2.5

s = sample standard deviation

= √2.5 = 1.58

Example 1: to calculate sample SD

1

1

Order i

Make sure to know how to get the standard deviation using your calculator.

Example 2: Use hand to calculate sample SD for the following data set: 3, 4, 5, 8.

2

1

2 )(1

1xx

ns

n

i

1. First calculate the variance s2.

2

1

)(1

1xx

ns

n

i

2. Then take the square root to get

the standard deviation s.

How to use calculator to find statistics… In order to find sample mean, sample SD, and 5-# summary, we can

use calculator to help as following: Stat Edit choose 1: Edit… input your data into L1; Stat Calc choose 1: 1-Var Stats Enter Enter. Read your outputs carefully.

Note: X-bar means sample mean; Sx means sample SD; n means sample size.

Q: find the sample mean, sample SD, and 5-# summary for the following data:

Example1: Data are: 3, 4, 5, 8. Example 2: Data are: 1, 3, 5, 6, 7, 8.

Definition, pg 43aIntroduction to the Practice of Statistics, Sixth Edition

© 2009 W.H. Freeman and Company

ALWAYS PLOT DATA BEFORE DECIDING ON A NUMERICAL SUMMARY.

How to choose summary statistics? Use: 5-number summary is better than the mean and s.d.

for skewed data; Use mean & s.d. for symmetric data.

How to perform data analysis:

top related