1 chapter 7 looking at distributions. 2 modeling by a distribution for a given data set we want to...

22
1 Chapter 7 Looking at Distributions

Upload: edward-green

Post on 30-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

1

Chapter 7Looking at Distributions

Page 2: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

2

Modeling by A Distribution

For a given data set we want to know which distribution can fit each variable. This is a modeling problem.

When we have a knowledge to use a specific type distribution (normal, exponential, Poisson distributions) to fit the data, a goodness-fit-test will be useful.

Various Q-Q plots are very useful methods to find a suitable distribution to fit the data.

Page 3: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

3

Two data sets

The contents in this chapter are from Chapter 7 of the textbook.

Our textbook chooses the data set of marathon.sav to show us how to use SPSS for looking at distribution. The Chicago Marathon has been run yearly since 1977.

As we use the student version of SPSS that has some limitation on the number of rows/columns,

we use a similar data set of mar1500.sav to instead.

Page 4: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

4

Data set “mar1500.sav”

The data set involves the following variables:

“age”, “sex”, “hours”, “agecat8”, and “agecat6”.

Hours = “completion time in hours” Agecat8: 1=24 or less, 2=25-39, 3=40-44, 4=45-

49, 5=50-54, 6=55-59, 7=60-64, 8=65+ Agecat6: 1=44 or less, 2=45-49, 3=50-54, 4=55-

59, 5=60-64, 6=65+

Page 5: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

5

Histogram

Page 6: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

6

Impressions on the histogram

The mean falls in 4.3 - 4.4 The distribution is not symmetric about the

mean. The distribution has a tail toward larger times. Low marathon times are difficult to achieve. It

is hard to break the world record. Since the distribution has a tail toward larger

values, the median should be somewhat less than the mean.

Page 7: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

7

Basic statisticsDescriptives

4.3306 .020164.2911

4.3702

4.30734.2765

.609.78070

2.197.705.511.00.482 .063.323 .126

MeanLower BoundUpper Bound

95% ConfidenceInterval for Mean

5% Trimmed MeanMedianVarianceStd. DeviationMinimumMaximumRangeInterquartile RangeSkewnessKurtosis

completiontime in hours

Statistic Std. Error

Page 8: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

8

Basic statistics

The 5% trimmed mean excludes the 5% largest and the 5% smallest values. It is based on the 90% of cases in the middle.

The trimmed mean provides an alternative to the median when you have some outliers.

In this data the 5% trimmed mean doesn’t differ much from the usual mean, because the distribution is not too far from being symmetric.

Page 9: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

9

Comparisons of completing time on Gender

Page 10: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

10

Comparisons of completing time on Gender

Percentiles

3.5399 3.02563.7233 3.23734.0964 3.5950 4.1006 3.59514.5433 4.0469 4.5433 4.04695.0790 4.6183 5.0783 4.61725.6372 5.14085.9658 5.5035

Percentiles5102550759095

F Msex

F Msex

completion time inhours

completion time inhours

WeightedAverage(Definition 1) Tukey's Hinges

Page 11: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

11

Comparisons of completing time on Gender

The difference in all of the percentile values of completing times between men and women is about 0.4882 hour.

The weighted percentiles and Tukey’s hinges are two different ways of calculating sample percentiles. More details refer to P.120.

Page 12: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

12

Histogram of completion times for women

8. 006. 004. 00

80

60

40

20

0

Freq

uenc

y

Mean =4. 6274Std. Dev. =0. 74743

N =589

Hi stogram

for sex= F

Page 13: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

13

Histogram of completion times for men

6. 005. 004. 003. 002. 00

100

80

60

40

20

0

Freq

uenc

y

Mean =4. 1387Std. Dev. =0. 74105

N =911

Hi stogram

for sex= M

Page 14: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

14

Age and Gender

Report

time HR:MIN:SEC

4:35:42 84 0:49:08 4:29:204:05:50 69 0:43:59 4:02:124:22:14 153 0:49:03 4:19:384:34:28 350 0:41:28 4:31:574:05:14 479 0:43:14 3:59:514:17:35 829 0:44:52 4:15:504:32:45 76 0:43:11 4:28:024:03:17 142 0:44:30 3:58:294:13:33 218 0:46:09 4:12:034:58:23 38 0:55:24 4:47:134:10:22 102 0:39:23 4:01:094:23:24 140 0:49:01 4:11:594:52:03 28 0:40:23 4:46:004:15:20 73 0:48:05 4:10:564:25:31 101 0:48:46 4:27:224:36:21 5 0:40:08 4:21:034:25:36 25 0:35:23 4:20:534:27:24 30 0:35:42 4:20:585:31:17 2 0:50:32 5:31:175:01:26 11 0:47:01 5:03:025:06:01 13 0:46:42 5:03:026:56:33 2 0:16:06 6:56:335:19:27 8 1:00:13 5:17:025:38:52 10 1:07:16 5:41:534:37:31 585 0:44:56 4:32:274:08:15 909 0:44:29 4:02:424:19:43 1494 0:46:53 4:16:29

sexFMTotalFMTotalFMTotalFMTotalFMTotalFMTotalFMTotalFMTotalFMTotal

age group24 or less

25-39

40-44

45-49

50-54

55-59

60-64

65+

Total

Mean NStd.

Deviation Median

Page 15: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

15

Age and Gender

65+60-6455-5950-5445-4944 or l ess

age group

6. 00

4. 00

2. 00

0. 00

Mean completion time in hours M

Fsex

Page 16: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

16

Boxplots of completing times by age and gender

65+60-6455-5950-5445-4944 or l ess

age group

8. 00

7. 00

6. 00

5. 00

4. 00

3. 00

2. 00

comp

leti

on t

ime

in h

ours

MF

sex

Page 17: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

17

Remarks

Average completion times for men and women of different ages are shown.

For every age group, the average time for men is less than the average time for women.

For men and women younger than 45, age does not seem to matter very much.

For both men and women the variability of completion times is very stable except the eldest age group.

Page 18: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

18

Detecting outliers

Cases with values between 1.5 and 3 box lengths from the upper or lower edge of the box are called outliers and are designated with an “o”.

Cases with values of more than 3 box lengths from the upper or lower edge of the box are called extreme values. They are designated with “*”.

Page 19: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

19

Extreme Values

1500 7:42:021499 7:07:571498 6:57:191497 6:55:081496 6:54:12

55 3:03:4562 3:04:5977 3:09:2279 3:09:5380 3:10:14

1489 6:36:271488 6:31:481483 6:24:161482 6:23:311481 6:22:40

1 De Haven(USA) 2:11:40

2 2:16:343 2:24:444 2:33:305 2:38:39

1234512345123451

2345

Highest

Lowest

Highest

Lowest

sexF

M

time HR:MIN:SECCase Number name Value

Page 20: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

20

A stem-and-leaf plot is a display very much like a histogram, but it includes more information of the data.

In a stem-and-leaf plot, each row corresponds to a stem and each case is represented by a leaf.

Stem-and-leaf plots

Page 21: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

21

The following are price of 15 students eating lunch at a fast-food restaurant:

5.35, 4.75, 4.30, 5.47, 4.85, 6.62, 3.54, 4.87,

6.26, 5.48, 7.27, 8.45, 6.05, 4.76, 5.91

1 3 | 5 The first value of 5.35 is rounded to 5.4

5 4 | 83998 The second value of 4.75 is rounded to 4.8

4 5 | 4559 Their stems are 5 and 4, respectively

3 6 | 631 Their leafs are 4 and 8, respectively

1 7 | 3

1 8 | 5

Stem-and-leaf plots

Page 22: 1 Chapter 7 Looking at Distributions. 2 Modeling by A Distribution For a given data set we want to know which distribution can fit each variable. This

22

Stem-and-leaf plots

completion time in hours Stem-and-Leaf Plot foragecat6= 45-49

Frequency Stem & Leaf

2.00 2 . 99 13.00 3 . 0022223344444 40.00 3 .

5555566777777788888888899999999999999999 35.00 4 . 00000001111111122222233333333334444 21.00 4 . 555666666777778888899 12.00 5 . 000111111234 9.00 5 . 667778889 4.00 6 . 0011 4.00 Extremes (>=6.2)

Stem width: 1.00 Each leaf: 1 case(s)