exploratory data analysis

25
Exploratory Data Analysis

Upload: simeon

Post on 23-Feb-2016

64 views

Category:

Documents


0 download

DESCRIPTION

Exploratory Data Analysis. John Tukey. Developed these procedures to help one get a first look at distributions of scores. What is the shape of the distribution? Are there any suspicious scores. Stem and Leaf Display Box and Whiskers Plot. Stem and Leaf Display. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploratory Data Analysis

Exploratory Data Analysis

Page 2: Exploratory Data Analysis

John Tukey• Developed these procedures to help one

get a first look at distributions of scores.• What is the shape of the distribution?• Are there any suspicious scores.• Stem and Leaf Display• Box and Whiskers Plot

Page 3: Exploratory Data Analysis

Stem and Leaf Display

• See the pulse rate data at Exploratory Data Analysis (EDA).

• The scores range from 48 to 104.• We probably want to group them into 5 to

15 intervals.• I’ll use two intervals for the 40’s, two for

the 50’s, etc.

Page 4: Exploratory Data Analysis

The StemConsists of a column of leading (aka “most significant” digits, the leftmost digits in the scores. I’ll add to the stem the leaves, the trailing (rightmost, least significant) digits of each score

Page 5: Exploratory Data Analysis

The Stem With Leaves

Next, I’ll arrange the leaves (within each row) from lowest to highest and add a “depth” column.

Page 6: Exploratory Data Analysis

Leaves Arranged in Order

Page 7: Exploratory Data Analysis

The Depth Column

• This column tells you how many scores there are in that row and all rows between it and the closer tail of the distribution.

• The row that contains the median has the row frequency in parentheses.

Page 8: Exploratory Data Analysis

Rotated Display

It looks like a histogram, but the bars made up of the scores.

From this display, can you identify any scores that are odd, compared to most of the other scores?

Page 9: Exploratory Data Analysis

Box and Whisker Plot

• Median Location = (N + 1)/2 = 97/2 =48.5.• The median will be located between the

48th and the 49th scores from either tail.

Page 10: Exploratory Data Analysis

Are 40 scores from 68 to 48. Count up 8 more scores, starting with the first 70. The 48th score is a 70, the 49th score is a 70, the median is 70.

Page 11: Exploratory Data Analysis

The Hinge Location

• = (Median Location + 1)/2. • Drop any decimal on the median location• For our data, hinge location = (48 + 1)/2 =

24.5. • Now, the upper hinge is the 24.5th score

from the upper end of the distribution.

Page 12: Exploratory Data Analysis

There are 24 scores from 80 up to 104. Go in toward the median one more score. The 25th score from the highest is a 78. The upper hinge is (78 + 79)/2 = 79.

Page 13: Exploratory Data Analysis

The 26th score from the lowest score is a 64. Move towards the lower tail by one score and you see the 25th score is also a 64. One more, the 24th score is also a 64. The lower hinge is 64.

Page 14: Exploratory Data Analysis

The H-Spread

• = the difference between the upper hinge and the lower hinge. For our data, 79 - 64 = 15.

• This is the range of the middle 50% of the scores.

• You also know this as the interquartile range.

Page 15: Exploratory Data Analysis

The Inner Fences

• The upper inner fence = the upper hinge plus 1.5 H‑spreads. For our data, 79 + 1.5(15) = 101.5.

• The lower inner fence is the lower hinge minus 1.5 H‑spreads, 64 - 1.5(15) = 41.5.

• These are invisible fences, they are not plotted.

Page 16: Exploratory Data Analysis

Adjacent Values

• These are scores that are outside of the middle 50% of the scores but within the inner fences.

• For our data, these will be scores that fall– between 79 and 101.5 or– between 41.5 and 64

Page 17: Exploratory Data Analysis

Outliers

• These are scores that are beyond the inner fences.

• For our data, these are scores that are– Less than 41.5 or– Greater than 101.5

Page 18: Exploratory Data Analysis

Outer Fences

• These invisible fences are 3 H-spreads beyond the hinges.

• For our data the lower outer fence is at 79 - 3(15) = 34

• and the upper outer fence is at 79 + 3(15) = 124

• Scores that are beyond the outer fences are called way-outliers.

Page 19: Exploratory Data Analysis

Drawing the Plot

• Prepare a numerical scale.• Draw a box that extends from the lower

hinge to the upper hinge.• Draw a line through the box at the median.• May also insert a symbol at the mean.• Draw whiskers out to the most extreme

adjacent values on each side

Page 20: Exploratory Data Analysis

Whiskers

• For our data, the lowest adjacent value is the 48, so we draw the whiskers on the lower end out to 48.

• We do not go all the way out the inner fence unless there is a score there.

• The highest adjacent value is a 99, so we draw whiskers on the upper end out to 99.

Page 21: Exploratory Data Analysis

Outliers

• Every outlier is plotted with a special symbol, often a O for a regular outlier and an * for a way-outliers.

• Some programs will also print the identification number next to every outlier

• These days, we use statistical software to make these displays and plots rather than doing them by hand.

Page 22: Exploratory Data Analysis

Plots Produced by SAS

Page 23: Exploratory Data Analysis

How tall, in inches, is your ideal mate?

Page 24: Exploratory Data Analysis

Eight Foot Tall Mate !

• That is a WAY-OUTLIER for sure !• Investigation of the original data sheets

revealed that the actual response was 69 inches, not 96 inches.

Page 25: Exploratory Data Analysis

Exploratory Data Analysis (EDA)

• It is highly recommended that you read the document linked above.

• It includes additional examples and a bit of silliness that might help you remember key concepts.

• Do watch the video clip of the Id attempting to cross an outer fence on the Forbidden Planet.