exploratory data analysis

Post on 23-Feb-2016

64 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Exploratory Data Analysis. John Tukey. Developed these procedures to help one get a first look at distributions of scores. What is the shape of the distribution? Are there any suspicious scores. Stem and Leaf Display Box and Whiskers Plot. Stem and Leaf Display. - PowerPoint PPT Presentation

TRANSCRIPT

Exploratory Data Analysis

John Tukey• Developed these procedures to help one

get a first look at distributions of scores.• What is the shape of the distribution?• Are there any suspicious scores.• Stem and Leaf Display• Box and Whiskers Plot

Stem and Leaf Display

• See the pulse rate data at Exploratory Data Analysis (EDA).

• The scores range from 48 to 104.• We probably want to group them into 5 to

15 intervals.• I’ll use two intervals for the 40’s, two for

the 50’s, etc.

The StemConsists of a column of leading (aka “most significant” digits, the leftmost digits in the scores. I’ll add to the stem the leaves, the trailing (rightmost, least significant) digits of each score

The Stem With Leaves

Next, I’ll arrange the leaves (within each row) from lowest to highest and add a “depth” column.

Leaves Arranged in Order

The Depth Column

• This column tells you how many scores there are in that row and all rows between it and the closer tail of the distribution.

• The row that contains the median has the row frequency in parentheses.

Rotated Display

It looks like a histogram, but the bars made up of the scores.

From this display, can you identify any scores that are odd, compared to most of the other scores?

Box and Whisker Plot

• Median Location = (N + 1)/2 = 97/2 =48.5.• The median will be located between the

48th and the 49th scores from either tail.

Are 40 scores from 68 to 48. Count up 8 more scores, starting with the first 70. The 48th score is a 70, the 49th score is a 70, the median is 70.

The Hinge Location

• = (Median Location + 1)/2. • Drop any decimal on the median location• For our data, hinge location = (48 + 1)/2 =

24.5. • Now, the upper hinge is the 24.5th score

from the upper end of the distribution.

There are 24 scores from 80 up to 104. Go in toward the median one more score. The 25th score from the highest is a 78. The upper hinge is (78 + 79)/2 = 79.

The 26th score from the lowest score is a 64. Move towards the lower tail by one score and you see the 25th score is also a 64. One more, the 24th score is also a 64. The lower hinge is 64.

The H-Spread

• = the difference between the upper hinge and the lower hinge. For our data, 79 - 64 = 15.

• This is the range of the middle 50% of the scores.

• You also know this as the interquartile range.

The Inner Fences

• The upper inner fence = the upper hinge plus 1.5 H‑spreads. For our data, 79 + 1.5(15) = 101.5.

• The lower inner fence is the lower hinge minus 1.5 H‑spreads, 64 - 1.5(15) = 41.5.

• These are invisible fences, they are not plotted.

Adjacent Values

• These are scores that are outside of the middle 50% of the scores but within the inner fences.

• For our data, these will be scores that fall– between 79 and 101.5 or– between 41.5 and 64

Outliers

• These are scores that are beyond the inner fences.

• For our data, these are scores that are– Less than 41.5 or– Greater than 101.5

Outer Fences

• These invisible fences are 3 H-spreads beyond the hinges.

• For our data the lower outer fence is at 79 - 3(15) = 34

• and the upper outer fence is at 79 + 3(15) = 124

• Scores that are beyond the outer fences are called way-outliers.

Drawing the Plot

• Prepare a numerical scale.• Draw a box that extends from the lower

hinge to the upper hinge.• Draw a line through the box at the median.• May also insert a symbol at the mean.• Draw whiskers out to the most extreme

adjacent values on each side

Whiskers

• For our data, the lowest adjacent value is the 48, so we draw the whiskers on the lower end out to 48.

• We do not go all the way out the inner fence unless there is a score there.

• The highest adjacent value is a 99, so we draw whiskers on the upper end out to 99.

Outliers

• Every outlier is plotted with a special symbol, often a O for a regular outlier and an * for a way-outliers.

• Some programs will also print the identification number next to every outlier

• These days, we use statistical software to make these displays and plots rather than doing them by hand.

Plots Produced by SAS

How tall, in inches, is your ideal mate?

Eight Foot Tall Mate !

• That is a WAY-OUTLIER for sure !• Investigation of the original data sheets

revealed that the actual response was 69 inches, not 96 inches.

Exploratory Data Analysis (EDA)

• It is highly recommended that you read the document linked above.

• It includes additional examples and a bit of silliness that might help you remember key concepts.

• Do watch the video clip of the Id attempting to cross an outer fence on the Forbidden Planet.

top related