exploratory data analysis: one variable

FPP 3-6

Exploratory Data Analysis: One Variable

Plan of attackDistinguish different types of variables

Summarize data numerically

Summarize data graphically

Use theoretical distributions to potentially learn more about a variable.

2

The five steps of statistical analyses1. Form the question2. Collect data3. Model the observed data

1. We start with exploratory techniques.

4. Check the model for reasonableness5. Make and present conclusions

Just to make sure we are on the same pageMore (or repeated) vocabulary

Individuals are the objects described by a set of dataexamples: employees, lab mice, states…

A variable is any characteristic of an individual that is of interest to the researcher. Takes on different values for different individualsexamples: age, salary, weight, location…

How is this different from a mathematical variable?

Just to make sure we are on the same page #2Measurement The value of a variable

obtained and recorded on an individualExample: 145 recorded as a person’s

weight, 65 recorded as the height of a tree, etc.

Data is a set of measurements made on a group of individuals

The distribution of a variable tells us what values it takes and how often it takes these values

Possible values -> Chest Size 33-34 35-36 37-38 39-40 41-42 43-44 45-46 47-48How often each occur -> count 21 266 1169 2152 1592 462 71 5

Chest Sizes of 5,738 Militamen

Two Types of Variablesa categorical/qualitative variable places an

individual into one of several groups or categoriesexamples:

Gender, Race, Job Type, Geographic location… JMP calls these variables nominal

a quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make senseexamples:

Height, Age, Salary, Price, Cost…Can be further divided to ordinal and continuous

Why two types?Both require their own summaries (graphically and

numerically) and analysis.

I can’t emphasis enough the importance of identifying the type of variable being considered before proceeding with any type of statistical analysis

Example

Age: quantitative Gender: categoricalRace: categoricalSalary: quantitativeJob type: categorical

Name Age Gender Race Salary Job TypeFleetwood, Delores 39 Female White 62,100 ManagementPerez, Juan 27 Male White 47,350 TechnicalWang, Lin 20 Female Asian 18,250 ClericalJohnson, LaVerne 48 Male Black 77,600 Management

Variable types in JMPQualitative/categorical

JMP uses Nominal

QuantitativeDiscrete

JMP uses Ordinal

ContinuousJMP uses Continuous

Exploratory data analysisStatistical tools that help examine data in

order to describe their main features

Basic strategyExamine variables one by one, then look at

the relationships among the different variables

Start with graphs, then add numerical summaries of specific aspects of the data

Exploratory data analysis: One variableGraphical displays

Qualitative/categorical data: bar chart, pie chart, etc.Quantitative data: histogram, stem-leaf, boxplot, timeplot

etc.

Summary statisticsQualitative/categorical: contingency tablesQuantitative: mean, median, standard deviation, range etc.

Probability modelsQualitative: Binomial distribution(others we won’t cover in

this class)Quantitative: Normal curve (others we won’t cover in this

class)

Example categorical/qualitative data

Summary tablewe summarize categorical data using a table. Note

that percentages are often called Relative Frequencies.

Class Frequency Relative FrequencyHighest Degree Obtained Number of CEOs ProportionNone 1 0.04Bachelors 7 0.28Masters 11 0.44Doctorate / Law 6 0.24Totals 25 1.00

Bar graphThe bar graph

quickly compares the degrees of the four groups

The heights of the four bars show the counts for the four degree categories

Pie chart

A pie chart helps us see what part of the whole group forms

To make a pie chart, you must include all the categories that make up a whole

Summary of categorical Summary of categorical variablesvariablesGraphically

Bar graphs, pie chartsBar graph nearly always preferable to a pie chart. It

is easier to compare bar heights compared to slices of a pie

Numerically: tables with total counts or percents

Quantitative variablesGraphical summary

HistogramStemplotsTime plotsmore

Numerical sumaryMeanMedianQuartilesRangeStandard deviationmore

Histograms The bins are:3.0 ≤ rate < 4.04.0 ≤ rate < 5.05.0 ≤ rate < 6.06.0 ≤ rate < 7.07.0 ≤ rate < 8.08.0 ≤ rate < 9.09.0 ≤ rate <

10.010.0 ≤ rate <

11.011.0 ≤ rate <

12.012.0 ≤ rate <

13.013.0 ≤ rate <

14.014.0 ≤ rate <

15.0

Histograms

The bins are:2.0 ≤ rate < 4.04.0 ≤ rate < 6.06.0 ≤ rate < 8.08.0 ≤ rate <

10.010.0 ≤ rate <

12.012.0 ≤ rate <

14.014.0 ≤ rate <

16.016.0 ≤ rate <

18.0

HistogramsWhere did the bins come from?

They were chosen rather arbitrarily

Does choosing other bins change the picture?Yes!! And sometimes dramatically

What do we do about this?Some pretty smart people have come up

with some “optimal” bin widths and we will rely on there suggestions

HistogramThe purpose of a graph is to help us

understand the data

After you make a graph, always ask, “What do I see?”

Once you have displayed a distribution you can see the important features

HistogramsWe will describe the features of the

distribution that the histogram is displaying with three characteristics

1.ShapeSymmetric, skewed right, skewed left, uni-

modal, multi-modal, bell shaped

2.CenterMean, median

3.Spread (outliers or not)Standard deviation, Inter-quartile range

Body temperatures of 30 people

96.5 97 97.5 98 98.5 99 99.5 100

100.0%

99.5%

97.5%

90.0%

75.0%

50.0%

25.0%

10.0%

2.5%

0.5%

0.0%

maximum

quartile

median

quartile

minimum

99.800

99.800

99.800

99.500

99.125

98.600

98.125

97.330

97.000

97.000

97.000

Quantiles

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

98.563333

0.7508539

0.1370865

98.843707

98.28296

30

Moments

Body Temp (F)

Distributions

Incomes from 500 households in 2000 current population survey

50

100

150

200

Cou

nt A

xis

0 50000 150000 250000

100.0%

99.5%

97.5%

90.0%

75.0%

50.0%

25.0%

10.0%

2.5%

0.5%

0.0%

maximum

quartile

median

quartile

minimum

282577

255901

168707

101999

63135

33722

17292

7871

3773

0

0

Quantiles

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

46854.196

43094.6

1929.1792

50644.53

43063.863

499

Moments

household income

Distributions

Histogram vs. Bar graphSpaces mean something in histograms but not in

bar graphsShape means nothing with bar graphsThe biggest difference is that they are displaying

fundamentally different types of variables

Time PlotsMany variables are measured at intervals

over time

ExamplesClosing stock pricesNumber of hurricanesUnemployment rates

If interest is a variable is to see change over time use a time plot

Time PlotsPatterns to look for

Patterns that repeat themselves at known regular intervals of time are called seasonal variation

A trend is a persistant, long-term rise or fall

Time plots

number of hurricanes each year from 1970 - 1990

0

2

4

6

8

10

Hurricanes

1965 1970 1975 1980 1985 1990 1995

Year

Numerical summaries of quantitative variablesWant a numerical summary for center and

spreadCenter

MeanMedianMode

SpreadRange Inter-quartile rangeStandard deviation

5 number summary is a popular collection of the followingmin, 1st quartile, median, 3rd quartile, max

MeanTo find the mean of a set of

observations, add their values and divide by the number of observations

equation 1:

equation 2:

€

μ =x1 + x2 +K + xN

N

€

μ =1

Nx i

i=1

N

∑

Mean exampleThe average age of 20 people in a room is

25. A 28 year old leaves while a 30 year old enters the room. Does the average age change?If so, what is the new average age?

MedianThe median is the midpoint of a distribution

The number such that half the observations are smaller and the other half are larger

Also called the 50th percentile or 2nd quartileTo compute a median

Order observationsIf number of observations is odd the median

is the center observationIf number of observations is even the median

is the average of the two center observations

Median exampleThe median age of 20 people in a room is

25. A 28 year old leaves while a 30 year old enters the room. Does the median age change?If so, what is the new median age?

The median age of 21 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. Does the median age change?If so, what is the new median age?

Mean vs MedianWhen histogram is symmetric mean and median

are similar

Mean and median are different when histogram is skewedSkewed to the right mean is larger than medianSkewed to the left mean is smaller than median

The business magazine Forbes estimates that the “average” household wealth of its readers is either about $800,000 or about $2.2 million, depending on which “average” it reports. Which of these numbers is the mean wealth and which is the median wealth? Why?

Mean vs MedianSymmetric distribution

Mean vs MedianRight skewed distribution

Mean vs MedianLeft skewed distribution

Extreme exampleIncome in small town of 6 people

$25,000 $27,000 $29,000 $35,000 $37,000 $38,000

Mean is $31,830 and median is $32,000Bill Gates moves to town

$25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000

Mean is $5,741,571 median is $35,000Mean is pulled by the outlier while the

median is not. The median is a better of measure of center for these data

Is a central measure enough?A warm, stable climate greatly affects some

individual’s health. Atlanta and San Diego have about equal average temperatures (62o

vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live?

Measures of spreadRange:

subtract the largest value form the smallestInter-quartile range:

subtract the 3rd quartile from the 1st quartile

Standard Deviation (SD):“average” distance from the mean

Which one should we use?

Standard DeviationThe standard deviation looks at how far

observations are from their meanIt is the square root of the average squared

deviations from the meanCompute distance of each value from meanSquare each of these distancesTake the average of these squares and

square root

Often we will use SD to denote standard deviation

€

σ =1

N

⎛

⎝ ⎜

⎞

⎠ ⎟ x i −μ( )

2

i=1

n

∑

Example

Standard deviationOrder these

histograms by the SD of the numbers they portray. Go from smallest largest

What is a reasonable guess of the SD for each?

-15 -10 -5 0 5 10 15 20

-1 -0.5 0 .5 1 1.5 2 2.5

-30 -20 -10 0 10 20 30

Histograms on same scale

-30 -20 -10 0 10 20 30

-30 -20 -10 0 10 20 30

-30 -20 -10 0 10 20 30

Problem from text (p. 74, #2)Which of the following sets of numbers has

the smaller SD’ a) 50, 40, 60, 30, 70, 25, 75

b) 50, 40, 60, 30, 70, 25, 75, 50, 50, 50

Repeat for these two sets c) 50, 40, 60, 30, 70, 25, 75

d) 50, 40, 60, 30, 70, 25, 75, 99, 1

More intuition behind the SDThis is a variance contest. You must give a list

of six numbers chosen from the whole numbers 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with repeats allowed.

Give a list of six numbers with the largest standard deviation such a list described above can possibly have.

Give a list of six numbers with the smallest standard deviation such a list can possibly have.

Properties of SDSD ≥ 0. (When is SD = 0)?

Has the same unit of measurement as the original observations

Inflated by outliers

Mean and SDWhat happens to the mean if you add 5 to

every number in a list?What happens to the SD?

€

σ =1

N

⎛

⎝ ⎜

⎞

⎠ ⎟ x i −μ( )

2

i=1

n

∑€

μ =1

Nx i

i=1

N

∑

Standard deviationSDs are like measurement units on a rulerAny quantitative variable can be converted

into “standardized” unitsThese are often called z-scores and are

denoted by the letter z

Important formula

ExampleACT versus SAT scoresWhich is more impressive

A 1340 on the SAT, or a 32 on the ACT?

€

z =value −mean

SD=value −μ

σ

The normal curveWhen histogram looks like a bell-shaped curve, z-

scores are associated with percentages

The percentage of the data in between two different z-score values equals the area under the normal curve in between the two z-score values

A bit of notation here. N(μ, σ) is short hand for writing normal curve with

mean μ and standard deviation σ (get used to this notation as it will be used fairly regularly through out the course)

Normal curves

Properties of normal curve In the Normal distribution with mean μ and

standard deviation σ:68% of the observations fall within 1 σ of μ95% of the observations fall within 2 σs of μ99.7% of the observations fall within 3 σs of μ

By remembering these numbers, you can think about Normal curves without constantly making detailed calculations

Properties of normal curvesFor a N(0,1) the following holds

IQA person is considered to have mental

retardation when

1.IQ is below 702.Significant limitations exist in two or more

adaptive skill areas3.Condition is present from childhood

What percentage of people have IQ that meet the first criterion of mental retardation

IQA histogram of all people’s IQ scores has a

μ=100 and a σ=16How to get % of people with IQ < 70

More IQ Reggie Jackson, one of the greatest baseball players ever, has

an IQ of 140. What percentage of people have bigger IQs than Reggie?

Marilyn vos Savant, self-proclaimed smartest person in the world, has a reported IQ of 205. What percentage of people have IQ scores smaller than Marilyn’s score?

Mensa is a society for “intelligent people.” To qualify for Mensa, one needs to be in at least the upper 2% of the population in IQ score. What is the score needed to qualify for Mensa?

Checking if data follow normal curve

Look for symmetric histogram

A different method is a normal probability plot. When normal curve is a good fit, points fall on a nearly straight line

Measurement errorMeasurement error model

Measurement = truth + chance error

OutliersBias effects all measurements in the same

way

Measurement = truth + bias + chance error

Often we assume that the chance error follows a normal curve that is centered at 0

exploratory data analysis: one variable

Documents