exploratory data analysis: one variable
DESCRIPTION
Exploratory Data Analysis: One Variable. FPP 3-6. Plan of attack. Distinguish different types of variables Summarize data numerically Summarize data graphically Use theoretical distributions to potentially learn more about a variable. The five steps of statistical analyses. - PowerPoint PPT PresentationTRANSCRIPT
FPP 3-6
Exploratory Data Analysis: One Variable
Plan of attackDistinguish different types of variables
Summarize data numerically
Summarize data graphically
Use theoretical distributions to potentially learn more about a variable.
2
The five steps of statistical analyses1. Form the question2. Collect data3. Model the observed data
1. We start with exploratory techniques.
4. Check the model for reasonableness5. Make and present conclusions
Just to make sure we are on the same pageMore (or repeated) vocabulary
Individuals are the objects described by a set of dataexamples: employees, lab mice, states…
A variable is any characteristic of an individual that is of interest to the researcher. Takes on different values for different individualsexamples: age, salary, weight, location…
How is this different from a mathematical variable?
Just to make sure we are on the same page #2Measurement The value of a variable
obtained and recorded on an individualExample: 145 recorded as a person’s
weight, 65 recorded as the height of a tree, etc.
Data is a set of measurements made on a group of individuals
The distribution of a variable tells us what values it takes and how often it takes these values
Possible values -> Chest Size 33-34 35-36 37-38 39-40 41-42 43-44 45-46 47-48How often each occur -> count 21 266 1169 2152 1592 462 71 5
Chest Sizes of 5,738 Militamen
Two Types of Variablesa categorical/qualitative variable places an
individual into one of several groups or categoriesexamples:
Gender, Race, Job Type, Geographic location… JMP calls these variables nominal
a quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make senseexamples:
Height, Age, Salary, Price, Cost…Can be further divided to ordinal and continuous
Why two types?Both require their own summaries (graphically and
numerically) and analysis.
I can’t emphasis enough the importance of identifying the type of variable being considered before proceeding with any type of statistical analysis
Example
Age: quantitative Gender: categoricalRace: categoricalSalary: quantitativeJob type: categorical
Name Age Gender Race Salary Job TypeFleetwood, Delores 39 Female White 62,100 ManagementPerez, Juan 27 Male White 47,350 TechnicalWang, Lin 20 Female Asian 18,250 ClericalJohnson, LaVerne 48 Male Black 77,600 Management
Variable types in JMPQualitative/categorical
JMP uses Nominal
QuantitativeDiscrete
JMP uses Ordinal
ContinuousJMP uses Continuous
Exploratory data analysisStatistical tools that help examine data in
order to describe their main features
Basic strategyExamine variables one by one, then look at
the relationships among the different variables
Start with graphs, then add numerical summaries of specific aspects of the data
Exploratory data analysis: One variableGraphical displays
Qualitative/categorical data: bar chart, pie chart, etc.Quantitative data: histogram, stem-leaf, boxplot, timeplot
etc.
Summary statisticsQualitative/categorical: contingency tablesQuantitative: mean, median, standard deviation, range etc.
Probability modelsQualitative: Binomial distribution(others we won’t cover in
this class)Quantitative: Normal curve (others we won’t cover in this
class)
Example categorical/qualitative data
Summary tablewe summarize categorical data using a table. Note
that percentages are often called Relative Frequencies.
Class Frequency Relative FrequencyHighest Degree Obtained Number of CEOs ProportionNone 1 0.04Bachelors 7 0.28Masters 11 0.44Doctorate / Law 6 0.24Totals 25 1.00
Bar graphThe bar graph
quickly compares the degrees of the four groups
The heights of the four bars show the counts for the four degree categories
Pie chart
A pie chart helps us see what part of the whole group forms
To make a pie chart, you must include all the categories that make up a whole
Summary of categorical Summary of categorical variablesvariablesGraphically
Bar graphs, pie chartsBar graph nearly always preferable to a pie chart. It
is easier to compare bar heights compared to slices of a pie
Numerically: tables with total counts or percents
Quantitative variablesGraphical summary
HistogramStemplotsTime plotsmore
Numerical sumaryMeanMedianQuartilesRangeStandard deviationmore
Histograms The bins are:3.0 ≤ rate < 4.04.0 ≤ rate < 5.05.0 ≤ rate < 6.06.0 ≤ rate < 7.07.0 ≤ rate < 8.08.0 ≤ rate < 9.09.0 ≤ rate <
10.010.0 ≤ rate <
11.011.0 ≤ rate <
12.012.0 ≤ rate <
13.013.0 ≤ rate <
14.014.0 ≤ rate <
15.0
Histograms The bins are:3.0 ≤ rate < 4.04.0 ≤ rate < 5.05.0 ≤ rate < 6.06.0 ≤ rate < 7.07.0 ≤ rate < 8.08.0 ≤ rate < 9.09.0 ≤ rate <
10.010.0 ≤ rate <
11.011.0 ≤ rate <
12.012.0 ≤ rate <
13.013.0 ≤ rate <
14.014.0 ≤ rate <
15.0
Histograms
The bins are:2.0 ≤ rate < 4.04.0 ≤ rate < 6.06.0 ≤ rate < 8.08.0 ≤ rate <
10.010.0 ≤ rate <
12.012.0 ≤ rate <
14.014.0 ≤ rate <
16.016.0 ≤ rate <
18.0
HistogramsWhere did the bins come from?
They were chosen rather arbitrarily
Does choosing other bins change the picture?Yes!! And sometimes dramatically
What do we do about this?Some pretty smart people have come up
with some “optimal” bin widths and we will rely on there suggestions
HistogramThe purpose of a graph is to help us
understand the data
After you make a graph, always ask, “What do I see?”
Once you have displayed a distribution you can see the important features
HistogramsWe will describe the features of the
distribution that the histogram is displaying with three characteristics
1.ShapeSymmetric, skewed right, skewed left, uni-
modal, multi-modal, bell shaped
2.CenterMean, median
3.Spread (outliers or not)Standard deviation, Inter-quartile range
Body temperatures of 30 people
96.5 97 97.5 98 98.5 99 99.5 100
100.0%
99.5%
97.5%
90.0%
75.0%
50.0%
25.0%
10.0%
2.5%
0.5%
0.0%
maximum
quartile
median
quartile
minimum
99.800
99.800
99.800
99.500
99.125
98.600
98.125
97.330
97.000
97.000
97.000
Quantiles
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
98.563333
0.7508539
0.1370865
98.843707
98.28296
30
Moments
Body Temp (F)
Distributions
Incomes from 500 households in 2000 current population survey
50
100
150
200
Cou
nt A
xis
0 50000 150000 250000
100.0%
99.5%
97.5%
90.0%
75.0%
50.0%
25.0%
10.0%
2.5%
0.5%
0.0%
maximum
quartile
median
quartile
minimum
282577
255901
168707
101999
63135
33722
17292
7871
3773
0
0
Quantiles
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
46854.196
43094.6
1929.1792
50644.53
43063.863
499
Moments
household income
Distributions
Histogram vs. Bar graphSpaces mean something in histograms but not in
bar graphsShape means nothing with bar graphsThe biggest difference is that they are displaying
fundamentally different types of variables
Time PlotsMany variables are measured at intervals
over time
ExamplesClosing stock pricesNumber of hurricanesUnemployment rates
If interest is a variable is to see change over time use a time plot
Time PlotsPatterns to look for
Patterns that repeat themselves at known regular intervals of time are called seasonal variation
A trend is a persistant, long-term rise or fall
Time plots
number of hurricanes each year from 1970 - 1990
0
2
4
6
8
10
Hurricanes
1965 1970 1975 1980 1985 1990 1995
Year
Numerical summaries of quantitative variablesWant a numerical summary for center and
spreadCenter
MeanMedianMode
SpreadRange Inter-quartile rangeStandard deviation
5 number summary is a popular collection of the followingmin, 1st quartile, median, 3rd quartile, max
MeanTo find the mean of a set of
observations, add their values and divide by the number of observations
equation 1:
equation 2:
€
μ =x1 + x2 +K + xN
N
€
μ =1
Nx i
i=1
N
∑
Mean exampleThe average age of 20 people in a room is
25. A 28 year old leaves while a 30 year old enters the room. Does the average age change?If so, what is the new average age?
MedianThe median is the midpoint of a distribution
The number such that half the observations are smaller and the other half are larger
Also called the 50th percentile or 2nd quartileTo compute a median
Order observationsIf number of observations is odd the median
is the center observationIf number of observations is even the median
is the average of the two center observations
Median exampleThe median age of 20 people in a room is
25. A 28 year old leaves while a 30 year old enters the room. Does the median age change?If so, what is the new median age?
The median age of 21 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. Does the median age change?If so, what is the new median age?
Mean vs MedianWhen histogram is symmetric mean and median
are similar
Mean and median are different when histogram is skewedSkewed to the right mean is larger than medianSkewed to the left mean is smaller than median
The business magazine Forbes estimates that the “average” household wealth of its readers is either about $800,000 or about $2.2 million, depending on which “average” it reports. Which of these numbers is the mean wealth and which is the median wealth? Why?
Mean vs MedianSymmetric distribution
Mean vs MedianRight skewed distribution
Mean vs MedianLeft skewed distribution
Extreme exampleIncome in small town of 6 people
$25,000 $27,000 $29,000 $35,000 $37,000 $38,000
Mean is $31,830 and median is $32,000Bill Gates moves to town
$25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000
Mean is $5,741,571 median is $35,000Mean is pulled by the outlier while the
median is not. The median is a better of measure of center for these data
Is a central measure enough?A warm, stable climate greatly affects some
individual’s health. Atlanta and San Diego have about equal average temperatures (62o
vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live?
Measures of spreadRange:
subtract the largest value form the smallestInter-quartile range:
subtract the 3rd quartile from the 1st quartile
Standard Deviation (SD):“average” distance from the mean
Which one should we use?
Standard DeviationThe standard deviation looks at how far
observations are from their meanIt is the square root of the average squared
deviations from the meanCompute distance of each value from meanSquare each of these distancesTake the average of these squares and
square root
Often we will use SD to denote standard deviation
€
σ =1
N
⎛
⎝ ⎜
⎞
⎠ ⎟ x i −μ( )
2
i=1
n
∑
Example
Standard deviationOrder these
histograms by the SD of the numbers they portray. Go from smallest largest
What is a reasonable guess of the SD for each?
-15 -10 -5 0 5 10 15 20
-1 -0.5 0 .5 1 1.5 2 2.5
-30 -20 -10 0 10 20 30
Histograms on same scale
-30 -20 -10 0 10 20 30
-30 -20 -10 0 10 20 30
-30 -20 -10 0 10 20 30
Problem from text (p. 74, #2)Which of the following sets of numbers has
the smaller SD’ a) 50, 40, 60, 30, 70, 25, 75
b) 50, 40, 60, 30, 70, 25, 75, 50, 50, 50
Repeat for these two sets c) 50, 40, 60, 30, 70, 25, 75
d) 50, 40, 60, 30, 70, 25, 75, 99, 1
More intuition behind the SDThis is a variance contest. You must give a list
of six numbers chosen from the whole numbers 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with repeats allowed.
Give a list of six numbers with the largest standard deviation such a list described above can possibly have.
Give a list of six numbers with the smallest standard deviation such a list can possibly have.
Properties of SDSD ≥ 0. (When is SD = 0)?
Has the same unit of measurement as the original observations
Inflated by outliers
Mean and SDWhat happens to the mean if you add 5 to
every number in a list?What happens to the SD?
€
σ =1
N
⎛
⎝ ⎜
⎞
⎠ ⎟ x i −μ( )
2
i=1
n
∑€
μ =1
Nx i
i=1
N
∑
Standard deviationSDs are like measurement units on a rulerAny quantitative variable can be converted
into “standardized” unitsThese are often called z-scores and are
denoted by the letter z
Important formula
ExampleACT versus SAT scoresWhich is more impressive
A 1340 on the SAT, or a 32 on the ACT?
€
z =value −mean
SD=value −μ
σ
The normal curveWhen histogram looks like a bell-shaped curve, z-
scores are associated with percentages
The percentage of the data in between two different z-score values equals the area under the normal curve in between the two z-score values
A bit of notation here. N(μ, σ) is short hand for writing normal curve with
mean μ and standard deviation σ (get used to this notation as it will be used fairly regularly through out the course)
Normal curves
Normal curves
Properties of normal curve In the Normal distribution with mean μ and
standard deviation σ:68% of the observations fall within 1 σ of μ95% of the observations fall within 2 σs of μ99.7% of the observations fall within 3 σs of μ
By remembering these numbers, you can think about Normal curves without constantly making detailed calculations
Properties of normal curvesFor a N(0,1) the following holds
IQA person is considered to have mental
retardation when
1.IQ is below 702.Significant limitations exist in two or more
adaptive skill areas3.Condition is present from childhood
What percentage of people have IQ that meet the first criterion of mental retardation
IQA histogram of all people’s IQ scores has a
μ=100 and a σ=16How to get % of people with IQ < 70
More IQ Reggie Jackson, one of the greatest baseball players ever, has
an IQ of 140. What percentage of people have bigger IQs than Reggie?
Marilyn vos Savant, self-proclaimed smartest person in the world, has a reported IQ of 205. What percentage of people have IQ scores smaller than Marilyn’s score?
Mensa is a society for “intelligent people.” To qualify for Mensa, one needs to be in at least the upper 2% of the population in IQ score. What is the score needed to qualify for Mensa?
Checking if data follow normal curve
Look for symmetric histogram
A different method is a normal probability plot. When normal curve is a good fit, points fall on a nearly straight line
Measurement errorMeasurement error model
Measurement = truth + chance error
OutliersBias effects all measurements in the same
way
Measurement = truth + bias + chance error
Often we assume that the chance error follows a normal curve that is centered at 0