introduction to descriptive statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfa...
TRANSCRIPT
![Page 1: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/1.jpg)
Introduction to
Descriptive
Statistics
17.871
![Page 2: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/2.jpg)
Types of Variables
Nominal
(Qualitative)
“categorical”
~Nominal(Quantitative)
Ordinal Interval orratio
![Page 3: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/3.jpg)
Describing data
Moment Non-mean
based measure
Center Mean Mode, median
Spread Variance
(standard
deviation)
Range,
Interquartile
range
Skew Skewness --
Peaked Kurtosis --
![Page 4: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/4.jpg)
Population vs. Sample Notation
Population Vs Sample
Greeks Romans
μ, σ, β s, b
![Page 5: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/5.jpg)
Mean
Xn
xn
i
i
1
![Page 6: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/6.jpg)
Variance, Standard Deviation
n
i
i
n
i
i
n
x
n
x
1
2
2
1
2
)(
,)(
![Page 7: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/7.jpg)
Variance, S.D. of a Sample
sn
x
sn
x
n
i
i
n
i
i
1
2
2
1
2
1
)(
,1
)(
Degrees of freedom
![Page 8: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/8.jpg)
Binary data
)1()1(
1 timeof proportion1)(
2 xxsxxs
xXprobX
xx
![Page 9: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/9.jpg)
Normal distribution example
IQ
SAT
Height
“No skew”
“Zero skew”
Symmetrical
Mean = median = mode Value
Frequency
![Page 10: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/10.jpg)
Skewness
Asymmetrical distribution
GPA of MIT students
“Negative skew”
“Left skew”
Value
Frequency
![Page 11: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/11.jpg)
Skewness
(Asymmetrical distribution) Income
Contribution to candidates
Populations of countries
“Residual vote” rates
“Positive skew”
“Right skew”
Value
Frequency
![Page 12: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/12.jpg)
Skewness
Value
Frequency
![Page 13: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/13.jpg)
Kurtosis
Value
Frequency
k > 3
k = 3
k < 3
leptokurtic
platykurtic
mesokurtic
![Page 14: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/14.jpg)
Normal distribution
Skewness = 0
Kurtosis = 3
22/)(
2
1)(
xexf
![Page 15: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/15.jpg)
The z-score
or the
“standardized score”
z x x
x
![Page 16: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/16.jpg)
More words about the normal curve
![Page 17: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/17.jpg)
Commands in STATA for
getting univariate statistics
summarize varname
summarize varname, detail
histogram varname, bin() start() width()
density/fraction/frequency normal
graph box varnames
tabulate [NB: compare to table]
![Page 18: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/18.jpg)
Example of Sophomore Test
Scores
High School and Beyond, 1980: A Longitudinal Survey of Students in the United States (ICPSR Study 7896)
totalscore = % of questions answered correctly minus penalty for guessing
recodedtype = (1=public school, 2=religious private, 3 = non-sectarian private)
![Page 19: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/19.jpg)
Explore totalscore some more
. table recodedtype,c(mean totalscore)
--------------------------
recodedty |
pe | mean(totals~e)
----------+---------------
1 | .3729735
2 | .4475548
3 | .589883
--------------------------
![Page 20: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/20.jpg)
Graph totalscore
. hist totalscore
0.5
11.5
2
Densi
ty
-.5 0 .5 1totalscore
![Page 21: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/21.jpg)
Divide into “bins” so that each bar
represents 1% correct hist totalscore,width(.01)
(bin=124, start=-.24209334, width=.01)
0.5
11.5
2
Density
-.5 0 .5 1totalscore
![Page 22: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/22.jpg)
Add ticks at each 10% mark
histogram totalscore, width(.01) xlabel(-.2 (.1) 1)
(bin=124, start=-.24209334, width=.01)
0.5
11.5
2
Density
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1totalscore
![Page 23: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/23.jpg)
Superimpose the normal curve
(with the same mean and s.d. as
the empirical distribution). histogram totalscore, width(.01) xlabel(-.2 (.1) 1)
normal
(bin=124, start=-.24209334, width=.01)
0.5
11.5
2
Density
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1totalscore
![Page 24: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/24.jpg)
Histograms by category
.histogram totalscore, width(.01) xlabel(-.2 (.1)1)
by(recodedtype)
(bin=124, start=-.24209334, width=.01)
01
23
01
23
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
1 2
3Density
totalscoreGraphs by recodedtype
![Page 25: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/25.jpg)
Main issues with histograms
Proper level of aggregation
Non-regular data categories
![Page 26: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/26.jpg)
A note about histograms with
unnatural categoriesFrom the Current Population Survey (2000), Voter and Registration Survey
How long (have you/has name) lived at this address?
-9 No Response
-3 Refused
-2 Don't know
-1 Not in universe
1 Less than 1 month
2 1-6 months
3 7-11 months
4 1-2 years
5 3-4 years
6 5 years or longer
![Page 27: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/27.jpg)
Solution, Step 1
Map artificial category onto
“natural” midpoint-9 No Response missing
-3 Refused missing
-2 Don't know missing
-1 Not in universe missing
1 Less than 1 month 1/24 = 0.042
2 1-6 months 3.5/12 = 0.29
3 7-11 months 9/12 = 0.75
4 1-2 years 1.5
5 3-4 years 3.5
6 5 years or longer 10 (arbitrary)
![Page 28: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/28.jpg)
Graph of recoded dataF
ract
ion
longevity0 1 2 3 4 5 6 7 8 9 10
0
.557134
histogram longevity, fraction
![Page 29: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/29.jpg)
longevity
0 1 2 3 4 5 6 7 8 9 10
0
15
Density plot of data
Total area of last bar = .557
Width of bar = 11 (arbitrary)
Solve for: a = w h (or)
.557 = 11h => h = .051
![Page 30: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/30.jpg)
Density plot template
Category Fraction X-min X-max X-length
Height
(density)
< 1 mo. .0156 0 1/12 .082 .19*
1-6 mo. .0909 1/12 ½ .417 .22
7-11 mo. .0430 ½ 1 .500 .09
1-2 yr. .1529 1 2 1 .15
3-4 yr. .1404 2 4 2 .07
5+ yr. .5571 4 15 11 .05
* = .0156/.082
![Page 31: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/31.jpg)
Draw the previous graph with a box
plot
. graph box totalscore-.
50
.51
Upper quartile
Median
Lower quartile} Inter-quartile
range
} 1.5 x IQR
![Page 32: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/32.jpg)
Draw the box plots for the different
types of schools. graph box totalscore,by(recodedtype)
-.5
0.5
1-.
50
.51
1 2
3
Graphs by recodedtype
![Page 33: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/33.jpg)
Draw the box plots for the
different types of schools using
“over” option-.
50
.51
1 2 3
graph box totalscore,over(recodedtype)
![Page 34: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/34.jpg)
Three words about pie charts:
don’t use them
![Page 35: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/35.jpg)
So, what’s wrong with them
For non-time series data, hard to get a
comparison among groups; the eye is very
bad in judging relative size of circle slices
For time series, data, hard to grasp cross-
time comparisons
![Page 36: Introduction to Descriptive Statisticsweb.mit.edu/17.871/www/2009/02 descriptive_stats_2009.pdfA note about histograms with unnatural categories From the Current Population Survey](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3afaf83ac8dd5ec228ce61/html5/thumbnails/36.jpg)
Some Words about
Graphical Presentation
Aspects of graphical integrity (following
Edward Tufte, Visual Display of
Quantitative Information)
Represent number in direct proportion to
numerical quantities presented
Write clear labels on the graph
Show data variation, not design variation
Deflate and standardize money in time series