descriptive statistics. the everyday notions of central tendency usual customary most standard...
Post on 24-Dec-2015
227 Views
Preview:
TRANSCRIPT
Descriptive Statistics
the everyday notions of central tendency
Usual Customary Most Standard Expected normal Ordinary Medium commonplace
NY Times, 10/24/ 2010Stories vs. StatisticsBy JOHN ALLEN PAULOS
Overview What are descriptive statistics?
A bit of terminology/notation Measures of Central Tendency
Mean, Mode, Median Measures of Variability
Ranges, Standard Deviations The Normal Curve
Terminology/Notation A data distribution = A set of data/scores
(the whole thing) 1, 2, 4, 7
X = A raw, single score (i.e., 2 from above) ∑ = Summation (added up)
∑X = 14 (each individual score added up) n = sample size (distribution size, or number
of scores) n = 4 (from above)
Descriptive Statistics Descriptive statistics are the side of
statistics we most often use in our everyday lives
Realize that most observations/data are too “large” for a human to take in and comprehend – we must “reduce” them How can we summarize what we see? Example – Grades/Registrar
Making sense out of chaos
Descriptive Statistics Descriptive statistics = describing
the data n = 50, a test score of 83%
Where does it fit in the class??
Descriptive Statistics Transform a set of numbers or
observations into indices that describe or characterize the data “Summary statistics” A large group of statistics that are
used in all research manuscripts Even the most complex statistical tests
and studies start with descriptive statistics
Descriptive Statistics
Descriptive Statistics
MeasurementScales
• Nominal• Ordinal• Interval• Ratio
GraphicPortrayals
• Frequencies• Histograms• Bar graphs• Normal distribution
CentralTendency
• Mean• Median• Mode
Relationship
• Scatterplot• Correlation• Regression
Variability
• Range• Standard deviation• Standardized scores
Descriptive Statistics Descriptive statistics usually
accomplish two major goals: 1) Describe the central location of the
data 2) Describe how the data are dispersed
about that point In other words, they provide:
1) Measures of Central Tendency 2) Measures of Variability
Measure of Central Tendency What SINGLE summary value best
describes the CENTRAL location of an entire distribution? Mode: which value occurs most often Median: the value above and below
which 50% of the cases fall (the middle; 50th percentile)
Mean: mathematical balance point; arithmetic/mathematical average
Mode Most frequent occurrence What if data were?
17, 19, 20, 20, 22, 23, 25, 28 17, 19, 20, 20, 22, 23, 23, 28
Problem: set of numbers can be bimodal, or trimodal, depending on the scores
Not a stable measure Ex. 17, 19, 20, 22, 23, 28, 28
Median Rank numbers, pick middle one What if data were…?
17, 19, 20, 23, 23, 28 Solution: add up two middle scores,
divide by 2 (=21.5) Best measure in asymmetrical
distribution (i.e. skewed), not sensitive to extreme scores Ex. 17, 19, 20, 23, 23, 428
Mean = X Add up the numbers and divide by the
sample size (the number of numbers!)
Try this one… 2,3,5,6,9 2+3+5+6+9 = 25 / 5 = 5
(Usually) best measure of the three –uses the most information (all values from distribution contribute)
XX
n
Characteristics of the Mean Balance point
Point around which deviations sum to zero
Deviation = X – X
For instance, if scores are 2,3,5,6,9 Mean is 5
Sum of deviations: (-3)+(-2)+0+1+4=0 ∑ (X – X) = 0
Characteristics of the Mean Affected by extreme scores
Example 1 Scores 7, 11, 11, 14, 17 Mean = 12, Mode and Median = 11
Example 2 Scores 7, 11, 11, 14, 170 Mean = 42.6, Mode & Median = 11
Characteristics of the Mean Balance point Affected by extreme scores Appropriate for use with interval or
ratio scales of measurement More stable than Median or Mode
when multiple samples drawn from the same population Basis for inferential stats
Guidelines to Choose Measure of Central Tendency Mean is preferred because it is the
basis of inferential statistics Median may be better for skewed
data Distribution of wealth in the US – ex.
annual household income in Washington state for 2000: mean=$76,818; median=$42,024
Mode to describe average of nominal data (eye color, hair color, etc…)
Scores
Normal Distribution
Frequency,How often a score occurs
MLB batting averages over 3-year span (min. 100 AB)
Mean = 0.267
n = 1291
MedianMode
Mean
Scores
Normal Distribution“Normal” distribution indicates the data are perfectly symmetrical
Positively skewed distribution
Median
Mode
Mean
Scores
NFL Salaries 2011
Negatively skewed distribution
Median
Mode
Mean
Scores
Relationship among the MCT & shape of distribution
Alaska’s average elevation of1900 feet is less than that of Kansas. Nothing in that average suggeststhe 16 highest mountains inthe United States are in Alaska. Averages mislead, don’t they?
Grab Bag, Pantagraph, 08/03/2000
Variability
Measures of dispersion or spread
The only thing constant is variation.
the notions of variability
•Unusual•Peculiar•Strange•Original•Extreme•Special•Unlike•Deviant•Dissimilar•different
NY Times, 10/24/ 2010Stories vs. StatisticsBy JOHN ALLEN PAULOS
Variability defined Measures of Central Tendency provide a
summary level of the data Recognizes that scores vary across
individual cases ie, the mean or median may not be an actual
score in your distribution Variability quantifies the spread of
performance How scores vary around mean/mode/median
To describe a distribution 1) Measure of Central Tendency
Mean, Mode, Median 2) Measure of Variability
Multiple measures Range, Interquartile range, Semi-
Interquartile Range Standard Deviation
Range Range = Difference between low/high
score # of hours spent watching TV/week
2, 5, 7, 7, 8, 8, 10, 12, 12, 15, 17, 20 Range = (Max - Min) Score
20 - 2 = 18 Very susceptible to outliers Doesn’t indicate anything about
variability around the mean/central point
Semi-Interquartile range What is a quartile??
Divide sample into 4 parts of equal size Q1 , Q2 , Q3 = Quartile Points
Interquartile Range = Q3 - Q1 Difference between highest and lowest
quartile SIQR = IQR / 2 Related to the Median…prevents outliers
from overly skewing measure For ordinal data or skewed interval/ratio
BMD and walkingQuartiles based on miles walked/week
Krall et al, 1994, Walking is related to bone density and rates of bone loss. AJSM, 96:20-26
Notes: Skewed Distribution?
95th Percentile?
50th Percentile vs Median?
Standard Deviation Most commonly accepted measure
of spread1. Compute the deviations of all numbers
from the mean2. Square and THEN sum each of the
deviations3. Divide by the number of deviations4. Finally, take the square root
Variation itself is nature's only irreducible essence. Stephen Jay Gould
2( )x X
n
Standard Deviation Distribution = 1, 3, 5, 7 X = 16 /4 = 4 1) Compute Deviations = -3, -1, 1, 3 2) Square Deviations = 9, 1, 1, 9 3) Sum Deviations = 20 4) Divide by n= 20/4 = 5 5) Take square root = √5 = 2.2
Key points about SD SD small data clustered round mean SD largedata scattered from the mean Affected by extreme scores (just like
mean)…oftentimes called “outliers” Consistent (more stable) across samples
from the same population Just like the mean - so it works well with
inferential stats (where repeated samples are taken)
SD Example Three NFL quarterbacks with similar QB
ratings in 2006: Matt Hasselbeck (SEA) = 76.0 Rex Grossman (CHI) = 73.9 Brett Favre (GB) = 72.7 Note: QB rating involves a complex formula accounting for
passing attempts, completions, yards, touchdowns, and interceptions…100+ is considered outstanding & 70-80 is average
All appear to have had very similar, somewhat mediocre seasons as QB’s
SD Example Let’s look at the SD of their game-
by-game QB ratings: Matt Hasselbeck (SEA) = 29.97 Rex Grossman (CHI) = 47.60 Brett Favre (GB) = 27.81
Grossman had, by far, the most variability (i.e. inconsistency) in his game-by-game performances…is this good or bad?
Clinical Use of SD
SD and the normal curve The following concepts are critical
to your understanding of how descriptive statistics works
Remember – a “normal” curve is perfectly symmetrical. This is not typical, but usually data are almost normal…
SD and the normal curve
60 70 80
X = 70SD = 10 34.1% 34.1%
About 68% ofscores fallwithin 1 SDof mean
About 68% ofscores fallbetween 60 and 70
The standard deviation and the normal curve
60 70 80
X = 70SD = 10
34% 34%
70
About 95% ofscores fallwithin 2 SDof mean
60 8050 90
X = 70SD = 10
The standard deviation and the normal curve
34.1% 34.1%
13.6% 13.6%
70
About 95% ofscores fallbetween 50 and 90
60 8050 90
X = 70SD = 10
The standard deviation and the normal curve
34.1% 34.1%
13.6%13.6%
The standard deviation and the normal curve
70
About 99.7% of scores fall within 3 S.D. of the mean
60 8050 90
X = 70SD = 10
40 100
2.3% 2.3%
34.1% 34.1%
13.6%13.6%
The standard deviation and the normal curve
70
About 99.7% of scores fall between 40 and 100
60 8050 90
X = 70SD = 10
40 100
2.3% 2.3%
34.1% 34.1%
13.6%13.6%
What about = 70, SD = 5? What approximate percentage of
scores fall between 65 & 75? …1SD below + 1SD above = 68%
What range includes about 99.7% of all scores?
…3SD below to 3SD above = 55 to 85
X
Interpreting The Normal Table
Area under Normal Curve Specific SD values (z) include certain
percentages of the scores Values of Special Interest
1.96 SD = 47.5% of scores (47.5 + 47.5 = 95%) 2.58 SD = 49.5% of scores (49.5 + 49.5 = 99%)
ie, 95% of scores fall within 1.96 standard deviations of the mean (1.96 above and 1.96 below)
IQ
10085 11570 130
X = 100SD = 15
55 145
2.3% 2.3%
34.1% 34.1%
13.6%13.6%
68% have an IQ between 85-115
MLB players’ batting averages over a 3-year span (min. 100 at bats)
~95% of players have an average between 0.196 and 0.337
Next Week… We will utilize our understanding of
descriptive statistics concepts, including central tendency, variability, and the normal curve, to examine standardized scores
Homework = Cronk 3.1 – 3.4 Bring calculator to class In-class activity 2…
top related