an overview of statistics. what is statistics? what does a statistician do? player games minutes...
TRANSCRIPT
AN OVERVIEW OF STATISTICS
WHAT IS STATISTICS?
What does a statistician do?
Player Games Minutes Points Rebounds FG%Player Games Minutes Points Rebounds FG%BobBob 34 34 32.7 32.7 2424 7.6 .552 7.6 .552AndyAndy 36 36 31.5 31.5 2121 8.4 .465 8.4 .465Larry 30Larry 30 33.0 33.0 1818 5.6 .493 5.6 .493Michael 31Michael 31 35.1 35.1 2929 6.1 .422 6.1 .422
Player Games Minutes Points Rebounds FG%Player Games Minutes Points Rebounds FG%BobBob 34 34 32.7 32.7 2424 7.6 .552 7.6 .552AndyAndy 36 36 31.5 31.5 2121 8.4 .465 8.4 .465Larry 30Larry 30 33.0 33.0 1818 5.6 .493 5.6 .493Michael 31Michael 31 35.1 35.1 2929 6.1 .422 6.1 .422
JOB OF A STATISTICIAN
• Collects numbers or data• Systematically organizes or arranges the data• Analyzes the data…extracts relevant
information to provide a complete numerical description
• Infers general conclusions about the problem using this numerical description
POLITICS
Forecasting and predicting winners of elections
Where to concentrate campaign appearances, advertising and $$…
If the election for president of the United States were held today, who would you be more likely to vote for?
Rudy Guiliani 45%Hilary Clinton 43%Someone else 2%Wouldn’t vote 4%
Unsure 6%
If the election for president of the United States were held today, who would you be more likely to vote for?
Rudy Guiliani 45%Hilary Clinton 43%Someone else 2%Wouldn’t vote 4%
Unsure 6%
• To market product…
• Interested in the average length of life of a light bulb
• Cannot test all the bulbs
INDUSTRY
USES OF STATISTICS
• Statistics is a theoretical discipline in its own right
• Statistics is a tool for researchers in other fields
• Used to draw general conclusions in a large variety of applications
COMMON PROBLEMDecision or prediction about a large body of
measurements which cannot be totally enumerated.
Examples
• Light bulbs (to enumerate population is destructive)
• Forecasting the winner of an election (population too big; people change their minds)
Solutions
Collect a smaller set of measurements that will (hopefully) be representative of the larger set.
DATA AND STATISTICS
Data consists of information coming from observations, counts, measurements, or responses.
Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.
A population is the collection of all outcomes, responses, measurement, or counts that are of interest.
A sample is a subset of a population.
Introduction to Probability Introduction to Probability and Statisticsand Statistics
Thirteenth EditionThirteenth Edition
Chapter 1
Describing Data with Graphs
Introduction to Statistical Terms Variable
o Something that can assume some type of value Data
consists of information coming from observations, counts, measurements, or responses.
Data Seto A collection of data values
Observationo the value, at a particular period, of a particular variable
An experimental unitexperimental unit is the individual or object on which a variable is measured.
A measurementmeasurement results when a variable is actually measured on an experimental unit.
A set of measurements, called datadata,, can be either a samplesample or a populationpopulation..
Example• Variable
– Time until a light bulb burns out
• Experimental unit – Light bulb
• Typical Measurements – 1500 hours, 1535.5 hours, etc.
Populations and Samples
• A Population is the set of all items or individuals of interest– Examples: All likely voters in the next election
All parts produced todayAll sales receipts for November
• A Sample is a subset of the population– Examples: 1000 voters selected at random for interview
A few parts selected for destructive testing
Every 100th receipt selected for audit
population
2,
sample
inference
2, sx
Sampling Techniques
Statistical Procedures
Parameters
Statistics
Parameters & StatisticsA parameter is a numerical description of a population characteristic.
A statistic is a numerical description of a sample characteristic.
Parameter Population
Statistic Sample
Univariate dataUnivariate data:: One variable is measured on a single experimental unit.
Bivariate dataBivariate data:: Two variables are measured on a single experimental unit.
Multivariate dataMultivariate data:: More than two variables are measured on a single experimental unit.
Nominalo for things that are mutually exclusive/non-overlappingo there is no order or rankingoFor example: gender (male or female), religion.
Ordinalo can be ordered, but not precisely.o For example : health quality (excellent, good, adequate, bad, terrible)
Intervalo involves measurements, but there is no meaningful zero.oFor example : temperature.
Ratioo involves measurements, it can be ranked and there are precise differences
between the ranks, as well as having a meaningful zero.oFor example: height, time, or weight
Qualitative
Discrete Continuous
Quantitative
Types of Variables
Types of Variables•Qualitative variablesQualitative variables measure a quality or characteristic on each experimental unit.
•Examples:Examples:•Hair color (black, brown, blonde…)•Make of car (Dodge, Honda, Ford…)•Gender (male, female)•State of birth (California, Arizona,….)
•Quantitative variablesQuantitative variables measure a numerical quantity on each experimental unit.
DiscreteDiscrete if it can assume only a finite or countable number of values.
ContinuousContinuous if it can assume the infinitely many values corresponding to the points on a line interval.
ExamplesExamples
• For each orange tree in a grove, the number of oranges is measured. – Quantitative discrete
• For a particular day, the number of cars entering a college campus is measured.– Quantitative discrete
• Time until a light bulb burns out– Quantitative continuous
Statistical MethodsStatistical Methods
Descriptive Statistics Inferential Statistics
• Utilizes numerical and graphical methods to look for patterns in the data set.
• The data can either be a representation of the entire population or a sample
Descriptive StatisticsDescriptive Statistics
Graphical Numerical
•Bar Chart•Pie Chart
•Bar/Pie Chart•Line Plot (Time Series)•Dotplot•Stem-and-Leaf Plot•Histogram•Ogive•Boxplot
Qualitative Quantitative
Note: Some graphs require a tabular representation (frequency distribution)
Qualitative Quantitative
•Central Tendency•Dispersion (Variability)
•Tables, frequency, percentage, cumulative percentage•Cross tabulation
Graphing Qualitative VariablesGraphing Qualitative Variables• Use a data distributiondata distribution to describe:
– What valuesWhat values of the variable have been measured– How oftenHow often each value has occurred
• “How often” can be measured 3 ways:– Frequency– Relative frequency = Frequency/n– Percent = 100 x Relative frequency
•Bar Chart•Pie Chart
Example• A bag of M&Ms contains 25 candies:• Raw Data:Raw Data:
Color Tally Frequency Relative Frequency
Percent
Red 3 3/25 = .12 12%
Blue 6 6/25 = .24 24%
Green 4 4/25 = .16 16%
Orange 5 5/25 = .20 20%
Brown 3 3/25 = .12 12%
Yellow 4 4/25 = .16 16%
m
m
mm
m
m
m m
m
m
mm m
m
m
mm
m
m
m
m
m
m
mmm
mm
m
m m
m m
m mm
m m m
m m
m m
mm
m
m
m m
m
Statistical Table:Statistical Table:
GraphsGraphsBar Chart
Pie Chart
Color
Fre
quency
GreenOrangeBlueRedYellowBrown
6
5
4
3
2
1
0
16.0%Green
20.0%Orange
24.0%Blue
12.0%Red
16.0%Yellow
12.0%Brown
Graphing Quantitative Variables
• Bar/Pie Chart• Line Plot (Time Series)• Dotplot• Stem-and-Leaf Plot• Histogram• Ogive• Boxplot
Graphing Quantitative Variables (1)Graphing Quantitative Variables (1)
• A single quantitative variable measured for different population segments or for different categories of classification can be graphed using a bar bar or pie chartpie chart.
A Big Mac hamburger costs $4.90 in Switzerland, $2.90 in the U.S. and $1.86 in South Africa.
A Big Mac hamburger costs $4.90 in Switzerland, $2.90 in the U.S. and $1.86 in South Africa.
Country
Cost
of a B
ig M
ac
($)
South AfricaU.S.Switzerland
5
4
3
2
1
0
• A single quantitative variable measured over time is called a time seriestime series. It can be graphed using a lineline or bar chartbar chart.
Sept Oct Nov Dec Jan Feb Mar
178.10 177.60 177.50 177.30 177.60 178.00 178.60
CPI: All Urban Consumers-Seasonally Adjusted
Graphing Quantitative Variables (2)Graphing Quantitative Variables (2)
• The simplest graph for quantitative data• Plots the measurements as points on a horizontal axis,
stacking the points that duplicate existing points.• Example:Example: The set 4, 5, 5, 7, 6
4 5 6 7
Graphing Quantitative Variables (3) -DotplotGraphing Quantitative Variables (3) -Dotplot
Stem and Leaf Plots (4)Stem and Leaf Plots (4)
• A simple graph for quantitative data • Uses the actual numerical values of each data point.
– Divide each measurement into two parts: the stem and the leaf.
– List the stems in a column, with a vertical line to their right.
– For each measurement, record the leaf portion in the same row as its matching stem.
– Order the leaves from lowest to highest in each stem.
– Provide a key to your coding.
– Divide each measurement into two parts: the stem and the leaf.
– List the stems in a column, with a vertical line to their right.
– For each measurement, record the leaf portion in the same row as its matching stem.
– Order the leaves from lowest to highest in each stem.
– Provide a key to your coding.
Example : Stem-and-Leaf Plot
The prices ($) of 18 brands of walking shoes:
90 70 70 70 75 70 65 68 60
74 70 95 75 70 68 65 40 65
4 0
5
6 0 5 5 5 8 8
7 0 0 0 0 0 0 4 5 5
8
9 0 5
Relative Frequency Histograms (5)Relative Frequency Histograms (5)• A relative frequency histogramrelative frequency histogram for a quantitative data set is a
bar graph in which the height of the bar shows “how often” (measured as a proportion or relative frequency) measurements fall in a particular class or subinterval.
• Divide the range of the data into 5-125-12 subintervalssubintervals of equal length.
• Calculate the approximate widthapproximate width of the subinterval as Range/number of subintervals.
• Round the approximate width up to a convenient value.• Use the method of left inclusionleft inclusion, including the left
endpoint, but not the right in your tally.• Create a statistical tablestatistical table including the subintervals, their
frequencies and relative frequencies.
• Draw the relative frequency histogramrelative frequency histogram, plotting the subintervals on the horizontal axis and the relative frequencies on the vertical axis.
• The height of the bar represents– The proportionproportion of measurements falling in that
class or subinterval.– The probabilityprobability that a single measurement, drawn
at random from the set, will belong to that class or subinterval.
Relative Frequency Histograms (5) : Relative Frequency Histograms (5) : cont’dcont’d
Example 1
The ages of 50 tenured faculty at a state university.• 34 48 70 63 52 52 35 50 37 43 53 43 52 44
• 42 31 36 48 43 26 58 62 49 34 48 53 39 45
• 34 59 34 66 40 59 36 41 35 36 62 34 38 28
• 43 50 30 43 32 44 58 53
• We choose to use 6 intervals.
• Minimum class width = (70 – 26)/6 = 7.33
• Convenient class width = 8
• Use 6 classes of length 8, starting at 25.
Range
Age Tally Frequency Relative Frequency
Percent
25 to < 33 1111 5 5/50 = .10 10%
33 to < 41 1111 1111 1111 14 14/50 = .28 28%
41 to < 49 1111 1111 111 13 13/50 = .26 26%
49 to < 57 1111 1111 9 9/50 = .18 18%
57 to < 65 1111 11 7 7/50 = .14 14%
65 to < 73 11 2 2/50 = .04 4%
Ages
Rela
tive fre
quency
73655749413325
14/50
12/50
10/50
8/50
6/50
4/50
2/50
0
Class Class Boundaries
Midpoint Frequency Relative Frequency
Percent
25 to < 33 24.5 – 33.5 29 5 5/50 = .10 10%
34 to < 42 33.5 – 42.5 38 16 16/50 = .32 32%
43 to < 51 42.5 – 51.5 47 14 14/50 = .28 28%
52 to < 60 51.5 – 60.5 56 10 10/50 = .20 20%
61 to < 69 60.5 – 69.5 65 4 4/50 = .08 8%
70 to < 78 69.5 – 78.5 74 1 1/50 = .02 2%
Shape?
Outliers?
What proportion of the tenured faculty are younger than 42.5?
What is the probability that a randomly selected faculty member is 52 or older?
Skewed right
No.
(16 + 5)/50 = 31/50 = .62=62%
(10 + 4 + 1)/50 = 15/50 = .34
Describing the Distribution
How Many Class Intervals?
• Many (Narrow class intervals)• may yield a very jagged distribution
with gaps from empty classes
• Can give a poor indication of how frequency varies across classes
• Few (Wide class intervals)• may compress variation too much
and yield a blocky distribution
• can obscure important patterns of variation.
0
2
4
6
8
10
12
0 30 60 More
TemperatureF
req
ue
nc
y
0
0.5
1
1.5
2
2.5
3
3.5
4 8
12 16 20 24 28 32 36 40 44 48 52 56 60
Mor
e
Temperature
Fre
qu
ency
(X axis labels are upper class endpoints)
Example 2
Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
32, 13, 12, 38, 41, 43, 44, 27, 53, 27
• Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
• Find range: 58 - 12 = 46
• Select number of classes: 5 (usually between 5 and 12)
• Compute class interval (width): 10 (46/5 then round up)
• Determine class boundaries (limits): 10, 20, 30, 40, 50, 60
• Compute class midpoints: 15, 25, 35, 45, 55
• Count observations & assign to classes
Example 2: Solution (Frequency Distribution)
Class
10 ≤ X < 20 3 .15 15
20 ≤ X < 30 6 .30 30
30 ≤ X < 40 5 .25 25
40 ≤ X < 50 4 .20 20
50 ≤ X < 60 2 .10 10
Total 20 1.00 100
RelativeFrequency Percentage
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
(continued)
Example 2: Solution (Frequency Distribution)
Frequency
0
1
2
3
4
5
6
7
5 15 25 35 45 55 65
Fre
qu
ency
Histogram: Daily High Temperature
Class Midpoints
Histogram: Example 2Histogram: Example 2
(No gaps between bars)
Class
10 ≤ X < 20 15 3
20 ≤ X < 30 25 6
30 ≤ X < 40 35 5
40 ≤ X < 50 45 4
50 ≤ X < 60 55 2
FrequencyClass
Midpoint
Ogive (6)Ogive (6)
An ogive is a curve drawn for the cumulative frequency distribution by joining with straight lines the dots marked above the upper boundaries of classes at heights equal to the cumulative frequencies of respective classes.
Two type of ogive:
(i) ogive less than
(ii) ogive greater than
First, build a table of cumulative frequency.
Cumulative Frequency
Class
10 ≤ X < 20 3 15 3 15
20 ≤ X < 30 6 30 9 45
30 ≤ X < 40 5 25 14 70
40 ≤ X < 50 4 20 18 90
50 ≤ X < 60 2 10 20 100
Total 20 100
Percentage Cumulative Percentage
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
FrequencyCumulative Frequency
Graphing Cumulative Frequencies: The Ogive
Ogive: Daily High Temperature
0
20
40
60
80
100
10 20 30 40 50 60Cu
mu
lati
ve P
erce
nta
ge
Class Boundaries (Not Midpoints)
Class
<10 0 0
10 ≤ X < 20 10 15
20 ≤ X < 30 20 45
30 ≤ X < 40 30 70
40 ≤ X < 50 40 90
50 ≤ X < 60 50 100
Cumulative Percentage
Lower class
boundary
Interpreting Graphs: Location and Spread
• Where is the data centered on the horizontal axis, and how does it spread out from the center?
• Where is the data centered on the horizontal axis, and how does it spread out from the center?
Interpreting Graphs: Shapes
Mound shaped and symmetric (mirror images)
Skewed right: a few unusually large measurements
Skewed left: a few unusually small measurements
Bimodal: two local peaks
Are there any strange or unusual measurements that stand out in the data set?
OutlierNo Outliers
Interpreting Graphs: Outliers
• A quality control process measures the diameter of a gear being made by a machine (cm). The technician records 15 diameters, but inadvertently makes a typing mistake on the second entry.
1.991 1.891 1.991 1.988 1.993 1.989 1.990 1.988
1.988 1.993 1.991 1.989 1.989 1.993 1.990 1.994
Example