chapter 1 - descriptive statistics€¦ · chapter 1 - descriptive statistics stt 351, us18 january...
TRANSCRIPT
Chapter 1 - Descriptive Statistics
STT 351, US18
January 10, 2019
Introduction
Population, Sample and Processes
Pictural and Tabular Methods in Descriptive Statistics
Measures of Location
Measure of Variability
Introduction
Population, Sample and Processes
Pictural and Tabular Methods in Descriptive Statistics
Measures of Location
Measure of Variability
Introduction
”I am not much given to regret, so I puzzled over this one a while.Should have taken much more statistics in college, I think.”
–Max Levchin, Paypal Co-founder, Slide Founder
Quote of the week from the Web site of the American StatisticalAssociation on November 23, 2010
”I keep saying that the sexy job in the next 10 years will bestatisticians, and I’m not kidding.”
–Hal Varian, Chief Economist at Google
August 6, 2009, The New York Times
Introduction
Population, Sample and Processes
Pictural and Tabular Methods in Descriptive Statistics
Measures of Location
Measure of Variability
First Definitions
Definition
I Population:A Population is any collection of ALL of the objects orindividuals of interest.
I Sample:A sample is a selection of objects or individuals in thepopulation. When the sample matches the population, we talkabout a census.
I VariableA Variable is any characteristic of interest associated with anobject or individual studied. When an object or an individualis associated with several variables, we talk about multivariatedata. We say that the data is univariate, otherwise.
Examples
I Type of transmission in automobiles.I Populations : All cars for salesI Sample : Cars in one dealershipI Variable : Type of transmission
M A A A M A A M A A
I Physicals in Basketball playersI Population : All Basketball playersI Sample : A given Basketball teamI Variable : height , weight
(72, 168) (87.1, 192) (89.1, 201) (90, 183) (86, 179)
Branches of Statistics
Once data is collected, one may wish to summarize and describeimportant features.
This branch is called Descriptive Statistics.
Some methods may be of graphical nature
I Histogram
I Boxplot
I Pie chart
Others may be of numerical nature
I Mean
I Standard Deviation
I Median
Computers
Even though is is not the main focus of this class, here’s a list ofcommonly used statistical softwares.
I The book often uses Minitab.
I Professional softwares such as SAS, S-plus.
I R (a free software) can be downloaded at:
https://www.r-project.org/
I Matlab
Example: Charity business
I The Web site charitynavigator.com gives information onroughly 5500 charitable organizations.
I Some charities operate very efficiently, with fundraising andadministrative expenses that are only a small percentage oftotal expenses, whereas others spend a high percentage ofwhat they take in on such activities.
I Here is data on fundraising expenses as a percentage of totalexpenditures for a random sample of 60 charities:
6.1 12.6 2.2 3.1 7.5 3.9 6.4 10.8 8.8 5.15.3 16.6 34.7 1.6 1.3 1.1 10.1 8.1 83.1 3.63.7 26.3 8.8 12.0 18.8 14.1 19.5 6.2 6.0 4.72.2 3.0 2.2 4.0 21.0 6.1 5.2 12.0 15.8 6.3
16.3 12.7 48.0 8.2 11.7 14.7 6.4 17.0 5.6 3.81.3 20.4 10.4 5.2 1.3 0.8 7.2 3.9 2.5 16.2
Example: Charity business (II)
I Without organization, it is hard to make sense of this data set.
I Some graphical representations: Stem and Leaf display andHistogram
Branches of Statistics (II)
I Having obtained a sample from a population, an investigatorwould frequently like to use sample information to draw sometype of conclusion (make an inference of some sort) about thepopulation.
I That is, the sample is a means to an end rather than an endin itself. Techniques for generalizing from a sample to apopulation are gathered within the branch of Statistics calledinferential statistics.
Example of inferential statistics
I A study of flexural strength (a measure of ability to resistfailure in bending) in high performance concrete gives thefollowing measurement (in mega Pascal):
5.9 7.2 7.3 6.3 8.1 6.8 7.0 7.6 6.8 6.5 7.06.3 7.9 9.0 8.2 8.7 7.8 9.7 7.4 7.7 9.7 7.87.7 11.6 11.3 11.8 10.7
I We can average these number to get an estimate of the trueflexural strength for all beams that could be made in this way.
Introduction
Population, Sample and Processes
Pictural and Tabular Methods in Descriptive Statistics
Measures of Location
Measure of Variability
Notations
I The number of observations in a single sample will be denotedn (Sample size).
I Given n observations of a single variable x , we denote theindividuals observations x1, x2, . . . , xn.
I We denote a sample between brackets:
{x1, x2, . . . , xn}
I A sample of size n = 4 of universities:
{Stanford , IowaState,Wyoming ,Rochester}
I A sample of size n = 5 of pH measurements:
{6.3, 6.2, 5.9, 6.5, 5.1}
Stem-and-leaf display
Consider a numerical data set x1, . . . , xn, where each xi consists ofat least two digits.
1. Select one or more leading digits for the stem values. Thetrailing digits become the leaves.
2. List possible stem values in a vertical column.
3. Record the leaf for each observation beside the correspondingstem value.
4. Indicate the units for stems and leaves someplace in thedisplay.
Example : Use of alcohol by college students.
I A study on 140 different campuses measured the value of thepercentage of undergraduate students who are binge drinkers.
I The figure below shows a stem-and-leaf display of this data.
I The first leaf on the stem 2 row is 1, which tells us that 21%of the students at one of the colleges in the sample were bingedrinkers.
A stem-and-leaf display conveys information about the followingaspects of the data:
I frequency distribution
I actual data values are displayed in ascending order
I identification of a typical or representative value
I extent of spread about the typical value
I presence of any gaps in the data
I extent of symmetry in the distribution of values
I number and location of peaks
I presence of any outlying values
In general, a display based on between 5 and 20 stems isrecommended.
Dotplot
I Each observation is represented by a dot above thecorresponding location on a horizontal measurement scale.
I As with a stem-and-leaf display, a dotplot gives informationabout frequency distribution, location, spread, extremes, gapsand outliers.
I Actual values are not displayed, but approximate locations ofdata values are.
I Example:
10.8 6.9 8.0 8.8 7.3 3.6 4.1 6.0 4.4 8.38.1 8.0 5.9 5.9 7.6 8.9 8.5 8.1 4.2 5.74.0 6.7 5.8 9.9 5.6 5.8 9.3 6.2 2.5 4.5
12.8 3.5 10.0 9.1 5.0 8.1 5.3 3.9 4.0 8.07.4 7.5 8.4 8.3 2.6 5.1 6.0 7.0 6.5 10.3
Histogram
There is a fundamental distinction between numerical(quantitative) variables.
I Ex: The number of 911 calls during a day.
I Ex: The length of a football throw
Definition
I A numerical variable is discrete if its set of possible valueseither is finite or else can be listed in an infinite sequence.
I A numerical variable is continuous if its possible valuesconsist of an entire interval on the number line.
Remark:
I Recall, the real number system!
I A discrete variable almost always results from counting.
I Continuous variable arise from measurement.
Constructing a Histogram for Discrete Data
I The frequency of any particular x value is the number oftimes that value occurs in the data set.
I The relative frequency of a value is the fraction orproportion of times the value occurs:
relative frequency of a value =number of times the value occurs
number of observations in the data set
Construction of histogram for a discrete Variable.
1. Determine the frequency and relative frequency of each xvalue.
2. Mark possible x values on a horizontal scale.
3. Above each value, draw a rectangle whose height is therelative frequency (or alternatively, the frequency) of thatvalue.
Example
The following table is a frequency distribution for the number ofhits per team per baseball game for all nine-inning games thatwere played between 1989 and 1993.
Example (II)
Example (III)
I From the table or directly from the histogram, we can deducethe following:
proportion of gameswith at most two hits
=rel. freq.for x = 0
+rel. freq.for x = 1
+rel. freq.for x = 2
= .0010 + .0037 + .0108
= .0155
I Exercise: What is the proportion of games with between 5 and10 hits (inclusive)?
Example (III)
I From the table or directly from the histogram, we can deducethe following:
proportion of gameswith at most two hits
=rel. freq.for x = 0
+rel. freq.for x = 1
+rel. freq.for x = 2
= .0010 + .0037 + .0108
= .0155
I Exercise: What is the proportion of games with between 5 and10 hits (inclusive)?
Constructing a Histogram for Continuous Data:Equal Class Widths
Constructing a histogram for continuous data (measurements)entails subdividing the measurement axis into a suitable number ofclass intervals or classes, such that each observation is containedin exactly one class.
Equal Class Widths
1. Break the data into classes with equal width
2. Determine the frequency and relative frequency for each class.
3. Mark the class boundaries on a horizontal measurement axis.
4. Above each class interval, draw a rectangle whose height isthe corresponding relative frequency (or frequency).
5. Must adopt a convention to include or exclude end-points.
6. Recommend 6 to 15 classes or bins.
Example
I Say we are given the following observations:
I The minimum is 0.3300578 and the maximum is 27.46153.We can start by dividing the data into classes, then we counthow many observations falls into each class.
Classes [0, 2) [2, 4) [4, 6) . . . [24, 26) [26, 28)Frequency 32 181 252 . . . 1 1
Example (II): The histogram
Motivation for unequal width classes
Consider the following observations:
11.5 5.7 3.6 5.2 12.1 9.9 9.3 7.8 6.2 6.67.0 13.4 17.1 9.3 5.6 5.4 5.2 5.1 4.9 10.7
15.2 8.5 4.2 4.0 3.9 3.8 3.4 20.6 25.5 13.812.6 13.1 8.9 8.2 10.7 14.2 7.6 5.5 5.1 5.05.2 4.8 4.1 3.8 3.7 3.6 3.6 3.6
If we chose to break these observations into classes of equal width:
Classes [0, 5) [5, 10) [10, 15) [15, 20) [20, 25) [25, 30)Frequency 14 21 9 2 1 1
Histogram
I We get the histogram :
I The problem : The observations are more concentratedaround between 0 and 10.
I We need more ”precision” for smaller values.
Constructing a Histogram for Continuous Data:Unequal Class Widths
Unequal Class Widths
1. Chose an appropriate subdivision of the data.
2. Determine the frequencies and relative frequencies for eachclasses.
3. Calculate the height of each rectangle using the formula:
rectangle height =relative frequency of the class
class width.
4. The resulting rectangle heights are usually called densities,and the vertical scale is the density scale
5. AREA PRINCIPLE: In general, area represents relativefrequency or frequency. If the class widths are equal, thenheight can be used to represent relative frequency orfrequency.
Motivation for unequal width classes (II)
We now chose the following classes:
Classes [2, 4) [4, 6) [6, 8) [8, 12) [12, 20) [20, 30)Frequency 9 15 5 9 8 2
Relative freq. .1875 .3125 .1042 .1875 .1667 .0417Density .094 .156 .052 .047 .021 .004
Remark
I The sample size in this example is n = 48.
I If we do not divide by the class width, we the rectangle for theclass [8,12) would be higher than the rectangle for the class[6,8) : the picture would be distorted and be a poorrepresentation of the data.
Histogram
A Note on Histogram Shapes
I A unimodal histogram is one that rises to a single peak andthen declines.
I A bimodal histogram has two different peaks. Bimodalityoccurs typically when the population is made from twodifferent type of individuals.
I A histogram with more than two peaks is said to bemultimodal.
I A histogram is symmetric if the left half is a mirror image ofthe right half.
I A unimodal histogram is positively skewed if the right orupper tail is stretched out compared with the left or lower tailand negatively skewed if the stretching is to the left.
Histograms as Density approximation
The Case of Qualitative Data
I Both a frequency distribution and a histogram can beconstructed when the data set is qualitative (categorical) innature.
I Variables whose values are groups or categories.
I In some cases, there will be a natural ordering of classes, forexample, freshmen, sophomores, juniors, seniors, graduatestudents
I In other cases the order will be arbitrary for example, Catholic,Jewish, Protestant, and the like.
”Overall, how would you rate the quality of public schoolsin your neighborhood today?”
The case of multivariate Data
Multivariate data is generally rather difficult to describe visually.Several methods for doing so appear later in the book, notably
scatter plots for bivariate numerical data.
Exercise 17 - Sec 1.2 Devore
Temperature transducers of a certain type are shipped in batchesof 50. A sample of 60 batches was selected, and the number oftransducers in each batch not conforming to design specificationswas determined, resulting in the following data:
2 1 2 4 0 1 3 2 0 5 3 3 1 3 2 4 7 0 2 30 4 2 1 3 1 1 3 4 1 2 3 2 2 8 4 5 1 3 15 0 2 3 2 1 0 6 4 2 1 6 0 3 3 3 6 1 2 3
1. Determine frequencies and relative frequencies for theobserved values of x =number of nonconforming transducersin a batch.
2. What proportion of batches in the sample have at most fivenonconforming transducers? What proportion have fewer thanfive? What proportion have at least five non-conformingunits?
3. Draw a histogram of the data using relative frequency on thevertical scale, and comment on its features.
Exercise 23 - Sec 1.2 Devore
The article ”Statistical Modeling of the Time Course of TantrumAnger” (Annals of Applied Stats, 2009: 1013-1034) discussed howanger intensity in children’s tantrums could be related to tantrumduration as well as behavioral indicators such as shouting,stamping, and pushing or pulling. The following frequencydistribution was given (and also the corresponding histogram):
classes [0, 2) [2, 4) [4, 11) [11, 20) [20, 30) [30, 40)frequency 136 92 71 26 7 3
Draw the histogram and then comment on any interesting features.
Introduction
Population, Sample and Processes
Pictural and Tabular Methods in Descriptive Statistics
Measures of Location
Measure of Variability
Measures of Location: Central Tendencies
I Visual summaries are great tools for preliminary impressionsand insights
I More formal data analysis often requires the calculation andinterpretation of numerical summary measures.
I From the data we try to extract several numbers that mightserve to characterize the data set and convey some of itssalient features.
I Consider the data x1, x2, . . . , xn, where each xi is a number.What features of such a set of numbers are of most interestand deserve emphasis?
The mean
I The most common measure of location is the mean, orarithmetic average.
DefinitionThe sample mean, denoted x̄ of observations x1, x2, . . . , xn isgiven by:
x̄ =1
n
n∑i=1
xi =x1 + · · ·+ xn
n.
I The sample mean has a flaw: its value can be greatlyinfluenced by a single outlier.
I Ex: Sample of incomes (salary).
The median
I The sample median is the middle value once the observationsare ordered from smallest to largest.
I For a sample x1, . . . , xn, we denote x̃ the sample median.
DefinitionThe sample median is obtained by first ordering the nobservations from smallest to largest (with any repeated valuesincluded so that every sample observation appears in the orderedlist). Then,
x̃ =
(n+12
)thordered value if n is odd
average of(n2
)thand
(n2 + 1
)thordered value if n is even
Comparision Median vs Mean
I Suppose the repartition of the salaries in a company is asfollows:
72 69 71 67 72 69 70 73 69 7069 68 67 67 69 76 67 69 67 7069 68 70 71 67 70 69 68 69 7066 66 71 70 68 69 71 70 71 68
I The sample mean: x̄ = 69.3 The median is x̃ = 69
I Now, assume one employee earns about twice the salary of itscoworkers:
72 69 71 67 72 69 70 73 69 7069 68 67 67 1243 76 67 69 67 7069 68 70 71 67 70 69 68 69 7066 66 71 70 68 69 71 70 71 68
I The new sample mean is x̄ = 98.65, whereas the new medianis still x̃ = 69.
Symmetry: Mean and Median
Other Measures of Location:Quartiles, Percentiles, and Trimmed Means
I Quartiles : divides the sample in fourth.I First Quartile (Q1): cutoff of the lower 25% of dataI Second Quartile (Q2): cutoff of the lower 50% of dataI Third Quartile (Q3): cutoff of the lower 75% of data
I Percentiles : divides the sample in hundredth.I The pth percentile is the cutoff of the lower p% of data
I Trimmed mean : x̄p%: remove the bottom p% and top p%before averaging.I This is a compromise between mean and median.
Other Measures of Location:Quartiles, Percentiles, and Trimmed Means
I Quartiles : divides the sample in fourth.I First Quartile (Q1): cutoff of the lower 25% of dataI Second Quartile (Q2): cutoff of the lower 50% of dataI Third Quartile (Q3): cutoff of the lower 75% of data
I Percentiles : divides the sample in hundredth.I The pth percentile is the cutoff of the lower p% of data
I Trimmed mean : x̄p%: remove the bottom p% and top p%before averaging.I This is a compromise between mean and median.
Other Measures of Location:Quartiles, Percentiles, and Trimmed Means
I Quartiles : divides the sample in fourth.I First Quartile (Q1): cutoff of the lower 25% of dataI Second Quartile (Q2): cutoff of the lower 50% of dataI Third Quartile (Q3): cutoff of the lower 75% of data
I Percentiles : divides the sample in hundredth.I The pth percentile is the cutoff of the lower p% of data
I Trimmed mean : x̄p%: remove the bottom p% and top p%before averaging.I This is a compromise between mean and median.
Other Measures of Location:Quartiles, Percentiles, and Trimmed Means
I Quartiles : divides the sample in fourth.I First Quartile (Q1): cutoff of the lower 25% of dataI Second Quartile (Q2): cutoff of the lower 50% of dataI Third Quartile (Q3): cutoff of the lower 75% of data
I Percentiles : divides the sample in hundredth.I The pth percentile is the cutoff of the lower p% of data
I Trimmed mean : x̄p%: remove the bottom p% and top p%before averaging.I This is a compromise between mean and median.
Other Measures of Location:Quartiles, Percentiles, and Trimmed Means
Example: We consider the sample of size n = 20:
15 6 8 5 12 1 3 0 2 9 1 11 4 4 4 3 1 6 2 11
We first sort the sample:
0 1 1 1 2 2 3 3 4 4 4 5 6 6 8 9 11 11 12 15
We thus get
I x̄ = 5.4
I x̃ = 4
I Q1 = 2
I Q3 = 9
I x̄25% = 4.5
Other Measures of Location:Quartiles, Percentiles, and Trimmed Means
Example: We consider the sample of size n = 20:
15 6 8 5 12 1 3 0 2 9 1 11 4 4 4 3 1 6 2 11
We first sort the sample:
0 1 1 1 2 2 3 3 4 4 4 5 6 6 8 9 11 11 12 15
We thus get
I x̄ = 5.4
I x̃ = 4
I Q1 = 2
I Q3 = 9
I x̄25% = 4.5
Exercise 33 - Sec 1.3 Devore
The May 1, 2009 issue of The Montclarian reported the followinghome sale amounts for a sample of homes in Alameda, CA thatwere sold the previous month (1000s of $):
590 815 575 608 350 1285 408 540 555 679
1. Calculate and interpret the sample mean and median.
2. Suppose the 6th observation had been 985 rather than 1285.How would the mean and median change?
3. Calculate a 20% trimmed mean by first trimming the twosmallest and two largest observations.
4. Calculate a 15% trimmed mean.
Introduction
Population, Sample and Processes
Pictural and Tabular Methods in Descriptive Statistics
Measures of Location
Measure of Variability
Motivation
I Reporting only a measure of location only gives partialinformation on the sample.
→ Different sample can have the same mean – it doesn’t meanthe two sample are identical:
I We introduce indicators of dispersion
Measures of Variability for Sample Data
I The simplest is the range:
Range = Maximum value−Minimum value
I Deviations from the mean:
DefinitionThe sample variance, denoted by s2, is given by
s2 =1
n − 1
n∑i=1
(xi − x̄)2.
The sample standard deviation, denoted by s, is the (positive)square root of the variance:
s =√s2.
Properties of the Sample Variance
Proposition
An alternative expression for the numerator of s2 is
Sxx =∑
(xi − x̄)2 =∑
x2i −(∑
xi )2
n
Proposition
Let x1, x2, . . . , xn be a sample and c be any nonzero constant.
I If y1 = x1 + c, . . . , yn = xn + c , then s2x = s2y .
I If y1 = cx1, . . . , yn = cxn, then s2y = c2s2x and sy = |c|sx .where s2x and s2y are the sample variance of x and y respectivelyand sx and sy the sample standard deviations
Example
Traumatic knee dislocation often requires surgery to repair rupturedligaments. One measure of recovery is range of motion (measured as theangle formed when, start- ing with the leg straight, the knee is bent asfar as possible). The given data on post- surgical range of motionappeared in the article ”Reconstruction of the Anterior and PosteriorCruciate Ligaments After Knee Dislocation” (Amer. J. Sports Med.,1999: 189–197):
154 142 137 133 122 126 135 135 108 120127 134 122
The sum of these 13 sample observations is∑
xi = 1695, and the sum oftheir squares is:∑
x2i = (154)2 + (142)2 + · · ·+ (122)2 = 222, 581
Thus the numerator of the sample variance is
Sxx =∑
x2i −(∑
xi )2
n= 222, 581− 16952
13= 1579.0769
from which s2 = 1579.0769/12 = 131.59 and s = 11.47.
Exercise
The following table give the temperature (in ◦F) of each day ofJanuary 2016 in a certain city in Michigan.
30.0 36.0 32.0 28.0 28.9 37.9 44.1 37.9 48.0 44.119.0 24.1 19.0 36.0 41.0 37.0 28.9 15.1 23.0 19.926.1 28.0 30.9 35.1 43.0 43.0 33.1 37.0 33.1 51.150.0
1. Find the sample mean and the sample variance.
2. We recall that the conversion to celsius:
T (◦C ) =T (◦F )− 32
1.8.
What is the sample mean and sample variance in ◦C?
Solution in R
I We first enter the temperature as a vector:
I Then, the sample mean mean and the sample variance aregiven by the formulas:
I Finally, we have a linear transformation, we have:
(x̄ in ◦C ) =(x̄ in ◦F )− 32
1.8=
33.55806− 32
1.8= 0.8655889.
s2(x in ◦C)
=s2[(x in ◦F )−32]
1.82=
s2[(x in ◦F )
1.82=
88.45652
1.82= 27.3014.
Boxplot
A Box and Whisker plot: emphasizes
I the center
I the spread
I the departure from symmetry
I the presence of outliers
Min Q1 median Q3 Max
Inter Quartile Range/Fourth spread
Definition
I Order the n observations from smallest to largest and separatedata into the smallest half and the largest half; the median x̃is included in both halves if n is odd.
I lower fourth (quartile) is the median of the smallest half andthe upper fourth (quartile) is the median of the largest half.
I A measure of spread that is resistant to outliers is the InterQuartile Range (fs), given by
IQR = Q3− Q1.
The IQR is unaffected by the smallest or the largest 25% of thedata. The simplest boxplot is based on the following five-numbersummary:
Minimum xi Q1 median Q3 Maximum xi
Example
We consider again the sample of size n = 20:
15 6 8 5 12 1 3 0 2 9 1 11 4 4 4 3 1 6 2 11
We first sort the sample:
0 1 1 1 2 2 3 3 4 4 4 5 6 6 8 9 11 11 12 15
Thus, we get IQR = 9− 2 = 7 and the boxplot:
Boxplot that shows outliers
We can add features to the boxplot to show outliers.
DefinitionAny observation farther than 1.5IQR from the closest quartile is anoutlier. An outlier is extreme if it is more than 3IQR from thenearest quartile, and it is mild otherwise.
We can now modify our construction:
1. Draw a whisker out from each end of the box to the smallestand largest observations that are not outliers.
2. Each mild outlier is represented by a closed circle
3. Each extreme outlier by an open circle.
Remark:Most of the softwares do it automatically.
Example 1.20 - Sec 1.4 Devore
I We have the following observations:
I We have x̃ = 92.17, Q1 = 45.64, Q3 = 167.79.
I Observe that Q1 -1.5IQR ≤ 0 : no outliers on the left.
I Now, Q3 + 1.5IQR = 351.015Q3 + 3IQR = 534.24
Comparative boxplots
I Drawing two boxplots side-by-side (with the same scale) is avery effective way of revealing similarities and differencesbetween two or more data sets.
I Example: miles per gallons by numbers of cylinders:
Exercise 53 - Sec 1.4 Devore
A mutual fund is a professionally managed investment scheme thatpools money from many investors and invests in a variety ofsecurities. Growth funds focus primarily on increasing the value ofinvestments, whereas blended funds seek a balance between currentincome and growth. Here is data on the expense ratio for samplesof 20 large-cap balanced funds and 20 large-cap growth funds :
1. Calculate and compare the values of x̄ , x̃ , and s for the twotypes of funds.
2. Construct a comparative boxplot for the two types of funds,and comment on interesting features.