data displays and summary statisticsrenaes/251/lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 |...

13
Data displays and summary statistics Statistics 251: Statistical Methods Module 2 Updated 2020 Have data will, er. . . ? Now what to do with the data? • Graphs visualize what the data is telling us • see trends and other features • knowing what the data looks like in graph form tells us what analysis to use Qualitative Graphs Bar graph consist of bars that are separated from each other and are usually rectangular; bars represent the categories Pie graph shows parts of a whole, in pie form. Not very useful and easy to manipulate; not used in this class beyond an example Barplot Red Green Yellow Counts 0 50 100 150 200 Distribution of M&M Colors 1

Upload: others

Post on 13-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

Data displays and summary statisticsStatistics 251: Statistical Methods

Module 2Updated 2020

Have data will, er. . . ?Now what to do with the data?

• Graphs visualize what the data is telling us• see trends and other features• knowing what the data looks like in graph form tells us what analysis to use

Qualitative GraphsBar graph consist of bars that are separated from each other and are usually rectangular; bars representthe categories

Pie graph shows parts of a whole, in pie form. Not very useful and easy to manipulate; not used in thisclass beyond an example

Barplot

Red Green Yellow

Cou

nts

050

100

150

200

Distribution of M&M Colors

1

Page 2: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

Pie chart

Red 13 %

Blue 21 %

Green 14 %

Orange 26 %

Yellow 12 %

Brown 14 %

Pie Chart of M&M colors

Quantitative graphs Ihistogram consists of adjoining rectangles. The horizontal axis (x) is the ranges of data values and thevertical axis (y) is the frequency (counts) or relative frequency (percents or probability, denoted as rf).

stemplot (stem-and-leaf plot) comes from the field of exploratory data analysis and is good when the datasets are small. To create the plot, divide each observation of data into a stem and a leaf. The leaf consists ofa final significant digit.

HistogramHistogram of hgts

hgts

Fre

quen

cy

60 62 64 66 68 70 72 74

05

1020

30

Stemplot

The decimal point is 1 digit(s) to the right of the |

2

Page 3: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

0 | 020 | 681 | 000131 | 5578892 | 0011233344

Quantitative graphs IIboxplot also called box-and-whisker plots or box-whisker plots. Gives a good graphical image of theconcentration of the data and also show how far the extreme values are from most of the data. A boxplot isconstructed from five values: the minimum value (min), the first quartile (Q1), the median (Median), thethird quartile (Q3), and the maximum value (max); the collection of these five summaries are called the “5Number Summary”.

scatterplot is a type of plot or mathematical diagram using Cartesian coordinates to display values fortypically two variables for a set of data, an x and y. Is useful for identifying trends/associations between twovariables.

time series also called line plot. Shows the distribution of a varaible over a specified time period.

BoxplotVertical boxplot

05

1015

20

Hours playing video games per week

BoxplotHorizontal boxplot

3

Page 4: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

0 5 10 15 20

Hours playing video games per week

Scatterplot

1.5 2.5 3.5 4.5

5060

7080

90

eruptions

wai

ting

Old Faithful eruption and interval times

4

Page 5: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

Time series (line plot)

CO2 1957−1997

Time

co2

1960 1970 1980 1990

320

340

360

Multiple time series

2003 2005 2007 2009

1000

000

4000

000

year

Ukr

aine

CO2 1957−1997

5

Page 6: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

Contour plot

90100110120130140150160170180190

Height(meters)

100 400 700

100

200

300

400

500

600

The Topography of Maunga Whau

Meters North

Met

ers

Wes

t

Describing distributions Isymmetric: if a vertical line can be drawn at some point in the histogram such that the shape to the leftand the right of the vertical line are mirror images of each other. In symmetric distributions, the mean andmedian are the same (approximately equal) and the mode(s) are generally in the center as well

left (or negative) skew: when it looks like the graph is “pulled” to the left (fewer observations to the leftthan the right); mode(s) generally on right side

right (or positive) skew: when it looks like the graph is “pulled” to the right (fewer observations to theright than the left); mode(s) generally on left side

Symmetric

hist(hgts)

6

Page 7: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

Histogram of hgts

hgts

Fre

quen

cy

60 62 64 66 68 70 72 74

05

1020

30

Left skewed

hist(b)

Histogram of b

b

Fre

quen

cy

−7 −6 −5 −4 −3 −2 −1 0

010

3050

Right skewed

7

Page 8: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

hist(c)

Histogram of c

c

Fre

quen

cy

0 1 2 3 4

010

2030

4050

Describing distributions IIunimodal: one main mode in the data set

bimodal: two modes (more than two is multimodal)

Unimodal

hist(hgts)

8

Page 9: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

Histogram of hgts

hgts

Fre

quen

cy

60 62 64 66 68 70 72 74

05

1020

30

Also unimodal

hist(b)

Histogram of b

b

Fre

quen

cy

−7 −6 −5 −4 −3 −2 −1 0

010

3050

Still unimodal

9

Page 10: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

hist(c)

Histogram of c

c

Fre

quen

cy

0 1 2 3 4

010

2030

4050

Bimodal

with(faithful,hist(eruptions))

Histogram of eruptions

eruptions

Fre

quen

cy

2 3 4 5

020

4060

Measures of Location Ipercentiles: divide ordered data into hundredths; useful for comparing values, especially with largepopulations. Ex: unemployment rates, SAT scores

10

Page 11: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

quartiles: divide ordered data into quarters- Q1: quartile 1 refers to the 25th percentile (25% of the data is less than Q1 and 75% of the data is morethan Q1)- M : the median refers to the 50th percentile, the center-most value of the ordered dataset (50% of the datais less than M and 50% of the data is more than M).- Q3: quartile 3 refers to the 75th percentile (75% of the data is less than Q3 and 25% of the data is morethan Q3)

Measures of Location IImean: the mathematical average. µ represents the population mean and X represents the sample mean. Itis calculated by the sum of the observation values divided by the number of observations

minimum, maximum: the smallest (min) and largest (max), respectively value of the dataset

mode: the most frequently ocurring observation(s); can have more than 1, can also have 0

population size: the number of elements in a population; denoted as N (always upper case)

sample size: the number of observations in a dataset; denoted as n (always lower case)

Formulas IMedian: values would be calculated the same way for the population as for the sample. We mostly dealwith samples. The data must be ordered before calculating the values

M = n+12

th observation (if n is odd); M = avg(n2 ,

n2 + 1

)th observations (if n is even)

Q1 = median of lower half of ordered datasetQ3 = median of upper half of ordered datasetmin the smallest valuemax the largest valuemode the observation(s) that occur most frequently N population size (most often not known or difficult tofind)n the sample size; the number of observations in a dataset

ExamplesDataset: {1,11,6,7,4,1,12} → order it: {1,1,4,6,7,11,12} and n = 7

M = n+12

th = 7+12th = 4th observations. M = 6. Note that sometimes the median is not an original

observation (and it does not have to be)

Q1: median of the lower half of the dataset (not including M): {1,1,4} so Q1 = 1 Q3: median of the upperhalf of the dataset (not including M): {7,11,12} so Q3 = 11min = 1, max = 12, mode = 1, N is unknown/not given, n = 7

The MeanPopulation values (parameters) are usually denoted with letters from the Greek alphabet, while sample values(statistics) are usually denoted with English letters (± a few extra symbols here and there)

Population mean: µ =∑

Xi

N where Xi are the values of each element (individual) in the population

Sample mean: X =∑

xi

n = where xi are the values of each observation in the sample

Example of sample mean: X =∑

xi

n = 1+11+6+7+45 = 440

11

Page 12: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

Measures of VariabilityIQR (interquartile range): shows the “spread” of the middle 50% of the data; also used to help identifyoutliers

outlier: a data point that is significantly different than the other data points. Some could be due to dataentry errors, some are unique and usually require more investigation

range: the difference between the max and min values; shows entire “spread” of the data

variance: the average squared distance each data point is from its mean

standard deviation: the average distance each data point is from its mean; is used most often as the mainmeasure of variability (thus its name)

Formulas IIIQR = Q3−Q1

A value is considered a potential outlier if it is < Q1− IQR(1.5) or if > Q3 + IQR(1.5)

range = max−min

Population variance: σ2 =∑

(Xi−µ)2

N = (X1−µ)2+(X2−µ)2+···+(XN −µ)2

N

Sample variance: s2 =∑

(xi−X)2

n−1 = (x1−X)2+(x2−X)2+···+(xn−X)2

n−1 where n− 1 = df , the degrees of freedom(more to come on df later)

Population standard deviation: σ =√∑

(Xi−µ)2

N =√σ2

Sample variance: s =√∑

(xi−X)2

n−1 =√s2

Note: σ2, σ, s2, and s MUST ≥ 0

More ExamplesDataset: {1,1,4,6,7,11,12}

n = 5, the 5# summary is: {1,1,6,11,12} (min,Q1,M,Q3,max)

IQR = Q3 − Q1 = 11 − 1 = 10; IQR(1.5) = 10(1.5) = 15; lower outlier boundary: = Q1 − IQR(1.5) =1 − 15 = −14 and upper outlier boundary: = Q3 + IQR(1.5) = 11 + 15 = 26. Since no observations areoutside those boundaries, there are no potential outliers in the dataset

Continued Examplerange = max−min = 12− 1 = 11

s2 =∑

(xi−X)2

n−1 = (1−6)2+(1−6)2+(4−6)2+···+12−6)2

7−1 = 63800

s =√∑

(xi−X)2

n−1 =√s2 =

√19.33 = 252.5866

With no idea of population size nor population values, we cannot compute the parameters (population values)but will use the statistics calculated above as estimates of the parameters

12

Page 13: Data displays and summary statisticsrenaes/251/Lectures/module2.pdf0 | 02 0 | 68 1 | 00013 1 | 557889 2 | 0011233344 QuantitativegraphsII boxplot also called box-and-whisker plots

Appropriate summaries for different distributionsWhen looking at data that has a non-symmetric distribution (skewed, bimodal, multimodal, . . . ), the beststatistics to use would be the 5# summary with the IQR, with a boxplot.

When looking at data that has a symmetric distribution (espeically the “normal” distribution – bell curve),the best statistics to use would be the mean and standard deviation, with a histogram (usually – boxplotsare ok too when n is small).

Empirical RuleThe Empirical Rule (ER) is appropriate only with symmetric distributions. ER states:

68% of observations are within the interval X ± 1s95% of observations are within the interval X ± 2s99.7% of observations are within the interval X ± 3s

ER exampleExample: X = 15, s = 268% of observations are within the interval X ± 1s = 15± 2 = (13, 17)95% of observations are within the interval X ± 2s = 15± 2(2) = (11, 19)99.7% of observations are within the interval X ± 3s = 15± 3(2) = (9, 21)

13