stor 155, section 2, last time distributions (how are data “spread out”?) visual display:...

42
Stor 155, Section 2, Last Time • Distributions (how are data “spread out”?) • Visual Display: Histograms – Binwidth is critical • Time Plots = Time Series • Course Organization & Website http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Sto r155-07Home.html

Upload: aja-denman

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Stor 155, Section 2, Last Time

• Distributions (how are data “spread out”?)

• Visual Display: Histograms– Binwidth is critical

• Time Plots = Time Series

• Course Organization & Websitehttp://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155-07Home.html

Page 2: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Reading In Textbook

Approximate Reading for Today’s Material:

Pages 40-55

Approximate Reading for Next Class:

Pages 64-83

Page 3: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

And now for something completely different

Is this class too “monotone”?

• Easier to understand?

• Calm environment enhances learning?

• Or does it induce somnolence?

What is “somnolence”?

Google definition:

Sleepiness, a condition of

semiconsciousness approaching coma.

Page 4: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

And now for something completely different

An experiment:

• Pull out any coins you have with you

• How many of you have:

– >= 1 penny?

– >= 1 nickel?

– >= 1 dime?

– >= 1 quarter?

• Choose most frequent denomination

Page 5: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

And now for something completely different

Collect data (into Spreadsheet):

• Years stamped on coins

(chosen denomination)

• Many as person has

• Enter into spreadsheet

• Look at “distribution” using histogram

Page 6: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

And now for something completely different

• Predicted Answer

– From Text Book, Problem 1.32

• Distribution is Left Skewed

• Works out as predicted?

• Why?

• Note: most skewed dist’ns seem to be:

Right Skewed

Page 7: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Exploratory Data Analysis 4

Numerical Summaries of Quant. Variables:

Idea: Summarize distributional information

(“center”, “spread”, “skewed”)

In Text, Sec. 1.2

for data

(subscripts allow “indexing numbers” in list)nxxx ,...,, 21

Page 8: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Numerical Summaries

A. “Centers” (note there are several)

1. “Mean” = Average =

• Greek letter “Sigma”, for “sum”

In EXCEL, use “AVERAGE” function

nxx n 1

xxn

iin

1

1

Page 9: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Numerical Summaries of Center

2. “Median” = Value in middle (of sorted list)

Unsorted E.g: Sorted E.g:

3 0

1 1

27 “in middle”? (no) 2 better “middle”!

2 3

0 27

EXCEL: use function “MEDIAN”

Page 10: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Difference Betw’n Mean & Median

Symmetric Distribution: Essentially no difference

Right Skewed Distribution:

50% area 50% area

M

bigger since “feels tails more strongly”x

Page 11: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Difference Betw’n Mean & Median

Outliers (unusual values):

Simple Web Example:

http://www.stat.sc.edu/~west/applets/box.html• Mean feels outliers much more strongly

– Leaves “range of most of data”– Good notion of “center”? (perhaps not)

• Median affected very minimally– Robustness Terminology:

Median is “resistant to the effect of outliers”

Page 12: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Difference Betw’n Mean & Median

A richer web example:Publisher’s Web Site: Statistical Applets: Mean & Median

• For Symmetric distributions: – Both are same

• Add an outlier: – Mean feels it much more strongly– Implication for “bad data”: can be very bad

• Two Clusters:– Median jumps more quickly– Mean more stable (better?)

Page 13: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Computation using Excel

Some Toy Examples:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg3Done.xls

• Compute Using Excel Functions

• Mean feels location of data on number line

• Median feels location of data in sorted list

• Median breaks tie by averaging center points

Page 14: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Numerical Centerpoint HW

HW: 1.46 a, 1.47, 1.49

• Use EXCEL

Page 15: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

And now for something completely different

Check out this small quick movie clip:

Page 16: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

And now for something completely different

Suggestions for other things to show here are very welcome….

• Movie Clips…

• Music…

• Jokes…

• Cartoons…

• …

Page 17: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Numerical Summaries (cont.)

A. “Spreads” (again there are several)

1. Range = biggest - smallest

range

Problems:

• Feels only “outliers”

• Not “bulk of data”

• Very non-resistant to outliers

ix ix

Page 18: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Numerical Summaries of Spread

2. Variance =

= “average squared distance to “

EXCEL: VAR

Drawback: units are wrong

e. g. For in feet is in square feet

111

222

12

n

xx

n

xxxxs

n

ii

n

x

ix2s

Page 19: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Numerical Summaries of Spread

3. Standard Deviation

EXCEL: STDEV

• Scale is right

• But not resistant to outliers

• Will use quite a lot later

(for reasons described later)

2ss

Page 20: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Interactive View of S. D.

Interesting web example (manipulate histogram):http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html

• Note SD range centered at mean

• Can put SD “right near middle” (densely packed data)

• Can put SD at “edges of data” (U shaped data)

• Can put SD “outside of data” (big spike + outlier)

• But generally “sensible measure of spread”

Page 21: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Variance – S. D. HW

C3: For the data set in 1.46 (i.e. 1.37), find the:

i. Variance (1620)

ii. Standard Deviation (40.2)

• Use EXCEL

Page 22: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Numerical Summaries of Spread

3. Interquartile Range = IQR

Based on “quartiles”, Q1 and Q3

(idea: shows where are 25% & 75% “through the data”)

25% 25% 25% 25%

Q1 Q2 = median Q3

IQR = Q3 – Q1

Page 23: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Quartiles Example

Revisit Hidalgo Stamp Thickness example:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Done.xls

Right skewness gives:

– Median < Mean

(mean “feels farther points more strongly”)

– Q1 near median

– Q3 quite far

(makes sense from histogram)

Page 24: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Quartiles Example

A look under the hood:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Raw.xls

• Can compute as separate functions for each

• Or use:

Tools Data Analysis Descriptive Stats

• Which gives many other measures as well

• Use “k-th largest & smallest” to get quartiles

Page 25: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

5 Number Summary

1. Minimum2. Q1 - 1st Quartile3. Median4. Q3 - 3rd Quartile5. Maximum

Summarize Information About:

a) Center - from 3b) Spread - from 2 & 4 (maybe 1 & 5)c) Skewness - from 2, 3 & 4d) Outliers - from 1 & 5

Page 26: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

5 Number Summary

How to Compute?http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Done.xls

• EXCEL function QUARTILE

• “One stop shopping”

• IQR seems to need explicit calculation

Page 27: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Rule for Defining “Outliers”

Caution: There are many of these

Textbook version:Above Q3 + 1.5 * IQR

Below Q1 – 1.5 * IQR

For stamps data:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Done.xls

– No outliers at “low end”

– Some at “high end”

Page 28: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

5 Number Sum. & Outliers HW

1.43

Page 29: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Box Plot

• Additional Visual Display Device

• Again legacy from pencil & paper days

• Not supported in EXCEL

• So we won’t do

• Main use: comparing populations

– Example: Figure from text

Page 30: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Box Plot

Page 31: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Box Plot

• Main use: comparing populations

– Example: Figure from text

• Want to do this?

Find better software package than Excel

Page 32: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

And now for something completely different

Recall

Distribution

of majors of

students in

this course:

Stat 155, Section 2, Majors

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Busine

ss /

Man

.

Biolog

y

Public

Poli

cy /

Health

Pharm

/ Nur

sing

Jour

nalis

m /

Comm

.

Env. S

ci.

Other

Undec

ided

Fre

qu

ency

Page 33: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

And now for something completely different

How about a business manager joke?

How many managers does it take to replace a light bulb?

Page 34: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

And now for something completely different

How about a business manager joke?

How many managers does it take to replace a light bulb?

Two. One to find out if it needs changing, and one to tell an employee to change it.

Source: http://www.joblatino.com/jokes/managers.html

Page 35: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Linear Transformations

Idea: What happens to data & summaries,

when data are:

“shifted and scaled”

i.e. “panned and zoomed”

Math:

Scaled by a

Shifted by b

baxbaxxx nn ,...,,..., 11

Page 36: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Linear Transformations

Effect on linear summaries:

• Centerpoints, and

“follow data”: .

• Spreads, and

“feel scale, not shift”: .

xbaMbxa ,

M

s IQR

aIQRas,

Page 37: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Most Useful Linear Transfo.

“Standardization”

Goal: put data sets on “common scale”

Approach:

1. Subtract Mean ,

to “center at 0”

2. Divide by S.D. ,

to “give common SD = 1”

s

x

Page 38: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

StandardizationResult is called “z-score”:

Note that

Thus is interpreted as:

“number of SDs from the mean”iz

sxx

z ii

ii

ii

xszx

xxsz

,

Page 39: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Standardization Example

Next time: work in Excel command:

STANDARDIZE

Page 40: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Standardization Example

Buffalo Snowfall Data:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg7Done.xls

• Standardized data have same (EXCEL default) histogram shape as raw data.

(Since axes and bin edges just

follow the transformation)

• i.e. “shape” doesn’t depend on “scaling”

Page 41: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Standardization Example

A look under the hood:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg7Raw.xls

Compute AVERAGE and SD

1. Standardize by:a. Create Formula in cell B2

b. Drag downwards

c. Keep Mean and SD cells fixed using $s

3. Check stand’d data have mean 0 & SD 1note that “8.247E-16 = 0”

Page 42: Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course

Standardization HW

C4: For data in 1.17, use EXCEL to:

a. Give the list of standardized scores

b. Give the Z-score for:

(i) the mean (0)

(ii) the median (-0.223)

(iii) the smallest (-1.21)

(iv) the largest (2.77)

1.59a, 1.73