161.120 introductory statistics week 4 lecture slides

24
161.120 Introductory Statistics Week 4 Lecture slides Exploring Time Series CAST chapter 4 Relationships between Categorical Variables Text sections 6.1 CAST chapter 5 Data Presentation Study Guide: extra notes section 13

Upload: stefan

Post on 09-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

161.120 Introductory Statistics Week 4 Lecture slides. Exploring Time Series CAST chapter 4 Relationships between Categorical Variables Text sections 6.1 CAST chapter 5 Data Presentation Study Guide: extra notes section 13. Time Series – What you need to be able to do. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 161.120 Introductory Statistics  Week 4 Lecture slides

161.120 Introductory Statistics Week 4 Lecture slides

• Exploring Time Series– CAST chapter 4

• Relationships between Categorical Variables– Text sections 6.1– CAST chapter 5

• Data Presentation– Study Guide: extra notes section 13

Page 2: 161.120 Introductory Statistics  Week 4 Lecture slides

Time Series – What you need to be able to do

• Plot time series and use least squares to make forecasts

• Identify and describe in words (in terms that the data collector might understand) the trend and seasonal components in a time series plot

Page 3: 161.120 Introductory Statistics  Week 4 Lecture slides

What is a Time Series?

A series of data values recorded (generally at equal time intervals) sequentially in time.

Average age at death of people each month in a large city over a period of 5 years

Area of rice grown in East Asia each year for the past 15 years

Weight of every 100th kiwifruit packed during an 8-hour shift

Number of hospital admissions each day over a period of 5 months

Page 4: 161.120 Introductory Statistics  Week 4 Lecture slides

What is a Time Series?• There is often a time-related pattern to the

variability. - A trend towards higher or lower values over

time - A pattern that repeats regularly

• Ignoring the time ordering and examining the data with dot plots or similar univariate techniques may result in useful information being missed

• Particularly important in business and commerce

Page 5: 161.120 Introductory Statistics  Week 4 Lecture slides

The importance of plotting• Can be difficult to get useful information from time series if they

are presented in tabular form.

• Information in a time series is most easily understood from a graphical display.

• A time series plot is a type of dot plot in which the values are displayed as crosses against a vertical axis. – The horizontal axis spreads out the crosses in time order.

– (It can also be thought of as a scatterplot in which the 'explanatory' variable is time.)

– The successive crosses

are often joined by lines.

Page 6: 161.120 Introductory Statistics  Week 4 Lecture slides

Year

Tota

l Dro

wni

ngs

2003200119991997199519931991198919871985

220

200

180

160

140

120

100

Drownings in New Zealand

Page 7: 161.120 Introductory Statistics  Week 4 Lecture slides

Trend

• Time series data often change systematically over time – this change is called the trend.

The long-term upward or downward movements in the values. For example time series plots of commodity prices often have an upward trend over a period of years.

The trend can be masked by random fluctuations Trend is very important for forecasting future values

Page 8: 161.120 Introductory Statistics  Week 4 Lecture slides

Smoothing Methods• Reduce the fluctuations

and show the trend more clearly.

– These methods replace each value in the series with a function of it and the adjacent values.

• Moving averages (also called running means)

– Each value is replaced by the mean of it and the two adjacent values (3-point moving average)

Year

Tota

l Cro

wni

ngs

2003200119991997199519931991198919871985

220

200

180

160

140

120

100

Total DrowningsMA(3)

Variable

Drownings in New Zealand

Page 9: 161.120 Introductory Statistics  Week 4 Lecture slides

Greater smoothing is obtained by using means of more adjacent values.

• Effective at highlighting the trend in the centre of a time series, but cannot be used at the ends since the moving average requires values both before and after each value being smoothed.

Year

Tota

l Dro

wni

ngs

2003200119991997199519931991198919871985

220

200

180

160

140

120

100

Total DrowningsMA(3)MA(5)MA(7)

Variable

Drownings in New Zealand

Page 10: 161.120 Introductory Statistics  Week 4 Lecture slides

Forecasting – Least squares

• Linear model – Residuals

• Recode year

Code

Tota

l Dro

wni

ngs

20151050

220

200

180

160

140

120

100

S 13.1705R-Sq 71.4%R-Sq(adj) 69.8%

Fitted Line PlotTotal Drownings = 179.3 - 3.421 Code

Page 11: 161.120 Introductory Statistics  Week 4 Lecture slides

• Quadratic model

• Patterns in residuals

• Forecasting– Once the equation of a trend line (using least squares) is obtained,

insert future time values into equation for forecast.

– Beware forecasting many time periods into the future• The shape of the actual trend line might be different from your model

Page 12: 161.120 Introductory Statistics  Week 4 Lecture slides

Cycles• Not all increases and decreases can be explained by a smooth

trend line.• Many time series change in cycles• Cyclical Patterns

– Cycles do not repeat regularly– Example: Sun spot activity cycle of approx 11 years, but not all

cycles are of the same length.• Seasonal Patterns

– Not usually referred to as ‘cyclical’– Distinguished by a period that repeats exactly– Regular cycles that are strongly repeated to the calendar

• Monthly or quarterly data often has a pattern of peaks and troughs that repeat in a similar way each year

– Important that the most recent values is not interpreted in relation to the immediately preceding value

Page 13: 161.120 Introductory Statistics  Week 4 Lecture slides

Relationships between Categorical Variables

• What we might ask

– Explain why relative frequencies allow better comparison between groups.

– Use stacked and grouped bars in a bar chart to better compare groups.

– Identify whether a table of data is a contingency table.

– Find marginal and conditional proportions from a contingency table to answer questions stated in words.

Page 14: 161.120 Introductory Statistics  Week 4 Lecture slides

Contingency Tables

• Single rectangular array combining frequency table for each variable

Example: A study exploring the relationship between hypertension (high blood pressure) and amount of smoking of a sample of 200 people.

Degree of hypertension

Frequency Amount of smoking

Frequency

Severe 44 None 70 Mild 69 Moderate 54 None 87 Heavy 76 Total 200 Total 200

Amount of smoking None Moderate Heavy Total

Severe 10 14 20 44 Mild 20 18 31 69 Degree of

hypertension None 40 22 25 87 Total 70 54 76 200

Page 15: 161.120 Introductory Statistics  Week 4 Lecture slides

• Fully describes categorical data (2 or more groups)• Poor way to compare distributions if there are different total numbers in

the groups

• Can be more informative to use proportions within the groups(each frequency in table is divided by the total for that group)

Places

Christchurch Palmerston North Total

Private/Company vehicle 111687 21825 133512 Public transport 5406 351 5757 Bicycle 8667 2013 10680 Walked / Jogged 6624 2406 9030

Means of Transport

Other 9195 2106 11301 Total 141579 28701 170280

Places

Christchurch Palmerston North

Private/Company vehicle 0.79 0.76 Public transport 0.04 0.01 Bicycle 0.06 0.07 Walked / Jogged 0.05 0.08

Means of Transport

Other 0.06 0.07 Total 1 1

Page 16: 161.120 Introductory Statistics  Week 4 Lecture slides

Example 6.1 Smoking and Divorce RiskData on smoking habits and divorce history for the 1669 respondents who had ever been married.

Among smokers, 49% have been divorced, 51% have not.Among nonsmokers, only 32% have been divorced, 68% have not.The difference between row percents indicates a relationship.

Page 17: 161.120 Introductory Statistics  Week 4 Lecture slides

Same shape whether based on frequency or relative frequency

Rela

tive

Fequ

ency

Divorced?Non smokerSmoker

NoYesNoYes

50

40

30

20

10

0

Page 18: 161.120 Introductory Statistics  Week 4 Lecture slides

When the groups correspond to different rows, the most important comparisons are down columns.

Ever Divorced? Smoke? Yes No Total Yes 0.49 0.51 1 No 0.32 0.68 1

Page 19: 161.120 Introductory Statistics  Week 4 Lecture slides

The corresponding bars for the smoking groups are widely spread, making comparison harder.

Can cluster bars by smoking group.

Rela

tive

Freq

uenc

y (w

ithin

sm

okin

g gr

oups

)

SmokeNot DivorcedDivorced

NoYesNoYes

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Rela

tive

Freq

uenc

y (w

ithin

sm

okin

g gr

oups

)

Smoke NoYesNot DivorcedDivorcedNot DivorcedDivorced

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Page 20: 161.120 Introductory Statistics  Week 4 Lecture slides

Example 6.2 Tattoos and Ear PiercesResponses from n = 565 men to two questions:1. Do you have a tattoo? 2. How many total ear pierces do you have?

Among men with no ear pierces, 43/424 = 10% have a tattoo.Among men with one ear pierce, 16/70 = 23% have a tattoo.Among men with two or more ear pierces, 26/71 = 37% have a tattoo.% with a tattoo as number of ear pierces => relationshipCould examine column percents (see graph above) or overall percents too.

Page 21: 161.120 Introductory Statistics  Week 4 Lecture slides

Stacked Bar charts are often the best way to graphically compare groups

Rela

tive

Freq

uenc

y (w

ithin

Tat

too

grou

ps)

TattooNo tattoo

100

80

60

40

20

0

NoneOneTwo or more

Ear pierces

Page 22: 161.120 Introductory Statistics  Week 4 Lecture slides

Types of bivariate relationship

• Experimental data– Categorical data sometimes collected separately from different groups

• Categorical measurement treated as response• Grouping treated as explanatory variable

• Stimulus-response data– Stimulus may affect the response– Also can have two categorical measurements made from one individual

One can affect the other but not the reverse• Association

– Not all relationships are causal, so sometimes the variables cannot be classified into explanatory and response variables

Page 23: 161.120 Introductory Statistics  Week 4 Lecture slides

What type of bivariate relationship?

Page 24: 161.120 Introductory Statistics  Week 4 Lecture slides

• Joint proportions– What proportion of the skiers where given the placebo and

didn’t catch a cold?

• Marginal proportions– What proportion of skiers didn’t catch a cold?

• Conditional proportions– What proportion of skiers caught a cold given that they had

the Placebo?

Cold No Cold Total Ascorbic acid 17 122 139

Placebo 31 109 140 Total 48 231 279