stats chapter 4

95
Chapter 4 More about relationships between 2 variables

Upload: richard-ferreria

Post on 03-Dec-2014

2.966 views

Category:

Education


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Stats chapter 4

Chapter 4

More about relationships between 2 variables

Page 2: Stats chapter 4

4.1 TRANSFORMING TO ACHIEVE LINEARITY

Page 3: Stats chapter 4

What if the scatterplot is not linear?

• Of course not all data is linear!• Our method in statistics will involving

mathematically operating on one or both of the explanatory and response variables

• An inverse transformation will be used to create a non-linear regression model

• This will be a little “mathy”

Page 4: Stats chapter 4

Transformations

• Before we begin transformations, remember that some well known phenomenon act in predictable ways– I.e. when working with time and gravity,

you should know that there is a square relationship between distance and time!

Page 5: Stats chapter 4

The Basics

• The data from measurements (raw data) must be operated on.

• Apply the same mathematical transformation on the raw data– Ex. “Square every response”

• Use methods from the previous chapter to find the LSRL for the transformed data

• Analyze your regression to ensure the LSRL is appropriate

• Apply an inverse transformation on the LSRL to find the regression for the raw data.

Page 6: Stats chapter 4

Example

Please refer to p 265 exercise 4.2Length (cm) Period (s)

16.5 0.777

17.5 0.839

19.5 0.912

22.5 0.878

28.5 1.004

31.5 1.087

34.5 1.129

37.5 1.111

43.5 1.290

46.5 1.371

106.5 2.115

Page 7: Stats chapter 4

Example

• Data inputted into L1 and L2

• Scatterplot• Looks pretty good,

right?

Page 8: Stats chapter 4

Exercise

• LSRL• Y=.6+.015X

r = 0.991• Residual Plot• Perhaps we can do

better!

Page 9: Stats chapter 4

Example

• L3 = L2^.5 (square root)

• LinReg L1, L3• Note that the value

of r2 has increased• Note that the value

of the residual of the last point has decreased

Page 10: Stats chapter 4

Exponential Models

• Many natural phenomenon are explained by an exponential model.

• Exponential models are marked by sharp increases in growth and decay.

• Basic model: y = A·Bx

• For this transformation, you need to take the logarithm of the response data.

• You may use “log10” or “ln” your choice.– I prefer “ln” (of course)

Page 11: Stats chapter 4

Exponential Models

After the transformation, we have the following linear model: ln(y) = a + b·x

1. ln(y) = a + b·x2. eln(y) = e(a + b·x) exponentiate3. y = ea · ebx property of

logarithms4. Let ‘A’ = ea redefine variables

‘B’ = eb

5. y = A·Bx this is our model

Page 12: Stats chapter 4

Exponential Models

• Since this is an ‘applied math’ course, you need not remember how to apply the inverse transformation

• Whew• BUT you do need to memorize:

when ln(y) = a + bxy = A·Bx

where ‘A’ = ea and ‘B’ = eb

Page 13: Stats chapter 4

Exponential Models

Let’s try this data

Page 14: Stats chapter 4

Exponential Models

Take the ln of L2- the response list and store in

L3

Page 15: Stats chapter 4

Exponential Models

These are our “transformed responses”

Page 16: Stats chapter 4

Exponential Models

From our homescreen, we perform an LSRL

using the transformed data

Page 17: Stats chapter 4

Exponential Models

We don’t have to store this regression for transformed

data

Page 18: Stats chapter 4

Exponential Models

Take note of the values of ‘a’ and ‘b’

Page 19: Stats chapter 4

Exponential Models

A quick look at the residuals

Page 20: Stats chapter 4

Exponential Models

The values of the residuals are small .. . no defined pattern

Page 21: Stats chapter 4

Exponential Models

• Our regression model is exponential y = A·Bx

Where A = ea and B = eb • y = e0.701 · (e0.184)x

Page 22: Stats chapter 4

Exponential Models

• Our regression model is exponential y = A·Bx

Where A = ea and B = eb • y = e0.701 x (e0.184)x

Page 23: Stats chapter 4

Exponential Models

• Our regression model is exponential y = A·Bx

Where A = ea and B = eb • y = e0.701 x (e0.184)x

• Ory = 2.06 · (1.20)x

Page 24: Stats chapter 4

Exponential Models

Put our regression in Y1

Page 25: Stats chapter 4

Exponential Models

Change Plot1 from a resid. to a scatter plot

Page 26: Stats chapter 4

Exponential Models

Looks pretty good, eh?

Page 27: Stats chapter 4

Power Models

• These models are used when the rate of increase is less severe than an exponential model, or if you suspect a ‘root’ model

• For this model, you will find the logarithms of both the expl var and the resp var

Page 28: Stats chapter 4

Power models

LSRL on transformed data yields:ln(y) = a + b·ln(x)

1. ln(y) = a + b·ln(x)2. e ln(y) = e(a + b·ln(x))

3. y = ea·eln(x^b)

4. y = ea ·xb

5. Let ‘A’ = ea

6. y = A · xb

Page 29: Stats chapter 4

Power models

Let’s use this data to find a power model

Page 30: Stats chapter 4

Power models

This time we need to transform both lists

Page 31: Stats chapter 4

Power models

This time we need to transform both lists

Page 32: Stats chapter 4

Power models

Transformed exp = L3Transformed resp = L4

Page 33: Stats chapter 4

Power models

LSRL on transformed datano need to store in Y1

Page 34: Stats chapter 4

Power models

Take note of the values of ‘a’ and ‘b’

Page 35: Stats chapter 4

Power models

A quick look at the residuals

Page 36: Stats chapter 4

Power models

Note that we use the transformed exp var

Page 37: Stats chapter 4

Power models

No defined pattern

Page 38: Stats chapter 4

Power models

Residuals are all small in size

Page 39: Stats chapter 4

Power models

• When ln(y) = a + b·ln(x),y = A · xb

where ‘A’ = ea

Our model is y = (e1.31)· x1.27

Page 40: Stats chapter 4

Power models

• When ln(y) = a + b·ln(x),y = A · xb

where ‘A’ = ea

Our model is y = (e1.31) · x1.27

Page 41: Stats chapter 4

Power models

• When ln(y) = a + b·ln(x),y = A · xb

where ‘A’ = ea

Our model is y = (e1.31) · x1.27

Or y = 3.71 · x1.27

Page 42: Stats chapter 4

Power models

Regression in Y1

Page 43: Stats chapter 4

Power models

Change from resid to scatter plot

Page 44: Stats chapter 4

Power models

(notice L1 and L2)

Page 45: Stats chapter 4

Power models

Looks pretty good!

Page 46: Stats chapter 4

Power models

• Much like the exponential model, you only need to know how the transformed model becomes the model for the raw data.

• When ln(y) = a + b·ln(x),y = A · xb

where ‘A’ = ea

Page 47: Stats chapter 4

Transformation thoughts

• Although this is not a major topic for the course, you still need to be able to apply these two transformations (exp and power)

• Be sure to check the residuals for the LSRL on transformed data! You may have picked the wrong model :/

• If one model doesn’t work, try the other. I would start with the exponential model.

• Don’t transform into a cockroach. Ask Kafka!

Page 48: Stats chapter 4

Assn 4.1

• pg 276 #5, 8, 9, 11, 12

Page 49: Stats chapter 4

4.2 RELATIONSHIPS BETWEEN CATEGORICAL VARIABLES

Page 50: Stats chapter 4

Marginal Distributions

• Tables that relate two categorical variables are called “Two-Way Tables”– Ex 4.11 pg 292

• Marginal Distribution– Very fancy term for “row totals and column

totals”– Named because the totals appear in the

margins of the table. Wow.

• Often, the percentage of the row or column table is very informative

Page 51: Stats chapter 4

Marginal Distributions

Age Group

Female

Male Total

15-17 89 61 15018-24 5668 4697 1036525-34 1904 1589 349435 or older

1660 970 2630

Totals 9321 7317 16639

Page 52: Stats chapter 4

Marginal Distributions

Age Group

Female

Male Total

15-17 89 61 15018-24 5668 4697 1036525-34 1904 1589 349435 or older

1660 970 2630

Totals 9321 7317 16639

Column Totals

Page 53: Stats chapter 4

Marginal Distributions

Age Group

Female

Male Total

15-17 89 61 15018-24 5668 4697 1036525-34 1904 1589 349435 or older

1660 970 2630

Totals 9321 7317 16639

Row Totals

Page 54: Stats chapter 4

Marginal Distributions

Age Group

Female

Male Total

15-17 89 61 15018-24 5668 4697 1036525-34 1904 1589 349435 or older

1660 970 2630

Totals 9321 7317 16639

Grand Total

Page 55: Stats chapter 4

Marginal Distributions “Age Group”

Page 56: Stats chapter 4

Marginal Distributions “Age Group”

Age Group

Female

Male Total Marg. Dist.

15-17 89 61 15018-24 5668 4697 1036525-34 1904 1589 349435 or older

1660 970 2630

Totals 9321 7317 16639

Page 57: Stats chapter 4

Marginal Distributions “Age Group”

Age Group

Female

Male Total Marg. Dist.

15-17 89 61 15018-24 5668 4697 1036525-34 1904 1589 349435 or older

1660 970 2630

Totals 9321 7317 16639Row total / grand

total150/16639=0.009

Page 58: Stats chapter 4

Marginal Distributions “Age Group”

Age Group

Female

Male Total Marg. Dist.

15-17 89 61 150 0.9%18-24 5668 4697 1036525-34 1904 1589 349435 or older

1660 970 2630

Totals 9321 7317 16639Row total / grand

total150/16639=0.009

Page 59: Stats chapter 4

Marginal Distributions “Age Group”

Age Group

Female

Male Total Marg. Dist.

15-17 89 61 150 0.9%18-24 5668 4697 10365 62.3%25-34 1904 1589 3494 21.0%35 or older

1660 970 2630 15.8%

Totals 9321 7317 16639 100%

Adds to 100%

Page 60: Stats chapter 4

Marginal Distributions “Gender”

Age Group

Female

Male Total

15-17 89 61 15018-24 5668 4697 1036525-34 1904 1589 3494

35 &up 1660 970 2630Totals 9321 7317 16639Margin

dist.56% 44% 100%

Similarly for columns

Page 61: Stats chapter 4

Describing Relationships

• Some relationships are easier to see when we look at the proportions within each group

• These distributions are called “Conditional Distributions”

• To find a conditional distribution, find each percentage of the row or column total.

• Let’s look at the same table, and find the conditional distribution of gender, given each age group

Page 62: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89 61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

Page 63: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89 61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

We will look at the conditional

distribution for this row

Page 64: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89 61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

This cell is 89/150 (cell total /row total)

=53.9%

Page 65: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

This cell is 89/150 (cell total /row total)

=59.3%

Page 66: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

This cell is 61/150 (cell total /row total)

=40.7%

Page 67: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

This cell is 61/150 (cell total /row total)

=40.7%

Page 68: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

Page 69: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

The table with complete

conditional distributions for

each row

Page 70: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

For an analysis of the effect of age

groups, compare a row’s conditional

distribution…

Page 71: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

With the marginal distribution for the

columns…

Page 72: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

They should be close …

Page 73: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

… unless there is an effect caused by

the age group (?)

Page 74: Stats chapter 4

Conditional DistributionsAge

GroupFemale Male Total

15-17 89(59.3%)

61(40.7%)

150(100%)

18-24 5668(54.7%)

4697(45.3%)

10365(100%)

25-34 1904(54.5%)

1589(45.5%)

3494(100%)

35 or older

1660(63.1%)

970(36.9%)

2630(100%)

Totals 9321(56%)

7317(44%)

16639(100%)

… and these are not close to the

marginal distribution!

Page 75: Stats chapter 4

Conditional Distributions

• Based on the previous table, the distribution of “gender given age group” are not that different.

• We can see that the “35 and older” group seems to differ slightly from the overall trend.

Page 76: Stats chapter 4

Conditional Distributions “age group given gender”

Age Group

Female Male Total

15-17 89(1%)

61(0.8%)

150(0.9%)

18-24 5668(60.8%)

4697(64.2%)

10365(62.3%)

25-34 1904(20.4%)

1589(21.7%)

3494(21.0%)

35 or older

1660(17.8%)

970(13.3%)

2630(15.8%)

Totals 9321(100%)

7317(100%)

16639(100%)

Page 77: Stats chapter 4

Conditional Distributions “age group given gender”

Age Group

Female Male Total

15-17 89(1%)

61(0.8%)

150(0.9%)

18-24 5668(60.8%)

4697(64.2%)

10365(62.3%)

25-34 1904(20.4%)

1589(21.7%)

3494(21.0%)

35 or older

1660(17.8%)

970(13.3%)

2630(15.8%)

Totals 9321(100%)

7317(100%)

16639(100%)

Here is the same chart with the

conditional distributions by

gender…

Page 78: Stats chapter 4

Conditional Distributions “age group given gender”

Age Group

Female Male Total

15-17 89(1%)

61(0.8%)

150(0.9%)

18-24 5668(60.8%)

4697(64.2%)

10365(62.3%)

25-34 1904(20.4%)

1589(21.7%)

3494(21.0%)

35 or older

1660(17.8%)

970(13.3%)

2630(15.8%)

Totals 9321(100%)

7317(100%)

16639(100%)

Is there a gender effect noticeable from this table?

Page 79: Stats chapter 4

Conditional Distributions “age group given gender”

Age Group

Female Male Total

15-17 89(1%)

61(0.8%)

150(0.9%)

18-24 5668(60.8%)

4697(64.2%)

10365(62.3%)

25-34 1904(20.4%)

1589(21.7%)

3494(21.0%)

35 or older

1660(17.8%)

970(13.3%)

2630(15.8%)

Totals 9321(100%)

7317(100%)

16639(100%)

Page 80: Stats chapter 4

Conditional Distribution

Conclusions from the previous chart• Females are more likely to be in the “35

and older group” and less likely to be in the “18 to 24” group

• Males are more likely to be in the “18 to 24” group and less likely to be in the “35 and older” group

• These differences appear slight. Are actually “significant” with respect to the overall distribution?

Page 81: Stats chapter 4

Conditional Distribution

• No single graph portrays the form of the relationship between categorical variables.

• No single numerical measure (such as correlation) summarizes the strength of the association.

Page 82: Stats chapter 4

Simpson’s Paradox

• Associations that hold true for all of several groups can reverse direction when teh data is combined to form a single group.

• EX 4.15 pg 299• This phenomenon is often the result

of an “unaccounted” variable.

Page 83: Stats chapter 4

Assignment 4.2

• Pg 298 #23-25, 29, 31-35

Page 84: Stats chapter 4

4.3 ESTABLISHING CAUSATION

Page 85: Stats chapter 4

Different Relationships

• Suppose two variables (X and Y) have some correlation– i.e. when X increases in value, Y

increases as well– One of the following relationships may

hold.

Page 86: Stats chapter 4

Different Relationships

Causation• In this relationship, the explanatory

variable is somehow affecting the response variable.

• In most instances, we are looking to find evidence of a causation relationship

Page 87: Stats chapter 4

Different Relationships

Causation

Page 88: Stats chapter 4

Different Relationships

Common Response• In this relationship, both X and Y are

correlated to a third (unknown) variable (Z).

• EX, When Z increases X increases and Y increases.

• Unless we known about Z, it appears as though X and Y have a causation relationship.

Page 89: Stats chapter 4

Different Relationships

Common Response

Page 90: Stats chapter 4

Different Relationships

Confounding• X and Y have correlation, • An (often unknown) third variable ‘Z”

also has correlation with Y• Is X the explanatory variable, or is Z

the explanatory variable, or are the both explanatory variables?

Page 91: Stats chapter 4

Different Relationships

Confounding

Page 92: Stats chapter 4

Causation

• The best way to establish causation is with a carefully designed experiment– Possible ‘lurking variables’ are controlled

• Experiments cannot always be conducted–Many times, they are costly or even

unethical

• Some guidelines need to be established in cases where an observational study is the only method to measure variables.

Page 93: Stats chapter 4

Causation- some criteria

• Association is strong• Association is consistent (among

different studies)• Large values of the response variable

are associated with stronger responses (typo?)

• The alleged cause precedes the effect in time

• The alleged cause is probable

Page 94: Stats chapter 4

Assignment 4.3

Pg312 #41, 45, 50, 51

Page 95: Stats chapter 4

Chapter 4 Review

• #37, 53, 54, 57