linear regression 1 oct 2010 cpsy501 dr. sean ho trinity western university please download from...

Linear RegressionLinear Regression

1 Oct 2010CPSY501Dr. Sean HoTrinity Western University

Please download from“Example Datasets”:

Record2.savExamAnxiety.sav

1 Oct 2010CPSY501: multiple regression 2

Outline for todayOutline for today

Correlation and Partial Correlation

Linear Regression (Ordinary Least Squares)Using Regression in Data AnalysisRequirements: VariablesRequirements: Sample Size

Assignments & Projects


Causation from correlation?Causation from correlation?

Ancient proverb: “correlation ≠ causation”

But: sometimes correlation can suggest causality, in the right situations!

We need >2 variables, and it'sNot the “ultimate” (only, direct) cause, butDegree of influence one var has on another

(“causal relationship”, “causality”) Is a significant fraction of the variability in one

variable explained by the other variable(s)?


How to get causality?How to get causality?

Use theory and prior empirical evidencee.g., temporal sequencing: can't be T2 → T1Order of causation? (gender & career aspir.)

Best is a controlled experiment:Randomly assign participants to groupsChange the level of the IV for each groupObserve impact on DV

But this is not always feasible/ethical!

Still need to “rule out” other potential factors


Potential hidden factorsPotential hidden factors

What other vars might explain these correlatns?Height and hair length

e.g., both are related to genderIncreased time in bed and suicidal thoughts

e.g., depression can cause both# of pastors and # of prostitutes (!)

e.g., city size To find causality, must rule out hidden factors

e.g., cigarettes and lung cancer


Example: immigration studyExample: immigration study

Level ofAcculturation

PsychologicalWell-being

Immigration Follow-up1 Year

LinguisticAbility

Psych.Well-being

LinguisticAbility


Partial CorrelationPartial Correlation

Purpose: to measure the unique relationship between two variables – after the effects of other variables are “controlled for”.

Two variables may have a large correlation, butIt may be largely due to the effects of other

moderating variablesSo we must factor out their influence, andSee what correlation remains (partial)

SPSS algorithm assumes parametricityThere exist non-parametric methods, though


Visualizing Partial CorrelationVisualizing Partial Correlation

Moderating Variable(e.g., city size)

Variable 2(e.g., prostitutes)

Variable 1(e.g., pastors)

PartialCorrelation

DirectionDirectionof Causation?of Causation?

??


Practise: Partial CorrelationPractise: Partial Correlation

Dataset: Record2.sav

Analyze → Correlate → Partial,Variables: adverts, salesControlling for: airplay, attract

Or: Analyze → Regression → Linear:Dependent: sales; Independent: all other varStatistics → “Part and partial correlations”

Uncheck everything elseAll partial r, plus regular (0-order) r


Linear RegressionLinear Regression

When we use correlation to try to inferdirection of causation, we call it regression:

Combining the influence of a number of variables (predictors) to determine their total effect on another variable (outcome).


Simple Regression: 1 predictorSimple Regression: 1 predictor

Regression

Simple regression is predicting scores on an outcome variable from a single predictor variable

Mathematically equiv. to bivariate correlation


Outcome

InterceptGradient

Predictor Error

OLS Regression ModelOLS Regression Model

In Ordinary Least-Squares (OLS) regression,the “best” model is defined as the line that minimizes the error between model and data (sum of squared differences).

Conceptual description of regression line(General Linear Model):

Y = b0 + b1X1 + (B2X2 … + BnXn) + ε


Assessing Fit of a ModelAssessing Fit of a Model

R2: proportion of variance in outcome accounted for by predictors: SSmodel / SStot

Generalization of r2 from correlation F ratio: variance attributable to the model

divided by variance unexplained by modelF ratio converts to a p-value,

which shows whether the “fit” is good If fit is poor, predictors may not be linearly

related to the outcome variable


Example: Record SalesExample: Record Sales

Dataset: Record2.sav Outcome variable: Record sales Predictor: Advertising Budget

Analyze → Regression → LinearDependent: sales; Independent: advertsStatistics → Estimates, Model fit,

Part and partial correlations


Record2: Interpreting ResultsRecord2: Interpreting Results

Model Summary: R Square = .335adverts explains 33.5% of variability in sales

ANOVA: F = 99.587 and Sig. = .000There is a significant effect

Report: R2 = .335, F(1, 198) = 99.587, p < .001 Coefficients: (Constant) B = 134.140 (Y-intcpt)

Unstandardized advert B = .096 (slope)Standardized advert Beta = .578

(uses variables converted to z-scores)Linear model: Ŷ = .578 * ABz + 134


General Linear ModelGeneral Linear Model

Actually, nearly all parametric methods can be expressed as generalizations of regression!

Multiple Regression: several scale predictors

Logistic Regression: categorical outcome

ANOVA: categorical IVt-test: dichotomous IVANCOVA: mix of categorical + scale IVsFactorial ANOVA: several categorical IVs

Log-linear: categorical IVs and DV

We'll talk about each of these in detail later


Regression Modelling ProcessRegression Modelling Process

1.Develop RQ: IVs/DVs, metricsCalc required sample size & collect data

2.Clean: data entry errors, missing data, outliers

3.Explore: assess requirements, xform if needed

4.Build model: try several models, see what fits

5.Test model: “diagnostic” issues:Multivariate outliers, overly influential cases

6.Test model: “generalizability” issues:Regression assumptions: rebuild model


Selecting VariablesSelecting Variables

According to your model or theory, what variables might relate to your outcomes?

Does the literature suggest important vars? Do the variables meet all the requirements for

an OLS multiple regression?(more on this next week)

Record sales example:DV: what is a possible outcome variable?IV: what are possible predictors, and why?


Choosing Good PredictorsChoosing Good Predictors

It's tempting to just throw in hundreds of predictors and see which ones contribute most

Don't do this! There are requirements on how the predictors interact with each other!

Also, more predictors → less power Have a theoretical/prior justification for them

Example: what's a DV you are interested in?Come up with as many possible good IVs as

you can – have a justification!Background, internal personal,

current external environment


Using Derived VariablesUsing Derived Variables

You may want to use derived variables in regression, for example:

Transformed variables (to satisfy assumptions)

Interaction terms: (“moderating” variables)e.g., Airplay * Advertising Budget

Dummy variables:e.g., coding for categorical predictors

Curvilinear variables (non-linear regression)e.g., looking at X2 instead of X


Required Sample SizeRequired Sample Size

Depends on effect size and # of predictorsUse G*Power to find exact sample sizeRough estimates on pp. 172-174 of Field

Consequences of insufficient sample size: Regression model may be overly influenced

by individual participants (not generalizable)Can't detect “real” effects of moderate size

Solutions: Collect more data from more participants! Reduce number of predictors in the model


Requirements on DVRequirements on DV

Must be interval/continuous:If not: mathematics simply will not workSolutions:

Categorical DV: use Logistic RegressionOrdinal DV: use Ordinal Regression,

or possibly convert into categorical form Independence of scores (research design)

If not: invalid conclusions Solutions: redesign data set, or

Multi-level modelling instead of regression


Requirements on DV, cont.Requirements on DV, cont.

Normal (use normality tests):If not: significance tests may be misleadingSolutions: Check for outliers, transform data,

use caution in interpreting significance Unbounded distribution (obtained range of

responses versus possible range of responses) If not: artificially deflated R2 Solutions:

Get more data in missing rangeUse more sensitive instrument

Str DisDisagree

NoneAgree

Str Agr

0

1

2

3

4

5

6

7

8

9

10


Requirements on IVRequirements on IV

Scale-levelCan be categorical, too (see next page)If ordinal, either threshold into categorical or

treat as if scale (only if have sufficient number of ranks)

Full-range of valuesIf an IV only covers 1-3 on a scale of 1-10,

then the regression model will predict poorly for values beyond 3

More requirements on the behaviour of IVs with each other and the DV (covered next week)


Categorical PredictorsCategorical Predictors

Regression can work for categorical predictors:

If dichotomous: code as 0 and 1e.g., 1 dichotomous predictor, 1 scale DV:Regression is equivalent to t-test!

And the slope B1 is the difference of means

If n categories: use “dummy” variablesChoose a base categoryCreate n-1 dichotomous variablese.g., {BC, AB, SK}: dummys are isAB, isSK

linear regression 1 oct 2010 cpsy501 dr. sean ho trinity western university please download from...

Documents