linear regression with r 1
TRANSCRIPT
Linear Regressionwith
2012-12-07 @HSPHKazuki Yoshida, M.D. MPH-CLE student
FREEDOMTO KNOW
1: Prepare data/specify model/read results
Group Website is at:
http://rpubs.com/kaz_yos/useR_at_HSPH
n Introduction
n Reading Data into R (1)
n Reading Data into R (2)
n Descriptive, continuous
n Descriptive, categorical
n Deducer
n Graphics
n Groupwise, continuous
n
Previously in this group
Menu
n Linear regression
Ingredients
n Data preparation
n Model formula
n within()
n factor(), relevel()
n lm()
n formula = Y ~ X1 + X2
n summary()
n anova(), car::Anova()
Statistics Programming
Open R Studio
Create a new scriptand save it.
http://www.umass.edu/statdata/statdata/data/
lowbwt.dat
http://www.umass.edu/statdata/statdata/data/lowbwt.txthttp://www.umass.edu/statdata/statdata/data/lowbwt.dat
We will use lowbwt dataset used in BIO213
lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4)
Load dataset from web
header = TRUEto pick up
variable names
skip 4 rows
lbw[c(10,39), "BWT"] <- c(2655, 3035)
“Fix” dataset
Replace data pointsto make the dataset identical
to BIO213 dataset10th,39th
rows
BWT column
Lower case variable names
names(lbw) <- tolower(names(lbw))
Convert variable names to lower case
Put them back into variable names
See overview
library(gpairs)gpairs(lbw)
RecodingChanging and creating variables
dataset <- within(dataset, { _variable manipulations_
})
Take datasetName of newly created dataset
(here replacing original)
Perform variable manipulationYou can specify by variable name
only. No need for dataset$var_name
lbw <- within(lbw, {
## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")
## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))
})
lbw <- within(lbw, {
## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")
## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))
})1 to White2 to Black3 to Other
Categorize race and label:
Numeric to categorical: element by element
1st will be reference
1st will be reference
lbw <- within(lbw, {
## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
})
factor() to create categorical variable
Take race variable
Order levels 1, 2, 3Make 1 reference level
Label levels 1, 2, 3 as White, Black, Other
Create new variable named
race.cat
Explained more in depth
lbw <- within(lbw, {
## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")
## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))
})
-Inf Inf0 1 2 3 4 5 6] ] ](None Normal Many
Numeric to categorical:range to element
1st will be reference
How breaks work
lbw <- within(lbw, {
## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")
## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))
})
Reset reference level
Change reference level of ftv.cat variablefrom None to Normal
lbw <- within(lbw, {
## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")
## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(FALSE,TRUE), labels = c("0","1+"))
})
Numeric to Boolean to Category
ptl < 1 to FALSE, then to “0”ptl >= 1 to TRUE, then to “1+”
TRUE, FALSE vector created
here levels labels
lbw <- within(lbw, {
## Categorize smoke ht ui smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes")) ht <- factor(ht, levels = 0:1, labels = c("No","Yes")) ui <- factor(ui, levels = 0:1, labels = c("No","Yes"))
})
## Alternative to abovelbw[,c("smoke","ht","ui")] <- lapply(lbw[,c("smoke","ht","ui")], function(var) { var <- factor(var, levels = 0:1, labels = c("No","Yes")) })
Binary 0,1 to No,Yes
One-by-one method
Loop method
model formula
outcome ~ predictor1 + predictor2 + predictor3
formula
SAS equivalent: model outcome = predictor1 predictor2 predictor3;
age ~ zyg
In the case of t-test
continuous variable to be compared
grouping variable to separate groups
Variable to be explained
Variable used to explain
Y ~ X1 + X2
linear sum
n . All variables except for the outcome
n + X2 Add X2 term
n - 1 Remove intercept
n X1:X2 Interaction term between X1 and X2
n X1*X2 Main effects and interaction term
Y ~ X1 + X2 + X1:X2
Interaction term
Main effects Interaction
Y ~ X1 * X2
Interaction term
Main effects & interaction
Y ~ X1 + I(X2 * X3)
On-the-fly variable manipulation
New variable (X2 times X3) created on-the-fly and used
Inhibit formula interpretation. For math
manipulation
lm.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm , data = lbw)
Fit a model
lm.full
See model object
Call: command repeated
Coefficient for each variable
summary(lm.full)
See summary
Call: command repeated
Model F-test
Residual distribution
Dummy variables created
R^2 and adjusted R^2
Coef/SE = t
ftv.catNone No 1st trimester visit people compared to Normal 1st trimester visit people (reference level)
ftv.catMany Many 1st trimester visit people compared to Normal 1st trimester visit people (reference level)
race.catBlack Black people compared to White people (reference level)
race.catOther Other people compared to White people (reference level)
confint(fit.lm)
Confidence intervals
Lower boundary
Upper boundary
Confidence intervals
anova(lm.full)
ANOVA table (type I)
degree of freedom
Sequential SS
Mean SS = SS/DF
F = Mean SS / Mean SS of residual
ANOVA table (type I)
1 age
2 lwt
3 smoke
1st gets all in type I
2nd gets all but overlap
between 1 in type Ilast remaining
only in type I
Type I = Sequential SS
library(car)Anova(lm.full, type = 3)
ANOVA table (type III)
degree of freedom
Marginal SS
F = Mean SS / Mean SS of residual
ANOVA table (type III)
Multi-category variables tested as
one
1 age
2 lwt
3 smoke
1st gets margin
only in type III
2nd
gets
margin
only
in ty
pe II
I
last gets margin
only in type III
Type III = Marginal SS
Type I Type III
Comparison
library(effects)plot(allEffects(lm.full), ylim = c(2000,4000))
Effect plot
Fix Y-axis values for all
plots
Effect of a variable with other covariate
set at average
Interaction
lm.full.int <- lm(bwt ~ age*lwt + smoke + ht + ui + age*ftv.cat + race.cat*preterm, data = lbw)
Continuous * Continuous
Categorical * CategoricalContinuous * Categorical
This model is for demonstration purpose.
Anova(lm.full.int, type = 3)
degree of freedom
Marginal SS
F = Mean SS / Mean SS of residual
Interactionterms
plot(effect("age:lwt", lm.full.int))
lwt level
Con
tinuo
us *
Con
tinuo
us
plot(effect("age:ftv.cat", lm.full.int), multiline = TRUE)C
ontin
uous
* C
ateg
oric
al
Cat
egor
ical
* C
ateg
oric
alplot(effect(c("race.cat*preterm"), lm.full.int),
x.var = "preterm", z.var = "race.cat", multiline = TRUE)