tuanv.nguyen% · species bodywt brainwt nondrea ming dreaming totalsleep%lifespan gestaon% predaon%...

Tuan V. Nguyen Gene$cs Epidemiology of Osteoporosis Lab

Garvan Ins$tute of Medical Research

Garvan Ins$tute Biosta$s$cal Workshop 17 April 2014 © Tuan V. Nguyen

Introduction to R

•  A brief history

•  Installa$on

•  Packages

•  Essen$al grammar

•  A session with R

Previously …

•  Many sta$s$cal packages were/are available

•  Popular packages include

Systat, Minitab, Sta$s$ca, BMDP, S+, Gauss, Spida

JMP, SPSS, Stata, SAS

and now R

R is gaining popularity

Number of scholarly ar$cles that reference each soUware by year (Source: Muenchen R. The popularity of data analysis soUware, r4stat.com/ar$cles/popularity)

R is gaining popularity

Number of scholarly ar$cles that reference each soUware by year, aUer removing the top two, SPSS and SAS (Source: Muenchen R. The popularity of

data analysis soUware, r4stat.com/ar$cles/popularity)

A brief history

•  R is a “sta$s$cal and graphical programming language”

•  Originated from S –  1988 -‐ S2: RA Becker, JM Chambers, A Wilks –  1992 -‐ S3: JM Chambers, TJ Has$e –  1998 -‐ S4: JM Chambers

•  R was ini$ally wriben by Ross Ihaka and Robert Gentleman (Univ of Auckland, New Zealand) in 1990s

•  From 1997: interna$onal “R-‐core”, 15 people

What can R do?

•  It is a sta$s$cal language

•  All models of sta$s$cal analysis

•  Great for simula$on work

•  Programming (do you want to take a challenge?)

Why R ?

•  Open source – totally free!

•  Developed by professional and academic sta$s$cians

•  Run on Windows, Unix, MacOS

•  Keep up-‐to-‐date with methodological developments

•  Speak the language of experts (bioinforma$cs and sta$s$cs)

•  Large user community

Installa9on

cran.r-project.org

Installation of R on Windows

•  Select Windows

•  Select “base”

•  Run à OK à Next

•  Then Finish –  R icon on your desktop

A screenshot of R

RStudio

An “add-‐on” of R

RStudio hbp://rstudio.org

Introduction to RStudio

•  An IDE (Interface Development Environment) of R.

•  Provide some convenient func$ons for running R

•  R also has a number of other IDEs: •  TinnR

•  R commander

R and RStudio

Can run R within Rstudio (you don’t need to start R)

RStudio

R console

Workspace: Variables

Files

Packages

R is a real demonstration of the power of collaboration

Ihaka

Packages

•  R = Base + Packages •  Base R includes basic R func$ons for simple func$ons and analyses

•  Packages are modules for specific analyses •  More than 6000 packages in R !

Common packages

Hmisc: Miscellaneous for data manipula$on

tables: For tabula$on of data

foreign: For reading data from other soUwares

tables: For tabula$on of data

gmodels: Programming tools

ggplot2: Advanced graphics

sciplot: Scien$fic graphs

Zelig: “Every one’s sta$s$cal soUware”

rms: Regression modeling strategies

car: Companion to regression analysis

survival: Survival analyses

EpiR: Epidemiological analyses

epicalc: Epidemiological analyses

boot: Bootstrap analyses

cluster: Cluster analysis

psych: Psychometrics and descrip$ve sta$s$cs

Basic management of packages

•  Installing new packages (try now!) install.packages(c("Hmisc", "rms", "tables", "foreign", "gmodels", "ggplot2", "sciplot", "Zelig", "car", "survival", "EpiR", "epicalc", "boot", "cluster", "psych", "binom", "BMA", "ExactCIdiff", "lattice", "mgcv", "gam", "nlme", "quantreg")

•  To find out which packages you have installed library()

R Grammar: a quick introduc9on

Interacting with R

•  Start up R

•  Can use up/down arrow keys to retrieve command history

•  Can use leU/right keys to edit a command line

•  Can use TAB to append a full command – very useful!

•  Mul$ple commands can be wriben in 1 line by using “;” separator

Variable names

•  Use lebers, numbers, and signs (., -‐, _)

•  Assignment symbol: <-‐ or =

•  Dis$nc$on between upper and lower case lebers Genotype = 5; genotype <- 7;

Geno.type = Genotype + genotype

Object-oriented language

R is an object-‐oriented language

•  Func$on

•  Vector • Matrix

•  Dataframe

Function

•  R “commands” = func$on

•  Func$on has arguments

•  Arguments include variables (name), parameters, op$ons, etc

•  Example: firng a linear regression model y = a + bx

m1 = lm(y ~ x, data=test)

Function

•  R “commands” = func$on

•  Func$on has arguments

•  Example: firng a linear regression model y = a + bx

m1 = lm(y ~ x, data=test)

Object name m1

Func9on lm = linear model

Arguments: variables: y, x dataset name

Vector

•  Vectors are basic building block in R •  Vector = a series of values

•  Values can be numeric or character score = c(4,2,1,5)

gender = c('F','M','F','M')

c (concatena9on) for direct data entry

Matrix

•  Rectagular data à rows, columns

• Matrix can be a collec$on of vectors

1 3 6 7

3 4 7 9 5 7 8 0

Matrix

1 3 6 7 3 4 7 9 5 7 8 0

v1 = c(1,3,5) v2 = c(3,4,7) v3 = c(6,7,8) v4 = c(7,9,0) m = cbind(v1,v2,v3,v4) m

Reference to matrix

•  Row first, column later

•  Flexible in R

> m v1 v2 v3 v4 [1,] 1 3 6 7 [2,] 3 4 7 9 [3,] 5 7 8 0

> m[2,3] v3 7 > m[1,] v1 v2 v3 v4 1 3 6 7 > m[1:2,] v1 v2 v3 v4 [1,] 1 3 6 7 [2,] 3 4 7 9

> m[,2:3] v2 v3 [1,] 3 6 [2,] 4 7 [3,] 7 8 > m[,3:4]*m[1,2] v3 v4 [1,] 18 21 [2,] 21 27 [3,] 24 0

Dataframe

Dataset in R = “Dataframe” = matrix

ID Gender Math Reading

1 F 5 8

2 M 5 2

3 F 7 3

4 F 8 6

fields, columns, variables

rows records observa9ons

numeric character numeric numeric

Reference to field/column in a dataframe

•  Dataframe should be attached prior to analysis

•  Reference to field: (dataframe name)$(field name)

•  Example: v1 = c(1,3,5) v2 = c(3,4,7) v3 = c(6,7,8) v4 = c(7,9,0) dat = data.frame(v1, v2, v3, v4) attach(dat) dat$sum = dat$v1 + dat$v3 sum1 = v1 + v3 dat

The effect of $

v1 = c(1,3,5) v2 = c(3,4,7) v3 = c(6,7,8) v4 = c(7,9,0) dat=data.frame(v1,v2,v3,v4) attach(dat) dat$sum = dat$v1 + dat$v3 sum1 = v1 + v3 dat

> dat v1 v2 v3 v4 sum 1 1 3 6 7 7 2 3 4 7 9 10 3   5 7 8 0 13

There is NO sum1 !

Data coding in R

id = c(1, 2, 3, 4, 5)

gender = c("male", "female", "male", "female", "female")

dat = data.frame(id, gender)

We want to create a new variable called sex with numeric values (1, 2)

dat$sex[gender=="male"] <- 1

dat$sex[gender=="female"] <- 2

Character and numeric coding

Character to numeric X = c("1", "2", "3", "4", "5")

We want to create a new variable called Y with numeric values (for calcula$on)

Y = as.numeric(X)

mean(Y)

Numeric to character Y = 1:10

We want to create a new variable called X with character values

X = as.character(Y)

Sorting dat: sort()

X = rnorm(10); X [1] 1.5651300 -0.5382971 -0.1995302 1.0111098 0.3590144 -1.5245237

[7] -0.3192534 0.1323256 -0.7916954 -0.0664167

sort(X) [1] -1.5245237 -0.7916954 -0.5382971 -0.3192534 -0.1995302 -0.0664167

[7] 0.1323256 0.3590144 1.0111098 1.5651300

Merging datasets

id = c(1,2,3,4) sex=c("M","F","M","F") dat1=data.frame(id,sex)

id = c(1,2,3,4,5) age=c(21,34,45,32,18) dat2=data.frame(id,age)

dat = merge(dat1, dat2, by="id") dat = merge(dat1, dat2, by="id", all.x=T, all.y=T)

An R Session (demo)

To work with R …

•  R, like most sta$s$cal programs, works on observa$ons (rows) and variables

•  You should keep in mind

–  Name of dataframe

–  Name of variables

Allison and Cichhetti’s study

Trueb Allison; Domenic V. Ciccher. Sleep in Mammals: Ecological and Cons$tu$onal Correlates. Science 1976; 194:732-‐734.

R Session

•  Reading a file into R for analysis

Filename: allison.csv

•  Some graphical analyses

•  Some descrip$ve (and not so descrip$ve) analyses

Allison T, Cicchetti DV (1976). Sleep in mammals: ecological and constitutional correlates. Science 194, 732–734.

Species BodyWt BrainWt NonDreaming Dreaming TotalSleep LifeSpan Gesta9on Preda9on Exposure Danger

Africanelephant 6654 5712 NA NA 3.3 38.6 645 3 5 3

Africangiantpouchedrat 1 6.6 6.3 2 8.3 4.5 42 3 1 3

Arc$cFox 3.385 44.5 NA NA 12.5 14 60 1 1 1

Arc$cgroundsquirrel 0.92 5.7 NA NA 16.5 NA 25 5 2 3

Asianelephant 2547 4603 2.1 1.8 3.9 69 624 3 5 4

Baboon 10.55 179.5 9.1 0.7 9.8 27 180 4 4 4

Bigbrownbat 0.023 0.3 15.8 3.9 19.7 19 35 1 1 1

Braziliantapir 160 169 5.2 1 6.2 30.4 392 4 5 4

Cat 3.3 25.6 10.9 3.6 14.5 28 63 1 2 1

Chimpanzee 52.16 440 8.3 1.4 9.7 50 230 1 1 1

Chinchilla 0.425 6.4 11 1.5 12.5 7 112 5 4 4

Cow 465 423 3.2 0.7 3.9 30 281 5 5 5

Deserthedgehog 0.55 2.4 7.6 2.7 10.3 NA NA 2 1 2

Donkey 187.1 419 NA NA 3.1 40 365 5 5 5

EasternAmericanmole 0.075 1.2 6.3 2.1 8.4 3.5 42 1 1 1

Reading file csv

•  Locate your folder and filename

•  Use the func$on read.csv

•  In Mac, you simply drag the filename to the R command line

dat = read.csv("~/Dropbox/Garvan Lectures 2014/Datasets and Teaching Materials/allison.csv", header=T, na.strings="NA")

Reading file through file.choose()

f = file.choose() # find the file

dat = read.csv(f, header=T, na.strings="NA")

attach(dat) # abach the data before analysis

names(dat) # want to know variable names

dim(dat) # how many rows and columns?

summary(dat) # summarize data

Summary: an overall “picture” > summary(dat) Species BodyWt BrainWt Africanelephant : 1 Min. : 0.005 Min. : 0.14 Africangiantpouchedrat: 1 1st Qu.: 0.600 1st Qu.: 4.25 ArcticFox : 1 Median : 3.342 Median : 17.25 Arcticgroundsquirrel : 1 Mean : 198.790 Mean : 283.13 Asianelephant : 1 3rd Qu.: 48.203 3rd Qu.: 166.00 Baboon : 1 Max. :6654.000 Max. :5712.00 (Other) :56 NonDreaming Dreaming TotalSleep LifeSpan Min. : 2.100 Min. :0.000 Min. : 2.60 Min. : 2.000 1st Qu.: 6.250 1st Qu.:0.900 1st Qu.: 8.05 1st Qu.: 6.625 Median : 8.350 Median :1.800 Median :10.45 Median : 15.100 Mean : 8.673 Mean :1.972 Mean :10.53 Mean : 19.878 3rd Qu.:11.000 3rd Qu.:2.550 3rd Qu.:13.20 3rd Qu.: 27.750 Max. :17.900 Max. :6.600 Max. :19.90 Max. :100.000 NA's :14 NA's :12 NA's :4 NA's :4 Gestation Predation Exposure Danger Min. : 12.00 Min. :1.000 Min. :1.000 Min. :1.000 1st Qu.: 35.75 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 Median : 79.00 Median :3.000 Median :2.000 Median :2.000 Mean :142.35 Mean :2.871 Mean :2.419 Mean :2.613 3rd Qu.:207.50 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :645.00 Max. :5.000 Max. :5.000 Max. :5.000 NA's :4

Descriptive statistics: counting

library(tables) tabular(factor(Exposure) ~ (n=1 + Percent("col"))) All n factor(Exposure) All Percent 1 27 43.548 2 13 20.968 3 4 6.452 4 5 8.065 5 13 20.968

Descriptive statistics: mean, SD, etc

tabular(factor(Exposure) ~ LifeSpan*(n=1 + mean + median + sd), data=na.omit(dat))

LifeSpan n factor(Exposure) All mean median sd 1 18 15.17 5.75 24.103 2 9 13.81 7.00 15.622 3 4 25.40 26.50 13.846 4 4 20.30 23.60 9.428 5 7 33.34 30.00 18.452

Descriptive statistics: graph means=with(na.omit(dat), tapply(LifeSpan, Exposure, mean)) barplot(sort(means), horiz=T, las=1, col="blue", xlab="Life Span", ylab="Exposure")

Descriptive statistics: graph library(sciplot) bargraph.CI(Exposure, LifeSpan, lc=F, data=na.omit(dat))

Box plot

boxplot(LifeSpan ~ Exposure, notch=F, col="blue")

1 2 3 4 5

020

4060

80100

Even better box plot library(ggplot2)

qplot(x=factor(Exposure), y=LifeSpan, data=dat, geom=c("boxplot", "jitter"), fill=Exposure)

0

25

50

75

100

1 2 3 4 5factor(Exposure)

LifeSpan

1

2

3

4

5Exposure

Histogram

hist(LifeSpan, prob=T, col="blue") lines(density(LifeSpan, na.rm=T), col="red", lwd=3)

Histogram of LifeSpan

LifeSpan

Density

0 20 40 60 80 100

0.00

0.01

0.02

0.03

Histogram with ggplot2

qplot(x=LifeSpan) + geom_histogram(col="white", fill="blue") + opts(legend.position="none")

0.0

2.5

5.0

7.5

10.0

0 25 50 75 100LifeSpan

count

Histogram and density with ggplot2

m = ggplot(data=dat, aes(x=LifeSpan))

m+ geom_histogram(binwidth=20, aes(y=..density..), col="white", fill="blue", lwd=0.5) + geom_density()

0.00

0.01

0.02

0.03

0 40 80 120LifeSpan

density

More “fancy” histogram

library(ggplot2) qplot(x=LifeSpan, geom="density", fill=factor(Exposure), alpha=I(0.5)) + opts(legend.position="top")

Scatter plot

plot(BodyWt, BrainWt, pch=16, col="blue")

Scatter plot with labels

plot(BodyWt, BrainWt, pch=16, col="blue")

text(BodyWt, BrainWt, labels=Species, cex= 0.5)

Scatter plot with transformation

plot(log(BodyWt), log(BrainWt), pch=16, col="blue")

Scatter plot with straight line

plot(log(BrainWt) ~ log(BodyWt), pch=16, col="blue") abline(lm((log(BrainWt) ~ log(BodyWt))), col="red")

Scatter plot coloured by a 3rd variable

qplot(x=log(BodyWt), y=log(BrainWt), col=Exposure) + stat_smooth(method="lm", se=T)

Scatter plot scaled by size

qplot(x=log(BodyWt), y=log(BrainWt), size=Danger, col=Exposure) + stat_smooth(method="lm", se=T)

Multiple scatter plots with straight line

qplot(log(BodyWt), log(BrainWt), data=dat, facets=~Danger)+geom_abline()

Correlogram

library(psych)

vars=cbind(log(BodyWt), log(BrainWt), TotalSleep, Dreaming, LifeSpan, Gestation)

pairs.panels(vars)

-2 2 6

0.96 -0.53

0 2 4 6

-0.23 0.61

0 300 600

-40

48

0.77-2

26

-0.56 -0.34 0.71 0.78

TotalSleep

0.73 -0.41

510

20

-0.63

02

46 Dreaming

-0.30 -0.45

LifeSpan

040

80

0.61

-4 0 4 8

0300600

5 10 20 0 40 80

Gestation

Factor analysis

library(psych)

vars=cbind(BodyWt, BrainWt, LifeSpan, Gestation, TotalSleep, Danger, Predation)

fit = factanal(na.omit(vars), 2, rotation="varimax")

fit

Factor analysis

Loadings: Factor1 Factor2 BodyWt 0.933 BrainWt 0.995 LifeSpan 0.511 Gestation 0.771 0.264 TotalSleep -0.333 -0.614 Danger 0.996 Predation 0.948 Factor1 Factor2 SS loadings 2.834 2.345 Proportion Var 0.405 0.335 Cumulative Var 0.405 0.740 Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 62.94 on 8 degrees of freedom. The p-value is 1.23e-10

Summary

•  R – an important development in sta$s$cal science

•  Absolutely free, powerful, highly flexible

• Widely used around the world

•  Fit all statsi$cal models

•  Very useful to simula$on work

•  High quality (eg publishable) graphics

Books and references

Dalgaard P (2008) Introductory Sta$s$cs with R. New York: Springer, 2nd edi$on.

Seefeld K, Linder E (2007) Sta$s$cs using R with biological examples. Available online (free). hbp://cran.r-‐project.org/doc/contrib/Seefeld_StatsRBio.pdf

Braun WJ, Murdoch DJ (2007) A First Course in Sta$s$cal Programming with R. Cambridge: Cambridge University Press.

Wickham H (2009) ggplot: using the grammar of graphics with R. Springer

Useful websites

www.rseek.org (Google)

tuanv.nguyen% · species bodywt brainwt nondrea ming dreaming totalsleep%lifespan gestaon% predaon%...

Documents