r introduction v2

58
THE WRIGHT LAB COMPUTATION LUNCHES An introduction to R Gene expression from DEVA to differential expression Handling massively parallel sequencing data

Upload: martin-johnsson

Post on 11-Nov-2014

4.415 views

Category:

Technology


0 download

Tags:

DESCRIPTION

A slightly different introduction to #Rstats. Data frames, linear models and plots with ggplot2.

TRANSCRIPT

Page 1: R introduction v2

THE WRIGHT LAB COMPUTATION LUNCHES

An introduction to R Gene expression from DEVA to differential expression Handling massively parallel sequencing data

Page 2: R introduction v2

R: data frames, plots and linear models

Martin Johnsson

Page 3: R introduction v2

statistical environment

Page 4: R introduction v2

free open source

Page 5: R introduction v2

scripting language

Page 6: R introduction v2

great packages (limma for microarrays, R/qtl for QTL

mapping etc)

Page 7: R introduction v2

You need to write code!

Page 8: R introduction v2

Why scripting?

harder the first time easier the next 20 times … necessity for moderately large data ( )

Page 9: R introduction v2

R-project.org

Rstudio.com

Page 10: R introduction v2

Interface is immaterial

ssh into a server, RStudio, alternatives very few platform-dependent elements that the end user needs to worry about

Page 11: R introduction v2

Scripting

write your code down in a .R file run it with source ( ) ## comments

Page 12: R introduction v2

if it runs without intervention from start to finish, you’re

actually programming

Page 13: R introduction v2

Help!

within R: ? and tab your favourite search engine ask (Stack Exchange, R-help mailing list, package mailing lists)

Page 14: R introduction v2

First task: make a new script that imports a data set and takes a subset

of the data

Page 15: R introduction v2

Reading in data

Excel sheet to data.frame one sheet at a time clear formatting short succinct column names export text

read.table

Page 16: R introduction v2

Subsetting

logical operators ==, !=, >, <, ! subset(data, column1==1) subset(data, column1==1 & column2>2)

Page 17: R introduction v2

Indexing with [ ]

first three rows: data[c(1,2,3),] two columns: data[,c(2,4)] ranges: data[,1:3]

Page 18: R introduction v2

Variables

columns in expressions data$column1 + data$column2 log10(data$column2)

assignment arrow data$new.column <- log10(data$column2) new.data <- subset(data, column1==10)

Page 19: R introduction v2

Exercise: Start a new script and save it as unicorn_analysis.R. Import unicorn_data.csv.

Take a subset that only includes green unicorns.

Page 20: R introduction v2

RStudio: File > New > R Script

data <- read.csv("unicorn_data.csv") green.unicorns <- subset(data, colour=="green")

Page 21: R introduction v2

Anatomy of a function call

function.name(parameters) mean(data$column) mean(x=data$column) mean(x=data$column, na.rm=T) ?mean

Page 22: R introduction v2

mean(exp(log(x)))

Page 23: R introduction v2

programming in R == stringing together functions

and writing new ones

Page 24: R introduction v2

Using a package (ggplot2) to make statistical graphics.

Page 25: R introduction v2

install.packages("ggplot2") library(ggplot2)

Page 26: R introduction v2

qplot(x=x.var, y=y.var, data=data) only x: histogram x and y numeric: scatterplot

or set geometry (geom) yourself

Page 27: R introduction v2

geoms: point line boxplot jitter – scattered points tile – heatmap and many more

Page 28: R introduction v2

Exercise: Make a scatterplot of weight and horn length in green unicorns.

Write all code in the unicorn_analysis.R script.

Save the plots as variables so you can refer back to them.

Page 29: R introduction v2

green.scatterplot <- qplot(x=weight, y=horn.length, data=green.unicorns)

24

27

30

33

250 300 350 400 450weight

horn.length

Page 30: R introduction v2

Exercise: Make a boxplot of horn length versus diet.

Page 31: R introduction v2

qplot(x=diet, y=weight, data=unicorn.data, geom="boxplot")

25

30

35

40

candy flowersdiet

horn.length

Page 32: R introduction v2

Small multiples

split the plot into multiple subplots useful for looking at patterns qplot(x=x.var, y=y.var, data=data, facets=~variable) facets=variable1~variable2

Page 33: R introduction v2

Exercise: Again, make a boxplot of diet and horn length, but separated

into small multiples by colour.

Page 34: R introduction v2

qplot(x=diet, y=horn.length, data=data, geom="boxplot", facets=~colour)

green pink

25

30

35

40

candy flowers candy flowersdiet

horn.length

Page 35: R introduction v2

Comparing means with linear models

Page 36: R introduction v2

Wilkinson–Rogers notation

one predictor: y ~ x additive model: y ~ x1 + x2 interactions: y ~ x1 + x2 + x1:y2 or y ~ x1 * x2 factors: y ~ factor(x)

Page 37: R introduction v2

Student’s t-test

Page 38: R introduction v2

t.test(variable ~ grouping, data=data) alternative: two sided, less, greater var.equal paired

geoms: boxplot, jitter

Page 39: R introduction v2

Carl Friedrich Gauss least squares estimation

Page 40: R introduction v2

25

30

35

40

250 300 350 400 450weight

horn.length

Page 41: R introduction v2

linear model y = a + b x + e, e ~ N(0, sigma) lm(y ~ x, data=some.data)

formula data summary( ) function

Page 42: R introduction v2

Exercise: Make a regression of horn length and body weight.

Page 43: R introduction v2

model <- lm(horn.length ~ weight, data=data) summary(model) Call: lm(formula = horn.length ~ weight, data = data) Residuals: Min 1Q Median 3Q Max -6.5280 -2.0230 -0.1902 2.5459 7.3620 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.41236 3.85774 5.291 1.76e-05 *** weight 0.03153 0.01093 2.886 0.00793 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.447 on 25 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.2499, Adjusted R-squared: 0.2199 F-statistic: 8.327 on 1 and 25 DF, p-value: 0.007932

Page 44: R introduction v2

What have we actually fitted? model.matrix(horn.length ~ weight, data) What were the results? coef(model) How uncertain? confint(model)

Page 45: R introduction v2

Plotting the model

regression equation y = a + x b a is the intercept b is the slope of the line

pull out coefficients with coef( ) a plot with two layers: scatterplot with added geom_abline( )

Page 46: R introduction v2

scatterplot <- qplot(x=weight, y=horn.length, data=data) a <- coef(model)[1] b <- coef(model)[2] scatterplot + geom_abline(intercept=a, slope=b)

25

30

35

40

250 300 350 400 450weight

horn.length

Page 47: R introduction v2

A regression diagnostic

the linear model needs several assumptions, particularly linearity and equal error variance the residuals vs fitted plot can help spot gross deviations

Page 48: R introduction v2

qplot(x=fitted(model), y=residuals(model))

-4

0

4

8

28 30 32 34fitted(model)

residuals(model)

Page 49: R introduction v2

Photo: Peter (anemoneprojectors), CC:BY-SA-2.0 http://www.flickr.com/people/anemoneprojectors/

Page 50: R introduction v2

analysis of variance aov(formula, data=some.data)

drop1(aov.object, test="F") F-tests (Type II SS)

post-hoc tests pairwise.t.test TukeyHSD

Page 51: R introduction v2

Exercise: Perform analysis of variance on weight and the effects of diet while

controlling for colour.

Page 52: R introduction v2

We want a two-way anova with an F-test for diet. model.int <- aov(weight ~ diet * colour, data=data) drop1(model.int, test="F") model.add <- aov(weight ~ diet + colour, data=data) drop1(model.add, test="F") Single term deletions Model: weight ~ diet + colour Df Sum of Sq RSS AIC F value Pr(>F) <none> 85781 223.72 diet 1 471.1 86252 221.87 0.1318 0.71975 colour 1 13479.7 99260 225.66 3.7714 0.06396 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 53: R introduction v2

Black magic

none of the is really R’s fault, but things that come up along the way

Page 54: R introduction v2

Black magic

none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( )

Page 55: R introduction v2

Black magic

none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( ) type I, II and III sums of squares in Anova

Page 56: R introduction v2

Black magic

none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( ) type I, II and III sums of squares in anova floating-point arithmetic, e.g. sin(pi)

Page 57: R introduction v2

Reading

Daalgard, Introductory statistics with R, electronic resource at the library Faraway, The linear model in R Gelman & Hill, Data analysis using regression and multilevel/hierarchical models Wickham, ggplot2 book tons of tutorials online, for instance http://martinsbioblogg.wordpress.com/a-slightly-different-introduction-to-r/

Page 58: R introduction v2

Exercise

More (and some of the same) analysis of the unicorn data set. Use the R documentation and google. I will post solutions.