r introduction v2

Tags:

Post on 11-Nov-2014

4.415 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A slightly different introduction to #Rstats. Data frames, linear models and plots with ggplot2.

TRANSCRIPT

THE WRIGHT LAB COMPUTATION LUNCHES

An introduction to R Gene expression from DEVA to differential expression Handling massively parallel sequencing data

R: data frames, plots and linear models

Martin Johnsson

statistical environment

free open source

scripting language

great packages (limma for microarrays, R/qtl for QTL

mapping etc)

You need to write code!

Why scripting?

harder the first time easier the next 20 times … necessity for moderately large data ( )

R-project.org

Rstudio.com

Interface is immaterial

ssh into a server, RStudio, alternatives very few platform-dependent elements that the end user needs to worry about

Scripting

write your code down in a .R file run it with source ( ) ## comments

if it runs without intervention from start to finish, you’re

actually programming

Help!

within R: ? and tab your favourite search engine ask (Stack Exchange, R-help mailing list, package mailing lists)

First task: make a new script that imports a data set and takes a subset

of the data

Reading in data

Excel sheet to data.frame one sheet at a time clear formatting short succinct column names export text

read.table

Subsetting

logical operators ==, !=, >, <, ! subset(data, column1==1) subset(data, column1==1 & column2>2)

Indexing with [ ]

first three rows: data[c(1,2,3),] two columns: data[,c(2,4)] ranges: data[,1:3]

Variables

columns in expressions data$column1 + data$column2 log10(data$column2)

assignment arrow data$new.column <- log10(data$column2) new.data <- subset(data, column1==10)

Exercise: Start a new script and save it as unicorn_analysis.R. Import unicorn_data.csv.

Take a subset that only includes green unicorns.

RStudio: File > New > R Script

data <- read.csv("unicorn_data.csv") green.unicorns <- subset(data, colour=="green")

Anatomy of a function call

function.name(parameters) mean(data$column) mean(x=data$column) mean(x=data$column, na.rm=T) ?mean

mean(exp(log(x)))

programming in R == stringing together functions

and writing new ones

Using a package (ggplot2) to make statistical graphics.

install.packages("ggplot2") library(ggplot2)

qplot(x=x.var, y=y.var, data=data) only x: histogram x and y numeric: scatterplot

or set geometry (geom) yourself

geoms: point line boxplot jitter – scattered points tile – heatmap and many more

Exercise: Make a scatterplot of weight and horn length in green unicorns.

Write all code in the unicorn_analysis.R script.

Save the plots as variables so you can refer back to them.

green.scatterplot <- qplot(x=weight, y=horn.length, data=green.unicorns)

24

27

30

33

250 300 350 400 450weight

horn.length

Exercise: Make a boxplot of horn length versus diet.

qplot(x=diet, y=weight, data=unicorn.data, geom="boxplot")

25

30

35

40

candy flowersdiet

horn.length

Small multiples

split the plot into multiple subplots useful for looking at patterns qplot(x=x.var, y=y.var, data=data, facets=~variable) facets=variable1~variable2

Exercise: Again, make a boxplot of diet and horn length, but separated

into small multiples by colour.

qplot(x=diet, y=horn.length, data=data, geom="boxplot", facets=~colour)

green pink

25

30

35

40

candy flowers candy flowersdiet

horn.length

Comparing means with linear models

Wilkinson–Rogers notation

one predictor: y ~ x additive model: y ~ x1 + x2 interactions: y ~ x1 + x2 + x1:y2 or y ~ x1 * x2 factors: y ~ factor(x)

Student’s t-test

t.test(variable ~ grouping, data=data) alternative: two sided, less, greater var.equal paired

geoms: boxplot, jitter

Carl Friedrich Gauss least squares estimation

25

30

35

40

250 300 350 400 450weight

horn.length

linear model y = a + b x + e, e ~ N(0, sigma) lm(y ~ x, data=some.data)

formula data summary( ) function

Exercise: Make a regression of horn length and body weight.

model <- lm(horn.length ~ weight, data=data) summary(model) Call: lm(formula = horn.length ~ weight, data = data) Residuals: Min 1Q Median 3Q Max -6.5280 -2.0230 -0.1902 2.5459 7.3620 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.41236 3.85774 5.291 1.76e-05 *** weight 0.03153 0.01093 2.886 0.00793 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.447 on 25 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.2499, Adjusted R-squared: 0.2199 F-statistic: 8.327 on 1 and 25 DF, p-value: 0.007932

What have we actually fitted? model.matrix(horn.length ~ weight, data) What were the results? coef(model) How uncertain? confint(model)

Plotting the model

regression equation y = a + x b a is the intercept b is the slope of the line

pull out coefficients with coef( ) a plot with two layers: scatterplot with added geom_abline( )

scatterplot <- qplot(x=weight, y=horn.length, data=data) a <- coef(model)[1] b <- coef(model)[2] scatterplot + geom_abline(intercept=a, slope=b)

25

30

35

40

250 300 350 400 450weight

horn.length

A regression diagnostic

the linear model needs several assumptions, particularly linearity and equal error variance the residuals vs fitted plot can help spot gross deviations

qplot(x=fitted(model), y=residuals(model))

-4

0

4

8

28 30 32 34fitted(model)

residuals(model)

Photo: Peter (anemoneprojectors), CC:BY-SA-2.0 http://www.flickr.com/people/anemoneprojectors/

analysis of variance aov(formula, data=some.data)

drop1(aov.object, test="F") F-tests (Type II SS)

post-hoc tests pairwise.t.test TukeyHSD

Exercise: Perform analysis of variance on weight and the effects of diet while

controlling for colour.

We want a two-way anova with an F-test for diet. model.int <- aov(weight ~ diet * colour, data=data) drop1(model.int, test="F") model.add <- aov(weight ~ diet + colour, data=data) drop1(model.add, test="F") Single term deletions Model: weight ~ diet + colour Df Sum of Sq RSS AIC F value Pr(>F) <none> 85781 223.72 diet 1 471.1 86252 221.87 0.1318 0.71975 colour 1 13479.7 99260 225.66 3.7714 0.06396 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Black magic

none of the is really R’s fault, but things that come up along the way

Black magic

none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( )

Black magic

none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( ) type I, II and III sums of squares in Anova

Black magic

none of the is really R’s fault, but things that come up along the way missing values: na.rm=T, na.exclude( ) type I, II and III sums of squares in anova floating-point arithmetic, e.g. sin(pi)

Reading

Daalgard, Introductory statistics with R, electronic resource at the library Faraway, The linear model in R Gelman & Hill, Data analysis using regression and multilevel/hierarchical models Wickham, ggplot2 book tons of tutorials online, for instance http://martinsbioblogg.wordpress.com/a-slightly-different-introduction-to-r/

Exercise

More (and some of the same) analysis of the unicorn data set. Use the R documentation and google. I will post solutions.

top related