computing with large data sets - new york universitybonneau/lectures/datasets/lecture16.pdf ·...

28
Computing with large data sets Richard Bonneau, spring 2009 Lecture 16 (week 10): bioconductor: an example R multi-developer project Monday, April 27, 2009

Upload: others

Post on 31-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

Computing with large data setsRichard Bonneau, spring 2009

Lecture 16 (week 10): bioconductor:

an example R multi-developer project

Monday, April 27, 2009

Page 2: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

Acknowledgments and other sources:

Ben Bolstad, Biostats lectures, Berkely

Bioinformatics and Computational Biology Solutions Using R and Bioconductor ( Gentleman, Carey, Huber, Irizarry, Dudiot)

The bioconductor website:

http://www.bioconductor.org/

Lecture 14v22.0480: computing with data, Richard Bonneau Lecture 16Monday, April 27, 2009

Page 3: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

DNA RNA Protein

Gene expression

Central dogma

v22.0480: computing with data, Richard Bonneau Lecture 16Monday, April 27, 2009

Page 4: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

Bacillus subtilis microarraysGenome of 4,106 protein coding genes, one spot-one gene

PCR-amplified probes printed on aminosilane coated slides, UV-crosslinked

measuring mRNA, RNA : microarrays

v22.0480: computing with data, Richard Bonneau Lecture 3Lecture 16Monday, April 27, 2009

Page 5: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

Bacillus subtilis microarraysGenome of 4,106 protein coding genes, one spot-one gene

PCR-amplified probes printed on aminosilane coated slides, UV-crosslinked

Spotting inconsistencies

measuring mRNA, RNA : microarrays

v22.0480: computing with data, Richard Bonneau Lecture 3Lecture 16Monday, April 27, 2009

Page 6: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

Affymetrix chip

Time spent on experiment

~7 days

Cost of Experiment

$150-$600

measuring mRNA, RNA : microarrays

v22.0480: computing with data, Richard Bonneau Lecture 3Lecture 16Monday, April 27, 2009

Page 7: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

2D - LC/LC

Study protein complexes without gel electrophoresis

Peptides all bind to cation exchange column

Peptides are separated by hydrophobicity on reverse phase column

Successive elution with increasing salt gradients separates peptides by charge

Complex mixture is simplified prior to MS/MS by 2D LC

(trypsin)

Lecture 14v22.0480: computing with data, Richard Bonneau Lecture 16Monday, April 27, 2009

Page 8: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

transcription factors control expression of genes

v22.0480: computing with data, Richard Bonneau Lecture 3Lecture 16Monday, April 27, 2009

Page 9: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

8

• Bioconductor (BioC) is an open source and open development software that is actively developing tools for the analysis of many types genomic data.

• Mainly written in R• Global and open source.• Licensed under the GPL/LGPL/BSD licenses

v22.0480: computing with data, Richard Bonneau Lecture 16

bioconductor

Monday, April 27, 2009

Page 10: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

9

• R gives us a wide range of powerful statistical and graphical methods.

• Tracking and managment of biological metadata in the analysis of experimental data

• R facilitates the rapid development of extensible, scalable, and interoperable software;

• Each package has high-quality documentation and reproducible research.

• The team provide training workshops in computational and statistical methods for genomic analysis.

v22.0480: computing with data, Richard Bonneau Lecture 16

bioconductor

Monday, April 27, 2009

Page 11: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

10

• Platform independent–Linux/Unix, Windows

• Predominantly command line interface• Often object oriented: S4 objects• Most of the current tools are designed for the

analysis of microarray data• R is used by many statisticians and has a large

repository of packages which might also be useful cran.r-project.org

v22.0480: computing with data, Richard Bonneau Lecture 16

transcription factors control expression of genesbioconductor features

Monday, April 27, 2009

Page 12: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

11

• Full access to algorithms and their implementation • The ability to fix bugs• To encourage good scientific computing and statistical

practice by providing appropriate tools and instruction • To provide a workbench of tools that allow researchers

to explore and expand the methods used to analyze biological data

• To ensure that the international scientific community is the owner of the software tools needed to carry out research

v22.0480: computing with data, Richard Bonneau Lecture 16

bioconductor : open source

Monday, April 27, 2009

Page 13: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

12

• Each package contains at least one vignette – a document that provides a textual, task-oriented description of

the package's functionality and that can be used interactively. Many are simple "HowTo"s, that is, they are designed to demonstrate how a particular task can be accomplished with that package's software. Others provide a more thorough overview of the package, or might even discuss general issues related to the package.

• The vignettes are generated using the Sweave function from the R package tools. They are documents that intermix text, code, and output (textual and graphical) and can be regenerated automatically whenever the data or analyses change.

v22.0480: computing with data, Richard Bonneau Lecture 16

transcription factors control expression of genesbioconductor: docs

Monday, April 27, 2009

Page 14: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

13

• There are currently almost 90 packages in the 1.4 release (May 2004). The first release in May 2002 had only 15 packages

• Some are very simple while others provide extensive capabilities for the analysis of a particular type of data

• There is some level of dependency among the packages

• We will explore a subset of the packages

v22.0480: computing with data, Richard Bonneau Lecture 16

bioconductor : packages

Monday, April 27, 2009

Page 15: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

14

• Accessor functions that can be applied to exprSets–exprs() - access the expression values–se.exprs() – access standard error estimates–pData() – access phenotype data–description() – obtain the MIAME information–geneNames() – access the names of the genes–sampleNames() – names of the samples

v22.0480: computing with data, Richard Bonneau Lecture 16

bioconductor : biobase

Monday, April 27, 2009

Page 16: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

15

• The core package for low-level analysis of Affymetrix data

• Provides–Mechanisms for reading and storing cel file data

(raw probe intensities)–Tools for exploring probe-intensity data–Methods for pre-processing – background

correction, normalization–Computing expression measures

v22.0480: computing with data, Richard Bonneau Lecture 16

bioconductor : affy, a package for affymetrix

Monday, April 27, 2009

Page 17: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

16

boxplot() hist()

v22.0480: computing with data, Richard Bonneau Lecture 16

bioconductor : affy, a package for affymetrix

Monday, April 27, 2009

Page 18: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

17

affyPLM - Pseudo-chip images

NegativeResiduals

PositiveResiduals

ResidualsWeights

image()

v22.0480: computing with data, Richard Bonneau Lecture 16

transcription factors control expression of genes

Monday, April 27, 2009

Page 19: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

18

affyPLM - RLE Plots

RelativeLogExpression

Mbox()

v22.0480: computing with data, Richard Bonneau Lecture 16

transcription factors control expression of genes

Monday, April 27, 2009

Page 20: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

19

NormalizedUnscaledStandardErrors

boxplot()

v22.0480: computing with data, Richard Bonneau Lecture 16

affyPLM - NUSE plots

Monday, April 27, 2009

Page 21: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

20

• Fitting probe-level models to Affymetrix data provides quality control information

• Quality assessment focuses on–Residuals–Weights from a robust fitting procedure–Relative log expression–Standard errors

v22.0480: computing with data, Richard Bonneau Lecture 16

QC : affyPLM

Monday, April 27, 2009

Page 22: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

21

v22.0480: computing with data, Richard Bonneau Lecture 16

getting data from the web

Monday, April 27, 2009

Page 23: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

22

v22.0480: computing with data, Richard Bonneau Lecture 16

getting data from the web

library(Biobase)library(GEOquery)

#Download GDS file, put it in the current directory, and load it:gds858 <- getGEO('GDS858', destdir=".")

#Or, open an existing GDS file (even if its compressed):gds858 <- getGEO(filename='GDS858.soft.gz')

good example of using the connection to GEO.

http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/geo/

Monday, April 27, 2009

Page 24: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

23

• Handles annotation–Convert between Unigene, LocusLink, Affymetrix

probeset ids and other annotation methods• Methods for accessing online information from

PubMed, GenBank

v22.0480: computing with data, Richard Bonneau Lecture 16

annotate

Monday, April 27, 2009

Page 25: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

24

• Allows you to create graphs with nodes and edges

v22.0480: computing with data, Richard Bonneau Lecture 16

Rgraphviz

Monday, April 27, 2009

Page 26: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

25

• clust: for clustering• class: for classification• rpart: trees• mlclust: model based clustering• mgcv: smoothers

v22.0480: computing with data, Richard Bonneau Lecture 16

other useful packages

Monday, April 27, 2009

Page 27: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

there is only One take home message from this entire lecture:

R can support a big effort very well, with web services, interface to data repositories, better languages used for web and database programing, access to high level stats and machine learning work, graphical interfaces, automated generation of interactive reports, etc.

Lecture 14v22.0480: computing with data, Richard Bonneau Lecture 16

other useful packages

Monday, April 27, 2009

Page 28: Computing with large data sets - New York Universitybonneau/lectures/datasets/lecture16.pdf · 2009-04-27 · Computing with large data sets Richard Bonneau, spring 2009 ... Acknowledgments

Acknowledgments and other sources:

Ben Bolstad, Biostats lectures, Berkely

Bioinformatics and Computational Biology Solutions Using R and Bioconductor ( Gentleman, Carey, Huber, Irizarry, Dudiot)

The bioconductor website:

http://www.bioconductor.org/

Lecture 14v22.0480: computing with data, Richard Bonneau Lecture 16Monday, April 27, 2009