computing with large data sets - new york universitybonneau/lectures/datasets/lecture16.pdf ·...
TRANSCRIPT
Computing with large data setsRichard Bonneau, spring 2009
Lecture 16 (week 10): bioconductor:
an example R multi-developer project
Monday, April 27, 2009
Acknowledgments and other sources:
Ben Bolstad, Biostats lectures, Berkely
Bioinformatics and Computational Biology Solutions Using R and Bioconductor ( Gentleman, Carey, Huber, Irizarry, Dudiot)
The bioconductor website:
http://www.bioconductor.org/
Lecture 14v22.0480: computing with data, Richard Bonneau Lecture 16Monday, April 27, 2009
DNA RNA Protein
Gene expression
Central dogma
v22.0480: computing with data, Richard Bonneau Lecture 16Monday, April 27, 2009
Bacillus subtilis microarraysGenome of 4,106 protein coding genes, one spot-one gene
PCR-amplified probes printed on aminosilane coated slides, UV-crosslinked
measuring mRNA, RNA : microarrays
v22.0480: computing with data, Richard Bonneau Lecture 3Lecture 16Monday, April 27, 2009
Bacillus subtilis microarraysGenome of 4,106 protein coding genes, one spot-one gene
PCR-amplified probes printed on aminosilane coated slides, UV-crosslinked
Spotting inconsistencies
measuring mRNA, RNA : microarrays
v22.0480: computing with data, Richard Bonneau Lecture 3Lecture 16Monday, April 27, 2009
Affymetrix chip
Time spent on experiment
~7 days
Cost of Experiment
$150-$600
measuring mRNA, RNA : microarrays
v22.0480: computing with data, Richard Bonneau Lecture 3Lecture 16Monday, April 27, 2009
2D - LC/LC
Study protein complexes without gel electrophoresis
Peptides all bind to cation exchange column
Peptides are separated by hydrophobicity on reverse phase column
Successive elution with increasing salt gradients separates peptides by charge
Complex mixture is simplified prior to MS/MS by 2D LC
(trypsin)
Lecture 14v22.0480: computing with data, Richard Bonneau Lecture 16Monday, April 27, 2009
transcription factors control expression of genes
v22.0480: computing with data, Richard Bonneau Lecture 3Lecture 16Monday, April 27, 2009
8
• Bioconductor (BioC) is an open source and open development software that is actively developing tools for the analysis of many types genomic data.
• Mainly written in R• Global and open source.• Licensed under the GPL/LGPL/BSD licenses
v22.0480: computing with data, Richard Bonneau Lecture 16
bioconductor
Monday, April 27, 2009
9
• R gives us a wide range of powerful statistical and graphical methods.
• Tracking and managment of biological metadata in the analysis of experimental data
• R facilitates the rapid development of extensible, scalable, and interoperable software;
• Each package has high-quality documentation and reproducible research.
• The team provide training workshops in computational and statistical methods for genomic analysis.
v22.0480: computing with data, Richard Bonneau Lecture 16
bioconductor
Monday, April 27, 2009
10
• Platform independent–Linux/Unix, Windows
• Predominantly command line interface• Often object oriented: S4 objects• Most of the current tools are designed for the
analysis of microarray data• R is used by many statisticians and has a large
repository of packages which might also be useful cran.r-project.org
v22.0480: computing with data, Richard Bonneau Lecture 16
transcription factors control expression of genesbioconductor features
Monday, April 27, 2009
11
• Full access to algorithms and their implementation • The ability to fix bugs• To encourage good scientific computing and statistical
practice by providing appropriate tools and instruction • To provide a workbench of tools that allow researchers
to explore and expand the methods used to analyze biological data
• To ensure that the international scientific community is the owner of the software tools needed to carry out research
v22.0480: computing with data, Richard Bonneau Lecture 16
bioconductor : open source
Monday, April 27, 2009
12
• Each package contains at least one vignette – a document that provides a textual, task-oriented description of
the package's functionality and that can be used interactively. Many are simple "HowTo"s, that is, they are designed to demonstrate how a particular task can be accomplished with that package's software. Others provide a more thorough overview of the package, or might even discuss general issues related to the package.
• The vignettes are generated using the Sweave function from the R package tools. They are documents that intermix text, code, and output (textual and graphical) and can be regenerated automatically whenever the data or analyses change.
v22.0480: computing with data, Richard Bonneau Lecture 16
transcription factors control expression of genesbioconductor: docs
Monday, April 27, 2009
13
• There are currently almost 90 packages in the 1.4 release (May 2004). The first release in May 2002 had only 15 packages
• Some are very simple while others provide extensive capabilities for the analysis of a particular type of data
• There is some level of dependency among the packages
• We will explore a subset of the packages
v22.0480: computing with data, Richard Bonneau Lecture 16
bioconductor : packages
Monday, April 27, 2009
14
• Accessor functions that can be applied to exprSets–exprs() - access the expression values–se.exprs() – access standard error estimates–pData() – access phenotype data–description() – obtain the MIAME information–geneNames() – access the names of the genes–sampleNames() – names of the samples
v22.0480: computing with data, Richard Bonneau Lecture 16
bioconductor : biobase
Monday, April 27, 2009
15
• The core package for low-level analysis of Affymetrix data
• Provides–Mechanisms for reading and storing cel file data
(raw probe intensities)–Tools for exploring probe-intensity data–Methods for pre-processing – background
correction, normalization–Computing expression measures
v22.0480: computing with data, Richard Bonneau Lecture 16
bioconductor : affy, a package for affymetrix
Monday, April 27, 2009
16
boxplot() hist()
v22.0480: computing with data, Richard Bonneau Lecture 16
bioconductor : affy, a package for affymetrix
Monday, April 27, 2009
17
affyPLM - Pseudo-chip images
NegativeResiduals
PositiveResiduals
ResidualsWeights
image()
v22.0480: computing with data, Richard Bonneau Lecture 16
transcription factors control expression of genes
Monday, April 27, 2009
18
affyPLM - RLE Plots
RelativeLogExpression
Mbox()
v22.0480: computing with data, Richard Bonneau Lecture 16
transcription factors control expression of genes
Monday, April 27, 2009
19
NormalizedUnscaledStandardErrors
boxplot()
v22.0480: computing with data, Richard Bonneau Lecture 16
affyPLM - NUSE plots
Monday, April 27, 2009
20
• Fitting probe-level models to Affymetrix data provides quality control information
• Quality assessment focuses on–Residuals–Weights from a robust fitting procedure–Relative log expression–Standard errors
v22.0480: computing with data, Richard Bonneau Lecture 16
QC : affyPLM
Monday, April 27, 2009
21
v22.0480: computing with data, Richard Bonneau Lecture 16
getting data from the web
Monday, April 27, 2009
22
v22.0480: computing with data, Richard Bonneau Lecture 16
getting data from the web
library(Biobase)library(GEOquery)
#Download GDS file, put it in the current directory, and load it:gds858 <- getGEO('GDS858', destdir=".")
#Or, open an existing GDS file (even if its compressed):gds858 <- getGEO(filename='GDS858.soft.gz')
good example of using the connection to GEO.
http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/geo/
Monday, April 27, 2009
23
• Handles annotation–Convert between Unigene, LocusLink, Affymetrix
probeset ids and other annotation methods• Methods for accessing online information from
PubMed, GenBank
v22.0480: computing with data, Richard Bonneau Lecture 16
annotate
Monday, April 27, 2009
24
• Allows you to create graphs with nodes and edges
v22.0480: computing with data, Richard Bonneau Lecture 16
Rgraphviz
Monday, April 27, 2009
25
• clust: for clustering• class: for classification• rpart: trees• mlclust: model based clustering• mgcv: smoothers
v22.0480: computing with data, Richard Bonneau Lecture 16
other useful packages
Monday, April 27, 2009
there is only One take home message from this entire lecture:
R can support a big effort very well, with web services, interface to data repositories, better languages used for web and database programing, access to high level stats and machine learning work, graphical interfaces, automated generation of interactive reports, etc.
Lecture 14v22.0480: computing with data, Richard Bonneau Lecture 16
other useful packages
Monday, April 27, 2009
Acknowledgments and other sources:
Ben Bolstad, Biostats lectures, Berkely
Bioinformatics and Computational Biology Solutions Using R and Bioconductor ( Gentleman, Carey, Huber, Irizarry, Dudiot)
The bioconductor website:
http://www.bioconductor.org/
Lecture 14v22.0480: computing with data, Richard Bonneau Lecture 16Monday, April 27, 2009