bioconductor steffen durinck robert gentleman sandrine dudoit november 28, 2003 nettab bologna

33
BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Upload: luis-montgomery

Post on 27-Mar-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

BioConductor

Steffen DurinckRobert GentlemanSandrine Dudoit

November 28, 2003NETTAB Bologna

Page 2: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Outline

• what is R

• what is Bioconductor

• packages

• getting and using Bioconductor

Page 3: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

R

• R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S.

Page 4: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

R

• what sorts of things is R good at?– there are very many statistical algorithms– there are very many machine learning

algorithms– visualization– it is possible to write scripts that can be

reused– R is a real computer language

Page 5: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

R

• R supports many data technologies– XML,database integration,SOAP

• R interacts with other languages– C; FORTRAN; Perl; Python; Java

• R has good visualization capabilities• R has a very active development

environment• R is largely platform independent

– Unix; Windows; OSX

Page 6: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Overview of the Bioconductor Project

Page 7: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Bioconductor

• Bioconductor is an open source and open development software project for the analysis of biomedical and genomic data.

• The project was started in the Fall of 2001 and includes 23 core developers in the US, Europe, and Australia.

• R and the R package system are used to design and distribute software.

• Releases– v 1.0: May 2nd, 2002, 15 packages.– v 1.1: November 18th, 2002, 20 packages.– v 1.2: May 28th, 2003, 30 packages.– v 1.3: October 28th, 2003, 54 packages.

• ArrayAnalyzer: Commercial port of Bioconductor packages in S-Plus.

Page 8: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Goals

• Provide access to powerful statistical and graphical methods for the analysis of genomic data.

• Facilitate the integration of biological metadata (GenBank, GO, LocusLink, PubMed) in the analysis of experimental data.

• Allow the rapid development of extensible, interoperable, and scalable software.

• Promote high-quality documentation and reproducible research.

• Provide training in computational and statistical methods.

Page 9: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Bioconductor Packages

Page 10: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Bioconductor packages

• Bioconductor software consists of R add-on packages.

• An R package is a structured collection of code (R, C, or other), documentation, and/or data for performing specific types of analyses.

• E.g. affy, cluster, graph, hexbin packages provide implementations of specialized statistical and graphical methods.

Page 11: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Bioconductor packagesRelease 1.3, October 28th, 2003

• AnnBuilder Bioconductor annotation data package builder• Biobase Biobase: Base functions for Bioconductor• DynDoc Dynamic document tools• MAGEML handling MAGEML documents• MeasurementError.cor Measurement Error model estimate for correlation coefficient• RBGL Test interface to boost C++ graph lib• ROC utilities for ROC, with uarray focus• RdbiPgSQL PostgreSQL access• Rdbi Generic database methods• Rgraphviz Provides plotting capabilities for R graph objects• Ruuid Ruuid: Provides Universally Unique ID values• SAGElyzer A package that deals with SAGE libraries• SNPtools Rudimentary structures for SNP data• affyPLM affyPLM - Probe Level Models• Affy Methods for Affymetrix Oligonucleotide Arrays• Affycomp Graphics Toolbox for Assessment of Affymetrix Expression Measures• Affydata Affymetrix Data for Demonstration Purpose• Annaffy Annotation tools for Affymetrix biological metadata• Annotate Annotation for microarrays

Page 12: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

• Ctc Cluster and Tree Conversion.• daMA Efficient design and analysis of factorial two-colour microarray data• Edd expression density diagnostics• externalVector Vector objects for R with external storage• factDesign Factorial designed microarray experiment analysis• Gcrma Background Adjustment Using Sequence Information• Genefilter Genefilter: filter genes• Geneplotter Geneplotter: plot microarray data• Globaltest Global Test• Gpls Classification using generalized partial least squares• Graph graph: A package to handle graph data structures• Hexbin Hexagonal Binning Routines• Limma Linear Models for Microarray Data• Makecdfenv CDF Environment Maker• marrayClasses Classes and methods for cDNA microarray data• marrayInput Data input for cDNA microarrays• marrayNorm Location and scale normalization for cDNA microarray data• marrayPlots Diagnostic plots for cDNA microarray data• marrayTools Miscellaneous functions for cDNA microarrays

Bioconductor packagesRelease 1.3, October 28th, 2003

Page 13: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Bioconductor packagesRelease 1.3, October 28th, 2003

• Matchprobes Tools for sequence matching of probes on arrays• Multtest Multiple Testing Procedures• ontoTools graphs and sparse matrices for working with ontologies• Pamr Pam: prediction analysis for microarrays• reposTools Repository tools for R• Rhdf5 An HDF5 interface for R• Siggenes Significance and Empirical Bayes Analyses of Microarrays• Splicegear splicegear• tkWidgets R based tk widgets• Vsn Variance stabilization and calibration for microarray data• widgetTools Creates an interactive tcltk widgets

Page 14: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Microarray data analysisCEL, CDF

affyvsn

.gpr, .Spot, MAGEML

Pre-processing

exprSet

graphRBGL

Rgraphviz

eddgenefilter

limmamulttest

ROC+ CRAN

annotateannaffy

+ metadata packagesCRAN

classclusterMASSmva

geneplotterhexbin

+ CRAN

marraylimma

vsn

Differential expression

Graphs &networks

Cluster analysis

Annotation

CRANclasse1071ipred

LogitBoostMASSnnet

randomForestrpart

Prediction

Graphics

Page 15: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

marray packages

Pre-processing two-color spotted array data:• diagnostic plots,• robust adaptive normalization (lowess, loess).

maPlot + hexbin

maBoxplot

maImage

Page 16: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

affy packagePre-processing oligonucleotide chip data:• diagnostic plots, • background correction, • probe-level normalization,• computation of expression measures.

image plotDensity

plotAffyRNADeg

barplot.ProbeSet

Page 17: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

annotate, annafy, and AnnBuilder

• Assemble and process genomic annotation data from public repositories.

• Build annotation data packages or XML data documents.

• Associate experimental data in real time to biological metadata from web databases such as GenBank, GO, KEGG, LocusLink, and PubMed.

• Process and store query results: e.g., search PubMed abstracts.

• Generate HTML reports of analyses.

AffyID41046_s_at

ACCNUMX95808

LOCUSID9203

SYMBOLZNF261

GENENAMEzinc finger protein 261

MAP Xq13.1

PMID1048621892058418817323

GOGO:0003677GO:0007275GO:0016021 + many other mappings

Metadata package hgu95av2mappings between different gene identifiers for hgu95av2 chip.

Page 18: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

MAGEML package

<!DOCTYPE MAGE-ML SYSTEM "D:/DATA/MAGE-ML/MAGE-ML.dtd">

<MAGE-ML identifier="MAGE-ML:E-SNGR-4">

<QuantitationTypeDimension_assnlist>

<QuantitationTypeDimension identifier="QTD:1">

<QuantitationTypes_assnreflist>

<MeasuredSignal_ref identifier="QT:F635 Median"/>

<MeasuredSignal_ref identifier="QT:F635 Mean"/>

….

marray packages

(cDNA arrays)

Page 19: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

SIGGENES PACKAGE - SAM

0.2 0.6 1.0 1.4 1.8

05

1015

2025

3035

4045

50

Delta vs. FDR

delta

FD

R (

in %

)

0.2 0.6 1.0 1.4 1.8

050

010

0020

0030

0040

0050

00

Delta vs. Significant Genes

delta

num

ber

of s

igni

fican

t ge

nes

Page 20: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

multtest package

• Multiple hypothesis testing

• Control type I error rate by using e.g. Bonferroni method

Page 21: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

heatmap

mva package -clustering

Page 22: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

mva package – principal component analysis

Page 23: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Getting started

Page 24: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Installation

1. Main R software: download from CRAN (cran.r-project.org), use latest release, now 1.8.0.

2. Bioconductor packages: download from Bioconductor (www.bioconductor.org), use latest release, now 1.3.

Available for Linux/Unix, Windows, and Mac OS.

Page 25: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Installation

• After installing R, install Bioconductor packages using getBioC install script.

• From R> source("http://www.bioconductor.org/getBioC.R") > getBioC()

• In general, R packages can be installed using the function install.packages.

• In Windows, can also use “Packages” pull-down menus.

Page 26: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

User interaction

• R Command-line

• Widgets. Small-scale graphical user interfaces (GUI), providing point & click access for specific tasks.– E.g. File browsing and selection for data input,

basic analyses.

Page 27: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Widgets

tkMIAMEtkphenoData

tkSampleNames

Reading in phenoData

Page 28: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Documentation and help

• R manuals and tutorials:available from the R website or on-line in an R session.

• R on-line help system: detailed on-line documentation, available in text, HTML, PDF, and LaTeX formats.> help.start()> help(lm)> ?hclust> apropos(mean)> example(hclust)> demo()> demo(image)

Page 29: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Short courses

• Bioconductor short courses– modular training segments on software and

statistical methodology;– lectures notes, computer labs, and course

packages available on WWW for self-instruction.

Page 30: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Vignettes

• Bioconductor has adopted a new documentation paradigm, the vignette.

• A vignette is an executable document consisting of a collection of code chunks and documentation text chunks.

• Vignettes provide dynamic, integrated, and reproducible statistical documents that can be automatically updated if either data or analyses are changed.

• Each Bioconductor package contains at least one vignette, providing task-oriented descriptions of the package's functionality.

Page 31: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

Vignettes

vExplorer

• HowTo’s: Task-oriented descriptions of package functionality.• Executable documents consisting of documentation text and code chunks.• Dynamic, integrated, and reproducible statistical documents.• Can be used interactively – vExplorer.• Generated using Sweave (tools package).

Page 32: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

References

• R www.r-project.org, cran.r-project.org– software (CRAN); – documentation; – newsletter: R News;– mailing list.

• Bioconductor www.bioconductor.org– software, data, and documentation (vignettes); – training materials from short courses; – mailing list.

• Personal– [email protected]

Page 33: BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna

acknowledgements

• Robert Gentleman Department of Biostatistical Science, Dana Faber

Cancer Institute, Boston

• Sandrine DudoitDivision Biostatistics, University of California, Berkeley