adapted by alex sanchez from tutorials by (1) steffen ... · (1) steffen durinck, robert gentleman...

34
Bioconductor tutorial Adapted by Alex Sanchez from tutorials by (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang

Upload: others

Post on 14-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Bioconductor tutorial

Adapted by Alex Sanchez from tutorials by(1) Steffen Durinck, Robert Gentleman and

Sandrine Dudoit(2) Laurent Gautier(3) Matt Ritchie(4) Jean Yang

Page 2: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Outline

• The Bioconductor Project• OOP in R and Bioconductor• Start-up: Installation, Courses, Vignettes

Page 3: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

The Bioconductor Project

Page 4: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Bioconductor• Bioconductor is an open source and open development

software project for the analysis of biomedical and genomic data.

• The project was started in the Fall of 2001 and includes more than 25 core developers in the US, Europe, and Australia.

• Releases– v 1.0: May 2nd, 2002, 15 packages.– v 1.1: November 18th, 2002, 20 packages.– v 1.2: May 28th, 2003, 30 packages.– …………………………..– v 1.9: October 4, 2006, 188 packages.

Page 5: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Goals• Provide access to powerful statistical and graphical

methods for the analysis of genomic data.• Facilitate the integration of biological metadata

(GenBank, GO, LocusLink, PubMed) in the analysis of experimental data.

• Allow the rapid development of extensible, interoperable, and scalable software.

• Promote high-quality documentation and reproducible research.

• Provide training in computational and statistical methods.

Page 6: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Bioconductor packages

• R and the R package system are used to design and distribute Bioconductor software.

• An R package is a structured collection of code (R, C, or other), documentation, and/or data for performing specific types of analyses.

• Different types of packages– Code– Data (metadata or experimental data)– Other (course packages …)

Page 7: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Bioconductor packages• Code packages provide implementations of specialized

statistical and graphical methods.• Data packages:

– Biological metadata: mappings between different gene identifiers (e.g., AffyID, GO, LocusID, PMID), CDF and probe sequence information for Affy arrays.E.g. hgu95av2, GO, KEGG.

– Experimental data: code, data, and documentation for specific experiments or projects.yeastCC: Spellman et al. (1998) yeast cell cycle. golubEsets: Golub et al. (2000) ALL/AML data.

• Course packages: code, data, documentation, and labs for the instruction of a particular course. E.g. EMBO03course package.

Page 8: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Some packages arranged by their functionality

CEL, CDFaffy,

simpleaffyvsn

.gpr, .Spot, MAGEML

Pre-processing

exprSet

graphRBGL

Rgraphviz

eddgenefilter

limmamulttest

ROC+ CRAN

annotateannaffy

+ metadata packagesCRAN

classclusterMASSmva

gplotsgeneplotter

hexbin+ CRAN

marraylimmavsn

Differential expression

Graphs &networks

Clusteranalysis

Annotation

CRANMLInterfaces

classe1071ipred

LogitBoostMASSnnet

randomForestrpart

Prediction

Graphics

estrogenAMLL

Data

Page 9: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Packages in Bioconductor 1.9

• There are three types of packages– Software packages

http://www.bioconductor.org/packages/1.9/bioc/– Metadata packages (annotations, CDF, probes)

http://www.bioconductor.org/packages/1.9/data/annotation/

– Experimental data packageshttp://www.bioconductor.org/packages/1.9/data/experiment/

Page 10: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

OOP in Bioconductor and R

Page 11: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

OOP• The Bioconductor project has adopted the object-

oriented programming (OOP) paradigm proposed in J. M. Chambers (1998). Programming with Data.

• This object-oriented class/method design allows efficient representation and manipulation of large and complex biological datasets of multiple types.

• Tools for programming using the class/method mechanism are provided in the R methods package.

• Tutorial:www.omegahat.org/RSMethods/index.html.

Page 12: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

OOP: classes

• A class provides a software abstraction of a real world object. It reflects how we think of certain objects and what information these objects should contain.

• Classes are defined in terms of slots which contain the relevant data.

• An object is an instance of a class.• A class defines the structure, inheritance, and

initialization of objects.

Page 13: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

OOP: methods• A method is a function that performs an action

on data (objects). – Methods define how a particular function should

behave depending on the class of its arguments.– Methods allow computations to be adapted to

particular data types, i.e., classes.• A generic function is a dispatcher, it examines its

arguments and determines the appropriate method to invoke.– Examples of generic functions in R include plot, summary, print.

Page 14: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

OOP: methods

• It is important to realize that when calling a generic function (such as plot), the actionsperformed depend on the class of thearguments.

• Methods define how a particular function shouldbehave depending on the class of its arguments.

• Methods allow computations to be adapted toparticular data types, i.e., classes.

Page 15: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Methods package

• The methods package contains a number offunctions for defining new classes and methods(e.g. setClass, setMethod) and for working withthese classes and methods.

• A tutorial is available at • http://www.omegahat.org/RSMethods/index.html

Page 16: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Examples

> x <-1:10 > y <-2*x + 1 + rnorm(10) > class(x)[1] "integer"> plot(x,y)

> fit <-lm(y ~ x)> class(fit)[1] "lm“> plot(fit)

Page 17: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Examples

>setClass(“simple", representation(x="numeric",y="matrix“), prototype = list(x=numeric(),y=matrix(0)))

> z <-new("simple", x=1:10, y=matrix(rnorm(50),10,5))

> z@x[1] 1 2 3 4 5 6 7 8 9 10

> setMethod("plot",signature(x="simple", y="missing"),function(x, y,...) plot(slot(x,"x"),slot(x,"y")[,1]))

> plot(z)

Page 18: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Environments and closures

• An environment is an object that containsbindings between symbols and values.

• It is very similar to a hash table.• Environments can be accessed using the

following functions# get a listing of objects in the environment e – ls(env=e) # get the value of the object with name x in theenvironment e

– get(“x”, env=e) # assign to the name x the value y in the environment e

– assign(“x”,y,env=e)

Page 19: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Environments and closures

• Environments can be associated with functions.• When an environment is associated with a

function, then that environment is used to obtainvalues for any unbound variables.

• The term closure refers to the coupling of thefunction body with the enclosing environment.

• The annotate, genefilter, and other packagestake advantage of environments and closures.

• Annotation packages consist of differentenvironments for different groups of annotations

Page 20: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Examples> x <-4> e1 <-new.env()

> assign(“x”,10, env=e1)> f <-function() x > environment(f) <-e1

> x # returns 4> f() # returns 10!

.

Page 21: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

exprSet class

description

annotation

phenoData

Any notes

Matrix of expression measures, genes x samples

Matrix of SEs for expression measures, genes x samples

Sample level covariates, instance of class phenoData

Name of annotation data

MIAME information

se.exprs

exprs

notes

• Use of object-oriented programming to deal with data complexity.

• S4 class/method mechanism (methods package).

Processed Affymetrix or spotted array data

Page 22: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

AffyBatch class

cdfName

exprs

nrow ncol

Probe-level intensity data for a batch of arrays (same CDF)

Dimensions of the array

Matrices of probe-level intensities and SEsrows probe cells, columns arrays.

Name of CDF file for arrays in the batch

se.exprs

description

annotation

phenoData

Any notes

Sample level covariates, instance of class phenoData

Name of annotation data

MIAME information

notes

Page 23: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

marrayRaw class

maRf

maW

maRb maGb

maGf

Pre-normalization intensity data for a batch of arrays

Matrix of red and green foreground intensities

Matrix of red and green background intensities

Matrix of spot quality weights

maNotes

maGnames

maTargets

maLayout Array layout parameters - marrayLayout

Description of spotted probe sequences- marrayInfoDescription of target samples - marrayInfo

Any notes

Page 24: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Getting Started

Page 25: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Installation

1. Main R software: download from CRAN (cran.r-project.org), use latest release.

2. Bioconductor packages: download from Bioconductor (www.bioconductor.org), use latest release.

Available for Linux/Unix, Windows, and Mac OS.

Page 26: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Installation

• After installing R, install Bioconductor packages using getBioC install script.

• From R> source("http://www.bioconductor.org/getBioC.R") > getBioC()

• In general, R packages can be installed using the function install.packages.

• In Windows, can also use “Packages” pull-down menus.

Page 27: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Installing vs. loading• Packages only need to be installed once .• But … packages must be loaded with each new R

session. • Packages are loaded using the function library. From

R > library(Biobase)

or the “Packages” pull-down menus in Windows. • To update packages, use function update.packages

or “Packages” pull-down menus in Windows. • To quit: > q()

Page 28: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Documentation and help

• R manuals and tutorials:available from the R website or on-line in an R session.

• R on-line help system: detailed on-line documentation, available in text, HTML, PDF, and LaTeX formats.> help.start()> help(lm)> ?hclust> apropos(mean)> example(hclust)> demo()> demo(image)

Page 29: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Short courses

• Bioconductor short courses– modular training segments on software and statistical

methodology;– lectures notes, computer labs, and course packages

available on WWW for self-instruction.

Page 30: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Vignettes

• Bioconductor has adopted a new documentation paradigm, the vignette.

• A vignette is an executable document consisting of a collection of code chunks and documentation text chunks.

• Vignettes provide dynamic, integrated, and reproducible statistical documents that can be automatically updated if either data or analyses are changed.

• Vignettes can be generated using the Sweave function from the R tools package.

Page 31: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Vignettes

• Each Bioconductor package contains at least one vignette, providing task-oriented descriptions of the package's functionality.

• Vignettes are located in the doc subdirectory of an installed package and are accessible from the help browser.

• Vignettes can be used interactively.• Vignettes are also available separately from the

Bioconductor website.

Page 32: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Vignettes

• Tools are being developed for managing and using this repository of step-by-step tutorials– Biobase: openVignette – Menu of available

vignettes and interface for viewing vignettes (PDF).– tkWidgets: vExplorer – Interactive use of

vignettes.– reposTools.

Page 33: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

Vignettes

vExplorer

• HowTo’s: Task-oriented descriptions of package functionality.• Executable documents consisting of documentation text and code chunks.• Dynamic, integrated, and reproducible statistical documents.• Can be used interactively –vExplorer.• Generated using Sweave (toolspackage).

Page 34: Adapted by Alex Sanchez from tutorials by (1) Steffen ... · (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang. ... •

References• R www.r-project.org, cran.r-project.org

– software (CRAN); – documentation; – newsletter: R News;– mailing list.

• Bioconductor www.bioconductor.org– software, data, and documentation (vignettes); – training materials from short courses; – mailing list.