gautier bosc2010 pythonbioconductor

22
Bioconductor with Python, What else ? ISMB / BOSC Laurent Gautier [[email protected]] DMAC / CBS July 10th, 2010 1 / 20

Upload: bosc-2010

Post on 12-May-2015

1.121 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Gautier bosc2010 pythonbioconductor

Bioconductor with Python, What else ?ISMB / BOSC

Laurent Gautier [[email protected]]

DMAC / CBS

July 10th, 2010

1 / 20

Page 2: Gautier bosc2010 pythonbioconductor

Disclaimer• This is not about the comparative merits of scripting

languages• This is about being able to access natively libraries

implemented in a different language

2 / 20

Page 3: Gautier bosc2010 pythonbioconductor

About Bioconductor

• Set of open-source packages for R• Started circa 2002 with a focus on microarrays• Rooted in statistics, data analyis, and visualization• Several hundred packages, addresses NGS, HTS, flow

cytometry, protein-protein interactions, . . .• Biannual releases• Presence on the publication circuit ( > 2, 300 citations for

the BioC publication, > 600 for limma, > 500 for affy )

3 / 20

Page 4: Gautier bosc2010 pythonbioconductor

About Python

• Simple and clear all-purpose scripting language• Sometimes used in introductions to programming• Popular for agile development• Bioinformatics libraries:

• biopython (libraries for bioinformatics)• galaxy (web front-end to pipelines)• PyCogent, pygr, bx-python (biological sequences-oriented)

• Large selection of libraries:• Web development: Zope, Django, Google App Engine• Scientific computing: Scipy / Numpy• Cloud computing: Disco, execnet• Interface with C: ctypes, Cython

4 / 20

Page 5: Gautier bosc2010 pythonbioconductor

A view on R/bioconductor and Python in bioinformatics

Bioinformaticsdata

Automation

Storage /Retrieval

SamplesMicroarray

NGS

Annotation

Flow-cytometry,

proteomics,other

assays. . .

R/BioconductorStatisticalanalysis

Visualization

Interactiveprogram-

ming

Python

Non-interactive

abilitiesData

storage /retrieval

Web

Algorithmdevelopment

Scientificcomputing

Python is an all-purpose scriptinglanguage.

Communities

ComputerScientists

Physicists

Biologists

Statisticians

5 / 20

Page 6: Gautier bosc2010 pythonbioconductor

Bioinformaticsdata

Automation

Storage /Retrieval

SamplesMicroarray

NGS

Annotation

Flow-cytometry,

proteomics,other

assays. . .

R/BioconductorStatisticalanalysis

Visualization

Interactiveprogram-

ming

Python

Non-interactive

abilitiesData

storage /retrieval

Web

Algorithmdevelopment

Scientificcomputing

Python is an all-purpose scriptinglanguage.

Communities

ComputerScientists

Physicists

Biologists

Statisticians

Page 7: Gautier bosc2010 pythonbioconductor

Bioinformaticsdata

Automation

Storage /Retrieval

SamplesMicroarray

NGS

Annotation

Flow-cytometry,

proteomics,other

assays. . .

R/BioconductorStatisticalanalysis

Visualization

Interactiveprogram-

ming

Python

Non-interactive

abilitiesData

storage /retrieval

Web

Algorithmdevelopment

Scientificcomputing

Python is an all-purpose scriptinglanguage.

Communities

ComputerScientists

Physicists

Biologists

Statisticians

Page 8: Gautier bosc2010 pythonbioconductor

Running R code from Python (an example)AimRunning edgeR from Python

MethodRobinson MD, McCarthy DJ and Smyth GK (2010). edgeR:a Bioconductor package for differential expression analysisof digital gene expression data. Bioinformatics 26, 139-140

DataControl Treated

lane1 lane2 lane3 lane4 lane5 lane6 lane8ENSG00000230758 0 0 1 0 0 0 0ENSG00000182463 0 2 4 1 5 5 0ENSG00000124208 82 124 102 136 90 120 40ENSG00000230753 0 0 0 3 0 0 0ENSG00000224628 7 8 8 18 8 7 1ENSG00000125835 138 209 227 295 281 220 54ENSG00000125834 25 31 48 56 67 61 15ENSG00000197818 17 27 16 26 41 39 9ENSG00000243473 0 0 0 2 0 0 0ENSG00000226325 0 0 2 0 3 1 0

. . . . . . . . . . . . . . . . . . . . . . . .

7 / 20

Page 9: Gautier bosc2010 pythonbioconductor

from rpy2.robjects.packages import importrfrom bioc import edger

base = importr(’base’)

summarized = edger.DGEList.new(counts = counts,lib_size = base.colSums(counts),group = grp)

disp = edger.estimateCommonDisp(summarized)

tested = edger.exactTest(disp)

results = edger.topTags(tested)

logConc logFC PValue FDRENSG00000127954 -31.03 37.97 0.00 0.00ENSG00000151503 -12.96 5.40 0.00 0.00ENSG00000096060 -11.78 4.90 0.00 0.00ENSG00000091879 -15.36 5.77 0.00 0.00ENSG00000132437 -14.15 -5.90 0.00 0.00ENSG00000166451 -12.62 4.57 0.00 0.00ENSG00000131016 -14.80 5.27 0.00 0.00ENSG00000163492 -17.28 7.30 0.00 0.00ENSG00000113594 -12.25 4.05 0.00 0.00ENSG00000116285 -13.02 4.11 0.00 0.00

8 / 20

Page 10: Gautier bosc2010 pythonbioconductor

R code / Python codelibrary(edgeR)summarized <- DGEList(counts = counts,

lib.size = colSums(counts),group = grp)

disp <- estimateCommonDisp(summarized)

from rpy2.robjects.packages import importrbase = importr(’base’)from bioc import edger

summarized = edger.DGEList.new(count = counts,lib_size = base.colSums(counts),group = grp)

disp = edger.estimateCommonDisp(summarized)

Note:• explicit in searching through namespaces• call R functions as native Python functions• use R objects as Python objects

9 / 20

Page 11: Gautier bosc2010 pythonbioconductor

Bioconductor library IRanges

10 / 20

Page 12: Gautier bosc2010 pythonbioconductor

Bioconductor library Biostrings

11 / 20

Page 13: Gautier bosc2010 pythonbioconductor

Separate communities

12 / 20

Page 14: Gautier bosc2010 pythonbioconductor

Bilingual community

13 / 20

Page 15: Gautier bosc2010 pythonbioconductor

Interpreters/Translators

14 / 20

Page 16: Gautier bosc2010 pythonbioconductor

Cost of translation

R package Python modulelines of code

AnnotationDbi 168 annotationdbi.pyBiobase 341 biobase.pyBiostrings 591 biostrings.pyBSgenome 112 bsgenome.pyedgeR 107 edger.pyGEOquery 102 geoquery.pyGGbase 104 ggbase.pyGGtools 77 ggtools.pygoseq 43 goseq.pyGSEABase 149 gseabase.pyIRanges 295 iranges.pyShortRead 301 shortread.py

15 / 20

Page 17: Gautier bosc2010 pythonbioconductor

R within Python• R is running as embedded into Python• R objects remain in the R workspace, but can be accessed

from Python• Python-level shells to access the R objects• The rpy2 package is used to achieve so

biostrings = importr(’Biostrings’)class AAString(XString):

_aastring_constructor = biostrings.AAString

@classmethoddef new(cls, x):

""" :param x: a string of amino-acids """res = cls(cls._aastring_constructor(conversion.py2ri(x)))_setExtractDelegators(res)return res

aas = AAString("PROTEIN")

16 / 20

Page 18: Gautier bosc2010 pythonbioconductor

What is needed to continue

More interpreters/translators• Many bioconductor packages.• Keep up-to-date existing translations.

Keeping up-to-date• Frequent API-breaking changes in bioconductor• Taylored interfaces increase maintenance• Meta-programming and reflexivity can alleviate this

17 / 20

Page 19: Gautier bosc2010 pythonbioconductor

Example with meta-programming:

class AssayData(rpy2.robjects.methods.RS4):""" Abstract class. That class in a ClassUnionRepresentationin R, that a is way to create a parent class for existingclasses. This is currently not modelled in Python. """__rname__ = ’AssayData’__metaclass__ = rpy2.robjects.methods.RS4_Type

__accessors__ = ((’featureNames’, ’Biobase’, ’featurenames’,True, ’maps Biobase::featureNames’),(’sampleNames’, ’Biobase’, ’samplenames’,True, ’maps Biobase::samplenames’),(’storageMode’, ’Biobase’, ’storagemode’,True, ’maps Biobase::storageMode’))

18 / 20

Page 20: Gautier bosc2010 pythonbioconductor

Example of a complete applicationA web-server to run EdgeR.

from bottle import route, runfrom my_edger import get_toptags, make_results_page@route(’/’)def index():

return ’’’<html> <body><form action="/edger" method="post" enctype="multipart/form-data"><input type="file" name="data" /> </form></body> </html>’’’

@route(’/edger’, method=’POST’)def run_edger():

data = request.files.get(’data’)if data:

counts, grp = read_count_data(data.file.name)top_tags = get_toptags(counts, grp)return make_result_page(top_tags)

else:abort(404, "Invalid count file.")

run(host=’localhost’, port=8080)

19 / 20

Page 21: Gautier bosc2010 pythonbioconductor

Acknowledgements• Users, and communities from R, Bioconductor, Python,

Biopython• (Vincent Davis, Nicolas Rapin, Brad Chapman)

URLshttp://pypi.python.org/pypi/rpy2-bioconductor-extensions/

http://bitbucket.org/lgautier/rpy2-bioc-extensions

http://packages.python.org/rpy2-bioconductor-extensions/ http://rpy2.sourceforge.net/

20 / 20

Page 22: Gautier bosc2010 pythonbioconductor

21 / 20