moeller bosc2010 debian_taverna

20
2010, Boston Community-driven computational biology with Debian and Taverna Steffen Möller, Hajo Krabbenhöft (Lübeck) Alan Williams, Katy Wolstencroft, Carole Goble (Manchester) Andreas Tille, Charles Plessy, David Paleino (Debian) BOSC 2010, Boston

Upload: bosc-2010

Post on 11-May-2015

743 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Moeller bosc2010 debian_taverna

2010,  Boston

Community-driven computational biology with Debian and Taverna

Steffen Möller, Hajo Krabbenhöft (Lübeck)Alan Williams, Katy Wolstencroft, Carole Goble (Manchester)

Andreas Tille, Charles Plessy, David Paleino (Debian)

BOSC 2010, Boston

Page 2: Moeller bosc2010 debian_taverna

2010,  Boston

Motivation

● Open Source Bioinformatics continues to grow and improve● steadily increasing number of tools and databases● addressing more and more complex issues

● Bioinformatics found entry into wet-lab routine● strong service units with many diverse projects● single deeply embedded individuals

● Wanted:● Exchange of bioinformatics recipes, as a database or eventually

linked from papers' method sections● Reliable, instant-available powerful external resources to perform

analysis

Page 3: Moeller bosc2010 debian_taverna

2010,  Boston

Dual role of Cloud technologies

● Sharing of physical resources● Computation● Storage

● Sharing of management resources● Reference Images● Pre-downloaded, pre-indexed data

– Amazon public data sets– “whatever BOSC 2010 agrees on” for our Eucalyptus

playground

Page 4: Moeller bosc2010 debian_taverna

2010,  Boston

How to Co-Maintain Cloud Images● Cloud images can be maintained just like regular machines

● The installation of many tools by many people● works, you get somewhere, but then you don't want to touch it again● Is error prone because of inter-dependencies of packages (shared

files, version incompatibilities)

● The partial update of such co maintained images● will most likely break something somewhere → modularity● you want to know what has been done to an image without a

dependency on external web pages → introspection

Page 5: Moeller bosc2010 debian_taverna

2010,  Boston

How to Co-Maintain Cloud Images

Wanted:● Mechanism to allow the individual upgrading of

software tools and integrity checks● Sharing of the effort

– to compile the source code – one wants to install the binaries only whenever possible

– to describe the packages – should be of little overhead or be already available

This is basically what Linux distributions do.

Page 6: Moeller bosc2010 debian_taverna

2010,  Boston

Dual role of Debian● Package provider

● many tens of thousands packages are offered– directly as a Linux distribution

– indirectly via descendents Ubuntu or BioLinux

● technical excellence– coherent builds across many platforms (PowerPC, Intel 32 and 64 bit, AMD,

MIPS) and Kernels (Linux, HURD, BSD, OpenSolaris)

– separation of documentation from binaries, GUI from command line, ...

● Community● bug reports● mailing Lists, special interest groups, you may discuss

– packages that are missing

– problems that many of us have that are yet unsolved

Page 7: Moeller bosc2010 debian_taverna

2010,  Boston

bioinformatics blend● subversion and git repositories for packages

● friendly and open community

● keen on close links with upstream

● Series of tasks within Debian Med – not only bioinformatics:Biology - Debian Med micro-biology packagesBiology development - Debian Med packages for development of micro-biology applicationsContent management - Debian Med content management systemsMedical data - Debian Med suggestions for medical databasesDental - Debian Med packages related to dental practiceEpidemiology - Debian Med epidemiology related packagesHospital information systems - Debian Med suggestions for Hospital Information SystemsImaging - Cross-platform for visualizing, processing and analysing of bioimagesImaging development - Debian Med packages for medical image developmentLaboratory - Debian Med suggestions for medical laboratoriesPharmacy - Debian Med packages for pharmaceutical researchPhysics - Debian Med packages for medical physicistsPractice - Debian Med packages for practice managementPsychology - Debian Med packages for psychologyStatistics - Debian Med statisticsTools - Debian Med several toolsTypesetting - Debian Med support for typesetting and publishing

Page 8: Moeller bosc2010 debian_taverna

2010,  Boston

How to Co-Maintain a Debian Package● Technically

● Do not touch the original source tree

● Create folder “debian” with files

– “control” - description of package + build deps

– “changelog” - version of package and what's new

– “rules” - how to say “make” and “make install”

– “install” - to split documentation from the rest

Should not be more difficult than executing “make all” directly, contact me or the list when running into problems.

● FTP-upload of package to distribution's server

● Sharing of “debian” folder with community with subversion/git/bazaar

● Community-driven security● Web of trust: Creator of package signs with his GPG key prior to upload,

GPG key is signed by others

● Bug reports may block transition of package to “stable” release

Page 9: Moeller bosc2010 debian_taverna

2010,  Boston

Something's missing

● We now have the resources.● packages that auto-transform into Cloud images● machines and disk to compute and store in-/output

● We have quite some Bio* community

● Wanted:● Linking of cloud resources with the desktop● Linking of web resources into it● Exchange and reference of

– Inter-package

– Inter-resource

processes that (have) work(ed for someone) and may be adapted

Page 10: Moeller bosc2010 debian_taverna

2010,  Boston

Dual role of Taverna● Technology:

● Connects files, web services and applications to workflows

● Workflows may comprise other workflows

● Community:

Portal to completeand partial solutionsas workflows onmyExperiment.org

Page 11: Moeller bosc2010 debian_taverna

2010,  Boston

Taverna integrates command line

● Any command executed in the shell can be integrated● local execution, remote execution with ssh or grid● nicely links clouds, packages and web

● Introduction of UseCases as workflow elements● Database with XML-specification of

– Inputs, Outputs and their MIME types– Commmand line and tools it needs

● Purpose-specific wrappers around binaries or scripts

Krabbenhöft et al., Bioinformatics, 2008

Page 12: Moeller bosc2010 debian_taverna

2010,  Boston

Shared UseCase management

Page 13: Moeller bosc2010 debian_taverna

2010,  Boston

Example: Clustering many sequences

● Compute times of several hours are generally not acceptable for public web services

● Not a problem with integrated clouds

CloudImage

Selection

apt-getinstall

t-coffee

StartinstanceLo

cal

Clo

ud

InformTavernaabout

IP number

WorkflowExecution

ResultsInterpretation

Page 14: Moeller bosc2010 debian_taverna

2010,  Boston

Remaining challenge:sharing public data

● Could work like the management of software, but● Often large with frequent updates

users differ in their demands for latest versions

● Involves post-processingusers differ in their demand to perform such

● Clouds could help, but● one would not want to pay for everything all the time● the installation process would need to be transparent to locally

recreate or update or … improve the data

Page 15: Moeller bosc2010 debian_taverna

2010,  Boston

Proposal: getData, a shared Perl script● The script is a large hash table

● extendable by configuration files that may be contributed from various packages, like EMBOSS

● Every entry comprises another hash table with attributes– Name – full name of database

– Source – how to retrieve it

– Post-download – what to do once it has arrived

– Recommends – tools suggested to install with the data

● All very simple and extendable● Direct mirroring of effort performed on the command line● The community can co-maintain this script more easily than

some cloud instance● More on http://wiki.debian.org/getData

Page 16: Moeller bosc2010 debian_taverna

2010,  Boston

Summary● Debian as community and repository for

bioinformatics software● Mailing lists, source code management● FTP servers

● Clouds introduce dynamics into the collaboration● Data flow between packages● Usability● Shared maintenance of public data

● Taverna ● Connects web, grid, cloud instances and local machine● Fosters exchange of experiences with various workflows

Page 17: Moeller bosc2010 debian_taverna

2010,  Boston

References and Acknowledgements

[1] Debian-Med http://debian-med.alioth.debian.org

[2] getData http://wiki.debian.org/getData

[3] Eucalyptus http://www.eucalyptus.com

[4] Taverna http://www.taverna.org.uk

[5] Taverna UseCases http://taverna.nordugrid.org

[6] myExperiment http://www.myExperiment.org

[7] Eucalyptus http://www.eucalyptus.com

The development of the UseCass plugin to Taverna was funded by the “KnowARC” EU project.

Page 18: Moeller bosc2010 debian_taverna

2010,  Boston

Debian/Ubuntu contributes● Impressive number of packages

● Bioinformatics (Bio*, EMBOSS, clustering, ...)● Cheminformatics (autodock, gromacs, ballview, …)● General scientific computing tools and libraries

– Clustering (Torque, Sun Grid Engine, ...)– Eucalyptus Cloud environment

● Automation of database updates and indexing with the “getData” script

Page 19: Moeller bosc2010 debian_taverna

2010,  Boston

Concept: Distro+Workflows+Cloud

● Debian/Ubuntu Linux Distribution● Chem- + Bioinformatics packages● Friendly Community

● Taverna Workflow Suite● Access to services in the web● Access to command line tools via ssh or grids● Exchange of ideas via myExperiment.org

● Eucalyptus or Amazon Clouds● Sharing of databases and indices● Readily available or customized images to instantiate

Page 20: Moeller bosc2010 debian_taverna

2010,  Boston

The Cloud contributes

A platform for individuals to share● Data (“download only once”)● Its management (“update and index only once”)● Experiences (“I show you”)

Physical resources● To be shared in community (“common cluster”)● To be bought on demand (“run at Amazon.com”)

Solutions● Readily usable images – by community or industry● Adaptability to local demands