embrace and emboss

40
Funded by: EMBRACE and EMBOSS Integrating everything and Integrated by everything Peter Rice, EBI ([email protected]) June 2006

Upload: edna

Post on 30-Jan-2016

93 views

Category:

Documents


0 download

DESCRIPTION

EMBRACE and EMBOSS. Integrating everything and Integrated by everything. Peter Rice, EBI ([email protected]) June 2006. EMBRACE and EMBOSS. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EMBRACE and EMBOSS

Funded by:

EMBRACE and EMBOSS

Integrating everything and

Integrated by everything

Peter Rice, EBI ([email protected])

June 2006

Page 2: EMBRACE and EMBOSS

Funded by:

EMBRACE and EMBOSS

EMBRACE is an EC-funded Network of Excellence with 18 partners, developing an integrated set of services for the major bioinformatics data resources and analysis tools.

The EMB name was selected after two previous names were rejected. It stands for "European Model for Bioinformatics Research And Community Education" .... and has no connection with EMBL.

EMBOSS is now 10 years old, with the project team hosted by EMBL-EBI, providing open source libraries and over 200 applications for sequence analysis.

EMBOSS has its roots at EMBL Heidelberg, but started at the Sanger Centre and the UK EMBnet node. The EMB name reflects the EMBL and EMBnet origins as "European Molecular Biology Open Software Suite"

Page 3: EMBRACE and EMBOSS

Funded by:

EMBRACE

Network of Excellence - 18 partners with data resources, analysis tools, expertise in grid technology and experimental biologists.

Graham Cameron, Peter Rice, Alan Bleasby — EBI, Cambridge, GBToby Gibson — EMBL, Heidelberg, DEAndreas Gisel — Institute of Biomedical Technologies, Section Bari, CNR, ITTeresa Attwood — University of Manchester, GBMarco Pagni—Swiss Institute of Bioinformatics, CHErik Bongcam-Rudloff — LCB/BMC, Uppsala, SEVincent Breton — CNRS, Clermont Ferrand, FRSøren Brunak — CBS, Lyngby, DKJosé-María Carazo — CNB, Madrid, ESArne Elofsson — DBB, Stockholm, SEDaniel Kahn — INRA/CNRS, Toulouse, FRRalf Herwig — MPI für Molekulare Genetik, Berlin, DEEija Korpelainen — CSC, Espoo, FIChristine Orengo — University College London, GBYitzhak Pilpel — Weizmann Institute of Science, ILGert Vriend — CMBI, Nijmegen, NLAlfonso Valencia — INTA-CAB, Madrid, ESChristian Bryne — University of Bergen, NO

Page 4: EMBRACE and EMBOSS

Funded by:

EMBRACE Overview

This kind of programming is hard to do.

EMBRACE aims to make it easier, and within the reach of experimental biologists.

To do this, we need an interoperable set of services and clients that can both find and make use of them.

Page 5: EMBRACE and EMBOSS

Funded by:

EMBRACE aims to enable ...•a scientist to evoke the latest and best version of a given program without any concern for its physical location

•the program to find the most up-to-date data without help from the user

•workflows to automatically take advantage of whatever compute power is available

•workflows to deliver results in a way which any user can understand

•the scientist to follow connections to other relevant data and tools using all the straightforward idioms of web browsing and hyperlinks.

Page 6: EMBRACE and EMBOSS

Funded by:

App

lica

tion

Use

r in

terf

ace

App

lica

tion

inte

rfac

e

EMBRACE: Interconnectivity

Page 7: EMBRACE and EMBOSS

Funded by:

EMBRACE: Approaches

•Defining an application interface•Design from the view of the user/application•Browser example

•User provides a query and a data type•Generate a list of results by data resource•Expand and browse the list, following links•Select some or all as input to analysis tools•Requires human-readable definitions

•Automation•A similar example, but with a program selecting and launching the analysis•Requires machine-readable definitions

Page 8: EMBRACE and EMBOSS

Funded by:

EMBRACE Data Content

DNA sequence information Protein sequence information Genome annotation Macromolecular Structure Data Expression information Literature Orthologs Untranslated regions

Protein Families Alignments Protein/protein-associations Structural domainsGene3D ORFandDB SNPs in regulatory regions3D Electron Microscopy data

Page 9: EMBRACE and EMBOSS

Funded by:

EMBRACE Analysis Tools

EMBOSSDNA sequence analysis Protein sequence analysis Pattern matching Genome annotationExpert systemsHidden Markov ModelsHomology searchesPhylogenetic analysisProtein structure analysisProtein structure comparison

Protein domain mappingMicroarrays and gene expressionBioinformatics workflowsBioinformatics tool environmentsProtein structure predictionElectron microscopyElectron microscope tomographySystems biology modellingText mining

Page 10: EMBRACE and EMBOSS

Funded by:

Web services Grid services

EMBRACEgridRequires:

Data managementData replication

Service discoveryComputing

KO??KOOK

OK??OKKO

Lack of infrastructure providing low-level services

Instability and lack of robustness

Standards still evolving, and implementations lying behind

Informationworld

Infrastructure world

Page 11: EMBRACE and EMBOSS

Funded by:

EMBRACE: Data Content Services

•Promised deliverables are prototypes•Webservice technology•Content provided by EBI and EMBL Heidelberg•Access to:

•Nucleotide sequence data resources•Protein sequence data resources•Protein motif resources

•Technology choices kept flexible•SOAP webservices from EBI•BioMart from EBI•Existing services from other partners

Page 12: EMBRACE and EMBOSS

Funded by:

EMBRACE: Analysis Tools Services

•Promised deliverables are prototypes•Webservice technology•Content provided by EBI•Access to:

•Sequence analysis tools (EMBOSS etc.)•Protein structure analysis tools (EMBOSS/EMBASSY etc.)

•Technology choices kept flexible•SOAP webservices•SOAPlab project (EBI/MyGrid)•Life Science Analysis Engine standard (OMG)

•Integration also implies•Tools will access data resources via EMBRACE interfaces

Page 13: EMBRACE and EMBOSS

Funded by:

EMBRACE: Technology Choice

•Promised deliverable is a survey of webservice and grid technologies•Will be made publicly available•To cover:

•European Grids and Bioinformatics (EGEE etc.)•Webservice standards•Grid service standards•Current standards•Emerging standards•Recommendations on technology adoption•Recommendations on further technology watch

•Technology test cases•Designed to demonstrate technology•Designed to show improvements in technology•Designed to highlight problems

Page 14: EMBRACE and EMBOSS

Funded by:

EMBRACE: Test Cases

•EMBRACE is driven by biological test cases:•4 initial test cases in the proposal•Workshop (Uppsala, 2005) defined new test cases•Partners illustrating use of their content/tool resources

•Test cases described in detail•Template adopted from BioMOBY•Implement template solutions•Identify missing components•Set priorities•... and fill in the gaps

Page 15: EMBRACE and EMBOSS

Funded by:

EMBRACE: Outreach

•First workshops have been internal (inreach)•In 2006, workshops will be mixed with outreach•EMBRACE is aimed at skilled bioinformaticians•Need to address needs of biological researchers

•EMBRACE provides a programming interface to services•Biologists need a simple "browser"•EMBRACE will need a simple interface to demonstrate utility

•Example interfaces:•Taverna (EBI/MyGrid/OMII-UK)•Other workflow systems•Simple program examples•Simple script examples•"The Big Red Button"

Page 16: EMBRACE and EMBOSS

Funded by:

EMBRACE Year Two

•Prototype content services to become standard•Prototype tool services to become standard•Further prototypes beyond sequence data•Established technology choice•Well documented test cases•Good links to biological research community

•Selected collaborators•Willing to explore emerging technologies•Biological (and practical) use cases

Page 17: EMBRACE and EMBOSS

EMBOSS: History• EMBOSS started in March 1996

• First requirements based on a list of long-standing problems in existing commercial software (GCG), and the need for public source code

• First "ajax" library written August 1996

• 30 potential developer/user sites identified November 1996 (EMBnet Helsinki)

• Wellcome Trust proposal February 1997 (Sanger, HGMP and EBI)

• Accepted August 1997

• Project started November 1997.

• EMBOSS 1.0.0 released on 15th July 2000.

• EMBOSS 2.0.0 released on 15th July 2002.

• EMBOSS 3.0.0 released on 15th July 2005

• EMBOSS 4.0.0 will be released on 15th July 2006

Page 18: EMBRACE and EMBOSS

Original Target UsersEach of the following groups had their own special needs which

EMBOSS aimed to satisfy:

• Sanger Centre genomic sequencing and analysis groups

• RFCGR/HGMP registered academic users (about 10,000)

• EMBnet service providers in 30+ other countries with over 30,000 users

• Academic users everywhere

• Pharmaceutical and biotechnology industry

• Bioinformatics developers

Page 19: EMBRACE and EMBOSS

Seqret

Seqret is a very simple application

• It reads a sequence USA (in any format, from anywhere)

• It writes a sequence USA (in any format)

If you tell it the sequence has feature annotation:

• It reads the features (in any format)

• It writes the features (in any format)

Seqret has 13 lines of code

Page 20: EMBRACE and EMBOSS

The source code seqret.c

#include "emboss.h"

int main(int argc, char **argv) { AjPSeqall seqall; AjPSeqout outseq; AjPSeq seq = NULL; embInit("seqret", argc, argv); seqall = ajAcdGetSeqall ("sequence"); outseq = ajAcdGetSeqout ("seqout"); while (ajSeqallNext (seqall, &seq)) ajSeqWrite (outseq, seq); ajSeqWriteClose (outseq); ajExit();}

Page 21: EMBRACE and EMBOSS

EMBOSS Quality Control

• Nightly build with no compiler warnings• 2,000 test runs (including expected fail conditions)• 150 valgrind memory leak tests• Code documentation validation and indexing• ACD file validation• ACD documentation completeness• Program documentation: description, command line

qualifiers, example run(s) and input/output files• Web site updates

Page 22: EMBRACE and EMBOSS

Disaster proof software licences

Page 23: EMBRACE and EMBOSS

Disaster proof software licences• 1977 Fred Sanger sequences ΦX174 with computing by Rodger Staden

• 1996 EMBOSS started by Peter Rice (Sanger) and Alan Bleasby (SEQNET Daresbury), in collaboration with Thure Etzold (EBI)

• 1997 funding approved by the Wellcome Trust

• 1998 SEQNET relocated to Hinxton (HGMP)

• 1999 Thure goes to LION Bioscience

• 2000 Peter leaves Sanger – EMBOSS goes to Alan at HGMP

• 2001 LION (Peter) adds EMBOSS to SRS and updates EMBOSS• CCP11 funding for EMBOSS development

• 2002 Peter leaves LION

• 2003 Peter joins EBI – integrating EMBOSS in myGrid services• Medical Research Council terminates funding for Rodger Staden

• MRC still "owns" the Staden package. Rodger Staden retires.

• HGMP is renamed after Rosalind Franklin (by MRC)

• 2004 April 1st: MRC announces RFCGR will be closed within 15 months

• 2005 Alan Bleasby and Jon Ison move to EBI; Tim Carver moves to Sanger

• All the code is still licensed to everyone under (L)GPL.

Page 24: EMBRACE and EMBOSS

Users: Are you a Man or a Mouse?

Page 25: EMBRACE and EMBOSS

Command Line

EMBOSS has many possible command lines:• Prompting for required values% seqret

What sequence []: embl:paamir

Output file [paamir.fasta]:

• Unix style% seqret embl:paamir –send 100 -auto

% seqret embl:paamir –se 100 -auto

% seqret –se 100 embl:paamir -auto

• GCG style% seqret embl:paamir –send=100 –auto

Page 26: EMBRACE and EMBOSS

Web Interface (wEMBOSS)

Page 27: EMBRACE and EMBOSS

Web interface (SRS)

Page 28: EMBRACE and EMBOSS

GUI Interfaces: Jemboss

Page 29: EMBRACE and EMBOSS

GUI Interfaces: Taverna

Page 30: EMBRACE and EMBOSS

Where are we now?

Page 31: EMBRACE and EMBOSS
Page 32: EMBRACE and EMBOSS
Page 33: EMBRACE and EMBOSS

New grant vision

• For the new grant we were asked to present a vision:• Genomics (whole genome analysis)

• Phylogenetics (beyond phylip)

• Gene expression (microarray data standards)

• Biostatistics (R and BioConductor)

• Proteomics (2d gel, MS, etc)

• Genetic linkage

• Chemistry (small molecules)

• All these ideas came from the 2005 User Survey

• We have funding only for core development (so far

Page 34: EMBRACE and EMBOSS

Extending core EMBOSS

• There are many other things we can do:• Workflows

• Automatic support for the 100+ interfaces• Generating XML files

• Notification of changes to ACD standard

• Testing

• Ontologies

• Graphics library

• Database indexing

• Non-sequence data access

Page 35: EMBRACE and EMBOSS

EMBOSS Books

• Three books are planned after 4.0.0

• Text ownership stays with the EMBOSS team for reuse

• Publishers Cambridge University Press• Programmer's guide

• After a major code refactoring effort

• Automated generation of code examples

• Administrator's guide• Installing and maintaining EMBOSS code

• Managing data resources

• Supporting in-house developers

• User's guide• Aimed at experimental biologists

Page 36: EMBRACE and EMBOSS

EMBOSS and Industry

• Celera were the first industrial users

• And the first to provide funding (for the SRS interface)

• Hardware manufacturers offer machines and compilers

• IBM, HP, Apple

• Our latest partners are SciTegic/Accelrys

• Pipeline Pilot Independent Software Vendor partnership

Page 37: EMBRACE and EMBOSS

Pipelining Heterogeneous Tools

Heterogeneous [BioJava, Perl, PROSITE, EMBOSS, (& GCG)]

tools for sequence annotation

Page 38: EMBRACE and EMBOSS

The SciTegic Challenge

• Pipeline Pilot runs on Linux• BioPerl interface to launch EMBOSS• EMBOSS team to maintain the BioPerl code

• Pipeline Pilot runs on Windows• EMBOSS team to support EMBOSSWIN

• Why? Because we can do it, and we expect the GCG development team will find it difficult!

Page 39: EMBRACE and EMBOSS

We need help

• Encouraging more developers• CUP books

• Developer training courses - not in Hinxton• Course in Indiana May 2005

• Sponsorship offer from Newcastle, UK

• Willing to travel anywhere!!!

[email protected]• Henrikki Almusa and Medicel (Helsinki)

• Suggestions for new applications• Collaborations in proposed new areas.

Page 40: EMBRACE and EMBOSS

Acknowledgements

• (HGMP/RFCGR): Gary Williams, Tim Carver, Hugh Morgan, Claude Beesley, Damian Counsell, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop

• LION: (Thomas Laurent), (Bijay Jassal), Thure Etzold

• Sanger: (Ian Longden), (Richard Bruskiewich), Simon Kelley, (Ewan Birney)

• EBI: Peter Rice, Alan Bleasby, Jon Ison, Lisa Mullan, (Martin Senger), Tom Oinn, Rodrigo Lopez, Mahmut Uludag, Shaun McGlinchey

• EMBnet: UK, Norway, Italy, Germany, Belgium, Argentina, China, Turkey, Israel, Canada, Manchester

• Others: Don Gilbert, Will Gilbert, Rodger Staden, Bill Pearson, Catherine Letondal, Luke McCarthy, Susan Jean Johns, David Bauer, Andrew Lyall,

Henrikki Almusa, Melody Clark, ....