rice bosc2010 emboss

19
EBI is an Outstation of the European Molecular Biology Laboratory. EMBOSS European Molecular Biology Open Software Suite Open-Bio Project Update 2010 Peter Rice [email protected]

Upload: bosc-2010

Post on 11-May-2015

834 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Rice bosc2010 emboss

EBI is an Outstation of the European Molecular Biology Laboratory.

EMBOSS

European Molecular Biology Open Software Suite

Open-Bio Project Update 2010

Peter Rice [email protected]

Page 2: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.232

A quick introduction

• Open source package for sequence analysis• ANSI C source code• GPL licensed applications, LGPL libraries• 200+ applications• 100+ third party applications in 15 associated packages

• MIRA, MEME, HMMER, PHYLIP, etc.• Project started 1996 at Sanger and HGMP• Now based at EBI• Release 1.0.0 15th July 2000• Release 6.3.0 15th July 2010• Funded by UK-BBSRC and EMBL-EBI• Originally funded by the Wellcome Trust• Additional funds from UK-MRC

Page 3: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.233

Who do we serve?

• Expert software developers• Bioinformaticians• Computer scientists

• Expert users• Biology research community• Industry

• Scientific users• Biology research community• Industry

Page 4: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.234

EMBOSS command line interface

• EMBOSS applications run from the command line• This is not the only interface

• There are over 100 interfaces and packaged systems available• Web: wEMBOSS• GUI: Jemboss• Web Services: SoapLab• Workflows: Galaxy, Taverna• Windows: mEMBOSS

• All applications have a command definition file (.acd)• Defines all inputs, outputs, and other options• Read at startup• Contains all command line options with descriptions• Template for any other interface

Page 5: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.235

EMBOSS Update

• Release 6.3.0 as usual on 15th July 2010• New support for NGS sequence formats• Adaptor detection added to supermatcher• Metadata and ontologies• Full set of public data resources• Three open source books: users, developers, admin

• Cambridge University Press

Page 6: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.236

NGS sequence formats

• SAM format: tab-delimited short read data• BAM format: binary compressed SAM format

• More work needed on remote access to mapped reads

• FASTQ short reads and quality scores• OpenBio project collaboration on format standards• Improved error detection (for all formats)• Improved performance for input and output• Indexing in dbxflat

Page 7: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.237

NGS sequence formats

• FASTQ joint effort with Bio* projects• Definition of 3 conflicting FASTQ formats• Agreement on standard parsing procedures

• @EAS54_6_R1_2_1_413_324• CCCTTCTTGTCTTCAGCGTTTCTCC• + EAS54_6_R1_2_1_413_324• ;;3;;;;;;;;;;;;7;;;;;;;88• @EAS54_6_R1_2_1_443_348• GTTGCTTCTGGCGTGGGTGGGGGGG• +EAS54_6_R1_2_1_443_348• ;;;;;;;;;;;9;7;;.7;393333

Page 8: Rice bosc2010 emboss

8

Other sequence formats

>AB036666 AB036666 Wolbachia sp. wKue genes

cattactatttcagtcgagacatattaggtcaatcaattttaatcaacaagattggtcaa

gatcaaagtaacattaaaaaatatatatactcatatggtgagtaccctctgaactggcct

cagggaacagaatacactttatctaacagccctgttacaacattaatatttgttcaaggt

aatgaaggacaagaaaaaacagcattcatttttcatatacgagagtccaatacaaaggaa

ttctatgctgataaaaaaattccagtgctaaacatacctaaaataggaaaagtaggaaat

gccgtagaaattaaaatgagtctaaaaaaatatgaaacagggttatcttttgaagacctt

tttgaaatagaacagataagtaaatatgaatcaagtggtaatgatcaacaatttacagat

ggcaagtttattgagatacctaattctgatgaattaaaggcaaaatttgatcaagcaatc

acttctcaacatgcttccgacggtgaggtttcattgcaagcctataaagtgttgcttact

gaagtagcagatacgatttaccctatcaaagatttgattactaatgaagcaagattacaa

gctgttcttaatggtttgcttagtagctatagtgatttaaagctacaggagacttctgcg

aagactgtaattatacctgaatttcaagtaggagcaggtggtcgtgtagatatggtaatt

Caaggtattggtccttcgtctcagggtactaaagaatacactcctatagcgctggaattt

Page 9: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.239

New data sources for EMBOSS

• BioMart access• As a sequence database, define sequence, identifier, etc.• Need to define a very large number of databases

• Ensembl access• Code from Michael Schuster• Ensembl SQL access code in library (access method soon)• Same issues as BioMart

• DAS 1.6 client access planned• GMOD access planned• BioSQL access planned

Page 10: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2310

Data servers

• Defining individual sequence databases is tedious• Many database definitions are similar• Simplify (and extend) with server definitions:

• SRS• MRS• BioMart• Ensembl• DAS 1.6

• Define server• USA to give server:dbname:queryfield-value• Database name and query field known to user

• Or reported by a query to the server in an extended showdb

Page 11: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2311

New data sources for EMBOSS (2)

• Non-sequence data• Cross-referenced resources from EMBL/UniProt/etc.• Useful to return as:

• Identifiers• Text for entries• HTML with markup• URLs for browsing

• Dbxref.dat • List of all known data resources• Standard names• Standard queries for sequence, text, HTML, etc• Query by identifier and other fields

Page 12: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2312

Ontologies

• Support for OBO format ontologies:• Gene Ontology• Sequence Ontology (used internally for features)• BioSapiens Ontology (used internally for features)

• Parsing and format validation• Indexing with new dbx applications• Indexing cross-references in EMBL/UniProt/etc.• Navigation up, down, siblings, etc.• Remote and local access

Page 13: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2313

Ontologies: EDAM

• EMBRACE Datatypes And Methods• OBO format (so far)

• All ACD files have relations attributes• “topic” for application (Immunological analysis)• “operation” for application (Epitope mapping)• “data” for inputs and outputs

• Pure protein sequence• Sequence record• 1 or more

• Sequence length• “Peptide immunogenicity report”

• Validation by acdvalid application

Page 14: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2314

EDAM in ACD

• application: antigenic [• documentation: "Finds antigenic sites in proteins"• groups: "Protein:Motifs"• relations: "/edam/topic/0000201 Immunological analysis"• relations: "/edam/operation/0000416 Epitope mapping“• ]

• seqall: sequence [• parameter: "Y"• type: "proteinstandard"• relations: "/edam/data/0001219 Pure protein sequence"• relations: "/edam/data/0000849 Sequence record" • relations: "/edam/data/0002178 1 or more“

]

• integer: minlen [• standard: "Y“ minimum: "1” maximum: "50” default: "6"• information: "Minimum length of antigenic region"• relations: "/edam/data/0001249 Sequence length“• ]• report: outfile [• parameter: "Y"• rformat: "motif"• multiple: "Y"• taglist: "int:pos=Max_score_pos"• relations: "/edam/data/0001534 Peptide immunogenicity report" • ]

Page 15: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2315

Ontologies: EDAM (2)

• SoapLab web services annotated with EDAM• EDAM terms parsed from ACD files• Web services have WSDL files• SAWSDL annotation with EDAM terms• Annotation can be used by BioCatalogue

• www.biocatalogue.org• Also can be used by EMBRACE registry

• www.embraceregistry.net

Page 16: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2316

Ontologies: NCBI Taxonomy

• Parsers for “.dmp” files• Will add dbx indexing applications• Local and remote access• Navigation up, down, siblings (the usual suspects)• Automatic cross references from sequence data

• EMBL source line• UniProt OX lines• BioMart mart name (organism name)

Page 17: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2317

EMBOSS Interfaces and wrappers

• Two releases in the past year• Possibly three releases next year• Too many for other projects to keep up

• So we are obliged to help, starting with:• SoapLab2• Jemboss• Galaxy• Pipeline Pilot

• BioPerl• wEMBOSS and Explorer• G-language?• …. And anyone else who asks!

Page 18: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2318

Peter RiceAlan Bleasby

Jon Ison Mahmut Uludag

The Emboss Team

Page 19: Rice bosc2010 emboss

BOSC 2010: EMBOSS12.04.2319

Acknowledgements

• EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Martin Senger, Tom Oinn, Jaina Mistry, Rodrigo Lopez, Sharmilla Pillai, Hamish McWilliam, Michael Schuster, Syed Haider

• RFCGR/HGMP: Alan Bleasby, Jon Ison, Tim Carver, Hugh Morgan, Claude Beazley, Lisa Mullan, Damian Counsell, Gary Williams, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop

• LION: Thomas Laurent, Bijay Jassal, Bren Vaughan, Thure Etzold

• Sanger Institute: Ian Longden, Richard Bruskiewich, Simon Kelley

• National bioinformatics service providers in: Norway, Spain, Italy, Netherlands, Germany, Belgium, Russia, China, Canada, Australia, Argentina

• Others: Catherine Letondal, Don Gilbert, Rodger Staden, Bill Pearson, Webb Miller, Marie-Laetitia Denayer, Amandine Schurmann, Gabriele Weiler, Luke McCarthy, David Mathog, David Bauer, Henrikki Almusa, Thomas Siegmund, Scott Markel, Darryl Leon, Bastien Chevreux, Ivo Hofacker, Kristoffer Rapacki, Matus Kalas

• IBM, Hewlett-Packard, (Compaq), Apple, SGI, Sun, LION bioscience, SciTegic, Cambridge University Press

• Open-Bio Foundation, Sourceforge, ... And the British Antarctic Survey

http://emboss.sourceforge.net

http://emboss.open-bio.org/wiki