a generic and modular platform for automated sequence processing and annotation

A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP 2

Upload: luz

Post on 30-Jan-2016




0 download


2. A generic and modular platform for automated sequence processing and annotation. Arthur Gruber. Instituto de Ciências Biomédicas Universidade de São Paulo. AG-ICB-USP. 2. Sequence processing and annotation. Analyzing and processing sequencing reads is a tedious and error-prone job - PowerPoint PPT Presentation


Page 1: A generic and modular platform for automated sequence processing and annotation

A generic and modular platform for automated

sequence processing and annotation

Arthur Gruber

Instituto de Ciências Biomédicas Universidade de São Paulo



Page 2: A generic and modular platform for automated sequence processing and annotation

• Analyzing and processing sequencing reads is a tedious and error-prone job

• Multistep process• All sequences are submitted to the same

processing steps• Sequences processed by a given step are

the input for the next one • Require different programs• Integrated system – PIPELINE

Sequence processing and annotation



Page 3: A generic and modular platform for automated sequence processing and annotation

Problem: how to build pipelines

• Creating scripts for new pipelines involves good programming knowledge

• Once created, most pipelines are difficult to change and customize

• Many programs must be used• Phred, Cross_match, Phrap, CAP3, Blast,

HMMer, InterproScan, TMHMM, etc.



Page 4: A generic and modular platform for automated sequence processing and annotation

• Each program needs a specific environment to work (e.g. directories with specific names)

• Each program produces output in different ways and formats

• Integrating programs is a hard task

2 Problem: how to build pipelines


Page 5: A generic and modular platform for automated sequence processing and annotation

Solution: creating an environment to build pipelines

• Abstract the environment of each program

• Abstract output format

• Easily specify “coupling” of different programs

• Document how the pipe was built • Easy to inspect and monitor• Easy to store (e.g. in a database)




Page 6: A generic and modular platform for automated sequence processing and annotation


• To develop a simple to use and configure platform for pipeline construction• Big sequencing centers already have sophisticated pipelines,

but many are not published and/or publicly available

• They are too complex for the small-/mid-sized labs

• Platform should be generic • Useful for any sequencing project

• Platform should provide components for the most common tasks

• New components should be easy to develop

Aims and characteristics:



Page 7: A generic and modular platform for automated sequence processing and annotation

EGene: a generic platform for pipeline construction

• Written in Perl language• Modular• Easy to build specific components to

interact with third-party programs• EGene components can be integrated

to fulfill user-specific needs• CoEd – a graphical configuration editor

written in Java – user-friendly interface




Page 8: A generic and modular platform for automated sequence processing and annotation


Page 9: A generic and modular platform for automated sequence processing and annotation


Page 10: A generic and modular platform for automated sequence processing and annotation


Page 11: A generic and modular platform for automated sequence processing and annotation


Page 12: A generic and modular platform for automated sequence processing and annotation


Page 13: A generic and modular platform for automated sequence processing and annotation


Page 14: A generic and modular platform for automated sequence processing and annotation


Page 15: A generic and modular platform for automated sequence processing and annotation

Sequence processing pipelineThe Eimeria ORESTES project

Size filteringFilter-size

End trimmingTrim-ends.pl

Quality filteringFilter-quality.pl

Vector masking and screeningCross_Match

Primer screening and maskingCross_Match

Base calling and quality assignmentPhred

Inputchromatogram files


Human sequence filteringBlast

Chicken sequence filteringBlast

Bacterial sequence filteringBlast

Repetitive sequence filteringCross_Match

Ribosomal sequence filteringCross_Match

Plastid sequence filteringCross_Match

Mitochondrial sequence filteringCross_Match



Page 16: A generic and modular platform for automated sequence processing and annotation

Sequence processing and grahical report



Page 17: A generic and modular platform for automated sequence processing and annotation

How to get EGene

Internet site:http://www.coccidia.icb.usp.br/egene

- EGene is distributed under the GNU General Public License- EGene is Open Source



Page 18: A generic and modular platform for automated sequence processing and annotation

How to get EGene

Internet site:http://www.coccidia.icb.usp.br/egene

- EGene is distributed under the GNU General Public License- EGene is Open Source



Page 19: A generic and modular platform for automated sequence processing and annotation
Page 20: A generic and modular platform for automated sequence processing and annotation

Recent developments

• Incorporation of forks• Enhancement of the data model –

incorporation of annotation evidences

• Development of annotation components

• Evidence-based annotation



Page 21: A generic and modular platform for automated sequence processing and annotation
Page 22: A generic and modular platform for automated sequence processing and annotation
Page 23: A generic and modular platform for automated sequence processing and annotation
Page 24: A generic and modular platform for automated sequence processing and annotation

Genome annotation

• Annotation is the process of adding information to DNA sequence.

• The information usually has a DNA coordinate.

• Features could be repeats, genes, promoters, protein domains, etc.

• Features can be cross-referenced to other databases (e.g. Pfam/Pubmed)



Page 25: A generic and modular platform for automated sequence processing and annotation

• Annotation is the process of adding information to DNA sequence.

• The information usually has a DNA coordinate.

• Features could be repeats, genes, promoters, protein domains, etc.

• Features can be cross-referenced to other databases (e.g. Pfam/Pubmed)

Genome annotation2


Page 26: A generic and modular platform for automated sequence processing and annotation

Annotation file

A typical annotation file contains:A header with:

• Information about the sequence• Organism• Authors• References• Comments

A feature table containing• Sequence features and co-ordinates



Page 27: A generic and modular platform for automated sequence processing and annotation

Feature table format

• Flatfile format• Format definition available at


• Covers DDBJ/EMBL/GenBank

• Defines all accepted annotation terms and hierarchy



Page 28: A generic and modular platform for automated sequence processing and annotation

Incorporating annotation

• EGene’s data model was enriched to incorporate annotation information into the representation of the sequences

• All collected data is converted into a proprietary XML format• The XML can be easily converted into

different annotation formats: Feature Table, GFF3, etc.

• We provide some converters and new ones can be easily implemented



Page 29: A generic and modular platform for automated sequence processing and annotation

Annotation components

• A comprehensive set of annotation components has been implemented:

• ORF finding and translation• Tandem repeats finding: TRF, String, mREPS• tRNA finding: tRNAscan-SE• Gene Prediction: Genscan, GlimmerM,

GlimmerHMM, Twinscan, Phat, ESTscan, SNAP • Motif finding: HMMer x Pfam, RPS-BLAST,

InterproScan• Similarity search: BLAST• EST mapping: Sim4, Exonerate



Page 30: A generic and modular platform for automated sequence processing and annotation

Annotation components

• A comprehensive set of annotation components has been implemented:• Transmembrane domain finding: TMHMM,

Phobius• Signal peptide: SignalP, Phobius• GPI anchor: DGPI• GO mapping and quantification• Orthology assignment and quantification:

COG/KOG• Pathway mapping: KEGG• Annotation visualization with GBrowse: web

inspection• Annotation report generation: feature table,

GFF3• Web site generation: HTML/PHP



Page 31: A generic and modular platform for automated sequence processing and annotation
Page 32: A generic and modular platform for automated sequence processing and annotation
Page 33: A generic and modular platform for automated sequence processing and annotation
Page 34: A generic and modular platform for automated sequence processing and annotation

EGene generates annotation files that can be inspected using regular editors

(Artemis, Apollo, etc.)



Page 35: A generic and modular platform for automated sequence processing and annotation

EGene’s annotation

• EGene can generate annotation in different formats:

• XML – local use, easy to feed a database management system

• Feature table Convenient for manual curation on Artemis Ready for submission to public databases

• GFF3 Current annotation interchange format Manual curation/visualization on Artemis,

Apollo and GMOD Genome Browser Compliant with Sequence Ontology terms



Page 36: A generic and modular platform for automated sequence processing and annotation
Page 37: A generic and modular platform for automated sequence processing and annotation
Page 38: A generic and modular platform for automated sequence processing and annotation
Page 39: A generic and modular platform for automated sequence processing and annotation
Page 40: A generic and modular platform for automated sequence processing and annotation

EGene performs GO term mapping and constructs web pages for inspection



Page 41: A generic and modular platform for automated sequence processing and annotation
Page 42: A generic and modular platform for automated sequence processing and annotation
Page 43: A generic and modular platform for automated sequence processing and annotation
Page 44: A generic and modular platform for automated sequence processing and annotation
Page 45: A generic and modular platform for automated sequence processing and annotation
Page 46: A generic and modular platform for automated sequence processing and annotation
Page 47: A generic and modular platform for automated sequence processing and annotation
Page 48: A generic and modular platform for automated sequence processing and annotation
Page 49: A generic and modular platform for automated sequence processing and annotation
Page 50: A generic and modular platform for automated sequence processing and annotation
Page 51: A generic and modular platform for automated sequence processing and annotation
Page 52: A generic and modular platform for automated sequence processing and annotation
Page 53: A generic and modular platform for automated sequence processing and annotation
Page 54: A generic and modular platform for automated sequence processing and annotation
Page 55: A generic and modular platform for automated sequence processing and annotation
Page 56: A generic and modular platform for automated sequence processing and annotation
Page 57: A generic and modular platform for automated sequence processing and annotation
Page 58: A generic and modular platform for automated sequence processing and annotation
Page 59: A generic and modular platform for automated sequence processing and annotation
Page 60: A generic and modular platform for automated sequence processing and annotation
Page 61: A generic and modular platform for automated sequence processing and annotation

EGene performs an integrated and quantitative orthology analysis

(COG/KOG) and constructs web pages



Page 62: A generic and modular platform for automated sequence processing and annotation
Page 63: A generic and modular platform for automated sequence processing and annotation
Page 64: A generic and modular platform for automated sequence processing and annotation
Page 65: A generic and modular platform for automated sequence processing and annotation
Page 66: A generic and modular platform for automated sequence processing and annotation
Page 67: A generic and modular platform for automated sequence processing and annotation
Page 68: A generic and modular platform for automated sequence processing and annotation
Page 69: A generic and modular platform for automated sequence processing and annotation
Page 70: A generic and modular platform for automated sequence processing and annotation
Page 71: A generic and modular platform for automated sequence processing and annotation
Page 72: A generic and modular platform for automated sequence processing and annotation
Page 73: A generic and modular platform for automated sequence processing and annotation
Page 74: A generic and modular platform for automated sequence processing and annotation
Page 75: A generic and modular platform for automated sequence processing and annotation
Page 76: A generic and modular platform for automated sequence processing and annotation
Page 77: A generic and modular platform for automated sequence processing and annotation
Page 78: A generic and modular platform for automated sequence processing and annotation
Page 79: A generic and modular platform for automated sequence processing and annotation
Page 80: A generic and modular platform for automated sequence processing and annotation
Page 81: A generic and modular platform for automated sequence processing and annotation
Page 82: A generic and modular platform for automated sequence processing and annotation

EGene automatically constructs a full web site for evidence inspection



Page 83: A generic and modular platform for automated sequence processing and annotation
Page 84: A generic and modular platform for automated sequence processing and annotation
Page 85: A generic and modular platform for automated sequence processing and annotation
Page 86: A generic and modular platform for automated sequence processing and annotation
Page 87: A generic and modular platform for automated sequence processing and annotation
Page 88: A generic and modular platform for automated sequence processing and annotation
Page 89: A generic and modular platform for automated sequence processing and annotation
Page 90: A generic and modular platform for automated sequence processing and annotation
Page 91: A generic and modular platform for automated sequence processing and annotation
Page 92: A generic and modular platform for automated sequence processing and annotation
Page 93: A generic and modular platform for automated sequence processing and annotation
Page 94: A generic and modular platform for automated sequence processing and annotation
Page 95: A generic and modular platform for automated sequence processing and annotation
Page 96: A generic and modular platform for automated sequence processing and annotation

Current developments

• Full integration with a database management system

• Automated task distribution management across multiple processing nodes

• Development of a graphical interface for evidence inspection and manual curation

• “Intelligent” annotation – use of probalistic methods to evaluate evidence and designate protein functions



Page 97: A generic and modular platform for automated sequence processing and annotation

Why use EGene2 ?• Ideal for small- and mid-sized laboratories

• Genome and EST sequencing projects• Conceived for Biologists

• Does not require programming skills• Generic tool for any sequencing/annotation

project – customized for specific user’s requirements

• Very easy to implement new components• Multiplatform - MacOS, UNIX, Linux, etc.• Well documented – HOWTOs, tutorials, example

datasets available• Easy configuration

• CoEd - Application with a GUI for pipeline construction• Generic pipeline templates provided



Page 98: A generic and modular platform for automated sequence processing and annotation

Research team

Prof. Alan M. Durham – IME-USP

AnnotationMilene Ferro – ICB-USPRicardo Yamamoto Abe – IME-USPLuiz Thiberio Rangel – ICB-USP

Sequence pre-processingAndré Yoshiaki Kashiwabara - IME-USP Fernando Tadashi G. Matsunaga - ICB-USPPaulo Henrique Ahagon - ICB-USP Leonardo Varuzza - ICB-USP



Page 99: A generic and modular platform for automated sequence processing and annotation

Financial Support

• FAPESP - São Paulo State Science Foundation

• CNPq - National Research Council



Page 100: A generic and modular platform for automated sequence processing and annotation

Thanks for your
