genome annotation using maker-p at iplant collaboration with mark yandell lab (university of utah) ...

Post on 28-Dec-2015

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Genome Annotation using MAKER-P at iPlant

Collaboration with Mark Yandell Lab (University of Utah)www.yandell-lab.org

iPlant:Josh Stein (CSHL)Matt Vaughn (TACC)Dian Jiao (TACC)Zhenyuan Lu (CSHL)Nirav Merchant (U. Arizona)

Carson Holt (Ontario Institute Cancer Research)

Cantarel et al. 2008. Genome Research 18:188Holt & Yandell. 2011. BMC Bioinformatics 12:491

What Are Annotations?

• Annotations are descriptions of features of the genome• Structural: exons, introns, UTRs, splice forms etc.

• Coding & non-coding genes

• Functional: enzymatic activity, expression

• Annotations should include evidence trail• Assists in quality control of genome annotations

• Examples of evidence supporting a structural annotation:• Ab initio gene predictions

• ESTs

• Protein homology

Secondary Annotation

• Protein Domains• InterPro Scan: combines many HMM databases

• GO and other ontologies• Pathway mapping

• E.g. BioCyc Pathway tools

Challenges in Plant Genome Annotation

• Genomes are BIG • Highly repetitive• Many pseudogenes

Yet it is important to get it right!

Contamination Issue

Annotation ErrorExample: split gene models

Typical Annotation Pipeline

• Contamination screening• Repeat/TE masking• Ab initio prediction• Evidence alignment (cDNA, EST, RNA-seq,

protein)• Evidence-based prediction• Combiner• Evaluation/filtering• Manual curation

Options for Protein-coding Gene Annotation

• MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the mountain of genomic data provided by next generation sequencing technologies into a usable resource.

MAKER identifies repeats, aligns ESTs and proteins to a genome, producesab-initio gene predictions, automatically synthesizes these data into geneannotations, and produces evidence-based quality values for downstream annotation management

Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED).

Better Quality Worse

MAKER-P MPI Support

Message Passing Interface (MPI) is a communication protocol for computer clusters which essentially allows multiple computers to act like a single powerful machine.

Current evidence

Current Assembly

Annotating the Genome – Apollo View

Current evidence

Current Assembly

Identify and Mask Repetitive Elements

Current evidence

Current Assembly

Identify and Mask Repetitive Elements

• RepeatMasker– RepBase– Species specific library

• RepeatRunner– MAKER internal protein library

Current evidence

Current Assembly

Identify and Mask Repetitive Elements

Current evidence

Current Assembly

Ab initio Predictions

Generate Ab Initio Gene Predictions

Current evidence

Current Assembly

Ab initio Predictions

Generate Ab Initio Gene Predictions

• MAKER currently supports:– SNAP– Augustus– GeneMark– FGENESH

• Can be run internally or externally

Current evidence

Current Assembly

Ab initio Predictions

Generate Ab Initio Gene Predictions

Current evidence

Current Assembly

Ab initio Predictions

Align EST and Protein Evidence

EST TBLASTX

EST BLASTNProtein BLASTX

Current evidence

Current Assembly

Ab initio Predictions

Align EST and Protein Evidence

EST TBLASTX

EST BLASTNProtein BLASTX

• Identify regions being actively transcribed (i.e. EST data)• Identify region with homology to a known protein

Current evidence

Current Assembly

Ab initio Predictions

Align EST and Protein Evidence

EST TBLASTX

EST BLASTNProtein BLASTX

Polish BLAST Alignments with Exonerate

Current evidence

Current Assembly

Ab initio Predictions

Polished protein

Polished EST

Polish BLAST Alignments with Exonerate

Current evidence

Current Assembly

Ab initio Predictions

Polished protein

Polished EST

• All base pairs must aligns in order.

• No HSP overlap is permitted

• Aligns HSPs correctly with respect to splice sites.

Polish BLAST Alignments with Exonerate

Current evidence

Current Assembly

Ab initio Predictions

Polished protein

Polished EST

Current evidence

Current Assembly

Ab initio Predictions

Hint-based SNAP Hint-based FgenesH

Pass Gene Finders Evidence-based ‘hints’

Current evidence

Current Assembly

Ab initio Predictions

Hint-based SNAP Hint-based FgenesH

*

*Quantitative Measures for the Management and Comparison of Annotated Genomes Karen Eilbeck , Barry Moore , Carson Holt and Mark Yandell BMC Bioinformatics 2009

10:67doi:10.1186/1471-2105-10-67

Identify Gene Model Most Consistent with Evidence*

Current evidence

Current Assembly

Ab initio Predictions*

Revise it further if necessary; Create New Annotation

Compute Support for Each Portion of Gene Model

MAKER-P v2.28 at iPlant

• TACC Lonestar• Supercomputer with 22,656 CPU

• MPI enabled for parallel computation

• Can complete entire rice genome in ~2 hrs (1,152 cores)96 CPU per chromosome

• Can complete Aegilops tauschii ALLPATHS-LG assembly in ~8 hrs (1,152 cores)

• Currently being integrated into the iPlant Discovery Environment

• Atmosphere • MPI enabled for parallel computation

• Maximum instance size 16 CPU

Assembly & Annotation at iPlant

top related