challenges in creating and curating plant pgdbs: lessons learned from aracyc and poplarcyc

25
Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc Peifen Zhang Carnegie Institution For Science Department of Plant Biology Stanford, CA

Upload: jui

Post on 25-Feb-2016

47 views

Category:

Documents


2 download

DESCRIPTION

Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc. Peifen Zhang Carnegie Institution For Science Department of Plant Biology Stanford, CA. Who We Are. PMN: - Sue Rhee ( PI ) - Kate Dreher ( curator ) - A. Karthik ( curator, previous ) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Challenges in Creating and Curating Plant PGDBs:

Lessons Learned from AraCyc and PoplarCyc

Peifen ZhangCarnegie Institution For Science

Department of Plant BiologyStanford, CA

Page 2: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

PMN: - Sue Rhee (PI)- Kate Dreher (curator)- A. Karthik (curator, previous)- Lee Chae (Postdoc)- Anjo Chi (programmer)- Cynthia Lee (TAIR tech team)- Larry Ploetz (TAIR tech team)- Shanker Singh (TAIR tech team)- Bob Muller (TAIR tech team)

Key Collaborators:- Peter Karp (MetaCyc, SRI)- Ron Caspi (MetaCYc, SRI)- Lukas Mueller (SGN)- Anuradha Pujar (SGN)

Who We Are

http://plantcyc.org

Page 3: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Introduction• Background and rationale

– Plants (food, feed, forest, medicine, biofuel…) – An ocean of sequences

• More than 60 species in genome sequencing projects, hundreds in EST projects

– Putting individual genes onto a network of metabolic reactions and pathways• Annotating, visualizing and analyzing at

system level– AraCyc (Arabidopsis thaliana, TAIR/PMN)

• predicted by using the Pathway Tools software, followed by manual curation

Page 4: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Introduction (cont)• Background and rationale

– Plants (food, feed, forest, medicine, biofuel…) – An ocean of sequences

• More than 60 species in genome sequencing projects, hundreds in EST projects

– Putting individual genes onto a network of metabolic reactions and pathways

• Annotating, visualizing and analyzing at system level– AraCyc (Arabidopsis thaliana, TAIR/PMN)

• predicted by using the Pathway Tools software, followed by manual curation

– Other plant pathway databases predicted by using the Pathway Tools

• RiceCyc (Oryza sativa, Gramene)• MedicCyc (Medicago truncatula, Noble Foundation)• LycoCyc (Solanum lycopersicum, SGN), …

Page 5: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Limitations• Creating pathway databases includes three major

components, and is resource-intensive– Sequence annotation– Reference pathway database– Pathway prediction, validation, refinement

• Heterogeneous sequence annotation protocols and varying levels of pathway validation impact quality and hinder meaningful cross-species comparison

• Using a non-plant reference database causes many false-positive and false-negative pathway predictions

Page 6: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Introducing the PMN• Scope

– A platform for plant metabolic pathway database creation– A community for data curation

• Curators, editorial board, ally in other databases, researchers

• Major goals– Create a plant-specific reference pathway database

(PlantCyc)– Create an enzyme sequence annotation pipeline– Enhance pathway prediction by using PlantCyc, and

including an automated initial validation step– Create metabolic pathway databases for plant species

– e.g. PoplarCyc (Populus trichocarpa), SoyCyc (soybean)

Page 7: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

PlantCyc Creation• Nature of PlantCyc

– Multiple-species, plants-only– curator-reviewed pathways, predicted,

hypothetical, empirical– primary and secondary metabolism

• Major Source– All AraCyc pathways and enzymes– Plant pathways and enzymes from MetaCyc– Additional pathways and enzymes manually

curated and added– Enzymes from RiceCyc, LycoCyc and MedicCyc

Page 8: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

PMN Database Content Statistics

PlantCyc 4.0 AraCyc 7.0 PoplarCyc 2.0Pathways 685 369 288Enzymes 11058 5506 3420Reactions 2929 2418 1707Compounds 2966 2719 1397Organisms 343 1 1*

Valuable plant natural products, many are specialized metabolites that are limited to a few species or genus.

• medicinal: e.g. artemisinin and quinine (treatment of malaria), codeine and morphine (pain-killer), ginsenosides (cardio-protectant), lupenol (antiinflammatory), taxol and vinblastine (anti-cancer)

• industrial materials: e.g. resin and rubber• food flavor and scents: e.g. capsaicin and piperine (chili and pepper flavor), geranyl acetate (aroma of rose) and menthol (mint).

Page 9: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc
Page 10: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

• Reference sequences, enzymes with known functions – 14,187 enzyme sequences compiled from GOA-

UniProt, Brenda, MetaCyc, and TAIR– 3805 functional identifiers (full EC number,

MetaCyc reaction id, GO id)• Annotation methods

– BLASTP• Cut-off

– unique e-value threshold for each functional identifier

Enzyme Sequence Annotation (version 1.0)

Page 11: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Number of enzrxn

TAIR Annotation Accuracy

PMN AnnotationAccuracy

Genes common to both: 3900

enzrxn common to both 2493 n/a n/a

TAIR-only enzrxn (EXP) 567 80% (12/15) n/a

TAIR-only enzrxn (IEA) 171 48% (11/23) n/a

TAIR-only enzrxn (ISS) 671 45% (10/22) n/a

PMN-only enzrxn (IEA) 3421 n/a 69% (11/16)

Genes unique to TAIR: 2225

EXP 397 77% (10/13) n/a

IEA 420 12% (2/17) n/a

ISS 378 45% (5/11) n/a

Genes unique to PMN: 1681 1503 n/a 35% (12/34)

*Accurate: the annotation came from a top hit that has good homology to a known enzyme

Page 12: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Conclusion• Increased performance with

potentially true enzymes

• Over-prediction for non-enzyme proteins

Page 13: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

• Reference sequences, proteins with known functions (ERL)– SwissProt

• 117,000 proteins, 26,000 enzymes, 2,400 full EC numbers– Additional enzymes from Brenda, MetaCyc, and TAIR– Functional identifiers: full EC number, MetaCyc reaction

id, GO id, • Annotation methods

– BLASTP– Priam (enzyme-specific, motif-based)– CatFam (enzyme-specific, motif-based)

• Function calling– Ensemble and voting

Enzyme Sequence Annotation (version 2.0, in progress)

Page 14: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Enzyme Sequence Annotation (version 2.0, in progress)

Lee Chae (unpublished)

Page 15: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Application to the Poplar Genome

• Sequence annotation version 1.0

• Pathway Tools version 12.5

• PGDB creation using PlantCyc vs MetaCyc

Page 16: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Comparison of the PoplarCyc Initial Builds with Either PlantCyc or MetaCyc

as the Reference Database. Reference database used PlantCyc (2.0) MetaCyc

(12.5)

Total number of pathways in the Reference database (version) 646 1395

Total number of predicted pathways 285 346

Number of false-positive predictions (false positive rate, FP/FP+TN) 25 (7.5%) 92 (8.5%)

Database-specific false positive predictions 2 69

Number of false-negative predictions (false negative rate, FN/TP+FN) 51 (16.4%) 56 (18.1%)

Database-specific false negative predictions 9 13

Page 17: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Conclusion• The absolute number of false-positive

pathways was reduced significantly by using PlantCyc as the reference

• The number of false-negative pathways was comparable using either PlantCyc or MetaCyc as the reference, indicating the usefulness of both databases as references

Page 18: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Automated Initial Pathway Validation

– Remove non-plant pathways, identified from manual validation of AraCyc and PoplarCyc• A list of 132 MetaCyc pathways (an up-to-

date file is posted online)

– Add universal plant pathways• A list of 115 pathways (an up-to-date file is

posted online)

Page 19: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Enzyme sequence annotation

Pathway prediction (PlantCyc)

Manual validatio

n

A Recap of the PMN Workflow

Pathway prediction (MetaCyc)

Automated pathway validation

Page 20: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

An Example of Practical Issues

Page 21: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Updating AraCyc with TAIR Functional Annotations

• Source and quality– Literature-based GO annotations– Catalytic activities– Experimental evidence (IDA, IMP, IGI, IPI,

IEP)

Page 22: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Problem•TAIR: AT5G13700 (polyamine oxidase, IDA, PubMed 16778015)•Polyamine oxidase reactions in MetaCyc/PlantCyc:

•Which one of the reaction catalyzed by AT5G13700 was supported in the paper?

Page 23: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Conclusion• Not to automatically propagate GO-

exp annotations to enzrxns

• Manually enter along with appropriate evidence

Page 24: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Future Work• Enhance pathway prediction and validation

– Using additional evidence, such as presence of compounds, weighted confidence of enzyme annotations

• Refine pathways, hole-filling– Including non-sequence homology based information in

enzyme function prediction, such as phylogenetic profiles, co-expression, protein structure

• Create new pathway databases– moss (P. patens), Selaginella, maize, cassava, wine grape …

• Add new data types, critical for strategic planning of metabolic engineering– Rate-limiting step– Transcriptional regulator

Page 25: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Thank you for your attention!