subliminal: exploiting semantic annotations in the reconstruction of metabolic networks

1
Subliminal: exploiting semantic annotations in the reconstruction of metabolic networks Neil Swainston Manchester Centre for Integrative Systems Biology, University of Manchester, Manchester M1 7ND, UK This work has been supported by the BBSRC/EPSRC grant: the Manchester Centre for Integrative Systems Biology 1 Applications of genome-scale metabolic reconstructions. Oberhardt MA, Palsson BØ, Papin JA. Mol Syst Biol. (2009) 5:320 2 A protocol for generating a high-quality genome-scale metabolic reconstruction. Thiele I, Palsson BØ. Nat Protoc. (2010) 5, 93-121. 3 High-throughput generation, optimization and analysis of genome-scale metabolic models. Henry CS, DeJongh M, et al. Nat Biotechnol. (2010) 28, 977-82. 4 The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Hucka M, Finney A, et al. Bioinformatics. (2003) 19, 524-31. 5 Minimum information requested in the annotation of biochemical models (MIRIAM). Le Novère N, Finney A, et al. Nat Biotechnol. (2005) 23, 1509-15. 6 A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Herrgård MJ, Swainston N, et al. Nat Biotechnol. (2008) 26, 1155-60. 7 libAnnotationSBML: a library for exploiting SBML annotations. Swainston N, Mendes P. Bioinformatics. (2009) 25, 2292-3. 8 ChEBI: a database and ontology for chemical entities of biological interest. Degtyarenko K, de Matos P, et al. Nucleic Acids Res. (2008) 36, D344-50. 9 The Universal Protein Resource (UniProt) in 2010. UniProt Consortium. Nucleic Acids Res. (2010) 38, D142-8. 10 http://sbml.org/Software/KEGG2SBML/ 11 The EcoCyc and MetaCyc databases. Karp PD, Riley M, et al. Nucleic Acids Res. (2000) 28, 56-9. 12 http://www.iupac.org/inchi/ 13 PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Nakai K, Horton P. Trends Biochem Sci. (1999) 24, 34-6. 14 The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. Steinbeck C, Han Y, et al. J Chem Inf Comput Sci. (2003) 43, 493-500. 15 PathText: a text mining integrator for biological pathway visualizations. Kemper B, Matsuzaki T, et al. Bioinformatics. (2010) 26, i374-81. Introduction The development of metabolic network reconstructions has increased in recent years. It now covers a range of organisms and has been applied to a number of research topics including metabolic engineering, genome-annotation, evolutionary studies, network property analysis, and interpretation of omics datasets 1 . The process of developing such reconstructions is now defined and is recognised as being time-consuming 2 . While many of the steps associated with generating a high-quality reconstruction require manual curation, some of these are applicable to automation, providing the possibility of automating the process of generating a draft reconstruction to be used in subsequent manual curation 3 . The importance of using standard representations such as SBML 4 and the MIRIAM standard 5 has been recognised 6 , with the development of reconstructions in which all components are semantically annotated with unambiguous database identifiers greatly facilitating their use by third parties. However, to date, the use of semantic annotations has been focused on the usability of the reconstruction after publication. Subliminal comprises a toolbox that exploits semantic annotations during the reconstruction process, utilising libAnnotationSBML 7 and web service interfaces to external databases such as ChEBI 8 and UniProt 9 to retrieve chemical and protein data which can be used in the automation of chemical protonation state determination, reaction mass / charge balancing and enzyme (and reaction) localisation. Initial pre-draft pathways: KEGG2SBML and other sources Initial pre-draft pathways for a given organism are generated from the existing KEGG2SBML 10 tool. KEGG2SBML generates SBML files representing individual metabolic pathways, which are then enhanced by addition of semantic annotations such as references to ChEBI and UniProt ids for metabolites and enzymes respectively, and EC terms. Subsequent work will focus on generating additional pathways from MetaCyc 11 and genome sequences. Model merging: pre-draft reconstruction Protonation state prediction Reaction mass / charge balancing Protein localisation Future directions While individual steps in the reconstruction process are amenable to automation, it is recognised that gap-filling, manual curation and validation are essential steps in generating a high-quality reconstruction. Semantic annotations can further aid the validation process through automated harvesting of chemical synonyms which can be fed to text-mining tools such as PathText 15 in order to simplify the arduous, but necessary, task of finding evidence for present (and missing) reactions in the literature. Automated acquisition from the ChEBI database of the InChI 12 (or SMILES) string representing each metabolite allows protonation state of the metabolite at a given pH to be predicted using cheminformatic resources such as the Chemistry Development Kit (CDK) 13 . By acquiring the chemical formulae and charge of each metabolite from the ChEBI database, each reaction can be represented as an matrix, A, containing elements and charges for each reactant and product. The vector, b, represents the stoichiometric coefficients of each reactant. Mixed integer linear programming can be applied to solve Ab = 0, producing a vector of stoichiometric coefficients to be applied to each reactant and product. Commonly absent species, such as water, protons and CO 2 , can also be considered, allowing previously unbalancable reactions (for example, from KEGG) to be balanced automatically. Ab = 0 As each of the initial pre-draft pathways, irrespective of their source, are semantically annotated with comparable terms, each can be merged automatically to generate a pre-draft reconstruction in which duplicate metabolites, enzymes and reactions are removed. With each enzyme being annotated with UniProt terms, the UniProt web services can be queried to automatically acquire each protein sequence. These can be fed to protein cellular location prediction algorithms such as PSORT 14 in order to predict subcellular location of the enzyme, and by implication, the reaction(s) that it catalyses.

Upload: neil-swainston

Post on 11-Jun-2015

773 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Subliminal: exploiting semantic annotations in the reconstruction of metabolic networks

Subliminal: exploiting semantic annotations in the reconstruction of metabolic networks Neil Swainston

Manchester Centre for Integrative Systems Biology, University of Manchester, Manchester M1 7ND, UK

This work has been supported by the BBSRC/EPSRC grant: the Manchester Centre for Integrative Systems Biology

1Applications of genome-scale metabolic reconstructions. Oberhardt MA, Palsson BØ, Papin JA. Mol Syst Biol. (2009) 5:320 2A protocol for generating a high-quality genome-scale metabolic reconstruction. Thiele I, Palsson BØ. Nat Protoc. (2010) 5, 93-121. 3High-throughput generation, optimization and analysis of genome-scale metabolic models. Henry CS, DeJongh M, et al. Nat Biotechnol. (2010) 28, 977-82. 4The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Hucka M, Finney A, et al. Bioinformatics. (2003) 19, 524-31. 5Minimum information requested in the annotation of biochemical models (MIRIAM). Le Novère N, Finney A, et al. Nat Biotechnol. (2005) 23, 1509-15. 6A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Herrgård MJ, Swainston N, et al. Nat Biotechnol. (2008) 26, 1155-60. 7libAnnotationSBML: a library for exploiting SBML annotations. Swainston N, Mendes P. Bioinformatics. (2009) 25, 2292-3. 8ChEBI: a database and ontology for chemical entities of biological interest. Degtyarenko K, de Matos P, et al. Nucleic Acids Res. (2008) 36, D344-50. 9The Universal Protein Resource (UniProt) in 2010. UniProt Consortium. Nucleic Acids Res. (2010) 38, D142-8. 10http://sbml.org/Software/KEGG2SBML/ 11The EcoCyc and MetaCyc databases. Karp PD, Riley M, et al. Nucleic Acids Res. (2000) 28, 56-9. 12http://www.iupac.org/inchi/ 13PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Nakai K, Horton P. Trends Biochem Sci. (1999) 24, 34-6. 14The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. Steinbeck C, Han Y, et al. J Chem Inf Comput Sci. (2003) 43, 493-500. 15PathText: a text mining integrator for biological pathway visualizations. Kemper B, Matsuzaki T, et al. Bioinformatics. (2010) 26, i374-81.

Introduction The development of metabolic network reconstructions has increased in recent years. It now covers a range of organisms and has been applied to a number of research topics including metabolic engineering, genome-annotation, evolutionary studies, network property analysis, and interpretation of omics datasets1.

The process of developing such reconstructions is now defined and is recognised as being time-consuming2. While many of the steps associated with generating a high-quality reconstruction require manual curation, some of these are applicable to automation, providing the possibility of automating the process of generating a draft reconstruction to be used in subsequent manual curation3.

The importance of using standard representations such as SBML4 and the MIRIAM standard5 has been recognised6, with the development of reconstructions in which all components are semantically annotated with unambiguous database identifiers greatly facilitating their use by third parties.

However, to date, the use of semantic annotations has been focused on the usability of the reconstruction after publication. Subliminal comprises a toolbox that exploits semantic annotations during the reconstruction process, utilising libAnnotationSBML7 and web service interfaces to external databases such as ChEBI8 and UniProt9 to retrieve chemical and protein data which can be used in the automation of chemical protonation state determination, reaction mass / charge balancing and enzyme (and reaction) localisation.

Initial pre-draft pathways: KEGG2SBML and other sources

Initial pre-draft pathways for a given organism are generated from the existing KEGG2SBML10 tool. KEGG2SBML generates SBML files representing individual metabolic pathways, which are then enhanced by addition of semantic annotations such as references to ChEBI and UniProt ids for metabolites and enzymes respectively, and EC terms.

Subsequent work will focus on generating additional pathways from MetaCyc11 and genome sequences.

Model merging: pre-draft reconstruction

Protonation state prediction

Reaction mass / charge balancing

Protein localisation

Future directions While individual steps in the reconstruction process are amenable to automation, it is recognised that gap-filling, manual curation and validation are essential steps in generating a high-quality reconstruction. Semantic annotations can further aid the validation process through automated harvesting of chemical synonyms which can be fed to text-mining tools such as PathText15 in order to simplify the arduous, but necessary, task of finding evidence for present (and missing) reactions in the literature.

Automated acquisition from the ChEBI database of the InChI12 (or SMILES) string representing each metabolite allows protonation state of the metabolite at a given pH to be predicted using cheminformatic resources such as the Chemistry Development Kit (CDK)13.

By acquiring the chemical formulae and charge of each metabolite from the ChEBI database, each reaction can be represented as an matrix, A, containing elements and charges for each reactant and product. The vector, b, represents the stoichiometric coefficients of each reactant. Mixed integer linear programming can be applied to solve Ab = 0, producing a vector of stoichiometric coefficients to be applied to each reactant and product. Commonly absent species, such as water, protons and CO2, can also be considered, allowing previously unbalancable reactions (for example, from KEGG) to be balanced automatically.

Ab = 0

As each of the initial pre-draft pathways, irrespective of their source, are semantically annotated with comparable terms, each can be merged automatically to generate a pre-draft reconstruction in which duplicate metabolites, enzymes and reactions are removed.

With each enzyme being annotated with UniProt terms, the UniProt web services can be queried to automatically acquire each protein sequence. These can be fed to protein cellular location prediction algorithms such as PSORT14 in order to predict subcellular location of the enzyme, and by implication, the reaction(s) that it catalyses.