pathologic pathway predictor
DESCRIPTION
PathoLogic Pathway Predictor. Inference of Metabolic Pathways. Gene Products. Genes/ORFs. DNA Sequences. Pathways. Reactions. Compounds. Annotated Genomic Sequence. Pathway/Genome Database. Pathways. Reactions. PathoLogic Software - PowerPoint PPT PresentationTRANSCRIPT
PathoLogic Pathway Predictor
SRI InternationalBioinformaticsInference of Metabolic Pathways
Pathway/GenomeDatabase
Annotated GenomicSequence
Genes/ORFs
Gene Products
DNA Sequences
Reactions
Pathways
Compounds
Multi-organism PathwayDatabase (MetaCyc)
PathoLogic Software
Integrates genome and pathway data to identify
putative metabolic networks
Genomic Map
Genes
Gene Products
Reactions
Pathways
Compounds
SRI InternationalBioinformaticsPathoLogic Functionality
Initialize schema for new PGDBTransform existing genome to PGDB formInfer metabolic pathways and store in PGDBInfer operons and store in PGDBAssemble Overview diagramAssist user with manual tasks
Assign enzymes to reactions they catalyze Identify false-positive pathway predictions Build protein complexes from monomers Infer transport reactions
SRI InternationalBioinformaticsPathoLogic Input/Output
Inputs: File listing genetic elements
http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat Files containing DNA sequence for each genetic element Files containing annotation for each genetic element MetaCyc database
Output: Pathway/genome database for the subject organism Reports that summarize:
Evidence contained in the input genome for the presence of reference pathways
Reactions missing from inferred pathways
SRI InternationalBioinformaticsPathoLogic Analysis Phases
Trial parsing of input data files [few days] Initialize schema of new PGDB [3 min] Create DB objects for replicons, genes, proteins [5 min] Assign enzymes to reactions they catalyze
ferrochelatase [10 min / 1 week] glutamate 1-semialdehyde 2,1-aminomutase porphobilinogen deaminase
A C GB D E F
E1 E2
SRI InternationalBioinformaticsPathoLogic Analysis Phases
From assigned reactions, infer what pathways are present [5 min / few days]
Define metabolic overview diagram [30 min]
Define protein complexes [few days]
SRI InternationalBioinformaticsgenetic-elements.dat
ID TEST-CHROM-1NAME Chromosome 1TYPE :CHRSMCIRCULAR? NANNOT-FILE chrom1.pfSEQ-FILE chrom1.fsa//ID TEST-CHROM-2NAME Chromosome 2CIRCULAR? NANNOT-FILE /mydata/chrom2.gbkSEQ-FILE /mydata/chrom2.fna//
SRI InternationalBioinformaticsFile Naming Conventions
One pair of sequence and annotation files for each genetic element
Sequence files: FASTA format suffix fsa or fna
Annotation file: Genbank format: suffix .gbk PathoLogic format: suffix .pf
SRI InternationalBioinformatics
Typical Problems Using Genbank Files With PathoLogic
Wrong qualifier names used: read PathoLogic documentation!
Extraneous information in a given qualifier
Check results of trial parse carefully
SRI InternationalBioinformaticsGenBank File Format
Accepted feature types: CDS, tRNA, rRNA, misc_RNA
Accepted qualifiers: /locus_tag Unique ID [recm] /gene Gene name [req] /product [req] /EC_number [recm] /product_comment [opt] /gene_comment [opt] /alt_name Synonyms [opt] /pseudo Gene is a pseudogene [opt]
For multifunctional proteins, put each function in a separate /product line
SRI InternationalBioinformaticsPathoLogic File Format
Each record starts with line containing an ID attribute Tab delimited Each record ends with a line containing //
One attribute-value pair is allowed per line Use multiple FUNCTION lines for multifunctional proteins
Lines starting with ‘;’ are comment lines
Valid attributes are: ID, NAME, SYNONYM STARTBASE, ENDBASE, GENE-COMMENT FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT DBLINK INTRON
SRI InternationalBioinformaticsPathoLogic File Format
ID TP0734NAME deoDSTARTBASE 799084ENDBASE 799785FUNCTION purine nucleoside phosphorylaseDBLINK PID:g3323039PRODUCT-TYPE PGENE-COMMENT similar to GP:1638807 percent identity: 57.51;
identified by sequence similarity; putative//ID TP0735NAME gltASTARTBASE 799867ENDBASE 801423FUNCTION glutamate synthaseDBLINK PID:g3323040PRODUCT-TYPE P
SRI InternationalBioinformatics
Before you start: What to do when an error occursMost Navigator errors are automatically trapped –
debugging information is saved to error.tmp file.All other errors (including most PathoLogic
errors) will cause software to drop into the Lisp debugger
Unix: error message will show up in the original terminal window from which you started Pathway Tools.
Windows: Error message will show up in the Lisp console. The Lisp console usually starts out iconified – its icon is a blue bust of Franz Liszt
2 goals when an error occurs: Try to continue working Obtain enough information for a bug report to send to
pathway-tools support team.
SRI InternationalBioinformaticsThe Lisp Debugger
Sample error (details and number of restart actions differ for each case)Error: Received signal number 2 (Keyboard interrupt)
Restart actions (select using :continue):
0: continue computation
1: Return to command level
2: Pathway Tools version 10.0 top level
3: Exit Pathway Tools version 10.0
[1c] EC(2):
To generate debugging information (stack backtrace)::zoom :count :all
To continue from error, find a restart that takes you to the top level – in this case, number 2:cont 2
To exit Pathway Tools::exit
SRI InternationalBioinformaticsHow to report an error
Determine if problem is reproducible, and how to reproduce it (make sure you have all the latest patches installed)
Send email to [email protected] containing:
Pathway Tools version number and platform Description of exactly what you were doing (which command
you invoked, what you typed, etc.) or instructions for how to reproduce the problem
error.tmp file, if one was generated If software breaks into the lisp debugger, the complete error
message and stack backtrace (obtained using the command :zoom :count :all, as described on previous slide)
SRI InternationalBioinformaticsUsing the PPP GUI to Create a
Pathway/Genome Database
Input Project Information Organism -> Create New
SRI InternationalBioinformaticsInput Project Information
SRI InternationalBioinformaticsNext Steps
Trial Parse Build -> Trial Parse Fix any errors in input files
Build pathway/genome database Build -> Automated Build
SRI InternationalBioinformaticsPathoLogic Parser Output
SRI InternationalBioinformaticsAssign Enzymes to
Reactions
MatchMatch
Gene Gene productproduct
5.1.3.2
UDP-glucose-4-epimerase
yesyes
AssignAssign
nono
Probable enzymeProbable enzyme-ase-ase
nono yesyes
Not a metabolic Not a metabolic enzymeenzyme
Manually Manually searchsearch
yesyes
AssignAssign
nono
Can’t AssignCan’t Assign
MetaCyc
UDP-D-glucose UDP-galactose
SRI InternationalBioinformaticsEnzyme Name Matcher
Matches on full enzyme nameMatch is case-insensitive and removes the
punctuation characters “ -_(){}',:”Also matches after removal of prefixes and
suffixes such as: “Putative”, “Hypothetical”, etc alpha|beta|…|catalytic|inducible chain|subunit|component Parenthetical gene name
SRI InternationalBioinformaticsEnzyme Name Matcher
For names that do not match, software identifies probable metabolic enzymes as those
Containing “ase” Not containing keywords such as
“sensor kinase” “topoisomerase” “protein kinase” “peptidase” Etc
Research unknown enzymes MetaCyc, Swiss-Prot, PubMed
SRI InternationalBioinformaticsEnzyme Name to Reaction
Mapping
See also file PTools Tutorial/PathoLogic Reports/name-matching-report.txt
SRI InternationalBioinformaticsManual Polishing
Refine -> Assign Probable Enzymes Do this first
Refine -> Rescore Pathways Redo after assigning enzymes
Refine -> Create Protein Complexes Can be done at any time
Refine -> Assign Modified Proteins Can be done at any time
Refine -> Transport Identification Parser Can be done at any time
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Update Overview Do this last, and repeat after any material changes to PGDB
SRI InternationalBioinformaticsAssign Probable Enzymes
SRI InternationalBioinformaticsHow to find reactions for
probable enzymes
First, verify that enzyme name describes a specific, metabolic function
Search for fragment of name in MetaCyc – you may be able to find a match that PathoLogic missed
Look up protein in SwissProt or other DBsSearch for gene name in PGDB for related
organism (bear in mind that gene names are not reliable indicators of function, so check carefully)
Search for function name in PubMedOther…
SRI InternationalBioinformaticsManual Polishing
Refine -> Assign Probable Enzymes
Refine -> Rescore Pathways
Refine -> Create Protein Complexes
Refine -> Assign Modified Proteins
Refine -> Transport Identification Parser
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Run Consistency Checker
Refine -> Update Overview
SRI InternationalBioinformaticsAutomated Pathway Inference
All pathways in MetaCyc for which there is at least one enzyme identified in the target organism are considered for possible inclusion.
Algorithm errs on side of inclusivity – easier to manually delete a pathway from an organism than to find a pathway that should have been predicted but wasn’t.
SRI InternationalBioinformatics
Considerations taken into account when deciding whether or not a pathway should be inferred: Is there a unique enzyme – an enzyme not involved in any
other pathway? Does the organism fall in the expected taxonomic domain of
the pathway? Is this pathway part of a variant set, and, if so, is there more
evidence for some other variant? If there is no unique enzyme:
Is there evidence for more than one enzyme? If a biosynthetic pathway, is there evidence for final reaction(s)? If a degradation pathway, is there evidence for initial reaction(s)? If an energy metabolism pathway, is there evidence for more than half the
reactions?
SRI InternationalBioinformatics
Assigning Evidence Scores to Predicted Pathways
X|Y|Z denotes score for P in O where:
X = total number of reactions in P Y = enzymes catalyzing number of reactions for which there is
evidence in O Z = number of Y reactions that are used in other pathways in O
SRI InternationalBioinformaticsManual Pruning of Pathways
Use pathway evidence report Coloring scheme aids in assessing pathway evidence
Phase I: Prune extra variant pathways
Rescore pathways, re-generate pathway evidence report
Phase II: Prune pathways unlikely to be present No/few unique enzymes Most pathway steps present because they are used in another pathway Pathway very unlikely to be present in this organism Nonspecific enzyme name assigned to a pathway step
SRI InternationalBioinformaticsCaveats
Cannot predict pathways not present in MetaCyc
Evidence for short pathways is hard to interpret
Since many reactions occur in multiple pathways, some false positives
SRI InternationalBioinformaticsOutput from PPP
Pathway/genome database
Summary pages Pathway evidence page
Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report”
Missing enzymes report
Directory tree containing sequence files, reports, etc.
SRI InternationalBioinformaticsResulting Directory Structure
ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/ input
organism.dat organism-init.dat genetic-elements.dat annotation files sequence files
reports name-matching-report.txt trial-parse-report.txt
kb ORGIDbase.ocelot
data overview.graph
released -> VERSION
SRI InternationalBioinformaticsManual Polishing
Refine -> Assign Probable Enzymes
Refine -> Rescore Pathways
Refine -> Create Protein Complexes
Refine -> Assign Modified Proteins
Refine -> Transport Identification Parser
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Run Consistency Checker
Refine -> Update Overview
SRI InternationalBioinformaticsCreating Protein Complexes
SRI InternationalBioinformaticsComplex Subunits
Stoichiometries
SRI InternationalBioinformaticsManual Polishing
Refine -> Assign Probable Enzymes
Refine -> Re-run Name Matcher
Refine -> Create Protein Complexes
Refine -> Assign Modified Proteins
Refine -> Transport Identification Parser
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Run Consistency Checker
Refine -> Update Overview
SRI InternationalBioinformaticsProteins as Reaction Substrates
SRI InternationalBioinformaticsManual polishing
Refine -> Assign Probable Enzymes
Refine -> Rescore Pathways
Refine -> Create Protein Complexes
Refine -> Assign Modified Proteins
Refine -> Transport Identification Parser
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Run Consistency Checker
Refine -> Update Overview
SRI InternationalBioinformaticsNomenclature
•WO pair = pair of genes within an operon
•TUB pair = pair of genes at a transcription unit boundary (delineate operons)
SRI InternationalBioinformaticsOperation of the operon
predictor
For each contiguous gene pair, predict whether gene pairs are within the same operon or at a transcription unit boundary
Use pairwise predictions to identify potential operons
AB = TUB pair
BC = WO pair operon = BCD
CD = WO pair
DE = TUB pair
A B C D E
SRI InternationalBioinformaticsOperon predictor
Predicts operon gene pairs based on: intergenic distance between genes genes in the same functional class
Typically used for operon prediction We use method from Salgado et al, PNAS (2000) as a
starting point. Uses E. coli experimentally verified data as a training set. Compute log likelihood of two genes being WO or TUB pair based
on intergenic distance.
SRI InternationalBioinformaticsOperon predictor
Additional features easily computed from a PGDB
1. both genes products enzymes in the same metabolic pathway
2. both gene products monomers in the same protein complex3. one gene product transports a substrate for a metabolic
pathway in which the other gene product is involved as an enzyme
4. a gene upstream or downstream from the gene pair (and within the same directon) is related to either one of the genes in the pair as per features 1, 2 and 3 above.