Functional Annotation
Background + Strategy
The Group
127th Feb 2012
Lavanya RishishwarArtika NathLu WangHaozheng Tian
Shengyun PengAshwath Kumar
Hamidreza Hassanzadeh
Outline
• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach
– Breadth– Depth
27th Feb 2012 2
Outline
• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach
– Breadth– Depth
27th Feb 2012 3
THE ‘WHAT?’Functional Annotation
427th Feb 2012
Genome Assembly
Assemble the Pieces Right
527th Feb 2012
Gene Prediction
When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries, as it has been called by one of our greatest philosophers .
Identify the words
When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries, as it has been called by one of our greatest philosophers .
627th Feb 2012
Functional Annotation
When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries, as it has been called by one of our greatest philosophers .
nat·u·ral·ist [nach-er-uh-list, nach-ruh-]noun1. a person who studies or is an expert in natural history, especially a zoologist or botanist.2. an adherent of naturalism in literature or art.Origin: 1580–90; natural + -ist
Origin of Species, Thenoun( On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life ) a treatise (1859) by Charles Darwin setting forth his theory of evolution.
Identify the function (i.e., meaning) of each word
DATABASESPROFILES
727th Feb 2012
Outline
• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach
– Breadth– Depth
27th Feb 2012 8
THE GRAVITY OF THE ANNOTATION PROCESS
Not just Newtonian927th Feb 2012
“Ultimately, one wishes to determine how genes—and the
proteins they encode—function in the intact organism.”
Albert B, et al. (2002) Molecular biology of cell. New York: Garland Science.
function
1027th Feb 2012
Function? What is it?
• To a cell biologist function might refer to the network of interactions in which the protein participates or to the location to a certain cellular compartment.
• To a biochemist, function refers to the metabolic process in which a protein is involved or to the reaction catalyzed by an enzyme.
1127th Feb 2012
Functional Annotation
Functional annotation consists of attaching biological information to genomic elements.• Biochemical function• Biological function• Involved regulation and interactions• Expression
1227th Feb 2012
Whatever happened to wet-lab?
“Experimentally annotating one complete bacterial genome varies from organism to organism. Roughly speaking, it could take as much as $25,000 and a period of 6-12 months for completing the process”
- Alejandro Caro
1327th Feb 2012
The Naked Truth
7/98
1/99
7/99
1/00
7/00
1/01
7/01
1/02
7/02
1/03
7/03
1/04
7/04
1/05
7/05
1/06
7/06
1/07
7/07
1/08
7/08
1/09
7/09
1/10
7/10
1/11
7/11
1/12
0
200
400
600
800
1000
1200
1400
1600
1800
2000
KEGG Genome: Release Update of Jan 2012
No. of Genomes in KEGG
1427th Feb 2012
How Gene Performs Function? Operon • Operon: Several genes with related functions that are regulated
together, because one piece of mRNA codes for several related proteins.
• Polycistronic mRNA,, mRNA coding for more than one polypeptide, is found only in prokaryotes
27th Feb 2012 15
Coding and non coding RNA’s
Protein CodingEnzymesStructural Regulatory Signal TransductionReceptors ToxinsVirulence Factors Membrane/Transmembrane
Non Coding Riboswitches CRISPRSrna's
Pathway Prediction 27th Feb 2012 16
Domain/Motif
• Domain:A discrete structural unit that is assumed to fold independently of the rest of the protein and to have its own function.~20-100 aa
• Motif:Are short, conserved regions and frequently are the most conserved regions of domains. Motifs are critical for the domain to function.
27th Feb 2012 17
Outline
• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach
– Breadth– Depth
27th Feb 2012 18
Understanding the Target
1927th Feb 2012
Haemophilus haemolyticus - The Biography
Haemophilus haemolyticus
• Gram-negative• Facultative anaerobe • Known to colonize the human respiratory tract.• Out of the 8 Haemophilus species found to colonize
the respiratory tract, H. influenzae and H. haemolyticus are the most prevalent ones.
• H. haemolyticus is an emerging pathogen– 5 cases of invasive disease reported between 2009-10.
27th Feb 2012 20
Strains of H. haemolyticus
fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolatesHpd: encoding a lipoprotein protein D,
Species Disease State State Isolated Hemolysis Hpd fucK
M19107 H. Haemolyticus Asymptomatic Minnesota Y - -
M19501 H. Haemolyticus Asymptomatic Minnesota N + -
M21127 H.Haemolyticus Pathogenic Georgia Y - -
M21621 H. Haemolyticus Pathogenic Texas Y - -
M21639 H. Haemolyticus Pathogenic Illinois N - -
M21709 H. Influenzae Pathogenic NY N - +
27th Feb 2012 21
Phylogeny
Niels Nørskov-Lauritsen, N., et al. (2005).Multilocus sequence phylogenetic study of the genus Haemophilus with description of Haemophilus pittmaniae sp. nov. International Journal of Systematic and Evolutionary Microbiology, 55, 449–45627th Feb 2012 22
Outline
• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach
– Breadth– Depth
27th Feb 2012 23
View from 300 ftand a brief time travel
2427th Feb 2012
Ontology
• An ontology is a "formal, explicit specification of a shared conceptualization“
• Two formal major ontology schemes:– EC – Enzyme Commission Number– GO – Gene Ontology
27th Feb 2012 25
Enzyme Commission (EC)
• A large scale comprehensive attempt to organize and classify enzymes according to its function
• For inclusion in the list, direct experimental evidence is to be provided for its claimed activity
• Organizes the list of enzymes in four levels of hierarchy, starting with the top most 6 classes:1. Oxidoreductases2. Transferases3. Hydrolases4. Lyases5. Isomerases6. Ligases
2627th Feb 2012
Chronology: Enzyme Commission (EC)
• Cons of EC: • Hierarchy only provides parent to child
relationship• Only specific to enzymes (doesn't cover all of the
proteins)
2727th Feb 2012
Chronology: Gene Ontology (GO)Or in other words "give this protein a name and stick to it!!"
27th Feb 2012 28
What is the GO?
• Molecular Function• Biological Process • Cellular Component• Relations between the terms
– ‘is_a’– ‘part_of’, ‘has_part’– ’regulates’
27th Feb 2012 29
Structure of GOdu Plessis L, Skunca N, Dessimoz C (2011). The what, where, how and why of gene ontology–a primer for bioinformaticians. Brief Bioinform. Doi: 10.1093/bib/bbr002
27th Feb 2012 30
General Rule To Apply Evidence Code
27th Feb 2012 31
Where Do Annotations Come From?
• Inferred from experiment– Most reliable– Base for computational method
• Inferred from computational method– Sequence similarity, structural similarity, etc.
• Inferred from author statement • Curator statement and Obsolete evidence
codes
27th Feb 2012 32
Why use the GO?• The ‘GO Consortium’ consists of a number of large databases
working together to define standardized ontologies and provide annotations to the GO.
• Search for interacting genes
• Reason across the relations
• Analyze the results of high-throughput experiment
• Infer function of un-annotated genes and inter protein-protein interactions.
27th Feb 2012 33
Outline
• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach
– Breadth– Depth
27th Feb 2012 34
CAUTION!PROS AND CONS OF CONVENTIONAL APPROACHES
Choosing The Right Function Prediction Tool
3527th Feb 2012
“Perutz et al. showed in 1960 that myoglobin and hemoglobin, the first two protein structures to be solved at atomic resolution using X-ray crystallography, have similar structures even though their sequences differ.”
27th Feb 2012 36
Pros and Cons: There are no free lunches!
• Homology Useful but different from “same” function– Simply implies common ancestry
3727th Feb 2012
Pros and Cons: There are no free lunches!
3827th Feb 2012
Pros and Cons: There are no free lunches!
• Quality of Prediction is as good as the quality of annotation of the database
• Eukaryotic function predictor can not be used for Prokaryotes and vice versa
3927th Feb 2012
Outline
• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach
– Breadth– Depth
27th Feb 2012 40
41
BREADTH AND DEPTH OF THE ANALYSIS
A Snapshot of the Iceberg Named Functional Annotation
27th Feb 2012
BREADTHSpectrum of Methods Selected
27th Feb 2012 42
Criteria for selecting methods
1. Currently being maintained2. Applicable to Prokaryotic sequences3. Could be installed locally (support batch
jobs if GUI)OR
Could be included in a pipeline i.e., have a command-line interface
4327th Feb 2012
Categories of Approaches
• Sequence similarity-based• Phylogenomics-based• Domain/pattern/profile - based
– Domain-based– Pattern-based– Profile-based
• Sequence clustering-based• Machine learning-based• Network-based27th Feb 2012 44
Breadth: Options
27th Feb 2012 45
Dead GUI Proprierty Eukaryotic Model External Servers InterPro Web-based Servers
Approach Resource Approach Resource
Sequence similarity based
GOtcha
Sequence clustering based
ProtoNet PFP CluSTr
GOsling eggNOG OntoBlast COGs
GOblet InParanoid Blast2GO MultiParanoid
Phylogenomics based
SIFTER OrthoMCL AFAWE
Machine learning based
ProtFun RIO GOPET
OrthoStrapper SVM-Prot
Domain/pattern/profile based
InterProScan ffPred TMHMM EzyPred HMMTOP
Network based
MCODE HMMER AGeS
Pfam SAMBA SUPERFAMILY RNSC
PROSITE PRODISTIN PRINTS Cytoscape SMART STRING Gene3D VisANT
PANTHER VIRGO TIGRFAMs
Pipelines
RASTSCOP MultiParanoid CATH AGMIAL
CatFam MicroScopePIRSF
PRODOM EFICAz PRIAM
Flowchart
4627th Feb 2012
DEPTHDescription of Selected Methods
27th Feb 2012 47
Level 1The building blocks!
27th Feb 2012 48
PanGenome Analysis• PanGeome is the full complement of genes in a species.
• It includes core genome which is a set of genes that are present in all strains, dispensable genome that are genes present in 2 or more strains and unique genes which are unique to specific strains.
• In this case, we will be using pangeome of Haemophilus influenzae.
• This database will be used as the reference database in BLAST.
• This method gives high confidence annotations since the strains selected are very closely related to the organism in question.
27th Feb 2012 49
BLAST: How it works?1. Divide a query
sequence into short chunks called words,
2. Look for exact matches
3. in case of hit try extending the alignment
27th Feb 2012 50
Statistical assessment
E-value: where, = Total number of residues in the database = Number of residues in the query sequence = Probability that an HSP alignment is a result of random chanceFor e.g., ,
27th Feb 2012 51
Different flavors!
• BLASTN– Queries nucleotide vs. nucleotide sequences
• BLASTP– Queries protein vs. protein sequences
• BLASTX– Queries 6 possible frames of nucleotide sequences vs. protein
sequences
• TBLASTN– Reciprocal of BLASTX
• TBLASTX– Queries 6 possible frames of nucleotide sequences vs. 6 possible
frames of nucleotide sequences inside the database27th Feb 2012 52
• Combines protein signatures from a number of member databases
into a single searchable resource
• Capitalizes on their individual strengths to produce an integrated
database and diagnostic tool.
"InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites."
Current release: 36.0 23 February 2012
36.0 !
New features:
• An update to Pfam (26.0) and PIRSF (2.78).
• The integration of 755 new methods from the GENE3D, PANTHER,
PIRSF, Pfam and SUPERFAMILY databases.
36.0 !
Member database information
Signature Database Version Signatures*Integrated Signatures**
GENE3D 3.3.0 2386 1441
HAMAP 140911 1702 1686
PANTHER 7 69566 2392
PIRSF 2.78 2983 2983
PRINTS 41.1 2050 2001
PROSITE patterns 20.72 1308 1291
PROSITE profiles 20.72 922 897
Pfam 26 13672 12672
PfamB 26 20000 0
ProDom 2006.1 1894 1105
SMART 6.2 1008 1002
SUPERFAMILY 1.73 1774 1208
TIGRFAMs 10.1 4023 4002
* Some signatures may not have matches to UniProtKB proteins.** Not all signatures of a member database may be integrated at the time of an InterPro release.
36.0 !
Member database information
Signature Database Version Signatures*Integrated Signatures**
GENE3D 3.3.0 2386 1441
HAMAP 140911 1702 1686
PANTHER 7 69566 2392
PIRSF 2.78 2983 2983
PRINTS 41.1 2050 2001
PROSITE patterns 20.72 1308 1291
PROSITE profiles 20.72 922 897
Pfam 26 13672 12672
PfamB 26 20000 0
ProDom 2006.1 1894 1105
SMART 6.2 1008 1002
SUPERFAMILY 1.73 1774 1208
TIGRFAMs 10.1 4023 4002
* Some signatures may not have matches to UniProtKB proteins.** Not all signatures of a member database may be integrated at the time of an InterPro release.
:
:
“The Gene3D database is a large collection of CATH(Class, Architecture, Topology, Homologues superfamily) protein domain assignments for ENSEMBL genomes and Uniprot sequences.”
HAMAPHigh-quality Automated and Manual Annotation of microbial Proteomes
Protein ANalysis THrough Evolutionary Relationships
Evolutionary relationships of proteins from super- to sub-familiesPIRSF
“PRINTS is a database of protein family ‘fingerprints’ offering a diagnostic resource for newly-determined sequences.”
Database of protein domains, families and functional sitesProDom Simple Modular Architecture Research Tool
“SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.”
TIGRFAMs
Integration into InterPro
:
Features of Member Databases• ProDom: provider of sequence-clusters built from UniProtKB
using PSI-BLAST.
• PROSITE patterns: provider of simple regular expressions.
• PROSITE and HAMAP profiles: provide sequence matrices.
• PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs).
• PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs).
Querying with InterProScan
“Sequence-based queries are performed using InterProScan, a tool that combines the different protein signature recognition methods native to the InterPro member databases into one resource.”
InterProScanQuery Sequence
• Web version • Stand-alone version
– A wrapper of sequence analysis apps– Database and output files scanning – Bulk data processing
Querying with InterProScan
Member Databases & Scanning Methods
The TMHMM and SignalP prediction search algorithms are provided through the web interface at EBI. However, they are not integrated into InterPro.
Member Databases Scanning Methods Software PackagePROSITE patterns pfscan PftoolsProsite ProfilesHAMAP Profiles pfscan Pftools
PRINTS FingerPRINTScanPFAM hmmscan HMMER3.0b3PRODOM ProDomBlastSMART hmmpfam HMMER2.3.2TIGRFAMs hmmscan HMMER3.0b3PIR SuperFamily hmmpfam HMMER2.3.2SUPERFAMILY hmmpfam/hmmsearch HMMER2.3.2GENE3D hmmpfam HMMER2.3.2
Blast2GO
• B2G has been design to (1) allow automatic and highthroughput sequence annotation and (2) integrate functionality for annotation-based data mining.
27th Feb 2012 62
Why Blast2GO?
• Blast2GO is designed for high-throughput sequence annotation.
• Better at mining and visualization capabilities
• Good at utilizing annotated sequences already deposited in public databases.
27th Feb 2012 63
How Blast2GO works?
• Basically, Blast2GO uses local or remote BLAST searches to find similar sequences to one or several input sequences.
• The program extracts the GO terms associated to each of the obtained hits and returns an evaluated GO annotation for the query sequence(s).
• Enzyme codes are obtained by mapping from equivalent GOs while InterPro motifs are directly queried at the InterProScan web service.
• GO annotation can be visualized reconstructing the structure of the Gene Ontology relationships and ECs are highlighted on KEGG maps
27th Feb 2012 64
How Blast2GO works?
• OBTAINING GO TERMS– The first step is to find sequences similar to a
query set by Blast searching. Homology search can either be done at public databases or custom databases when a local Blast installation is available.
– By using Blast hit gene identifiers (gi) and gene accessions B2G retrieves all GO annotations for the hit sequences, together with their evidence codes (EC).
27th Feb 2012 65
How Blast2GO works?
• ANNOTATION ASSIGNMENT– annotation score (AS), direct term (DT)
27th Feb 2012 66
How Blast2GO works?
• STATISTICS– statistical assessment of GO term enrichments in a
group of interesting genes when compared with a reference group (Blüthgen et al., 2004).
– Gossip computes Fisher’s Exact Test applying robust FDR (false discovery rate) correction for multiple testing and returns a list of significant GO terms ranked by their corrected or one-test P-values
• VISUALIZATION27th Feb 2012 67
Systems for Functional Annotation
• Clusters of Orthologous Groups (COGs) • euKaryote Orthologous Groups (KOGs)• Gene Ontology (GO)• Enzyme Commission no. (EC)
27th Feb 2012 68
Clusters of Orthologous Groups of Genes (KOGs, COGs)
– Why?• Orthologs retain the same function during evolution
and hence have a critical role in functional annotation. COGs provides a framework for functional analysis.
• It's also important for phylogenetic and evolutionary analysis of genomes. Interpretable phylogenetic trees generally can be constructed only within sets of orthologs.
27th Feb 2012 69
How to find Orthologous genes?
• Naive approach: For a query gene and target genome, the highest similarity score indicates homologous relationship– Gives good results for not so distant species– How about larger phylogenetically distances?
• Gene duplications: Suggests that a many-to-many relationship required
• What if several hits with not a so high score emerge ? Stringent threshold may lead to false negatives
• COG approach: Each two genes inside a COG are either orthologous genes or orthologous groups of paralogs
27th Feb 2012 70
How to create COGs• Choose all 2-permutations of available genes and perform pairwise
comparison between genes from different clades (in this case 5 clades)
• Best hits (BeT) in other organisms are recognized• Make the graph of consistent relations (does not depend on an absolute
threshold level)• The simplest case is a triangle: if a gene yields a hit with two other
genomes there are, being orthologs is a necessary condition for yielding a hit between those two genes
• Merge all triangles with common side
2 2
10 90
3000 ~8.9e6
17967 ~3.2e8
27th Feb 2012 71
How to create COGs - continued
6. Do to existence of paralogs, BeTs are not necessarily symmetrical (RBBH [Reciprocal Best Blast Hits] )
?
Tatusov, Koonin & Lipman, Science 278, 631 (1997)27th Feb 2012 72
Facing challenges when creating COGs
• The clusters however are subject to ambiguity:– Proteins with distinct regions (multi-domain proteins)
each belonging to a different conserved family. • Sol: Further inspection of domains
– When one gene in a pair of paralogs is lost in one lineage (but not in the other), it may artificially merge the two COGs.
• Sol: Similarity measures
27th Feb 2012 73
COGs vs. Gene Function
• Each COG includes proteins from at least 3 major clades with divergence time estimated around over a billion year. Hence they are ancient conserved families with important (if not necessary function)
• Accordingly, the proteins belonging to mysterious COGs are good possible candidates for further analysis
• Also, if someone experimentally verifies a genes function, it can also be confidently applied to fellow COG members. Similarly upon inclusion of a new gene to the COG (by COGNITOR) it’s function is derived
• For most free-living prokaryotes, ~80% of the genes belong to COGs. Up to 1o% of proteins in genomes are estimated as fast evolving, poorly conserved proteins and hence the COG coverage of most genomes is approaching saturation
27th Feb 2012 74
http://www.ncbi.nlm.nih.gov/COG/
Clusters of Orthologous Groups (COGs)
27th Feb 2012 75
Classification of COGs by functional categories INFORMATION STORAGE AND PROCESSING [J] Translation, ribosomal structure and biogenesis [A] RNA processing and modification [K] Transcription [L] Replication, recombination and repair [B] Chromatin structure and dynamics
CELLULAR PROCESSES AND SIGNALING [D] Cell cycle control, cell division, chromosome partitioning [Y] Nuclear structure [V] Defense mechanisms [T] Signal transduction mechanisms [M] Cell wall/membrane/envelope biogenesis [N] Cell motility [Z] Cytoskeleton [W] Extracellular structures [U] Intracellular trafficking, secretion, and vesicular transport [O] Posttranslational modification, protein turnover, chaperones
METABOLISM[C] Energy production and conversion [G] Carbohydrate transport and metabolism [E] Amino acid transport and metabolism [F] Nucleotide transport and metabolism [H] Coenzyme transport and metabolism [I] Lipid transport and metabolism [P] Inorganic ion transport and metabolism [Q] Secondary metabolites biosynthesis, transport and catabolism
POORLY CHARACTERIZED [R] General function prediction only [S] Function unknown
27th Feb 2012 76
LipoP• It is a tool used to mainly predict lipoprotein signal
peptides.
• It is most suitable for Gram negative bacteria but shown to have considerable accuracy for Gram positive bacteria as well.
• It uses Hidden Markov Models to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins.
27th Feb 2012 77
Thank You!
27th Feb 2012 78
To be continued…