Download - Functional Annotation Background + Strategy The Group 127th Feb 2012 Lavanya Rishishwar Artika Nath Lu Wang Haozheng Tian Shengyun Peng Ashwath Kumar Hamidreza

Functional Annotation

Background + Strategy

The Group

127th Feb 2012

Lavanya RishishwarArtika NathLu WangHaozheng Tian

Shengyun PengAshwath Kumar

Hamidreza Hassanzadeh

Outline

• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach

– Breadth– Depth

27th Feb 2012 2

Outline



27th Feb 2012 3

THE ‘WHAT?’Functional Annotation

427th Feb 2012

Genome Assembly

Assemble the Pieces Right

527th Feb 2012

Gene Prediction

When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries, as it has been called by one of our greatest philosophers .

Identify the words


627th Feb 2012



nat·u·ral·ist [nach-er-uh-list, nach-ruh-]noun1. a person who studies or is an expert in natural history, especially a zoologist or botanist.2. an adherent of naturalism in literature or art.Origin: 1580–90; natural + -ist

Origin of Species, Thenoun( On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life ) a treatise (1859) by Charles Darwin setting forth his theory of evolution.

Identify the function (i.e., meaning) of each word

DATABASESPROFILES

727th Feb 2012

Outline



27th Feb 2012 8

THE GRAVITY OF THE ANNOTATION PROCESS

Not just Newtonian927th Feb 2012

“Ultimately, one wishes to determine how genes—and the

proteins they encode—function in the intact organism.”

Albert B, et al. (2002) Molecular biology of cell. New York: Garland Science.

function

1027th Feb 2012

Function? What is it?

• To a cell biologist function might refer to the network of interactions in which the protein participates or to the location to a certain cellular compartment.

• To a biochemist, function refers to the metabolic process in which a protein is involved or to the reaction catalyzed by an enzyme.

1127th Feb 2012


Functional annotation consists of attaching biological information to genomic elements.• Biochemical function• Biological function• Involved regulation and interactions• Expression

1227th Feb 2012

Whatever happened to wet-lab?

“Experimentally annotating one complete bacterial genome varies from organism to organism. Roughly speaking, it could take as much as $25,000 and a period of 6-12 months for completing the process”

- Alejandro Caro

1327th Feb 2012

The Naked Truth

7/98

1/99

7/99

1/00

7/00

1/01

7/01

1/02

7/02

1/03

7/03

1/04

7/04

1/05

7/05

1/06

7/06

1/07

7/07

1/08

7/08

1/09

7/09

1/10

7/10

1/11

7/11

1/12

0

200

400

600

800

1000

1200

1400

1600

1800

2000

KEGG Genome: Release Update of Jan 2012

No. of Genomes in KEGG

1427th Feb 2012

How Gene Performs Function? Operon • Operon: Several genes with related functions that are regulated

together, because one piece of mRNA codes for several related proteins.

• Polycistronic mRNA,, mRNA coding for more than one polypeptide, is found only in prokaryotes

27th Feb 2012 15

Coding and non coding RNA’s

Protein CodingEnzymesStructural Regulatory Signal TransductionReceptors ToxinsVirulence Factors Membrane/Transmembrane

Non Coding Riboswitches CRISPRSrna's

Pathway Prediction 27th Feb 2012 16

Domain/Motif

• Domain:A discrete structural unit that is assumed to fold independently of the rest of the protein and to have its own function.~20-100 aa

• Motif:Are short, conserved regions and frequently are the most conserved regions of domains. Motifs are critical for the domain to function.

27th Feb 2012 17

Outline



27th Feb 2012 18

Understanding the Target

1927th Feb 2012

Haemophilus haemolyticus - The Biography

Haemophilus haemolyticus

• Gram-negative• Facultative anaerobe • Known to colonize the human respiratory tract.• Out of the 8 Haemophilus species found to colonize

the respiratory tract, H. influenzae and H. haemolyticus are the most prevalent ones.

• H. haemolyticus is an emerging pathogen– 5 cases of invasive disease reported between 2009-10.

27th Feb 2012 20

Strains of H. haemolyticus

fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolatesHpd: encoding a lipoprotein protein D,

Species Disease State State Isolated Hemolysis Hpd fucK

M19107 H. Haemolyticus Asymptomatic Minnesota Y - -

M19501 H. Haemolyticus Asymptomatic Minnesota N + -

M21127 H.Haemolyticus Pathogenic Georgia Y - -

M21621 H. Haemolyticus Pathogenic Texas Y - -

M21639 H. Haemolyticus Pathogenic Illinois N - -

M21709 H. Influenzae Pathogenic NY N - +

27th Feb 2012 21

Phylogeny

Niels Nørskov-Lauritsen, N., et al. (2005).Multilocus sequence phylogenetic study of the genus Haemophilus with description of Haemophilus pittmaniae sp. nov. International Journal of Systematic and Evolutionary Microbiology, 55, 449–45627th Feb 2012 22

Outline



27th Feb 2012 23

View from 300 ftand a brief time travel

2427th Feb 2012

Ontology

• An ontology is a "formal, explicit specification of a shared conceptualization“

• Two formal major ontology schemes:– EC – Enzyme Commission Number– GO – Gene Ontology

27th Feb 2012 25

Enzyme Commission (EC)

• A large scale comprehensive attempt to organize and classify enzymes according to its function

• For inclusion in the list, direct experimental evidence is to be provided for its claimed activity

• Organizes the list of enzymes in four levels of hierarchy, starting with the top most 6 classes:1. Oxidoreductases2. Transferases3. Hydrolases4. Lyases5. Isomerases6. Ligases

2627th Feb 2012

Chronology: Enzyme Commission (EC)

• Cons of EC: • Hierarchy only provides parent to child

relationship• Only specific to enzymes (doesn't cover all of the

proteins)

2727th Feb 2012

Chronology: Gene Ontology (GO)Or in other words "give this protein a name and stick to it!!"

27th Feb 2012 28

What is the GO?

• Molecular Function• Biological Process • Cellular Component• Relations between the terms

– ‘is_a’– ‘part_of’, ‘has_part’– ’regulates’

27th Feb 2012 29

Structure of GOdu Plessis L, Skunca N, Dessimoz C (2011). The what, where, how and why of gene ontology–a primer for bioinformaticians. Brief Bioinform. Doi: 10.1093/bib/bbr002

27th Feb 2012 30

General Rule To Apply Evidence Code

27th Feb 2012 31

Where Do Annotations Come From?

• Inferred from experiment– Most reliable– Base for computational method

• Inferred from computational method– Sequence similarity, structural similarity, etc.

• Inferred from author statement • Curator statement and Obsolete evidence

codes

27th Feb 2012 32

Why use the GO?• The ‘GO Consortium’ consists of a number of large databases

working together to define standardized ontologies and provide annotations to the GO.

• Search for interacting genes

• Reason across the relations

• Analyze the results of high-throughput experiment

• Infer function of un-annotated genes and inter protein-protein interactions.

27th Feb 2012 33

Outline



27th Feb 2012 34

CAUTION!PROS AND CONS OF CONVENTIONAL APPROACHES

Choosing The Right Function Prediction Tool

3527th Feb 2012

“Perutz et al. showed in 1960 that myoglobin and hemoglobin, the first two protein structures to be solved at atomic resolution using X-ray crystallography, have similar structures even though their sequences differ.”

27th Feb 2012 36

Pros and Cons: There are no free lunches!

• Homology Useful but different from “same” function– Simply implies common ancestry

3727th Feb 2012


3827th Feb 2012


• Quality of Prediction is as good as the quality of annotation of the database

• Eukaryotic function predictor can not be used for Prokaryotes and vice versa

3927th Feb 2012

Outline



27th Feb 2012 40

41

BREADTH AND DEPTH OF THE ANALYSIS

A Snapshot of the Iceberg Named Functional Annotation

27th Feb 2012

BREADTHSpectrum of Methods Selected

27th Feb 2012 42

Criteria for selecting methods

1. Currently being maintained2. Applicable to Prokaryotic sequences3. Could be installed locally (support batch

jobs if GUI)OR

Could be included in a pipeline i.e., have a command-line interface

4327th Feb 2012

Categories of Approaches

• Sequence similarity-based• Phylogenomics-based• Domain/pattern/profile - based

– Domain-based– Pattern-based– Profile-based

• Sequence clustering-based• Machine learning-based• Network-based27th Feb 2012 44

Breadth: Options

27th Feb 2012 45

Dead GUI Proprierty Eukaryotic Model External Servers InterPro Web-based Servers

Approach Resource Approach Resource

Sequence similarity based

GOtcha

Sequence clustering based

ProtoNet PFP CluSTr

GOsling eggNOG OntoBlast COGs

GOblet InParanoid Blast2GO MultiParanoid

Phylogenomics based

SIFTER OrthoMCL AFAWE

Machine learning based

ProtFun RIO GOPET

OrthoStrapper SVM-Prot

Domain/pattern/profile based

InterProScan ffPred TMHMM EzyPred HMMTOP

Network based

MCODE HMMER AGeS

Pfam SAMBA SUPERFAMILY RNSC

PROSITE PRODISTIN PRINTS Cytoscape SMART STRING Gene3D VisANT

PANTHER VIRGO TIGRFAMs

Pipelines

RASTSCOP MultiParanoid CATH AGMIAL

CatFam MicroScopePIRSF

PRODOM EFICAz PRIAM

Flowchart

4627th Feb 2012

DEPTHDescription of Selected Methods

27th Feb 2012 47

Level 1The building blocks!

27th Feb 2012 48

PanGenome Analysis• PanGeome is the full complement of genes in a species.

• It includes core genome which is a set of genes that are present in all strains, dispensable genome that are genes present in 2 or more strains and unique genes which are unique to specific strains.

• In this case, we will be using pangeome of Haemophilus influenzae.

• This database will be used as the reference database in BLAST.

• This method gives high confidence annotations since the strains selected are very closely related to the organism in question.

27th Feb 2012 49

BLAST: How it works?1. Divide a query

sequence into short chunks called words,

2. Look for exact matches

3. in case of hit try extending the alignment

27th Feb 2012 50

Statistical assessment

E-value: where, = Total number of residues in the database = Number of residues in the query sequence = Probability that an HSP alignment is a result of random chanceFor e.g., ,

27th Feb 2012 51

Different flavors!

• BLASTN– Queries nucleotide vs. nucleotide sequences

• BLASTP– Queries protein vs. protein sequences

• BLASTX– Queries 6 possible frames of nucleotide sequences vs. protein

sequences

• TBLASTN– Reciprocal of BLASTX

• TBLASTX– Queries 6 possible frames of nucleotide sequences vs. 6 possible

frames of nucleotide sequences inside the database27th Feb 2012 52

• Combines protein signatures from a number of member databases

into a single searchable resource

• Capitalizes on their individual strengths to produce an integrated

database and diagnostic tool.

"InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites."

Current release: 36.0 23 February 2012

36.0 !

New features:

• An update to Pfam (26.0) and PIRSF (2.78).

• The integration of 755 new methods from the GENE3D, PANTHER,

PIRSF, Pfam and SUPERFAMILY databases.

36.0 !

Member database information

Signature Database Version Signatures*Integrated Signatures**

GENE3D 3.3.0 2386 1441

HAMAP 140911 1702 1686

PANTHER 7 69566 2392

PIRSF 2.78 2983 2983

PRINTS 41.1 2050 2001

PROSITE patterns 20.72 1308 1291

PROSITE profiles 20.72 922 897

Pfam 26 13672 12672

PfamB 26 20000 0

ProDom 2006.1 1894 1105

SMART 6.2 1008 1002

SUPERFAMILY 1.73 1774 1208

TIGRFAMs 10.1 4023 4002

* Some signatures may not have matches to UniProtKB proteins.** Not all signatures of a member database may be integrated at the time of an InterPro release.

36.0 !

Member database information

Signature Database Version Signatures*Integrated Signatures**

GENE3D 3.3.0 2386 1441

HAMAP 140911 1702 1686

PANTHER 7 69566 2392

PIRSF 2.78 2983 2983

PRINTS 41.1 2050 2001

PROSITE patterns 20.72 1308 1291

PROSITE profiles 20.72 922 897

Pfam 26 13672 12672

PfamB 26 20000 0

ProDom 2006.1 1894 1105

SMART 6.2 1008 1002

SUPERFAMILY 1.73 1774 1208

TIGRFAMs 10.1 4023 4002

* Some signatures may not have matches to UniProtKB proteins.** Not all signatures of a member database may be integrated at the time of an InterPro release.

:

:

“The Gene3D database is a large collection of CATH(Class, Architecture, Topology, Homologues superfamily) protein domain assignments for ENSEMBL genomes and Uniprot sequences.”

HAMAPHigh-quality Automated and Manual Annotation of microbial Proteomes

Protein ANalysis THrough Evolutionary Relationships

Evolutionary relationships of proteins from super- to sub-familiesPIRSF

“PRINTS is a database of protein family ‘fingerprints’ offering a diagnostic resource for newly-determined sequences.”

Database of protein domains, families and functional sitesProDom Simple Modular Architecture Research Tool

“SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.”

TIGRFAMs

Integration into InterPro

:

Features of Member Databases• ProDom: provider of sequence-clusters built from UniProtKB

using PSI-BLAST.

• PROSITE patterns: provider of simple regular expressions.

• PROSITE and HAMAP profiles: provide sequence matrices.

• PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs).

• PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs).

Querying with InterProScan

“Sequence-based queries are performed using InterProScan, a tool that combines the different protein signature recognition methods native to the InterPro member databases into one resource.”

InterProScanQuery Sequence

• Web version • Stand-alone version

– A wrapper of sequence analysis apps– Database and output files scanning – Bulk data processing

Querying with InterProScan

Member Databases & Scanning Methods

The TMHMM and SignalP prediction search algorithms are provided through the web interface at EBI. However, they are not integrated into InterPro.

Member Databases Scanning Methods Software PackagePROSITE patterns pfscan PftoolsProsite ProfilesHAMAP Profiles pfscan Pftools

PRINTS FingerPRINTScanPFAM hmmscan HMMER3.0b3PRODOM ProDomBlastSMART hmmpfam HMMER2.3.2TIGRFAMs hmmscan HMMER3.0b3PIR SuperFamily hmmpfam HMMER2.3.2SUPERFAMILY hmmpfam/hmmsearch HMMER2.3.2GENE3D hmmpfam HMMER2.3.2

Blast2GO

• B2G has been design to (1) allow automatic and highthroughput sequence annotation and (2) integrate functionality for annotation-based data mining.

27th Feb 2012 62

Why Blast2GO?

• Blast2GO is designed for high-throughput sequence annotation.

• Better at mining and visualization capabilities

• Good at utilizing annotated sequences already deposited in public databases.

27th Feb 2012 63

How Blast2GO works?

• Basically, Blast2GO uses local or remote BLAST searches to find similar sequences to one or several input sequences.

• The program extracts the GO terms associated to each of the obtained hits and returns an evaluated GO annotation for the query sequence(s).

• Enzyme codes are obtained by mapping from equivalent GOs while InterPro motifs are directly queried at the InterProScan web service.

• GO annotation can be visualized reconstructing the structure of the Gene Ontology relationships and ECs are highlighted on KEGG maps

27th Feb 2012 64

How Blast2GO works?

• OBTAINING GO TERMS– The first step is to find sequences similar to a

query set by Blast searching. Homology search can either be done at public databases or custom databases when a local Blast installation is available.

– By using Blast hit gene identifiers (gi) and gene accessions B2G retrieves all GO annotations for the hit sequences, together with their evidence codes (EC).

27th Feb 2012 65

How Blast2GO works?

• ANNOTATION ASSIGNMENT– annotation score (AS), direct term (DT)

27th Feb 2012 66

How Blast2GO works?

• STATISTICS– statistical assessment of GO term enrichments in a

group of interesting genes when compared with a reference group (Blüthgen et al., 2004).

– Gossip computes Fisher’s Exact Test applying robust FDR (false discovery rate) correction for multiple testing and returns a list of significant GO terms ranked by their corrected or one-test P-values

• VISUALIZATION27th Feb 2012 67

Systems for Functional Annotation

• Clusters of Orthologous Groups (COGs) • euKaryote Orthologous Groups (KOGs)• Gene Ontology (GO)• Enzyme Commission no. (EC)

27th Feb 2012 68

Clusters of Orthologous Groups of Genes (KOGs, COGs)

– Why?• Orthologs retain the same function during evolution

and hence have a critical role in functional annotation. COGs provides a framework for functional analysis.

• It's also important for phylogenetic and evolutionary analysis of genomes. Interpretable phylogenetic trees generally can be constructed only within sets of orthologs.

27th Feb 2012 69

How to find Orthologous genes?

• Naive approach: For a query gene and target genome, the highest similarity score indicates homologous relationship– Gives good results for not so distant species– How about larger phylogenetically distances?

• Gene duplications: Suggests that a many-to-many relationship required

• What if several hits with not a so high score emerge ? Stringent threshold may lead to false negatives

• COG approach: Each two genes inside a COG are either orthologous genes or orthologous groups of paralogs

27th Feb 2012 70

How to create COGs• Choose all 2-permutations of available genes and perform pairwise

comparison between genes from different clades (in this case 5 clades)

• Best hits (BeT) in other organisms are recognized• Make the graph of consistent relations (does not depend on an absolute

threshold level)• The simplest case is a triangle: if a gene yields a hit with two other

genomes there are, being orthologs is a necessary condition for yielding a hit between those two genes

• Merge all triangles with common side

2 2

10 90

3000 ~8.9e6

17967 ~3.2e8

27th Feb 2012 71

How to create COGs - continued

6. Do to existence of paralogs, BeTs are not necessarily symmetrical (RBBH [Reciprocal Best Blast Hits] )

?

Tatusov, Koonin & Lipman, Science 278, 631 (1997)27th Feb 2012 72

Facing challenges when creating COGs

• The clusters however are subject to ambiguity:– Proteins with distinct regions (multi-domain proteins)

each belonging to a different conserved family. • Sol: Further inspection of domains

– When one gene in a pair of paralogs is lost in one lineage (but not in the other), it may artificially merge the two COGs.

• Sol: Similarity measures

27th Feb 2012 73

COGs vs. Gene Function

• Each COG includes proteins from at least 3 major clades with divergence time estimated around over a billion year. Hence they are ancient conserved families with important (if not necessary function)

• Accordingly, the proteins belonging to mysterious COGs are good possible candidates for further analysis

• Also, if someone experimentally verifies a genes function, it can also be confidently applied to fellow COG members. Similarly upon inclusion of a new gene to the COG (by COGNITOR) it’s function is derived

• For most free-living prokaryotes, ~80% of the genes belong to COGs. Up to 1o% of proteins in genomes are estimated as fast evolving, poorly conserved proteins and hence the COG coverage of most genomes is approaching saturation

27th Feb 2012 74

http://www.ncbi.nlm.nih.gov/COG/

Clusters of Orthologous Groups (COGs)

27th Feb 2012 75

Classification of COGs by functional categories INFORMATION STORAGE AND PROCESSING [J] Translation, ribosomal structure and biogenesis [A] RNA processing and modification [K] Transcription [L] Replication, recombination and repair [B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING [D] Cell cycle control, cell division, chromosome partitioning [Y] Nuclear structure [V] Defense mechanisms [T] Signal transduction mechanisms [M] Cell wall/membrane/envelope biogenesis [N] Cell motility [Z] Cytoskeleton [W] Extracellular structures [U] Intracellular trafficking, secretion, and vesicular transport [O] Posttranslational modification, protein turnover, chaperones

METABOLISM[C] Energy production and conversion [G] Carbohydrate transport and metabolism [E] Amino acid transport and metabolism [F] Nucleotide transport and metabolism [H] Coenzyme transport and metabolism [I] Lipid transport and metabolism [P] Inorganic ion transport and metabolism [Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED [R] General function prediction only [S] Function unknown

27th Feb 2012 76

LipoP• It is a tool used to mainly predict lipoprotein signal

peptides.

• It is most suitable for Gram negative bacteria but shown to have considerable accuracy for Gram positive bacteria as well.

• It uses Hidden Markov Models to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins.

27th Feb 2012 77

Thank You!

27th Feb 2012 78

To be continued…

Download - Functional Annotation Background + Strategy The Group 127th Feb 2012 Lavanya Rishishwar Artika Nath Lu Wang Haozheng Tian Shengyun Peng Ashwath Kumar Hamidreza

Top Related