aim$and$outline$of$the$course · 2014. 7. 10. · aim$and$outline$of$the$coursedb$datamining$ $...

79

Upload: others

Post on 23-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$
Page 2: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Aim  and  Outline  of  the  course  Db  data  mining  

Db  tools  can  be  used  to  retrieve  informa7on  for  a  gene  or  protein    The  most  important  concept  is  the  similarity      

Page 3: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Aim  and  Outline  of  the  course  Db  data  mining  

BLAST    

Interpro    

Blast2GO    

Page 4: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence  Alignments  

13

BLAST (Basic Local Alignment Search Tool

• One of the tools of the NCBI - The U.S. National Center for Biotechnology Information.

• Uses word matching like FASTA

• Similarity matching of words (3 AA’s, 11 bases/nucleotides) – does not require identical words.

• If no words are similar, then there is no alignment

– won’t find matches for very short sequences

Page 5: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence  Alignments  

14

BLAST word matching

MEAAVKEEISVEDEAVDKNI MEA EAA AAV AVK VKE KEE EEI EIS ISV ...

Break query into words:

Break database sequences

into words:

Page 6: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence  Alignments  

15

Database Sequence Words Lists RTT AAQ

SDG KSS SRW LLN QEL RWY VKI GKG DKI NIS LFC WDV AAV KVR PFR DEI

Compare word lists

Query Word List: MEA EAA AAV AVK VKL KEE EEI EIS ISV

?

Compare word lists by Hashing

& allow near matches!

Page 7: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence  Alignments  

16

ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT

TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY

IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH

MEA EAA AAV AVK KLV KEE EEI EIS ISV

Find locations of matching words in all sequences

Page 8: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence  Alignments  

17

Extend hits one base at a time

• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.

• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.

Page 9: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence  Alignments  

17

Extend hits one base at a time

• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.

• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.

Why statistics??

Page 10: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence  Alignments  

18

• Job Title: P14867|GBRA1_HUMAN Gamma-aminobutyric acid... Show Conserved Domains Putative conserved domains have been detected, click on the image below for detailed results. * BLASTP 2.2.18 (Mar-02-2008) protein-protein BLAST Database: Non-redundant SwissProt sequences 309,621 sequences; 115,465,120 total letters Query= P14867|GBRA1_HUMAN Gamma-aminobutyric acid receptor subunit alpha-1 - Homo sapiens (Human). Length=456 Sequences producing significant alignments: (Bits) Value sp|P14867.3|GBRA1_HUMAN Gamma-aminobutyric acid receptor subu... 948 0.0 Gene info sp|Q5R6B2.1|GBRA1_PONPY Gamma-aminobutyric acid receptor subu... 944 0.0 sp|Q4R534.1|GBRA1_MACFA Gamma-aminobutyric acid receptor subu... 944 0.0 sp|P08219.1|GBRA1_BOVIN Gamma-aminobutyric acid receptor subu... 939 0.0 Gene info sp|P62813.1|GBRA1_RAT Gamma-aminobutyric acid receptor subuni... 908 0.0 Gene info sp|P19150.1|GBRA1_CHICK Gamma-aminobutyric acid receptor subu... 882 0.0 Gene info sp|P47869.2|GBRA2_HUMAN Gamma-aminobutyric acid receptor subu... 670 0.0 Gene info sp|P26048.1|GBRA2_MOUSE Gamma-aminobutyric acid receptor subu... 669 0.0 Gene info sp|P23576.1|GBRA2_RAT Gamma-aminobutyric acid receptor subuni... 669 0.0 Gene info sp|P10063.1|GBRA2_BOVIN Gamma-aminobutyric acid receptor subu... 667 0.0 Gene info

• sp|Q08E50.1|GBRA5_BOVIN Gamma-aminobutyric acid receptor subu... 641 0.0 Gene info sp|Q8BHJ7.1|GBRA5_MOUSE Gamma-aminobutyric acid receptor subu... 640 0.0 Gene info sp|P31644.1|GBRA5_HUMAN Gamma-aminobutyric acid receptor subu... 638 0.0 Gene info sp|P19969.1|GBRA5_RAT Gamma-aminobutyric acid receptor subuni... 636 0.0 Gene info sp|P34903.1|GBRA3_HUMAN Gamma-aminobutyric acid receptor subu... 632 0.0 Gene info sp|P26049.1|GBRA3_MOUSE Gamma-aminobutyric acid receptor subu... 630 6e-180 Gene info sp|P10064.1|GBRA3_BOVIN Gamma-aminobutyric acid receptor subu... 628 2e-179 Gene info sp|P20236.1|GBRA3_RAT Gamma-aminobutyric acid receptor subuni... 627 3e-179 Gene info sp|P30191.1|GBRA6_RAT Gamma-aminobutyric acid receptor subuni... 520 6e-147 Gene info sp|P16305.2|GBRA6_MOUSE Gamma-aminobutyric acid receptor subu... 518 2e-146 Gene info sp|Q90845.1|GBRA6_CHICK Gamma-aminobutyric acid receptor subu... 518 3e-146 Gene info

BLAST : example of result

Page 11: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence  Alignments  

19

BLAST is approximate but fast

• BLAST makes similarity searches very quickly, but also makes

errors

– misses some important similarities

– makes many incorrect matches

• The NCBI BLAST web server lets you compare your query

sequence to various sequences stored in the GenBank;

• This is a VERY fast and powerful computer.

• The speed and relatively good accuracy of BLAST are the key why

the tool is the most popular bioinformatics search tool.

Page 12: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$
Page 13: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

InterPro Database Protein

Functional Analysis

Jennifer McDowall, Ph.D. Senior InterPro Curator

Page 14: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

EBI Sequence Databases

UniProtKB Swiss-Prot

manual annotation

UniProtKB TrEMBL

protein sequence

translate

(GenBank,    DDBJ)  

nucleotide sequence

EMBL

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

>7M

>400,000

Page 15: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

EBI Sequence Databases

UniProtKB Swiss-Prot

manual annotation

UniProtKB TrEMBL

protein sequence

translate

InterPro

Protein signatures

protein annotation

(GenBank,    DDBJ)  

nucleotide sequence

EMBL

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

groups of related proteins

(same family or share domains)

Page 16: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

UniProtKB

UniProt/ SwissProt

proteins

InterPro ~370,000

~400,000

Signature matches

InterPro ~80% Protein Coverage

UniMESS Metagenomic

proteins

>6M

Available 2009

UniProt/ TrEMBL proteins

>5.3M

>7M

Page 17: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

What are protein signatures?

Multiple sequence alignment

•  A  signature  describes  the  paDern  of  a  set  of  conserved  residues  in  a  group  of  proteins  

Ø   Define  a  protein  family  Ø   Define  a  protein  feature  (domain  or  conserved  site)  

Page 18: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

• More  sensi7ve  homology  searches    Ø   Find  more  distant  homologues  than  BLAST  

What value are signatures?

Page 19: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

• More  sensi7ve  homology  searches  

What value are signatures?

•  Classifica7on  of  proteins    Ø   Associate  proteins  that  share:          Func7on  

                       Domains                          Sequence                          Structure  

Page 20: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

What value are signatures?

•  Annota7on  of  protein  sequences  Ø   Define  conserved  regions  of  a  protein  

-  e.g.      loca7on  and  type  of  domains          key  structural  or  func7onal  sites  

•  Classifica7on  of  proteins    

• More  sensi7ve  homology  searches  

Page 21: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

What value are signatures?

•  Transfer  addi7onal  (automa7c)  annota7on  Ø   Associate  TrEMBL  proteins  with  well-­‐annotated  SwissProt  proteins  

 

                 Transfer  annota7on  

• More  sensi7ve  homology  searches  

•  Classifica7on  of  proteins    

•  Annota7on  of  protein  sequences  

Page 22: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Signature methods

•  Pattern

•  Fingerprint

•  Sequence clustering

•  HMM

•  SAM

Page 23: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Patterns

Pattern/motif in sequence à regular expression

Can define important sites

Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |

EXAMPLE: Insulin

Page 24: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Patterns

Pattern/motif in sequence à regular expression

Can define important sites

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |

EXAMPLE: PS00262 Insulin family signature

Page 25: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Patterns

Pattern/motif in sequence à regular expression

Can define important sites

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |

EXAMPLE: PS00262 Insulin family signature

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N

Page 26: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Patterns

Pattern/motif in sequence à regular expression

Can define important sites

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |

EXAMPLE: PS00262 Insulin family signature

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Regular expression

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N

Page 27: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Patterns – understanding a regular expression

C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C

Strictly  conserved  site;  only  one  amino  acid  is  accepted  

at  this  posi7on  

Curly  brackets  denote  amino  acids  that  cannot  occur  at  a  single  posi7on    

x  denotes  any  amino  acid  can  occur  at  a  single  

posi7on    

There  are  dashes  between  each  posi7on  

Page 28: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Patterns – understanding a regular expression

C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C

X(2)  –  therefore  any  amino  acid  can  occur  at  the  next  two  posi7on    

Square  brackets  denote  range  of  amino  acids  that  occur  at  a  single  posi7on    

Page 29: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Patterns

Extract pattern sequences xxxxxx xxxxxx xxxxxx xxxxxx

Sequence alignment

Insulin family motif Define pattern

Pattern signature

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression

PS00000

Page 30: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fingerprints

Several motifs à characterise family

Different combinations of motifs describe subfamilies

Identify small conserved regions in divergent proteins

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

Page 31: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fingerprints

Several motifs à characterise family

Different combinations of motifs describe subfamilies

Identify small conserved regions in divergent proteins

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

His phosphorylation site

Page 32: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fingerprints

Several motifs à characterise family

Different combinations of motifs describe subfamilies

Identify small conserved regions in divergent proteins

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA:

His phosphorylation site

Ser phosphorylation site

MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

Page 33: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fingerprints

Several motifs à characterise family

Different combinations of motifs describe subfamilies

Identify small conserved regions in divergent proteins

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA:

His phosphorylation site

Ser phosphorylation site Conserved site

MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE

Page 34: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fingerprints

Several motifs à characterise family

Different combinations of motifs describe subfamilies

Identify small conserved regions in divergent proteins

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE

1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE

3-motif fingerprint

Page 35: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fingerprints

Extract motif sequences

xxxxxx xxxxxx xxxxxx xxxxxx

xxxxxx xxxxxx xxxxxx xxxxxx

xxxxxx xxxxxx xxxxxx xxxxxx

Sequence alignment

Correct order

Correct spacing

Ser phosphorylation

site

Conserved site

His phosphorylation

site Define motifs

Fingerprint signature 1 2 3

PR00000

Page 36: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence clustering

Automatic clustering of homologous domains

**Rarely covers entire domain (conserved core)

**Signature size can change with release

Known domain families

Recruit homologous domains

PSI-BLAST

MKDOM2 Automatic clustering

ProDomAlign Align domain families

Page 37: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Hidden Markov Models (HMM)

Can characterise protein over entire length

Models conserved and divergent regions (position-specific scoring)

Models insertions and deletions

Ø  Outperform in sensitivity and specificity

Ø  More flexible (can use partial alignments)

Page 38: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

Sequence alignment

Scoring matrix

(residue frequency at each position in

alignment)

Profile

Hidden Markov Models (HMM)

Bayesian statistics

probability scoring

Page 39: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

M = match state

M1

Hidden Markov Models (HMM)

Page 40: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

M1

Hidden Markov Models (HMM)

M2

M = match state

Page 41: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

M1

Hidden Markov Models (HMM)

M2 M3

M = match state

Page 42: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

M1

Hidden Markov Models (HMM)

M2 M3 M4 M5 M6 M7 M8 M9 M10 M4 M5 M6 M7 M8 M9 M10

M = match state

Page 43: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M4 M5 M6 M7 M8 M9 M10

I = insert state

I1 I2 I3 I4 I5 I6 I7 I8 I9

D = delete state

D2 D3 D4 D5 D6 D7 D8 D9

Hidden Markov Models (HMM)

Page 44: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Hidden Markov Models (HMM)

HMM databases: •  PIR SUPERFAMILY

•  PANTHER

•  TIGRFAM

•  PFAM

•  SMART

•  SUPERFAMILY

•  GENE3D

Domains conserved in sequence

Families conserved in sequence

Domains conserved in structure

Page 45: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

SAM Profile HMMs

Homologous structural superfamilies

Start with single seed sequence

Proteins in superfamily may have low

sequence identity

Few proteins in family have PDB structures

Create 1 model for every protein in superfamily à combine results

Page 46: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

SAM Profile models

T99 script:

Low identity matches

Close homologues

WU-BLASTP

search

Final model

Single seed sequence GIHARPATLLVQTASKF

Initial model

GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF

New larger alignment GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF

Page 47: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Signatures Methods

•  Pattern

•  Fingerprint

•  Sequence clustering

•  HMM

•  SAM

Describe  protein  features:  ac7ve  sites,  binding  sites…  

Describe  families  and  sibling  subfamilies  

Predicts  conserved  domains  

Page 48: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Signature Methods

•  Pattern

•  Fingerprint

•  Sequence clustering

•  HMM

•  SAM

Func7onal  classifica7on  of  

families  

Func7onal  domain  annota7on  

Structural  domain  annota7on  

Page 49: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Comprehensive annotation InterPro    removes    

redundancy  

SWIB/MDM2  domain  

RanBP2-­‐type  zinc  finger  

RING-­‐type  zinc  finger  Domain  annota7on  

Page 50: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Comprehensive annotation

Conserved  site  within  zinc  finger  

Annotate  features  

Page 51: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Comprehensive annotation

Mdm2/Mdm4  family  

Mdm4  subfamily  

Parent  

Child  

Family  classifica7on  

Page 52: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Domain Boundaries

Gene3D (and SSF) determines domain structural boundaries

Pfam trims domains to regions of good sequence conservation

ProDom displays shortest conserved sequence

Page 53: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fragmented Signatures

4) Non-contiguous domains

3) Repeated elements

2) Duplicated domains

1) Signature method

Page 54: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fragmented Signatures

•  e.g. PRINTS – discrete motifs 1) Signature method

3) Repeated elements

2) Duplicated domains

4) Non-contiguous domains

Page 55: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fragmented Signatures

1) Signature method

2) Duplicated domains

3) Repeated elements

4) Non-contiguous domains

•  e.g. SSF - duplication consisting of 2 domains with same fold

Page 56: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fragmented Signatures

3) Repeated elements

2) Duplicated domains

•  e.g. Kringle, WD40

4) Non-contiguous domains

1) Signature method

Page 57: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fragmented Signatures

3) Repeats

4) Non-contiguous domains

2) Duplicated domains

1) Signature method

•  Structural domains can consist of non-contiguous sequence

Page 58: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Fragmented Signatures

4) Non-contiguous domains

3) Repeats

2) Duplicated domains

1) Signature method

Page 59: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Complementary Annotation

Ø  Sequence-based signature (Pfam) shows that the domain is made up of repeating sequence elements

Beta-­‐propeller  repeat  

Ø  Structural-based signature (SSF) shows boundaries of structural domain

7-­‐blade  beta-­‐propeller  

Page 60: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Complementary Annotation

PFAM  shows  domain  is  composed  of  two  types  of  repeated  sequence  mo7fs  

SUPERFAMILY  shows  the  poten7al  domain  

boundaries  

Page 61: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Complementary Annotation

GENE3D  shows  that  these  domains  share  homologous  structure  

PFAM/SMART  show  2  domains  from  dis7nct  sequence  families  

Page 62: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Func7onal  annota7on  with  Blast2GO  

Page 63: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Index  •  Why  Blast2GO?  Concepts  on  func7onal  annota7ons  –  Func7on  assignment  –  Vocabularies  – GO  and  GOA  –  The  GO  Direct  Acyclic  Graph  –  The  problem  of  func7on  transfer  

•  Blast2GO  Java  applica7on  +  Prac7cals  •  Blast2GO  @  Babelomics  +  Prac7cals    

Page 64: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Why  Blast2GO?                                                    Workflow  analysis  Experiment  

MNAT1 CTNNBL1 ENOX2 GTPBP1 RALY TAGLN2 RAB3A PPP2R5A MAPRE1 ..... ...

Data-­‐Analysis  Gene-­‐List  

Functional    interpretation  

Functional    Profiling  

+ Functional    Annotation  

Page 65: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

 What  does  Blast2GO  do?  

Generates  annotations   Visualization  of  funcional  annotations  

Page 66: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Concepts  of  func7onal  annota7on  •  Gene/Protein  func7on  

•  Referes  to  the  molecular  func7on  of  a  gene  or  a  protein:  

 Tyrosine  kinase  •  Func7onal  annota7on  

•  More  general,  referes  to  the  characteriza7on  of  func7onal  aspect  of  the  protein.      Stress-­‐related,  cytoplasm,  ABC  transporter  

•  Also  referes  to  the  process  of  assingment  of  a  func7on  label  

•  Habitually,  standard  vocabularies  are  used  to  assign  func7on  

 

Page 67: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Func7onal  Vocabularies  Molecular Function Biological Process Cellular Component

Metabolic pathways

KEGG orthologues

Functional motifs

Page 68: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

The  Gene  Ontology  

•  Project  developed  by  the  Gene  Ontology  Consor7um  

•  Provides  a  controlled  vocabulary  to  describe  gene  and  gene  product  aDributes  in  any  organism  

•  Includes  both  the  development  of  the  Ontology  and  the  maintenance  of  a  Database  of  annota7ons  

Page 69: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

The  Ontology  ü Annotations  are  given  to  te  

most  specific  (low)  level.  ü  True  path  rule:  annotation  

at  a  term  implies  annotation  to  all  its  parent  terms  

ü Annotation  is  given  with  an  Evidence  Code:  o  IDA:  inferred  by  direct  assay  o  TAS:  traceable  author  

statement  o  ISS:  infered  by  sequence  

similarity  o  IEA:  electronic  annotation  o  ….  

More  general  

More  specific  

Page 70: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

The  GO  has  a  DAG  structure  

Page 71: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

The  Gene  Ontology  Database  (GOA)  

•  There  is  a  collabora7ng  ins7tu7on  per  organism  to  provide  annota7ons  

•  Most  of  the  GOA  annota7ons  come  from  UniProt  •  Most  of  the  annota7ons  are  electronic  annota7ons    

http://www.geneontology.org/GO.current.annotations.shtml  

Page 72: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

InterPro  

•  Collec7on  of  databases  with  func7onal  annota7on  of  protein  mo7fs  

•  Func7onal  vocabulary  at  UniProt  •  There  is  an  equivalence  table  between  GO  and  InterPro  

http://www.ebi.ac.uk/interpro/databases.html  

Page 73: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Func7onal  assignment    Annota7on  

Empirical   Transference  

Molecular    interac7ons  

Gene/protein  expression  

Biochemical  assay  

Structure  Comparison  

Sequence  analysis  

Iden7fica7on  of  folds  

Mo7f  iden7fica7on  

Filogeny  

Literature    reference  

Sequence  homology  

Page 74: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Automa7c  annota7on  •  GO  annota7ons  can  be  created  by  comparision  to  annotated  sequences  

•  To  achieve  enough  coverage,  high-­‐throughput,  automa7c  annota7on  is  required  

•  The  most  effec7ve  (also  error  prone)  automa7c  annota7on  method  is  transfer  from  sequence  similarity    

Page 75: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Concerns  in  func7onal  transfer  by  similarity  

•  Level  of  homology  (~  from  40-­‐60%  is  possible)  •  The  overlap  query  and  hit  sequences  (not  much  a  problem)  •  The  domain  or  structure  func7on  associa7on  •  The  paralog  problem:  genes  with  similar  sequences      might  have  different  func7onal  specifica7ons  

•  The  evidence  for  the  original  annota7on  •  Balance  between  quality  and  quan7ty:  depends  on  the  use  

GO1,  GO2,  GO3,  GO4  

GO1,  GO2,  GO3,  GO4  QUERY

HIT

Page 76: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Blast2GO  •  Suite  for  func7onal  annota7on  and  data  mining  on  func7onal  

data  –  Considera7ons  for  annota=on  

•  Simlarity  •  Length  of  the  overlap  •  Percentage  of  hit  sequence  spanned  by  the  overlap  •  Evidence  original  annota7on  •  Blast  hits  and  mo7f  hits  •  Refinement  by  addi7onal  methods  

–  Visualiza7on:    •  Annota7on  charts  •  Knowledge  discovery  on  the  DAG  

•  Desktop  Java  applica7on  •  web  interface  @  Babelomics:  Babelomics  for  non-­‐model  

Page 77: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Blast2GO  Annota7on  strategy  

Sq1

Blast Sq2

Sq3

Sq4

Sq1

Sq2

Sq3

Sq4

Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4

Sq1

Sq2

Sq3

Sq4

Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4

Hit1 Hit2

go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4

go6,go9, go8 go1,go8 go4,go1, go8,go9

go2 go2,go4, go4 go2,go5, go6 go2,go4

Sq1

Sq2

Sq3

Sq4

go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4

go6,go9, go8 go1,go8 go4,go1, go8,go9

go2 go2,go4, go4 go2,go5, go6 go2,go4

Mapping

Hit1 Hit2

Annotation

Page 78: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

Blast2GO  Annota7on  Strategy  

Sq1

Sq2

Sq3

Sq4

go1,go2, go3, GO11

go8, GO12, GO13

go2,go4

GO15

Refinement

InterPro Annex

GOSlim Manual

Sq1

Sq2

Sq3

Sq4

go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4

go6,go9, go8 go1,go8 go4,go1, go8,go9

go2 go2,go4, go4 go2,go5, go6 go2,go4

Page 79: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$

B2G  Highligh7ng  on  the  DAG:  The  B2G  Score  

•  Coloring  strategy  to  highlight  regions  in  the  DAG  where  the  most  interes7ng  informa7on  is  concentrated  

•  The  confluence  score  (B2G  score)  keeps  a  balance  between  the  number  of  annotated  sequences  at  one  node  and  the  distance  to  the  origin  of  annota7on