![Page 1: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/1.jpg)
![Page 2: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/2.jpg)
Aim and Outline of the course Db data mining
Db tools can be used to retrieve informa7on for a gene or protein The most important concept is the similarity
![Page 3: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/3.jpg)
Aim and Outline of the course Db data mining
BLAST
Interpro
Blast2GO
![Page 4: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/4.jpg)
Sequence Alignments
13
BLAST (Basic Local Alignment Search Tool
• One of the tools of the NCBI - The U.S. National Center for Biotechnology Information.
• Uses word matching like FASTA
• Similarity matching of words (3 AA’s, 11 bases/nucleotides) – does not require identical words.
• If no words are similar, then there is no alignment
– won’t find matches for very short sequences
![Page 5: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/5.jpg)
Sequence Alignments
14
BLAST word matching
MEAAVKEEISVEDEAVDKNI MEA EAA AAV AVK VKE KEE EEI EIS ISV ...
Break query into words:
Break database sequences
into words:
![Page 6: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/6.jpg)
Sequence Alignments
15
Database Sequence Words Lists RTT AAQ
SDG KSS SRW LLN QEL RWY VKI GKG DKI NIS LFC WDV AAV KVR PFR DEI
Compare word lists
Query Word List: MEA EAA AAV AVK VKL KEE EEI EIS ISV
?
Compare word lists by Hashing
& allow near matches!
![Page 7: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/7.jpg)
Sequence Alignments
16
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT
TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY
IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
MEA EAA AAV AVK KLV KEE EEI EIS ISV
Find locations of matching words in all sequences
![Page 8: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/8.jpg)
Sequence Alignments
17
Extend hits one base at a time
• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.
• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.
![Page 9: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/9.jpg)
Sequence Alignments
17
Extend hits one base at a time
• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.
• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.
Why statistics??
![Page 10: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/10.jpg)
Sequence Alignments
18
• Job Title: P14867|GBRA1_HUMAN Gamma-aminobutyric acid... Show Conserved Domains Putative conserved domains have been detected, click on the image below for detailed results. * BLASTP 2.2.18 (Mar-02-2008) protein-protein BLAST Database: Non-redundant SwissProt sequences 309,621 sequences; 115,465,120 total letters Query= P14867|GBRA1_HUMAN Gamma-aminobutyric acid receptor subunit alpha-1 - Homo sapiens (Human). Length=456 Sequences producing significant alignments: (Bits) Value sp|P14867.3|GBRA1_HUMAN Gamma-aminobutyric acid receptor subu... 948 0.0 Gene info sp|Q5R6B2.1|GBRA1_PONPY Gamma-aminobutyric acid receptor subu... 944 0.0 sp|Q4R534.1|GBRA1_MACFA Gamma-aminobutyric acid receptor subu... 944 0.0 sp|P08219.1|GBRA1_BOVIN Gamma-aminobutyric acid receptor subu... 939 0.0 Gene info sp|P62813.1|GBRA1_RAT Gamma-aminobutyric acid receptor subuni... 908 0.0 Gene info sp|P19150.1|GBRA1_CHICK Gamma-aminobutyric acid receptor subu... 882 0.0 Gene info sp|P47869.2|GBRA2_HUMAN Gamma-aminobutyric acid receptor subu... 670 0.0 Gene info sp|P26048.1|GBRA2_MOUSE Gamma-aminobutyric acid receptor subu... 669 0.0 Gene info sp|P23576.1|GBRA2_RAT Gamma-aminobutyric acid receptor subuni... 669 0.0 Gene info sp|P10063.1|GBRA2_BOVIN Gamma-aminobutyric acid receptor subu... 667 0.0 Gene info
• sp|Q08E50.1|GBRA5_BOVIN Gamma-aminobutyric acid receptor subu... 641 0.0 Gene info sp|Q8BHJ7.1|GBRA5_MOUSE Gamma-aminobutyric acid receptor subu... 640 0.0 Gene info sp|P31644.1|GBRA5_HUMAN Gamma-aminobutyric acid receptor subu... 638 0.0 Gene info sp|P19969.1|GBRA5_RAT Gamma-aminobutyric acid receptor subuni... 636 0.0 Gene info sp|P34903.1|GBRA3_HUMAN Gamma-aminobutyric acid receptor subu... 632 0.0 Gene info sp|P26049.1|GBRA3_MOUSE Gamma-aminobutyric acid receptor subu... 630 6e-180 Gene info sp|P10064.1|GBRA3_BOVIN Gamma-aminobutyric acid receptor subu... 628 2e-179 Gene info sp|P20236.1|GBRA3_RAT Gamma-aminobutyric acid receptor subuni... 627 3e-179 Gene info sp|P30191.1|GBRA6_RAT Gamma-aminobutyric acid receptor subuni... 520 6e-147 Gene info sp|P16305.2|GBRA6_MOUSE Gamma-aminobutyric acid receptor subu... 518 2e-146 Gene info sp|Q90845.1|GBRA6_CHICK Gamma-aminobutyric acid receptor subu... 518 3e-146 Gene info
BLAST : example of result
![Page 11: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/11.jpg)
Sequence Alignments
19
BLAST is approximate but fast
• BLAST makes similarity searches very quickly, but also makes
errors
– misses some important similarities
– makes many incorrect matches
• The NCBI BLAST web server lets you compare your query
sequence to various sequences stored in the GenBank;
• This is a VERY fast and powerful computer.
• The speed and relatively good accuracy of BLAST are the key why
the tool is the most popular bioinformatics search tool.
![Page 12: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/12.jpg)
![Page 13: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/13.jpg)
InterPro Database Protein
Functional Analysis
Jennifer McDowall, Ph.D. Senior InterPro Curator
![Page 14: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/14.jpg)
EBI Sequence Databases
UniProtKB Swiss-Prot
manual annotation
UniProtKB TrEMBL
protein sequence
translate
(GenBank, DDBJ)
nucleotide sequence
EMBL
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
>7M
>400,000
![Page 15: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/15.jpg)
EBI Sequence Databases
UniProtKB Swiss-Prot
manual annotation
UniProtKB TrEMBL
protein sequence
translate
InterPro
Protein signatures
protein annotation
(GenBank, DDBJ)
nucleotide sequence
EMBL
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
groups of related proteins
(same family or share domains)
![Page 16: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/16.jpg)
UniProtKB
UniProt/ SwissProt
proteins
InterPro ~370,000
~400,000
Signature matches
InterPro ~80% Protein Coverage
UniMESS Metagenomic
proteins
>6M
Available 2009
UniProt/ TrEMBL proteins
>5.3M
>7M
![Page 17: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/17.jpg)
What are protein signatures?
Multiple sequence alignment
• A signature describes the paDern of a set of conserved residues in a group of proteins
Ø Define a protein family Ø Define a protein feature (domain or conserved site)
![Page 18: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/18.jpg)
• More sensi7ve homology searches Ø Find more distant homologues than BLAST
What value are signatures?
![Page 19: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/19.jpg)
• More sensi7ve homology searches
What value are signatures?
• Classifica7on of proteins Ø Associate proteins that share: Func7on
Domains Sequence Structure
![Page 20: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/20.jpg)
What value are signatures?
• Annota7on of protein sequences Ø Define conserved regions of a protein
- e.g. loca7on and type of domains key structural or func7onal sites
• Classifica7on of proteins
• More sensi7ve homology searches
![Page 21: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/21.jpg)
What value are signatures?
• Transfer addi7onal (automa7c) annota7on Ø Associate TrEMBL proteins with well-‐annotated SwissProt proteins
Transfer annota7on
• More sensi7ve homology searches
• Classifica7on of proteins
• Annota7on of protein sequences
![Page 22: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/22.jpg)
Signature methods
• Pattern
• Fingerprint
• Sequence clustering
• HMM
• SAM
![Page 23: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/23.jpg)
Patterns
Pattern/motif in sequence à regular expression
Can define important sites
Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: Insulin
![Page 24: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/24.jpg)
Patterns
Pattern/motif in sequence à regular expression
Can define important sites
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: PS00262 Insulin family signature
![Page 25: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/25.jpg)
Patterns
Pattern/motif in sequence à regular expression
Can define important sites
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: PS00262 Insulin family signature
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
![Page 26: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/26.jpg)
Patterns
Pattern/motif in sequence à regular expression
Can define important sites
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: PS00262 Insulin family signature
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Regular expression
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
![Page 27: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/27.jpg)
Patterns – understanding a regular expression
C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C
Strictly conserved site; only one amino acid is accepted
at this posi7on
Curly brackets denote amino acids that cannot occur at a single posi7on
x denotes any amino acid can occur at a single
posi7on
There are dashes between each posi7on
![Page 28: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/28.jpg)
Patterns – understanding a regular expression
C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C
X(2) – therefore any amino acid can occur at the next two posi7on
Square brackets denote range of amino acids that occur at a single posi7on
![Page 29: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/29.jpg)
Patterns
Extract pattern sequences xxxxxx xxxxxx xxxxxx xxxxxx
Sequence alignment
Insulin family motif Define pattern
Pattern signature
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression
PS00000
![Page 30: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/30.jpg)
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
![Page 31: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/31.jpg)
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
His phosphorylation site
![Page 32: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/32.jpg)
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA:
His phosphorylation site
Ser phosphorylation site
MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
![Page 33: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/33.jpg)
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA:
His phosphorylation site
Ser phosphorylation site Conserved site
MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE
![Page 34: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/34.jpg)
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE
1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE
3-motif fingerprint
![Page 35: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/35.jpg)
Fingerprints
Extract motif sequences
xxxxxx xxxxxx xxxxxx xxxxxx
xxxxxx xxxxxx xxxxxx xxxxxx
xxxxxx xxxxxx xxxxxx xxxxxx
Sequence alignment
Correct order
Correct spacing
Ser phosphorylation
site
Conserved site
His phosphorylation
site Define motifs
Fingerprint signature 1 2 3
PR00000
![Page 36: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/36.jpg)
Sequence clustering
Automatic clustering of homologous domains
**Rarely covers entire domain (conserved core)
**Signature size can change with release
Known domain families
Recruit homologous domains
PSI-BLAST
MKDOM2 Automatic clustering
ProDomAlign Align domain families
![Page 37: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/37.jpg)
Hidden Markov Models (HMM)
Can characterise protein over entire length
Models conserved and divergent regions (position-specific scoring)
Models insertions and deletions
Ø Outperform in sensitivity and specificity
Ø More flexible (can use partial alignments)
![Page 38: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/38.jpg)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
Sequence alignment
Scoring matrix
(residue frequency at each position in
alignment)
Profile
Hidden Markov Models (HMM)
Bayesian statistics
probability scoring
![Page 39: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/39.jpg)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M = match state
M1
Hidden Markov Models (HMM)
![Page 40: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/40.jpg)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1
Hidden Markov Models (HMM)
M2
M = match state
![Page 41: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/41.jpg)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1
Hidden Markov Models (HMM)
M2 M3
M = match state
![Page 42: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/42.jpg)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1
Hidden Markov Models (HMM)
M2 M3 M4 M5 M6 M7 M8 M9 M10 M4 M5 M6 M7 M8 M9 M10
M = match state
![Page 43: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/43.jpg)
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M4 M5 M6 M7 M8 M9 M10
I = insert state
I1 I2 I3 I4 I5 I6 I7 I8 I9
D = delete state
D2 D3 D4 D5 D6 D7 D8 D9
Hidden Markov Models (HMM)
![Page 44: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/44.jpg)
Hidden Markov Models (HMM)
HMM databases: • PIR SUPERFAMILY
• PANTHER
• TIGRFAM
• PFAM
• SMART
• SUPERFAMILY
• GENE3D
Domains conserved in sequence
Families conserved in sequence
Domains conserved in structure
![Page 45: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/45.jpg)
SAM Profile HMMs
Homologous structural superfamilies
Start with single seed sequence
Proteins in superfamily may have low
sequence identity
Few proteins in family have PDB structures
Create 1 model for every protein in superfamily à combine results
![Page 46: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/46.jpg)
SAM Profile models
T99 script:
Low identity matches
Close homologues
WU-BLASTP
search
Final model
Single seed sequence GIHARPATLLVQTASKF
Initial model
GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF
New larger alignment GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF
![Page 47: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/47.jpg)
Signatures Methods
• Pattern
• Fingerprint
• Sequence clustering
• HMM
• SAM
Describe protein features: ac7ve sites, binding sites…
Describe families and sibling subfamilies
Predicts conserved domains
![Page 48: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/48.jpg)
Signature Methods
• Pattern
• Fingerprint
• Sequence clustering
• HMM
• SAM
Func7onal classifica7on of
families
Func7onal domain annota7on
Structural domain annota7on
![Page 49: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/49.jpg)
Comprehensive annotation InterPro removes
redundancy
SWIB/MDM2 domain
RanBP2-‐type zinc finger
RING-‐type zinc finger Domain annota7on
![Page 50: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/50.jpg)
Comprehensive annotation
Conserved site within zinc finger
Annotate features
![Page 51: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/51.jpg)
Comprehensive annotation
Mdm2/Mdm4 family
Mdm4 subfamily
Parent
Child
Family classifica7on
![Page 52: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/52.jpg)
Domain Boundaries
Gene3D (and SSF) determines domain structural boundaries
Pfam trims domains to regions of good sequence conservation
ProDom displays shortest conserved sequence
![Page 53: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/53.jpg)
Fragmented Signatures
4) Non-contiguous domains
3) Repeated elements
2) Duplicated domains
1) Signature method
![Page 54: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/54.jpg)
Fragmented Signatures
• e.g. PRINTS – discrete motifs 1) Signature method
3) Repeated elements
2) Duplicated domains
4) Non-contiguous domains
![Page 55: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/55.jpg)
Fragmented Signatures
1) Signature method
2) Duplicated domains
3) Repeated elements
4) Non-contiguous domains
• e.g. SSF - duplication consisting of 2 domains with same fold
![Page 56: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/56.jpg)
Fragmented Signatures
3) Repeated elements
2) Duplicated domains
• e.g. Kringle, WD40
4) Non-contiguous domains
1) Signature method
![Page 57: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/57.jpg)
Fragmented Signatures
3) Repeats
4) Non-contiguous domains
2) Duplicated domains
1) Signature method
• Structural domains can consist of non-contiguous sequence
![Page 58: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/58.jpg)
Fragmented Signatures
4) Non-contiguous domains
3) Repeats
2) Duplicated domains
1) Signature method
![Page 59: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/59.jpg)
Complementary Annotation
Ø Sequence-based signature (Pfam) shows that the domain is made up of repeating sequence elements
Beta-‐propeller repeat
Ø Structural-based signature (SSF) shows boundaries of structural domain
7-‐blade beta-‐propeller
![Page 60: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/60.jpg)
Complementary Annotation
PFAM shows domain is composed of two types of repeated sequence mo7fs
SUPERFAMILY shows the poten7al domain
boundaries
![Page 61: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/61.jpg)
Complementary Annotation
GENE3D shows that these domains share homologous structure
PFAM/SMART show 2 domains from dis7nct sequence families
![Page 62: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/62.jpg)
Func7onal annota7on with Blast2GO
![Page 63: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/63.jpg)
Index • Why Blast2GO? Concepts on func7onal annota7ons – Func7on assignment – Vocabularies – GO and GOA – The GO Direct Acyclic Graph – The problem of func7on transfer
• Blast2GO Java applica7on + Prac7cals • Blast2GO @ Babelomics + Prac7cals
![Page 64: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/64.jpg)
Why Blast2GO? Workflow analysis Experiment
MNAT1 CTNNBL1 ENOX2 GTPBP1 RALY TAGLN2 RAB3A PPP2R5A MAPRE1 ..... ...
Data-‐Analysis Gene-‐List
Functional interpretation
Functional Profiling
+ Functional Annotation
![Page 65: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/65.jpg)
What does Blast2GO do?
Generates annotations Visualization of funcional annotations
![Page 66: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/66.jpg)
Concepts of func7onal annota7on • Gene/Protein func7on
• Referes to the molecular func7on of a gene or a protein:
Tyrosine kinase • Func7onal annota7on
• More general, referes to the characteriza7on of func7onal aspect of the protein. Stress-‐related, cytoplasm, ABC transporter
• Also referes to the process of assingment of a func7on label
• Habitually, standard vocabularies are used to assign func7on
![Page 67: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/67.jpg)
Func7onal Vocabularies Molecular Function Biological Process Cellular Component
Metabolic pathways
KEGG orthologues
Functional motifs
![Page 68: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/68.jpg)
The Gene Ontology
• Project developed by the Gene Ontology Consor7um
• Provides a controlled vocabulary to describe gene and gene product aDributes in any organism
• Includes both the development of the Ontology and the maintenance of a Database of annota7ons
![Page 69: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/69.jpg)
The Ontology ü Annotations are given to te
most specific (low) level. ü True path rule: annotation
at a term implies annotation to all its parent terms
ü Annotation is given with an Evidence Code: o IDA: inferred by direct assay o TAS: traceable author
statement o ISS: infered by sequence
similarity o IEA: electronic annotation o ….
More general
More specific
![Page 70: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/70.jpg)
The GO has a DAG structure
![Page 71: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/71.jpg)
The Gene Ontology Database (GOA)
• There is a collabora7ng ins7tu7on per organism to provide annota7ons
• Most of the GOA annota7ons come from UniProt • Most of the annota7ons are electronic annota7ons
http://www.geneontology.org/GO.current.annotations.shtml
![Page 72: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/72.jpg)
InterPro
• Collec7on of databases with func7onal annota7on of protein mo7fs
• Func7onal vocabulary at UniProt • There is an equivalence table between GO and InterPro
http://www.ebi.ac.uk/interpro/databases.html
![Page 73: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/73.jpg)
Func7onal assignment Annota7on
Empirical Transference
Molecular interac7ons
Gene/protein expression
Biochemical assay
Structure Comparison
Sequence analysis
Iden7fica7on of folds
Mo7f iden7fica7on
Filogeny
Literature reference
Sequence homology
![Page 74: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/74.jpg)
Automa7c annota7on • GO annota7ons can be created by comparision to annotated sequences
• To achieve enough coverage, high-‐throughput, automa7c annota7on is required
• The most effec7ve (also error prone) automa7c annota7on method is transfer from sequence similarity
![Page 75: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/75.jpg)
Concerns in func7onal transfer by similarity
• Level of homology (~ from 40-‐60% is possible) • The overlap query and hit sequences (not much a problem) • The domain or structure func7on associa7on • The paralog problem: genes with similar sequences might have different func7onal specifica7ons
• The evidence for the original annota7on • Balance between quality and quan7ty: depends on the use
GO1, GO2, GO3, GO4
GO1, GO2, GO3, GO4 QUERY
HIT
![Page 76: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/76.jpg)
Blast2GO • Suite for func7onal annota7on and data mining on func7onal
data – Considera7ons for annota=on
• Simlarity • Length of the overlap • Percentage of hit sequence spanned by the overlap • Evidence original annota7on • Blast hits and mo7f hits • Refinement by addi7onal methods
– Visualiza7on: • Annota7on charts • Knowledge discovery on the DAG
• Desktop Java applica7on • web interface @ Babelomics: Babelomics for non-‐model
![Page 77: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/77.jpg)
Blast2GO Annota7on strategy
Sq1
Blast Sq2
Sq3
Sq4
Sq1
Sq2
Sq3
Sq4
Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4
Sq1
Sq2
Sq3
Sq4
Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4
Hit1 Hit2
go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4
go6,go9, go8 go1,go8 go4,go1, go8,go9
go2 go2,go4, go4 go2,go5, go6 go2,go4
Sq1
Sq2
Sq3
Sq4
go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4
go6,go9, go8 go1,go8 go4,go1, go8,go9
go2 go2,go4, go4 go2,go5, go6 go2,go4
Mapping
Hit1 Hit2
Annotation
![Page 78: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/78.jpg)
Blast2GO Annota7on Strategy
Sq1
Sq2
Sq3
Sq4
go1,go2, go3, GO11
go8, GO12, GO13
go2,go4
GO15
Refinement
InterPro Annex
GOSlim Manual
Sq1
Sq2
Sq3
Sq4
go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4
go6,go9, go8 go1,go8 go4,go1, go8,go9
go2 go2,go4, go4 go2,go5, go6 go2,go4
![Page 79: Aim$and$Outline$of$the$course · 2014. 7. 10. · Aim$and$Outline$of$the$courseDb$datamining$ $ Db$tools$can$be$used$to$retrieve$informaon$for$agene$or$protein$ $ The$mostimportantconceptis$the$](https://reader034.vdocuments.us/reader034/viewer/2022051914/60054a55e222140fb75b9edc/html5/thumbnails/79.jpg)
B2G Highligh7ng on the DAG: The B2G Score
• Coloring strategy to highlight regions in the DAG where the most interes7ng informa7on is concentrated
• The confluence score (B2G score) keeps a balance between the number of annotated sequences at one node and the distance to the origin of annota7on