aim$and$outline$of$the$course · 2014. 7. 10. · aim$and$outline$of$the$coursedb$datamining$ $...
TRANSCRIPT
Aim and Outline of the course Db data mining
Db tools can be used to retrieve informa7on for a gene or protein The most important concept is the similarity
Aim and Outline of the course Db data mining
BLAST
Interpro
Blast2GO
Sequence Alignments
13
BLAST (Basic Local Alignment Search Tool
• One of the tools of the NCBI - The U.S. National Center for Biotechnology Information.
• Uses word matching like FASTA
• Similarity matching of words (3 AA’s, 11 bases/nucleotides) – does not require identical words.
• If no words are similar, then there is no alignment
– won’t find matches for very short sequences
Sequence Alignments
14
BLAST word matching
MEAAVKEEISVEDEAVDKNI MEA EAA AAV AVK VKE KEE EEI EIS ISV ...
Break query into words:
Break database sequences
into words:
Sequence Alignments
15
Database Sequence Words Lists RTT AAQ
SDG KSS SRW LLN QEL RWY VKI GKG DKI NIS LFC WDV AAV KVR PFR DEI
Compare word lists
Query Word List: MEA EAA AAV AVK VKL KEE EEI EIS ISV
?
Compare word lists by Hashing
& allow near matches!
Sequence Alignments
16
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT
TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY
IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
MEA EAA AAV AVK KLV KEE EEI EIS ISV
Find locations of matching words in all sequences
Sequence Alignments
17
Extend hits one base at a time
• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.
• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.
Sequence Alignments
17
Extend hits one base at a time
• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.
• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.
Why statistics??
Sequence Alignments
18
• Job Title: P14867|GBRA1_HUMAN Gamma-aminobutyric acid... Show Conserved Domains Putative conserved domains have been detected, click on the image below for detailed results. * BLASTP 2.2.18 (Mar-02-2008) protein-protein BLAST Database: Non-redundant SwissProt sequences 309,621 sequences; 115,465,120 total letters Query= P14867|GBRA1_HUMAN Gamma-aminobutyric acid receptor subunit alpha-1 - Homo sapiens (Human). Length=456 Sequences producing significant alignments: (Bits) Value sp|P14867.3|GBRA1_HUMAN Gamma-aminobutyric acid receptor subu... 948 0.0 Gene info sp|Q5R6B2.1|GBRA1_PONPY Gamma-aminobutyric acid receptor subu... 944 0.0 sp|Q4R534.1|GBRA1_MACFA Gamma-aminobutyric acid receptor subu... 944 0.0 sp|P08219.1|GBRA1_BOVIN Gamma-aminobutyric acid receptor subu... 939 0.0 Gene info sp|P62813.1|GBRA1_RAT Gamma-aminobutyric acid receptor subuni... 908 0.0 Gene info sp|P19150.1|GBRA1_CHICK Gamma-aminobutyric acid receptor subu... 882 0.0 Gene info sp|P47869.2|GBRA2_HUMAN Gamma-aminobutyric acid receptor subu... 670 0.0 Gene info sp|P26048.1|GBRA2_MOUSE Gamma-aminobutyric acid receptor subu... 669 0.0 Gene info sp|P23576.1|GBRA2_RAT Gamma-aminobutyric acid receptor subuni... 669 0.0 Gene info sp|P10063.1|GBRA2_BOVIN Gamma-aminobutyric acid receptor subu... 667 0.0 Gene info
• sp|Q08E50.1|GBRA5_BOVIN Gamma-aminobutyric acid receptor subu... 641 0.0 Gene info sp|Q8BHJ7.1|GBRA5_MOUSE Gamma-aminobutyric acid receptor subu... 640 0.0 Gene info sp|P31644.1|GBRA5_HUMAN Gamma-aminobutyric acid receptor subu... 638 0.0 Gene info sp|P19969.1|GBRA5_RAT Gamma-aminobutyric acid receptor subuni... 636 0.0 Gene info sp|P34903.1|GBRA3_HUMAN Gamma-aminobutyric acid receptor subu... 632 0.0 Gene info sp|P26049.1|GBRA3_MOUSE Gamma-aminobutyric acid receptor subu... 630 6e-180 Gene info sp|P10064.1|GBRA3_BOVIN Gamma-aminobutyric acid receptor subu... 628 2e-179 Gene info sp|P20236.1|GBRA3_RAT Gamma-aminobutyric acid receptor subuni... 627 3e-179 Gene info sp|P30191.1|GBRA6_RAT Gamma-aminobutyric acid receptor subuni... 520 6e-147 Gene info sp|P16305.2|GBRA6_MOUSE Gamma-aminobutyric acid receptor subu... 518 2e-146 Gene info sp|Q90845.1|GBRA6_CHICK Gamma-aminobutyric acid receptor subu... 518 3e-146 Gene info
BLAST : example of result
Sequence Alignments
19
BLAST is approximate but fast
• BLAST makes similarity searches very quickly, but also makes
errors
– misses some important similarities
– makes many incorrect matches
• The NCBI BLAST web server lets you compare your query
sequence to various sequences stored in the GenBank;
• This is a VERY fast and powerful computer.
• The speed and relatively good accuracy of BLAST are the key why
the tool is the most popular bioinformatics search tool.
InterPro Database Protein
Functional Analysis
Jennifer McDowall, Ph.D. Senior InterPro Curator
EBI Sequence Databases
UniProtKB Swiss-Prot
manual annotation
UniProtKB TrEMBL
protein sequence
translate
(GenBank, DDBJ)
nucleotide sequence
EMBL
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
>7M
>400,000
EBI Sequence Databases
UniProtKB Swiss-Prot
manual annotation
UniProtKB TrEMBL
protein sequence
translate
InterPro
Protein signatures
protein annotation
(GenBank, DDBJ)
nucleotide sequence
EMBL
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG
groups of related proteins
(same family or share domains)
UniProtKB
UniProt/ SwissProt
proteins
InterPro ~370,000
~400,000
Signature matches
InterPro ~80% Protein Coverage
UniMESS Metagenomic
proteins
>6M
Available 2009
UniProt/ TrEMBL proteins
>5.3M
>7M
What are protein signatures?
Multiple sequence alignment
• A signature describes the paDern of a set of conserved residues in a group of proteins
Ø Define a protein family Ø Define a protein feature (domain or conserved site)
• More sensi7ve homology searches Ø Find more distant homologues than BLAST
What value are signatures?
• More sensi7ve homology searches
What value are signatures?
• Classifica7on of proteins Ø Associate proteins that share: Func7on
Domains Sequence Structure
What value are signatures?
• Annota7on of protein sequences Ø Define conserved regions of a protein
- e.g. loca7on and type of domains key structural or func7onal sites
• Classifica7on of proteins
• More sensi7ve homology searches
What value are signatures?
• Transfer addi7onal (automa7c) annota7on Ø Associate TrEMBL proteins with well-‐annotated SwissProt proteins
Transfer annota7on
• More sensi7ve homology searches
• Classifica7on of proteins
• Annota7on of protein sequences
Signature methods
• Pattern
• Fingerprint
• Sequence clustering
• HMM
• SAM
Patterns
Pattern/motif in sequence à regular expression
Can define important sites
Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: Insulin
Patterns
Pattern/motif in sequence à regular expression
Can define important sites
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: PS00262 Insulin family signature
Patterns
Pattern/motif in sequence à regular expression
Can define important sites
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: PS00262 Insulin family signature
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
Patterns
Pattern/motif in sequence à regular expression
Can define important sites
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: PS00262 Insulin family signature
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Regular expression
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
Patterns – understanding a regular expression
C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C
Strictly conserved site; only one amino acid is accepted
at this posi7on
Curly brackets denote amino acids that cannot occur at a single posi7on
x denotes any amino acid can occur at a single
posi7on
There are dashes between each posi7on
Patterns – understanding a regular expression
C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C
X(2) – therefore any amino acid can occur at the next two posi7on
Square brackets denote range of amino acids that occur at a single posi7on
Patterns
Extract pattern sequences xxxxxx xxxxxx xxxxxx xxxxxx
Sequence alignment
Insulin family motif Define pattern
Pattern signature
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression
PS00000
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
His phosphorylation site
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA:
His phosphorylation site
Ser phosphorylation site
MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA:
His phosphorylation site
Ser phosphorylation site Conserved site
MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE
Fingerprints
Several motifs à characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE
1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE
3-motif fingerprint
Fingerprints
Extract motif sequences
xxxxxx xxxxxx xxxxxx xxxxxx
xxxxxx xxxxxx xxxxxx xxxxxx
xxxxxx xxxxxx xxxxxx xxxxxx
Sequence alignment
Correct order
Correct spacing
Ser phosphorylation
site
Conserved site
His phosphorylation
site Define motifs
Fingerprint signature 1 2 3
PR00000
Sequence clustering
Automatic clustering of homologous domains
**Rarely covers entire domain (conserved core)
**Signature size can change with release
Known domain families
Recruit homologous domains
PSI-BLAST
MKDOM2 Automatic clustering
ProDomAlign Align domain families
Hidden Markov Models (HMM)
Can characterise protein over entire length
Models conserved and divergent regions (position-specific scoring)
Models insertions and deletions
Ø Outperform in sensitivity and specificity
Ø More flexible (can use partial alignments)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
Sequence alignment
Scoring matrix
(residue frequency at each position in
alignment)
Profile
Hidden Markov Models (HMM)
Bayesian statistics
probability scoring
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M = match state
M1
Hidden Markov Models (HMM)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1
Hidden Markov Models (HMM)
M2
M = match state
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1
Hidden Markov Models (HMM)
M2 M3
M = match state
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1
Hidden Markov Models (HMM)
M2 M3 M4 M5 M6 M7 M8 M9 M10 M4 M5 M6 M7 M8 M9 M10
M = match state
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M4 M5 M6 M7 M8 M9 M10
I = insert state
I1 I2 I3 I4 I5 I6 I7 I8 I9
D = delete state
D2 D3 D4 D5 D6 D7 D8 D9
Hidden Markov Models (HMM)
Hidden Markov Models (HMM)
HMM databases: • PIR SUPERFAMILY
• PANTHER
• TIGRFAM
• PFAM
• SMART
• SUPERFAMILY
• GENE3D
Domains conserved in sequence
Families conserved in sequence
Domains conserved in structure
SAM Profile HMMs
Homologous structural superfamilies
Start with single seed sequence
Proteins in superfamily may have low
sequence identity
Few proteins in family have PDB structures
Create 1 model for every protein in superfamily à combine results
SAM Profile models
T99 script:
Low identity matches
Close homologues
WU-BLASTP
search
Final model
Single seed sequence GIHARPATLLVQTASKF
Initial model
GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF
New larger alignment GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF
Signatures Methods
• Pattern
• Fingerprint
• Sequence clustering
• HMM
• SAM
Describe protein features: ac7ve sites, binding sites…
Describe families and sibling subfamilies
Predicts conserved domains
Signature Methods
• Pattern
• Fingerprint
• Sequence clustering
• HMM
• SAM
Func7onal classifica7on of
families
Func7onal domain annota7on
Structural domain annota7on
Comprehensive annotation InterPro removes
redundancy
SWIB/MDM2 domain
RanBP2-‐type zinc finger
RING-‐type zinc finger Domain annota7on
Comprehensive annotation
Conserved site within zinc finger
Annotate features
Comprehensive annotation
Mdm2/Mdm4 family
Mdm4 subfamily
Parent
Child
Family classifica7on
Domain Boundaries
Gene3D (and SSF) determines domain structural boundaries
Pfam trims domains to regions of good sequence conservation
ProDom displays shortest conserved sequence
Fragmented Signatures
4) Non-contiguous domains
3) Repeated elements
2) Duplicated domains
1) Signature method
Fragmented Signatures
• e.g. PRINTS – discrete motifs 1) Signature method
3) Repeated elements
2) Duplicated domains
4) Non-contiguous domains
Fragmented Signatures
1) Signature method
2) Duplicated domains
3) Repeated elements
4) Non-contiguous domains
• e.g. SSF - duplication consisting of 2 domains with same fold
Fragmented Signatures
3) Repeated elements
2) Duplicated domains
• e.g. Kringle, WD40
4) Non-contiguous domains
1) Signature method
Fragmented Signatures
3) Repeats
4) Non-contiguous domains
2) Duplicated domains
1) Signature method
• Structural domains can consist of non-contiguous sequence
Fragmented Signatures
4) Non-contiguous domains
3) Repeats
2) Duplicated domains
1) Signature method
Complementary Annotation
Ø Sequence-based signature (Pfam) shows that the domain is made up of repeating sequence elements
Beta-‐propeller repeat
Ø Structural-based signature (SSF) shows boundaries of structural domain
7-‐blade beta-‐propeller
Complementary Annotation
PFAM shows domain is composed of two types of repeated sequence mo7fs
SUPERFAMILY shows the poten7al domain
boundaries
Complementary Annotation
GENE3D shows that these domains share homologous structure
PFAM/SMART show 2 domains from dis7nct sequence families
Func7onal annota7on with Blast2GO
Index • Why Blast2GO? Concepts on func7onal annota7ons – Func7on assignment – Vocabularies – GO and GOA – The GO Direct Acyclic Graph – The problem of func7on transfer
• Blast2GO Java applica7on + Prac7cals • Blast2GO @ Babelomics + Prac7cals
Why Blast2GO? Workflow analysis Experiment
MNAT1 CTNNBL1 ENOX2 GTPBP1 RALY TAGLN2 RAB3A PPP2R5A MAPRE1 ..... ...
Data-‐Analysis Gene-‐List
Functional interpretation
Functional Profiling
+ Functional Annotation
What does Blast2GO do?
Generates annotations Visualization of funcional annotations
Concepts of func7onal annota7on • Gene/Protein func7on
• Referes to the molecular func7on of a gene or a protein:
Tyrosine kinase • Func7onal annota7on
• More general, referes to the characteriza7on of func7onal aspect of the protein. Stress-‐related, cytoplasm, ABC transporter
• Also referes to the process of assingment of a func7on label
• Habitually, standard vocabularies are used to assign func7on
Func7onal Vocabularies Molecular Function Biological Process Cellular Component
Metabolic pathways
KEGG orthologues
Functional motifs
The Gene Ontology
• Project developed by the Gene Ontology Consor7um
• Provides a controlled vocabulary to describe gene and gene product aDributes in any organism
• Includes both the development of the Ontology and the maintenance of a Database of annota7ons
The Ontology ü Annotations are given to te
most specific (low) level. ü True path rule: annotation
at a term implies annotation to all its parent terms
ü Annotation is given with an Evidence Code: o IDA: inferred by direct assay o TAS: traceable author
statement o ISS: infered by sequence
similarity o IEA: electronic annotation o ….
More general
More specific
The GO has a DAG structure
The Gene Ontology Database (GOA)
• There is a collabora7ng ins7tu7on per organism to provide annota7ons
• Most of the GOA annota7ons come from UniProt • Most of the annota7ons are electronic annota7ons
http://www.geneontology.org/GO.current.annotations.shtml
InterPro
• Collec7on of databases with func7onal annota7on of protein mo7fs
• Func7onal vocabulary at UniProt • There is an equivalence table between GO and InterPro
http://www.ebi.ac.uk/interpro/databases.html
Func7onal assignment Annota7on
Empirical Transference
Molecular interac7ons
Gene/protein expression
Biochemical assay
Structure Comparison
Sequence analysis
Iden7fica7on of folds
Mo7f iden7fica7on
Filogeny
Literature reference
Sequence homology
Automa7c annota7on • GO annota7ons can be created by comparision to annotated sequences
• To achieve enough coverage, high-‐throughput, automa7c annota7on is required
• The most effec7ve (also error prone) automa7c annota7on method is transfer from sequence similarity
Concerns in func7onal transfer by similarity
• Level of homology (~ from 40-‐60% is possible) • The overlap query and hit sequences (not much a problem) • The domain or structure func7on associa7on • The paralog problem: genes with similar sequences might have different func7onal specifica7ons
• The evidence for the original annota7on • Balance between quality and quan7ty: depends on the use
GO1, GO2, GO3, GO4
GO1, GO2, GO3, GO4 QUERY
HIT
Blast2GO • Suite for func7onal annota7on and data mining on func7onal
data – Considera7ons for annota=on
• Simlarity • Length of the overlap • Percentage of hit sequence spanned by the overlap • Evidence original annota7on • Blast hits and mo7f hits • Refinement by addi7onal methods
– Visualiza7on: • Annota7on charts • Knowledge discovery on the DAG
• Desktop Java applica7on • web interface @ Babelomics: Babelomics for non-‐model
Blast2GO Annota7on strategy
Sq1
Blast Sq2
Sq3
Sq4
Sq1
Sq2
Sq3
Sq4
Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4
Sq1
Sq2
Sq3
Sq4
Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4
Hit1 Hit2
go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4
go6,go9, go8 go1,go8 go4,go1, go8,go9
go2 go2,go4, go4 go2,go5, go6 go2,go4
Sq1
Sq2
Sq3
Sq4
go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4
go6,go9, go8 go1,go8 go4,go1, go8,go9
go2 go2,go4, go4 go2,go5, go6 go2,go4
Mapping
Hit1 Hit2
Annotation
Blast2GO Annota7on Strategy
Sq1
Sq2
Sq3
Sq4
go1,go2, go3, GO11
go8, GO12, GO13
go2,go4
GO15
Refinement
InterPro Annex
GOSlim Manual
Sq1
Sq2
Sq3
Sq4
go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4
go6,go9, go8 go1,go8 go4,go1, go8,go9
go2 go2,go4, go4 go2,go5, go6 go2,go4
B2G Highligh7ng on the DAG: The B2G Score
• Coloring strategy to highlight regions in the DAG where the most interes7ng informa7on is concentrated
• The confluence score (B2G score) keeps a balance between the number of annotated sequences at one node and the distance to the origin of annota7on