aim$and$outline$of$the$course · 2014. 7. 10. · aim$and$outline$of$the$coursedb$datamining$ $...

Aim and Outline of the course Db data mining

Db tools can be used to retrieve informa7on for a gene or protein The most important concept is the similarity

Aim and Outline of the course Db data mining

BLAST

Interpro

Blast2GO

Sequence Alignments

13

BLAST (Basic Local Alignment Search Tool

• One of the tools of the NCBI - The U.S. National Center for Biotechnology Information.

• Uses word matching like FASTA

• Similarity matching of words (3 AA’s, 11 bases/nucleotides) – does not require identical words.

• If no words are similar, then there is no alignment

– won’t find matches for very short sequences

Sequence Alignments

14

BLAST word matching

MEAAVKEEISVEDEAVDKNI MEA EAA AAV AVK VKE KEE EEI EIS ISV ...

Break query into words:

Break database sequences

into words:

Sequence Alignments

15

Database Sequence Words Lists RTT AAQ

SDG KSS SRW LLN QEL RWY VKI GKG DKI NIS LFC WDV AAV KVR PFR DEI

Compare word lists

Query Word List: MEA EAA AAV AVK VKL KEE EEI EIS ISV

?

Compare word lists by Hashing

& allow near matches!

Sequence Alignments

16

ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT

TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY

IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH

MEA EAA AAV AVK KLV KEE EEI EIS ISV

Find locations of matching words in all sequences

Sequence Alignments

17

Extend hits one base at a time

• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.

• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.

Sequence Alignments

17

Extend hits one base at a time

• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.

• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.

Why statistics??

Sequence Alignments

18

• Job Title: P14867|GBRA1_HUMAN Gamma-aminobutyric acid... Show Conserved Domains Putative conserved domains have been detected, click on the image below for detailed results. * BLASTP 2.2.18 (Mar-02-2008) protein-protein BLAST Database: Non-redundant SwissProt sequences 309,621 sequences; 115,465,120 total letters Query= P14867|GBRA1_HUMAN Gamma-aminobutyric acid receptor subunit alpha-1 - Homo sapiens (Human). Length=456 Sequences producing significant alignments: (Bits) Value sp|P14867.3|GBRA1_HUMAN Gamma-aminobutyric acid receptor subu... 948 0.0 Gene info sp|Q5R6B2.1|GBRA1_PONPY Gamma-aminobutyric acid receptor subu... 944 0.0 sp|Q4R534.1|GBRA1_MACFA Gamma-aminobutyric acid receptor subu... 944 0.0 sp|P08219.1|GBRA1_BOVIN Gamma-aminobutyric acid receptor subu... 939 0.0 Gene info sp|P62813.1|GBRA1_RAT Gamma-aminobutyric acid receptor subuni... 908 0.0 Gene info sp|P19150.1|GBRA1_CHICK Gamma-aminobutyric acid receptor subu... 882 0.0 Gene info sp|P47869.2|GBRA2_HUMAN Gamma-aminobutyric acid receptor subu... 670 0.0 Gene info sp|P26048.1|GBRA2_MOUSE Gamma-aminobutyric acid receptor subu... 669 0.0 Gene info sp|P23576.1|GBRA2_RAT Gamma-aminobutyric acid receptor subuni... 669 0.0 Gene info sp|P10063.1|GBRA2_BOVIN Gamma-aminobutyric acid receptor subu... 667 0.0 Gene info

• sp|Q08E50.1|GBRA5_BOVIN Gamma-aminobutyric acid receptor subu... 641 0.0 Gene info sp|Q8BHJ7.1|GBRA5_MOUSE Gamma-aminobutyric acid receptor subu... 640 0.0 Gene info sp|P31644.1|GBRA5_HUMAN Gamma-aminobutyric acid receptor subu... 638 0.0 Gene info sp|P19969.1|GBRA5_RAT Gamma-aminobutyric acid receptor subuni... 636 0.0 Gene info sp|P34903.1|GBRA3_HUMAN Gamma-aminobutyric acid receptor subu... 632 0.0 Gene info sp|P26049.1|GBRA3_MOUSE Gamma-aminobutyric acid receptor subu... 630 6e-180 Gene info sp|P10064.1|GBRA3_BOVIN Gamma-aminobutyric acid receptor subu... 628 2e-179 Gene info sp|P20236.1|GBRA3_RAT Gamma-aminobutyric acid receptor subuni... 627 3e-179 Gene info sp|P30191.1|GBRA6_RAT Gamma-aminobutyric acid receptor subuni... 520 6e-147 Gene info sp|P16305.2|GBRA6_MOUSE Gamma-aminobutyric acid receptor subu... 518 2e-146 Gene info sp|Q90845.1|GBRA6_CHICK Gamma-aminobutyric acid receptor subu... 518 3e-146 Gene info

BLAST : example of result

Sequence Alignments

19

BLAST is approximate but fast

• BLAST makes similarity searches very quickly, but also makes

errors

– misses some important similarities

– makes many incorrect matches

• The NCBI BLAST web server lets you compare your query

sequence to various sequences stored in the GenBank;

• This is a VERY fast and powerful computer.

• The speed and relatively good accuracy of BLAST are the key why

the tool is the most popular bioinformatics search tool.

InterPro Database Protein

Functional Analysis

Jennifer McDowall, Ph.D. Senior InterPro Curator

EBI Sequence Databases

UniProtKB Swiss-Prot

manual annotation

UniProtKB TrEMBL

protein sequence

translate

(GenBank, DDBJ)

nucleotide sequence

EMBL

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG


>7M

>400,000

EBI Sequence Databases

UniProtKB Swiss-Prot

manual annotation

UniProtKB TrEMBL

protein sequence

translate

InterPro

Protein signatures

protein annotation

(GenBank, DDBJ)

nucleotide sequence

EMBL



groups of related proteins

(same family or share domains)

UniProtKB

UniProt/ SwissProt

proteins

InterPro ~370,000

~400,000

Signature matches

InterPro ~80% Protein Coverage

UniMESS Metagenomic

proteins

>6M

Available 2009

UniProt/ TrEMBL proteins

>5.3M

>7M

What are protein signatures?

Multiple sequence alignment

•  A signature describes the paDern of a set of conserved residues in a group of proteins

Ø  Define a protein family Ø  Define a protein feature (domain or conserved site)

• More sensi7ve homology searches Ø  Find more distant homologues than BLAST

What value are signatures?

• More sensi7ve homology searches


•  Classifica7on of proteins Ø  Associate proteins that share: Func7on

Domains Sequence Structure


•  Annota7on of protein sequences Ø  Define conserved regions of a protein

-  e.g. loca7on and type of domains key structural or func7onal sites

•  Classifica7on of proteins



•  Transfer addi7onal (automa7c) annota7on Ø  Associate TrEMBL proteins with well-‐annotated SwissProt proteins

Transfer annota7on


•  Classifica7on of proteins

•  Annota7on of protein sequences

Signature methods

•  Pattern

•  Fingerprint

•  Sequence clustering

•  HMM

•  SAM

Patterns

Pattern/motif in sequence à regular expression

Can define important sites

Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |

EXAMPLE: Insulin

Patterns



MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN


EXAMPLE: PS00262 Insulin family signature

Patterns





MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N

Patterns





C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Regular expression

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N

Patterns – understanding a regular expression

C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C

Strictly conserved site; only one amino acid is accepted

at this posi7on

Curly brackets denote amino acids that cannot occur at a single posi7on

x denotes any amino acid can occur at a single

posi7on

There are dashes between each posi7on

Patterns – understanding a regular expression

C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C

X(2) – therefore any amino acid can occur at the next two posi7on

Square brackets denote range of amino acids that occur at a single posi7on

Patterns

Extract pattern sequences xxxxxx xxxxxx xxxxxx xxxxxx

Sequence alignment

Insulin family motif Define pattern

Pattern signature

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression

PS00000

Fingerprints

Several motifs à characterise family

Different combinations of motifs describe subfamilies

Identify small conserved regions in divergent proteins

EXAMPLE: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

Fingerprints





PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

His phosphorylation site

Fingerprints





PTHP_ENTFA:


Ser phosphorylation site

MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

Fingerprints





PTHP_ENTFA:


Ser phosphorylation site Conserved site

MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE

Fingerprints





PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE

1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE

3-motif fingerprint

Fingerprints

Extract motif sequences

xxxxxx xxxxxx xxxxxx xxxxxx



Sequence alignment

Correct order

Correct spacing

Ser phosphorylation

site

Conserved site

His phosphorylation

site Define motifs

Fingerprint signature 1 2 3

PR00000

Sequence clustering

Automatic clustering of homologous domains

**Rarely covers entire domain (conserved core)

**Signature size can change with release

Known domain families

Recruit homologous domains

PSI-BLAST

MKDOM2 Automatic clustering

ProDomAlign Align domain families

Hidden Markov Models (HMM)

Can characterise protein over entire length

Models conserved and divergent regions (position-specific scoring)

Models insertions and deletions

Ø  Outperform in sensitivity and specificity

Ø  More flexible (can use partial alignments)

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:

Sequence alignment

Scoring matrix

(residue frequency at each position in

alignment)

Profile


Bayesian statistics

probability scoring


M = match state

M1



M1


M2

M = match state


M1


M2 M3

M = match state


M1


M2 M3 M4 M5 M6 M7 M8 M9 M10 M4 M5 M6 M7 M8 M9 M10

M = match state

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M4 M5 M6 M7 M8 M9 M10

I = insert state

I1 I2 I3 I4 I5 I6 I7 I8 I9

D = delete state

D2 D3 D4 D5 D6 D7 D8 D9



HMM databases: •  PIR SUPERFAMILY

•  PANTHER

•  TIGRFAM

•  PFAM

•  SMART

•  SUPERFAMILY

•  GENE3D

Domains conserved in sequence

Families conserved in sequence

Domains conserved in structure

SAM Profile HMMs

Homologous structural superfamilies

Start with single seed sequence

Proteins in superfamily may have low

sequence identity

Few proteins in family have PDB structures

Create 1 model for every protein in superfamily à combine results

SAM Profile models

T99 script:

Low identity matches

Close homologues

WU-BLASTP

search

Final model

Single seed sequence GIHARPATLLVQTASKF

Initial model

GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF

New larger alignment GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF

Signatures Methods

•  Pattern

•  Fingerprint


•  HMM

•  SAM

Describe protein features: ac7ve sites, binding sites…

Describe families and sibling subfamilies

Predicts conserved domains

Signature Methods

•  Pattern

•  Fingerprint


•  HMM

•  SAM

Func7onal classifica7on of

families

Func7onal domain annota7on

Structural domain annota7on

Comprehensive annotation InterPro removes

redundancy

SWIB/MDM2 domain

RanBP2-‐type zinc finger

RING-‐type zinc finger Domain annota7on

Comprehensive annotation

Conserved site within zinc finger

Annotate features

Comprehensive annotation

Mdm2/Mdm4 family

Mdm4 subfamily

Parent

Child

Family classifica7on

Domain Boundaries

Gene3D (and SSF) determines domain structural boundaries

Pfam trims domains to regions of good sequence conservation

ProDom displays shortest conserved sequence

Fragmented Signatures

4) Non-contiguous domains

3) Repeated elements

2) Duplicated domains

1) Signature method


•  e.g. PRINTS – discrete motifs 1) Signature method





1) Signature method




•  e.g. SSF - duplication consisting of 2 domains with same fold




•  e.g. Kringle, WD40


1) Signature method


3) Repeats



1) Signature method

•  Structural domains can consist of non-contiguous sequence



3) Repeats


1) Signature method

Complementary Annotation

Ø  Sequence-based signature (Pfam) shows that the domain is made up of repeating sequence elements

Beta-‐propeller repeat

Ø  Structural-based signature (SSF) shows boundaries of structural domain

7-‐blade beta-‐propeller


PFAM shows domain is composed of two types of repeated sequence mo7fs

SUPERFAMILY shows the poten7al domain

boundaries


GENE3D shows that these domains share homologous structure

PFAM/SMART show 2 domains from dis7nct sequence families

Func7onal annota7on with Blast2GO

Index •  Why Blast2GO? Concepts on func7onal annota7ons –  Func7on assignment –  Vocabularies – GO and GOA –  The GO Direct Acyclic Graph –  The problem of func7on transfer

•  Blast2GO Java applica7on + Prac7cals •  Blast2GO @ Babelomics + Prac7cals

Why Blast2GO? Workflow analysis Experiment

MNAT1 CTNNBL1 ENOX2 GTPBP1 RALY TAGLN2 RAB3A PPP2R5A MAPRE1 ..... ...

Data-‐Analysis Gene-‐List

Functional interpretation

Functional Profiling

+ Functional Annotation

What does Blast2GO do?

Generates annotations Visualization of funcional annotations

Concepts of func7onal annota7on •  Gene/Protein func7on

•  Referes to the molecular func7on of a gene or a protein:

Tyrosine kinase •  Func7onal annota7on

•  More general, referes to the characteriza7on of func7onal aspect of the protein. Stress-‐related, cytoplasm, ABC transporter

•  Also referes to the process of assingment of a func7on label

•  Habitually, standard vocabularies are used to assign func7on

Func7onal Vocabularies Molecular Function Biological Process Cellular Component

Metabolic pathways

KEGG orthologues

Functional motifs

The Gene Ontology

•  Project developed by the Gene Ontology Consor7um

•  Provides a controlled vocabulary to describe gene and gene product aDributes in any organism

•  Includes both the development of the Ontology and the maintenance of a Database of annota7ons

The Ontology ü Annotations are given to te

most specific (low) level. ü  True path rule: annotation

at a term implies annotation to all its parent terms

ü Annotation is given with an Evidence Code: o  IDA: inferred by direct assay o  TAS: traceable author

statement o  ISS: infered by sequence

similarity o  IEA: electronic annotation o  ….

More general

More specific

The GO has a DAG structure

The Gene Ontology Database (GOA)

•  There is a collabora7ng ins7tu7on per organism to provide annota7ons

•  Most of the GOA annota7ons come from UniProt •  Most of the annota7ons are electronic annota7ons

http://www.geneontology.org/GO.current.annotations.shtml

InterPro

•  Collec7on of databases with func7onal annota7on of protein mo7fs

•  Func7onal vocabulary at UniProt •  There is an equivalence table between GO and InterPro

http://www.ebi.ac.uk/interpro/databases.html

Func7onal assignment Annota7on

Empirical Transference

Molecular interac7ons

Gene/protein expression

Biochemical assay

Structure Comparison

Sequence analysis

Iden7fica7on of folds

Mo7f iden7fica7on

Filogeny

Literature reference

Sequence homology

Automa7c annota7on •  GO annota7ons can be created by comparision to annotated sequences

•  To achieve enough coverage, high-‐throughput, automa7c annota7on is required

•  The most effec7ve (also error prone) automa7c annota7on method is transfer from sequence similarity

Concerns in func7onal transfer by similarity

•  Level of homology (~ from 40-‐60% is possible) •  The overlap query and hit sequences (not much a problem) •  The domain or structure func7on associa7on •  The paralog problem: genes with similar sequences might have different func7onal specifica7ons

•  The evidence for the original annota7on •  Balance between quality and quan7ty: depends on the use

GO1, GO2, GO3, GO4

GO1, GO2, GO3, GO4 QUERY

HIT

Blast2GO •  Suite for func7onal annota7on and data mining on func7onal

data –  Considera7ons for annota=on

•  Simlarity •  Length of the overlap •  Percentage of hit sequence spanned by the overlap •  Evidence original annota7on •  Blast hits and mo7f hits •  Refinement by addi7onal methods

–  Visualiza7on: •  Annota7on charts •  Knowledge discovery on the DAG

•  Desktop Java applica7on •  web interface @ Babelomics: Babelomics for non-‐model

Blast2GO Annota7on strategy

Sq1

Blast Sq2

Sq3

Sq4

Sq1

Sq2

Sq3

Sq4

Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4

Sq1

Sq2

Sq3

Sq4

Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4

Hit1 Hit2

go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4

go6,go9, go8 go1,go8 go4,go1, go8,go9

go2 go2,go4, go4 go2,go5, go6 go2,go4

Sq1

Sq2

Sq3

Sq4




Mapping

Hit1 Hit2

Annotation

Blast2GO Annota7on Strategy

Sq1

Sq2

Sq3

Sq4

go1,go2, go3, GO11

go8, GO12, GO13

go2,go4

GO15

Refinement

InterPro Annex

GOSlim Manual

Sq1

Sq2

Sq3

Sq4




B2G Highligh7ng on the DAG: The B2G Score

•  Coloring strategy to highlight regions in the DAG where the most interes7ng informa7on is concentrated

•  The confluence score (B2G score) keeps a balance between the number of annotated sequences at one node and the distance to the origin of annota7on

aim$and$outline$of$the$course · 2014. 7. 10. · aim$and$outline$of$the$coursedb$datamining$ $...

Documents