infer function by motifs pp2 introfunc3 - rostlab · 2014-11-05 · bairoch a (1991) nar 19 2241-5...
TRANSCRIPT
/00© Burkhard Rost
1
title: Protein Prediction 2 (for Bioinformaticians) - Protein function: Infer function by motifsshort title: pp2_introfunc3
lecture: Protein Prediction 2 - Protein function TUM wintersemester
/00© Burkhard Rost
So far: Function introduction • Molecular biology is just at an exciting beginning • We can compute some aspects of molecular life • Most accurate inference of function: based on homology
Today • Motifs • Function by association
NEXT • “compute” enzyme function? • predict localization
2
Past - TOC today - Next
/144© Burkhard Rost
I.2b Function Intro: Sequence motifs
3
/144© Burkhard Rost
Motifs - intro
4
/00© Burkhard Rost
Full sequence (ADH1_human, 95 aa): MANEVIKCKAAVAWEAGKPLSIEEIEVAPPKAHEVRIKIIATAVCHTDAY
TLSGADPEGCFPVILGHEGAGIVESVGEGVTKLKAVWRMQILSKS
Motifs could be:MANEVIKCKAA
Or:MAN[ED]hh[KR]C[KR]
5
Sequence vs motif
/144© Burkhard Rost
6
How can we use this concept 2 search?
?
/144© Burkhard Rost
7
Resources for motifs/patterns
PROSITE:http://us.expasy.org/prosite/ [Hulo et al. Nucl. Acids. Res. 32:D134-D137(2004)]
PRINTS:
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/[Attwood, Briefings in Bioinformatics, 3(3), 252-263 (2002)]
BLOCKS:
http://www.blocks.fhcrc.org/[Henikoff et al., Nucl. Acids Res. 28:228-230 (2000)]
/144© Burkhard Rost
PROSITE
8
/00© Burkhard Rost
1986 starts SWISS-PROT 1988 starts PROSITE 1993 starts ExPasy (with Ron Appel) 1998 SIB: Swiss Institute of Bioinformatics 2009 CALIPHO Computer and Laboratory Investigation of Proteins of Human Origin
9
Amos Bairoch
Amos Bairoch
Shapers and Shakers
/00© Burkhard Rost
SwissProtProSiteExPasyCaliphoSIB - Swiss Inst Bioinformatics
papers: • >220 papers (Nov 2013) • 4 >1,000 citations (Nov 2013) • 70 over 100 (Nov 2013)
10
Amos BairochShapers and Shakers
Amos Bairoch
/00© Burkhard Rost
SwissProtProSiteExPasyCaliphoSIB - Swiss Inst Bioinformatics
papers: • >220 papers (Nov 2013) • 4 >1,000 citations (Nov 2013) • 70 over 100 (Nov 2013) • H-index
11
Amos BairochShapers and Shakers
Amos Bairoch
What’s good?
/00© Burkhard Rost
SwissProtProSiteExPasyCaliphoSIB - Swiss Inst Bioinformatics
papers: • >220 papers (Nov 2013) • 4 >1,000 citations (Nov 2013) • 70 over 100 (Nov 2013) • H-index 79 (ISI Nov 2013)
12
Amos BairochShapers and Shakers
Amos Bairoch
/00© Burkhard Rost
Manually align family + annotate motifs Use motifs for automatic alignment and annotation of unknown
13
Motifs and patterns
Search for the motif pattern in a new protein
Find a motif or a pattern in a functionally characterized family
Transfer function annotation
© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
/00© Burkhard Rost
completeness:DB as many motifs as possible high specificity:no false positives at a level at which most are found documentation periodic reviewing
14
PROSITE: Concepts for DB
/00© Burkhard Rost
Bairoch A (1991) NAR 19 2241-5 PROSITE: a dictionary of sites and patterns in proteinsrepeated: 1992, 1993
15
PROSITE history
/00© Burkhard Rost
Bairoch A (1991) NAR 19 2241-5 PROSITE: a dictionary of sites and patterns in proteinsrepeated: 1992, 1993
16
PROSITE history
Search for the motif pattern in a new protein
Find a motif or a pattern in a functionally characterized family
Transfer function annotation
/00© Burkhard Rost
Bairoch A (1991) NAR 19 2241-5 PROSITE: a dictionary of sites and patterns in proteinsrepeated: 1992, 1993Solution:GxxGxxG (membrane)[RK](2)-x-[ST] (phosphorylation)
17
PROSITE history
Search for the motif Find a motif or a pattern in a
Transfer function
/00© Burkhard Rost
completeness:DB as many motifs as possible high specificity:no false positives at a level at which most are found documentation periodic reviewing
18
PROSITE: Concepts for DB
Search for the motif Find a motif or a pattern in a
Transfer function
/00© Burkhard Rost
Bairoch A (1991) NAR 19 2241-5 PROSITE: a dictionary of sites and patterns in proteinsrepeated: 1992, 1993A Bairoch & P Bucher (1994) NAR 22:3583-9PROSITE: recent developments (profiles) A Bairoch, P Bucher & K Hofmann (1996) NAR 24:189-96repeated 1997, 1999 (Hofmann, Bucher, Falquet, Bairoch)L Falquet, M Pani, P Bucher, N Hulo, CJ Sigrist, K Hofmann, & A Bairoch (2002) NAR 30:235-8
19
PROSITE history
Philip Bucher
/00© Burkhard Rost
CJ Sigrist, L Cerutti, N Hulo, A Gattiker, L Falquet, M Pagni, A Bairoch, P Bucher (2002) Brief Bioinform 3:265-74 N Hulo, CJ Sigrist, V Le Saux, PS Langendijk-Genevaux, L Bordoli, A Gattiker, E De Castro, P Bucher, A Bairoch (2004) NAR 32:D134-7 A Gattiker, E Gasteiger, A Bairoch (2002) Appl Bioinformatics 1:107-8ScanProsite: a reference implementation of a PROSITE scanning tool
20
PROSITE history
/00© Burkhard Rost
A Bairoch (1991) NAR 19 Suppl: 2241-5, prev (1992) NAR 20 Suppl: 2013-8, x (1993) NAR 21: 3097-103, A Bairoch and P Bucher (1994) NAR 22: 3583-9, A Bairoch, P Bucher and K Hofmann (1996) NAR 24: 189-96, prev (1997) NAR 25: 217-21, K Hofmann, P Bucher, L Falquet and A Bairoch (1999) NAR 27: 215-9, L Falquet, M Pagni, P Bucher, N Hulo, CJ Sigrist, K Hofmann and A Bairoch (2002) NAR 30: 235-8, A Gattiker, E Gasteiger and A Bairoch (2002) Appl Bioinformatics 1: 107-8, CJ Sigrist, L Cerutti, N Hulo, A Gattiker, L Falquet, M Pagni, A Bairoch and P Bucher (2002) Brief Bioinform 3: 265-74, N Hulo, CJ Sigrist, V Le Saux, PS Langendijk-Genevaux, L Bordoli, A Gattiker, E De Castro, P Bucher and A Bairoch (2004) NAR 32: D134-7, CJ Sigrist, E De Castro, PS Langendijk-Genevaux, V Le Saux, A Bairoch and N Hulo (2005) Bioinformatics 21: 4060-6, E de Castro, CJ Sigrist, A Gattiker, V Bulliard, PS Langendijk-Genevaux, E Gasteiger, A Bairoch and N Hulo (2006) NAR 34: W362-5, N Hulo, A Bairoch, V Bulliard, L Cerutti, E De Castro, PS Langendijk-Genevaux, M Pagni and CJ Sigrist (2006) NAR 34: D227-30, N Hulo, A Bairoch, V Bulliard, L Cerutti, BA Cuche, E de Castro, C Lachaize, PS Langendijk-Genevaux and CJ Sigrist (2008) NAR 36: D245-9,
21
PROSITE - evolution of method
/144© Burkhard Rost
22
PROSITE / ScanProsite
© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
K Hofmann, P Bucher, L Falquet & A Bairoch (1999) Nucl Acids Res 27: 215-9N Hulo et al. (2004) Nucleic Acids Res 32: D134-7
/144© Burkhard Rost
PRINTS
23
/00© Burkhard Rost
University of Manchester (Faculty of Life Sciences & School of Computer Sciences) PRINTS: dignostic fingerprint database TK Attwood & ME Beck (1994) PRINTs-a protein motif fingerprint database
24
Terry K Attwood
Terry K Attwood
/00© Burkhard Rost
Motifs are stretches of evolutionary conserved fingerprints version 42.0 (Manchester Univ, Feb 2012) 2,156 FINGERPRINTS encoding 12,444 single motifs TK Attwood, P Bradley, DR Flower, A Gaulton, N Maudling, A Mitchell, G Moulton, A Nordle, K Paine, P Taylor, A Uddin, C Zygouri (2003) NAR:31, 400-2
25
PRINTS concept
/00© Burkhard Rost
homeoboxThe homeobox is a 60-residue motif first identified in a number of Drosophila homeotic and segmentation proteins, but now known to be well-conserved in many other animals, including vertebrates [1-3]. Proteins containing homeobox domains are likely to play an important role in development - most are known to be sequence-specific DNA-binding transcription factors. The domain binds DNA through a helix-turn-helix (HTH) structure.
26
PRINTS: example
/144© Burkhard Rost
BLOCKS
27
/00© Burkhard Rost
Fred Hutchinson Cancer Center, SeattleHHMI (Howard Hughes Medical Institute) papers:
• >300 papers (Nov 2011) • 3 >1,000 citations (end 2011) • 72 over 100 • H-index 83 (ISI Nov 2011) Paradigm changes
• gene in gene - in intron (1986) • histones NOT only in octamers (2004) • DNA-methylation in histones: H2.AZ in histone spool promotes
gene expression (2008): NOT DNA-methylation shuts off genes (important for cancer drug development)
28
Jorja & Steven HenikoffShapers and Shakers
/00© Burkhard Rost
compile log-odd ratios
BLOSUMn=threshold at n% pairwise sequence identityS Henikoff & Jorja Henikoff (1992) PNAS 89:10915-9
29
BLOSUM
Steven Henikoff
/00© Burkhard Rost
BLOcks of amino acid SUbstitution MatricesAlign only conserved regionsJG Henikoff and S Henikoff (1996) Meth Enzymology 266: 88-104
S Pietrokovski, JG Henikoff & S Henikoff (1996) NAR 24: 197-201
30
BLOSUM
/00© Burkhard Rost
idea taken from multiple alignments
31
BLOCKS
/144© Burkhard Rost
32
BLOCKS: length distribution
J Liu & B Rost (2003) Current Opinion in Chemical Biology 7, 5-11
/144© Burkhard Rost
Pfam
33
/00© Burkhard Rost
classify all proteins and RNA into families to better understand their function and evolution 1997 starts Pfam (Protein families) 2003 Rfam (RNA-families)
Citation giant: • 229 papers (Nov 2011) • 1 with >8,800 citations (Nov 2011) • 6 with >1,000 citations (11/2011) • 32 with > 100 citations (11/2011) • Hirsh index: 48
34
Alex BatemanShapers and Shakers
/00© Burkhard Rost
EL Sonnhammer, SR Eddy, R Durbin (1997) Pfam: a comprehensive database of protein families based on seed alignments. Proteins 28:405-20 EL Sonnhammer, SR Eddy, E Birney, A Bateman, R Durbin (1998) NAR 26:320-2 A Bateman, E Birney, R Durbin, SR Eddy, RD Finn, EL Sonnhammer (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. NAR 27:260-2 SJ Sammut, RD Finn, A Bateman (2008) Pfam 10 years on: 10,000 families and still growing. Brief. Bioinform 9:210-9
35
Pfam: Protein families
/144© Burkhard Rost
36
Pfam: how its done
manual alignment
/00© Burkhard Rost
version/families/
37
Pfam - current stats
/144© Burkhard Rost
38
Pfam-7TM
A Bateman, et al. (2004) Nucleic Acids Res 32: D138-41© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
/144© Burkhard Rost
39
Clusters & FamiliesDB/Method Version Latest
UpdateEntries Update URL (all begin with http://)
Short sequence motifsPROSITE 17.23 10/2002 1573 manual www.expasy.ch/prosite/Blocks+ 8/2001 8656 manual blocks.fhcrc.org/blocks/PRINTS 35.0 7/2002 1750 manual www.bioinf.man.ac.uk/dbbrowser/PRINTS/
Structural domain-like regions
Pfam-A 7.6 9/2002 4463 manual pfam.wustl.eduTIGRFAM 2.1 9/2002 1622 manual www.tigr.org/TIGRFAMs/SMART 3.4 10/2002 654 manual smart.embl-heidelberg.deSBASE 9.0 10/2002 483 semi-
manualhydra.icgeb.trieste.it/~kristian/SBASE/
DOMO 2.0 4/1998 automatic www.infobiogen.fr/services/domo/ProDom 2001.3 12/2001 automatic prodes.toulouse.inra.fr/prodom/doc/prodom.htmGeneRAGE automatic www.ebi.ac.uk/research/cgg/services/rage/TribeMCL automatic www.ebi.ac.uk/research/cgg/tribe/CHOP 10/2002 automatic cubic.bioc.columbia.edu/db/chop/
Integration
InterPro 5.2 9/2002 5875 N/A www.ebi.ac.uk/interpro/MetaFam 4.1 9/2002 N/A metafam.ahc.umn.edu
Clusters of proteins
CluSTr automatic www.ebi.ac.uk/clustr/SYSTERS 3.0 automatic systers.molgen.mpg.dePICASSO 0 3/1998 automatic systers.molgen.mpg.deProtoNet 1.4 9/2002 automatic www.protonet.cs.huji.ac.il/protonet/ProClust 1.0 automatic promoter.mi.uni-koeln.de/~proclust/
J Liu & B Rost (2003) Cur Op Chem Biol 7, 5-11
/144© Burkhard Rost
40
Some overlap between databases
J Liu & B Rost (2003) Cur Op Chem Biol 7, 5-11
/144© Burkhard Rost
41
… not everything that shines is copper
J Liu & B Rost (2003) Cur Op Chem Biol 7, 5-11
/144© Burkhard Rost
localization motifs
42
/144© Burkhard Rost
motif-based inference of localization
43
/144© Burkhard Rost
Rajesh Nairnow: FDA, Washington
44
Rajesh Nair
/144© Burkhard Rost
45
Similar proteins may differ in localization
R Nair & B Rost (2002) Protein Science 11: 2836-47
/144© Burkhard Rost
46
Shuttle into the nucleus
CYTOPLASM
NUCLEUS
NLS M9
Transportin Importin
Nucleus
Cytoplasm
M Cokol, R Nair & B Rost (2000) EMBO Rep 1: 411-415
/144© Burkhard Rost
47
Types of zip-codes
following: B Alberts, D Bray, J Lewis, M Raff, K Roberts, JD Watson: The Cell, Garland, 1994
/00© Burkhard Rost
ONE in PROSITE bi-partite motif
Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E
Coverage F
PROSITE 1 96 31 90 % 3 %SWISS-PROT 322 290 n.a. 9 %
NLS-lit cleaned 91 309 35 100 % 10 %NLS-lit consensus 91 537 35 100 % 17 %PredictNLS_DB 214 1354 186 100 % 43 %
Coverage
48
How many NLS motifs in databases?
/144© Burkhard Rost
49
Experimental NLS: positive chargesNLS Protein Reference
RKRKK YstDNApolalpha Hsieh et al., 1998RKRRR Amida Irie et al., 2000KKKKRKREK LEF-1 Prieve et al., 1998KKKRRSREK TCF-1 Prieve et al.,. 1998RQARRNRRRRWR HIV-1 Rev Truant et al., 1999RRMKWKK PDX-1 Moede et al., 1999PKKKRKV SV40 LrgT Kalderon et al., 1984PRRRK SRY Sudbeck and Scherer, 1997GKKRSKA H2B Moreland et al., 1987KAKRQR v-Rel Gilmore and Temin, 1988RGRRRRQR Amida Irie et al., 2000PPVKRERTS RanBP3 Welch et al., 1999PYLNKRKGKP Pho4p Welch et al., 1999KRx{7,9}PQPKKKP p53-NLS1 Liang and Clarke, 1999KVTKRKHDNEGSGSKRPK Hum-Ku70 Koike et al., 1999RLKKLKCSKx{19}KTKR GAL4 Chan et al., 1998RKRIREDRKx{18}RKRKR TCPTP Chan et al., 1998RRERx{4}RPRKIPR BDV-P Schwemmle et al., 1999KKKKKEEEGEGKKK act/inh betaA Blauer et al., 1999PRPRKIPR BDV-P Shoya et al., 1998PPRIYPQLPSAPT BDV-P Shoya et al., 1998KDCVINKHHRNRCQYCRLQR TR2 Yu et al., 1998APKRKSGVSKC PolyomaVP1 Chang et al., 1992RKKRRQRRR HIV-1 Tat Truant et al., 1999MPKTRRRPRRSQRKRPPT Rex Palmeri and Malim, 1999KRPMNAFIVWSRDQRRK SRY Sudbeck and Scherer, 1997KRPMNAFMVWAQAARRK SOX9 Sudbeck and Scherer, 1997PPRKKRTVV NS5A Ide et al., 1996YKRPCKRSFIRFI DNAse EBV Liu et al., 1998LKDVRKRKLGPGH DNAse EBV Lyons et al., 1987KRPRP AdenovE1a Bouvier and Baldacci, 1995RRSMKRK hVDR Vihinen-Ranta et al., 1997PAKRARRGYK CPV capsid Kaneko et al., 1997RKCLQAGMNLEARKTKK hGlu.cort. Kaneko et al., 1997RRERNKMAAAKCRNRRR CFOS Kaneko et al., 1997KRMRNRIAASKCRKRKL CJUN Kaneko et al., 1997
/144© Burkhard Rost
50
Experimental NLS: more complicated
NLS Protein Reference
CYGSKNTGAKKRKIDDA DNAhelicaseQ1 Miyamoto et al., 1997
[AKR]TPIQKHWRPTVLTEGPPVKIRIETGEWE[KA] ASVintegrase Kukolj G. 1998
GGGx{3}KNRRx{6}RGGRN Nab2 Truant et al., 1998
KRxxxxxxxxxKTKK THOV NP Weber et al., 1998
EYLSRKGKLEL VirD2-Nterm Tinland et al., 1992KRPACTLKPECVQQLLVCSQEAKK HCDA Somasekaram et al., 1999
RVHPYQR QKI-5 Wu et al., 1999HARNT Eguchi et al., 1997YNNQSSNFGPMKGGN M9 Bonifaci et al., 1997
SxGTKRSYxxM InfluenzaNP Wang et al., 1997TKRSxxxM InfluenzaNP Wang et al., 1997VNEAFETLKRC MyoD Vandromme et al., 1995
MNKIPIKDLLNPG Mat-alpha Hall et al., 1984
/144© Burkhard Rost
51
In silico mutagenisis
/144© Burkhard Rost
52
Increasing accuracy and coverage
Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E
Coverage F
PROSITE 1 96 31 90 % 3 %SWISS-PROT 322 290 n.a. 9 %
NLS-lit cleaned 91 309 35 100 % 10 %NLS-lit consensus 91 537 35 100 % 17 %PredictNLS_DB 214 1354 186 100 % 43 %
Coverage
/144© Burkhard Rost
53
Increasing accuracy and coverage
Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E
Coverage F
PROSITE 1 96 31 90 % 3 %SWISS-PROT 322 290 n.a. 9 %
NLS-lit cleaned 91 309 35 100 % 10 %NLS-lit consensus 91 537 35 100 % 17 %PredictNLS_DB 214 1354 186 100 % 43 %
Coverage
/144© Burkhard Rost
54
Types of zip-codes
/144© Burkhard Rost
Sarah Gilman
55
Kaz Wrzeszczynski
/00© Burkhard Rost
56
ER
&Sequence motif 1 ER/Golgi Non-ER/Golgi
N % N %Endoplasmic reticulum (ER) motifs 2
KDEL-C-term 56 92 5 8KDEL 61 7 714 92HDEL-C-term 45 92 4 8HDEL 46 15 269 2HDEF-C-term 2 50 2 50HDEF 2 2 89 98
Golgi apparatus motifs 3
YQRL 3 1 270 99YKGL 5 1 442 99YHPL 4 5 76 95YXXZ 477 1 83112 99NPFKD 0 0 14 100FXFXD 31 1 3169 99FQFND 1 25 3 75PXPXP 65 1 8477 99X 479 1 80461 99GRIP-motif 5 1 50 1 50GRIP-motif (shortened) 6 1 3 28 97
C-term variations 4PROSITE Pattern 7 134 77 39 23{KH}DEL 86 78 5 4{KHR}{DENQ}EL 125 80 32 20{KHR}{DENQ}L 125 71 49 29{KHRDENQAS}{DENQIYCV}{DENQ}L 156 25 477 75{KRDEAVYF}{KRDEVYFMQ}{KHED}{DK}EL 39 89 5 11
KO Wrzeszczynski & B Rost (2004) CMLS 61: 1341-53
/00© Burkhard Rost
57
ER
&Sequence motif 1 ER/Golgi Non-ER/Golgi
N % N %Endoplasmic reticulum (ER) motifs 2
KDEL-C-term 56 92 5 8KDEL 61 7 714 92HDEL-C-term 45 92 4 8HDEL 46 15 269 2HDEF-C-term 2 50 2 50HDEF 2 2 89 98
Golgi apparatus motifs 3
YQRL 3 1 270 99YKGL 5 1 442 99YHPL 4 5 76 95YXXZ 477 1 83112 99NPFKD 0 0 14 100FXFXD 31 1 3169 99FQFND 1 25 3 75PXPXP 65 1 8477 99X 479 1 80461 99GRIP-motif 5 1 50 1 50GRIP-motif (shortened) 6 1 3 28 97
C-term variations 4PROSITE Pattern 7 134 77 39 23{KH}DEL 86 78 5 4{KHR}{DENQ}EL 125 80 32 20{KHR}{DENQ}L 125 71 49 29{KHRDENQAS}{DENQIYCV}{DENQ}L 156 25 477 75{KRDEAVYF}{KRDEVYFMQ}{KHED}{DK}EL 39 89 5 11
KO Wrzeszczynski & B Rost (2004) CMLS 61: 1341-53
/00© Burkhard Rost
58
ER
&Sequence motif 1 ER/Golgi Non-ER/Golgi
N % N %Endoplasmic reticulum (ER) motifs 2
KDEL-C-term 56 92 5 8KDEL 61 7 714 92HDEL-C-term 45 92 4 8HDEL 46 15 269 2HDEF-C-term 2 50 2 50HDEF 2 2 89 98
Golgi apparatus motifs 3
YQRL 3 1 270 99YKGL 5 1 442 99YHPL 4 5 76 95YXXZ 477 1 83112 99NPFKD 0 0 14 100FXFXD 31 1 3169 99FQFND 1 25 3 75PXPXP 65 1 8477 99X 479 1 80461 99GRIP-motif 5 1 50 1 50GRIP-motif (shortened) 6 1 3 28 97
C-term variations 4PROSITE Pattern 7 134 77 39 23{KH}DEL 86 78 5 4{KHR}{DENQ}EL 125 80 32 20{KHR}{DENQ}L 125 71 49 29{KHRDENQAS}{DENQIYCV}{DENQ}L 156 25 477 75{KRDEAVYF}{KRDEVYFMQ}{KHED}{DK}EL 39 89 5 11
KO Wrzeszczynski & B Rost (2004) CMLS 61: 1341-53
/00© Burkhard Rost
59
ER
&Sequence motif 1 ER/Golgi Non-ER/Golgi
N % N %Endoplasmic reticulum (ER) motifs 2
KDEL-C-term 56 92 5 8KDEL 61 7 714 92HDEL-C-term 45 92 4 8HDEL 46 15 269 2HDEF-C-term 2 50 2 50HDEF 2 2 89 98
Golgi apparatus motifs 3
YQRL 3 1 270 99YKGL 5 1 442 99YHPL 4 5 76 95YXXZ 477 1 83112 99NPFKD 0 0 14 100FXFXD 31 1 3169 99FQFND 1 25 3 75PXPXP 65 1 8477 99X 479 1 80461 99GRIP-motif 5 1 50 1 50GRIP-motif (shortened) 6 1 3 28 97
C-term variations 4PROSITE Pattern 7 134 77 39 23{KH}DEL 86 78 5 4{KHR}{DENQ}EL 125 80 32 20{KHR}{DENQ}L 125 71 49 29{KHRDENQAS}{DENQIYCV}{DENQ}L 156 25 477 75{KRDEAVYF}{KRDEVYFMQ}{KHED}{DK}EL 39 89 5 11
KO Wrzeszczynski & B Rost (2004) CMLS 61: 1341-53
/00© Burkhard Rost
60
ER
&Sequence motif 1 ER/Golgi Non-ER/Golgi
N % N %Endoplasmic reticulum (ER) motifs 2
KDEL-C-term 56 92 5 8KDEL 61 7 714 92HDEL-C-term 45 92 4 8HDEL 46 15 269 2HDEF-C-term 2 50 2 50HDEF 2 2 89 98
Golgi apparatus motifs 3
YQRL 3 1 270 99YKGL 5 1 442 99YHPL 4 5 76 95YXXZ 477 1 83112 99NPFKD 0 0 14 100FXFXD 31 1 3169 99FQFND 1 25 3 75PXPXP 65 1 8477 99X 479 1 80461 99GRIP-motif 5 1 50 1 50GRIP-motif (shortened) 6 1 3 28 97
C-term variations 4PROSITE Pattern 7 134 77 39 23{KH}DEL 86 78 5 4{KHR}{DENQ}EL 125 80 32 20{KHR}{DENQ}L 125 71 49 29{KHRDENQAS}{DENQIYCV}{DENQ}L 156 25 477 75{KRDEAVYF}{KRDEVYFMQ}{KHED}{DK}EL 39 89 5 11
KO Wrzeszczynski & B Rost (2004) CMLS 61: 1341-53
/144© Burkhard Rost
61
ER
&Sequence motif 1 ER/Golgi Non-ER/Golgi
N % N %Endoplasmic reticulum (ER) motifs 2
KDEL-C-term 56 92 5 8KDEL 61 7 714 92HDEL-C-term 45 92 4 8HDEL 46 15 269 2HDEF-C-term 2 50 2 50HDEF 2 2 89 98
Golgi apparatus motifs 3
YQRL 3 1 270 99YKGL 5 1 442 99YHPL 4 5 76 95YXXZ 477 1 83112 99NPFKD 0 0 14 100FXFXD 31 1 3169 99FQFND 1 25 3 75PXPXP 65 1 8477 99X 479 1 80461 99GRIP-motif 5 1 50 1 50GRIP-motif (shortened) 6 1 3 28 97
C-term variations 4PROSITE Pattern 7 134 77 39 23{KH}DEL 86 78 5 4{KHR}{DENQ}EL 125 80 32 20{KHR}{DENQ}L 125 71 49 29{KHRDENQAS}{DENQIYCV}{DENQ}L 156 25 477 75{KRDEAVYF}{KRDEVYFMQ}{KHED}{DK}EL 39 89 5 11
KO Wrzeszczynski & B Rost (2004) CMLS 61: 1341-53
/00© Burkhard Rost
Automate
Unify
Remote homologues
62
Open challenges - motifs and patterns
© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
/00© Burkhard Rost
Identify active site / functional element
Search for this structural pattern in a new protein
Transfer function annotation
S Jones & J Thornton (2004) Curr Opin Struc Biol 8:3-7
Manual identification of active site Automatic structural alignment?
63
Structural motifs
© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
/00© Burkhard Rost
Find
Search
Add biophysics of the site to the spatial search
64
Open challenges - structural motifs
© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
/144© Burkhard Rost
Example 3: Voltage-gated
potassium channel
65
/144© Burkhard Rost
66
Example: Voltage-gated potassium channel
V Ruta et al. & R MacKinnon (2003) Nature, 422:180-5
• Eukaryotic voltage-gated potassium channel (VG-K+) • Prokaryotic membrane proteins are easier to crystallize than eukaryotic ones
• find a prokaryotic VG-K+ having functional and structural features similar to the eukaryotic one
© Marco Punta
/144© Burkhard Rost
67
Voltage-gated K+ channel: sequence
1MAAVAGLYGLGEDRQHRKKQQQQQQHQKEQLEQKEEQKKIAERKLQLREQQLQRNSLDGY
GSLPKLSSQDEEGGAGHGFGGGPQHFEPIPHDHDFCERVVINVSGLRFETQLRTLNQFPD
TLLGDPARRLRYFDPLRNEYFFDRSRPSFDAILYYYQSGGRLRRPVNVPLDVFSEEIKFY
ELGDQAINKFREDEGFIKEEERPLPDNEKQRKVWLLFEYPESSQAARVVAIISVFVILLS
IVIFCLETLPEFKHYKVFNTTTNGTKIEEDEVPDITDPFFLIETLCIIWFTFELTVRFLA
CPNKLNFCRDVMNVIDIIAIIPYFITLATVVAEEEDTLNLPKAPVSPQDKSSNQAMSLAI
LRVIRLVRVFRIFKLSRHSKGLQILGRTLKASMRELGLLIFFLFIGVVLFSSAVYFAEAG
SENSFFKSIPDAFWWAVVTMTTVGYGDMTPVGVWGKIVGSLCAIAGVLTIALPVPVIVSN
FNYFYHRETDQEEMQSQNFNHVTSCPYLPGTLGQHMKKSSLSESSSDMMDLDDGVESTPG
LTETHPGRSAVAPFLGAQQQQQQQPVASSLSMSIDKQLQHPLQHVTQTQLYQQQQQQQQQ
QQNGFKQQQQQTQQQLQQQQSHTINASAAAATSGSGSSGLTMRHNNALAVSIETDV
The template: voltage gated potassium channel from Shaker
© Marco Punta
/00© Burkhard Rost
68
Why called shaker?
??
???
/00© Burkhard Rost
69
Why called shaker?
© Wikipedia
The shaker (Sh) gene, when mutated, causes a variety of atypical behaviors in the fruit fly .. Under ether anesthesia, the fly’s legs will shake … , it will exhibit aberrant movements. Sh-mutant flies have a shorter lifespan than regular flies; in their larvae, the repetitive firing of action potentials as well as prolonged exposure to neurotransmitters at neuromuscular junctions occurs.
/144© Burkhard Rost
70
Voltage-gated K+ channel: search
PSI-BLAST: http://www.ncbi.nih.gov/BLAST/ © Marco Punta
/144© Burkhard Rost
71
Voltage-gated K+ channel: alignment
Shaker: 413 AVYFAEAGSENSFFKSIPDAFWWAVVTMTTVGYGDMTPVGVWGKIVGSLCAIAGVLTIAL 472 A+Y E NS KS+ DA WWAVVT TTVGYGD+ P GK++G + G+ + L Target: 150 AIYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTL 209
Shaker: 473 PVPVIVSNF 481 + + + F Target: 210 LIGTVSNMF 218
the alignment
© Marco Punta
~ 30% PIDE over 80 aligned residues: enough?
/144© Burkhard Rost
72
Voltage-gated K+ channel: filter
© Marco Punta
/144© Burkhard Rost
73
Voltage-gated K+ channel: alignment
Shaker: 413 AVYFAEAGSENSFFKSIPDAFWWAVVTMTTVGYGDMTPVGVWGKIVGSLCAIAGVLTIAL 472 A+Y E NS KS+ DA WWAVVT TTVGYGD+ P GK++G + G+ + L Target: 150 AIYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTL 209
Shaker: 473 PVPVIVSNF 481 + + + F Target: 210 LIGTVSNMF 218
the alignment
© Marco Punta
~ 30% PIDE over 80 aligned residues: not quite enough to infer similarity in structure
/144© Burkhard Rost
74
Voltage-gated K+ channel: alignment
Shaker: 413 AVYFAEAGSENSFFKSIPDAFWWAVVTMTTVGYGDMTPVGVWGKIVGSLCAIAGVLTIAL 472 A+Y E NS KS+ DA WWAVVT TTVGYGD+ P GK++G + G+ + L Target: 150 AIYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTL 209
Shaker: 473 PVPVIVSNF 481 + + + F Target: 210 LIGTVSNMF 218
the alignment
Target :
295
1
the entire sequence of the identified protein
MSVERWVFPGCSVMARFRRGLSDLGGRVRNIGDVMEHPLVELGVSYAALLSVIVVVVEYT
MQLSGEYLVRLYLVDLILVIILWADYAYRAYKSGDPAGYVKKTLYEIPALVPAGLLALIE
GHLAGLGLFRLVRLLRFLRILLIISRGSKFLSAIADAADKIRFYHLFGAVMLTVLYGAFA
IYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTLL
IGTVSNMFQKILVGEPEPSCSPAKLAEMVSSMSEEEFEEFVRTLKNLRRLENSMK
© Marco Punta
/144© Burkhard Rost
75
Voltage-gated K+ channel: function?Shaker channel
• Membrane protein?
© Marco Punta
/144© Burkhard Rost
76
Voltage-gated K+ channel:
Out
In
α-bundle β-barrel
© Marco Punta
/144© Burkhard Rost
77
Voltage-gated K+ channel: TMH predicted
Side View single subunit
Top View tetramer
© Marco Punta
/144© Burkhard Rost
78
Voltage-gated K+ channel: TMH predicted
1 MAAVAGLYGLGEDRQHRKKQQQQQQHQKEQLEQKEEQKKIAERKLQLREQQLQRNSLDGY
GSLPKLSSQDEEGGAGHGFGGGPQHFEPIPHDHDFCERVVINVSGLRFETQLRTLNQFPD
TLLGDPARRLRYFDPLRNEYFFDRSRPSFDAILYYYQSGGRLRRPVNVPLDVFSEEIKFY
ELGDQAINKFREDEGFIKEEERPLPDNEKQRKVWLLFEYPESSQAARVVAIISVFVILLS
IVIFCLETLPEFKHYKVFNTTTNGTKIEEDEVPDITDPFFLIETLCIIWFTFELTVRFLA
CPNKLNFCRDVMNVIDIIAIIPYFITLATVVAEEEDTLNLPKAPVSPQDKSSNQAMSLAI
LRVIRLVRVFRIFKLSRHSKGLQILGRTLKASMRELGLLIFFLFIGVVLFSSAVYFAEAG
SENSFFKSIPDAFWWAVVTMTTVGYGDMTPVGVWGKIVGSLCAIAGVLTIALPVPVIVSN
FNYFYHRETDQEEMQSQNFNHVTSCPYLPGTLGQHMKKSSLSESSSDMMDLDDGVESTPG
LTETHPGRSAVAPFLGAQQQQQQQPVASSLSMSIDKQLQHPLQHVTQTQLYQQQQQQQQQ
QQNGFKQQQQQTQQQLQQQQSHTINASAAAATSGSGSSGLTMRHNNALAVSIETDV
S1
S2
S3
S4 S5
P S6
© Marco Punta
/144© Burkhard Rost
79
Voltage-gated K+ channel: TMHs predicted
MSVERWVFPGCSVMARFRRGLSDLGGRVRNIGDVMEHPLVELGVSYAALLSVIVVVVEYT
MQLSGEYLVRLYLVDLILVIILWADYAYRAYKSGDPAGYVKKTLYEIPALVPAGLLALIE
GHLAGLGLFRLVRLLRFLRILLIISRGSKFLSAIADAADKIRFYHLFGAVMLTVLYGAFA
IYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTLL
IGTVSNMFQKILVGEPEPSCSPAKLAEMVSSMSEEEFEEFVRTLKNLRRLENSMK
S1
S2 S3
S4 S5
P S6
TMHs predictions on the target sequence
© Marco Punta
/144© Burkhard Rost
80
Voltage-gated K+ channel: function of template
Shaker channel
• Membrane protein
• K+ selectivity?
© Marco Punta
/144© Burkhard Rost
81
Voltage-gated K+ channel:
Out
In + -
-
++ -
-
+
© Marco Punta
/144© Burkhard Rost
82
Voltage-gated K+ channel: conservation of outer pore
Shaker: 413 AVYFAEAGSENSFFKSIPDAFWWAVVTMTTVGYGDMTPVGVWGKIVGSLCAIAGVLTIAL 472 A+Y E NS KS+ DA WWAVVT TTVGYGD+ P GK++G + G+ + L Target: 150 AIYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTL 209
Shaker: 473 PVPVIVSNF 481 + + + F Target: 210 LIGTVSNMF 218
P S6
the selectivity filter
S5 S6
P
S4S3S2S1T
Gx
xG
x xT
© Marco Punta
/144© Burkhard Rost
83
Voltage-gated K+ channel: functional characterization of target
Shaker channel
• Membrane protein
• K+ selectivity
© Marco Punta
/144© Burkhard Rost
84
Voltage-gated K+ channel: functional characterization of target
Shaker channel
• Membrane protein
• K+ selectivity
• Voltage gating
© Marco Punta
/144© Burkhard Rost
85
Voltage-gated K+ channel:
Out
In
Out
© Marco Punta
closed
/144© Burkhard Rost
86
Voltage-gated K+ channel:
Out
In
+
-
Out
© Marco Punta
open
/144© Burkhard Rost
87
Voltage-gated K+ channel: Conservation of functional residues in target
S5 S6
P
S4S3S2S1
Shaker: 413 AVYFAEAGSENSFFKSIPDAFWWAVVTMTTVGYGDMTPVGVWGKIVGSLCAIAGVLTIAL 472 A+Y E NS KS+ DA WWAVVT TTVGYGD+ P GK++G + G+ + L Sbjct : 150 AIYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTL 209
Shaker: 473 PVPVIVSNF 481 + + + F Sbjct : 210 LIGTVSNMF 218
P S6
the gating hinge
© Marco Punta
/144© Burkhard Rost
88
Voltage-gated K+ channel: Conservation of functional residues in target
S5 S6
P
S3S2S1
Shaker: 413 AVYFAEAGSENSFFKSIPDAFWWAVVTMTTVGYGDMTPVGVWGKIVGSLCAIAGVLTIAL 472 A+Y E NS KS+ DA WWAVVT TTVGYGD+ P GK++G + G+ + L Target: 150 AIYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTL 209
Shaker: 473 PVPVIVSNF 481 + + + F Target: 210 LIGTVSNMF 218
P S6
+
+++
S4
voltage sensor
© Marco Punta
/144© Burkhard Rost
89
Voltage-gated K+ channel: Conservation of functional residues in target
S5 S6
P
S3S2S1
Shaker: 413 AVYFAEAGSENSFFKSIPDAFWWAVVTMTTVGYGDMTPVGVWGKIVGSLCAIAGVLTIAL 472 A+Y E NS KS+ DA WWAVVT TTVGYGD+ P GK++G + G+ + L Target: 150 AIYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTL 209
Shaker: 473 PVPVIVSNF 481 + + + F Target: 210 LIGTVSNMF 218
P S6
S4
other voltage sensing residues
© Marco Punta
/144© Burkhard Rost
90
Voltage-gated K+ channel: Function of target
Shaker channel
• Membrane protein
• K+ selectivity
• Voltage gating
© Marco Punta
/00© Burkhard Rost
91
Roderick MacKinnon’s Nobel Prize
© Wikipedia
Roderick MacKinnon (Rockefeller Univ New York)
Nobel Prize Chemistry 2003:“for structural and
mechanistic studies of ion channels”
© Nobel Prize Foundation
potassium sodiumDA Doyle, J Morais Cabral, RA Pfuetzner, A Quo, JM Gulbis, SL Cohen, BT Chait and R MacKinnon. The structure of the potassium channel: Molecular basis of K+ conduction and selectivity. Science 280 (1998) 69-77.
JH Morais-Cabral, Y Zhou and R MacKinnon. Energetic optimization of ion conduction rate by the K+ selectivity filter. Nature 414 (2001) 37-47.
Y Jiang, A Lee, J Chen, M Cadene, BT Chait and R MacKinnon (2002). Crystal structure and mechanism of a calcium-gated potassium channel. Nature 417, 515-522.
/144© Burkhard Rost
I.2c Function Intro: Function by association
92
/00© Burkhard Rost
93
Co-expression
Expression data Machine Learning / Clustering Functional classes
For example: P Brown et al. (2000) PNAS 97:262-267© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
/144© Burkhard Rost
94
Interactions / networks
For example: AH Tong et al. (2002) Science 295: 321-324© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
/00© Burkhard Rost A Bairoch (2000) Nucleic Acid Res 28:304-305
Differentiate functional and physical interaction
Improve accuracy and coverage (data, algorithm)
Ab initio/de novo prediction
95
Open challenges - function by association
© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
/00© Burkhard Rost
Sub-cellular localization (nucleus, membrane,
etc.)
Post-translational modifications
Functionally important residues
Interaction sites
96
Predict aspects of function
© Marco Punta & Yanay Ofran & Burkhard Rost (Columbia New York)
/00© Burkhard Rost
Function introduction • Molecular biology is just at an exciting beginning • We can compute some aspects of molecular life • Most accurate inference of function: based on homology • Homology-based inference of function can be improved by
motifsproblem: definition of motifs still not fully automated
NEXT • Computing chemistry - enzyme function • Prediction of subcellular localization
97
Conclusions today