linear functional modules€¦ · dansk resume i årtier har ... martin dm, ausiello g, brannetti...

212
Cis-relation Trans-relation ELM-domain interaction Domain-domain interaction The Composite Modular Protein Function Model Linear Functional Modules Implications for Protein Function Rune Linding 2004

Upload: others

Post on 09-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Cis-relation

Trans-relation

ELM-domain interaction

Domain-domain interaction

The Composite Modular Protein Function Model

Linear Functional ModulesImplications for Protein Function

Rune Linding 2004

Page 2: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 3: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Linear Functional Modules -Implications for Protein Function

PhD thesisRune Linding

EMBL - Biocomputing UnitMeyerhofstrasse 1D-69117 HeidelbergGermany

Heidelberg 2004

Page 4: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 5: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Abstract

For many decades models of protein function have primarily been describedin terms of functional modules known as globular domains. However, a largegroup of functional sites has been revealed over the last 15 years or so. Onlyrecently have these begun to be catalogued. These are linear modules andencompass ligand sites such as 14-3-3, SH3 and Cyclin ligands, as well asposttranslational modification sites and targeting signals. Linear modules areshort and co-linear in both sequence and structure space. The short lengthmake them difficult to detect based on sequence alone. Experimentally, theyare neglected because they reside in disordered or unstructured parts of pro-teins which are often removed recombinantly during protein expression. Yet,the linear modules are as important for protein function as are globular do-mains. It is now clear that linear modules make up a very large component ofthe cellular regulatory networks as they are ligands for many signaling domainsand proteins.

Although linear modules can not be detected accurately from sequencealone, their functionality is strongly dependent on context, e.g. a linear mod-ule may only be functional in a restricted set of cellular compartments. Suchcontextual information can be utilised in prediction.

The Eukaryotic Linear Motif computational resource (ELM, http://elm.eu.org), developed for predicting functional sites, is knowledge based andstores contextual information concerning linear functional sites annotated fromthe scientific literature. This is later used for contextual and logical filtering inthe prediction of linear functional modules.

Linear modules are typically found in unstructured parts of proteins andhence two methods, DisEMBL (http://dis.embl.de) and GlobPlot (http://globplot.embl.de), for detection of protein disorder ab initio from se-quence alone were developed. These methods are used to reduce the searchspace in the ELM resource. The methods are also used by the structural pro-teomics’ community for setting up expression constructs and by researchersstudying intrinsically disordered or unstructured proteins.

In the post-genomic era analysis of cellular protein based regulatory net-works and systems is of increasing importance. Yet, we do not know the roleplayed by the topologies, functional modules and different protein-interactionnetworks in cellular systems. Based on the increasing number of different lin-ear modules catalogued by ELM, one can anticipate the existence of largeprotein-protein networks within the cell, dominated by interactions betweenglobular domains and linear modules, e.g. SH3 ligands. By systematic proteo-

i

Page 6: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

ii

mic scale analysis of such networks, deconvolution into defined higher ordersystem modules can now be commenced. In order to do this, a model for com-posite protein function is proposed and it is shown how higher order functionalrelationships between the individual functional modules may be used to inferprotein function.

Page 7: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Dansk Resume

I årtier har modeller for proteinfunktion været baseret på funktionelle modulerkendt som globulære domæner. En meget stor gruppe af biokemisk funktion-elle sites er blevet afdækket over de sidste 15 år, men kun for nyligt er disseblevet kortlagt og katalogiseret. Disse sites kaldes lineære moduler og omfatterprotein ligander såsom 14-3-3, SH3 og Cyclin, post-translationelle modifika-tions sites samt signalpeptider. Lineære moduler er korte og co-lineære i bådesekvens- og struktur-rummet. Den korte længde gør dem meget vanskeligeat detektere alene ud fra sekvens. Eksperimentelt er de også vanskelige atefterforske fordi de sidder i uordnede (disordered) eller ustrukturerede dele afproteiner som oftest bliver fjernet rekombinant i forbindelse med ekspressionaf proteiner.

Lineære moduler er ligeså vigtige for et proteins funktion som de globulæredomæner. Ydermere hved vi nu at disse moduler udgør en meget stor del afde regulatoriske cellulære netværk, hvilket primært skyldes at de fungerer somligander for mange signalerings proteiner og domæner.

Selvom lineære moduler ikke kan detekteres effektivet fra sekvens er de-res funktionalitet stærkt afhængig af konteksten de er i, eksempelvis kan etlineært modul være funktionelt i bestemte celle organeller. Denne slags kon-tekstinformation kan bruges til at forbedre forudsigelsen af lineære moduler.

The Eukaryotic Linear Motif resource (ELM, http://elm.eu.org), er encomputational ressource udviklet til forudsigelse af lineære funktionelle sites.Ressourcen er baseret på kontekstinformation om lineære moduler som er an-noteret og ekstraheret fra den videnskabelige litteratur. Disse informationsæter gemt i en relationel database og senere brugt til kontekstuel og logisk filtrer-ing af forudsigelserne af lineære moduler

Lineære moduler er typisk fundet i ustrukturerede dele af proteiner, vi harderfor udviklet to metoder, DisEMBL (http://dis.embl.de) og GlobPlot(http://globplot.embl.de), til detektion af protein disorder fra sekvensalene. Disse metoder bruges til at begrænse søgningsrummet for lineæremoduler i ELM ressourcen. Metoderne er også anvendt i structural genomics tilat designe ekspressionsvektorer og af forskere som arbejder med intrinsicallydisordered proteiner.

I den post-genomiske æra er analysen af cellulære protein baserede regu-latoriske netværk og systemer til stadigt mere vigtigt. Vi forstår stadig ikke rol-len af topologier, funktionelle moduler og forskellige protein-protein interaktion-snetværk har i cellulære systemer. Baseret på det voksende antal af lineæremoduler som er katalogiseret i ELM projektet kan vi nu begynde at analys-

iii

Page 8: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

iv

erer de store intra-cellulærer interaktionsnetværk, der primært er baseret påinteraktioner mellem et lineært modul og et globulært domæne, som findes icellen , eksempelvis SH3 signalerings netværk.

Ved systematisk proteomskala analyser af disse netværk kan vi beskrivedem i højere-ordens system komponenter. En model for sammensat proteinfunktion er derfor foreslået og det vises hvordan højere-ordens funktionellerelationer mellem de individuelle funktionsmoduler kan bruges til at beskriveproteinfunktion.

Page 9: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Preface

This work was carried out at the European Molecular Biology Laboratory(EMBL) in the lab of Toby Gibson. This thesis is of the accumulative type, i.e.it is the collection of the essential publications I worked on during the courseof my PhD. The thesis is organised in four sections. Following the introductionthe focus is on linear functional modules and then the Eukaryotic Linear Motif(ELM) resource. The third section is focused on the ab initio methods for pre-diction of protein disorder. Finally, the section deals with protein diseases anda hypothesis for novel composite protein function models.

v

Page 10: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 11: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Acknowledgements

Below I wish to thank some of the many people that have contributed to thiswork. It has been a pleasure to work and do research with you. The 3.5+ years@EMBL could not have been as good without you.

• My great supervisor, Toby Gibson, from whom I learned a lot and whogave me the freedom to follow my own path.

• Thanks to Rob Russell for being a great second supervisor. Rob, we willcontinue to work together in the future.

• Thanks to Søren Brunak for helping to arrange the joint degree betweenEMBL and DTU. I am grateful for you writing the Nysyn series book onNeural Networks in the late 1980’s which inspired me when I was 14.

• Thanks to Luis Serrano for inviting me to his group at EMBL just to ex-perience that I left the group before I even arrived :-) Luis, I enjoy all theprojects we have been doing ever since.

• Thanks to Peer Bork for cool annual ‘retreats’ to the snowy white Alps.SMART rules!

• Ivica Letunic, thanks for SMARTer and good skiing.

• Federico De Masi for drinking too many beers with me, many ideas camefrom it.

• Will Stanley for disorder and philosophy.

• Thanks to all of ELM, it has been a great pleasure to work with all of you.Since we count about twenty people I can only mention a few. Rein andPål you have been very important people to me, thanks for everything.David you have been a cool computer shark to have around.

• Thanks to Francesca Diella for being a great colleague and for comment-ing on all this work.

• Jesper Borg, you will be the first to fold a protein in mind :-) It has been apleasure to feel the rush of physics again.

• Thanks to Kresten Lindorff-Larsen for being a great friend and re-searcher.

vii

Page 12: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

viii

• Thanks to Lars Juhl Jensen for great collaborations and for helping withthe thesis in your usual critical and straight-to-the-point manner. I ap-preciate the time we have had together at EMBL very much. I guess aheadset is needed for doing trans-atlantic work in the future :-)

• I am deeply grateful to FreeBSD.org, (bio)Python.org, PostgreSQL.org,Debian.org, Gentoo.org, Apache.org and Linus Thorvalds for fantasticopen source and free software. Almost no part of this work could havebeen done without these resources from the open source community.

• Thanks to the LYX team for developing the software this thesis was writtenin.

• Thanks to the EU for the ELM grant QLRI-CT-2000-00127.

• My wife Sara.

Page 13: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Publications

Publications Included in this Thesis

Paper I

Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal M,Cameron S, Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, MaselliV, Via A, Cesareni G, Diella F, Superti-Furga G, Wyrwicz L, Ramu C, McGuiganC, Gudavalli R, Letunic I, Bork P, Rychlewski L, Kuster B, Helmer-Citterich M,Hunter WN, Aasland R, Gibson TJ.ELM server: A new resource for investigating short functional sites in modulareukaryotic proteins.Nucleic Acids Res. 2003 Jul 1; 31(13): 3625-30.

Draft manuscript I

Linding R et al.ELM - The Eukaryotic Linear Motif database.In preparation - To be submitted to JMB 2004

Paper II

Linding R, Russell RB, Neduva V, Gibson TJ.GlobPlot: Exploring protein sequences for globularity and disorder.Nucleic Acids Res. 2003 Jul 1; 31(13): 3701-8.

Paper III

Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB.Protein disorder prediction: implications for structural proteomics.Structure (Camb). 2003 Nov; 11(11): 1453-9.

Paper IV

Linding R, Schymkowitz J, Rousseau F, Diella F, Serrano L.Beta aggregation and protein disorder: Unfolded with reason?JMB 2004 - Submitted

ix

Page 14: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

x

Draft manuscript II

Linding R et al.Linear Functional Modules - Composite Systems of Protein Function.In preparation - To be submitted to PLoS Biology 2004

Book chapter

Linding R, Letunic I, Gibson TJ and Bork P.Computational analysis of Modular Protein Architectures.Wiley-WCH GmbH April 2004 Ed.: Cesareni G, Gimona, Sudol M & Yaffe M.Modular Protein Domains - Chp. 13 - Accepted

Additional Publications

1. Bimpikis K, Budd A, Linding R, Gibson TJ.BLAST2SRS, a web server for flexible retrieval of related protein se-quences in the SWISS-PROT and SPTrEMBL databases.Nucleic Acids Res. 2003 Jul 1; 31(13): 3792-4.

2. Aasland R, Abrams C, Ampe C, Ball LJ, Bedford MT, Cesareni G, GimonaM, Hurley JH, Jarchau T, Lehto VP, Lemmon MA, Linding R, Mayer BJ,Nagai M, Sudol M, Walter U, Winder SJ.Normalization of nomenclature for peptide motifs as ligands of modularprotein domains.FEBS Lett. 2002 Feb 20; 513(1): 141-4.

3. Diella F, Cameron S, Gemund C, Linding R, Brunak S, Blom N, GibsonTJ.Phospho.ELM A database of experimentally verified protein phosphoryla-tion sites in Eukaryotic proteins.Genome Biology - Submitted

Work in Progress

1. Neduva V, Linding R, Gibson TJ, Russell RB.Discovery of linear functional modules.Nature Genetics or PNAS - In preparation

2. Reincke U, Linding R et al.TopicScore - SAS Enterprise powered text-mining engine for biosciences.In preparation

3. Schymkowitz J et al.Fold attractors correspond to energy profiles.In progress

Page 15: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

xi

4. Linding R, Borg J, Schymkowitz J, Serrano L.Forcefields and linear functional modules.In progress

5. De Masi F, Linding R, Neduva V, Russell RB.Global network analysis of linear motifs.In progress

Page 16: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 17: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Contents

Abstract i

Dansk Resume iii

Preface v

Acknowledgements vii

Publications ix

I Introduction 1

1 Protein Architecture 31.1 The modular model of protein function . . . . . . . . . . . . . . . 31.2 Partitioning of protein space . . . . . . . . . . . . . . . . . . . . . 5

2 Analysing globular domains 72.1 Globularity of domains . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Resources for analysis of globular domains . . . . . . . . . . . . 92.3 Other globular features and resources . . . . . . . . . . . . . . . 9

2.3.1 Globular repeats . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Domain-domain interactions . . . . . . . . . . . . . . . . . 102.3.3 No domains? . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Non-globular protein segments 133.1 Narrowing the search space . . . . . . . . . . . . . . . . . . . . . 143.2 Linear functional modules . . . . . . . . . . . . . . . . . . . . . . 143.3 Available resources for linear functional modules . . . . . . . . . 153.4 Linear functional modules are important . . . . . . . . . . . . . . 173.5 Two worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

II Linear Protein Space 19

4 Functional Sites and Linear Motifs 214.1 Coming to terms with linear motifs - ELM jargon and concepts . . 21

4.1.1 Functional sites . . . . . . . . . . . . . . . . . . . . . . . . 22

xiii

Page 18: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

xiv CONTENTS

4.1.2 Linear motif - an ELM . . . . . . . . . . . . . . . . . . . . 234.1.3 Regular expressions . . . . . . . . . . . . . . . . . . . . . 234.1.4 ELM instances . . . . . . . . . . . . . . . . . . . . . . . . 244.1.5 ELM knowledge . . . . . . . . . . . . . . . . . . . . . . . 244.1.6 Classes of functional sites - ELM identifiers . . . . . . . . 25

4.2 Check-list for functional sites . . . . . . . . . . . . . . . . . . . . 254.3 Linear motifs - a structural gallery . . . . . . . . . . . . . . . . . . 274.4 Are linear motifs only found in Eukaryotes? . . . . . . . . . . . . 274.5 Low-complexity regions - Hyper ELM clusters . . . . . . . . . . . 304.6 Conformation of the functional sites . . . . . . . . . . . . . . . . 30

4.6.1 Affinity and specificity of ELMs . . . . . . . . . . . . . . . 304.6.2 Helical - in solution and bound . . . . . . . . . . . . . . . 314.6.3 Beta conformation - bound and unbound? . . . . . . . . . 314.6.4 Linus Pauling would not be surprised . . . . . . . . . . . . 33

4.7 ELM evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.8 Why ELM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 The Design of ELM 355.1 Conceiving ELM - beyond sequence analysis . . . . . . . . . . . 355.2 The good, the bad and the .[DE].[VIL] . . . . . . . . . . . . . . . 365.3 elmDB - a relational database . . . . . . . . . . . . . . . . . . . . 375.4 The prototype server . . . . . . . . . . . . . . . . . . . . . . . . . 385.5 The current version of ELM . . . . . . . . . . . . . . . . . . . . . 385.6 ELM

- the future . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.6.1 Plenty of room for improvement . . . . . . . . . . . . . . . 395.6.2 Formalising ELM . . . . . . . . . . . . . . . . . . . . . . . 39

Paper I 415.7 ELM Server: A New Resource for Investigating Short Functional

Sites in Modular Eukaryotic Proteins . . . . . . . . . . . . . . . . 41

Draft I 535.8 ELM - The Eukaryotic Linear Motif database . . . . . . . . . . . . 53

5.8.1 ELM Resource architecture . . . . . . . . . . . . . . . . . 555.8.2 Prediction strategy . . . . . . . . . . . . . . . . . . . . . . 555.8.3 Database schema of elmDB . . . . . . . . . . . . . . . . . 55

5.9 ELM resource content - a survey . . . . . . . . . . . . . . . . . . 565.9.1 Mapping ELMs onto Gene Ontology . . . . . . . . . . . . 565.9.2 The taxonomic distribution of linear motifs . . . . . . . . . 57

5.10 Two filter types: contextual and logical . . . . . . . . . . . . . . . 605.10.1 Logical filters - a result of compartmentalisation? . . . . . 615.10.2 Structure based contextual filtering . . . . . . . . . . . . . 61

5.11 ELM entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.11.1 Molecular function classes - LIG, MOD, TRG, CLV and

ISO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.12 ELM annotation - Siteseeing . . . . . . . . . . . . . . . . . . . . 66

5.12.1 Instance data - an invaluable resource for biologists . . . 66

Page 19: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

CONTENTS xv

5.12.2 Mining for ELMs - creating contextual profiles . . . . . . . 685.12.3 Building sequence models . . . . . . . . . . . . . . . . . 68

III Intrinsic Protein Disorder 71

6 Ab initio Prediction of Protein Disorder 736.1 What is protein disorder? . . . . . . . . . . . . . . . . . . . . . . 736.2 What role does protein disorder play in biology? . . . . . . . . . . 74

6.2.1 Intrinsically Disordered Proteins (IDPs) . . . . . . . . . . 746.2.2 Target selection . . . . . . . . . . . . . . . . . . . . . . . . 746.2.3 Protein function and disorder? . . . . . . . . . . . . . . . 75

6.3 Hypothesis vs data driven . . . . . . . . . . . . . . . . . . . . . . 756.4 Data sources for protein disorder . . . . . . . . . . . . . . . . . . 76

6.4.1 X-ray data . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.4.2 NMR data . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.4.3 Other data types . . . . . . . . . . . . . . . . . . . . . . . 78

6.5 Old and novel definitions of disorder . . . . . . . . . . . . . . . . 786.5.1 Hot loops - a novel disorder definition . . . . . . . . . . . 796.5.2 RUSSELL/LINDING propensities . . . . . . . . . . . . . . . 79

6.6 Methods for finding protein disorder . . . . . . . . . . . . . . . . 80

Paper II 816.7 GlobPlot:

Exploring Protein Sequences for Globularity and Disorder . . . . 81

Paper III 956.8 Protein Disorder Prediction:

Implications for Structural Proteomics . . . . . . . . . . . . . . . 95

7 Practical Disorder Prediction 1097.1 GlobPlotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.1.1 Analysing a globplot . . . . . . . . . . . . . . . . . . . . . 1107.2 Prediction of multiple types of disorder with DisEMBL . . . . . . 112

7.2.1 Differences between predictors in DisEMBL . . . . . . . . 1127.2.2 Using DisEMBL . . . . . . . . . . . . . . . . . . . . . . . . 112

7.3 GlobPlot or DisEMBL . . . . . . . . . . . . . . . . . . . . . . . . 1147.4 Design of protein expression vectors . . . . . . . . . . . . . . . . 1157.5 Bridging to prediction of linear motifs . . . . . . . . . . . . . . . . 1157.6 Structural flexibility - order/disorder transitions . . . . . . . . . . . 115

IV Function & Disease 119

8 Protein Disorder & Disease 1218.1 Protein Diseases - conformational diseases . . . . . . . . . . . . 1218.2 Aggregation, disorder and disease . . . . . . . . . . . . . . . . . 1228.3 Disease and linear motifs . . . . . . . . . . . . . . . . . . . . . . 123

Page 20: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

xvi CONTENTS

Paper IV 1258.4 Beta Aggregation and Protein Disorder: Unfolded with Reason? 125

9 Composite Protein Function Models 1379.1 The motivation and foundation . . . . . . . . . . . . . . . . . . . 1379.2 Interaction connects functions . . . . . . . . . . . . . . . . . . . . 138

9.2.1 Physical interaction classification . . . . . . . . . . . . . . 1389.2.2 Cis/Trans relations - abstract function networks . . . . . . 139

9.3 Composite models . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Draft II 1419.4 Linear Functional Modules - Composite Protein Function Systems1419.5 Cellular pathway wide analysis using ELM . . . . . . . . . . . . . 1439.6 Transient interaction networks? . . . . . . . . . . . . . . . . . . . 1439.7 Abstract functional inference . . . . . . . . . . . . . . . . . . . . 1449.8 Composite Modular Protein Function Systems - proposal . . . . 145

Epilogue 149

V Additional Information 151

URLs 153

Database Scheme 155

List of Linear Motifs 159

List of Figures 165

List of Tables 169

Bibliography 173

Page 21: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Part I

Introduction

1

Page 22: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 23: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Chapter 1

Protein Architecture

Problems worthy of attack prove their worth by hitting back.Piet Hein - Danish poet and scientist

Nature seems to present protein functions in two states: structured and un-structured. Proteins are hetero polymers of amino acids, the sequence ofamino acids determines not only the structure and folding of a protein but alsothe lack of structure. Molecular functions in proteins are found as structuralunits e.g. modular globular domains. However an emerging large group offunctional sites are found primarily in unstructured parts of proteins.

1.1 The modular model of protein function

Multidomain proteins predominate in Eukaryotic proteomes. The basic hypo-thesis in what one might call the ‘modular model for protein function’ is thatindividual functions assigned to different sequence segments (often domains)combine to create a complex function for the whole protein. The term ‘mod-ular’ is referring directly to the autonomous nature of the individual foldingunits determining these functions, whereas the term ‘globular’ is describingthe structural state of a domain, whether it is modular or not. A dogma in struc-tural biology is that atomic structure determines function. The modular modelhas grown out of this paradigm, however we know that at the fold level this isnot always true, one can find structures belonging to the same fold (e.g. inSCOP, [141]) that have completely different functions: an example of this isGluthathion-S-transferase and S-crystallin. They share 75% sequence similar-ity and the same fold, but the former is an enzyme and the latter a structuralprotein [7]. This is mainly a problem when one is trying to infer function fromsequence, having the atomic structure solved and supportive biochemical datawill sometimes help to define the function.

As single domain globular proteins are typically less difficult to crystallise,for a long time they dominated the perception of typical protein structure (al-though fibrous proteins like collagen were of course well known). Gradually, asprotein sequences have accumulated, the monodomain view of protein struc-ture has been replaced by the realisation that most proteins are multidomain, at

3

Page 24: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

4 Chapter 1. Protein Architecture

Figure 1.1: A schematic showing the domain and functional site architectureof the well known proto oncogenic protein kinase Src. About 60-80% of pro-teins from higher eukaryotes have analogous modular architectures [10]. Eventhough ‘only’ ~30 000 ORFs are predicted to be in the human genome, mostof these will have several splice variants and moreover several functional sitese.g. Post-Translational Modification sites (PTMs). These sites will be in dif-ferent states and thereby increase the number of different functional species(isoforms) several fold. The arrows are only showing the approximate locationsof the functional sites.

least in higher eukaryotes. Multidomain architectures are usual for transmem-brane receptors, signalling proteins, cytoskeletal proteins, chromatin proteins,transcription factors and so forth. Multi domain proteins can be described asconsisting of a series of functional modules. The modules come in two classes:globular domains and short peptide functional sites. An archetypal protein ofthe modular model is Src shown in Figure 1.1. Src consists of 3 globular do-mains: SH3, SH2 and a tyrosine kinase domain (itself consisting of two struc-tural domains) and 8 known functional sites including 4 phosphorylation sitesand 3 ligands of modular domains (SH3, SH2 and Cyclin).

Page 25: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

1.2. Partitioning of protein space 5

Figure 1.2: A conceptual model of protein structure and function observationspace. Many functions and their structures can be assigned to two differentsub spaces. A linear/non-globular one and a globular one. It is important tonotice that these two subspaces are not detached from each other, a goodexample are disordered loops that can be protruding from a globular domain.At the (white) borderline we find Coiled-coils, repeat proteins (often formingrods) and single transmembrane helices (TM1).

1.2 Partitioning of protein space

It is becoming increasingly clear that many functionally important protein seg-ments occur outside globular domains [236, 69]. Observation space of proteinstructure and function is partitioned into two sub spaces, refer to Figure 1.2.The first consists of globular units with binding pockets, active sites and interac-tion surfaces. The second sub space contains non-globular segments such assorting signals, post-translational modification sites and protein ligands (e.g.SH3 or WW ligands). Globular units are built of regular secondary structureelements and contribute the majority of the structural data deposited in PDB.The globular function space is described very well by domain databases suchas SMART [134] and Pfam [20]. In contrast, the non-globular sub space en-compasses disordered, unstructured and flexible regions without regular sec-ondary structure. Functional sites within the non-globular space are known aslinear motifs, described by ELM [179], PROSITE [200] and Scansite [241].This group of sites includes peptide ligand sites, cell compartment targetingsignals, post-translational modification sites and cleavage sites. In this work Iwill refer to these using terms such as linear motifs, functional sites or linearfunctional modules, these will be defined at a later time but they all refer to thesame functional entities.

Page 26: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6 Chapter 1. Protein Architecture

Traditionally protein function and interaction studies have been focused onfunctional and structural domains and in fact most large datasets dealing withprotein interaction have also focused on this type of interaction [4]. This isbecause methods such as affinity purification tend to isolate ‘sticky’ and ‘non-transient’ protein complexes. The methods for isolating transient interactionssuch as the binding of a cognate ligand to its modular domain are different andmuch smaller in vivo datasets exist. However the non-transient network is onlya subpart of the interaction networks within the cell. It is the author’s hope thatthis work will encourage the communities to focus on collecting large amountsof data on domain-peptide interactions, since only by having these can we tryto get a more complete picture of cellular protein networks.

In the following I will discuss how to annotate and analyse protein se-quences for modular domains and linear funtional sites. Methods for findingdomains and predicting their function will be described first, followed by a de-scription of how to identify unstructured regions and potential functions therein.

Page 27: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Chapter 2

Analysing globular domains

The past three decades have seen relatively steady levels of domain discovery[49]. It seems likely that most of the more common modular protein domainshave already been described in the literature (see e.g. signalling domainsand their number of occurrences in the human genome, refer to Figure 2.1).However, as there is a large number of domains that are only detectable in a

IL2 B

H4

RR

M_1

BA

SIC

TKP

UA

BID

_2P

LP Pol

yAP

TX HS

AU

AS

OR

AN

GE

PI3

K_r

bdP

SN

WH

1P

I3K

_C2

VP

S9

A1p

p14

_3_3

RIN

Gv

CA

RP

RP

Rca

lpai

n_III

Sor

bS

EC

63IF

abd

CA

RD

DA

GK

aH

MG

17S

CP

OLF NL PB

DG

oLoc

oP

P2C

cFO

LNB

RIG

HT

CY

Cc

TY PA

CM

BT

GuK

cH

Dc

TRA

SH

GE

LFN

1LE

R UIM H

AT R

hoG

AP

IG_F

LMN

SH

2 LIM

LDLa S

H3 R

RM

AN

K

SMART domain

10

100

1000

Num

ber o

f pro

tein

s

SMART domainsnumber of occurences in human proteome

Figure 2.1: Presence of SMART domains in the H.sapiens proteome.Only some domain names are shown. The most abundant domains areANK(ankyrin repeats), WD40 and ZnF_C2H2 (out of scale). Data from IvicaLetunic [personal communication].

7

Page 28: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

8 Chapter 2. Analysing globular domains

relatively small number of proteins, we predict that many domains remain to bediscovered and that each genome may harbour its own repertoire of speciesor at least lineage-specific domains [67].

Similarity search has since the late 1970’s been an extremely powerful com-putational technique to assign novel functions to proteins. In addition to data-base search methods that were introduced in the early 1980’s, the realisationof conserved entities such as local motifs was incorporated into software thatwas able to scan dedicated collections of such motifs in the mid 1980’s (seee.g. PROSITE, 1985 for an early resource).

2.1 Globularity of domains

Both ‘domain’ and ‘globular’ are terms defined in structural protein biology.They have since been used in other contexts such as function description andsequence analysis. Domains were first defined in structural terms as early X-ray structures showed separate entities with defined subfunctions connectedby flexible regions. The term ‘globular’ is referring to the globule structuralstate: a protein globule can be depicted as a soluble sphere with a hydrophobiccore. An operational definition of globular proteins can be found in [53]:

“Most natural proteins in solution are much smaller in their dimen-sions than comparable polypeptides with random or repetitive con-formations and have roughly spherical shapes; hence they aregenerally referred to as globular. Their physical properties do notchange gradually as the environment is altered (e.g., by changesin temperature, pH or pressure) as do the properties of randompolypeptides. Instead, globular proteins usually exhibit little or nochange, until a point is reached at which there is a sudden drasticchange and, invariably, a loss of biological function. This phe-nomenon is known as denaturation.”

Structural biologists relate the term globular to domains which are compact andfold independently of the remainder of the host protein. Theoretical definitionsof domains exist [210, 209, 212], including several algorithms for classifyingdomains and folds [211, 118].

Resources for finding globular domains with determined tertiary structureinclude SCOP (fold level classification) [141], ASTRAL (domains from SCOP)[39], SUPERFAMILY (Hidden Markov Models of SCOP superfamilies) [92] andCATH (architecture level classification) [99, 164]. However these resourcesare confined to the structural knowledge base PDB [231]. Therefore only asubset of the total fold and structure space is described, the remaining domainsare to a certain extent described in Pfam and SMART.

Page 29: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

2.2. Resources for analysis of globular domains 9

2.2 Resources for analysis of globular domains

There are numerous domain databases available that can be useful for detec-tion and analysis of globular domains in your favourite protein sequence. Theycan be separated into several categories:

Databases such as SMART (http://smart.embl.de, [134]) and PFAM(http://www.sanger.ac.uk/software/pfam/, [20]), primarily makeuse of hand-edited sequence alignments representing single protein domainswith defined borders at the sequence level. The Conserved Domain Database(CDD) at the NCBI (http://www.ncbi.nlm.nih.gov/structure/cdd/cdd.shtml, [148]) allows the searching of profiles derived from SMART andPFAM using a modified version of the BLAST algorithm.

PROSITE (http://www.expasy.org/prosite/, [200]) is a manuallyannotated resource and contains a much more heterogeneous set of domainsand structural motifs. However, for linear motifs it has been superseded byELM and Scansite, refer to 3.3.

Another approach is to rely on various automatic methods to generate do-main signatures. This is the case for databases such as ProDom (http://www.toulouse.inra.fr/prodom.html, [198]) and BLOCKS (http://www.blocks.fhcrc.org/, [103]). Such resources predict domains thatdo not always correspond to known structural and globular domains and for thispurpose may not be as sensitive. However, these resources are of substantialdiscovery value since they are collecting conserved sequence segments thatmay specify novel functions.

In addition to the search functionality of the databases themselves, thereare several ‘meta servers’ that allow users to search multiple domain data-bases. The Interpro database, at the EBI (http://www.ebi.ac.uk/interpro/, [156]) allows the searching of the PROSITE, PFAM, PRINTS,ProDom and SMART model collections.

2.3 Other globular features and resources

Many other types of structural entities, besides those corresponding to do-mains, are globular in nature. These include protein repeats (e.g. Ankyrins),some transmembrane domains and Coiled-coil super helices. Many functionas structural scaffolds for different cellular components and rod proteins.

2.3.1 Globular repeats

Repeats can be hard to detect in protein sequences: most of them are shortand their sequences are often highly divergent. The number of repeats in differ-ent proteins is extremely variable. Finally, defining the first and last residues ofa repeat is more contentious than for a domain, since repeats are more proneto circular permutation than are domains, particularly within closed structures[187], and partial truncation results in non-integer repeat numbers.

Page 30: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

10 Chapter 2. Analysing globular domains

Repeat detection methods are often incorporated into domain predic-tion servers (for example, SMART uses Prospero [155] program from theARIADNE package), but there are also dedicated protein repeat predictionservers, like for example REPeats (http://www.embl-heidelberg.de/~andrade/papers/rep/search.html, [8]). The GlobPlot method de-scribed in 7.1 on page 109 is also capable of repeat detection.

2.3.2 Domain-domain interactions

Homology based methods are a primary source of globular domain discovery.However, another approach is to use domain-domain interactions as repres-ented by physical interaction data sets. Given that protein-protein interactionsinvolve physical interactions between protein domains, domain-domain inter-action information can be useful for validating, annotating and even predictingprotein interactions [160]. Naturally, protein interaction databases such asBIND, MINT and DIP [17, 238, 242] are alpha-omega for such studies. Theseare resources based on proteomic studies of protein-protein interactions, e.g.by yeast-two-hybrid or peptide libraries.

Another type of service is offered by the STRING database (http://string.embl.de/) which is dedicated to proteome-wide prediction ofprotein-protein associations [224]. It is an integrated resource relying on awide range of experimental and computational datasets, in order to make re-liable interaction predictions. STRING contains genomic context associations(derived from genome comparisons), interactions derived from co-expressionanalysis and various types of high-throughput experimental data, all of whichare stringently benchmarked using a common reference. However, it is import-ant to note that association does not always imply physical interaction, refer toChapter 9 on page 137.

Protein structure is a very accurate method for finding and describingprotein-protein interactions [5, 4] , [Aloy et al. - in press - Science 2004].

2.3.3 No domains?

In case no domains are found by e.g. SMART or Pfam, this does not meanthat your favourite protein does not contain any higher fold or globular domains.Most often it simply indicates the presence of non-annotated domains, whichof course have potentially higher discovery value. However several resourcesexist to help identify potential domain boundaries and give hints to the structureof what might be hidden in the sequence.

Secondary structure prediction - good

Prediction of secondary structure is the most mature of any structure predictionstrategy and accuracies of up to ~80% can be reached in some cases [175,54, 55]. An initial BLAST search to find homologous proteins is important to get

Page 31: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

2.3. Other globular features and resources 11

a further idea of the function and to build a sequence set for a multiple align-ment that can be used in secondary structure prediction servers such as Pre-dictProtein (PROFsec, http://www.predictprotein.org/) and JPRED(http://www.compbio.dundee.ac.uk/~www-jpred/), [55, 183].

Tertiary structure prediction - difficult

Prediction of tertiary structure and fold is still error-prone and difficult, how-ever having good secondary structure predictions at hand can assist thisanalysis as well. Perhaps the best approach is to submit the sequenceto one of the homology based prediction servers such as the 3D-JURYmetaserver (http://bioinfo.pl/Meta/, [90]) or SWISS-MODEL (http://www.expasy.org/swissmod/SWISS-MODEL.html, [193]). Other re-sources can be found on the websites for the evaluation competitions CASP(http://predictioncenter.llnl.gov/casp5/Casp5.html) and CA-FASP (http://bioinfo.pl/cafasp/). Recently David Baker’s group man-aged to design, at atomic level, a novel protein fold in silico [130], this is a largeleap forward in an old field and it will likely advance structure prediction meth-ods in the years to come.

Other sequence features - narrowing down domain boundaries

Single transmembrane segments (TM1), coiled-coils and low-complexity re-gions are all incorporated in the SMART server. Sometimes low-complexityregions are disordered (see 6.6 on page 80). Coiled-coils are also disorderedsequences, however they behave as globular units after the coiled-coil struc-ture is formed, this is a very clear example of disorder-order transition.

At EMBL in Heidelberg we have three additional methods that are useful inthe definition of potential domain boundaries: DisEMBL [138], DomCut [208]and GlobPlot [139], see 7.1.1 on page 110. Another resource for potentialdomain boundary prediction is DomPred (http://bioinf.cs.ucl.ac.uk/dompred/, [150]).

Page 32: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 33: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Chapter 3

Non-globular protein segments

Opposites are not contradictory but complementary.Niels Bohr

As most attention in assigning function to proteins has been on globular do-mains there are relatively few tools for analysing non-globular protein space.Structural biology has tended to avoid unstructured proteins and regions (e.g.by removing them in expression constructs), which has led to a skew towardsglobular proteins. However this neglect is not confined to structural biologyonly, bioinformatics has also tended to sweep non-globular function predictionunder the academic carpet. While resources are readily available for revealingglobular domains in sequences, until recently there has not been any compre-hensive collection of short functional sites/motifs comparable to the globulardomain resources, yet these are just as important for the function of multido-main proteins. Indeed, it is impossible for a researcher to find a list of currentlyknown motifs, while going through the literature to retrieve them is imprac-tical without foreknowledge in more areas than any one person will have. Thisneglect is primarily due to the fact that short sequence motifs are statisticallyinsignificant and difficult to handle compared to domains for which accurate se-quence models can be produced. The insignificance is due to weak consensussequences and the short length of these motifs.

In some cases an identical sequence motif is functional in one context butnot in another. This is the case for the RGD motif that is bound by Integrins[217]. The SH3 ligand site in the Nef protein, shown in Figure 5.1 on page 37,is another good example of this. The two predicted SH3 ligand sites are almostidentical in sequence, however one is partially covered by a loop which makesit inaccessible to the binding partner. This situation could not be resolved byany sequence analysis method, it is a contextual problem. It follows that it isnot the detection of the motif which presents the real problem, rather it is thefact that the context of the predicted site is incorrect which renders the sitenon-functional.

13

Page 34: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

14 Chapter 3. Non-globular protein segments

3.1 Narrowing the search space

The approach for finding functional sites is fundamentally different from the onedescribed above for globular domains. Since linear motifs are short, typically5-10 amino acids, they overpredict (high level of false positives) massively evenif they are described using artificial neural networks or other sensitive probab-ilistic methods. Linear motifs are context dependent, they are only functionalif they are exposed to interaction with a modular domain or are in the rightcell compartment. However, structurally they prefer to be in non-globular ordisordered regions of the protein, both of which can be detected fairly accur-ately. Recently, it has become possible to analyse natively unstructured pro-teins by methods such as NMR. Besides their high content of functional sites,disordered and non-globular regions are exciting for many other reasons. Wechose therefore to create methods, refer to Chapters 6 on page 73 and 7 onpage 109, for narrowing down the search sequence space for linear motifsby identifying regions with higher likelihood for encountering these modules,namely unstructured or disordered regions.

3.2 Linear functional modules

Having identified candidate unstructured regions one can start searching forfunction therein. Most functions will correlate to short linear peptide motifsthat are used for cell compartment targeting, protein-protein interaction, regu-lation by phosphorylation, acetylation, glycosylation and a host of other post-translational modifications. Refer to Table 3.1 for an overview of the manyfunctions these sites perform. A typical functional site is shown in Figure 3.1:notice the linear unstructured and flexible protein backbone, a requirement forthe CSK kinase to be able to modify the tyrosine.

The number of known categories of functional sites has increased dramat-ically in the last few years and it is obvious that there are more to be dis-covered. These sites are usually short and often reveal themselves in multiplesequence alignments as short patches of conservation, leading to their defin-ition as linear motifs. In addition to occurring outside globular domains, somesites e.g. phosphorylation sites, are often found in exposed, flexible loops pro-truding from globular domains. Considering the abundance of targeting signalsand post-translational modification sites it is reasonable to assume that thereare more linear functional sites than globular domains in a higher eukaryoticproteome.

For more details on these functional modules and the concepts related tothem refer to Chapter 4 on page 21.

Page 35: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

3.3. Available resources for linear functional modules 15

Table 3.1: Main classes of functional sites. Functional sites are as varied andnumerous as domains are. On a proteome level we expect at least 5 sitesper protein, resulting in about 150000 instances in the human proteome. Thisindicates the presence of a gigantic and complex interaction and regulatorysystem.

3.3 Available resources for linear functional mod-ules

ELM is now the largest collection of linear motifs followed by Scansite andPROSITE [163, 241, 200]. ELM and PROSITE both uses simple consensussequences (regular expressions) for describing linear motifs, refer to 4.1.3 onpage 23. Scansite (http://scansite.mit.edu) is a very capable resourcefocusing on cell signalling. It complements ELM in using Position Specific Scor-ing Matrices (PSSMs) for prediction, which are more sensitive than regular ex-pressions. A series of individual predictors of functional sites can be found athttp://www.cbs.dtu.dk/services/ hosted by Centre for Biological Se-quence Analysis (CBS). CBS provides high performance neural network pre-dictors. However, neither Scansite nor CBS provide an annotated databasesimilar to ELM. The PROSITE database has collected a number of linear pro-tein motifs, representing them as regular expressions. PROSITE patterns havebeen very useful but suffer from severe overprediction and more recently the

Page 36: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

16 Chapter 3. Non-globular protein segments

Figure 3.1: The C-terminal CSK Tyr phosphorylation site in Src (PDB: 1fmk)in the closed conformation bound to the SH2 domain. This linear motif (in red)shows the general features of an ELM: it is linear in sequence and structurespace. The sequence of the instance of this functional site is: TEPQYQPGE.In the ELM resource this is called the MOD_TYR_CSK functional site and de-scribed by the pattern: [TAD][EA].Q(Y)[QE].[GQA][PEDLS] . The image wascreated using PyMOL, http://pymol.sourceforge.net/.

database has emphasised globular domain annotation at the expense of linearmotifs.

Also informative are databases that store known instances of linear func-tional sites include, besides ELM, UniProt (http://www.uniprot.org/,[12]), HPRD ( [172], http://www.hprd.org/) and Phospho.ELM at http://phospho.elm.eu.org. Databases of instances are not directly useful forprediction but provide valuable data-mining resources.

In summary, there is a limited set of tools for analysing the non-globular orlinear part of protein space. Identifying segments in protein sequences thatare unstructured as well as assigning function to these segments is thereforethe aim of this work.

Page 37: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

3.4. Linear functional modules are important 17

3.4 Linear functional modules are important

Post-translational modification sites such as phosphorylation or glycosylationsites are important for the function, regulation and folding of proteins, this iswell established knowledge [121, 230]. However, recently it has been demon-strated, that in general, protein features such as functional sites are importantfor the functional classification of proteins [116, 170, 78]. This work was basedon the ProtFun method which is an ab initio method for prediction of broaderfunctional classes based on protein features alone [117].

Besides being important for protein function linear motifs and functionalsites, individual functional entities are likely to have their own evolutionarilyhistory [116], i.e. they are indeed modular just like domains. It follows that oneshould try and combine these two classes of modules into a composite modelfor protein function.

3.5 Two worlds

It is the author’s hope that the reader now has a split view of protein functiondescription. That is entirely on purpose because to a large extent this is howthe models of protein function currently are drawn up. Is there a way to combinethese two worlds? The answer is probably, and it might bring us much furtherin the understanding of proteins than just the sum of the two worlds. It mightlead us to the first composite models for protein function.

Historically, modern molecular biology and bioscience is to a very largeextent based on the quantum mechanical principles that were established byBohr, Einstein, Heisenberg, Fermi and others in the 1920-30s. Positivism andthe focus on microscopic and atomic details, later allowed people like LinusPauling and Watson and Crick to study the molecules of life. Molecular biologyhas now generated a large amount of molecular detailed data, however we stilldo not quite understand how these things work together in forming the systemof the cell.

The common view of protein function up to now been largely focused on theconcepts of structure and domains. Structural biology has made major contri-butions to our understanding of the biomolecules, however there are some in-trinsic weaknesses to it. X-ray structures are images of dead proteins, they donot breath and move as when they were in vivo. Dynamics are likely as import-ant for function than any given fold, which is indicated by the increased focuson unstructured but functionally important parts of the protein. This attentionhas to a large extent been based on the successful deployment of NuclearMagnetic Resonance (NMR) spectroscopy in the study of proteins.

The concept of functional domains has been very successful in studyingproteins, however it lacks the same level of details as the structural dogmadescribed in 1.1. Only by merging the domain centric view with a non-globularand linear motif centric view, as presented in this thesis, can we get to a morecomplete conceptual model for protein function.

Page 38: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

18 Chapter 3. Non-globular protein segments

I will first describe the nature of linear motifs and then later show how pre-diction of disordered regions can be useful for finding linear motifs. This thesiswill aim at a composite model for protein function that incorporates both glob-ular domains and linear modules such as ELMs/functional sites.

Page 39: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Part II

Linear Protein Space

19

Page 40: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 41: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Chapter 4

Functional Sites and Linear Motifs

Proteins contain functional sites that are linear in structure and co-linear in sequence space. We describe these functional modulesas linear motifs or ELMs. This chapter will give an overview andintroduction to the molecular functions encoded in linear motifs.

4.1 Coming to terms with linear motifs - ELM jar-gon and concepts

Protein structure, function and sequence space can be described in terms ofthe modules that make up the system, refer to Table 4.1. Functional and struc-tural modules within the non-globular or linear space are known as linear mo-tifs or functional sites and are catalogued though incompletely by ELM [179],CBS, PROSITE [200] and Scansite [241]. This group of modules includespeptide ligands, cell compartment targeting signals, post-translational modific-ation sites and cleavage sites.

As shown in Figure 4.1 and Table 3.1 on page 15, many different molecular

Linear modules Globular modulesPost-translational modification sites Globular domainsDisordered segments Binding pocketsLinkers Active sitesLinear epitopes Interaction surfacesActive peptides (e.g. inhibitors, toxins) EpitopesMHC molecules Molten globules

Table 4.1: Protein structure, sequence and function space can be separatedinto system modules, linear and globular modules. It is important to notice thatthese two classes are not entirely detached from one another, a good exampleare disordered loops that can be protruding from a globular domain. Somemodules are borderline between the two, e.g. coiled-coils, repeat elements(often forming rods) and single transmembrane helices.

21

Page 42: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

22 Chapter 4. Functional Sites and Linear Motifs

Figure 4.1: The Epsin-1 protein is involved in Clathrin mediated endocytosis,where it binds to lipid heads, clathrin, the Ap2 complex and Eps15. It is pro-posed to be responsible per se for membrane curvature [82]. Its principal bind-ing partner is Eps15 which is localised at the edge of a forming coat. Epsin-1contains many functional sites including three (UIM) ubiquitin-interacting mo-tifs.

and biological functions are related to linear modules. In this chapter we willdiscuss and illustrate some of the basic characteristics of linear functional sites,many of which are useful not just for classification and understanding of linearmotifs but also for filtering and prediction.

In a large project (the ELM project counted about 20-25 people) it is naturalto develop certain words and concepts to describe things that seem logical tothe people involved in the project but likely less obvious to external scientists.Here I will lay out some of the major concepts that the reader needs to beaware of to fully comprehend the following chapters.

4.1.1 Functional sites

An example of a functional site is a phosphorylation site. It is the concept of amolecular function, modification in this case, that can be related to a particularmolecular spot on the three-dimensional structure of a protein. The molecu-lar functions we classify as functional sites are only the ones that are linear insequence and structure space, i.e. an ‘active site’ is not compatible with ourdefinition since it is non linear in sequence (residues not close in sequence)and structure space (pocket not linear peptide). Immunology would probablydescribe functional sites as linear epitopes. The functional site is the highestlevel of classification of a molecular function in the ELM database. It is a bio-logical and bibliographical object. Lately we started to refer to functional sitesas linear functional modules, not to complicate matters, but rather because wewant to stress their functional modularity, refer to Chapter 9 on page 137.

Page 43: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

4.1. Coming to terms with linear motifs - ELM jargon and concepts 23

4.1.2 Linear motif - an ELM

Functional sites are described by linear motifs. Linear motifs are bioinformaticobjects that model functional sites. An ELM (Eukaryotic Linear Motif) consistsof a sequence model and contextual knowledge that is used for prediction of aparticular functional site. Currently ELM uses patterns (like PROSITE, refer toFigure 4.2) as sequence models for functional sites, but they could as well beneural networks or Position Specific Scoring Matrices [104].

The word linear is very precise, it states that the motif is co-linear in se-quence and linear in structure, i.e. without more than one regular secondarystructure element. The latter implies it can be described using peptide models.Linearity is the hall-mark of ELM and ELMs.

The list of linear motifs currently in the ELM resource can be found onpage 159.

4.1.3 Regular expressions

Currently ELM is using regular expressions or patterns for predicting linearmotifs. These are equivalent with PROSITE’s patterns except that we are usingPOSIX standard patterns, refer to [83]. Table 4.2 shows the syntax used forregular expressions in describing linear motifs. A basic problem with regularexpressions is that they do not provide any match statistics, either a patternmatches or it does not. Therefore it is difficult to evaluate the significance of agiven match.

Figure 4.2: The SH3 ligand site is modelled as five ELMs in ELM, refer tolisting of all linear motifs on page 159. Two of these linear motifs are shownhere. The regular expressions used to detect them are shown. Class I SH3 lig-ands are for instance described as RxxPxxP or rather R..P..P in POSIX regularexpression syntax which is the one ELM uses. Figure by Barbara Brannetti.

Page 44: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

24 Chapter 4. Functional Sites and Linear Motifs

Character Name Meaning. dot Matches any one amino acid[...] character class Matches any amino acids listed[^...] negated character class Matches any amino acids not listedmin,max specified range Min required, max allowed^ caret Matches the N-terminal$ dollar Matches the C-terminal? question One residue allowed, but is optional* star Any number of residues allowed,

but is optional+ plus One residue required,

additional residues are optional| alternation Matches either expression it separates(...) parentheses 1. Used to mark positions of special interest,

e.g. the covalently modified residue2. Defining groups in pattern

Table 4.2: Syntax of regular expression or patterns, as described on the ELMwebsite.

4.1.4 ELM instances

This concept describes perhaps the most important data in the ELM resource.A match of one particular ELM sequence model (regular expression or pattern)to one particular polypeptide that has been verified experimentally, is an ELMinstance. In clear-cut cases we do infer some instances by homology, e.g.between mouse, rat and human proteins. We are using the Human ProteomeInitiative’s controlled vocabularies in the annotation of experimental evidence[105]. Soon the ELM user will be notified about verified functional sites withina query sequence if it is annotated in the ELM database.

The ELM instance data is not just valuable for the user, they are extremelyimportant for the resource. These data are the only ones we can use to testour filter and predictions on. Instance data is also a source for contextualinformation about functional sites, e.g. secondary structure context.

In retrospect we started slightly late with collecting these, but recently wehave gathered a substantial data set 1. These data are used intensively in DraftI and II. A large subset specific for phosphorylation will be shipped to CBS inDenmark for training kinase specific predictors.

4.1.5 ELM knowledge

Functional sites are short in sequence and structure space. However, theycan only be functional if they are placed in the right molecular and structuralcontext. Moreover one can use their cellular context to improve the relevance of

1About 1300 non-redundant instances on ~1500 sequences. Instances exist for ~70 of the114 functional sites currently catalogued by ELM. Refer to Table 5.5 on page 67.

Page 45: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

4.2. Check-list for functional sites 25

their predictions, e.g. there is no point in predicting nuclear-specific functionalsites if the user submitted a cytosolic protein sequence. We have two classesof knowledge attached to functional sites:

Contextual knowledge is referring to the structural or molecular context de-pendency of a particular functional site, e.g. it has to be in surface-exposedloops to be functional.

Logical knowledge is for instance the taxonomic range or the set of com-partments the site is active in. The difference between the two types of know-ledge lies in the way they are used in the resource, logical knowledge is usedfor logical filtering of the relevancy of predicted functional sites, whereas thecontextual knowledge is used for improving the predictive power of ELM. Pa-per I and Draft I discuss these concepts in more detail. The knowledge aboutfunctional sites is annotated from the literature by the ELM annotaters, alsoknown as siteseers, in a process we call siteseeing, refer to 5.12 on page 66.

4.1.6 Classes of functional sites - ELM identifiers

Chemically speaking all functional sites are ligands. However the focus ofELM is on biology and we chose molecular biology based definitions to classifyfunctional sites in five classes, refer to Figure 4.3. Each class is used as partof what we call ELM identifiers which are simply names for entries in the ELMresource:

TRG targeting signals

LIG protein ligands

MOD covalent modification sites

CLV proteolytical cleavage sites

ISO non-covalent isomerisation sites

The ISO class is the latest class and is supposed to encompass prolyl cis-trans isomerisation sites, e.g in the p53 protein. Examples of ELM identifiersare LIG_SH3_1, TRG_KDEL, MOD_SUMO and CLV_PCSK_FUR_1 Theseare essential for the user as reference points to the resource and can used inpublications etc.

4.2 Check-list for functional sites

The list below is meant as a general guideline to important concepts and rulesfor functional sites:

1. They can often be described with peptide models.

2. Their secondary structure preference is coils or loops.

Page 46: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

26 Chapter 4. Functional Sites and Linear Motifs

Figure 4.3: A schematic figure illustrating the different classes of functionalsites currently catalogued by ELM. The ISO class is not shown. Figure by ReinAasland.

3. Some have α-helical conformation.

4. Some have β-conformation.

5. Can be maximum one regular secondary structure element long.

6. Low chain stiffness is likely important for the function.

7. Both the backbone and sidechains are used in the recognition and bind-ing in the non α-helical motifs, i.e. lower residue specificity.

8. H-bonds are often used extensively for their binding.

9. Tend to cluster in low-complexity regions.

10. Have specific taxonomic ranges.

11. Are specific for cellular compartments.

Many of these rules are directly reflected in the patterns for the correspondingELMs. The backbone sampling during recognition of a particular functional siteis often indicated by a dot in the pattern that simply states that any amino acidcan be accepted in this position. Chain stiffness [129], flexibility, disorder andcoils are all correlated parameters, refer to Chapter 6 for more information onprotein disorder.

Page 47: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

4.3. Linear motifs - a structural gallery 27

4.3 Linear motifs - a structural gallery

In the author’s view the best way to understand the nature of functional sites isby looking at the protein structures of the sites. The linear properties becomeparticular visible in three-dimensional space. There are a few pitfalls, manystructures are solved using a peptide to represent the functional site, howevershort peptides are most often unstructured and might not reflect the nature ofthe functional site as it is in a full-length polypeptide. It is dangerous to takea given functional site and use it in a peptide model without good biochemicalsupportive evidence for doing so.

4.4 Are linear motifs only found in Eukaryotes?

No, however they are far more frequent in higher Eukaryotes. The LIG_PCNAshown in Figure 4.5 is also found in bacteria.

Viruses often masquerade cellular functions in order to silently invade ahost cell. The HIV-1 and Ebola virus are using functional sites in their cellularmachinery for budding. The structural Gag proteins of HIV and Ebola displaythe PTAP (also know as ‘Late domain’) motif in order to bind to among others,

(a) Bi-partite NLS module. (b) PTS1 signal.

Figure 4.4: Two TRG class functional sites. (a) Currently Nuclear Localisa-tion Signals are not predicted by ELM. Burkhard Rost’s lab have done a lot ofwork on NLS prediction and provide the tool PredictNLS [158] for the detec-tion of this type of module. (b) The peroxisomal targeting signal-1 (PTS1) is aC-terminal tripeptide that is sufficient to direct proteins into peroxisomes. Thecanonical PTS1 sequence is Ser-Lys-Leu-COOH. The humane PEX5 proteinis the receptor for PTS1, it interacts with the signal via a series of tetratri-copeptide repeats (TPRs) within its C-terminal half. Figure (a) is from [48].Figure (b) is from [88].

Page 48: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

28 Chapter 4. Functional Sites and Linear Motifs

(a) PCNA binding site

(b) Clathrin box motif.

Figure 4.5: Two LIG class functional sites. (a) The PCNA (Proliferating CellNuclear Antigen) binding site is found in proteins involved in DNA replication,repair, methylation and cell cycle control [227]. The PCNA binding site is ashort helical motif that mediates interaction with PCNA which binds proteins tothe DNA replication fork. The alignment shows two LIG_PCNA instances. (b)Clathrin boxes are motifs found on cargo adaptor proteins. They interact withthe N-terminal beta-propeller structure of Clathrin heavy chain [62, 153, 79].Figure (a) by Rein Aasland. Figure (b) is from a review by David Owen [165].

Page 49: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

4.4. Are linear motifs only found in Eukaryotes? 29

(a) A SUMOylation site.

(b) A fucosylation site.

Figure 4.6: Two MOD class functional sites. (a) The RanGAP1 (red) proteinhas a functional site for SUMOylation which is recognised by the Ubc9 (blue)en (SUMO) proteins are mini proteins that are covalently attached to many pro-teins, especially within the nucleus [152, 77, 32], (PDB:1KPS). (b) Structureof a legume lectin in complex with the human lactotransferrin N2 fragment andwith an isolated biantennary glycopeptide, shows the role of the fucose moi-ety [26], (PDB: 1LGB). Lactoferrin is also involved in amyloid formation, seeChapter 8 for more details. Figure (a) by Pål Puntervoll.

Page 50: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

30 Chapter 4. Functional Sites and Linear Motifs

the Tsg-101 protein which facilitate virus budding in the host cell. The PTAPmotif has been observed as PPXY, PTAP, PSAP, or YXXL patterns [107, 113,137, 226].

It is likely that PTAP functional sites may be simulating a natural occur-ring site in the host cell. It is known that PP.Y from Rous sarcoma virus Gagprotein interacts with intracellular WW domains from ubiquitin ligases such asNedd4/Rsp5. Since WW is a typical signaling domain the virus is making itsway into the host cell system by mimicking normal signaling motifs [137].

4.5 Low-complexity regions - Hyper ELM clusters

Epsins and Histones both contain long disordered C-terminal tails which arehighly enriched in functional sites, refer to Figure 4.1 on page 22. This is per-haps as a result of them being enriched in low-complexity sequence (sequencesegments biased towards a few residues), whether this is as a result of the lin-ear motifs clustering or these regions evolutionarily attract functional sites isnot known at the moment. However, these regions are functioning as hot spotsfor protein regulation and they are often unstructured [190, 235, 110]. Many ofthe proline rich functional sites like SH3 and SH2 are very often found clusteredtogether in regions of low-complexity, e.g. in the Yeast Las-17 protein. As partof a collaboration with the lab of Rob Russell on linear motif discovery, it be-came clear that many of the potential novel motifs are also of low-complexitybut not confined to proline enrichment [unpublished results]. Recently this wasfound in the case of the CBP, refer to Paper II, homologue p300 [214].

4.6 Conformation of the functional sites

As mentioned previously the linearity of functional sites is an intrinsic and im-portant feature. However, conformation changes are likely to take place duringrecognition and binding of the motifs. Classic concepts of ‘induced fit’, ‘lock-and-key’ and ‘rigid adaptation’ were created to describe ligand binding by en-zymes and receptors [33, 127, 149, 53, 174] and are loosely applied to linearmotifs though this may not be correct in the strict enzymatic sense. Especiallybecause functional sites often have low binding affinities, typically with K inthe range of 1-500 µM [179].

4.6.1 Affinity and specificity of ELMs

It can not be written more clearly than these words from Tony Pawson in areview from early 2004 [170]:

“In signaling networks, there is no optimal affinity for protein-proteininteractions, but rather a wide range of dissociation constants thatare tailored for distinct forms of biological regulation. It is natural

Page 51: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

4.6. Conformation of the functional sites 31

to assume that tight protein-protein interactions yield a high de-gree of specificity and are more biologically relevant than interac-tions with relatively weak affinities. Strong interactions are longlived, and this can be advantageous, as in the tethering of inact-ive PKA to an AKAP in readiness for a cAMP signal. However,such interactions cannot always provide the flexibility that a cellneeds to respond dynamically to changing external conditions orinternal programs. Indeed, protein-protein interactions that are de-pendent on posttranslational modifications must by definition haverelatively modest affinities since much of the binding energy mustcome from the modified residue itself [28]. However, affinities inthe micromolar range do not necessarily mean an absence of spe-cificity. Indeed for both SH2 and SH3 domains, increased affinityfor a particular motif can (somewhat paradoxically) come at the ex-pense of specificity [125, 243]. A resolution to this conundrum maybe that enhanced binding to an optimal sequence may at the sametime interfere with a barrier to recognition of ectopic motifs and thusdecrease specificity.”

4.6.2 Helical - in solution and bound

Most linear motifs do not have regular secondary structure in solution andbound. Others like LIG_PCNA, LIG_PTS2, LIG_IQ and LIG_NRBOX are hel-ical both bound and unbound. We do not have any examples of motifs with betaconformation in solution yet, however a few function as ‘beta-addition’ motifssuch as the one described for p53 below. The important point is that prob-ably no linear motif will consist of more than one regular structure element.Already with two regular secondary structure elements the ELM concepts be-gin to fall apart because one can have packing and we defined linear motifsto be independent of structural packing. In solution, any helicity of an ELMis likely transient and disorder predominate, e.g. the p53 N-terminal regionthat contains a LIG_MDM2 site. The LIG_MDM2 is a helical ELM and it showdisorder-to-order transition upon binding, refer to Figure 9.2 on page 143 and[57].

4.6.3 Beta conformation - bound and unbound?

The p53 protein contains acetylation sites in the C-terminal, refer to Fig-ure 9.2 on page 143. The structure of this MOD class functional site in complexwith the de-acetylase Sir2 was recently solved [15]. As the paper states:

“The nature of the interaction between Sir2-Af2 and an acetylatedp53 peptide indicates that the substrates of Sir2 proteins are likelyto be either unstructured peptides or segments of proteins that canunfold upon binding to Sir2. The C terminus of p53, correspondingto the peptide used in the present study, is deacetylated by humanSIRT1 in vivo [220] and has been shown by NMR studies to be

Page 52: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

32 Chapter 4. Functional Sites and Linear Motifs

Figure 4.7: The structure of an acetylated p53 peptide model of a functionalsite and the Sir2 de-acetylase. The peptide adopts a beta conformation uponbinding but is unstructured in solution. The unbound functional site as it isresiding in the full length p53 is also unstructured. Figure from [15].

unstructured in solution [188]. Acetylated histone N-terminal tails,which are substrates for yeast Sir2p, are similarly unstructured inthe context of the nucleosome [147, 146].”

However, the work shows (refer to Figure 4.7) that the bound form of the acet-ylated peptide is in β-conformation. However, it is unlikely that this peptide isin β-conformation before binding. This is a beautiful example of disorder/ordertransition during binding of a functional site: In the solved structure the p53peptide is in a β-conformation, but in the structures solved of p53’s C-terminalpart corresponding to the peptide used in Figure 4.7, the motif is unstructured.This is supported by disorder predictions as seen in Figure 4.8.

In other cases such as the DPF and DPW motifs from Epsin-15 and Epsin,the binding of the ELM requires formation of a type 1 β-turn and does not leadto any substantial conformational changes in the recognition protein [30].

Many solved complexes of linear motifs and their recognition modules showless or no change in conformation of both the receiving domain and the func-tional site [36, 107] and [240, 239]. Ligand sites such as FHA, 14-3-3 andPBD seem to bind in a rather extended and unstructured manner [MichaelYaffe - personal communication]. In many cases functional sites are resid-ing in unstructured segments of the host protein and studies of binding andcooperativity in such intrinsically disordered proteins have been done extens-ively [75, 74, 73]. More discussion on order/disorder transitions can be foundin 7.5 on page 115. A conformational study of the RGD ligand site was done[217].

Page 53: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

4.6. Conformation of the functional sites 33

Figure 4.8: DisEMBL predictions for the region of p53 that Avalos and co-workers used for their peptide model in [15]. Here is shown a zoomed plotof the predictions that were done on full length p53. We see that DisEMBLpredicts disorder by all the definitions, refer to Paper III, in the region. We alsosee a drop just around 381-383 where the actual linear motif is, one couldinterpret this as a slight tendency to be ordered. This is in agreement withthe observation that substrates of Sir2 are disordered in solution, but becomesordered upon binding.

4.6.4 Linus Pauling would not be surprised

Functional site recognition and binding is making extensive usage of hydrogenbonding to the backbone [98, 97]. This has been seen for many functionalsites such as hydroxyproline modification sites [107], phosphorylation sites[240, 239], PDZ ligands [84] and SH3 ligands [166].

Peptides binding in extended conformation will often have every secondresidue pointing towards the solvent, this is the reason why many ELMs havealternating dots in their pattern. Examples of this are LIG_14-3-3 with thepattern R[SFYW].S.P and MOD_OGLYCOS which is described as C.S.PC inELM. Often not all of the 20 amino acids are allowed in such positions andthey are accordingly excluded in the pattern, as in the case of the ProteinPhosphatase 1 site LIG_PP1 which has the pattern ..[RK].0,1[VI][^P][FW]. .Finally motifs such as the LXXLL motif or LIG_NRBOX do not make use ofsuch extensive H-bonding.

Page 54: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

34 Chapter 4. Functional Sites and Linear Motifs

4.7 ELM evolution

This topic is very exciting and now that the ELM resource is large it is worthlooking into more carefully, however this will be left for others to do. It isclear though that because ELMs in general are short they can pop up by pointmutations, although others likely have longer evolutionarily history. Knowledgeabout the sequence conservation of ELMs might be useful for detecting them.Rob Russell contributed some analysis to Draft I on this topic.

4.8 Why ELM?

Three years ago no comprehensive knowledge base and predictive resourceexisted for functional sites and linear motifs. The EU funded ELM consortiumwas founded in the spring of 2001 with the aim of building such a resource. Wehave created a rather large resource and developed several novel techniquesfor improving the detection of linear motifs. The next two papers are describingthe prototype resource and some of the design details. Since the author wasinvolved in almost all parts of the resource development it is impossible to de-scribe all the work that has been done. My major contributions were the designof the relational database framework and inventing and developing two novelab initio protein disorder filter technologies. In the last phase of my PhD I havebegun to apply the ELM resource on proteome scale studies. The next stepis to propose protein function models composed of both globular domains andlinear functional modules. These models can then be used in future studies ofcellular regulatory pathways and systems.

Page 55: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Chapter 5

The Design of ELM

With optimal predictors for linear functional sites, the predictivepower will still be hampered by a high level of false positives. Evenwith neural networks it is not possible to get the levels of accur-acy known for e.g. domain predictions. The fundamental reasonfor this is that they are short and varied in conservation. The onlyway to circumvent this is by introducing filters at various levels. Thisis how the cell deals with the problem. In this chapter I will dis-cuss some of the design considerations that were done in the earlyphase of the ELM project. Three years ago no comprehensive orintegrated resource for prediction of functional sites existed. ELM isstill not comprehensive but it is large and it does provide an integ-rated resource, i.e. offers predictions of many molecular functionsand provides a knowledge interface as well.

5.1 Conceiving ELM - beyond sequence analysis

Most of the basic ideas in the ELM resource were conceived by Toby Gibson.Many ideas came to mind by trying to figure out how the cell deals with linearfunctional sites. Non-functional (false positives) linear motifs are distributedwidely within any proteome sequence space. How can the cell discriminatethese sites? By compartmentalisation and molecular context. The first is anintrinsic part of how targeting and sorting signals work. The second can beunderstood by structural analysis of proteins.

The cell is presented with a similar level of false positives for linear motifsas the user is when he/she paste a sequence into the webserver (hopefully thecell experiences less overprediction). Somehow the cell seems to be able todiscriminate true from false functional sites. There are three principal explana-tions for this:

Firstly, the cell is compartmentalised, i.e. it consists of several physicallyseparated compartments. It follows that a protein in one compartment cancontain a functional site that simply does not have the corresponding recogni-tion module within that compartment and therefore the site will be ‘silent’ dueto the lack of potential interaction partners.

35

Page 56: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

36 Chapter 5. The Design of ELM

Secondly, and perhaps more complexly, functional sites are dependent onmolecular context, in particular the structural context is crucial for the function.If a potential phosphorylation site is buried within the globular core of a domain,it is not possible for a kinase to reach it. The majority of functional sites lackregular secondary structure and needs to be solvent exposed and accessible.

Thirdly the cell uses complex-mediated protein delivery, for example theinsulin receptor needs to be in a complex with the IRS substrates to phos-phorylate them. Depending on the state of the cell, this complex might not form(as in insulin resistance). IRS phosphorylation leads to new complexes form-ing, e.g. the binding of P85 Pi-3 kinase by SH2 domains to the phospho-Tyrsites [114]. Another, related mechanism is that a functional site can some-times be masked because the host protein is residing in a complex. After thecomplex is transported into the right subcompartment it might dissolve into sub-units and thereby expose the previously masked ELM for interaction partnersresiding in that subcompartment.

Whereas the use of compartmentalisation as a strategy for separating func-tional networks is intrinsically logical to the way the cell works, the molecularcontext is more complex to interpret. Perhaps we will only get a deeper under-standing of this by analysing the evolution of functional sites.

5.2 The good, the bad and the .[DE].[VIL]

Neural networks have been used successfully by others, browse http://www.cbs.dtu.dk [25, 96, 117, 115, 59, 161], however in many cases we donot have enough data to train networks and the performance gain would thusbe marginal. Linear motifs do not predict well from sequence. An example isa class of PDZ ligands which have the pattern [DE].V.[IL]. In a non-redundantSWISS-PROT this pattern matches on average about 10 times per protein,refer to Figure 5.10. There are primarily two reasons for this: Firstly, they arevery short and co-linear, typically 5-10 consecutive residues. Secondly, theyhave varied sequence conservation, i.e. they have weak consensus motifs.A clear example of the problem is that in some cases an identical sequencemotif is functional in one context but not another, as is the case for many ofthe LIG_RGD sites [217]. Another archetypal example of this is the the SH3ligand instances in the Nef protein shown in Figure 5.1. In the case of LIG_RB,shown in Figure 7.5 on page 116, there is clearly more information in the se-quences of the true positives and the regular expression is not catching all ofit.

There are two potential solutions to this ‘ELM problem’, one is to developmore sensitive sequence models, however even with near-optimal methodssuch as machine learning techniques, linear motifs will continue to overpredict,as the Nef case illustrates.

Another approach, which is the one used in this work is to collect contextualinformation by trawling the scientific literature. This knowledge is then used todesign and test logical and contextual filters that works on top of the predictors.That is the approach ELM chose.

Page 57: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.3. elmDB - a relational database 37

Figure 5.1: V-1 Nef protein (magenta) in complex with wild type Fyn SH3 (red)domain (PDB: 1avz) contains two potential SH3 ELMs. Residue numbers aregiven as well as the accessibility (acc.) for the highlighted fragments. Theyellow motif is the only one binding to the Fyn partner. The cyan putative ELMis covered by a loop and has an accessibility of 81% (compared to the 98% ofthe true one).

5.3 elmDB - a relational database

A framework was needed for storing and querying the contextual knowledgegathered for the functional sites in the ELM resource. The natural choice wasa relational database. Relational databases store data according to a mod-elled structured schema. The language used is usually the Structured QueryLanguage (SQL)1 [229, 38].

The design of the ELM database, elmDB, was made less trivial by the factthat it was not clear what sort of data was needed. The early phase of data-base designing is often an attempt to predict and find examples of data thatis needed for the later software design. This approach often results in a data-base schema that is rather atomic and data centric. The schema is shown inFigure 5.6 on page 58 and as an abstract model in Figure 5.5 on page 57. Thecomplete and ‘zoomed in’ schema can be found on page 155. This is the casefor the elmDB, it is not particularly object oriented, however it does allow forstorage of important contextual knowledge and to some degree it supports fil-tering. An example of the latter is the taxonomic filtering which is entirely doneby the SQL queries.

1We use the SQL-92 jargon of the language as it is implemented in the Relational DataBase Management System (RDBMS) PostgreSQL 7.x .

Page 58: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

38 Chapter 5. The Design of ELM

5.4 The prototype server

The prototype version of ELM included filtering based on user selected specieand cellular compartments. We used SMART and Pfam for masking the glob-ular parts of the query sequences. Later on we came to realise that manysequence models in Pfam are not domains but rather signatures or families.This is often the case for ‘politically important’ molecules such as the Prionproteins and p53. Instead of making a domain prediction of such a molecule,Pfam made near-full length multiple alignments and derived HMMs based onthese, this would typically label a whole sequence as globular in the first ver-sion of the ELM filter. We later on compared the predicted Pfam ’domains‘ witha database of known globular domains in Pfam in order to aoid using signa-tures for filtering.

5.5 The current version of ELM

We are currently testing, benchmarking and implementing the latest filtertechnologies for the ELM resource. Both GlobPlot and DisEMBL (refer toChapter 6 on page 73) have been optimised for usage in ELM. We used theELM ‘instance data‘ to optimise these methods for globularity masking. Thissimply means that we wanted the methods to mask parts of the query se-quence that we would exclude in the prediction of ELMs:

>PRIO_HUMAN_LOOPS 17-110, 128-177, 189-200, 226-239manlgcwmlv lfvatwSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYPPQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWNKPSKPKTNMK hmagaaaaga vvgglggYML GSAMSRPIIH FGSDYEDRYYRENMHRYPNQ VYYRPMDEYS NQNNFVHdcv nitikqhtVT TTTKGENFTEtdvkmmervv eqmcitqyer esqayYQRGS SMVLFSSPPv illisfliflivg

This works by parsing the sequence through the DisEMBL or GlobPlotpipelines. Since these methods are fundamentally two-state, we can label thesegments as ordered (globular, lowercase) or disordered (new search space,uppercase). In this case we see the coils predictions on the human prionprotein. We use the coils predictor to mask segments that are not coils, i.e.potential domains. Based on the current set of instances DisEMBL can maskabout 50% of the sequence space and leave ~70% of the instances in non-globular masked sequence. In the Draft I paper we will include performanceestimates for each ELM of these filters. The false positives (FP) in this dataset can be approximated by all matches to linear motifs that are not annotatedwith experimental evidence. This is based on the assumption that in a largedataset the number of true positives (TP) is much less than the number of falsepositives, i.e. TP < < FP.

Page 59: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.6. ELM

- the future 39

5.6 ELM

- the future

The next generation ELM resource should be based on a solid statistical frame-work. It is only now we can do this because only now do we have the filtersand tools needed for detection of functional sites in place.

5.6.1 Plenty of room for improvement

The ELM resource is now up and running. There is, however, plenty of roomfor improvement. Integration and creation of ontologies for protein-ELM inter-actions and ontology based filtering are examples of such enhancements thatcan be done to the resource. Development of a multiple alignment filter couldhelp the predictive power. However, creating and deploying Position SpecificScoring Matrices (PSSMs) or neural networks would be an essential enhance-ment of ELM. Even if such methods may only result in a moderate improvementin predictive power, they are needed in order to establish a statistical frame-work for benchmarking and optimisation of the resource as a whole. PSSMsare well suited because of the easy integration with oriented peptide libraries[241, 163].

5.6.2 Formalising ELM

A framework could be produced where linear modules are predicted by ma-chine learning techniques [18, 161, 173], e.g. neural networks. Moreover,provided that the annotated data in the elmDB were to be expanded and mademore reliable by using ontologies and controlled vocabularies, e.g. HUPOand Gene-Ontology, the author anticipates the creation of a next generationELM resource, ELM

, based on a supervised learning framework, e.g. Hidden

Markov Models (HMM). We already know many of the rules this HMM shoulddeal with, refer to 4.2. Many of these rules are included in the resource already,however not as a formal framework which can be trained. Combining such aframework with domain predictions such as SMART would likely generate aresource for predicting protein function that would outperform any current re-source in predictive power and usefulness. The ProtFun architecture would beone possible starting point for such a resource, however a substantial amountof development would have to be directed towards designing the HMM mod-els for describing the molecular and logical contexts of linear motifs. In otherwords, the resource would simulate contextual and logical filtering. The authoris not convinced that we have enough data yet to create such a next generationprotein function predictor.

Page 60: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 61: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Paper I

This paper demonstrated the prototype of the ELM resourcewith limited functionality. The first filters to be developed werea taxonomy filter and a simple globular domain filter based onSMART/Pfam. This paper focuses on the prototype server anddoes not go in to detail with the underlying framework for the re-source.

5.7 ELM Server: A New Resource for Investigat-ing Short Functional Sites in Modular Euka-ryotic Proteins

Pål Puntervoll , Rune Linding , Christine Gemünd

, Sophie Chabanis-Davidson

, Morten Mattingsdal , Scott Cameron , David M. A. Martin , Gabri-

ele Ausiello , Barbara Brannetti , Anna Costantini , Fabrizio Fèrre , VincenzaMaselli , Allegra Via , Gianni Cesareni , Francesca Diella , Giulio Superti-Furga , Lucjan Wyrwicz , Chenna Ramu

, Caroline McGuigan

, Ramb-

abu Gudavalli, Ivica Letunic

, Peer Bork

, Leszek Rychlewski , Bernhard

Küster , Manuela Helmer-Citterich , William N. Hunter , Rein Aasland andToby J. Gibson

Department of Molecular Biology, University of Bergen, Norway

European Molecular Biology Laboratory, Postfach 10.2209, 69012

Heidelberg, Germany

Division of Biological Chemistry and Molecular Microbiology, University ofDundee, Dundee, UK

Centre for Molecular Bioinformatics, Department of Biology, University ofRome Tor Vergata, Rome, Italy

Cellzome AG, Heidelberg, Germany

BioInfoBank Institute, Poznan, Poland

41

Page 62: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

42 Chapter 5. The Design of ELM

The authors wish it to be known that, in their opinion, the first three authorsshould be regarded as joint First Authors.

To whom correspondence should be addressed. Tel: +49 6221387398; Fax:

+49 6221387517; Email: [email protected]

Multidomain proteins predominate in eukaryotic proteomes. Indi-vidual functions assigned to different sequence segments combine tocreate a complex function for the whole protein. While on-line resourcesare available for revealing globular domains in sequences, there hashitherto been no comprehensive collection of small functional sites/ mo-tifs comparable to the globular domain resources, yet these are as im-portant for the function of multidomain proteins. Short linear peptide mo-tifs are used for cell compartment targeting, protein protein interaction,regulation by phosphorylation, acetylation, glycosylation and a host ofother post-translational modifications. ELM, the Eukaryotic Linear Motifserver at http://elm.eu.org/, is a new bioinformatics resource for in-vestigating candidate short non-globular functional motifs in eukaryoticproteins, aiming to ll the void in bioinformatics tools. Sequence compar-isons with short motifs are difficult to evaluate because the usual sig-nificance assessments are inappropriate. Therefore the server is imple-mented with several logical filters to eliminate false positives. Currentfilters are for cell compartment, globular domain clash and taxonomicrange. In favourable cases, the filters can reduce the number of retainedmatches by an order of magnitude or more.

Introduction

The first crystal structure of a protein to be solved, myoglobin, revealed a com-pact globular structure with regular a-helical elements linked by short irregularloops [124]. Because single domain globular proteins are often, though notalways, easy to crystallise, for a long time they dominated perception of typ-ical protein structure (although fibrous proteins like collagen were of coursewell known). Gradually, as protein sequences have accumulated, the monodo-main view of protein structure has been replaced by the realisation that mostproteins are multidomain, at least in higher eukaryotes. The current cham-pion in size is the giant muscle protein titin at >38 000 residues encompassingsome 320 autonomously folded domains [19]. Multidomain architectures areusual for transmembrane receptors, signalling proteins, cytoskeletal proteins,chromatin proteins, transcription factors and so forth. There are now severalglobular protein domain databases accessible on the web, including Pfam [20],SMART [134], PROSITE [81], INTERPRO [11], PRODOM [51] and BLOCKS[103]. Using these tools, a user can often get a good overview of the do-main architecture of a polypeptide sequence and the functions these domainsare likely to perform. However, there remain protein sequence segments that

Page 63: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.7. ELM Server: A New Resource for Investigating Short FunctionalSites in Modular Eukaryotic Proteins 43

are difficult to analyse productively. For example, there are often large seg-ments of multidomain proteins that are non-globular, intrinsically lacking thecapability to fold into a defined tertiary structure [74, 215, 111]. Sometimesthe function of such regions may be as simple as linkers connecting globulardomains and the sequence of amino acids is not important. The structure ofyeast RNA polymerase II [52] illustrates this point. Very often, however, theseunstructured regions may contain functional sites such as protein interactionsites, cell compartment targeting signals, post-translational modification sitesor cleavage sites. These sites are usually short and often reveal themselvesin multiple sequence alignments as short patches of conservation, leading totheir definition as short sequence motifs. In addition to occurring outside globu-lar domains, some sites, for example, phosphorylation sites, are often found inexposed, flexible loops protruding from within globular domains. These shortpeptide functional sites are analogous to the linear epitopes of immunology.Considering the abundance of targeting signals and post-translational modific-ation sites, it is reasonable to assume that there are more functional sites thanglobular domains in a higher eukaryotic proteome. The PROSITE databasehas collected a number of linear protein motifs, representing them as regularexpression patterns [81]. PROSITE patterns have been very useful, but alsosuffer from severe overprediction problems and more recently the databasehas emphasised globular domain annotation at the expense of linear motifs.However, the number of known categories of functional sites has burgeoneddramatically in the last few years and it is clear that there are more to bediscovered. One only has to think of the huge current research activity intospecific methylation and acetylation of histones and chromatin proteins, whicherupted after decades of more indirect analyses [207, 218]. There has been agrowing gap in the bioinformatics resources available to researchers for deal-ing with small functional sites. Indeed, it is impossible for a researcher to finda list of currently known motifs, while going through the literature to retrievethem is impractical without foreknowledge in more areas than any one personwill have. The Eukaryotic Linear Motif (ELM) consortium has established aproject to provide a hitherto missing bioinformatics resource for linear motifs.Our aim is to cover the set of functional sites that can be defined by the localpeptide sequence, operating essentially independently of protein tertiary struc-ture. The resource suffers from the overprediction problem inherent to smallprotein motifs, but we are developing context filters such as cell compartment,taxonomy and globular domain clash that can partly reduce the severity of theproblem. In this resource, we use the term ELM to denote our bioinformaticalrepresentation of a functional site including the sequence motif and its con-text. ELM is an ongoing project but already provides a working server with>80 motif patterns and access to basic annotation. This manuscript providesan overview of the current status of the ELM resource and an indication of thefuture directions we hope to take.

Page 64: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

44 Chapter 5. The Design of ELM

ELM resource architecture

At the core of the ELM resource is a PostgreSQL relational database with69 tables storing data about linear motifs. Much of this complexity is not yetfully utilised: it anticipates current and future filtering strategies as well as in-formation retrieval by users. The ELM database architecture is beyond thescope of this manuscript and will be presented elsewhere. All data input isby hand curation. Annotating each ELM (our jargon: Siteseeing) typically in-volves extensive literature searches, BLAST runs, multiple alignment of rel-evant protein families, perusal of SWISS-PROT and other on-line databasesand, where practical, discussion with experimentalists from the field. In or-der to promote interoperability with other bioinformatics resources we use twopublic annotation standards. Gene Ontology (GO) identifiers are used for cellcompartment, molecular function and biological process [14, 106] while theNCBI taxonomy database identifiers [232] are used for taxonomic nodes at theapex of phylogenetic groupings in which an ELM occurs. The motif patternsare currently represented as POSIX regular expressions (usable in the Pythonand Perl languages), analogous to PROSITE, but with a different syntax. Forexample, the C-terminal peroxisome import signal PTS1 [88] has a consensussequence of xSKL or KSxL and, allowing for observed redundancy, can berepresented as (.[SAPTC][KRH][LMFI]$)j([KRH][SAPTC] [NTS][LMFI]$) where$ is the C-terminus. ELM is primarily developed and deployed with open sourcesoftware and is hosted on Debian GNU/Linux and secure FreeBSD/OpenBSDsystems. Software is developed in Python including some modules from thehttp://BioPython.org project to retrieve information from SWISS-PROTand PubMed [232]. The web interface software uses the CGImodel framework[40]. The server output is HTML.

The ELM server

The public ELM server is at http://elm.eu.org/ and will be mirrored byconsortium partners. The scheme in Figure 5.2 outlines how the server is im-plemented. Users submit a protein sequence to the server and specify thespecies and (if known) one or more relevant subcellular compartments. Theserver reports a list of matching motifs that have been filtered to remove im-plausible matches. Users should be patient as the turn-around time can be afew minutes while the server accesses several separate resources includingthe SMART domain server [134]. Matched motifs are usually not statisticallysignificant and overprediction will occur despite filtering, hence matches shouldnot be thought to represent true instances of functional sites (unless experi-mentally verified). Potentially interesting matches might be useful as guides toexperiment. The filtered output list has links to the unfiltered results should theuser wish to inspect them and also links to retrieve motif annotation from theELM database.

Page 65: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.7. ELM Server: A New Resource for Investigating Short FunctionalSites in Modular Eukaryotic Proteins 45

Taxonomyfilter

ELMSearchELMSearch

Query-dependendentData Collection

Globulardomainfilter

CellCompartment

filter

User-suppliedData

CytoplasmHuman...

Filter...

Filter...

CandidateSites ListCandidateSites List

20.. 23 LIG_FHA_1 35.. 39 MOD_GSK3_1186..189 MOD_CAAXbox...

Filtering

Data annotationobjects

Data annotationobjects

1-166 RAS

SMART TM PredictProtein

sigP SCOP

Input

Sequence-associatedinformation

Sequence-associatedinformation

Output

Context-filtered ELM predictionsContext-filtered ELM predictions

Taxonomyfilter

Globulardomainfilter

CellCompartment

filter

MOD_CAAXbox CVVM 186..189 Generic CAAX box ...

CytoplasmHuman

SequenceSequence

>RASN_HUMANMTEYKLVVVGAGGVG...

ELM DatabaseMotif-dependent

data

ELM DatabaseMotif-dependent

data

User-suppliedData

Figure 5.2: Scheme of the ELM server flowthrough using human RASN asa query. Dashed boxes indicate the four stages from input to result. As theserver is further developed, more filters will be added (light blue) requiringmore query-dependent data to be retrievable (pink parallelograms).

ELM Filters

There is an apparent paradox in sequence motif matching. Pattern methodsfind many false (but apparently plausible) sequence matches, yet, somehow,these are not recognised by their cognate binding/modification proteins. Oneobvious reason why a sequence that matches a motif is not a true functionalsite is that the motif does not fully and accurately represent the functional site.Another reason is that the sequence matches occur in an irrelevant context.

Page 66: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

46 Chapter 5. The Design of ELM

They may match to a sequence from a wrong cellular compartment or froma species that does not use this functional site. For these cases, it is easyto develop context filters that remove such false positives. Other reasons areless amenable for filter development given current knowledge. For example,tyrosine kinases appear to be non-specific in vitro [25, 128], yet they may havehighly specific substrates in vivo. This suggests that their substrates are de-livered through adaptor-mediated complexation. We would need to know a lotmore about such molecular complexes to deploy them as useful filters. Cur-rently we have three filters installed on the ELM server. These filters are not100% accurate and may exclude true matches on occasion. The interfaceprovides links to masked matches if the user wishes to retrieve them, but thetop level results have been filtered. This approach should encourage users tothink critically about ELM server results.

Cell compartment filter

Each ELM will be annotated with GO terms for the set of cell compartmentsin which it is known to function. For example PTS1 is found on proteins thatare targeted to the peroxisomal lumen whereas the NxS/T N-glycosylation siteapplies to proteins transported out of the cell. The user specifies the compart-ments in which the query protein functions and all matches for ELMs not foundin these compartments will be filtered out.

Globular domain filter

All matches inside globular domains identified with the SMART and Pfam do-main databases [134, 20] are subtracted. (About 10% of Pfam entries do notin fact correspond to globular domains: at the time of manuscript submissionthese are part of the filter but we will shortly use flags in Pfam to eliminatethem.) Some functional sites seem never to be found inside globular domains,for example, PTS1 or the NR box (LXXLL) [100]. However, others, such asphosphorylation sites are frequently in exposed loops of globular domains.Given the limited accuracy of the domain filter, users should consult the un-filtered results too. The domain filter currently acts as a screen. In many casesusers will be able to investigate surface accessibility by examination of an avail-able three-dimensional structure or by using a good quality two-dimensionalstructure prediction [54, 175] or perhaps by using a homology modelling serversuch as SWISS-MODEL or the Meta server [94, 35]. We are working to providebetter domain filtering in the future, for example, by using surface accessibilityin known structures and annotating known instances of intradomain ELMs.

Taxonomic filtering

Some types of functional site are found in all known eukaryotes, e.g. the ERretention signal KDEL is universal (unless there are any eukaryotes that have

Page 67: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.7. ELM Server: A New Resource for Investigating Short FunctionalSites in Modular Eukaryotic Proteins 47

secondarily lost the endoplasmic reticulum). But others are restricted to spe-cific eukaryotic taxa. For example, the origin of multicellular animals drovethe development of protein export enhancements especially for intercellularcommunication systems, leading to many novel kinds of functional site. Per-haps most strikingly, the large tyrosine kinase multigene family is found onlyin Metazoa. Occasionally functional sites may have become secondarily lostin a lineage. An example is PTS2, a second peroxisomal import signal foundwidely in eukaryotes but absent from the Caenorhabditis elegans proteome[154]. Each ELM is annotated with one or more NCBI taxonomy node iden-tifiers to indicate its known phylogenetic distribution, e.g. the node Metazoafor SH2-, PTB-binding and other phosphotyrosine sites. The user provides thequery species and all ELMs that are not assigned to its lineage are filtered out.

Figure 5.3 shows the ELM server output using the human RASN sequenceas a query. Of 77 ELM entries, 14 have matches in the sequence, but 13 areremoved by the filters with only the (true) C-terminal prenylation site remain-ing. This example indicates the potential of logical filters for improving motifsearches.

Applying ELM

There are two primary purposes motivating the ELM project. One aim is to cre-ate a comprehensive database of eukaryotic linear motifs: a knowledge basethat is currently missing in biological research. As the resource matures it willbecome increasingly valuable for data-mining purposes. The second aim isto provide a resource to aid in ELM discovery, furthering the understanding ofmultidomain proteins. This aim is harder to achieve since the server will providemany false assignments, although this varies according to the sequence in-formation content of the ELMs. We illustrate this by observing the effects ofthe three currently implemented context filters on four different ELMs occurringin nuclear proteins (Table 5.1). In the case of WRPW, a motif that occurs at orclose to the C-terminus [168], the regular expression alone is highly discrimin-ative; the 54 matches in SWISS-PROT include 33 presumptive true positives.All these are retained after applying the three filters yet only one presumptivefalse positive remains. At the other extreme is SUMO [157], which has nearly25 000 matches in the human subset of SWISS-PROT, of which 4059 hits re-main after filtering. Since this implies that 3 of 4 of the nuclear proteins have onaverage 2.5 sumoylations, this ELM is obviously subject to massive overpredic-tion. Until we are able to provide calibration of ELM results, users can evaluatemotif discrimination with the SIRW server (http://sirw.embl.de/), whichallows pattern searching of database subsets selected by keyword such asnuclear, cytoplasm or Golgi [181]. Our analysis also shows that the currentimplementation of the globular domain filter significantly decreases overpredic-tion [for example, by 53% for RBBD [56], see Table 5.1]. As discussed above,however, some true positives are filtered out since a number of ELMs occurin globular domains. This is the case for RBBD, where three experimentallyconfirmed sites reside in globular domains (see Table 5.1, footnote g). This de-ficiency will be remedied with improved domain filtering. The predictive power

Page 68: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

48C

hapt

er5.

The

Des

ign

ofE

LM ELM_ID Regular expression Total hits Taxonomy Subcellular location Non-globular

Nuclear Non-nuclear UnknownLIG_WRPW [WFY]RP[WFY].0,7$ 54 Metazoa 42 34 7 1 34

Human 10 7 2 1 7LIG_RBBD [LI].C.E 6127 Metazoa 2784 487 1305 992

Human 813 185 347 281 87

LIG_NRBOX [^P]L[^P][^P]LL[^P] 44902 Metazoa 19963 2003 11752 6208Human 6138 775 3641 1722 458

MOD_SUMO [VILAFP]K.[EDNGP] 255048 Metazoa 81329 16094 37502 27733Human 24319 5968 10428 7923 4059

Table 5.1: Distribution of selected nuclear ELM matches within SWISS-PROT release 40.41 (121515 sequence entries). LIG_WRPW: ligand motif for transcriptional cofactors; LIG_RBBD: ligand motif for Rb interacting proteins; LIG_NRBOX: ligand motif fornuclear receptors; MOD_SUMO: modification motif for sumoylation. The total number of regular expression matches. One sequence may have more than one hit. The taxonomy range for each ELM is given along with the number of matches within that taxonomy range. In addition, the correspondingnumbers for Homo sapiens are shown. Subcellular location was evaluated by the SWISS-PROT comment line ’subcellular location’. Nuclear: comment contains word nuclear ornucleus. Non-nuclear: comment does not contain words nuclear or nucleus. Unknown: comment line is missing. Globularity of the human nuclear sequences with ELM predictions was evaluated by the SMART server (including Pfam domains). All ELMsthat are within SMART/Pfam domains were excluded. All but one of the predicted nuclear LIG_WRPWs are presumptive true positives. Eleven of 19 experimentally verified instances of LIG_RBBD in human sequences are in this set. Among the missing occurrences are threewhich are known to reside in globular domains. Due to the large number of sequences containing predicted MOD_SUMO, 200 randomly chosen sequences were subjected to theSMART/Pfam filtering. The obtained ELM number was scaled to reflect the theoretical number of MOD_SUMOs in non-globular regionsof human nuclear sequences. MOD_SUMO is known to be located in globular domains as well as in non-globular regions, and some truepositives are thus likely to have been filtered out by the crude SMART/Pfam filter.

Page 69: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.7. ELM Server: A New Resource for Investigating Short FunctionalSites in Modular Eukaryotic Proteins 49

Figure 5.3: Example of ELM server output using a short sequence, humanRASN, as query. The output provides a table summarising the matches andthe filtering, a list of globular domains revealed by the SMART server (in thiscase the RAS domain entry), the list of motif matches that survived filtering (inthis case only the C-terminal prenylation site), and finally the list of matchesexcluded by domain filtering. Hyperlinks to the filtered results as well as toELM annotation are provided.

of the ELM resource can be enhanced by harnessing it to other data, includ-ing experimental results. For example, many protein kinase recognition sitesare among those that severely overpredict. If a protein is known not to bephosphorylated, kinase sites can all be ignored, whereas if it is known to bephosphorylated, then the kinase site matches can be targeted for experimentaltesting. Mass spectrometry can be a useful tool in revealing post-translationalmodifications. ELM can provide synergism with appropriate experiments andcan help in mapping out a research program. In this way, the ELM resourceshould become increasingly useful to the research community.

Page 70: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

50C

hapt

er5.

The

Des

ign

ofE

LM

Name Functional sites URL PMIDScansite Phosphorylation and signaling motifs scansite.mit.edu 11283593NetOGlyc Mucin type GalNAc O-glycosylation sites www.cbs.dtu.dk/services/NetOGlyc 9557871NetNglyc N-Glycosylation motifs www.cbs.dtu.dk/services/NetNGlycPredictNLS Nuclear localization signals cubic.bioc.columbia.edu/predictNLS 11258480The Sulfinator Tyrosine sulfation motifs us.expasy.org/tools/sulfinator 12050077NMT N-terminal N-myristoylation motifs mendel.imp.univie.ac.at/myristate/SUPLpredictor.htm 11955008PSORT Protein sorting signals psort.nibb.ac.jp 10829231TargetP Protein sorting signals www.cbs.dtu.dk/services/TargetP 10891285SignalP Cleavage sites and signal/nonsignal peptide prediction www.cbs.dtu.dk/services/SignalP 9051728Big-PI Predictor GPI modification site mendel.imp.univie.ac.at/gpi/gpi_server.html 11287675MITOPROT Mitochondrial targeting sequences www.mips.biochem.mpg.de/cgi-bin/proj/medgen/mitofilter 8944766

Table 5.2: Some specialised resources for motif analysis. PMID is the PubMed Identifier for the server publication.

Page 71: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.7. ELM Server: A New Resource for Investigating Short FunctionalSites in Modular Eukaryotic Proteins 51

Other motif resources

ELM is already the largest collection of linear motifs, followed by PROSITEand Scansite [241]. There are other sites that specialise on one or a few motifsfor which they may provide better prediction quality than ELM and should beutilised where appropriate. Many functional sites reside in unstructured poly-peptide regions and the GlobPlot server (http://globplot.embl.de) isuseful for revealing sequence segments of non-globular character [139], theinverse of the SMART and Pfam domain servers. Some useful motif serversare listed in Table 5.2 and the ELM and ExPASy servers list more. Also ofnote are protein interaction databases such as BIND [17] and DIP [238]. Moreinformative protein interaction databases that store known instances of linearmotifs [237] include MINT [242], Phosphobase [128] and ASC [80]. Databasesof instances are not directly useful for prediction but provide valuable data-mining resources.

Future directions

The current ELM resource provides basic functionality and there are manyways in which it can be improved. More comprehensive coverage and bettermotif annotation are planned, including known instances, representative align-ments and standardised motif nomenclature [1]. In many cases HMM or Profilemethods [189] will provide complementary or more sensitive detection with re-spect to regular expressions and we plan to provide both. We are working toimprove filtering logic, especially for globular domains, currently the weakest fil-ter. Other filters, including a surface accessibility filter and a segment flexibilityfilter, are being developed and will be implemented after successfully passingthe benchmarks. Calibration of prediction quality for each ELM is needed forusers to assess overprediction likelihoods. The ELM server can be improvedwith a graphical interface and by performance enhancements that may includeGRID technology. We intend to make ELM available for automated proteomeanalysis pipelines. Last, but not least, we hope that the research communitywill provide us with useful feedback and help us to improve ELM.

Acknowledgements

The ELM consortium is funded by EU grant QLRI-CT-2000- 00127.

Page 72: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 73: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Draft I

This draft paper describes the ELM resource design and databasein detail. We show the potential of the new filters developed sincethe prototype, including those of disorder and structure. A surveyof the resource is included, e.g. correlations between linear motifsand Gene Ontology are shown. The first year of my PhD work wasfocused on designing and conceiving the prediction strategies andframework to support ELM. To be submitted to JMB 2004.

5.8 ELM - The Eukaryotic Linear Motif database

Rune Linding , Pål Puntervoll

, Sophie Chabanis-Davidson , AllegraVia , Lars Juhl Jensen , Morten Mattingsdal

, Anna Costantini , Barbara

Brannetti , Caroline McGuigan , Christine Gemünd , David M. A. Mar-tin , Francesca Diella , Gabriele Ausiello , Ivica Letunic , Jakub Pas

,

Lucjan S. Wyrwicz, Marcin von Grotthuss

, Scott Cameron , Victor

Neduva , Giulio Superti-Furga , Peer Bork , Gianni Cesareni , RobertB. Russell , Bernhard Küster , Manuela Helmer-Citterich , William N.Hunter , Leszek Rychlewski

, Toby J. Gibson ! #" and Rein Aasland

$The ELM Consortium - http://elm.eu.org

EMBL - Biocomputing Unit - Meyerhofstr 1 - D-69117 Heidelberg - GermanyCBU - Department of Computional Biology, University of Bergen, Bergen -

Norway

MBI - Department of Molecular Biology, University of Bergen, Bergen -Norway

University of Rome Tor Vergata, Rome - Italy

Division of Biological Chemistry and Molecular Microbiology, University ofDundee, Dundee - UK

Centre for Molecular Bioinformatics, Department of Biology, University ofDundee, Dundee - UK

Cellzome AG, Heidelberg - GermanyBioInfoBank Institute, Poznan - Poland%To whom correspondence should be addressed. E-mail: [email protected]

53

Page 74: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

54 Chapter 5. The Design of ELM

Summary

An only now comprehensively catalogued large number of functionalsites are found primarily in unstructured parts of proteins. These lin-ear modules encompass ligand sites such as 14-3-3, SH3 and Cyclinligands as well as post-translational modification sites and targetingsignals. We have created the largest and most comprehensive com-putational resource: The Eukaryotic Linear Motif resource, ELM http://elm.eu.org, for finding these functional sites. The ELM resource isknowledge based and stores contextual profiles for linear functional sitesannotated from the scientific literature. Contextual and logical filters areapplied to reduce the overprediction of short linear motifs. We here showthe design details of the ELM resource. This is to our knowledge the firstcontextual and logical filtering resource available to the biological com-munity. This is also the first survey of the biological information contentof the ELM resource. We show how the resource currently covers cellularcompartments, taxonomic ranges and biological processes. The processof annotating the knowledge base is described here in detail. Recentlywe have added new and very powerful structural filters to the resource.The resource now offers filtering based on both ab initio, as well as fromknown protein structures. We have collected a large amount of experi-mentally verified instances of functional sites, these are used to test andbenchmark proteome wide predictions.

Introduction

The Eukaryotic Linear Motif server (http://elm.eu.org/, [179]) is a youngbioinformatics resource for investigating candidate short functional motifs ineukaryotic proteins. Linear motifs are short (normally 5-10 amino acids) andtherefore difficult to evaluate statistically. Therefore the ELM resource is de-ploying logical and contextual filters to eliminate false positives, of which theprediction strategy and framework to support this is described in this paper.The reason for using filtering is that the sequence matches (potential ELMinstances) occur in an irrelevant context. They may match to a sequencefrom a wrong cellular compartment or from a species that does not use thisfunctional site. As we have seen the structural context is also of great im-portance for linear motifs to be reachable in order to be functional. It ispossible to develop context filters that remove such false positives. In ELMwe do most of this inside the knowledge database the annotation processis building. The ELM database is designed to accommodate these typesof filters. Globular structured components are described by resources suchas SMART and Pfam [20, 133], whereas the linear components are muchless catalogued. Important resources for linear components are ELM, Scans-ite (http://scansite.mit.edu), PROSITE (http://expasy.org) andCBS (http://www.cbs.dtu.dk). ELM is the largest collection of linear mo-tifs followed by Scansite and PROSITE [163, 241, 200].

Page 75: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.8. ELM - The Eukaryotic Linear Motif database 55

Results & Discussion

5.8.1 ELM Resource architecture

The core of the ELM resource is a relational database, which stores data con-cerning linear motifs. Figure 5.4 outlines how the ELM server is implemented.The user submits a protein sequence to the server and will receive a list ofmatching ELMs which have been filtered to remove false positives. This listmay naturally still include false negatives and residual false positives. Matchedmotifs are usually not statistically significant and overprediction will occur des-pite filtering, hence matches should be thought to represent potential true in-stances of functional sites and should be used as guides to experimental de-termination.

5.8.2 Prediction strategy

We developed contextual and logical filtering strategies. To our knowledge thisis the first time this has been done in a biocomputational resource. This entailscollection of contextual information sets, context profiles, for each functionalsite. Examples of contextual information are the taxonomic range, cellularcompartment and structural environment a given functional site is functionalin. The profiles are stored in a knowledge base modelled as a relational data-base, elmDB, storing data concerning linear motifs. Figure 5.4 outlines howcontextual profiling works in the ELM resource. In order to accommodate theprediction strategy a structured logical information system was needed. Thebest suited system was found to be the Structured Query Language (SQL).The SQL language allows for logical constraints and queries on structured in-formation. The structured information is, in our case, the context profiles andthe filters can be formulated as queries against the data. Part of the knowledgegained will go into development of advanced contextual filters such as the onesdealing with the three-dimensional structure of proteins.

5.8.3 Database schema of elmDB

Figure 5.6 shows the entity-relationship model for the elmDB2. A simpler ver-sion of the schema can be seen in Figure 5.5.The relational database poweringthe ELM resource consists of 69 relational tables. The database can store alarge variety of information regarding functional sites, ELM sequence mod-els, ELM instances and contextual information. The full database schemacan be viewed at http://elm.eu.org/elmDB_scheme.pdf. The data-base was designed after the entity-relationship model and is normalised toThird-Normal form (3NF) [38]. We have chosen the most powerful and featurerich open source Relational Database Management System (RDBMS), Postgr-eSQL (http://www.postgresql.org), for the job of managing the elmDB.

2Please refer to page 155 for a magnified version.

Page 76: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

56 Chapter 5. The Design of ELM

Figure 5.4: The information flow in the current version of the ELM resource.Currently we have three filters installed on the ELM server. These filters are notcompletely accurate and will introduce false negatives occasionally althoughwe try to avoid this as much as possible. In general the approach in ELM isto predict as few false positives as possible but false negatives are even moreimportant to avoid.

A software suite for communicating with the database has been developed inPython, http://www.python.org.

5.9 ELM resource content - a survey

We here show the first resource-wide survey of ELM.

5.9.1 Mapping ELMs onto Gene Ontology

A major advancement in the biosciences was the recent introduction of onto-logies for formalising biology. The Gene Ontology (GO) consortium providesthree major ontologies for the description of ‘biological process’, ‘molecular

Page 77: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.9. ELM resource content - a survey 57

Figure 5.5: Simplified diagram of the elmDB.

function’ and ‘cellular components’. In ELM we are currently using the biolo-gical process and cellular component ontologies, the molecular function onto-logy is currently not suited for ELM. We use the biological component ontologyto describe sub cellular compartment relationships for functional sites. The GO‘biological process’ ontology is used to explain which major functions the hostprotein of a given functional site, are involved in. In order to illustrate the rela-tionship between ELMs and annotated GO terms we have done a survey in thedrosophila proteome. This is shown in Figure 5.7. Only some of the major cat-egories of the ontology are shown for simplicity. When a black dot is observedon a white background, it mostly means that for the given cell compartment,the annotation of sub cellular localisation of proteins in GOA is based on In-terpro which is error prone. In the cases for process these missing correlationsare likely due to the overprediction by the pattern for the specific ELM.

5.9.2 The taxonomic distribution of linear motifs

Functional sites are conserved over species, however they often manifest themselves differently in sequence space. ELM have specific taxonomic profiles,e.g. the retention signal for the Endoplasmatic Reticulum TRG_KDEL has dif-ferent sequences in mammals (KDEL), yeast (HDEL) and fungi (DDEL).

Page 78: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

58C

hapt

er5.

The

Des

ign

ofE

LM

Journals+id: SERIAL pkey+name: VARCHAR(512) ukey Authors

+id pkey: SERIAL+name ukey: VARCHAR(512) ukey

Articles+id: SERIAL pkey+flag_id: INTEGER fkey = 1+journal_id: INTEGER fkey+title: VARCHAR(1024)+abstract: VARCHAR(4096)+volume: SMALLINT+issue: VARCHAR(64) = ’’+pagination: VARCHAR(64)+PMID: INTEGER ukey+publication_date: VARCHAR(48) = (’TODAY’)+ukey = (title,pagination)

Site_siteseeings+site_id: INTEGER fkey+elmer_id: INTEGER fkey+pkey = (site_id,elmer_id)

Elms+id: SERIAL pkey+identifier: VARCHAR(32) ukey+short_description: VARCHAR(256) ukey+functional_site_id: INTEGER fkey = 1+log_id: INTEGER fkey = 1

Article_authors+author_id: INTEGER fkey+article_id: INTEGER fkey+author_rank: SMALLINT = 1+pkey = (author_id,article_id)+ukey = (article_id,author_rank)

Evidence_references+evidence_id: INTEGER fkey+article_id: INTEGER fkey+pkey = (evidence_id,article_id)

ELMers+id: SERIAL pkey+firstname: VARCHAR(64)+lastname: VARCHAR(64)+initials: VARCHAR(6) = ’’+degree: VARCHAR(128) = ’’+startdate: DATE = ’2001-01-03’+enddate: DATE = ’TODAY’+birthday: DATE = ’TODAY’+gender: CHAR(1)+email_id: INTEGER fkey = 1+cellphone: VARCHAR(64) = ’’+workphone: VARCHAR(64)+homephone: VARCHAR(64) = ’’+fax: VARCHAR(64)+workplace_id: INTEGER fkey = 1+ukey = (firstname,lastname,initials)+gender_check = gender IN (’M’,’F’)+url_id: INTEGER fkey = 1+siteseer: BOOLEAN = TRUE+login: VARCHAR(128) ukey = Firstname+Lastname

Workplaces+id: SERIAL pkey+name: VARCHAR(128)+department: VARCHAR(128)+street: VARCHAR(128)+postalcode: VARCHAR(128) = ’’+city: VARCHAR(128)+country: VARCHAR(128)+state: VARCHAR(128) = ’’+url_id: INTEGER fkey ukey+ukey = (name,department)

Stati+id: SERIAL pkey+status: VARCHAR(128) ukey

Proteins+id: SERIAL pkey+public_name: VARCHAR(256) ukey+note_id: INTEGER fkey = 1+expression_id: INTEGER fkey = 1

Taxonomy_tree NEW+node_id: INTEGER pkey = 1+rank_id: INTEGER fkey = 1+ncbi_scientific_name: VARCHAR(256) ukey+genbank_common_name: VARCHAR(256) = ’’+left_metric: INTEGER = 1+right_metric: INTEGER = 1

Protein_references+protein_id: INTEGER fkey+article_id: INTEGER fkey+pkey = (protein_id,article_id)

External_links+id: SERIAL pkey+external_id: VARCHAR(128) = ’’+external_database_id: INTEGER fkey = 1+external_id_type_id: INTEGER fkey = 1+external_value: VARCHAR(1024) = ’’+ukey = (external_id,external_database_id,external_id_type_id)

External_databases+id: SERIAL pkey+name: VARCHAR(256) ukey+url_id: INTEGER fkey = 1+module_file_id: INTEGER fkey = 1+ukey = (name,url_id)

Protein_external_links+protein_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (protein_id,external_link_id,logic_id)

Site_external_links NEW+site_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (site_id,external_link_id,logic_id)

Notes+id: SERIAL pkey+note: TEXT

URLs+id: SERIAL pkey+note_id: INTEGER fkey+url: VARCHAR(8190)

Synonyms+id: SERIAL pkey+synonym: VARCHAR(256) ukey

Protein_synonyms+protein_id: INTEGER fkey+synonym_id: INTEGER fkey+pkey = (protein_id,synonym_id)

Emails+id: SERIAL pkey+email: VARCHAR(256) ukey

Models+id: SERIAL pkey+model: TEXT+name: VARCHAR(128) ukey+model_type_id: INTEGER fkey+sha1_bin: BYTEA ukey+sha1_hex: CHAR(40) ukey+note_id: INTEGER fkey

Elm_models+elm_id: INTEGER fkey+model_id: INTEGER fkey+note_id: INTEGER fkey = 1+pkey = (elm_id,model_id)+best_model: BOOLEAN = FALSE

Files+id: SERIAL pkey+file_oid: OID+description: VARCHAR(512)+filename: VARCHAR (512)+sha1_bin: BYTEA ukey+sha1_hex: CHAR(40) ukey+note_id: INTEGER fkey+file_type_id: INTEGER fkey+ukey = (file_oid,filename)

Protein_files+file_id: INTEGER fkey+protein_id: INTEGER fkey+pkey = (file_id,protein_id)

Site_URLs+site_id: INTEGER fkey+url_id: INTEGER fkey+pkey = (site_id,url_id)

Protein_URLs+protein_id: INTEGER fkey+url_id: INTEGER fkey+pkey = (protein_id,url_id)

Functional_sites+id: SERIAL pkey+abstract: TEXT+short_description: VARCHAR(1024) ukey+log_id: INTEGER fkey = 1+public_name: VARCHAR(256) ukey+expert_email_id: INTEGER fkey = 1+descriptive_title: VARCHAR(128) ukey+status_id: INTEGER fkey = 1

Elm_taxonomies+node_id: INTEGER fkey+elm_id: INTEGER fkey+excluded_node: BOOLEAN = FALSE+pkey = (node_id,elm_id)

Site_synonyms+site_id: INTEGER fkey+synonym_id: INTEGER fkey+pkey = (site_id,synonym_id)

Elm_instances+id: SERIAL pkey+elm_id: INTEGER fkey = 1+sequence_id: INTEGER fkey = 1+logic_id = 1+startposition: INTEGER = 0+endposition: INTEGER = 0+model_id: INTEGER fkey+note_id: INTEGER fkey = 1+ukey = (elm_id,sequence_id,startposition,endposition,logic_id,model_id)

External_methods+id: SERIAL pkey+name: VARCHAR(128)+method: VARCHAR(128) = ’Neural Network’+module_file_id: INTEGER fkey = 1+url_id: fkey ukey = 1+contact_email_id: INTEGER fkey = 1+display: VARCHAR(1024) = ’Copyright 2002 ELM Consortium’+note_id: INTEGER fkey = 1+interface_flag_id: INTEGER fkey = 1+ukey = (name,url_id)

Protein_expression_contexts+id: SERIAL pkey+context: VARCHAR(1024) ukey

External_id_types+id: SERIAL pkey+external_id_type: VARCHAR(256) ukey

File_types+id: SERIAL pkey+type: VARCHAR(256) ukey

Elm_files+file_id: INTEGER fkey+elm_id: INTEGER fkey+pkey = (file_id,elm_id)

Site_files+file_id: INTEGER fkey+site_id: INTEGER fkey+pkey = (file_id,site_id)

Logic+id: INTEGER pkey+logic: VARCHAR(256) ukey+value: FLOAT8 = 0

Evidence_codes+id: SERIAL pkey+code: VARCHAR(8) ukey+description: VARCHAR(256) ukey

Article_evidence_codes+evidence_code_id: INTEGER fkey+article_id: INTEGER fkey+pkey = (evidence_code_id,article_id)

Flags+id: SERIAL pkey+flag: VARCHAR(128) ukey

External_link_flags+flag_id: INTEGER fkey+external_link_id: INTEGER fkey+pkey = (flag_id,external_link_id)

Search_profiles+id: SERIAL pkey+profile: VARCHAR(1024) ukey+note_id: INTEGER fkey = 1+search_type_id: INTEGER fkey = 1

Sequences+id: SERIAL+name: VARCHAR(256) ukey+sha1_bin: BYTEA ukey+sha1_hex: CHAR(40) ukey+sequence: TEXT+length: INTEGER = 0+complexity: VARCHAR(1024) = ’formula?’+composition: VARCHAR(1024) = ’atomic?’+molecular_weight: FLOAT8+sequence_version: VARCHAR(128)+external_checksum: VARCHAR(512) ukey+taxonomy_node_id: INTEGER fkey

Sequences_external_links+sequence_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (sequences_id,external_link_id,logic_id)

Site_search_profiles+site_id: INTEGER fkey+search_profile_id: INTEGER fkey+pkey = (site_id,query_id)

Copyright (C) 2001-2004 Rune Linding - ELM Consortium - http://elm.eu.org

Elm_external_links+elm_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (elm_id,external_link_id,logic_id)

Evidence_classes+id: SERIAL pkey+class: VARCHAR(128) ukey

Evidences+id: INTEGER pkey+method_id: INTEGER fkey+evidence_class_id: INTEGER fkey+reliability_id: INTEGER fkey+note_id: INTEGER fkey

Instance_evidence+elm_instance_id: INTEGER fkey+evidence_id: INTEGER fkey+logic_id: INTEGER fkey

Taxonomy_ranks+id: SERIAL pkey+rank: VARCHAR(64) ukey

Taxonomy_flags+flag_id: INTEGER fkey+node_id: INTEGER fkey+pkey = (flag_id,node_id)

Evidence_external_links+evidence_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (elm_id,external_link_id)

Sequence_types+id: SERIAL pkey+type: VARCHAR(128) ukey

Article_files+file_id: INTEGER fkey+article_id: INTEGER fkey+pkey = (file_id,article_id)

Evidence_evidence_codes+evidence_code_id: INTEGER fkey+evidence_id: INTEGER fkey+pkey = (evidence_code_id,evidence_id)

Reliabilities+id: SERIAL pkey+reliability: VARCHAR(128) ukey

External_link_files+file_id: INTEGER fkey+external_link_id: INTEGER fkey+pkey = (file_id,external_link_id)

Model_types+id: INTEGER pkey+type: VARCHAR(64) ukey

Households+mtime: TIMESTAMP WITH TIME ZONE = (’NOW’)+creation_date: TIMESTAMP WITH TIME ZONE = (’NOW’)

Protein_taxa+protein_id: INTEGER fkey+node_id: INTEGER fkey+pkey = (protein_id,node_id)

Evidence_methods+id: INTEGER pkey+method: VARCHAR(512) ukey

Elm_siteseeings+elm_id: INTEGER fkey+elmer_id: INTEGER fkey+pkey = (elm_id,elmer_id)

Search_profile_types+id: SERIAL pkey+profile: VARCHAR(256) ukey

Protein_complexes+id: SERIAL pkey+name: VARCHAR(256) ukey

Complex_subunits+complex_id: INTEGER fkey+protein_id: INTEGER fkey+pkey = (complex_id,protein_id)

Protein_complexes+id: SERIAL pkey+name: VARCHAR(512) ukey

Complex_subunits+protein_id: INTEGER fkey+complex_id: INTEGER fkey

Logic_flags+flag_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (flag_id,logic_id)

Figu

re5.

6:Th

eel

mD

Bre

pres

ente

das

anU

nifo

rmM

odel

ling

Lang

uage

(UM

L)m

odel

.It

cons

ists

of69

rela

tiona

ltab

les

and

ishi

ghly

norm

alis

ed.

The

data

-ba

seis

mod

elle

din

ade

scrip

tive

man

ner

and

isth

eref

ore

toa

larg

eex

tent

self-

expl

anat

ory.

The

sche

ma

can

bedo

wnl

oade

dathttp://elm.eu.org/

elmDB.pdf

.The

publ

icve

rsio

nof

the

data

base

does

notc

urre

ntly

incl

ude

the

info

rmat

ion

forr

ecog

nitio

nm

odul

esan

din

tera

ctio

npa

rtne

rs.

Page 79: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.9. ELM resource content - a survey 59

CLV_NDR_NDR_1CLV_PCSK_KEX2_1LIG_14-3-3_1LIG_14-3-3_2LIG_AP_GamEar_1LIG_COP1LIG_CORNRBOXLIG_CYCLIN_1LIG_Clathr_ClatBox_1LIG_CtBPBDLIG_Dynein_DLC8_1LIG_EH1LIG_FHA_1LIG_IQLIG_MYNDLIG_NRBOXLIG_PCNALIG_PDZ_1LIG_PDZ_2LIG_PDZ_3LIG_PIP2_ANTH_1LIG_PIP2_ENTH_1LIG_PP1LIG_PTB_1LIG_PTB_2LIG_RBBDLIG_RGDLIG_SH2_1LIG_SH2_2LIG_SH2_3LIG_SH3_1LIG_SH3_2LIG_SH3_3LIG_SH3_4LIG_SH3_5LIG_TPRLIG_TRAF2_1LIG_TRAF2_2LIG_WRPW_1LIG_WRPW_2LIG_WW_1LIG_WW_2LIG_WW_3LIG_WW_4MOD_ASX_betaOH_EGFMOD_CAAXboxMOD_CDKMOD_CK1_1MOD_CK2_1MOD_Cter_AmidationMOD_GlcNHglycanMOD_N-GLC_1MOD_NMyristoylMOD_PKA_1MOD_PKA_2MOD_PKB_1MOD_SUMOMOD_WntLipidTRG_EHTRG_ER_KDEL_1TRG_ER_diArg_1TRG_ER_diLys_1TRG_EVH1_ITRG_EVH1_IITRG_Golgi_diPhe_1TRG_PTS1TRG_PTS2TRG_WXXXY/F

Re

gu

latio

n o

f tr

an

scrip

tio

nS

ign

al tr

an

sd

uctio

nR

esp

on

se

to

str

ess

Ce

ll co

mm

un

ica

tio

nC

ell

diffe

ren

tia

tio

nD

eve

lop

me

nt

Nu

cle

otid

e m

eta

bo

lism

Ph

osp

ha

te m

eta

bo

lism

Pro

tein

ca

tab

olis

mE

lectr

on

tra

nsp

ort

Intr

ace

llula

r p

rote

in t

ran

sp

ort

Syn

ap

tic v

esic

le t

ran

sp

ort

Nu

cle

us

En

do

pla

sm

ic r

eticu

lum

Inte

gra

l to

me

mb

ran

eM

em

bra

ne

Extr

ace

llula

r

Low High

Overrepresentation of GO termamong proteins matching ELM

GO term annotated for ELM

Figure 5.7: Comparison of GO annotation of ELMs and GO annotation of pre-dicted ELM containing proteins in the D.melanogaster proteome. For eachELM passing the taxonomic filter, the set of proteins that matches that ELM’sregular expression was identified. Based on GOA [37], the GO terms signi-ficantly overrepresented among the proteins in each set were identified usinga hypergeometric test. For each such associated ELM and GO term, the foldover-representation of the GO term is colour coded. A black dot in the centreof a field denotes that the ELM is annotated with the GO term in question.

Page 80: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

60 Chapter 5. The Design of ELM

Euk

aryo

taV

irid

ipla

ntae

Fung

i/Met

azoa

gro

upM

etaz

oaE

umet

azoa

Ver

tebr

ates

Mam

mal

ia46

730

517

14

10650 53 83 88105106

(a) taxonomy survey

extr

acel

lula

rpl

asm

a m

embr

ane

cyto

plas

men

dopl

asm

ic re

ticul

umG

olgi

app

arat

usm

itoch

ondr

ion

pero

xiso

me

nucl

eus

3424 61 9 11 1 314

226415

144121

1027113

18

(b) compartmentsurvey

Figure 5.8: The taxonomic and sub cellular localisation surveys. The numberindicates how many ELMs have the given annotation profile.

5.10 Two filter types: contextual and logical

The basic idea being that since we cannot discriminate ELMs based on se-quence matching, we can use a knowledge base of contextual information re-garding functional sites and ELMs to filter out false positives. This knowledgebase is created/curated manually from the scientific literature. The differencein logical and contextual filtering is in the level of the context. A filter works byseparating and retaining or excluding in the same way ELM filters work. Cur-rently two logical filters are part of ELM, the taxonomic context filter and thecellular compartment filter.

Page 81: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.10. Two filter types: contextual and logical 61

5.10.1 Logical filters - a result of compartmentalisation?

5.10.1.1 Taxonomic Filtering

Some types of functional site are found in all eukaryotes, e.g. the ER retentionsignal KDEL is universal but others are restricted to specific eukaryotic taxa.Perhaps most strikingly, the large tyrosine kinase multigene family is found onlyin Metazoa. Each ELM is annotated with one or more NCBI taxonomy nodesto indicate its known phylogenetic distribution. The user provides the queryspecies and all ELMs that are not assigned to its lineage are filtered out.

5.10.1.2 Cell Compartment Filter

In ELM every linear motif is annotated with GO terms for the set of cell com-partments in which it is known to function. For example KDEL is a signalfor retention of the host protein to the Endoplasmatic Reticulum whereas theSUMO site applies to proteins in the nucleus and the PML body. The user spe-cifies the compartments in which the query protein functions and all matchesfor ELMs not found in these compartments will be filtered out. In the futureELM may support prediction of compartment using LOC3D [159].

5.10.2 Structure based contextual filtering

The structural context of ELMs are likely the most sensitive discriminator oftrue and false positives. We have developed several approaches to analysethe structure or predicted structural environment for a given ELM.

The basis for this is what is known as the ELM instances which is a growingcollection of experimentally verified instances of specific ELMs on specific poly-peptides. These data are extremely difficult to mine from the literature and theyare scarce. Another problem is that researchers tend to focus on one particularfunctional site and study it in several organisms, even sometimes on the sameproteins. This leads to redundancy in the dataset, which has to be avoidedwhen one performs benchmarkings. The basic assumption is that since weknow ELM patterns predict enormously, we can assume that in a large dataset such as our SWISSPROT_EU_50% the number of true positives (TP) ismuch smaller than the number of (FP), i.e. FP> >TP. We can then assume that

Unfiltered SMART/Pfam GlobPlot2

FN/TP rate Matches TP Acc. Matches TP Acc. Matches TP Acc.

0.0-0.1 22256 422 0.019 15569 387 0.025 13064 404 0.031

0.1-0.5 17935 479 0.027 11482 401 0.035 8458 320 0.038

0.5-1.0 567 59 0.104 293 39 0.133 232 10 0.043

0.0-1.0 0.042 0.052 0.056

Table 5.3: Preliminary benchmarks for the SMART/Pfam and GlobPlot2 glob-ularity filters. Acc. is prediction accuracy defined as TP/(TP+FP)

Page 82: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

62 Chapter 5. The Design of ELMLI

G_1

4-3-

3_1

LIG

_14-

3-3_

3LI

G_A

P_G

amE

ar_1

LIG

_CO

P1

LIG

_CO

RN

RB

OX

LIG

_CY

CLI

N_1

LIG

_Cla

thr_

Cla

tBox

_1LI

G_C

tBP

LIG

_Dyn

ein_

DLC

8_1

LIG

_EH

1LI

G_H

OM

EO

BO

XLI

G_H

P1_

1LI

G_M

AD

2LI

G_M

DM

2LI

G_M

YN

DLI

G_N

RB

OX

LIG

_PC

NA

LIG

_PIP

2_A

NTH

_1LI

G_R

BLI

G_R

GD

LIG

_SH

2_G

RB

2LI

G_S

H2_

STA

T3LI

G_S

H2_

STA

T5LI

G_S

H3_

1LI

G_S

H3_

2LI

G_S

H3_

3LI

G_S

ID_1

LIG

_Sin

3_2

LIG

_TP

RLI

G_T

RA

F2_1

LIG

_TR

AF6

LIG

_WR

PW

_1M

OD

_CM

AN

NO

SM

OD

_N-G

LC_1

MO

D_N

-GLC

_2M

OD

_OFU

CO

SY

MO

D_S

Pal

mito

yl_4

MO

D_S

UM

OM

OD

_TY

R_C

SK

MO

D_T

YR

_ITA

MM

OD

_TY

R_I

TIM

MO

D_T

YR

_ITS

MTR

G_A

P2a

lpha

TRG

_EN

DO

CY

TIC

_2TR

G_E

R_K

DE

L_1

TRG

_ER

_diA

rg_1

TRG

_ER

_diL

ys_1

TRG

_Gol

gi_d

iPhe

_1TR

G_L

ysE

nd_A

PsA

cLL_

1TR

G_L

ysE

nd_A

PsA

cLL_

3TR

G_P

ML_

SV

TRG

_PTS

10

0.2

0.4

0.6

0.8

1

TP/(T

P+F

P)

UnfilteredSMART/PfamGlobPlot

Prediction accuracy

Figure 5.9: Improvement in prediction accuracy, TP/(TP+FP), by applyingGlobPlot or SMART/Pfam filters. In many cases the accuracy is improved,however, what is not shown is that the level of false negatives is raised as well.This is a preliminary figure.

any difference between a set of exclusively TP (elm_instances_50%) and thisdataset should be significant and a signal should be visible.

As shown in Figure 5.9, the prediction accuracy (defined as TP/(TP+FP)) can be improved if one can accept the introduction of more false negatives.However, the prediction accuracy of most of the regular expressions is so poorthat the sensitivity is of minor importance. The total results, based on a subsetof the instance data without phosphorylation motifs, are shown in Table 5.3.The ELMs with FP/FN rate between 0.5-1.0 are: LIG_RB, MOD_N-GLC_2,TRG_Golgi_diPhe_1, TRG_PML_SV, LIG_PIP2_ANTH_1, MOD_TYR_DYR.These are preliminary results, however we can see that although some truepositives are lost by the filtering a lot of false positives are removed, corres-ponding to about 50% reduction in the sequence search space.

Page 83: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.10. Two filter types: contextual and logical 63

5.10.2.1 Globular Domain Filter - a two track filtering

Globular domains identified with the SMART and Pfam (domain subset) re-sources are used for filtering out ELMs. There are two tracks in this filter:

• domain filtering

• ELM rescue

The domain filter works by removing all ELMs within the boundaries ofSMART/Pfam domains matching the same sequence as they are false posit-ives. The assumption here is that sites within globular units are not accessibleand therefore not functional, clearly an oversimplification.

This module is an annotation based module that works to reduce false neg-atives introduced by the SMART/Pfam filter. From the siteseeing we know thate.g. a Tyrosine phosphorylation site can be found within a Tyrosine Kinasedomain, it follows that this domain type should not exclude MOD_TYR sites.ELMs can occur inside certain domains e.g. the internal tyrosine phosphoryla-tion sites in the active loops of tyrosine kinase domains, as is described in 1.1.This later group of ELMs is to a certain extent being ‘rescued’, i.e. for someELMs certain SMART/Pfam domains are simply not used for as part of thedomain filter.

Given the limited accuracy of the domain filter the unfiltered results areprovided to the user on the result front page. In many cases users will be ableto investigate surface accessibility by examination of an available 3D structureor by using a good quality 2D structure prediction [175, 54, 55], or perhaps byusing a homology modelling server such as SWISS-MODEL or the 3D-JURYmeta-server [193, 90]. We are currently developing better domain filters, e.g.using surface accessibility from known structures to discriminate false fromtrue positives.

5.10.2.2 Globularity Masking

The globular domain filter is beneficial because it assigns function to the se-quence that is used to remove potential false positives, however the intrinsicweakness of this filter is the coverage of sequence space. Pfam and SMARTstill have limited coverage (perhaps less than 50%) of the sequence space andmany globular regions are not picked up by these resources. Another problemis that these databases are not focusing on locating precise domain boundar-ies. In order to optimise their sequence models they most often only use themost conserved parts of the domain alignments. In general they predict do-mains to be shorter than they really are. Both of these problems are counter-acted by ab initio globularity masking. The methods, GlobPlot and DisEMBL,described in Chapter 6 can both be used for this purpose.

GlobPlot

We have earlier shown that GlobPlot’s RUSSELL/LINDING scale can detect do-main boundaries ab initio, i.e. from sequence alone [139]. This module there-

Page 84: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

64 Chapter 5. The Design of ELM

fore complements the SMART/Pfam filter that only deals with known domains.In general GlobPlot detects domain boundaries more accurately than Pfamand SMART [data not shown].

DisEMBL

Protein disorder and unstructured regions are often regions within proteinswhere one finds functional sites. We previously developed a method for abinitio prediction of protein disorder. The method, DisEMBL, provides severalcomplementary definitions of protein disorder including the novel concept ofhot loops [138]. This filter currently work as a globularity masker like GlobPlot.

5.10.2.3 Structural filter

The ab initio filter is expected to get rid of a great number of false positive hits.It is known, however, that at least some ELMs are located within structuraldomains (e.g. auto phosphorylation sites). In such cases the exclusion ofthe motif caused by the globular domain filter should be tuned by using theknown three-dimensional information, whenever this is available. To this endwe envisaged two different strategies. The first to be used in cases whenstructural information is available for the analysed ELM; the second when it isnot.

1. For all ELMs of known structure, data about secondary structure andaccessibility are collected in order to build a scoring schema for the pre-dicted ELMs. When the motifs on the user sequence are analysed, aprediction of the secondary structure and accessibility of each motif isperformed using homology modelling techniques. Based on the com-parison between predicted values and data collected from true positiveinstances, the filter produces a score. The score is correlated to the de-gree of conservation of the ELM accessibility and secondary structurefeatures across diverse true positive protein structures. It is also associ-ated to a reliability value, which relies on the number of non-redundant3D true positives available.

2. The second strategy considers only the predicted structural data, withoutperforming a comparison with a benchmark. In this case the score iscorrelated to the accessibility and secondary structure information that istypically associated to linear motifs (e.g. high exposure to the solvent,high mobility, lack of secondary structure, etc.). We are using BLASTto identify a match with the query sequence in the ASTRAL collectionof SCOP domains. If a significant match can be found we can analysethe predicted ELMs position within the structural domain. This modulewill compute the accessibility and secondary structure of the predictedELM and assign a score according to this. The structural filter will bediscussed further elsewhere [Via et al. 2004].

Page 85: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.11. ELM entries 65

5.11 ELM entries

In elmDB the information is stored in certain layers. The top layer is the func-tional site, which refers to the set of information describing a molecular func-tion that has been catalogued in ELM. A functional site consists of potentiallyseveral ELMs, or linear motifs, which are the sequence models for a particu-lar molecular function. SH3 ligands is one group of functional sites. In ELMSH3 ligands are described by five different sequence models or linear motifs:LIG_SH3_1-5 all slightly different in their consensus sequence motif, refer toFigure 4.2 on page 23. Contextual information is stored as related to ELMs if itis sequence specific and to functional sites if it is of a more general nature suchas taxonomic range or subcellular compartment range. Some context profilesare general for many functional sites, e.g. about 70% of the current ELM in-stances seem to be in non-globular or disordered regions of proteins. This is avery important finding because it means that we can exclude a lot of sequencespace when we are trying to predict ELMs. We have developed several toolsthat are exploiting this tendency within functional sites.

5.11.1 Molecular function classes - LIG, MOD, TRG, CLV andISO

Linear motifs seems to group themselves into certain classes according to theirsequence specificity, although this is primarily based on sequence analysis thisis also known from large scale experimental studies [Gianni 2004, Serrano,Yaffe]. In ELM we have identified 5 classes of ELMs.

Isomerisation sites are recognised by domains which alter the stereo-chemical modus of the amino acid, perhaps resulting in an altered structureand function of the host protein, e.g. the prolyl isomerase PIN1 which recog-nises phosphorylated Ser/Thr-Pro motifs. Note that the Ser/Thr-Pro motif must

Concept Definition Example

a functional site A set of short linear (sub)sequences LIG_RB:

that can be related to a molecular function Rb binding motif

an ELM The common pattern of a set of linear (sub)sequences [LI].C.[DE]

that can be related to a molecular function

an ELM instance An instance of an ELM in a particular polypeptide RBB1_HUMAN

LVCHE

Table 5.4: Definitions of different concepts used in the ELM resource. Func-tional sites are, as opposed to e.g. active sites, short and linear in sequenceand structure space. In the ELM resource we describe functional sites as lin-ear motifs. Here the linear motif is shown as a regular expression or patternbut it could as well have been another type of sequence model e.g. a neuralnetwork or PSSM.

Page 86: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

66 Chapter 5. The Design of ELM

Figure 5.10: Occurrences of different classes of functional sites in SWISS-PROT. Functional sites are as varied and numerous as domains are. On aproteome level we expect at least 5 sites per protein resulting in about 150kinstances in the human proteome. This indicates the presence of a giganticand complex interaction and regulatory system.

first be recognised for phosphorylation by another domain (a kinase) which isin turn recognized by PIN1.

5.12 ELM annotation - Siteseeing

All data input is by hand curation performed by trained molecular biologists.Annotating each ELM is called Siteseeing, it includes the processes shownin Figure 5.11. In order to promote interoperability with other bioinformaticsresources ELM uses three public annotation standards. Gene Ontology (GO)identifiers are used for cell compartment and biological process [14, 106],while the NCBI taxonomy database identifiers [232] are used for taxonomicnodes at the apex of phylogenetic groupings in which an ELM occurs. Thethird standard used in ELM is described below.

5.12.1 Instance data - an invaluable resource for biologists

A major task in the annotation of contextual profiles is the collection of knownverified instances of ELMs. These are data sets of experimental evidence fora particular linear motif residing on a particular polypeptide. These data areinvaluable for the training and benchmarking of filters. Refer to Materials &Methods for more details on the process of annotating contextual profiles. In

Page 87: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.12. ELM annotation - Siteseeing 67

Figure 5.11: The siteseeing process. The manually annotated ELM resourcemakes use of bioinformatics and biology knowledge to collect and analyse aset of representative peptides bearing a given functional site (the instances).The analysis, update and evaluation of the dataset typically involves bioinform-atics and biology methods as well as hybrid methods merging the two discip-lines . At the end of the pipe line, a model to predict the motifs belonging to afunctional class (the ELMs) is built and the frame of rules used for the logicalcontext filtering of the ELMs are stored. It also provides a tailored abstract anda list of key references to its user. The flow of the siteseeing process typic-ally involves extensive literature searches, BLAST runs, multiple alignment ofrelevant protein families, perusal of SWISS-PROT and other online databasesand, where practical, discussion with experimentalists from the field.

Class ELMs Functional sites InstancesCLV 7 3 0TRG 15 12 108MOD 30 24 1012LIG 62 38 845ISO 0 0 0

Table 5.5: Overview of the content of current elmDB.

Page 88: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

68 Chapter 5. The Design of ELM

order to create contextual profiles one needs to understand how we classifymolecular functions within ELM. Annotation of ELM instances are assignedontology terms from the Proteomics Standards Initiative Molecular Interactionontologies for evidence methods [105]. In the future the ELM resource willbe able to report known instances of ELMs with details regarding which kindof experiments were performed to show the instance, and with links to therelevant literature. Refer to Table 5.5 for an overview of current content ofinstances.

Other resources that contain instance like data are protein interaction data-bases such as BIND, MINT and DIP [17, 238, 242] and Phospho.ELM,http://phospho.elm.eu.org, [Diella et al. Genome Biology 2004, sub-mitted].

5.12.2 Mining for ELMs - creating contextual profiles

Starting from the current knowledge about a functional site (literature, se-quence, taxonomy) the siteseer uses a range of methods as diverse as data-base searches, homology searches, literature/text mining or interactions withexperimentalists to collect an extensive set of sequences containing examplesof the functional site (the instances). This collection is used to extract thedefinition of one (or possibly several) ELMs and derive the best method mod-els for predicting the ELMs: regular expression, profile or perhaps in the fu-ture a PSSM. The definition of an ELM would not be complete without thedescription of its biological context: taxonomic distribution, cellular distribution,3D structure context. This information builds the unique frame of rules usedfor the logical context filtering that allows the reduction of false positives inthe ELM prediction. Siteseeing consists of two complementary processes: arather descriptive annotation task to collect information found in the literatureand databases and a representative population of ELM instances. In a moreanalytical phase, multiple alignments, elaboration of profiles or regular expres-sions and homology searches lead to the informatic description of an ELM. Thetwo aspects are entwined and interdependent since biology and bioinformaticsconstantly feed each other during the siteseeing process. The description ofan ELM typically results from hybrid methods: for example literature mining (forsearches and update) and standardisation of annotation by controlled vocabu-laries combines the use of biological knowledge and bioinformatics tools.

5.12.3 Building sequence models

The motif patterns are currently represented as POSIX regular expressions(usable in the Python and Perl languages), analogous to PROSITE patterns,but with a different syntax. For example the FxDxF motif, which is responsiblefor the binding of accessory endocytic proteins to the alpha-subunit of adaptorprotein complex AP-2, has a consensus sequence of F-x-D-x-F and is writtenas F.D.F . Linear motifs in ELM will, in the future, include motif descriptionsaccording to the ‘Seefeld convention’ nomenclature for linear motifs [1].

Page 89: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

5.12. ELM annotation - Siteseeing 69

In the future ELM might incorporate neural networks or other sensitivesearch methods, nevertheless linear motifs will continue to overpredict andrequire alternative approaches for reducing the levels of false positives.

Using ELM - now and in the future

The public ELM webserver allows the user to retrieve filtered, as well as un-filtered, raw results. This approach should encourage users to think criticallyabout ELM server results. Figure 5.12 shows the ELM server output using thehuman Src sequence as a query. This example indicates the potential of thefiltering approach for improving motif searches. A pipeline interface to ELMprediction for use in proteome analysis is currently being developed and ap-plied, this pipeline and the results will be made available as soon as possible.

The predictive power of the ELM resource can be enhanced by comparisonto other data, including experimental results. For example, many protein kinaserecognition sites are among those which severely overpredict. If a protein isknown to be phosphorylated then the kinase site matches can be targeted

Figure 5.12: Sample output from the ELM server. The Src oncogene is usedas an example. Light blue bars indicates retained predictions whereas the greyones have been filtered out because of clashes with the SMART/Pfam domainfilter.

Page 90: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

70 Chapter 5. The Design of ELM

for experimental testing. Mass spectrometry can be a useful tool in revealingpost-translational modifications. ELM can provide synergism with appropriateexperiments and can help in mapping out a research program. In this way, theELM resource should become increasingly useful to the research community.

Future enhancements

In the future ELM may support prediction of compartments using LOC3D[159]. We would like to formalise our contextual profiles and try to createa comprehensive logical framework and simulation system for linear motifs.However a vector space formal system could be anticipated in the future.

Acknowledgements

The ELM consortium was supported by EU grant QLRI-CT-2000-00127.Thanks to Sara Quirk for commenting on this manuscript. Finally we aredeeply grateful to FreeBSD.org, kernel.org, (bio)Python.org, PostgreSQL.org,Debian.org, Apache.org and Linus Torvalds for fantastic open free software.

Page 91: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Part III

Intrinsic Protein Disorder

71

Page 92: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 93: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Chapter 6

Ab initio Prediction of ProteinDisorder

We predict1 about 70% of the instances of linear functional sitesto be in non-globular regions. The remainder are to a large extentresiding in internal domain loops, exposed to the solvent. In or-der to narrow the search space for linear functional modules I havedeveloped two novel and powerful methods for finding unstructuredsegments in proteins from sequence alone. Besides being potentialfilters these two methods, GlobPlot and DisEMBL, are used intens-ively by structural genomics’ groups for target optimisation. Bothmethods are able to detect Intrinsically Disordered Proteins (IDPs)which is a group of proteins that receives increasing interest.

6.1 What is protein disorder?

No commonly agreed definition of protein disorder exists. The thermodynamicdefinition of disorder in a polypeptide chain is the “random coil” structural state.The random coil state can best be understood as the structural ensemblespanned by a given polypeptide in which all degrees of freedom are usedwithin the conformational space. However, even under extremely denaturingsolvation conditions, such as 8M urea, this theoretical state is not observed insolvated proteins [126, 167, 213, 203]. Proteins in solution thus always keep acertain amount of residual structure. This indicates that protein disorder shouldrather be considered a highly dynamic and unstructured (with no regular sec-ondary structure) state of the polypeptide. The closest operational definition ofthis is perhaps what NMR researchers refer to as the ‘unstructured ensemble’2.However, as we will see, data from X-ray crystallography seems to be relatedto the same type of disorder that can be found by NMR.

1Based on GlobPlot and DisEMBL predictions in the ELM instance data.2An unstructured ensemble is the set of solutions to a NMR spectrum sampled on the

unfolded protein in solution.

73

Page 94: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

74 Chapter 6. Ab initio Prediction of Protein Disorder

6.2 What role does protein disorder play in bio-logy?

Protein disorder is interesting for several reasons beyond enhancing thechances of finding functional sites. Intrinsic disorder is now forcing its wayinto the structure-function paradigm of structural biology. The author thinksthat to a certain extent this paradigm will be rewritten over the next few years.The old dogma of a one-to-one relationship between the sequence, structureand function of a protein has fallen.

6.2.1 Intrinsically Disordered Proteins (IDPs)

Although under-researched, there is an increasing number of reports of IDPs(also known as Intrinsically Unstructured Proteins, IUPs). These are proteinsor domains that, in their native state, are either entirely disordered or containlarge disordered regions. More than 180 such proteins are known includingTau, Prions, Bcl-2, p53, 4E-BP1 and HMG proteins (see Figure 7.4) [215, 219,143, 131, 13, 57]. As these proteins are not necessarily entirely unstructuredwe proposed the IDP abbreviation, to make it clear that these are proteins that‘contain’ disorder.

Protein disorder is important for understanding protein function as well asprotein folding pathways [171, 221]. Although little is understood about thecellular and structural meaning of IDPs, they are thought to become orderedonly when bound to another molecule (e.g. CREB-CBP complex [180]) uponchanges in the biochemical environment [71, 69]. This is supported by thelower cooperativity observed in these chains by NMR, i.e. their folding is moredependent on other molecules.

Many IDPs and misfolded proteins are involved in major protein diseasessuch as Parkinson’s and Alzheimer’s syndromes. This is due to abnormal ag-gregation patterns of these proteins. In Chapter 8 we will investigate this fur-ther.

6.2.2 Target selection

Both the tools, DisEMBL and GlobPlot, which we have developed, are nowbeing used by several structural genomics initiatives and labs around the worldwho are either studying IDPs or trying to optimise their recombinant proteinexpression vectors by removing disordered segments.

In the post-genomic era discovery of novel domains and functional sites inproteins is of growing importance. One focus of structural genomics’ initiativesis to solve structures for novel domains and thereby increase the coverage offold and structure space [29]. During the target selection process in structuralgenomics/biology intrinsic protein disorder is important to consider becausedisordered regions (at the N- and C- termini or even within domains) often leadto difficulties in protein expression, purification and crystallisation. This canfor instance be rationalised by the under-representation of these in the PDB.

Page 95: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.3. Hypothesis vs data driven 75

It is therefore essential to be able to predict which regions of a target proteinare potentially disordered/unstructured. This seems to be less true for the IDPclass, i.e. disorder is mainly causing trouble in work with globular proteins,refer to Paper IV.

6.2.3 Protein function and disorder?

The current view on protein disorder is that it allows for more interaction part-ners and modification sites [236, 140, 215]. However, we have not been able toconfirm this hypothesis by analysing a large interaction dataset [unpublishedresults]. This might be because such datasets are enriched in non-transientinteractions and interactions carried out by disordered proteins are transient.

Perhaps disordered proteins have evolved to provide a simple solution tohaving large intermolecular interfaces while allowing for smaller protein, gen-ome and cell sizes [95]. It has been proposed that having several relativelylow affinity linear interaction sites allows for a flexible, subtle regulation as wellas accounting for specificity and cooperative binding effects [79]. In the lightof the modular model described in 1.1 on page 3, one can see how these sitescould be used in a combinatory manner to generate a very large set of potentialinteraction environments.

Recently it was proposed that IDPs are ‘junk-proteins’ [144], this hypo-thesis is not supported by Paper IV. Firstly, we found that the majority ofexperimentally verified IDPs are of less than 50% sequence low-complexity.Secondly, we know that low-complexity is often co-occurring with functionalsites and therefore important for function, refer to 4.5 on page 30. Thirdly, ithas been proposed how repeats might be important for the evolution of thisprotein class [216]. I find it ‘interesting’ that the term is now reused since wenow know that ‘junk-DNA’ was also a false concept.

6.3 Hypothesis vs data driven

The lack of a commonly agreed definition of protein disorder is likely to bedue to the fact that it is observed by different experimental techniques. Thework that has been done hitherto, primarily from the Dunker group, has in thisauthor’s opinion failed to clearly state this [69, 70, 71, 72, 111]. Instead,their work is an attempt to fit all types of disorder observations into one largeparadigm of protein disorder.

We chose a reductionists approach and set out by defining, based on ex-perimental data, several types of disorder which we thought would be biolo-gically relevant. We then mined data sets for these definitions and trainedpredictors for them. The basis for our work was a simple two-state disorder-order description, i.e. we do not have degrees of disorder. This is clearly anoversimplification, however with the lack of definitions it seems a reasonableapproximation to a potentially much more complicated problem.

Another fundamental difference between our approach and the one of theDunker lab is that we have as an absolute requirement that there is no regular

Page 96: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

76 Chapter 6. Ab initio Prediction of Protein Disorder

secondary structure in a polypeptide segment in order for it to be even con-sidered to be disordered. This hypothesis is also the basis for the work doneby David Jones’ and Burkhard Rost’s labs [119, 140].

6.4 Data sources for protein disorder

Protein disorder is observed by a variety of experimental methods, such asX-ray crystallography, NMR-, Raman-, CD-spectroscopy and hydrodynamicmeasurements [205, 71]. In vivo studies of disorder are possible with NMRspectroscopy on living cells (e.g. anti-sigma factor FlgM [60]). Each one ofthese methods detects different aspects of disorder resulting in several oper-ational definitions of protein disorder [see 215, for a review]. Table 6.1 showsthe major data sources for protein disorder.

6.4.1 X-ray data

As seen in Table 6.1 X-ray crystallography provides at least three paramet-ers that can describe intrinsic protein disorder: coils, REMARK465 (missingcoordinates) and anisotropic thermofactors (B-factors).

6.4.1.1 Coils

Residues in X-ray structures are annotated as belonging to one of several sec-ondary structure types. These annotations can be extracted from PDB files bythe DSSP algorithm [120]. We considered residues as α-helix (’H’), 310-helix(’G’) or β-strand (’E’) as ordered, and all other states (’T’, ’S’, ’B’, ’I’, ’ ’) as coils(also known as loops). Coils/loops are not necessarily disordered (e.g. turns),however protein disorder is only found within loops. It follows that one can useloop assignments as a necessary but not sufficient requirement for disorder; adisorder predictor entirely based on this definition will thus be promiscuous.

Source Parameters Advantages DisadvantagesX-ray B-factors, Coils Very large dataset Truncated data

REMARK465 Entropic disorder Noisy dataNMR Random-coil chemical shifts In vivo/situ data Very limited data

Relaxation, Order vectorsCD/Raman Missing far UV signal low-tech No residue specificity

Loops are not disorderHydrodynamics Hydrodynamic volume low-tech No residue specificity

Small dataset

Table 6.1: Data sources for protein disorder.

Page 97: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.4. Data sources for protein disorder 77

6.4.1.2 REMARK465 - biting its own tail

Missing coordinates (REMARK465) in X-Ray structure are defined by ‘RE-MARK 465’ entries in PDB. Non-assigned electron densities most often reflectintrinsic disorder and were used early on in disorder prediction [136]. Thistype of data is noisy, in particular methionine suffers a bias in the dataset for atleast two reasons:

1. Often the N–terminal methionine is cleaved off .

2. Some structures are solved using selenomethionine derivatives for phas-ing, which can lead to deletion of the residue in the PDB entry.

A more serious and unsolvable problem is that these data are truncated. Sinceprotein disorder is often removed during expression and crystallisation of theprotein, the data that goes into PDB is only a subset of the full set. This couldtheoretically be counteracted by data mining and comparison of the sequencesin PDB with the ones in SWISS-PROT where the full-length sequence mightbe stored, however this is not a trivial task to automate. The situation is similarto a dog biting its own tail.

6.4.1.3 Anisotropic thermofactors (B-factors)

In crystal structures of proteins, the B-factor reflects the uncertainty in atompositions in the model and often represents the combined effects of thermalvibrations and entropic disorder. Several attempts have been made to try touse B-factors for disorder prediction [31, 222, 85, 70, 244] but there are manypitfalls in doing so as B-factors can vary greatly within a single structure due toeffects of local packing and structural environment. Recent progress in derivingpropensity scales for residue mobility based on B-factors [201] encouraged usto use B-factors for defining protein disorder. Our breakthrough was to useB-factors and coils together in the hot loops definition, see 6.5.1. It should benoted that we were the first group to use B-factors successfully for disorderprediction, see Paper III, even if we are not cited in a recent paper in ProteinScience by the Dunker lab on the same topic.

6.4.2 NMR data

NMR offers a fast growing and exciting [57] alternative data source for pre-diction of structural disorder. One characteristic of the conformations detectedin soluted disordered proteins, is their reduced stability and cooperativity com-pared to the structure found in globular proteins. This means that these struc-tures are in equilibrium with disordered states or alternative conformations on atime-scale faster compared to that of the sampling rate in NMR experiments. Itfollows that common NMR parameters, such as chemical shifts, coupling con-stants and NOE (Nuclear Overhause Effect), are averaged over an ensembleof conformations which are rather different from each other. This renders somedifficulty in directly using these parameters for assessing the time-scale and the

Page 98: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

78 Chapter 6. Ab initio Prediction of Protein Disorder

extent of the motions in IDPs [34, 203, 202]. However other measurementssuch as hydrodynamic radii [233] and Hydrogen exchange can be used [34].

NMR relaxation data on disordered proteins can provide insight into boththe structural and dynamic properties of these molecules. Negative values in15N-1H heteronuclear NOE observed in relaxation experiments indicate pro-tein disorder [46, 223].

Many of these data are deposited in the BioMagResBank (BMRB, http://www.bmrb.wisc.edu/). In the future we would like to collaborate withNMR groups on training new predictors based on this type of data.

6.4.3 Other data types

The Raman, CD, proteolysis and hydrodynamically methods (such as gel-filtration) share an intrinsic weakness, which is the lack of residue specificity.This is not necessarily a problem in terms of empirical studies, however it doesproduce data that is not suited for training predictors. Another problem is thatthese data are not available in knowledge bases, one would have to mine themfrom the literature. This is also a problem that is attached to small-angle X-rayscattering (SAXS) data [47].

We simply chose not to use data from these methods for training predictors,others have chosen to use all data regardless of source [69, 70, 71, 72, 111].

6.5 Old and novel definitions of disorder

Both the coils and REMARK465 data were used directly for describing proteindisorder, however this has been done by others so this has limited impact. Wechose to deliver a two-state secondary structure predictor that deals only withprediction of loop/non-loop. We think this predictor is able to compete withestablished secondary structure predictors such as PROFsec or PSI-PRED[175, 228]. The missing coordinate data based predictors could be marginallyenhanced by cleaning up the data and ignoring e.g. proteolytically removedmethionines from the set. This would require a substantial amount of manualdata manipulation and the performance gain is likely not overwhelming.

Figure 6.1 shows the disorder propensities for each amino acid by thefour definitions of disorder used in our work. A more detailed discussion ofthese values can be found in Paper III but in general hydrophobic residuesare promoting order according to all definitions of disorder. Disorder promot-ing residues include proline, lysine, serine, threonine and methionine. Someproteins contain short hydrophobic intra-domain linkers, these are highly dis-ordered since they are not solvated properly. However they are not longenough to force an opening of the domain structures, i.e. it is inherent theyhave to be disordered and unhappy. Our predictors cannot pick up these typesof disorder, they would rather be predicted as globular due to their hydrophobi-city.

Page 99: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.5. Old and novel definitions of disorder 79

P K S T N D R M E G Q H W I V A L C F YAmino acid

-0.4

-0.2

0

0.2

0.4

0.6lo

g(od

ds)

Hot loopsCoilsMissing coordinatesRussell/Linding

Figure 6.1: Propensities for the amino acids to be disordered according to thedefinitions used in DisEMBL and GlobPlot (sorted by hot loop preference). Thisscale directly reflects the data sets used for training, however it is only a roughapproximation of what the DisEMBL neural networks are using in predictingdisorder. Error bars correspond to the 25 and 75 percentiles as estimatedby stochastic simulation. The RUSSELL/LINDING scale is an absolute scale.REMARK465 is noisy due to the reasons mentioned in 6.4.1.2.The same biasis seen in [71, Figure 10].

6.5.1 Hot loops - a novel disorder definition

Hot loops is our novel definition of protein disorder based on X-ray data. Hotloops constitute a refined subset of the coils set, namely those loops with ahigh degree of mobility as determined from Cα temperature factors (B-factors).It follows that highly dynamic loops should be considered protein disorder. Theidea of hot loops is to a certain extent RASMOL’s fault, we wanted to predictwhat was always red and unstructured.

We think that it will prove difficult to create a more precise definition ofdisorder based on crystallographic data. An example of hot loop results isshown in Figure 7.4 on page 114, where we mapped the probabilities onto thestructure of Nonhistone chromosomal protein 6A from Yeast. It is remarkablethat a definition based on X-ray data can predict so well on NMR structures:arguing that this novel definition of disorder is relevant.

6.5.2 RUSSELL/LINDING propensities

The RUSSELL/LINDING propensities are parameters based on the hypothesisthat the tendency for disorder can be expressed as P = RC-SS where RCand SS is the propensities for a given amino acid to be in ‘random coil’ and

Page 100: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

80 Chapter 6. Ab initio Prediction of Protein Disorder

regular ‘secondary structure’ respectively. This scale was defined during thedevelopment of the GlobPlot predictor described Paper II. This scale is slightlymagical in the sense that it came out as a direct result of subtracting the oc-currence frequencies from SCOP, there was not performed any normalisationof the data.

6.6 Methods for finding protein disorder

There have been several other attempts to predict disorder. Perhaps the earli-est are methods finding regions of low-complexity. Although many such re-gions are structurally disordered, the correlation is far from perfect as regionsof low sequence complexity are not always disordered (and vice versa) [69].Likely the strongest evidence for this correlation comes from the fact that low-complexity regions are rarely seen in protein 3D structures [190, 235, 110].Methods to predict low complexity, like SEG [235] and CAST [178], arethus often used for this purpose. Methods using hydrophobicity can also givehints about disordered regions, as they are typically exposed and rarely hydro-phobic.

The first tool designed specifically for prediction of protein disorder wasPONDR (Predictor Of Naturally Disordered Regions, http://www.pondr.com, [182, 85, 86]). It is based on artificial neural networks. PONDR is how-ever not freely accessible for academia. Refer to [151] for a recent evaluationof disorder prediction (DisEMBL was published after CASP5).

Regions without regular secondary structure can be predicted by theNORSp (NOn Regular Structure) server [140], however as the authors pointout such regions are not necessarily disordered. Structures such as the Kringledomain (PDB: 1krn) are almost entirely without regular secondary structure intheir native state but they still have tertiary structure wherein the basic buildingblock is coils. These ‘loopy proteins’ are not necessarily IDPs since they canstill form a well defined globular tertiary structure. Prediction of protein tertiarystructure could be an alternative route to disorder prediction, although suchmethods are computationally intensive and error-prone. Moreover, such meth-ods are usually designed to predict the structure of globular domains and theirbehaviour on other sequences can be unpredictable.

Because PONDR only allows a limited numbers of predictions and becausethe basis it was developed on was not clear, we decided to develop our ownmethods based on data driven definitions of protein disorder. This is presentedin the following two papers.

Page 101: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Paper II

GlobPlot was the first disorder and domain boundary predictor. Itwas in the first instance created to narrow down regions enriched infunctional sites. We defined a new disorder propensity scale namedRussell/Linding. The method has at least once successfully beenused for protein expression vector design. The method is compet-ent at finding domain boundaries as well.

6.7 GlobPlot:Exploring Protein Sequences for Globularityand Disorder

Rune Linding , Robert B. Russell, Victor Neduva & Toby J. GibsonEuropean Molecular Biology Laboratory

Biocomputing Unit

D-69117 Heidelberg

Germany

To whom correspondence should be addressed. Phone: +49 6221 387451 -Fax: +49 6221 387517 - email: [email protected]

A major challenge in the proteomics and structural genomics era is topredict protein structure and function, including identification of thoseproteins that are partially or wholly unstructured. Non-globular sequencesegments often contain short linear peptide motifs (e.g. SH3-bindingsites) which are important for protein function. We present here a newtool for discovery of such unstructured, or disordered regions withinproteins. GlobPlot (http://globplot.embl.de) is a web service thatallows the user to plot the tendency within the query protein for or-der/globularity and disorder. We show examples with known proteinswhere it successfully identifies inter-domain segments containing linearmotifs, and also apparently ordered regions that do not contain any re-cognised domain. GlobPlot may be useful in domain hunting efforts. The

81

Page 102: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

82 Chapter 6. Ab initio Prediction of Protein Disorder

plots indicate that instances of known domains may often contain addi-tional N- or C- terminal segments that appear ordered. Thus GlobPlotmay be of use in the design of constructs corresponding to globularproteins, as needed for many biochemical studies, particularly structuralbiology. GlobPlot has a pipeline interface - GlobPipe - for the advanceduser to do whole proteome analysis. GlobPlot can also be used as ageneric infrastructure package for graphical displaying of any possiblepropensity.

Introduction

Figure 6.2: GlobPlot predictions for human Bcl-2. The predicted disorderedsegments are mapped on the structure in red. The yellow helix kink is falselypredicted as a disordered segment. The green segment is not predicted byGlobPlot probably because the algorithm has lower sensitivity in the terminidue to the Savitzky-Golay filter. Blue colour corresponds to the globular domainof Bcl-2.

In the post-genomic era, discovery of novel domains and functional sitesin proteins is of growing importance. A key part of initiatives like structuralgenomics is to optimize target selection by identifying domains and therebyincrease spanning of fold and structure space [29]. In addition, it has recentlybeen recognised that many functionally important protein segments lie outside

Page 103: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.7. GlobPlot:Exploring Protein Sequences for Globularity and Disorder 83

of globular domains in regions that are intrinsically disordered [236]. Computa-tional tools to help discern domains from intra-domain regions are key to suchefforts. We describe here a graphical tool GlobPlot and a pipeline companionGlobPipe that do just this: they measure and display the propensity of proteinsequences to be ordered or disordered.

There are many methods e.g. SMART [134], PRODOM [198], Interpro[156], Pfam [20], PROSITE [200] and ELM [179] (http://elm.eu.org),available for finding globular domains (e.g. SH3, TyrKc, active sites) andlinear motifs (e.g. SH3 ligands, LXXLL nuclear receptor ligands, Tyrosinephosphorylation sites, post-translational modification sites) within a protein se-quence. These methods typically rely on sequence similarity models, lookingfor recurrence of known domains or motifs by such means such as HMMs [76],pattern discovery (http://www.cs.ucr.edu/~stelo/pattern.html) orSW-profiles [204]. Although these methods are of great value in annotatingprotein sequences, they are limited in their ability to uncover new features notyet discovered.

A complementary approach to domain or feature discovery is to predict pro-tein structure, though such methods are computationally intensive, error-proneand are usually designed to predict structure only within globular regions.

In order to predict possible targets for further structural analysis, we presenthere a method complementary to structure prediction. We describe a simple,easy to use, propensity/scale based tool for exploring both potential globularand disordered/flexible regions in proteins based on their sequence.

Protein disorder can be described as the lack of regular secondary structureand a high degree of flexibility in the polypeptide chain [236]. Ordered regionsare often termed globular, and typically contain regular secondary structurespacked into a compact globule. However no general definition of disorder ex-ists.

Disordered regions can contain functional sites, predicted as linear motifsby ELM, and they are of growing interest owing to the increasing number ofreports of intrinsically unstructured/disordered proteins (IUPs). IUPs containregions that are partially or completely unfolded/unstructured in the native invivo state of the protein. More than 100 IUPs are known [236, 215], includingTau [194], Prions [143], Bcl-2 (refer to Figure 6.2) and partially p53 [131].Although little is understood about the cellular and structural meaning of thisstate it is thought that it may exist as a molten globule and become orderedonly when bound to another molecule [219, 71]. It is clear however that IUPsplay a central role in biology and in diseases mediated by protein misfoldingand aggregation [69, 70].

Prediction of disorder currently can be performed using SEG [235], whichsearches for regions of low sequence complexity. However, low complexity ofthe sequence does not imply disorder in all cases. It is also possible to usemethods such as hydrophobicity plots, though this approach is better suited toidentification of segments, such as transmembrane helices rather than findinglong segments of disorder.

PONDR (http://www.pondr.com), [85, 86], is a neural network basedtool for disorder prediction, but it is not freely accessible.

Page 104: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

84 Chapter 6. Ab initio Prediction of Protein Disorder

Prot-Scale (http://us.expasy.org/cgi-bin/protscale.pl) is ageneral resource for showing amino acid propensity scales, using a slidingwindow algorithm. Prot-Scale does not offer any dedicated disorder predictor.

We discuss here GlobPlot, a tool to identify regions of globularity and dis-order within protein sequences. It is a simple approach based on a runningsum of the propensity for amino acids to be in an ordered or disordered state.We show that, despite its simplicity, this method is able to identify such regionswhen compared to domain databases and sets of disordered proteins.

Inside GlobPlot

Propensity sets

At the heart of GlobPlot are propensities P , for all amino acids to be in globu-lar or non-globular states. The GlobPlot package currently contains 7 differentpropensity sets, though others could easily be added. There is no standarddefinition of disorder, and no large set of universally agreed disordered pro-teins. Moreover, different parts of proteins are probably ordered under differ-ent conditions. We have thus developed a tool that allows parameters fromdifferent definitions of disorder to be applied.

We designed parameters based on the hypothesis that the tendency for dis-order can be expressed as P = RC − SS where RC and SS is the propensity

N

P

Q A R

S C T

D

E V

F W

G

H Y

I

K L

M

-0.5

0

0.5

1

1.5

2

Russell/LindingDeleage/Roux

Disorder propensity

Figure 6.3: Propensities for disorder/globularity detection. The RUS-SELL/LINDING is the default set used by the GlobPlot algorithm.

Page 105: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.7. GlobPlot:Exploring Protein Sequences for Globularity and Disorder 85

Topic URLOnline Help http://globplot.embl.de/help.htmlPropensity sets http://globplot.embl.de/propensities.htmlGallery http://globplot.embl.de/gallery/Links http://globplot.embl.de/links.html

Table 6.2: Supplementary online material.

for a given amino acid to be in ‘random coil’ and regular ‘secondary struc-ture’ respectively. The starting point for the propensity scales were paramet-ers for secondary structure and ‘random coil’ described by CHOU and FAS-MAN [43, 45, 44] and later introduced as propensities by DELEAGE and ROUX[61]. Initially we defined a set solely based on these parameters (shownas DELEAGE/ROUX in Figure 6.3). However we found that this scale per-formed poorly in finding disordered segments. Since the structure databaseis now much larger we decided to recalculate propensities for amino acids tobe either in regular secondary structures (α-helices or β-strands) defined byDSSP [120] or outside of them (‘random coil’, loops, turns etc.). We defineda non-redundant set of proteins by taking one representative from each super-family in the SCOP database (version 1.59; http://scop.mrc-lmb.cam.ac.uk/scop/, [141, 39]). The frequencies RC and SS for each amino acidwere calculated from this dataset. The resulting propensities, named RUS-SELL/LINDING are given in Figure 6.3. Combining “random coil” and “second-ary structure” in the RUSSELL/LINDING set enhanced the discrimination of thegraphs and is the key factor in the success of this scale being able to detectboth disorder and globular packing. GlobPlot is not intended as a compet-itor for secondary structure prediction. It cannot give the same level of detailas one can obtain from a secondary structure prediction based on a multiplealignment.

We also calculated propensities based on coordinates described as missingfrom the protein databank (http://www.pdb.org) [231]. We consideredthe same set of representatives, but here we ignored domain definitions (i.e.we took the whole chain or protein). We then looked for the ‘REMARK 465’records in the associated PDB files (restricting these to the appropriate chainwhen required) for residues not seen either in the electron density or the NMRstructure. We presumed this set to be disordered, and all residues with C-α entries to be ordered. The resulting propensity set named REMARK465 isonline but still under development. The performance of this scale in predictingdisorder will be evaluated in future work.

We provide a variety of different scales or propensities for the user to ex-plore, the numerical values can be obtained from the link given in Table 6.2.In addition to the mentioned scales for finding disorder we also have someclassical scales online for hydropathy.

Page 106: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

86 Chapter 6. Ab initio Prediction of Protein Disorder

The algorithm

The basic algorithm behind GlobPlot is simple and very fast:For each amino acid a, we have defined a propensity P (ai) ∈R, see Figure

6.3.Given a protein sequence of length L, we define a sum function Ω as fol-

lows:

Ω(ai) =i−1∑

j=1

Ω(aj) + ln(i + 1) · P (ai) for i = 1, . . . , L

where P (ai) is the propensity for the ith amino acid and ln the natural log-arithm.

We run a digital low-pass filter based on Savitzky-Golay (refer to section14.8 in [177]) over Ω in order to smooth the curve and get the numericalestimation of the 1st order derivative. The filtering is performed by an ex-ternal open source C module (sav_gol) from the TISEAN 2.1 [102] NonlinearTime Series Analysis package (http://www.mpipks-dresden.mpg.de/~tisean/). The resulting smoothed function ΩS is plotted using the DIS-LIN 8.0 package. DISLIN is distributed as platform specific binaries fromhttp://www.linmpi.mpg.de/dislin/. The ln(i+1) term was introducedin order to balance the plot more evenly between N and C-terminal, doing soby increasing the weight of the terms as a function of residue number. Pu-tative globular and disorder segments are selected using a simple peak finderalgorithm (referred to as PeakFinder). The peaks are chosen when the 1stderivative shows positive (disorder) or negative (globular) values over a con-tinuous stretch of the minimum length given by the user as ‘PeakFinder windowlength’.

We opted to use a running sum function for three reasons. First, it resultsin plots that are easy to interpret, whether by human or algorithm. Second, it isa simple approach to a very complex problem. Third, there is no dependencyon frame length as is the case for sliding window methods such as SEG orProt-Scale.

We expect that one could construct algorithms that avoid the unbalancedweighting of the residue numbers and work more directly on the propensities;we plan to incorporate this in later versions.

Testing GlobPlot

Benchmarking methods to predict disorder are hampered by both the lack of astandard definition of disorder and the lack of a quality dataset. Performanceof a particular set of parameters will clearly depend on the dataset from whichthey are derived. The benchmarking of GlobPlot was done using the GlobPipescript collection and SQL based data mining (we are using PostgreSQL asrelational database). We found that SQL datamining is a very efficient, fast andflexible way of doing data analysis. We benchmarked GlobPlot by a variety ofapproaches:

Page 107: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.7. GlobPlot:Exploring Protein Sequences for Globularity and Disorder 87

• Test of disorder prediction of IUPs

• Comparison to PONDR disorder prediction

• Structural (SMART) context analysis of predicted disordered segments

• Benchmarking prediction of globular segments (GlobDoms) vs SMART

• Benchmarking prediction of disorder using structural B-factors

GlobPlot on IUPs

The operational definition of IUPs is based on X-ray, NMR, CD and a variety ofhydrodynamic volume measurements. Several IUPs are only unstructured un-der certain equilibrium conditions where the unfolded state is favoured over thefolded/structured state [60]. Formation of Coiled-coil dimerisation provides awell understood example of such an equilibrium. Induced folding upon binding

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

Residue number

-37.0

-26.2

-15.3

-4.5

6.4

17.2

Inte

grat

ed d

isor

der

prop

ensi

ty

CREB-binding protein (EC 2.3.1.48).

SMART domainsZnF_TAZ 348-433Pfam:KIX 587-667BROMO 1084-1194coiled_coil_region 1547-1576ZnF_ZZ 1701-1742ZnF_TAZ 1766-1844coiled_coil_region 2190-2217

Figure 6.4: GlobPlot of human CREB binding protein (CBP_HUMAN). Abouthalf of the sequence appears to be in a disordered state with long flexibleregions observed at N- and C-terminus. The flexible region just after the KIXdomain might be important for induced binding of the pKID domain of CREB toCBP [180], [63]. For further discussion of disorder in CBP/CREB see WRIGHTet al. [236].

Page 108: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

88 Chapter 6. Ab initio Prediction of Protein Disorder

0 20 40 60 80 100 120 140 160 180 200 220 240 260

Residue number

-4.0

3.2

10.3

17.5

24.6

31.8

Inte

grat

ed d

isor

der

prop

ensi

ty

Major prion protein 1 precursor (PrP) (Major scrapie-associated fibril protein 1).

SMART domainsPRP 25-251

Figure 6.5: GlobPlot of bovine Prion protein (P10279). The flexible N-terminalsegment is easily spotted, refer to [143]. The SMART ‘domain’ is in this casea protein signature not a descriptor of a globular fold. The plot was created byusing the ‘Create PostScript ’ option.

to a target protein is also observed as in the case of the proteins CREB andCBP [180, 63]. This indicates that IUP assigned protein sets are error-proneand should be carefully considered on a case by case basis. We selected pro-teins from a recent review by TOMPA ( [215], Table 1). We applied GlobPlot onthe 20 proteins listed and predicted non-globular segments in all of these pro-teins. The plots of these proteins can be viewed in the online GlobPlot gallery(http://globplot.embl.de/gallery/). We found that the CBP proteinmore than CREB protein shows significantly disordered regions as shown inFigure 6.4. In the case of FlgM it has recently been shown to be partiallyfolded in vivo [60], the N-terminal part is correctly found by GlobPlot to bedisordered. We suggest that Stathmin seems to be disordered because theinteraction partner Tsg-101 is partially disordered. Finally the globplot of theBovine Prion protein shows clearly the N-terminal flexible tail found in the NMRstudy of this protein, refer to Figure 6.5

GlobPlot vs PONDR A comparison of GlobPlot and the neural network pre-dictor PONDR (http://www.pondr.com), [85, 86] were severely hamperedby the fact that the PONDR server only allows 30 predictions. Therefore we

Page 109: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.7. GlobPlot:Exploring Protein Sequences for Globularity and Disorder 89

Protein PONDR segment GlobPlot segment Comment

PRIO_BOVIN 25-141 22-118, 133-141 Effectively the same.

Q13541 [4E-BP1] 1-43, 62-116 24-49, 63-95, 97-106 Effectively the same.

PRP1_HUMAN 15-331 28-322 Effectively the same.

TN101_HUMAN 145-225 137-220 Effectively the same.

CA19_HUMAN 256-402, 415-759, 777-902 260-471, 474-757, 782-903 Effectively the same.

Q9HBB5 [MUCDHL-FL] 435-664, 730-786, 805-845 454-660 ,695-716, 723-768, 810-836 Effectively the same.

ROG_HUMAN 84-130, 139-205, 348-391 87-200, 203-337, 342-382 Different.

K1CI_HUMAN 344-384, 461-622 8-148, 460-613 Different.

O95060 [AblBP4] 158-368 160-283, 293-307, 317-363, 375-385 Effectively the same.

Table 6.3: For 7 out of 9 of these proteins the regions identified by the twomethods are the same, allowing for minor variations in the start/end of thedisordered region. In the case of K1CI, GlobPlot and PONDR both find regionsat the C-terminus (~460-620), but PONDR finds a short segment between 344-384 whereas GlobPlot finds a long segment between 8-148. For ROG, bothmethods again find a region at the C-terminus (~345-385) and one in the regionof residues 85-200. However, GlobPlot also predicts the intervening region(203-337) to be disordered.

could only test qualitatively. In general GlobPlot and PONDR predict aboutthe same on the disordered proteins we tested. Refer to Table 6.3 for furtherdetails of the comparison.

GlobDoms vs SMART

To determine to what extent GlobPipe can be used for isolation of putativeglobular domains (GlobDoms) we benchmarked against the SMART server(http://smart.embl.de). A data set of 10497 human protein sequenceswas created. The following criteria were imposed on the candidate sequences:

1. Subset of Swiss-Prot human proteome that contains SMART hit(s) [IVICALETUNIC, personal communication]

2. Key word filtered for ‘fragments’, ‘putative’ , ‘hypothetical’ and ‘similar to’

3. Non-redundant data (based on EMBL’s nrsp95)

The results of the structural context analysis can be seen in Figure 6.6. In the10497 proteins SMART predicted 47340 domains, of which 25672 were longerthan 30 residues. Finally 47989 putative globular domains (GlobDoms) werefound by GlobPipe using a PeakFinder search window length of 30. GlobPlotpredicts a substantial fraction of putative domains that are not known to/foundby SMART/Pfam.

The PeakFinder module was modified to find ‘downhill’ areas in theGlobPlot that were 30+ amino acids. 30 is about the minimum size that SMARTannotation will consider a globular domain (except for disulfide linked mini-domains).

Page 110: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

90 Chapter 6. Ab initio Prediction of Protein Disorder

30-4950-74

75-99100-149

150-199200-249

250-299300-399

400-499500-999

1000-19992000+

Length in residues

0

2000

4000

6000

8000

# do

mai

ns in

leng

th ra

nge

SMART domainsPatched GlobDomsExternal GlobDoms

GlobDoms vs SMART domainsPrediction & Discovery

Figure 6.6: Benchmarking of GlobDoms vs SMART domains: We used theGlobPipe PeakFinder to search for ‘down-hill’ areas (negative 1st order deriv-ative) in the GlobPlot graph. Assuming that such regions (GlobDoms) can bepatched together (and thereby define a single domain) if they overlap with orare completely embedded in a SMART domain on the same sequence, we es-tablish a recovery of the SMART domains. Patched GlobDoms are predicteddomains co-located with a known SMART domain on the same sequence. Thegreen ‘Discovery’ bar shows how many GlobDoms are found entirely outsideSMART predictions. From fragments of length 100 and up we observe that thefraction originating from the overlapped segments results in overprediction.

Disordered segments in SMART context

We also made an attempt to look for the structural context of the disorderedsegments GlobPipe predicted in the above dataset. GlobPipe predicted 75152disordered segments using a 8 residue long frame for the smoothing routine.We compared all of these segments with the domain architecture of the “hostsequence” as predicted by SMART. We found that most flexible regions fall out-side globular domains (refer to Figure 6.7). For short peptides (7-14 residues)about half are nested within SMART domains, for longer segments much fewerare nested and more are overlapping. We interpret these data as an indica-tion that GlobPlot does seem to discriminate structural features and context

Page 111: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.7. GlobPlot:Exploring Protein Sequences for Globularity and Disorder 91

<7 7-9 10-14 15-19 20-24 25-29 30-49 50-74 75-99 100-149 >159Segment length

0

5000

10000

15000

#dis

orde

red

segm

ents

per

leng

th ra

nge

OverlappedNestedOutside

Disordered segmentsLength & Location

Figure 6.7: Structural analysis of disordered segments: For all lengths of seg-ments it is observed that more segments are found in sequence not predictedto be globular by SMART. However we observe a significant amount of internaldisorder, which in many cases can correspond to a loop, hinge or another typeof flexible insertion in the protein. The amount of overlapping is difficult to in-terpret, however we have observed that GlobPlot often predicts the domainboundaries more precisely than SMART does. This is because SMART typic-ally uses only the ‘core’ sequences to define the Hidden Markov Model for agiven domain.

of the disordered segments. Another observation is that most segments arewithin this short range, from a functional point of view this makes sense sincewe know that most functional flexible/disordered sites are of length 5-10. Thisgives a basis for deploying GlobPlot as a discovery/overview tool for annotationof functional sites.

Benchmarking using B-factors

Structural B-factors (isotropic temperature factors) were chosen because theyare unrelated to the data we used in creating the propensity sets and becausethey to a certain degree reflect disorder and flexibility in the polypeptide chain[169, 225, 244]. However, B-factors vary greatly between structures and are

Page 112: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

92 Chapter 6. Ab initio Prediction of Protein Disorder

often influenced by crystal packing and other structural artefacts. In attempt toavoid these issues only the B-factor for the Cα atom was considered.

We defined a non-redundant set of proteins by taking one representativefrom each family in the SCOP database The set was reduced to contain onlyX-RAY structures of a resolution higher than 2.2Å. The B factor values were ex-tracted from the PDB entries and the average B-factor and standard deviationwas calculated for each chain. In order to have a stringent set only residuesthat had a B-factor 3.5 standard deviations above the average were markedas disordered. The resulting data were then compared to the predictions ofGlobPlot using the RUSSELL/LINDING propensity set. The 7 most N- and C-terminal residues were ignored due to the lower sensitivity introduced by theSavitzky-Golay filtering.

At a specificity of 88% we obtained a sensitivity of 28%. Accounting forthe low sensitivity was that 74% of the false negatives were due to highB-factor helix bundles and domains observed in enzymes (especially lig-ases,hydrolases and oxidoreductases) [186]. We expect a putative sensitivityof 59% in this dataset.

A fundamental problem with benchmarking a disorder predictor is that cur-rently no general definition of disorder is agreed on. Furthermore, we heredescribe disorder as two states (disorder/order), whereas one should expect itto be multistate. These issues makes it very difficult to select unbiased data-sets for benchmarking.

We will continue the hunt for better datasets and further benchmarking to beused in disorder studies in future work. We iterate that GlobPlot successfullyidentifies the disorder in well characterised proteins like TAU, CBP, Prions andPRP1_HUMAN.

Name of option Effect of optionPlot title If a SWISS-PROT acc/entry is entered, an auto generated

title is created. This option allows for an alternative title.Windowlength PeakFinder Min. length of cont. disorder the PeakFinder should select.Perform SMART prediction A domain search using SMART, slows down the plotting.Show dy/dx plot Shows the smoothed 1st order derivative of the sum function

used by the PeakFinder to find segments of disorder.Show raw plot Show the raw sum function without digital filtering.Window length for smoothing The window length used by the least-square routine in

the Savitzky-Golay filtering.Windowlength for derivative Length of smoothing window used for 1st derivative.Propensity sets The user can choose among several propensity sets.

Table 6.4: Non-trivial user options available.

Page 113: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.7. GlobPlot:Exploring Protein Sequences for Globularity and Disorder 93

Using GlobPlot

The GlobPlot software package consist of two parts, both implemented in thelanguage Python 2.2 (http://www.python.org):

An Internet plotting server - GlobPlot

GlobPlot is a CGI (Common Gateway Interface) based server accessible athttp://globplot.embl.de, for exploring disorder and globular segments(GlobDoms). The web interface is fairly straight forward to use, the user canpaste a sequence or enter the SWISS-PROT/SWALL accession (e.g. P08630)or entry code (e.g. BTLK_DROME). The GlobPlot server fetches the se-quence and description of the polypeptide from an ExPASy server using Biopy-thon.org software. By default the server will send the sequence to the pub-lic SMART queue (http://smart.embl.de, that by default also predictsPfam domains) and display any obtained domain predictions as coloured boxeslayered on the graph. The SMART/Pfam prediction substantially increases theplotting time, but is set to ’on’ by default because it is a very informative feature.Showing the boundaries of known SMART domains in the sequence is of greatvalue for navigating as well as analysing the globplot. The SMART predictionsare used solely for graphical viewing, they are not used in the GlobPlot routineitself.

In order to present a graph that is smoothed for digital noise, we use adigital low-pass filter based on Savitzky-Golay (least-square fitting). The usercan obtain the non-smoothed curve, as well as change the window length usedby the Savitzky-Golay algorithm, however normally the default settings for thesmoothing are optimal. Further information on the available user options aredescribed in Table 6.4. In order to give the user the possibility for further dataanalysis the numerical data for the plot can be downloaded in tabulated formatfrom the result page and used in other plotting software such as Grace, Open-Office.org or Excel. Because GlobPlot is ‘scale stable’ the user can paste ina specific sub sequence and obtain a zoomed plot. The output file format forthe plot is PNG (Portable Network Graphics), but publication quality plots canbe created using the Postscript option. Residue ranges for found disorderedsegments and globular regions (GlobDoms) are shown at the bottom of theoutput page.

A pipeline interface - GlobPipe

GlobPipe is a pipeline that can be used for proteome scale analysis. Thepipeline software is not a complete package but rather a set of routines thatperforms tasks relevant for SQL driven datamining large amounts of data in arelational database (PostgreSQL, www.postgresql.org). GlobPipe is stillunder development but it should be possible for any user with some program-ming skills to set up their own pipeline analysis, using these routines. We ex-pect to set up a database of disordered protein sequences based on GlobPipepredictions.

Page 114: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

94 Chapter 6. Ab initio Prediction of Protein Disorder

The full GlobPlot/GlobPipe package (excluded the DISLIN and TISEANmodules that both have to be obtained on their respective websites) can bedownloaded as a tarball at http://globplot.embl.de/download.html.The software is released under the Academic Free License Version 1.2 andis thereby OSI Certified Open Source Software (http://www.opensource.org). The software has broad platform coverage and is currently served on aFreeBSD box.

GlobPlot can be used for...

• Finding regions that create trouble in your crystallisation setups

• Searching for putative new globular domains and functional sites

• Searching for IUPs and intrinsic disorder

• Graphical visualisation of any propensity set that can be constructed fora polymeric sequence

• Building a database of putative IUPs and domains using GlobPipe

Acknowledgements

We thank WILL STANLEY, KRESTEN LINDORFF-LARSEN, JESPER BORG andDAVID MARTIN for cool fruitful discussions and suggestions, and CHRISTINEGEMUEND and IVICA LETUNIC for SMART interface code/data. Grateful toFRANCESCA DIELLA, CHENNA RAMU, SOPHIE CHABANIS and SARA QUIRK forreading this manuscript. This work was partly supported by EU grant QLRI-CT-2000-00127.

Finally we are deeply grateful to FreeBSD.org, (bio)Python.org, Postgr-eSQL.org, Debian.org and Apache.org for fantastic open free software.

Page 115: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Paper III

The success of GlobPlot led us to introduce a neural network pre-dictor of protein disorder. To help the field of disorder we wanted toclearly define several types of protein disorder including the novelconcept of hot loops. This is the first successful use of thermo-factors in disorder prediction. The predictor of hot loops are ex-tremely accurate and we show that hot loops correspond to somedegree to disorder observed in NMR experiments.

6.8 Protein Disorder Prediction:Implications for Structural Proteomics

Rune Linding , Lars Juhl Jensen , Francesca Diella , Peer Bork & ,Toby J. Gibson & Robert B. Russell EMBL - Biocomputing Unit - Meyerhofstr 1 - D-69117 Heidelberg - Germany

Max-Delbrück-Centre für Molecular Medicine - Robert-Rössle-str 10 -

D-13092 Berlin - Germany

CellZome GmbH - Meyerhofstr 1 - D-69117 Heidelberg - Germany

Corresponding author: [email protected]

Contributed equally to this work

A great challenge in the proteomics and structural genomics erais to predict protein structure and function, including identification ofthose proteins that are partially or wholly unstructured. Disordered re-gions in proteins often contain short linear peptide motifs (e.g. SH3-ligands and targeting signals) that are important for protein function.We present here DisEMBL, a computational tool for prediction of dis-ordered/unstructured regions within a protein sequence. As no cleardefinition of disorder exists, we have developed parameters based onseveral alternative definitions, and introduced a new one based on theconcept of ‘hot loops’, i.e. coils with high temperature factors. Avoiding

95

Page 116: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

96 Chapter 6. Ab initio Prediction of Protein Disorder

potentially disordered segments in protein expression constructs can in-crease expression, foldability and stability of the expressed protein. Dis-EMBL is thus useful for target selection and the design of constructs asneeded for many biochemical studies, particularly structural biology andstructural genomics projects. The tool is freely available via a web in-terface (http://dis.embl.de) and can be downloaded for use in largescale studies.

Introduction

In the post-genomic era, discovery of novel domains and functional sites inproteins is of growing importance. One focus of structural genomics initiativesis to solve structures for novel domains and thereby increase the coverage offold and structure space [29]. During the target selection process in structuralgenomics/biology intrinsic protein disorder is important to consider since dis-ordered regions at the N- and C- termini (or even within domains) often leads todifficulties in protein expression, purification and crystallisation. It is thereforeessential to be able to predict which regions of a target protein are potentiallydisordered/unstructured. Computational tools to help discern ordered globulardomains from disordered regions are key to such efforts.

It is becoming increasingly clear that many functionally important proteinsegments occur outside of globular domains [236, 69]. Protein structure andfunction space is partitioned in two sub spaces. The first consist of glob-ular units with binding pockets, active sites and interaction surfaces. Thesecond sub space contains non-globular segments such as sorting signals,post-translational modification sites and protein ligands (e.g. SH3 ligands).Globular units are built of regular secondary structure elements and contrib-ute the majority of the structural data deposited in PDB. In contrast, the non-globular sub space encompasses disordered, unstructured and flexible regionswithout regular secondary structure. Functional sites within the non-globularspace are known as linear motifs (catalogued by ELM, http://elm.eu.org)[179].

There are also many recent reports of Intrinsically Disordered Proteins(IDPs, also known as Intrinsically Unstructured Proteins). These are proteinsor domains that, in their native state, are either completely disordered or con-tain large disordered regions. More than 100 such proteins are known includingTau, Prions, Bcl-2, p53, 4E-BP1 and eIF1A (see Figure 6.11) [215, 219].

Protein disorder is important for understanding protein function as well asprotein folding pathways [171, 221]. Although little is understood about thecellular and structural meaning of IDPs, they are thought to become orderedonly when bound to another molecule (e.g. CREB-CBP complex [180]) orowing to changes in the biochemical environment [71, 69, 219].

The current view on disorder is that disordered proteins are disordered toallow for more interaction partners and modification sites [236, 140, 215]. It hasalso been suggested that disordered proteins exist to provide a simple solutionto having large intermolecular interfaces while keeping smaller protein, gen-

Page 117: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.8. Protein Disorder Prediction:Implications for Structural Proteomics 97

ome and cell sizes [95]. It has been noted that having several relatively lowaffinity linear interaction sites allows for a flexible, subtle regulation as well asaccount for specificity with fewer linear motifs types [79]. It has also beendemonstrated that protein disorder plays a central role in biology and in dis-eases mediated by protein misfolding and aggregation [194, 123, 21].

No commonly agreed definition of protein disorder exists. The thermody-namic definition of disorder in a polypeptide chain is the “random coil” struc-tural state. The random coil state can best be understood as the structuralensemble spanned by a given polypeptide in which all degrees of freedomare used within the conformational space. However, even under extremelydenaturing solvation conditions, such as 8M urea, this theoretical state is notobserved in solvated proteins [199, 3, 126]. Proteins in solution thus seem toalways keep a certain amount of residual structure.

Protein disorder is only indirectly observed by a variety of experimentalmethods, such as X–ray crystallography, NMR–, Raman–, CD–spectroscopyand hydrodynamic measurements [205, 71]. In vivo studies of disorder arepossible with NMR spectroscopy on living cells (e.g. anti–sigma factor FlgM[60]). Each one of these methods detects different aspects of disorder resultingin several operational definitions of protein disorder [see 215, for a review].

There have been several previous attempts to predict disorder. Perhapsthe earliest are methods finding regions of low-complexity. Although manysuch regions are structurally disordered, the correlation is far from perfect asregions of low sequence complexity are not always disordered (and vice versa)[69]. Likely the strongest evidence for this correlation comes from the factthat low-complexity regions are rarely seen in protein 3D structures [190].Methods to predict low complexity, like SEG [235] and CAST [178], are thusoften used for this purpose. Methods using hydrophobicity can also give hintsas to disordered regions, as they are typically exposed and rarely hydrophobic.

The first tool designed specifically for prediction of protein disorder wasPONDR (Predictor Of Naturally Disordered Regions, http://www.pondr.com) [182, 85, 86]. It is based on artificial neural networks. An alternativemethod is GlobPlot (http://globplot.embl.de) that instead relies on anovel propensity based disorder prediction algorithm [139].

Regions without regular secondary structure can be predicted by theNORSp (NOn Regular Structure) server [140], however as the authors admitsuch regions are not necessarily disordered. Structures such as the Kringledomain (PDB: 1krn) are almost entirely without regular secondary structure intheir native state but they still have tertiary structure wherein the basic buildingblock is coils. These “loopy proteins” are not necessarily IDPs since they canstill form a well defined globular tertiary structure.

Prediction of protein tertiary structure could be an alternative route todisorder prediction, though such methods are computationally intensive anderror–prone. Moreover such methods are usually designed to predict the struc-ture of globular domains, meaning that their behaviour on other sequences canbe unpredictable.

Here we present DisEMBL a method based on artificial neural networkstrained for predicting several definitions of disorder. It predicts and displays

Page 118: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

98 Chapter 6. Ab initio Prediction of Protein Disorder

the probability of disordered segments within a protein sequence. DisEMBLfurthermore provide a pipeline interface for bulk predictions, essential for largescale structural genomics.

Results & Discussion

Our definitions of disorder

As no single definition of disorder exists, we will describe in detail what wedefine as disordered regions. We describe protein disorder as two-state mod-els where each residue is either ordered or disordered. For this purpose weused three different criteria for assigning disorder:

• Loops/coils as defined by DSSP [120]. Residues are assigned as be-longing to one of several secondary structure types. For this definitionwe considered residues as α–helix (’H’), 310–helix (’G’) or β–strand (’E’)as ordered, and all other states (’T’, ’S’, ’B’, ’I’, ’ ’) as loops (also knownas coils). Loops/coils are not necessarily disordered, however proteindisorder is only found within loops. It follows that one can use loop as-signments as a necessary but not sufficient requirement for disorder; adisorder predictor entirely based on this definition will thus be promiscu-ous.

• Hot loops constitute a refined subset of the above, namely those loopswith a high degree of mobility as determined from Cα temperature factors(B-factors). It follows that highly dynamic loops should be consideredprotein disorder. Several attempts have been made to try to use B-factorsfor disorder prediction [31, 222, 85, 70, 244] but there are many pitfallsin doing so as B-factors can vary greatly within a single structure due toeffects of local packing and structural environment. Recent progress inderiving propensity scales for residue mobility based on B-factors [201],encouraged us to use B-factors for defining protein disorder.

• Missing coordinates in X-Ray structure as defined by REMARK465entries in PDB. Non assigned electron densities most often reflect in-trinsic disorder, and have been used early on in disorder prediction [136].

A fundamental problem with X-ray data, is that it is limited to what is foundin the PDB. Many structures are solved on truncated polypeptides, explicitlybecause the parts that are cut off are disordered or highly flexible, meaning thatthe data set itself is truncated. Some of these regions could be recovered bycombining PDB with sequence databases, however this cannot be performedin an automated fashion.

Performance evaluation

For each of the three definitions of disorder described above, a data set wasconstructed and partitioned into five cross validation sets. We trained an en-

Page 119: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.8. Protein Disorder Prediction:Implications for Structural Proteomics 99

0 0.2 0.4 0.6 0.8 1Sensitivity

0

0.2

0.4

0.6

0.8

1

Rat

e of

fals

e po

sitiv

es

Loops/CoilsHot loopsMissing coordinates

Figure 6.8: Sensitivity and rate of false positives for the various DisEMBLneural networks. The receiver output characteristic (ROC) curves were con-structed from cross validation test set performances. Performances are repor-ted on a per residue basis. Since the data sets were homology reduced basedon SCOP, the performances shown correspond to what can be expected fornovel protein sequences.

semble of five artificial neural networks on each the three data sets. Figure6.8 shows the expected performance for these three predictors on novel se-quences as estimated by cross validation. As can be seen, we are able topredict a large fraction of the missing coordinate residues with a very low errorrate.

The coils networks predict regions without regular secondary structure: itis a two state secondary structure prediction method. This neural networkensemble is capable of identifying approximately half of the negative examples,while discarding essentially no positive examples (see Figure 6.8). Thereforethis predictor is perhaps better thought of as a filter to remove false positivepredictions made by the other networks.

Separate networks were trained for predicting which of the loops have highB-factors (‘hot loops’). The performance of these networks are shown in Fig-ure 6.8. Hot loops are predicted using these two network ensembles collect-ively. The overall performance of this composite predictor can be estimatedfrom the individual performances of these ensembles as follows. Choosing a

Page 120: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

100 Chapter 6. Ab initio Prediction of Protein Disorder

0 0.2 0.4 0.6 0.8 1Sensitivity

0

0.2

0.4

0.6

0.8

1R

ate

of fa

lse

posi

tives

Missing coordinates (raw)Missing coordinates (smoothed)PONDR (VL-XT)PONDR (XL1)PONDR (CaN)

Figure 6.9: Comparison with PONDR. We tested the performance of ournetworks the data sets from Dunker et al. (http://disorder.chem.wsu.edu/PONDR/PONDR.htm). Only the performance points for the variousPONDR predictors reported at the web site are shown, as we do not have ac-cess to the raw PONDR predictions. Relative to these points, the REMARK465predictor of DisEMBL performs marginally better than PONDR.

cutoff of 80% sensitivity for each ensemble corresponds to a sensitivity of 64%(80%*80%) for the composite predictor. At these cutoffs the rate of false posit-ives are 6.9% and 19% respectively, corresponding to 1.3% then combined. Itfollows that hot loops are very predictable. In contrast, the missing coordinatespredictor has a higher (16%) rate of false positives at the same sensitivity. Apossible explanation for this is that REMARK465 can be assigned to a residuefor several reasons, disorder being only one of these.

We compared DisEMBL to PONDR, refer to Figure 6.9. The comparisonto PONDR was severely hampered by the fact that access to raw PONDRpredictions is restricted. Therefore we can currently only compare to the per-formance points stated on the official website of PONDR’s developers. Sincethe VL-XT predictor is smoothing its predictions by a running average of 9residues, we also applied this smoothing to our predictions. Relative to thesepoints our predictor perform marginally better in predicting the same type ofdisorder. We only use smoothing in this comparison as the performance gain

Page 121: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.8. Protein Disorder Prediction:Implications for Structural Proteomics 101

is a consequence of the design of this particular data set. Smoothing does notimprove performance on our own data sets.

Complementarity of predictors

In order to investigate the relationships between the different disorder defin-itions we determined how correlated our predictors are. This was done bycalculating the linear Pearson correlation coefficients for the predictions by thethree predictors on a data set consisting of one sequence from each proteinfamily in SCOP version 1.61. The correlation between hot loops and coils istrivial since the data set used for training the hot loop predictor is a sub setof the one used for the coils network. The predictions of missing coordinatesand coils are only weakly related (CC = 0.231), while the hot loops predictionsshow a stronger correlation to the REMARK465 networks (CC = 0.455).

Since we initially assumed that missing coordinates directly reflects proteindisorder the correlation with the hot loops predictions support this alternativedefinition of disorder, it also shows that the definitions are complementary notredundant.

The relationship between the different predictors can also be seen in Figure6.10. The figure shows the per residue counts in the different data sets. In gen-eral hydrophobic residues are promoting order according to all three definitionsof disorder. Disorder promoting residues include proline, lysine, serine, threon-ine and methionine. For lysine it can be seen that even though this residue is

P K S T N D R M E G Q H W I V A L C F Y

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

log(

odds

)

Hot loopsCoilsMissing coordinates

Figure 6.10: Propensities for the amino acids to be disordered according to thethree definitions used in this work (sorted by hot loop preference). This scaleis directly reflecting what was in the data sets used for training, however it isonly a first approximation of what the neural networks are using in predictingdisorder. The coils scale is similar to the RUSSELL/LINDING propensity scaledescribed in [139]. Error bars correspond to the 25 and 75 percentiles asestimated by stochastic simulation.

Page 122: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

102 Chapter 6. Ab initio Prediction of Protein Disorder

not observed much in coils it is found primarily in hot loops, the opposite is thecase for proline. Methionine suffers a bias in the REMARK465 dataset for atleast two reasons: 1) often the N–terminal methionine is cleaved off 2) somestructures are solved using selenomethionine derivatives for phasing, whichcan lead to deletion of the residue in the PDB entry. The same bias is seen in[71, Figure 10].

Comparing predictions and experiments

As mentioned NMR and CD data can provide insight on protein disorder. Sincesuch data were not used during training, they provide good examples for val-idating our predictors. NMR parameters such as order vectors, chemical shiftswithin the ‘random coil window’ and disordered residues as assigned by theauthors, will likely provide the best data sets for training disorder predictors inthe future. Chemical shifts are already being used for prediction of secondarystructure and coils, e.g. in TALOS [50]. Unfortunately, NMR data are rather

Figure 6.11: DisEMBL hot loop predictions mapped on the NMR mean struc-ture of human translation initiation factor eIF1A (PDB: 1D7Q, SWISS–PROT:P47813). The predicted probabilities were with a colour scale going from blueto red, where red corresponds to the most likely disordered regions and blue toordered regions. Both the manually assigned disordered regions score higherthan the globular domain, in particular the N-terminal one [22].

Page 123: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.8. Protein Disorder Prediction:Implications for Structural Proteomics 103

scarce and only cover a subset of structure space, which is why we have notused such data for training predictors. However, it is interesting to notice thatour predictions of disordered regions seem to agree with disorder determinedby NMR. An example of this is Figure 6.11, which shows predictions of hotloops mapped on the NMR structure of human translation factor eIF1A.

Missing signals in the far UV of CD spectra indicate the absence of regularsecondary structure. A fundamental problem in using CD data for assigningdisorder is that the lack of regular secondary structure does not imply that theprotein is disordered, merely that it is a ‘loopy protein’. Another disadvantageof CD data is that they do not provide information on which residues are dis-ordered. The same limitations apply to hydrodynamic measurements. We havetherefore not used CD or hydrodynamic data in the training of our predictors.

However, the lack of residue specific information allows us to make predic-tions of disordered segments within proteins shown to contain such segmentsby CD. A number of examples previously described in the literature as beingdisordered were analysed with DisEMBL (Table 6.5). In all of these cases, wepredict either all or large parts of the protein to be disordered. In the case ofhistone H1.2 the predictions are in agreement with what is known from exper-imental studies of the homologue histone H5 and other linker histones [16].CREB is another that has been intensively studied, we correctly predict theunstructured pKID domain (approx. residues 113-154) [180, 236, 63].

Using DisEMBL

The DisEMBL prediction method is publicly accessible as a web server athttp://dis.embl.de for predicting disorder in proteins. Although theGlobPlot server at http://globplot.embl.de can also be used for pre-dicting protein disorder, the two methods complement each other as they ap-proach disorder prediction differently. GlobPlot is less accurate than DisEMBLin coils prediction, however it was designed as a visual inspection tool for find-

Protein [acc number] DisEMBL hot loops Proteinsegments length

Histone H1.2 [P15865] 1-39 110-218 218Samatolliberin(GHRH) [P42692] 1-4 28-45 45Protamine [P15340] 1-61 (full length) 61CREB [P15337] 1-10 100-170 330-341 34130S Ribosomal Protein [P02379] 1-70 (full length) 70Prothymosin alpha [P01252] 1-109 (full length) 109cAMP-dependent PKI [P04541] 1-20 26-34 50-75 75Hirudin [P01050] 1-13 42-65 65

Table 6.5: Hot loops predictions on proteins that are reported to be IDPs ac-cording to CD data [215, 219, 71].

Page 124: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

104 Chapter 6. Ab initio Prediction of Protein Disorder

0 20 40 60 80 100 120 140 160 180 200

Residue

0.0

0.2

0.4

0.6

0.8

1.0

Dis

orde

r P

roba

bilit

y

Histone H1.2 (H1d).

Predictors:CoilsRemark-465Hot loops

Figure 6.12: Sample output from the DisEMBL web server, showing predic-tions for Histone H1.2 [P15865]. The green curve is the predictions for missingcoordinates, red for the hot loop network and blue for coil. The horizontal linescorrespond to the random expectation level for each predictor; for coils and hotloops the prior probabilities were used, while a neural network score of 0.5 isused for REMARK465. From this plot it is seen that the predictors agree onresidues 1–39 and 110–218 as being disordered.

ing both domain boundaries, repeats and unstructured regions. Furthermore,the GlobPlot algorithm is very simple and intuitive, which might appeal to someusers.

The web interface is fairly straight forward to use, the user can paste a se-quence or enter the SWISS-PROT/SWALL accession (e.g.. P08630) or entrycode (e.g. PRIO_HUMAN). The DisEMBL server fetches the sequence anddescription of the polypeptide from an ExPASy server using Biopython.org soft-ware. The probability of disorder is shown graphically, as illustrated at Figure6.12. The random expectation levels for the different predictors are shown onthe graph as horizontal lines but should only be considered an absolute min-ima.

The user will normally not have to change the default parameters, but online documentation of the different settings are provided at http://dis.embl.de/help.html. If the query protein sequence is very long, >1000residues, the user can download the predictions and use a local graph/plottingtool such as Grace or OpenOffice.org to plot and zoom the data.

Having identified the potential disordered regions, the user should now have

Page 125: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.8. Protein Disorder Prediction:Implications for Structural Proteomics 105

a good basis for setting up expression vectors and/or comparing the data withobtained structural data. It is currently impossible to say which of the definitionsof disorder is most appropriate for design of protein expression vectors. Wethus strongly encourage feedback on successes and failures in using DisEMBLfor expression and structural analysis of proteins.

The web server only allows predictions on one sequence at a time, ifbulk predictions are needed we supply DisEMBL as a pipeline software pack-age. The pipeline consists of the same three neural networks implemen-ted as one ANSI C code module, which reads sequence from STDIN andwrites predictions to STDOUT. The pipeline interface is intended for the struc-tural genomics initiatives. The pipeline can analyse in the order of 1 millionresidues/min on a 1GHz x86 PC. This allows for very large scale predictions,e.g. as part of a structure space scanning. DisEMBL is released as OSI(http://www.opensource.org) certified opensource software and can bedownloaded from http://dis.embl.de/download.html.

Conclusion

We have presented a method to predict disordered regions within protein se-quences. Our method profits from predicting protein disorder according to mul-tiple definitions, including the new concept of ‘hot loops’. Furthermore, themethod is highly accurate, predicting more than 60% of hot loops with lessthan 2% false positives. We anticipate that DisEMBL will be of great use toexperimental biologists wishing to optimise constructs for expression and crys-tallisation, or who wish to identify features within a studied protein sequence.We expect further progress to come from a deeper understanding of disorderedproteins, which will lead to more systematic definitions of the phenomenon.

Materials and Methods

The coils data set was constructed based on DSSP [120] secondary structureassignments as described in Linding et al. [139]. This data set only containsone chain from each SCOP superfamily according to SCOP 1.59. Two differentversions of this data set were used for training neural networks. One versionsimply consists of the ‘raw’ labelling of residues as described above, whilethe other version is filtered to only include regions of at least 7 consecutiveresidues with the same labelling. The latter version of the data set consists of1238 sequences with 132395 labelled residues, 75424 of which are labelledas coil.

A second data set was constructed for discriminating between ordered anddisordered loops. Loops were identified in a similar manner as in coil dataset, only the Continuous DSSP [6] rather than DSSP was used for secondarystructure assignment and SCOP 1.61 was used for reducing the data set to onechain per family. As B-factors from different chains are not directly comparable,

Page 126: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

106 Chapter 6. Ab initio Prediction of Protein Disorder

B-factors from regions of regular secondary structure were used for normalisa-tion by establishing chain specific cutoffs for discriminating between orderedand disordered regions. Subsequently, all loop regions with B-factors below themedian for secondary structure elements were labelled as ordered loops, whileonly those above the 90% fractile were considered to be disordered loops. Thisresults in 1412 sequences containing only 795 residues labelled as being dis-ordered.

Finally, a data set was constructed for the prediction of missing coordinates(REMARK465). Like the B-factor data set, this set was reduced to include onlyone chain from each protein family in to SCOP version 1.61 a total of 1547sequences.

To form separate test and training sets, each of these data sets were splitinto five cross-validation partitions. As the data sets have already been re-duced to at most one sequence per SCOP family, sequence similarity withinthe data set does not present a problem. Note that the use of single represent-atives from SCOP families ensures that no sequence detectable homologuesare in the data set. Cross-validation performance estimates are thus independ-ent of how the data sets are partitioned. To ensure that the partitions werebalanced in the sense that each contains roughly the same number of dis-ordered residues, the sequences were first sorted according to their number ofdisordered residues and then assigned to the five cross–validation partitions ina round-robin scheme.

Neural network training

Artificial neural networks were trained on symmetric sequence windowscentered at the position to be predicted. All neural networks were trained witha learning rate of 0.005. The size of these windows was systematically variedfrom 3 to 51 residues. For each window size the number of hidden units wasvaried in the range 5-50. The performance of every such combination wasevaluated by five fold cross validation and best parameter combinations wereselected based on ROC curves.

The best cross-validation performance for the coil data set was obtainedusing the filtered version where only consistently labelled regions of at least7 residues were considered. The optimal network architecture was a windowsize of 19 residues and 30 hidden units. Other networks with good perform-ance all had similar network architectures and gave very similar predictions.No significant improvement in performance could thus be obtained by forminga larger ensemble of networks.

Separate networks were trained for the task of discriminating betweenordered and disordered hot loops. Possibly because of the small number ofpositive examples, networks with many hidden neurons performed no betterthan those with few. This was especially true when using large window sizes.The best performing ensemble of networks had a window size of 41 residuesand only 5 hidden neurons.

For prediction of missing coordinates (REMARK465) it was discovered thatnetworks with a window size of 9 residues performed best at relatively low

Page 127: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

6.8. Protein Disorder Prediction:Implications for Structural Proteomics 107

sensitivities while networks with a window size of 21 performed better for highersensitivities; for both window sizes, 30 hidden units gave the best performance.An ensemble was thus formed consisting of two sets of five cross validationnetworks with window sizes of 9 and 21 residues respectively. This ensembleoutperformed both the individual cross validation ensembles over the entirerange of sensitivities.

Conversion of network output to probability scores

The coil and hot loops neural network ensembles, the score distributions ofpositive and negative test examples were estimated using Gaussian kerneldensity estimation. Based on these distributions a calibration curve for convert-ing neural network output scores to probabilities was constructed as previouslydescribed [117]. For coil prediction a prior probability of 43% (the compositionof the training data set) was used while a prior probability of 20% was usedin the case of hot loop prediction. As the resulting calibration curves wereessentially linear, they were approximated by a least squares linear fit (corr.>0.99).

Smoothing

We run a digital low-pass filter based on SAVITZKY-GOLAY (refer to section14.8 in [177]) on the network output in order to smooth the curves. The fil-tering is performed by an external open source C module (sav_gol) from theTISEAN 2.1 [102] Nonlinear Time Series Analysis package (http://www.mpipks-dresden.mpg.de/~tisean/). The resulting smoothed functionsare plotted using the DISLIN 8.0 package. DISLIN is distributed as platformspecific binaries from http://www.linmpi.mpg.de/dislin/.

Acknowledgements

This work was partly supported by EU grant QLRI-CT-2000-00127. All neuralnetworks used for this study were trained using the HOW program by Prof.Søren Brunak. Lars Juhl Jensen is funded by the Bundesministerium fürForschung und Bildung, BMBF-01-GG-9817. Thanks to Sophie Chabanis-Davidson and Sara Quirk for commenting on this manuscript. Finally we aredeeply grateful to FreeBSD.org, (bio)Python.org, PostgreSQL.org, Debian.organd Apache.org for fantastic open free software.

Page 128: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 129: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Chapter 7

Practical Disorder Prediction

Having established novel techniques for identifying unstructuredsegments in proteins entirely from sequence, we discuss how thesetools can be used in various ways. Many functional sites dependon disorder-to-order and order-to-disorder transitions, i.e. conform-ational changes during binding. It seems that our methods can de-tect these ‘soft’ spots in proteins that are likely to be able to undergosuch changes.

7.1 GlobPlotting

GlobPlot, presented in Paper II, was invented specifically for aiding the ELMproject, however it proved to be of much wider interest [139]. From thebeginning we wanted a graphical tool that could generate ‘easy to interpret’plots of the tendency within the sequence for being unstructured or structured.The basis for GlobPlot was the RUSSELL/LINDING scale mentioned earlier in6.5. The combination of ‘random coil’ and ‘secondary structure’ in the RUS-SELL/LINDING scale, enhanced the discrimination of the graphs and was thekey factor in the success of this scale being able to detect both disorder andglobular packing.

GlobPlot is not intended as a competitor for secondary structure prediction,since it cannot give the same level of detail as one can obtain from a secondarystructure prediction based on a multiple alignment. GlobPlot is an ab initiomethod i.e. it requires only one sequence and can therefore be applied to novelsequences with no homologues, i.e. it does not use a multiple alignment. Thebasic algorithm behind GlobPlot is beautifully simple and very fast:

For each amino acid a, we have defined a propensity P (ai) ∈R, see RUS-SELL/LINDING in Figure 6.1. Given a protein sequence of length L, we definea sum function Ω as follows:

Ω(ai) =

L∑

i=1

P (ai)

where P (ai) is the propensity for the i’th amino acid. This algorithm isdifferent from the one in the paper that includes a logarithmic term, however

109

Page 130: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

110 Chapter 7. Practical Disorder Prediction

we finally came to realise that the term was absolutely absurd and only hadnegative effects on the algorithm performance.

The GlobPlot webserver is plotting the Ω function and the graphs arereferred to as globplots. Before plotting Ω the digital smoothing algorithmSavitzky-Golay is used to reduce noise on the curve. The Savitzky-Golay func-tion calculates a numerical estimate of the first order derivative which is usedfor the algorithm that finds segments of disorder or order (globdoms).

7.1.1 Analysing a globplot

Reading globplots is fairly easy but different from e.g. hydropathy plots, inthat globplots are cumulative sum curves rather than derivative curves. AsGlobPlot plots a running sum, the graph is analysed by looking at the slope.The numbers on the ordinate axis do not matter, they equal the running sumand we are only interested in whether a given segment of the graph is dis-

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Residue number

-101.0

-38.2

24.5

87.3

150.1

212.9

Inte

grat

ed d

isor

der

prop

ensi

ty

Mucin 5B precursor (Mucin 5 subtype B, tracheobronchial) (High molecular weight salivary mucin MG1) (Sublingual gland mucin).

SMART domainssignal_peptide 1-26VWD 67-224Pfam:TIL 329-386VWC 388-456VWD 415-579VWC 755-820VWC 858-918VWD 885-1043VWD 5005-5178VWC 5355-5424VWC 5464-5527CT 5601-5683

Figure 7.1: Globplot of Human Mucin 5 protein (SWISS-PROT:MU5B_HUMAN). Most of this slime protein is highly disordered. SinceGlobPlot is plotting a running sum of the disorder propensities, the graph isanalysed by looking at the slope. The numbers on the ordinate axis do notreally matter, it is the ‘up hill’ or ‘down hill’ tendency that one should read. Re-ferring to Figure 6.1, disorder promoting propensities are positive, so ’up hill’on the graph is equivalent to disorder.

Page 131: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

7.1. GlobPlotting 111

0 50 100 150 200 250 300 350 400 450

Residue

-12.2

-8.9

-5.7

-2.5

0.7

4.0

Dis

orde

r pr

open

sity

Curves:RussellLindingLIM 147-200LIM 209-263low_complexity_region 289-304low_complexity_region 322-337HOX 367-429

Figure 7.2: GlobPlot of Apterous protein involved in development of the wingand imaginal discs in Fruit fly. The domain boundaries of the LIM and HOXdomains can be rationalised by comparison to the SMART predictions.

ordered or not. The latter is seen by the drop or raise in the curve becausethat is how the RUSSELL/LINDING scale works: negative values correspond toordered residues and positive ones indicate disorder promoting amino acids.

We designed GlobPlot like this because we think it results in ‘profile like’intuitive plots. In particular we wished to avoid a high variation curve such asthe derivative curve. The globplot in Figure 7.1 is a good example of these‘profile like curves’: the plot of Mucin shows that the central part of the proteinis predicted to be almost completely disordered by GlobPlot (using the RUS-SELL/LINDING disorder definition), this is probably why this protein is such aslimy protein.

Domain detection with GlobPlot is as easy as finding protein disorder sinceboth features are shown in the plot. In order to help the user navigate andunderstand the plots. the webserver is overlaying the graph with any pre-dicted SMART domains. In domain hunting situations one would look for‘down-hill’ regions of the graph. As seen in Figure 7.2 GlobPlot can detectpotential domains: notice the down-hill slope whenever a domain is found bySMART/Pfam. GlobPlot often detects additional sequence to be ordered, thisis because SMART and Pfam only use the most conserved sequence part of a

Page 132: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

112 Chapter 7. Practical Disorder Prediction

domain to generate their Hidden Markov Models for the domain. This indicatesthat GlobPlotting is useful for domain boundary definition.

The performance of GlobPlot encouraged us to refine our approach andpredict disorder in a more traditional biocomputational manner by training arti-ficial neural network predictors for the various definitions of disorder mentionedabove.

7.2 Prediction of multiple types of disorder withDisEMBL

DisEMBL, presented in Paper III [138], is a computational tool for predictionof disordered/unstructured regions within a protein sequence . Currently Dis-EMBL contains 5 different predictors: The three disorder predictors basedon Coils, REMARK465, and hot loops. Besides these the webserver allowsfor prediction of low-complexity regions using CAST [178]. This is relev-ant since many low-complexity regions are structurally disordered [235]; Wealso provide prediction of beta aggregation segments using the Serrano lab’sTANGO algorithm, refer to Paper IV. This latter predictor is included becausethe amyloid and protein aggregation community is using this service.

7.2.1 Differences between predictors in DisEMBL

The coils predictor is primarily used as a filter to set the requirement of dis-order to be within coils predicted regions (refer to 6.1). DisEMBL is a highlyaccurate predictor, predicting more than 60% of hot loops with less than 2%false positives . In Paper III, we showed this correlation as well as a compar-ison of the correlations between our alternative definitions of protein disorder.Even though the predictors show significant correlations they are not redund-ant. Moreover, when analysing a protein using DisEMBL, the user should notthink that if two of the predictors agree more than the third one, they are moreright, this is simply not the case. The predictors are predicting different typesof protein disorder.

7.2.2 Using DisEMBL

DisEMBL is freely available via a web interface (http://dis.embl.de) andcan be downloaded for use in large scale studies. The web interface is fairlystraight forward to use, the user can submit a sequence or enter the SWISS-PROT/SWALL accession (e.g. P08630) or entry code (e.g. HMG1_HUMAN).The server fetches the sequence and description of the polypeptide froman ExPASy server using Biopython.org software. The probability of dis-order is shown graphically, as illustrated at Figure 7.3. The random ex-pectation levels for the different predictors are shown on the graph as ho-rizontal lines but should only be considered an absolute minima. The de-fault parameters are set for optimal prediction and should only be changed

Page 133: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

7.2. Prediction of multiple types of disorder with DisEMBL 113

0 10 20 30 40 50 60 70 80 90

Residue

0.0

0.2

0.4

0.6

0.8

1.0

Dis

orde

r P

roba

bilit

y Nonhistone chromosomal protein 6A.

Predictors:CoilsRemark-465Hot loops

Figure 7.3: Sample output from the DisEMBL web server, showing predic-tions for Yeast Nonhistone chromosomal protein 6A (High Mobility Group pro-tein, SWISS-PROT: NHPA_YEAST). The green curve shows the prediction formissing coordinates (REMARK465), red shows the hot loop prediction and bluefor the coils prediction. The horizontal lines correspond to the random expect-ation level for each predictor; for coils and hot loops the prior probabilities wereused, while a neural network score of 0.5 is used for REMARK465. From thisplot it is seen that especially the N-terminal tail of the protein is predicted tobe disordered. See Figure 7.4 for a mapping of the hot loop predictions on thestructure of this protein.

in rare cases. On-line documentation of the different settings is provided athttp://dis.embl.de/help.html. If the query protein sequence is verylong, >1000 residues, the user can download the predictions and use a localgraph/plotting tool such as Grace or OpenOffice.org to plot and zoom in on thedata. DisEMBL now includes a web applet for interactive plotting and zoomingof the graphs. We are very grateful to Claus Jensen1 for writing this appletcode.

1Lars’ dad ;-)

Page 134: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

114 Chapter 7. Practical Disorder Prediction

Figure 7.4: DisEMBL hot loop predictions mapped on the NMR structure ofNonhistone chromosomal protein 6A (High Mobility Group protein, PDB: 1cg7 -model 1, SWISS-PROT: NHPA_YEAST). The predicted probabilities were witha colour scale going from blue to red, where red corresponds to the most likelydisordered regions and blue to ordered regions. The unstructured tail clearlyshows the highest disorder scores, refer to Figure 7.3. Surface plot generatedwith VMD, see Table 9.1.

7.3 GlobPlot or DisEMBL

The GlobPlot algorithm is very simple and intuitive, which is appealing. Al-though it was originally designed for prediction of protein disorder, the RUS-SELL/LINDING propensity scale functions as well for detection of domainboundaries, repeats and other globular features. The RUSSELL/LINDING scaleand the SMART domain overlay feature are unique for GlobPlot. DisEMBL ismore accurate than GlobPlot in coils prediction, which is related to the RUS-SELL/LINDING scale. It furthermore provides the novel hot loop definition ofdisorder. The two methods complement each other in their approach to dis-order prediction. In general we urge the user to submit his or her sequencesto both tools.

Where DisEMBL is a true prediction engine, GlobPlot is not really a pre-dictor, rather it simply shows the RUSSELL/LINDING values for the query pro-tein. It is in that sense a blackbox since it cannot really be benchmarked orcompared to other predictors. This is an issue, however we know from ana-lysing thousands of sequences that it is an extremely powerful tool and manyusers find it very intuitive to use. It is good that there is still room for heuristictools in bioinformatics.

Page 135: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

7.4. Design of protein expression vectors 115

7.4 Design of protein expression vectors

As mentioned earlier protein disorder is related to problems during protein ex-pression, purification and crystallisation. Other tools such as TANGO (http://tango.embl.de) deal with protein beta aggregation, which is different fromthe disorder in solvated proteins which our tools predict, refer to Chapter 8.

We think that the identification of the potential disordered regions shouldprovide a good basis for setting up expression vectors and/or comparing thedata with obtained structural data. However, currently we cannot assess whichof the definitions of disorder is most appropriate for design of protein expres-sion vectors. Recently GlobPlot has been used to create expression vectors[Charlie Boyd - personal communication]. A paper is under revision showingstrong correlation between DisEMBL predictions and an X-ray crystallographicstudy [confidential communication].

7.5 Bridging to prediction of linear motifs

Figure 7.5 shows what we call a ‘disorder profile’ for a linear motif. The figureshows how much order or disorder is observed for true positives (experiment-ally verified) instances of this functional site, compared to all false positives.The false positives are assumed to correspond to the total set of matchesof the regular expression in a non-redundant SWISS-PROT. This can be as-sumed if the amount of true positives are much less than the amount of falsepositives (TP< <FP). The figure shows that the true positives are much moreordered than the false positives and that the signal goes towards zero whenwe move away from the functional site. It is difficult to interpret this signal,is it sequence conservation or just the known instances (about 25) which arestrange? It does, however, match well with the structure of this site, shownin Figure 7.6. A helix is observed next to the functional site. This is how thesituation is when the linear module is bound to its interaction partner, the Rbprotein. However one can easily imagine this helix to be longer and include theordered sequence of the functional site when the complex is not formed.

Another example of this is the binding of PTAP (refer to 4.4 on page 27) tothe UEV domain in Tsg-101 which creates ~30% more order in the N-terminalpart of this domain, judging by intramolecular NOE, refer to [176, 226, 113].

7.6 Structural flexibility - order/disorder trans-itions

The classic paradigm in structural biology is that sequence encodes struc-ture that then determines function. This should be specified to sequence en-codes structures but also non-structure, which might sound trivial but the au-thor thinks is a stronger message.

Page 136: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

116 Chapter 7. Practical Disorder Prediction

-40 -20 0 20 40Residue

-3

-2

-1

0

1

2

log(

p(TP

)/p(F

P))

Hot loopsRemark465Coils

Disorder profileLIG_RB: [LI].C.[DE]

Figure 7.5: Disorder profile for LIG_RB. The graph shows the log ratiobetween the average DisEMBL disorder vectors for the known instances (TP)and the backgrounds matches (~FP) of the same regular expression in a non-redundant SWISS-PROT. The colours correspond to the DisEMBL networks:REMARK465 (green), Coils (blue) and hot loops (red). A window of 50 residuesbefore and after the centre of the ELM was used. Up to 20 fold strong ordersignal can be observed around the ELM.

The author has several unexplained observations regarding disordered pro-teins that will be investigated further in the future. Why is it that many brain andsynapse related proteins are predicted to be disordered? Some moleculeslike coiled-coil regions seems unstructured in sequence space but are indeedfolded and behave globular, is this an indication of the cooperativity in thesemolecules?

Some functional sites, especially those that are post-translationally modi-fied, are observed to be folded in their unmodified form. This is the case forTyrosine 416 in Src, refer to Figure 7.7, which is in a small helix pointing in-wards to the core of a globular domain. How can the kinase ‘see’ or modifythis residue? Clearly some sort of order/disorder transition or unfolding needsto take place. Interestingly our predictors seems to be able to pick this up. Weoften predict such sites at, or just below, the thresholds we normally set for

Page 137: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

7.6. Structural flexibility - order/disorder transitions 117

Figure 7.6: LIG_RB is a functional site responsible for interaction with ret-inoblastoma (Rb) family proteins. Rb proteins are known for repression ofE2F proteins required for transcription of proteins, important in the cell cycle.The figure shows the SV40 Large T antigen interacting with the ‘Rb pocket’(PDB:1gh6). Figure by Pål Puntervoll.

Figure 7.7: Tyrosine 416 in the proto-oncogene Src (PDB: 2src). In the non-phosphorylated structure the tyrosine is wrapped up in a helix and points in-wards below the surface of the protein. This is based on an X-ray structure,so one should be careful putting too much emphasis on this picture, howeversome unfolding or order-disorder transition must take place in order for thekinase to bind and modify the tyrosine.

Page 138: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

118 Chapter 7. Practical Disorder Prediction

discriminating order from disorder. These ‘shadow-land’ sequences are likelyinteresting for the study of folding and dynamics. They might even providegood data for looking for phosphorylation sites [112].

Page 139: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Part IV

Function & Disease

119

Page 140: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 141: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Chapter 8

Protein Disorder & Disease

Unfortunately linear protein space also has its disadvantages. Cur-rently it is thought that amyloidosis in many major protein diseasessuch as Parkinson’s, Alzheimers’, Hungtington’s, BSE/CFJ and Dia-betes II is related to the occurrence of short linear aggregation mo-tifs.

8.1 Protein Diseases - conformational diseases

The deposition of proteins in the form of amyloid fibrils and plaques is the char-acteristic feature of more than 20 degenerative conditions affecting either thecentral nervous system or a variety of peripheral tissues. These conditions in-clude Alzheimer’s, Parkinson’s, Type II Diabetes and the prion diseases (e.g.BSE). It is now clear that the ability to form amyloid structures is a general prop-erty of polypeptides [65, 66] and is not confined to the proteins associatedwith these diseases. It has been found recently that aggregates of proteinsnot associated with amyloid diseases can hinder normal cell function to thesame extent as the aggregates of proteins linked with specific neurodegener-ative conditions. Moreover, the mature amyloid fibrils or plaques appear to besubstantially less toxic than the pre-fibrillar aggregates that are their precurs-ors [206], refer to Figure 8.1. In fact Mark Pepys has proposed that amyloidsin some cases might function as molecular ‘vents’ reducing cellular stress byaccumulating otherwise toxic aggregates in ordered fibres [Protein Society -2003].

Moreover, for diseases such as Parkinson’s, Huntington’s and Serpinopath-ies it is known that protein misfolding is playing a role [194, 123, 21]. Thusanalysis of the unstructured ensembles of polypeptides and the pathways tomisfolded states is important for understanding these diseases. The unstruc-tured states of proteins can be analysed in natively disordered proteins (IDPs)[73]. It was therefore natural for us to investigate the relationship betweenprotein disorder, aggregation and amyloid patterns.

121

Page 142: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

122 Chapter 8. Protein Disorder & Disease

Figure 8.1: Formation of early aggregates in protein misfolding can lead todisease. The toxicity of these aggregates seems to result from an intrinsic abil-ity to hinder fundamental cellular processes by interacting with cellular mem-branes, causing oxidative stress and increases in free Ca

'that eventually

lead to apoptotic or necrotic cell death [65, 64, 122]. Figure taken from Dysonand Wright [73].

8.2 Aggregation, disorder and disease

In order to identify potential drug targets to prevent or cure diseases associatedwith amyloidosis a molecular understanding of the driving forces involved in theformation of these organised assemblies rich in β-sheet structure is needed.

Protein aggregation was for a time thought to be an unspecific processcaused by the formation of non-native contacts between protein folding inter-mediates. The evidence for this was discovery of a large variation in mor-phologies of aggregates that were observed by techniques such as electronmicroscopy and atomic force microscopy [64, 66]. Recently it has been foundspectroscopically that two major types of aggregates are observed: the onesthat involve the formation of β-sheets and the ones with a native spectrum cor-responding to α-helical conformations [64, 66, 185, 27]. The latter is typicallyobserved in proteins with high coiled-coil propensity.

Both forms of aggregates can be converted to amyloid like fibres which areenriched in the characteristic cross-β structure [64, 66]. It has, for instance,been shown that a coiled-coil conformation under ambient conditions couldtransform into amyloid fibrils at elevated temperatures [122].

It should be noted that formation of β-sheet aggregates does not alwayslead to formation of amyloid fibrils, however β-sheet formation is a necessary

Page 143: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

8.3. Disease and linear motifs 123

intermediate in fibril formation [58]. This again points to the importance ofstructural flexibility in these processes.

8.3 Disease and linear motifs

Amylogenic diseases are thought to be related to the occurrence of short lin-ear motifs in unfolded regions. These motifs are important for the initiationof the formation of the amyloid fibres that causes great harm to the cellularenvironment, particularly in brain tissue. There are several proposed peptidemodels for these motifs and the structural context in which they occur is underinvestigation [162, 206, 65, 142]. This points to a complex situation whereother features, such as structural disorder, in the polypeptide are important fortoxicity. The reason why one may think this is that these linear aggregationmotifs needs to be exposed in order to catalyse aggregation. Luis Serrano andManuela de la Paz found that some of the highly amyloidogenic sequences areless frequent in proteins than innocuous amino acid combinations and that, ifpresent, they are surrounded by amino acids that disrupt their aggregatingcapability [142, 58]. In order to further investigate this we chose to look morethoroughly at IDPs and their tendency to aggregate. Our hypothesis was thatif they need to stay in solution they might have evolved away from aggregation.Since many IDPs are related to diseases one can speculate that mutations that

Figure 8.2: Lactoferrin has been identified in amyloid deposits. The pro-tein contains a highly amyloidogenic linear motif NAGDVAFV. Dobson and co-workers showed how this can lead to amyloid formation in the full length protein[162]. The figure is from the same paper.

Page 144: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

124 Chapter 8. Protein Disorder & Disease

lead to aggregation prone regions can be important for understanding their rolein these diseases.

Page 145: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Paper IV

In this paper we showed, by computational analysis of IntrinsicallyDisordered Proteins, that this class of proteins has significantly lessaggregation prone regions than globular proteins. This is likely be-cause there is a selective pressure to keep them in solution.

8.4 Beta Aggregation and Protein Disorder: Un-folded with Reason?

Rune Linding , Joost Schymkowitz , Frederic Rousseau

, FrancescaDiella and Luis Serrano European Molecular Biology Laboratory, Programme for Structural andComputational biology, Meyerhofstr 1, D-69117 Heidelberg, Germany

Flemish Interuniversity Institute for Biotechnology, SWITCH laboratory, Vrije

Universiteit Brussel, Pleinlaan 1050, Brussels, Belgium

Contributed equally.

Corresponding author.

A growing number of proteins is being identified that are biologic-ally active though intrinsically unstructured, in sharp contrast with theclassic notion that proteins require a well-defined three dimensionalstructure in order to be functional. Recent work has shown that spe-cific sequences of amino acids in a polypeptide chain initiate the pro-cess of β-sheet aggregation and Amyloid formation. Since aggregationprone sequences would be almost exclusively solvent exposed in un-structured/disordered segments, proteins containing aggregation pronesequences in unstructured regions would have solubility problems. Thelatter suggestion contrasts the large number of unstructured proteinsidentified especially in higher eukaryotes and the fact that these pro-teins are often resistant to harsh treatments such as boiling. We haveused the algorithm TANGO (http://tango.embl.de) to compare the β-aggregation tendency of globular proteins and intrinsically disorderedproteins, including those predicted by the algorithms DisEMBL (http:

125

Page 146: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

126 Chapter 8. Protein Disorder & Disease

//dis.embl.de) and GlobPlot (http://globplot.embl.de). Our dataclearly shows that aggregation prone regions are generally buried insidestructured proteins/regions. Disordered regions on the other hand showa much lower tendency to aggregate, suggesting that they have evolvedaway from β-sheet aggregation.

Introduction

Protein aggregation has long been thought of as an unspecific process causedby the formation of non-native contacts between protein folding intermediates.Recent work however points to an unexpected simplicity underlying the ag-gregation process [66, 42, 41]. Spectroscopically, two major types of aggreg-ates are observed: species that are enriched in β-sheet [66] and species thatretain the native spectrum [27, 184]. In this work we exclusively deal with theformer. Several models were postulated for β-sheet aggregation that all involvethe formation of an intermolecular β-sheet initiated by specific amino acid se-quences that act as nuclei for β-sheet aggregation [24, 197, 23, 195, 196].This is confirmed by the experimental finding that in many proteins, aggrega-tion occurs during refolding or under conditions in which denatured or partiallyfolded states are significantly populated, i.e. high concentration, destabilizingconditions or mutations [64]. If this is the general case, then we will expectthat aggregating regions in proteins will be hidden in the folded state and notexposed to the medium where they can readily initiate aggregation. Based onthe work of Chiti et al. [42, 41] that revealed the importance of simple factorssuch as hydrophobicity and charge on β-sheet aggregation, we recently de-veloped the computer programme TANGO (Escamilla, AM, FR, JS and LS,submitted for publication) to predict β-sheet aggregation regions in proteinsbased on a statistical mechanics algorithm that considers different compet-ing conformations: β-turn, α-helix, β-sheet aggregates and the folded state.The algorithm is based on the assumption that in the ordered β-sheet aggreg-ates the nucleating regions will be fully buried, paying maximal desolvationenergy as well as entropy, while satisfying their H-bonding potential. The en-ergy contributions are derived from the FOLD-X force field [93]. In a blind testinvolving 176 peptides from over 20 proteins, TANGO achieved an accuracy of95% in predicting aggregating sequences, as well as the effect of point muta-tions on the aggregation tendency of proteins (Escamilla, AM, FR, JS and LS,submitted for publication). Many Intrinsically Disordered Proteins (IDPs) hasbeen discovered in all kingdoms of life, but especially in higher eukaryotes[72, 71, 234]. These are proteins that are unstructured under physiologicalconditions but that have similar biological activity as globular folded proteins[74, 215]. More than 180 such proteins are known to date, including Prions,CREB, Tau, MAPs and p53 [215, 219]. These disordered polypeptides per-form important regulatory functions and are widespread in eukaryotic cellsand tissue. Some acquire structure upon post-translational modification orupon ligand-binding, others act as structural anchors in large protein-proteinand protein-RNA complexes making use of extended interaction surfaces that

Page 147: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

8.4. Beta Aggregation and Protein Disorder: Unfolded with Reason? 127

are simply not available in more compact conformations [69]. Furthermore,many globular proteins contain disordered segments that acts as functionalmodules, e.g. post-translational modification sites and domain ligands. Suchfunctional sites known as linear motifs are catalogued and predicted by ELM,http://elm.eu.org, [179]. Some groups have suggested that disorderedregions could be involved in protein misfolding & aggregation [194, 123, 21].However, it is often found experimentally that unstructured proteins are resist-ant to aggregation, even under harsh treatments such as incubation at hightemperature. In fact, heat-exposure of cell-extracts is an effective protocol forpurification of several recombinantly expressed unstructured proteins [215].Counter intuitively, it is the globular proteins that aggregate and not the unstruc-tured proteins. If unstructured/disordered proteins or segments are to thrive inthe crowded environment of the eukaryotic cell, they should avoid aggregationprone regions and it is expected that they will have very low aggregation tend-encies when compared to globular proteins. To test this hypothesis we haveused the algorithm TANGO to compare the aggregation tendency of a set ofglobular proteins derived from the SCOP database [9], a set of proteins thatwere experimentally shown to be unstructured [215, 219] as well as a set ofpredicted disordered protein sequences. Data sets of experimentally verifieddisordered proteins are scarce and rather error-prone. We have collected andcured a set of 296 redundant sequences which is to our knowledge the largestdata set available to the community. The data sets of predicted disorderedsegments or proteins were predicted by the DisEMBL [138] and GlobPlot [139]algorithms and divided into sequences of low-(~50%) and average sequencecomplexity.

Results & Discussion

Tango performance: what TANGO score is a good predictorof aggregation?

The TANGO algorithm was calibrated using data found in the literature on theaggregation of 174 peptides corresponding to sequence fragments of 21 dif-ferent proteins (Escamilla, AM, FR, JS and LS, submitted). 70 out of the 151peptides in our set were experimentally observed to aggregate in the concen-tration range between 100 µM and 1 mM, while the remainder was monomericin this range. A detailed description of our results together with the source ofthe experimental data is given in Fernandez-Escamilla et al (Escamilla, AM,FR, JS and LS, submitted). For each residue in a peptide TANGO returns itspercentual occupancy of the β-aggregation conformation. When we considerpeptides to have some aggregation tendency when they possess a segmentof at least 5 consecutive residues, each populating the β-aggregated conform-ation for more than 5%, the overall hit rate of TANGO reaches 97% comparedwith our experimental set. Consequently, using our guide-line of 5 consecutiveresidues with a TANGO score above 5% is a good predictor of aggregation,independent of the size of the protein under study.

Page 148: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

128 Chapter 8. Protein Disorder & Disease

Overall aggregation tendency of globular and disorderedproteins

Figure 8.3 shows the number of aggregation prone regions (defined as be-ing longer than 5 amino acids and having an aggregation tendency of morethan 5%) per 1000 amino acids. Globular proteins have a strong increasein the number of aggregating regions compared to IDPs or unstructured re-gions in proteins: there are around 20 aggregating windows per 1000 residuesin globular proteins, whereas this number lies around 9 in IDPs. Moreover,the TANGO score of the experimentally tested IDPs and the IDPs predictedby DisEMBL/GlobPlot are very similar, indicating that TANGO could be usedin conjunction with other software to predict intrinsically disordered proteins(http://dis.embl.de provides a webserver integrating the TANGO andDisEMBL [138] algorithms). This is also exemplified in Figure 8.4 were weshow that 70% of IDPs do not have any aggregating residue above 5%, whileonly 20% of the SCOP database fulfil the same criteria.

Statistical considerations

In order to exclude a bias in our results due to the difference in number of pro-teins in the globular and IDP sets or due to the average sequence length ofIDP and globular proteins, 25 control sets were generated by randomly select-ing 10000 sequence fragments of 20 amino acids from each original set. Thesewere analysed in an identical manner as the original data sets and the stand-ard deviation between the results of original and control sets were calculated.

Figure 8.3: Comparison of aggregation tendency in globular and unstructuredproteins (IDPs) - overview. For details, see Table 8.1.

Page 149: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

8.4. Beta Aggregation and Protein Disorder: Unfolded with Reason? 129

Figure 8.4: Distribution of aggregation tendency in datasets: number ofresidues that have a TANGO score larger than 5%, per 1000 residues. Up-per panel - 70% of unfolded regions in experimentally validated IPDs do notcontain aggregation prone regions. Lower panel - only 20% of SCOP proteinsdo not contain aggregation prone stretches.

The standard deviation was below 1 window per 1000 residues, indicating thatour results are not strongly influenced by the composition and size of the sets.The sequence set for IDPs and globular proteins were both similarity reducedhardly to maximum 40% similarity between any pair of sequences within theset.

Aggregation tendency in proteins of different structuralclasses

In Table 8.1 we show the aggregation tendency for globular proteins of differ-ent fold classes. Interestingly enough we do not find a significant differencebetween different fold classes, including membrane proteins. There is a slightdifference in a smaller number for multidomain proteins, suggesting that thelinker regions could be disordered and thus non-aggregating. The fact that we

Page 150: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

130

Cha

pter

8.P

rote

inD

isor

der

&D

isea

se

Aggregation prone regions Number of protein sequencesGlobular proteins (ASTRAL40) per 1000 residues in datasetAll alpha proteins 18 29All beta proteins 20 174Alpha and beta proteins (a/b) 18 190Alpha and beta proteins (a+b) 19 162Membrane and cell surface proteins 16 11Multi Domain proteins 13 25

Experimentally validated Intrinsically Disordered ProteinsIDPs, experimentally validated 7 183IDPs, experimentally validated, low complexity 4 27

Predicted Intrinsically Disordered ProteinsIDPs defined by Russell/Linding 6 602IDPs defined by Russell/Linding, low complexity 5 60IDPs predicted by Hot-loops 8 7925IDPs predicted by Hot-loops, low complexity 3 59IDPs predicted by Coils/Loops 8 16988IDPs predicted by Coils/Loops, low complexity 4 74IDPs predicted by Remark-465 6 623IDPs predicted by Remark-465, low complexity 4 139

Table 8.1: Comparison of the aggregation tendency in various datasets, see methods for details on the dataset composition.

Page 151: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

8.4. Beta Aggregation and Protein Disorder: Unfolded with Reason? 131

cannot distinguish membrane proteins which contain large hydrophobic trans-membrane segments from cytoplasmic ones is expected from the fact that theaggregation tendency predicted by TANGO does not rely only on hydrophobi-city but considers that entropy, H-bonding and electrostatics play an equallyimportant role in β-sheet aggregation (Escamilla, AM, FR, JS and LS, submit-ted).

Aggregation tendency in IDPs predicted by different meth-ods

We earlier introduced two novel methods GlobPlot [139] and DisEMBL [138]for prediction of IDPs ab initio, i.e. entirely from sequence. This work wasmainly done to support the prediction of Eukaryotic Linear Motifs (ELMs,http://elm.eu.org, [179]). However these methods are also being extens-ively used by the structural biology community to help the design of expressionvectors. Disordered regions are often associated with problems during proteinpurification/crystallization of globular proteins. This is supported by the factthat low-complexity disorder and IDPs are under-represented in PDB [235].Thus DisEMBL and GlobPlot can point to regions in the sequence that shouldbe avoided. As indicated above TANGO seems to distinguish between glob-ular and IDPs based just on aggregation tendency. Thus it is interesting tocompare the aggregation tendency of the IDPs predicted by different methods(using alternate disorder definitions, Table 8.1). As in the case of the globularproteins we do not find any significant differences. The only consistent and ex-pected result is that low complexity sequences always score slightly lower byall methods. However, the differences between low complexity sequences andthe IDPs of normal sequence complexity is small, indicating that unstructuredproteins avoid β-sheet aggregation through specific amino acid sequences andnot just by low complexity. In other words, our analysis shows that there isa clear selective pressure in unstructured/disorganized polypeptide stretchesagainst aggregation prone regions.

Aggregation tendency in globular proteins and solvent ac-cessibility

As discussed above, globular domains have a significant number of aggregat-ing regions on average, while IDPs have a much lower number. Since IDPsshould be fully or almost completely solvent exposed one could expect that themajority of the aggregating regions in globular proteins should be solvent hid-den and involved in secondary structure elements, thus avoiding aggregation.In Figure 8.5 we show the solvent accessibility of all the aggregation regionsfound in our globular database. It is quite clear that with few exceptions ag-gregation prone regions have a low solvent accessibility. This means that inglobular proteins buried regions may aggregate, only if the protein becomesunstable or if it has a long lifetime that increases the probability of unfolding

Page 152: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

132 Chapter 8. Protein Disorder & Disease

Figure 8.5: Histogram showing the average solvent accessibility of aggrega-tion prone regions to be low. The solvent accessible surface area of the ag-gregation region prone regions (defined as segments of 5 consecutive residueswith a TANGO score of more than 5% per residue) in the SCOP [9] dataset wascalculated with the program DSSP [108].

events. Thus mutations that destabilize the protein or faulty chaperone or de-gradation machinery in the cell can reveal the hidden aggregation tendency ofglobular proteins in vivo.

Conclusions

We have shown that intrinsically disordered proteins are significantly lessprone to aggregation than globular proteins. This result indicates that theseproteins are selected for low aggregation tendency and that this is not achievedby low-complexity sequences. In globular proteins aggregation-prone regionsare normally buried, which implies these protein are only at risk of aggregatingin partly folded states and in the absence of chaperones. Recently it has beenproposed that some IDPs are ‘junk proteins’ genetically stabilized by the low-complexity in their sequences, but our findings do not support this hypothesis[144]. Rather, our observation suggests that unfolded proteins have evolved forthe functional advantages discussed in the introduction, with the added valuethat they are not prone to aggregate, an important consideration in complexgenomes. Moreover, as seen from Table 8.1, most verified IDPs are not oflow-complexity.

Page 153: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

8.4. Beta Aggregation and Protein Disorder: Unfolded with Reason? 133

Materials & Methods

Datasets

In this study we have used dataset that covers globular and IDPs . Both pre-dicted and experimentally verified datasets are described and for each datasetwe have split the data in a low- and normal-complexity set.

Experimentally verfied IDPs:

Databases of intrinsically disordered proteins are error prone and ratherscarce. We have built a cured and expanded database of 296 experiment-ally verified proteins. This dataset was annotated from literature starting from[215, 219] and from Dunker lab data (http://divac.ist.temple.edu/disprot). The 40% non-redundant subset consist of 183 non-homologoussequences. Redundancy reduction was done by the CD-HI algorithm [135].All disordered segments were mapped into the SWISS-PROT [87] databaseto get full length chains. This ensures a more precise disorder/order residuecount within the set. This dataset is available at http://dis.embl.de/datasets/idp40.fas.gz .

Predicted IDPs:

No commonly agreed definition of protein disorder exists. DisEMBL [138] andGlobPlot [139] provide four different definitions of disorder. The DisEMBLdisorder definitions are known as Coils/Loops, Remark-465 (missing X-Raydata) and Hot-loops. GlobPlots primary disorder definition is known as Rus-sell/Linding disorder [139, 138]. We have predicted IDPs according to each ofthese four definitions. In all cases a minimum disorder length of 50 consecutiveresidues were used to identify putative IDPs. All the predictions were done in a40% similarity reduced human proteome set. The proteome was based on theEBI (http://www.ebi.ac.uk/proteome/) proteome cross-checked withthe mouse predictions to remove erroneous gene predictions.

Globular proteins:

To represent globular proteins we used the Astral 1.63 40% subset of SCOP[9, 39].

Low-complexity

All sequences was subsequently classified according to low-complexity as de-termined by CAST [178]. Only the segments predicted to be disordered byDisEMBL or GlobPlot was analysed with CAST. If at least one disordered seg-ment contained > 50% low-complexity residues the sequence was categorisedas a low-complexity IDP.

Page 154: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

134 Chapter 8. Protein Disorder & Disease

The TANGO model

The model used by the TANGO algorithm is designed to predict β-aggregationin peptides and denatured proteins and consists of a phase-space encom-passing the random coil and possible structural states: β-turn, α-helix β-sheetaggregation and the folded state in the case of globular proteins. Every seg-ment of a peptide can populate each of these states (except the folded statethat encompasses the whole sequence) according to a Boltzmann distribution,i.e. the frequency of population of each structural state for a given segmentwill be relative to its energy. Therefore, to predict β-aggregating segments ofa peptide TANGO simply calculates the partition function of the phase-space.Here we give an overview of the main features of TANGO, a detailed descrip-tion of the algorithm can be found in Fernandez-Escamilla et al. (Escamilla,AM, FR, JS and LS, submitted).

α-Helical propensities

The parameters used in the latest version of AGADIR (AGADIR-1s, [132]),have been used to determine the helical propensity of the amino acid se-quences. The only modification has been the implementation of a two windowapproximation (Escamilla, AM, FR, JS and LS, submitted).

β-Turn propensities

β-turn propensity is calculated by considering three energy contributions (seeMethods & Supplementary Material): (1) an amino-acid specific cost in con-formational entropy for fixing that residue in a β-turn compatible conformation,(2) interactions of each amino acid with the turn structure in a position depend-ent manner, and (3) a single H-bond between the main chains of residues i andi+3 of the turn. We have only considered 4 types of turns for which we couldobtain significant statistical data, Types I, I’, II and II’.

β-sheet aggregation

To estimate the aggregation tendency of a particular amino acid sequence,we have taken the following assumptions: (1) In an ordered β-sheet aggreg-ate the main secondary structure is the β-strand. (2) The regions involved inthe aggregation process are fully buried, thus paying full solvation costs andgains, full entropy and optimize their H-bond potential (i.e. the number of H-bonds made in the aggregate is related to the number of donor groups thatare compensated by acceptors. An excess of donors or acceptors remainsunsatisfied). (3). Complementary charges in the selected window establish fa-vourable electrostatic interactions and overall net charge of the peptide insideand outside the window disfavours aggregation.

Page 155: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

8.4. Beta Aggregation and Protein Disorder: Unfolded with Reason? 135

The effect of physico-chemical conditions on aggregation

The effect of pH, temperature and ionic strength on electrostatic interactionswas taken into account as described in AGADIR2-1s [132]. Similarly the de-pendence of entropy, H-bonds and hydrophobic interactions on temperatureand ionic strength are taken into consideration as described in AGADIR2-1s[132].

Assumptions

We have opted for a two-window sampling approximation which assumes thatthe probability of finding more than two ordered segments in the same poly-peptide chain is too low to be considered (the simple one window will deviatetoo much from reality for peptides with > 50 residues). Further, we assumethere is no energetic coupling between the two windows. Finally, we do notconsider aggregation intermediates. This means that we consider aggregatesas a single molecular species or structural state in competition with the fol-ded protein, the β-turn and α-helical conformations. One simplification in thisapproximation is that the we do not consider β-hairpins as aggregating mo-tifs. The reason been that is their not yet a satisfactory algorithm to predictβ-hairpin stability. Thus, when running TANGO over the above datasets welooked at the probability of finding an amino acid stretch having 5% or more ag-gregation tendency as well as the overall aggregation tendency considered asthe sum of all prediction numbers above 5% divided by the protein length. Theformer assumes that non-aggregating sequences will have few aggregation-prone amino acid stretches and the second takes into consider the possibilitythat a protein could aggregate even if it has only one strong aggregating seg-ment in the whole protein.

Acknowledgements

This work was partly supported by EU grant QLRI-CT-2000-00127. Thanks toSara Quirk for reading this manuscript. We are grateful to Lars Juhl-Jensen forthe human proteome set.

Page 156: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 157: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Chapter 9

Composite Protein FunctionModels

With more tools at hand for finding linear functional modules, I pro-pose a new composite and modular model for describing proteinfunction. This model will integrate both globular domain modulesand linear modules in an attempt to describe function as a resultof higher order relationships between these modules. This is hypo-thesis formulation in progress.

9.1 The motivation and foundation

I propose to expand the ‘Modular Protein Function Model’ to a ‘Composite Mod-ular Function Model’ that will be described in draft manuscript II. The model isbased on linear and globular modules (ELMs and Pfam/SMART modules). Theauthor thinks it is important to formalise and classify physical interactions in or-der to help clear up the picture from network based studies. Moreover, perhapsone can use abstract relationships between functional modules to infer proteinfunction.

The modular model, described in 1.1 on page 3, has been used to describea wide range of aspects of protein function [2, 32, 68, 78, 101, 170, 33, 191].However until now this has not been formalised and explored in-depth, this isthe aim of the work presented in Draft manuscript II, which is in a preliminarystate.

In my model, polypeptides are described as graphs where nodes are mod-ules and edges correspond to cis-relationships, see below. Polypeptides inter-act through physical trans-interactions that result in trans-relations in a functionspace. The basic functional units in proteins are linear and globular modulesand these come together in function space as vectorial collections. The vectoris defined by the order of occurrence from N- to C-terminus. The basic unitsof protein function are modules. The linear modules can be distinguished fromthe globular ones by their co-linearity in sequence and structure space. Thesebasic functional building blocks are combined in a polypeptide chain and definecomposite functions of a given protein.

137

Page 158: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

138 Chapter 9. Composite Protein Function Models

Modules from different polypeptide chains can come together in functionspace, mediated by transient physical interactions and protein complex form-ation. This will increase the functional complexity of the proteins and everymodule in a physical interaction network is in functional relationship with othermodules in other or the same polypeptides.

9.2 Interaction connects functions

A functional unit is a set of sequence or structure elements that is enough forinteraction with another molecule. It follows that function can only be basedon physical interactions with e.g. polypeptides or small compounds. Structuralproteins function often by monomeric polymerisation based on domain-domainpacking. The model I am proposing is focused much more on cellular signalingnetworks. Figure 9.1 shows the different types of interactions and relationsthat are the framework for describing protein function in the model.

9.2.1 Physical interaction classification

As in many other protein function models, physical interactions are the ‘frame-work’ that relates the functional modules to each other.

Figure 9.1: The new model for protein function proposed. Physical interactionsdrive everything. They are classified in ELM-domain or domain-domain typesof interactions. Functional relations are formed between functional modulesbecause of these interactions. Two types of functional relations are described,cis within the same molecule or trans due to physical interactions.

Page 159: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

9.2. Interaction connects functions 139

Cis ELM-domain interaction

Intra-chain interactions are known for e.g. SH2 and SH3 ligands in Src. Oftenintra molecular functioning motifs are slightly different from the ones that workin trans, this is likely due to the reaction kinetics on intra-chain interactions.This type of interaction sometimes lead to a cis-trans switch. In Src the SH2cis-binding can after phosphorylation, be replaced by a trans-interaction withthe PDGF receptor [145].

Trans ELM-domain interaction

The most typical interaction of linear modules. This is forming a very largeregulatory network within the cell. Many signaling pathways are based on thistype of interaction [170].

Trans domain-domain interaction

The majority of published interaction networks are of this type [91, 89, 4]. Thisis to a certain extent because the current techniques such as affinity purific-ation, pull-downs and to a lesser extent two-hybrid screens are favouring thistype of interaction. These interactions are typically more sticky (higher affinity)or non-transient in nature than the ones based on linear modules. However wehave recently extracted some interaction networks based on linear modulesinteractions with globular domains, as shown in Draft II.

ELM-ELM interactions?

We do not know of these yet, and they would not go down well with our defini-tions. However, we wonder whether some of the longer proline-rich sequencesin the cell could form non-specific aggregation which might approximate thisclass, e.g. in Collagen.

9.2.2 Cis/Trans relations - abstract function networks

Relations are based on physical interactions. They can be interpreted as in-teractions between functional modules in an abstract function space. In otherwords, they are associations between molecular functions derived from inter-actions between the polypeptides that encode them.

Cis ELM-ELM relation

The MOD_CDK phosphorylation site is dependent on a LIG_CYCLIN in thesame polypeptide, one could include negative evidence, e.g. conflicting target-ing modules. This is a very powerful type of positive evidence in the predictionof linear modules.

Page 160: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

140 Chapter 9. Composite Protein Function Models

Cis domain-domain relation

For structural domains this is normal, e.g. Src kinase domain consists of two-structural domains in one functional unit. Modular domains have many inter-relations, these can be extracted from SMART.

Cis domain-ELM relation

Examples of ELMs that always reside together with a globular domain, are ac-tivation loop phosphorylation motifs in tyrosine kinases. Likely cleavage sitesmight be dominant in this type of relation because they might be used to ‘re-lease’ modular domains from each other.

Trans ELM-domain relation

Tricky to interpret but for phosphorylation this could give hints about unknownswitches in the pathway.

9.3 Composite models

I propose that only by combining these module types, can a more completedescription of cellular systems be achieved. Knowing the set of linear andglobular modules of a given protein, the polypeptide could be described as amodule vector. The components of the vector would be vector space encodedmodule descriptors. One could then measure functional similarity by proximityin this vector space. This could be formulated as a formal framework for traininga future computational resource. Another very interesting project would beto compare with functional associations as they are defined in the STRINGresource, mentioned in 2.3.2 on page 10.

Page 161: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Draft II

This paper will propose a more complex model for describing pro-tein functions. We show a glimpse of the large regulatory networksthat are based on linear modules within the cell. The atomic decon-volution, into linear and globular modules, of a canonical cellularpathway is currently missing from the manuscript. This unfinisheddraft manuscript will be submitted to PLoS Biology 2004.

9.4 Linear Functional Modules - Composite Pro-tein Function Systems

Rune Linding ( )+* , Pål Puntervoll ,+* , Lars Juhl Jensen , Morten Mat-

tingsdal, Sean Hooper , Robert B. Russell , Toby J. Gibson ! -" and Rein

Aasland

$The ELM Consortium - http://elm.eu.org

EMBL - Biocomputing Unit - Meyerhofstr 1 - D-69117 Heidelberg - GermanyCBU - Department of Computional Biology, University of Bergen, Bergen -

Norway

MBI - Department of Molecular Biology, University of Bergen, Bergen -Norway%To whom correspondence should be addressed. E-mail: [email protected]

$ Contributed equally to this work.

Summary

In the post-genomic era, analysis of cellular regulatory networks andsystems are of increasing importance. Yet we do not know the role ofthe topologies, modules and different regulatory protein interaction net-works play in a future complex model of cellular systems. Hitherto, pro-tein function models have been primarily described in terms of functional

141

Page 162: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

142 Chapter 9. Composite Protein Function Models

modules known as globular domains. However an only now cataloguedlarge group of functional sites is found primarily in unstructured partsof proteins. These linear system modules encompass ligand sites suchas 14-3-3, SH3 and Cyclin ligands as well as post-translational modific-ation sites and targeting signals. This work describes the first compre-hensive and proteomic level analysis of such system modules. Linearsystem modules are short and co-linear in both sequence and structurespace, i.e. they behave as linear peptide motifs, which makes them dif-ficult to detect from sequence alone. Experimentally they are often neg-lected because they reside in unstructured parts of proteins which areoften recombinantly removed during protein expression. We have cre-ated the largest and most comprehensive computational resource: TheEukaryotic Linear Motif resource, ELM http://elm.eu.org, for findingthese functional sites. The ELM resource is knowledge based and storescontextual profiles for linear functional sites annotated from the scientificliterature. Contextual profiles and logical filters reduce the overpredic-tion of short linear motifs. Linear system modules are as important forprotein function models as are globular domains and we show how onecan infer novel functional networks by considering both types as mod-ules. We propose how they can be integrated into the modular modelof protein function. Several proteome interaction datasets are analysedfor interactions between a linear and globular module, to show moleculardetails about verified protein interactions.

Introduction

As structural genomics’ initiatives are providing more structures, it is becom-ing increasingly clear that many functionally important protein segments resideoutside of globular domains [236, 69]. About 60-80% of the proteome in higherEukaryotes consists of multidomain proteins [10]. An archetypal example ofthis architecture is shown as is p53 shown in Figure 9.2. It is likely that mostproteins have more linear functional sites than globular domains, thus identi-fication of these is crucial for functional classification of the protein [116, 117].Even though ‘only’ ~30000 ORFs are predicted to be in the human genome,most of these will have several splice variants and moreover several functionalsites e.g. post-translational modification sites. These sites can be in differ-ent chemical states and thereby increase the number of different functionalspecies (isoforms) several fold [Humpherey-Smith]. This work concerns howthese molecular functional sites behave as linear system modules in the modu-lar model of protein function that has hitherto focused on protein domains. Wepostulate that linear motifs and globular domains come together in a compositemodel of protein function. Functional sites can be classified into groups, thelargest ones possibly being covalent modification sites such as phosphoryla-tion, acetylation, sumoylation and ubiqutination. Another very large group isthe ligand sites including ligands for MDM2, cyclin, p300 and TAFII31 as wellas sub-cellular localisation signals. Often functional sites are found in densely

Page 163: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

9.5. Cellular pathway wide analysis using ELM 143

Figure 9.2: The p53 protein contains hyper ELM clusters. These contribute tothe hub role this protein plays in central regulatory pathways.

packed clusters in sequence space that serves as regulatory hot-spots, referto Figure 9.2.

Results

9.5 Cellular pathway wide analysis using ELM

Here we will show an atomic analysis of a major cellular signaling pathway. Wewill show how the pathway is composed of domain-domain and elm-domaininteractions and perhaps couple some relation analysis on top as a linkage tothe model proposal.

9.6 Transient interaction networks?

Most studies of protein function and interaction have been domain centric. Infact most large datasets dealing with protein interaction have also focused onthis type of interaction [4]. This is because methods such as affinity purificationtend to isolate ‘sticky’ and ‘non-transient’ protein complexes. The methods forisolating transient interactions such as the binding of a cognate ligand to itsmodular domain are different and much smaller in vivo datasets exist. How-ever the non-transient network is only a part, and perhaps even a subpart, of

Page 164: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

144 Chapter 9. Composite Protein Function Models

Hsp83

spagHop

Dpit47

Hsc70-4

Hsp70Aa

CG5834

CG6489

CHIP

CG8286

CG9170

Pp1-alpha-96A

CG9326

eiger

CG7552

CG7417

veli

CG11208

Magi

CG32264

CG6688

Sr-Cl

dlg1

knr

CG32406

CG14582

CG15529

ac

usp

CG17575

CG5823 cic

ftz-f1

CG2926 CG1340

Traf3

CG9588

CG3402

CG18745

Actn

CG30084

Fur2 CG32677

CG15803

CG5053

X11L CG14168CG3214

Dap160

Figure 9.3: Interactions between D. melanogaster proteins predicted to bemediated by ELMs. All pairs of proteins interacting in the recent systematicscreen by CuraGen Corp. [91] were examined for possible ELM-domain in-teractions. ELMs were predicted using the regular expressions and applyingthe species filter, while the SMART resource was used for detection of proteindomains [133, 192]. Shown are all interacting pairs of proteins where one pro-tein contains a putative ELM (blue circle) and the other protein contains thecorresponding ELM binding domain (SMART domain symbol), excluding anyinteractions that involve SH3 domains.

the interaction networks within the cell. As mentioned earlier, methods suchas yeast-two-hybrid and affinity purification are sampling domain-domain typeprotein interaction at the expense of ELM-domain ones. However, even withskewed data such as these one should expect a low-level of elm-domain inter-actions. Recently a network for the WW motif was published [109].

9.7 Abstract functional inference

As mentioned, protein function has hitherto been described primarily in termsof globular domains. We here show, what is to our knowledge, the first attemptsof drawing atomically deconvoluted functional networks. A central concepthere is the so called trans-function principle, any linear ligand site needs apartner modular domain in order to engage in its cellular role. We here show

Page 165: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

9.8. Composite Modular Protein Function Systems - proposal 145

Figure 9.4: Nucset-2 is another ELM-domain interaction figure, based on nuc-lear proteins. The figure was created using NetView.jar from the Bork lab.

results for nuclear proteins if one enforces this principle upon experimentallyverified protein interaction networks.

9.8 Composite Modular Protein Function Sys-tems - proposal

Here we will propose the model.

Discussion

Linear system modules are linear because of their context

The linearity of ELMs is based on two things:

1. Their structural context, they are often linear and unstructured

2. They evolve in an independent manner

Linear motifs and aggregation [Linding et al. JMB 2004] disease. Contextualprofiling - the future ELM resource and protein function model. A vector space

Page 166: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

146 Chapter 9. Composite Protein Function Models

Figure 9.5: Edges correspond to cis-relations, i.e. the modules are on thesame polypeptide. Domains are shown as blue dots and ELMs as red dots.Green lines show domain-domain cis-relations. Blue lines correspond to ELM-domain cis-relations. We are still trying to interpret this type of networks.

description of elms with context scores as coordinates. Similar functions byproximity in a space based on functional module vectorial descriptions of pro-teins.

Materials & Methods

Datasets used

ELM instances

The ELM instances currently count 1900 experimentally verified instances of72 ELMs distributed on 1500 sequences. In our analysis we used a non-redundant subset based on a 90% similarity threshold. No sequences wereremoved in the case that more than one ELM had instances on the sequenceor if the sequence contained the only single known instance of the ELM. Thesequence similarity check was performed using the CD-HI algorithm [135].

Page 167: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

9.8. Composite Modular Protein Function Systems - proposal 147

Supporting information

The ELM resource is on-line on http://elm.eu.org. The web interfaceallows for queries against the ELM resource as well as moderated pipelinepredictions performed by a remote users’ robot script. The interaction datasets and the underlying instance data can be downloaded from the websiteat http://elm.eu.org/datasets/. In the future ELM will be availablefor local installation. A figure illustrating the information flow in the siteseeingprocess can be seen at http://elm.eu.org/PLoS.paper/siteseeing.pdf.

Acknowledgements

The ELM consortium was supported by EU grant QLRI-CT-2000-00127.Thanks to Sara Quirk for commenting on this manuscript. Finally we aredeeply grateful to FreeBSD.org, kernel.org, (bio)Python.org, PostgreSQL.org,Debian.org, Apache.org and Linus Torvalds for fantastic open free software.

Author contributions

RL, PP, LJJ, MM, RBR, TJG and RA wrote the paper and performed the exper-iments. SH developed the NetView program that is available upon request.

Page 168: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 169: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Epilogue

Your theory is crazy, but it’s not crazy enough to be true.Niels Bohr

Description of protein function has hitherto been skewed towards the roleof globular domains. However, regulatory networks consist of ELM-domainsinteractions. The prediction and classification of linear modules is still lessmature than for globular domains but at least we have clear proofs of principlesin our work. We established that contextual and logical filtering is an absolutenecessity in the prediction of linear modules. Moreover, novel techniques forreducing the search space of linear functional sites were developed.

If the appropriate statistical framework could be created, e.g. by introdu-cing more complex sequence models, the integration and optimisation of thesestrategies could be enhanced even more. We have seen that linear motifs arenot just important for protein function but also for some of the most destructivehuman diseases. The unstructured part of protein space is now being increas-ingly studied and I hope that the ab initio methods presented can prove usefulfor the communities working on intrinsic protein disorder and protein folding.

The work on proposing the next generation models of composite proteinfunction will continue. A major task lies within the analysis and deconvolutionof cellular pathways into domain-domain and ELM-domain interactions. Thefunction of a protein is greater than the sum of the functional modules withinit and even with a new model for function, our description of these complexbiopolymers will still be very primitive.

Thank you for your attention!

149

Page 170: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 171: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Part V

Additional Information

151

Page 172: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 173: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

URLs

In Table 9.1 I have listed some URLs that might be useful. Many more linkscan be found in the annual database (http://nar.oupjournals.org/content/vol31/issue1/) and webserver (http://nar.oupjournals.org/content/vol31/issue13/) open access issues of Nucleic Acids Re-search.

153

Page 174: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

154

Resource Functional classes http://SMART Globular modular domains smart.embl.dePfam Globular modular domains www.sanger.ac.uk/Software/Pfam/Interpro Globular domain meta server www.ebi.ac.uk/interpro/CDD Globular modular domains web.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtmlPROSITE Domain signatures and a few linear motifs www.expasy.ch/sprot/prosite.htmlPredictProtein Secondary structure prediction www.predictprotein.orgELM Functional sites, ELMs, linear motifs elm.eu.orgPyMOL Very nice and easy to use molecule viewer and renderer pymol.sourceforge.netVMD Feature rich molecule viewer www.ks.uiuc.edu/Research/vmd/Scansite Phosphorylation and signaling motifs scansite.mit.eduProtfun Enzyme categories and higher functional classes www.cbs.dtu.dk/services/ProtFunNetNglyc N-Glycosylation motifs www.cbs.dtu.dk/services/NetNGlycPredictNLS Nuclear localization signals cubic.bioc.columbia.edu/predictNLSSignalP Cleavage sites & signal/non-signal peptide prediction www.cbs.dtu.dk/services/SignalPPSORT Protein sorting signals psort.nibb.ac.jpSulfinator Tyrosine sulfation motifs us.expasy.org/tools/sulfinatorGlobPlot Protein disorder and globularity globplot.embl.deDisEMBL Protein disorder dis.embl.deGO Biological function, component and process www.geneontology.orgEnsemble Genome browsing www.ensembl.orgPhospho.ELM Instances of Ser/Thr/Tyr phosphorylation phospho.elm.eu.orgPerl Script oriented language widely used in bioinformatics www.perl.comPython Highly object oriented language designed for large projects www.python.orgHMMER Hidden Markov Model software suite hmmer.wustl.eduBiopython Bioinformatics modules for Perl www.biopython.orgBioperl Bioinformatics modules for Python www.bioperl.org

Table 9.1: Some of the resources I have referred to in this thesis. For more information please refer to the individual web sites.

Page 175: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Database Schema

The ELM database schema in detail. We used our own jargon ofthe Uniform Modelling Language(UML) to design the charts. Thediagrams was done in the open source diagram editor DIA. A sep-arate file with the corresponding SQL code is manually maintainedin the CVS repository for ELM.

155

Page 176: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

156

Elms+id: SERIAL pkey+identifier: VARCHAR(32) ukey+short_description: VARCHAR(256) ukey+functional_site_id: INTEGER fkey = 1+log_id: INTEGER fkey = 1

Proteins+id: SERIAL pkey+public_name: VARCHAR(256) ukey+note_id: INTEGER fkey = 1+expression_id: INTEGER fkey = 1

External_links+id: SERIAL pkey+external_id: VARCHAR(128) = ’’+external_database_id: INTEGER fkey = 1+external_id_type_id: INTEGER fkey = 1+external_value: VARCHAR(1024) = ’’+ukey = (external_id,external_database_id,external_id_type_id)

Protein_external_links+protein_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (protein_id,external_link_id,logic_id)

Models+id: SERIAL pkey+model: TEXT+name: VARCHAR(128) ukey+model_type_id: INTEGER fkey+sha1_bin: BYTEA ukey+sha1_hex: CHAR(40) ukey+note_id: INTEGER fkey

Elm_models+elm_id: INTEGER fkey+model_id: INTEGER fkey+note_id: INTEGER fkey = 1+pkey = (elm_id,model_id)+best_model: BOOLEAN = FALSE

Protein_files+file_id: INTEGER fkey+protein_id: INTEGER fkey+pkey = (file_id,protein_id)

Protein_URLs+protein_id: INTEGER fkey+url_id: INTEGER fkey+pkey = (protein_id,url_id)

Elm_taxonomies+node_id: INTEGER fkey+elm_id: INTEGER fkey+excluded_node: BOOLEAN = FALSE+pkey = (node_id,elm_id)

Elm_instances+id: SERIAL pkey+elm_id: INTEGER fkey = 1+sequence_id: INTEGER fkey = 1+logic_id = 1+startposition: INTEGER = 0+endposition: INTEGER = 0+model_id: INTEGER fkey+note_id: INTEGER fkey = 1+ukey = (elm_id,sequence_id,startposition,endposition,logic_id,model_id)

External_id_types+id: SERIAL pkey+external_id_type: VARCHAR(256) ukey

Evidence_codes+id: SERIAL pkey+code: VARCHAR(8) ukey+description: VARCHAR(256) ukey

Flags+id: SERIAL pkey+flag: VARCHAR(128) ukey

External_link_flags+flag_id: INTEGER fkey+external_link_id: INTEGER fkey+pkey = (flag_id,external_link_id)

Sequences+id: SERIAL+name: VARCHAR(256) ukey+sha1_bin: BYTEA ukey+sha1_hex: CHAR(40) ukey+sequence: TEXT+length: INTEGER = 0+complexity: VARCHAR(1024) = ’formula?’+composition: VARCHAR(1024) = ’atomic?’+molecular_weight: FLOAT8+sequence_version: VARCHAR(128)+external_checksum: VARCHAR(512) ukey+taxonomy_node_id: INTEGER fkey

Sequences_external_links+sequence_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (sequences_id,external_link_id,logic_id)

Elm_external_links+elm_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (elm_id,external_link_id,logic_id)

Evidences+id: INTEGER pkey+method_id: INTEGER fkey+evidence_class_id: INTEGER fkey+reliability_id: INTEGER fkey+note_id: INTEGER fkey

Instance_evidence+elm_instance_id: INTEGER fkey+evidence_id: INTEGER fkey+logic_id: INTEGER fkey

Taxonomy_flags+flag_id: INTEGER fkey+node_id: INTEGER fkey+pkey = (flag_id,node_id)

Sequence_types+id: SERIAL pkey+type: VARCHAR(128) ukey

Evidence_evidence_codes+evidence_code_id: INTEGER fkey+evidence_id: INTEGER fkey+pkey = (evidence_code_id,evidence_id)

Model_types+id: INTEGER pkey+type: VARCHAR(64) ukey

Protein_taxa+protein_id: INTEGER fkey+node_id: INTEGER fkey+pkey = (protein_id,node_id)

Evidence_methods+id: INTEGER pkey+method: VARCHAR(512) ukey

Protein_complexes+id: SERIAL pkey+name: VARCHAR(256) ukey

Complex_subunits+complex_id: INTEGER fkey+protein_id: INTEGER fkey+pkey = (complex_id,protein_id)

Protein_complexes+id: SERIAL pkey+name: VARCHAR(512) ukey

Complex_subunits+protein_id: INTEGER fkey+complex_id: INTEGER fkey

Figure 9.6: elmDB schema 1.

Page 177: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

157

Evidence_references+evidence_id: INTEGER fkey+article_id: INTEGER fkey+pkey = (evidence_id,article_id)

Proteins+id: SERIAL pkey+public_name: VARCHAR(256) ukey+note_id: INTEGER fkey = 1+expression_id: INTEGER fkey = 1

Taxonomy_tree NEW+node_id: INTEGER pkey = 1+rank_id: INTEGER fkey = 1+ncbi_scientific_name: VARCHAR(256) ukey+genbank_common_name: VARCHAR(256) = ’’+left_metric: INTEGER = 1+right_metric: INTEGER = 1

Protein_references+protein_id: INTEGER fkey+article_id: INTEGER fkey+pkey = (protein_id,article_id)

External_links+id: SERIAL pkey+external_id: VARCHAR(128) = ’’+external_database_id: INTEGER fkey = 1+external_id_type_id: INTEGER fkey = 1+external_value: VARCHAR(1024) = ’’+ukey = (external_id,external_database_id,external_id_type_id)

External_databases+id: SERIAL pkey+name: VARCHAR(256) ukey+url_id: INTEGER fkey = 1+module_file_id: INTEGER fkey = 1+ukey = (name,url_id)

Notes+id: SERIAL pkey+note: TEXT

URLs+id: SERIAL pkey+note_id: INTEGER fkey+url: VARCHAR(8190)

Synonyms+id: SERIAL pkey+synonym: VARCHAR(256) ukey

Protein_synonyms+protein_id: INTEGER fkey+synonym_id: INTEGER fkey+pkey = (protein_id,synonym_id)

Files+id: SERIAL pkey+file_oid: OID+description: VARCHAR(512)+filename: VARCHAR (512)+sha1_bin: BYTEA ukey+sha1_hex: CHAR(40) ukey+note_id: INTEGER fkey+file_type_id: INTEGER fkey+ukey = (file_oid,filename)

Elm_taxonomies+node_id: INTEGER fkey+elm_id: INTEGER fkey+excluded_node: BOOLEAN = FALSE+pkey = (node_id,elm_id)

External_methods+id: SERIAL pkey+name: VARCHAR(128)+method: VARCHAR(128) = ’Neural Network’+module_file_id: INTEGER fkey = 1+url_id: fkey ukey = 1+contact_email_id: INTEGER fkey = 1+display: VARCHAR(1024) = ’Copyright 2002 ELM Consortium’+note_id: INTEGER fkey = 1+interface_flag_id: INTEGER fkey = 1+ukey = (name,url_id)

Protein_expression_contexts+id: SERIAL pkey+context: VARCHAR(1024) ukey

File_types+id: SERIAL pkey+type: VARCHAR(256) ukey

Elm_files+file_id: INTEGER fkey+elm_id: INTEGER fkey+pkey = (file_id,elm_id)

Logic+id: INTEGER pkey+logic: VARCHAR(256) ukey+value: FLOAT8 = 0

Evidence_codes+id: SERIAL pkey+code: VARCHAR(8) ukey+description: VARCHAR(256) ukey

Article_evidence_codes+evidence_code_id: INTEGER fkey+article_id: INTEGER fkey+pkey = (evidence_code_id,article_id)

Copyright (C) 2001-2004 Rune Linding - ELM Consortium - http://elm.eu.org

Evidence_classes+id: SERIAL pkey+class: VARCHAR(128) ukey

Taxonomy_ranks+id: SERIAL pkey+rank: VARCHAR(64) ukey

Taxonomy_flags+flag_id: INTEGER fkey+node_id: INTEGER fkey+pkey = (flag_id,node_id)

Evidence_external_links+evidence_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (elm_id,external_link_id)

Evidence_evidence_codes+evidence_code_id: INTEGER fkey+evidence_id: INTEGER fkey+pkey = (evidence_code_id,evidence_id)

Reliabilities+id: SERIAL pkey+reliability: VARCHAR(128) ukey

External_link_files+file_id: INTEGER fkey+external_link_id: INTEGER fkey+pkey = (file_id,external_link_id)

Households+mtime: TIMESTAMP WITH TIME ZONE = (’NOW’)+creation_date: TIMESTAMP WITH TIME ZONE = (’NOW’)

Protein_taxa+protein_id: INTEGER fkey+node_id: INTEGER fkey+pkey = (protein_id,node_id)

Evidence_methods+id: INTEGER pkey+method: VARCHAR(512) ukey

Elm_siteseeings+elm_id: INTEGER fkey+elmer_id: INTEGER fkey+pkey = (elm_id,elmer_id)

Complex_subunits+complex_id: INTEGER fkey+protein_id: INTEGER fkey+pkey = (complex_id,protein_id)

Logic_flags+flag_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (flag_id,logic_id)

Figure 9.7: elmDB schema 2.

Page 178: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

158

Journals+id: SERIAL pkey+name: VARCHAR(512) ukey Authors

+id pkey: SERIAL+name ukey: VARCHAR(512) ukey

Articles+id: SERIAL pkey+flag_id: INTEGER fkey = 1+journal_id: INTEGER fkey+title: VARCHAR(1024)+abstract: VARCHAR(4096)+volume: SMALLINT+issue: VARCHAR(64) = ’’+pagination: VARCHAR(64)+PMID: INTEGER ukey+publication_date: VARCHAR(48) = (’TODAY’)+ukey = (title,pagination)

Site_siteseeings+site_id: INTEGER fkey+elmer_id: INTEGER fkey+pkey = (site_id,elmer_id)

Article_authors+author_id: INTEGER fkey+article_id: INTEGER fkey+author_rank: SMALLINT = 1+pkey = (author_id,article_id)+ukey = (article_id,author_rank)

ELMers+id: SERIAL pkey+firstname: VARCHAR(64)+lastname: VARCHAR(64)+initials: VARCHAR(6) = ’’+degree: VARCHAR(128) = ’’+startdate: DATE = ’2001-01-03’+enddate: DATE = ’TODAY’+birthday: DATE = ’TODAY’+gender: CHAR(1)+email_id: INTEGER fkey = 1+cellphone: VARCHAR(64) = ’’+workphone: VARCHAR(64)+homephone: VARCHAR(64) = ’’+fax: VARCHAR(64)+workplace_id: INTEGER fkey = 1+ukey = (firstname,lastname,initials)+gender_check = gender IN (’M’,’F’)+url_id: INTEGER fkey = 1+siteseer: BOOLEAN = TRUE+login: VARCHAR(128) ukey = Firstname+Lastname

Workplaces+id: SERIAL pkey+name: VARCHAR(128)+department: VARCHAR(128)+street: VARCHAR(128)+postalcode: VARCHAR(128) = ’’+city: VARCHAR(128)+country: VARCHAR(128)+state: VARCHAR(128) = ’’+url_id: INTEGER fkey ukey+ukey = (name,department)

Stati+id: SERIAL pkey+status: VARCHAR(128) ukey

Site_external_links NEW+site_id: INTEGER fkey+external_link_id: INTEGER fkey+logic_id: INTEGER fkey+pkey = (site_id,external_link_id,logic_id)

Emails+id: SERIAL pkey+email: VARCHAR(256) ukey

Files+id: SERIAL pkey+file_oid: OID+description: VARCHAR(512)+filename: VARCHAR (512)+sha1_bin: BYTEA ukey+sha1_hex: CHAR(40) ukey+note_id: INTEGER fkey+file_type_id: INTEGER fkey+ukey = (file_oid,filename)

Site_URLs+site_id: INTEGER fkey+url_id: INTEGER fkey+pkey = (site_id,url_id)

Functional_sites+id: SERIAL pkey+abstract: TEXT+short_description: VARCHAR(1024) ukey+log_id: INTEGER fkey = 1+public_name: VARCHAR(256) ukey+expert_email_id: INTEGER fkey = 1+descriptive_title: VARCHAR(128) ukey+status_id: INTEGER fkey = 1

Site_synonyms+site_id: INTEGER fkey+synonym_id: INTEGER fkey+pkey = (site_id,synonym_id)

Site_files+file_id: INTEGER fkey+site_id: INTEGER fkey+pkey = (file_id,site_id)

Article_evidence_codes+evidence_code_id: INTEGER fkey+article_id: INTEGER fkey+pkey = (evidence_code_id,article_id)

Search_profiles+id: SERIAL pkey+profile: VARCHAR(1024) ukey+note_id: INTEGER fkey = 1+search_type_id: INTEGER fkey = 1

Site_search_profiles+site_id: INTEGER fkey+search_profile_id: INTEGER fkey+pkey = (site_id,query_id)

Copyright (C) 2001-2004 Rune Linding - ELM Consortium - http://elm.eu.org

Taxonomy_ranks+id: SERIAL pkey+rank: VARCHAR(64) ukey

Article_files+file_id: INTEGER fkey+article_id: INTEGER fkey+pkey = (file_id,article_id)

External_link_files+file_id: INTEGER fkey+external_link_id: INTEGER fkey+pkey = (file_id,external_link_id)

Search_profile_types+id: SERIAL pkey+profile: VARCHAR(256) ukey

Figure 9.8: elmDB schema 3.

Page 179: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

List of Linear Motifs

The list of linear motifs that is currently in the ELM resource. Moredetailed information can be found on the ELM website.

CLV_MEL_PAP_1 Prophenoloxidase-activating proteinase (PAP) cleavagesite ([Ile/Leu/Val]-Xaa-Xaa-Arg-|-[Val/Phe]-[Gly/Ser]-Xaa)

CLV_NDR_NDR_1 N-Arg dibasic convertase (nardilysine) cleavage site (Xaa-|-Arg-Lys or Arg-|-Arg-Xaa)

CLV_PCSK_FUR_1 Furin (PACE) cleavage site (Arg-Xaa-[Arg/Lys]-Arg-|-Xaa)

CLV_PCSK_KEX2_1 Yeast kexin 2 cleavage site (Lys-Arg-|-Xaa or Arg-Arg-|-Xaa)

CLV_PCSK_PC1ET2_1 NEC1/NEC2 cleavage site (Lys-Arg-|-Xaa)

CLV_PCSK_PC7_1 Proprotein convertase 7 (PC7, PCSK7) cleavage site(Arg-Xaa-Xaa-Xaa-[Arg/Lys]-Arg-|-Xaa)

CLV_PCSK_SKI1_1 Subtilisin/kexin isozyme-1 (SKI1) cleavage site ([RK]-X-[hydrophobic]-[LTKF]-|-X)

LIG_14-3-3_1 14-3-3 proteins interaction motif 1

LIG_14-3-3_2 14-3-3 proteins interaction motif 2

LIG_14-3-3_3 Consensus derived from natural interactors which do not ex-actly match the mode1 and mode2 ligands

LIG_AP_GAE_1 Acidic Phe motif mediates the interaction between ENTHcontaining proteins and the gamma-ear domains of Gga2p and AP-1.Proposed roles: in clathrin localization and assembly on TGN/endosomemembranes and in traffic between the TGN and endosome.

LIG_Clathr_ClatBox_1 Clathrin box motif found on cargo adaptor proteins, itinteracts with the beta propeller structure located at the N-terminus ofClathrin heavy chain.

LIG_Clathr_ClatBox_2 Clathrin box motif found on cargo adaptor proteins, itmediates binding to the N-terminal beta propeller of clathrin. This variantmotif (also called W box) is found in the central region of Amphiphysinswhere it coexists with a "classical" clathrin box.

159

Page 180: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

160

LIG_COP1 COP1 binding motif

LIG_CORNRBOX The corepressor nuclear receptor box motif confers bindingto nuclear receptors.

LIG_CtBP PxDLS motif that interacts with the CtBP protein

LIG_CYCLIN_1 Substrate recognition site that interacts with cyclin andthereby increases phosphorylation by cyclin/cdk complexes. Predictedprotein should have the MOD_CDK site. Also used by cyclin inhibitors.

LIG_Dynein_DLC8_1 The [KR]xTQT motif interacts with the common target-accepting grooves of 8kDa Dynein Light Chain dimer.

LIG_EH NPF motif responsible for the interaction with the EH eukaryotic sig-nalling modules

LIG_EH1 The eh1 motif allows bindning to Groucho/TLE corepressors

LIG_FHA_1 FHA domain interaction motif 1, threonine phosphorylation is re-quired

LIG_GYF LIG_GYF is a proline-rich sequence specifically recognized by GYFdomains

LIG_HOMEOBOX The YPWM motif confers binding to the PBX homeoboxdomain

LIG_HP1_1 Ligand to interface formed by dimerisation of two chromoshadowdomains in HP1 proteins.

LIG_IBS_1 Integrins are major collagen receptors on the surface of eukaryoticcells. This consensus sequence is present in some alpha chains of dif-ferent collagen types (e.g. alpha 1 chain of type I, II, V and alpha 2 chainof collagen type I and VIII).

LIG_IQ Calmodulin binding helical peptide motif

LIG_MAD2 Mad2 binding motif

LIG_MDM2 Motif found in p53 family members which confers binding to theN-terminal domain of MDM2

LIG_MYND PxLxP motif, MYND ligand

LIG_NRBOX The nuclear receptor box motif (LXXLL) confers binding to nuc-lear receptors.

LIG_PCNA The PCNA binding site is found in proteins involved in DNA replic-ation, repair and cell cycle control.

LIG_PDZ_1 PDZ domains recognize short sequences at the carboxy terminusof target proteins

Page 181: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

161

LIG_PDZ_2 this is the motif recognized by class II PDZ domains

LIG_PDZ_3 Class III PDZ domains binding motif

LIG_PIP2_ANTH_1 The ANTH-type-PI(4,5)P2 binding motif is found in ENTHcontaining proteins of the AP180 family and actin binding proteins. Itbinds to the lipid bilayer and may serve to tether clathrin and cytoskeletonto the plasma membrane.

LIG_PIP2_ENTH_1 The ENTH-type-PI(4,5)P2 binding motif is found on Epsinand Epsin-like proteins. Upon binding to PI(4,5)P2 and in conjunctionwith clathrin polymerisation Epsin directly modifies membrane curvature.

LIG_PP1 LIG_PP1 is a conserved PP1c (protein phosphatase 1 catalyticsubunit)- binding motif

LIG_PTAP PTAP alike ligands

LIG_PTB_1 Phopshotyrosine binding (PTB) motif

LIG_PTB_2 LIG_PTB_2 is a NPXY PTB domain binding motif whose tyrosineis non-phosphorylated

LIG_PXL The paxillin LD motif is recognized by proteins mainly involved incytoskeleton

LIG_RB Interacts with the Retinoblastoma protein

LIG_RGD This motif can be found in proteins of the extracellular matrix and itis recognized by different members of the integrin family. The structureof the tenth type III module of fibronectin has shown that the RGD motiflies on a flexible loop.

LIG_SH2_GRB2 GRB2-like Src Homology 2 (SH2) domains binding motif.

LIG_SH2_PTP2 SH-PTP2 and phospholipase C-gamma Src Homology 2(SH2) domains binding motif.

LIG_SH2_SRC Src-family Src Homology 2 (SH2) domains binding motif.

LIG_SH2_STAT3 YXXQ motif found in the cytoplasmic region of cytokine re-ceptors that bind STAT3 SH2 domain.

LIG_SH2_STAT5 STAT5 Src Homology 2 (SH2) domain binding motif.

LIG_SH2_STAT6 STAT6 Src Homology 2 (SH2) domain binding motif.

LIG_SH3_1 This is the motif recognized by class I SH3 domains

LIG_SH3_2 this is the motif recognized by class II SH3 domains

LIG_SH3_3 This is the motif recognized by those SH3 domains with a non-canonical class I recognition specificity

Page 182: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

162

LIG_SH3_4 This is the motif recognized by those SH3 domains with a non-canonical class II recognition specificity

LIG_SH3_5 PXXDY motif recognized by some SH3 domains

LIG_SID_1 Motif interacts with PAH2 domain in the Sin3 scaffold protein

LIG_Sin3_2 Motif interacts with PAH2 domain in the Sin3 scaffold protein. (sp-1 like)

LIG_Sin3_3 Motif interacts with PAH2 domain in the Sin3 scaffold protein. (notmad or sp-1 like)

LIG_TNKBM The poly(ADP-ribose) polymerases Tankyrase-1(TNK1_HUMAN) and Tankyrase-2 (TNK2_HUMAN) and bind pro-teins through an ankyrin-repeat domain (SM0248). The recognised motif(tankyrase-binding motif) is RxxPDG.

LIG_TPR Ligands of the TPR (tetratricopeptide repeat motif) domains areEEVD motifs, C-terminal sequences highly conserved in all eukaryoticmembers of the Hsp70 and Hsp90 families.

LIG_TRAF2_1 Major TRAF2-binding consensus motif. Members of the tumornecrosis factor receptor (TNFR) superfamily initiate intracellular signal-ing by recruiting the C-domain of the TNFR-associated factors (TRAFs)through their cytoplasmic tails.

LIG_TRAF2_2 Minor TRAF2-binding consensus motif. Members of the tumornecrosis factor receptor (TNFR) superfamily initiate intracellular signal-ing by recruiting the C-domain of the TNFR-associated factors (TRAFs)through their cytoplasmic tails.

LIG_TRAF6 TRAF6 binding site. Members of the tumor necrosis factor re-ceptor (TNFR) superfamily initiate intracellular signaling by recruiting theC-domain of the TNFR-associated factors (TRAFs) through their cyto-plasmatic tails.

LIG_WH1 LIG_WH1 is the WIP sequence motif binding to the WH1 domainsof WASP and N-WASP

LIG_WRPW_1 The WRPW motif mediates recruitment of transcriptional co-repressors of the Groucho/transducin-like enhancer-of-split (TLE) family.LIG_WRPW_1 is based on the C-terminus located motifs found in theHairy and Runt family proteins.

LIG_WRPW_2 The WRPW motif mediates recruitment of transcriptional co-repressors of the Groucho/transducin-like enhancer-of-split (TLE) fam-ily. LIG_WRPW_2 is not restricted to the C-terminus (in contrast toLIG_WRPW_1).

LIG_WW_1 PPXY is the motif recognized by WW domains of Group I

Page 183: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

163

LIG_WW_2 PPLP is the motif recognized by WW domains of Group II

LIG_WW_3 WW domain of group III binding motif

LIG_WW_4 Class IV WW domains interaction motif; phosphorylation-dependent interaction.

MOD_ASX_betaOH_EGF ASX hydroxylation of some EGF domains.

MOD_CAAXbox Generic CAAX box prenylation motif

MOD_CDK Substrate motif for phosphorylation by CDK

MOD_CK1_1 CK1 phosphorylation site

MOD_CK2_1 CK2 phosphorylation site

MOD_CMANNOS Motif for attachment of a mannosyl residue to a tryptophan

MOD_Cter_Amidation Peptide C-terminal amidation

MOD_GlcNHglycan Glycosaminoglycan attachment site

MOD_GSK3_1 GSK3 phosphorylation recognition site

MOD_N-GLC_1 Generic motif for N-glycosylation. Shakin-Eshleman et al.showed that Trp, Asp, and Glu are uncommon before the Ser/Thr pos-ition. Efficient glycosylation usually occurs when ~60 residues or moreseparate the glycosylation acceptor site from the C-terminus

MOD_N-GLC_2 Atipical motif for N-glycosylation site. Examples are HumanCD69, which is uniquely glycosylated at typical (Asn-X-Ser/Thr) and atyp-ical (Asn-X-Cys) motifs, beta protein C

MOD_NMyristoyl Generic motif for N-Myristoylation site.

MOD_OFUCOSY Site for attachment of a fucose residue to serin

MOD_OGLYCOS Site for attachment of a glucose residue to a serin

MOD_PK_1 Phosphorylase kinase phosphorlation site

MOD_PKA_1 PKA is a protein kinase involved in cell signalling

MOD_PKA_2 PKA phosphorylation site

MOD_PKB_1 PKB Phosphorylation site

MOD_PLK Site phosphorylated by the Polo-like-kinase

MOD_SPalmitoyl_2 Class 2 Palmitoylation motif

MOD_SPalmitoyl_4 Class 4 palmitoylation motif

MOD_SUMO Motif recognised for modification by SUMO-1

Page 184: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

164

MOD_TYR_CSK Members of the non-receptor tyrosine kinase Csk familyphosphorylate the C-terminal tyrosine residues of the Src family.

MOD_TYR_DYR The kinase activity of the DYRK (dual specificity kinase) isdependent on the autophosphorylation of the YXY motif in the activationloop.

MOD_TYR_ITAM ITAM (immunoreceptor tyrosine-based activatory motif).ITAM consists of partially conserved short sequence of amino acid foundin the cytoplasmatic tail of antigen and Fc receptors.

MOD_TYR_ITIM ITIM (immunoreceptor tyrosine-based inhibitory motif).Phosphorylation of the ITIM motif, found in the cytoplasmic tail of someinhibitory receptors (KIRs) that bind MHC Class I, leads to the recruitmentand activation of a protein tyrosine phosphatase.

MOD_TYR_ITSM ITSM (immunoreceptor tyrosine-based switch motif). Thismotif is present in the cytoplasmic region of the CD150 subfamily withinthe CD2 family and it enables these receptors to bind to and to be regu-lated by SH2 adaptor molecules, as SH2DIA.

MOD_WntLipid Palmitoylation site in WNT signalling proteins that is requiredfor correct processing in the endoplasmic reticulum.

TRG_AP2alpha FxDxF motif responsible for the binding of accessory endo-cytic proteins to the alpha-subunit of adaptor protein complex AP-2

TRG_ENDOCYTIC_2 Tyrosine-based sorting signal responsible for the inter-action with mu subunit of AP (Adaptor Protein) complex

TRG_ER_diArg_1 ER retention/retrieving signal found at the N-terminus oftype II ER membrane proteins (cytoplasmic in this topology). Occasion-ally found in type I and IV ER membrane proteins.

TRG_ER_diLys_1 ER retention and retrieving signal found at the C-terminusof type I ER membrane proteins (cytoplasmic in this topology). Di Lysinesignal is reponsible for COP-I mediated retrieval from post ER compart-ments.

TRG_ER_KDEL_1 Retrieving signal found at the C-terminus of many ER sol-uble proteins. It interacts with the KDEL receptor (an integral membraneprotein) which in turns interacts with components of the coatomer (COPI).

TRG_EVH1_I proline-rich motif binding to signal transduction class I EVH1domains

TRG_EVH1_II proline-rich motif binding to signal transduction class II EVH1domains

TRG_Golgi_diPhe_1 ER to Golgi anterograde transport signal found at theC-terminus of cargo receptors, binds to COP II and COPI.

Page 185: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

165

TRG_LysEnd_APsAcLL_1 Sorting and internalisation signal found in thecytoplasmic juxta-membrane region of type I transmembrane proteins.Targets them from the Trans Golgi Network to the lysosomal-endosomal-melanosomal compartments. Interacts with adaptor protein (AP) com-plexes

TRG_LysEnd_APsAcLL_3 Sorting signal found in the cytoplasmic juxta-membrane region of type I transmembrane lysosomal, endosomal andmelanosomal proteins. Based on experimental evidence and alignments,this very specific ELM represents the best combination for AP3 binding.

TRG_LysEnd_GGAAcLL_1 Sorting signal directing type I transmembraneproteins from the Trans Golgi Network (TGN) to the lysosomal-endosomal compartment. It is found near the C-terminus and interactswith the VHS domain of GGAs adaptor proteins.

TRG_PML_SV WHXL motif found in the C-terminal region of Synaptotagminfamily proteins that is essential for the localization of the synaptic vescicleat the plasma membrane.

TRG_PTS1 Generic PTS1 ELM for all eukaryotes

TRG_PTS2 Generic PTS2 pattern for all eukaryotes (except lineages whichhave lost it)

TRG_WXXXY/F Specific ELM present in Pex5p and binding to Pex13p andPex14p. Part of the peroxisomal matrix protein import system

Page 186: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 187: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

List of Figures

1.1 A schematic showing the domain and functional site architectureof the well known proto oncogenic protein kinase Src. . . . . . . 4

1.2 A conceptual model of protein structure and function observa-tion space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Presence of SMART domains in the H.sapiens proteome. . . . . 7

3.1 The C-terminal CSK Tyr phosphorylation site in Src (PDB: 1fmk)in the closed conformation bound to the SH2 domain. . . . . . . 16

4.1 The Epsin-1 protein is involved in Clathrin mediated endocytosis. 224.2 The SH3 ligand site is modelled as five ELMs in ELM. . . . . . . 234.3 The MOD, LIG, CLV and TRG classes. . . . . . . . . . . . . . . . 264.4 Two TRG class functional sites. . . . . . . . . . . . . . . . . . . . 274.5 Two LIG class functional sites. . . . . . . . . . . . . . . . . . . . 284.6 Two MOD class functional sites. . . . . . . . . . . . . . . . . . . 294.7 The structure of an acetylated p53 peptide model of a functional

site and the Sir2 de-acetylase. . . . . . . . . . . . . . . . . . . . 324.8 DisEMBL predictions for the region of p53 that Avalos and co-

workers used for their peptide model. . . . . . . . . . . . . . . . . 33

5.1 V-1 Nef protein (magenta) in complex with wild type Fyn SH3(red) domain (PDB: 1avz) contains two potential SH3 ELMs. . . . 37

5.2 Scheme of the ELM server flowthrough using human RASN asa query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 Example of ELM server output using a short sequence, humanRASN, as query. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.4 The information flow in the current version of the ELM resource. 565.5 Simplified diagram of the elmDB. . . . . . . . . . . . . . . . . . . 575.6 The ELM db represented as UML model. . . . . . . . . . . . . . 585.7 Comparison of GO annotation of ELMs and GO annotation of

predicted ELM containing proteins in the D.melanogaster pro-teome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.8 The taxonomic and sub cellular localisation surveys. . . . . . . . 605.9 Improvement in prediction accuracy, TP/(TP+FP), by applying

GlobPlot or SMART/Pfam filters. . . . . . . . . . . . . . . . . . . 625.10 Occurrences of different classes of functional sites in SWISS-

PROT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

167

Page 188: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

168 LIST OF FIGURES

5.11 The siteseeing process. . . . . . . . . . . . . . . . . . . . . . . . 675.12 Sample output from the ELM server. . . . . . . . . . . . . . . . . 69

6.1 Propensities for the amino acids to be disordered according tothe definitions used in DisEMBL and GlobPlot (sorted by hotloop preference). . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2 GlobPlot predictions for human Bcl-2. . . . . . . . . . . . . . . . 826.3 Propensities for disorder/globularity detection. . . . . . . . . . . 846.4 GlobPlot of human CREB binding protein (CBP_HUMAN). . . . 876.5 GlobPlot of bovine Prion protein (P10279). . . . . . . . . . . . . 886.6 Benchmarking of GlobDoms vs SMART domains. . . . . . . . . . 906.7 Structural analysis of disordered segments. . . . . . . . . . . . . 916.8 Sensitivity and rate of false positives for the various DisEMBL

neural networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.9 Comparison with PONDR. . . . . . . . . . . . . . . . . . . . . . . 1006.10 Propensities for the amino acids to be disordered according to

the three definitions used in this work (sorted by hot loop prefer-ence). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.11 DisEMBL hot loop predictions mapped on the NMR mean struc-ture of human translation initiation factor eIF1A (PDB: 1D7Q,SWISS–PROT: P47813). . . . . . . . . . . . . . . . . . . . . . . 102

6.12 Sample output from the DisEMBL web server, showing predic-tions for Histone H1.2 [P15865]. . . . . . . . . . . . . . . . . . . 104

7.1 Globplot of Human Mucin 5 protein (SWISS-PROT:MU5B_HUMAN). . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2 GlobPlot of Apterous protein involved in development of the wingand imaginal discs in Fruit fly. . . . . . . . . . . . . . . . . . . . . 111

7.3 Sample output from the DisEMBL web server, showing predic-tions for Yeast Nonhistone chromosomal protein 6A (High Mo-bility Group protein, SWISS-PROT: NHPA_YEAST). . . . . . . . 113

7.4 DisEMBL hot loop predictions mapped on the NMR structure ofNonhistone chromosomal protein 6A (High Mobility Group pro-tein, PDB: 1cg7 - model 1, SWISS-PROT: NHPA_YEAST). . . . 114

7.5 Disorder profile for LIG_RB. . . . . . . . . . . . . . . . . . . . . . 1167.6 LIG_RB is a functional site responsible for interaction with ret-

inoblastoma (Rb) family proteins. . . . . . . . . . . . . . . . . . . 1177.7 Tyrosine 416 in the proto-oncogene Src (PDB: 2src). . . . . . . 117

8.1 Formation of early aggregates in protein misfolding can lead todisease. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8.2 Lactoferrin has been identified in amyloid deposits. . . . . . . . . 1238.3 Comparison of aggregation tendency in globular and unstruc-

tured proteins (IDPs) - overview. . . . . . . . . . . . . . . . . . . 1288.4 Distribution of aggregation tendency in datasets: number of

residues that have a TANGO score larger than 5%, per 1000residues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Page 189: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

LIST OF FIGURES 169

8.5 Histogram showing the average solvent accessibility of aggreg-ation prone regions to be low. . . . . . . . . . . . . . . . . . . . . 132

9.1 The new model for protein function proposed. . . . . . . . . . . 1389.2 The p53 protein contains hyper ELM clusters. These contribute

to the hub role this protein plays in central regulatory pathways. . 1439.3 Interactions between D. melanogaster proteins predicted to be

mediated by ELMs. . . . . . . . . . . . . . . . . . . . . . . . . . . 1449.4 A nuclear protein ELM-domain interactions network. . . . . . . 1459.5 Cis-relations in a nuclear interaction network. . . . . . . . . . . 1469.6 elmDB schema 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 1569.7 elmDB schema 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 1579.8 elmDB schema 3. . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Page 190: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 191: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

List of Tables

3.1 Main classes of functional sites. . . . . . . . . . . . . . . . . . . . 15

4.1 Protein structure, sequence and function space can be separ-ated into system modules. . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Syntax of regular expression or patterns, as described on theELM website. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Distribution of selected nuclear ELM matches within SWISS-PROT release 40.41 (121515 sequence entries). . . . . . . . . . 48

5.2 Some specialised resources for motif analysis. PMID is thePubMed Identifier for the server publication. . . . . . . . . . . . . 50

5.3 Preliminary benchmarking of the SMART/Pfam and GlobPlot2filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 Definitions of different concepts used in the ELM resource. . . . 655.5 Overview of the content of current elmDB. . . . . . . . . . . . . 67

6.1 Data sources for protein disorder. . . . . . . . . . . . . . . . . . . 766.2 Supplementary online material. . . . . . . . . . . . . . . . . . . . 856.3 Comparison of GlobPlot and PONDR on several IUPs. . . . . . . 896.4 Non-trivial user options available. . . . . . . . . . . . . . . . . . . 926.5 Hot loops predictions on proteins that are reported to be IDPs

according to CD data [215, 219, 71]. . . . . . . . . . . . . . . . . 103

8.1 Comparison of the aggregation tendency in various datasets,see methods for details on the dataset composition. . . . . . . . 130

9.1 Some of the resources I have referred to in this thesis. For moreinformation please refer to the individual web sites. . . . . . . . . 154

171

Page 192: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,
Page 193: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

Bibliography

[1] Aasland, R., Abrams, C., Ampe, C., Ball, L., Bedford, M., Cesareni, G., Gi-mona, M., Hurley, J., Jarchau, T., Lehto, V., Lemmon, M., Linding, R., Mayer,B., Nagai, M., Sudol, M., Walter, U., and Winder, S. (2002). Normalizationof nomenclature for peptide motifs as ligands of modular protein domains.FEBS Lett, 513(1):141–4.

[2] Aasland, R. and Stewart, A. (1995). The chromo shadow domain, a secondchromo domain in heterochromatin-binding protein 1, HP1. Nucleic AcidsRes, 23(16):3168–73.

[3] Ackerman, M. and Shortle, D. (2002). Robustness of the long-range struc-ture in denatured staphylococcal nuclease to changes in amino acid se-quence. Biochemistry, 41(46):13791–7.

[4] Aloy, P. and Russell, R. (2002a). The third dimension for protein interactionsand complexes. Trends Biochem Sci, 27(12):633–8.

[5] Aloy, P. and Russell, R. B. (2002b). Potential artefacts in protein-interactionnetworks. FEBS Lett, 530(1-3):253–4.

[6] Andersen, C., Palmer, A., Brunak, S., and Rost, B. (2002). Continuum sec-ondary structure captures protein flexibility. Structure (Camb), 10(2):175–84.

[7] Andrade, M. (2003). Bioinformatics and Genomes Current Perspectives.horizon scientific press, first edition.

[8] Andrade, M., Ponting, C., Gibson, T., and Bork, P. (2000). Homology-basedmethod for identification of protein repeats using statistical significance es-timates. J Mol Biol, 298(3):521–37.

[9] Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J. P., Chothia, C.,and Murzin, A. G. (2004). SCOP database in 2004: refinements integ-rate structure and sequence family data. Nucleic Acids Res, 32 Databaseissue:D226–9.

[10] Apic, G., Gough, J., and Teichmann, S. (2001). An insight into domaincombinations. Bioinformatics, 17 Suppl 1:S83–9.

[11] Apweiler, R., Attwood, T., Bairoch, A., Bateman, A., Birney, E., Biswas,M., Bucher, P., Cerutti, L., Corpet, F., Croning, M., Durbin, R., Falquet, L.,Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D.,

173

Page 194: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

174 BIBLIOGRAPHY

Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N., Oinn, T.,Pagni, M., Servant, F., Sigrist, C., and Zdobnov, E. (2001). The InterPro data-base, an integrated documentation resource for protein families, domainsand functional sites. Nucleic Acids Res, 29(1):37–40.

[12] Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro,S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Nat-ale, D. A., O’Donovan, C., Redaschi, N., and Yeh, L.-S. L. (2004). Uni-Prot: the Universal Protein knowledgebase. Nucleic Acids Res, 32 Databaseissue:D115–9.

[13] Aritomi, M., Kunishima, N., Inohara, N., Ishibashi, Y., Ohta, S., andMorikawa, K. (1997). Crystal structure of rat Bcl-xL. Implications for thefunction of the Bcl-2 protein family. J Biol Chem, 272(44):27886–92.

[14] Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis,A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L.,Kasarskis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G.,and Sherlock, G. (2000). Gene ontology: tool for the unification of biology.The Gene Ontology Consortium. Nat Genet, 25(1):25–9.

[15] Avalos, J. L., Celic, I., Muhammad, S., Cosgrove, M. S., Boeke, J. D., andWolberger, C. (2002). Structure of a Sir2 enzyme bound to an acetylatedp53 peptide. Mol Cell, 10(3):523–35.

[16] Aviles, F., Chapman, G., Kneale, G., Crane-Robinson, C., and Bradbury,E. (1978). The conformation of histone H5. Isolation and characterisation ofthe globular segment. Eur J Biochem, 88(2):363–71.

[17] Bader, G., Betel, D., and Hogue, C. (2003). BIND: the Biomolecular Inter-action Network Database. Nucleic Acids Res, 31(1):248–50.

[18] Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., and Nielsen, H. (2000).Assessing the accuracy of prediction algorithms for classification: an over-view. Bioinformatics, 16(5):412–24.

[19] Bang, M., Centner, T., Fornoff, F., Geach, A., Gotthardt, M., McNabb, M.,Witt, C., Labeit, D., Gregorio, C., Granzier, H., and Labeit, S. (2001). Thecomplete gene sequence of titin, expression of an unusual approximately700-kDa titin isoform, and its interaction with obscurin identify a novel Z-lineto I-band linking system. Circ Res, 89(11):1065–72.

[20] Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.,Griffiths-Jones, S., Howe, K., Marshall, M., and Sonnhammer, E. (2002).The Pfam protein families database. Nucleic Acids Res, 30(1):276–80.

[21] Bates, G. (2003). Huntingtin aggregation and toxicity in Huntington’s dis-ease. Lancet, 361(9369):1642–4.

Page 195: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

BIBLIOGRAPHY 175

[22] Battiste, J., Pestova, T., Hellen, C., and Wagner, G. (2000). The eIF1Asolution structure reveals a large RNA-binding surface important for scan-ning function. Mol Cell, 5(1):109–19.

[23] Blake, C. and Serpell, L. (1996). Synchrotron X-ray studies suggest thatthe core of the transthyretin amyloid fibril is a continuous beta-sheet helix.Structure, 4(8):989–98.

[24] Blake, C., Serpell, L., Sunde, M., Sandgren, O., and Lundgren, E. (1996).A molecular model of the amyloid fibril. Ciba Found Symp, 199:6–15; dis-cussion 15–21, 40–6.

[25] Blom, N., Gammeltoft, S., and Brunak, S. (1999). Sequence andstructure-based prediction of eukaryotic protein phosphorylation sites. J MolBiol, 294(5):1351–62.

[26] Bourne, Y., Mazurier, J., Legrand, D., Rougé, P., Montreuil, J., Spik, G.,and Cambillau, C. (1994). Structures of a legume lectin complexed withthe human lactotransferrin N2 fragment, and with an isolated biantennaryglycopeptide: role of the fucose moiety. Structure, 2(3):209–19.

[27] Bousset, L., Thomson, N. H., Radford, S. E., and Melki, R. (2002). Theyeast prion Ure2p retains its native alpha-helical conformation upon as-sembly into protein fibrils in vitro. EMBO J, 21(12):2903–11.

[28] Bradshaw, J., Mitaxov, V., and Waksman, G. (2000). Mutational investiga-tion of the specificity determining region of the Src SH2 domain. J Mol Biol,299(2):521–35.

[29] Brenner, S. (2000). Target selection for structural genomics. Nat StructBiol, 7 Suppl:967–9.

[30] Brett, T. J., Traub, L. M., and Fremont, D. H. (2002). Accessory pro-tein recruitment motifs in clathrin-mediated endocytosis. Structure (Camb),10(6):797–809.

[31] Brooks, B. and Karplus, M. (1985). Normal modes for specific motions ofmacromolecules: application to the hinge-bending mode of lysozyme. ProcNatl Acad Sci U S A, 82(15):4995–9.

[32] Buchberger, A. (2002). From UBA to UBX: new words in the ubiquitinvocabulary. Trends Cell Biol, 12(5):216–21.

[33] Buck, E. and Iyengar, R. (2003). Organization and functions of inter-acting domains for signaling by protein-protein interactions. Sci STKE,2003(209):re14.

[34] Buck, M., Radford, S., and Dobson, C. (1994). Amide hydrogen exchangein a highly denatured state. Hen egg-white lysozyme in urea. J Mol Biol,237(3):247–54.

Page 196: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

176 BIBLIOGRAPHY

[35] Bujnicki, J., Elofsson, A., Fischer, D., and Rychlewski, L. (2001). Structureprediction meta server. Bioinformatics, 17(8):750–1.

[36] Byeon, I., Yongkiettrakul, S., and Tsai, M. (2001). Solution structureof the yeast Rad53 FHA2 complexed with a phosphothreonine peptidepTXXL: comparison with the structures of FHA2-pYXL and FHA1-pTXXDcomplexes. J Mol Biol, 314(3):577–88.

[37] Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J.,Binns, D., Harte, N., Lopez, R., and Apweiler, R. (2004). The Gene Onto-logy Annotation (GOA) Database: sharing knowledge in Uniprot with GeneOntology. Nucleic Acids Res, 32(1):D262–6.

[38] Celko, J. (2000). Joe Celko’s SQL for Smarties: Advanced SQL Program-ming. Morgan Kaufmann.

[39] Chandonia, J., Walker, N., Lo Conte, L., Koehl, P., Levitt, M., and Bren-ner, S. (2002). ASTRAL compendium enhancements. Nucleic Acids Res,30(1):260–3.

[40] Chenna, R. and C, G. (2000). cgimodel: Cgi programming made easywith python. Linux Journal, 75:142–149.

[41] Chiti, F., Stefani, M., Taddei, N., Ramponi, G., and Dobson, C. M. (2003).Rationalization of the effects of mutations on peptide and protein aggrega-tion rates. Nature, 424(6950):805–8.

[42] Chiti, F., Taddei, N., Baroni, F., Capanni, C., Stefani, M., Ramponi, G., andDobson, C. M. (2002). Kinetic partitioning of protein folding and aggregation.Nat Struct Biol, 9(2):137–43.

[43] Chou, P. and Fasman, G. (1974). Conformational parameters for aminoacids in helical, beta-sheet, and random coil regions calculated from pro-teins. Biochemistry, 13(2):211–22.

[44] Chou, P. and Fasman, G. (1978a). Empirical predictions of protein con-formation. Annu Rev Biochem, 47:251–76.

[45] Chou, P. and Fasman, G. (1978b). Prediction of the secondary structureof proteins from their amino acid sequence. Adv Enzymol Relat Areas MolBiol, 47:45–148.

[46] Choy, W.-Y. and Kay, L. E. (2003). Probing residual interactions in unfol-ded protein states using NMR spin relaxation techniques: an application todelta131delta. J Am Chem Soc, 125(39):11988–92.

[47] Choy, W.-Y., Mulder, F. A. A., Crowhurst, K. A., Muhandiram, D., Millett,I. S., Doniach, S., Forman-Kay, J. D., and Kay, L. E. (2002). Distribution ofmolecular size within an unfolded state ensemble using small-angle X-rayscattering and pulse field gradient NMR techniques. J Mol Biol, 316(1):101–12.

Page 197: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

BIBLIOGRAPHY 177

[48] Conti, E. and Izaurralde, E. (2001). Nucleocytoplasmic transport entersthe atomic age. Curr Opin Cell Biol, 13(3):310–9.

[49] Copley, R., Doerks, T., Letunic, I., and Bork, P. (2002). Protein domainanalysis in the era of complete genomes. FEBS Lett, 513(1):129–34.

[50] Cornilescu, G., Delaglio, F., and Bax, A. (1999). Protein backbone anglerestraints from searching a database for chemical shift and sequence homo-logy. J Biomol NMR, 13(3):289–302.

[51] Corpet, F., Servant, F., Gouzy, J., and Kahn, D. (2000). ProDom andProDom-CG: tools for protein domain analysis and whole genome compar-isons. Nucleic Acids Res, 28(1):267–9.

[52] Cramer, P., Bushnell, D., and Kornberg, R. (2001). Structural basis oftranscription: RNA polymerase II at 2.8 angstrom resolution. Science,292(5523):1863–76.

[53] Creighton, T. (1993). Proteins Structures and Molecular Properties. W.H.Freeman and Company, second edition.

[54] Cuff, J. and Barton, G. (2000). Application of multiple sequence align-ment profiles to improve protein secondary structure prediction. Proteins,40(3):502–11.

[55] Cuff, J., Clamp, M., Siddiqui, A., Finlay, M., and Barton, G. (1998).JPred: a consensus secondary structure prediction server. Bioinformatics,14(10):892–3.

[56] Dahiya, A., Gavin, M., Luo, R., and Dean, D. (2000). Role of the LXCXEbinding site in Rb function. Mol Cell Biol, 20(18):6799–805.

[57] Dawson, R., Müller, L., Dehner, A., Klein, C., Kessler, H., and Buchner,J. (2003). The N-terminal domain of p53 is natively unfolded. J Mol Biol,332(5):1131–41.

[58] de la Paz, M. L. and Serrano, L. (2004). Sequence determinants of amyl-oid fibril formation. Proc Natl Acad Sci U S A, 101(1):87–92.

[59] de Lichtenberg, U., Jensen, T. S., Jensen, L. J., and Brunak, S. (2003).Protein feature based identification of cell cycle regulated proteins in yeast.J Mol Biol, 329(4):663–74.

[60] Dedmon, M., Patel, C., Young, G., and Pielak, G. (2002). FlgM gainsstructure in living cells. Proc Natl Acad Sci U S A, 99(20):12681–4.

[61] Deleage, G. and Roux, B. (1987). An algorithm for protein secondarystructure prediction based on class prediction. Protein Eng, 1(4):289–94.

[62] Dell’Angelica, E. (2001). Clathrin-binding proteins: got a motif? Join thenetwork! Trends Cell Biol, 11(8):315–8.

Page 198: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

178 BIBLIOGRAPHY

[63] Demarest, S., Martinez-Yamout, M., Chung, J., Chen, H., Xu, W., Dyson,H., Evans, R., and Wright, P. (2002). Mutual synergistic folding in recruitmentof CBP/p300 by p160 nuclear receptor coactivators. Nature, 415(6871):549–53.

[64] Dobson, C. (2001). The structural basis of protein folding and its links withhuman disease. Philos Trans R Soc Lond B Biol Sci, 356(1406):133–45.

[65] Dobson, C. (2002a). Protein misfolding and human disease. Scientific-WorldJournal, 2(1 Suppl 2):132.

[66] Dobson, C. M. (2002b). Getting out of shape. Nature, 418(6899):729–30.

[67] Doerks, T., Copley, R., Schultz, J., Ponting, C., and Bork, P. (2002). Sys-tematic identification of novel protein domain families associated with nuc-lear functions. Genome Res, 12(1):47–56.

[68] Dueber, J. E., Yeh, B. J., Chak, K., and Lim, W. A. (2003). Reprogram-ming control of an allosteric signaling switch through modular recombination.Science, 301(5641):1904–8.

[69] Dunker, A., Brown, C., Lawson, J., Iakoucheva, L., and Obradovic, Z.(2002). Intrinsic disorder and protein function. Biochemistry, 41(21):6573–82.

[70] Dunker, A., Garner, E., Guilliot, S., Romero, P., Albrecht, K., Hart, J.,Obradovic, Z., Kissinger, C., and Villafranca, J. (1998). Protein disorder andthe evolution of molecular recognition: theory, predictions and observations.Pac Symp Biocomput, pages 473–84.

[71] Dunker, A., Lawson, J., Brown, C., Williams, R., Romero, P., Oh, J., Old-field, C., Campen, A., Ratliff, C., Hipps, K., Ausio, J., Nissen, M., Reeves, R.,Kang, C., Kissinger, C., Bailey, R., Griswold, M., Chiu, W., Garner, E., andObradovic, Z. (2001). Intrinsically disordered protein. J Mol Graph Model,19(1):26–59.

[72] Dunker, A., Obradovic, Z., Romero, P., Garner, E., and Brown, C. (2000).Intrinsic protein disorder in complete genomes. Genome Inform Ser Work-shop Genome Inform, 11:161–71.

[73] Dyson, H. and Wright, P. (1998). Equilibrium NMR studies of unfoldedand partially folded proteins. Nat Struct Biol, 5 Suppl:499–503.

[74] Dyson, H. and Wright, P. (2002a). Coupling of folding and binding forunstructured proteins. Curr Opin Struct Biol, 12(1):54–60.

[75] Dyson, H. J. and Wright, P. E. (2002b). Insights into the structure and dy-namics of unfolded proteins from nuclear magnetic resonance. Adv ProteinChem, 62:311–40.

[76] Eddy, S. (1998). Profile hidden Markov models. Bioinformatics,14(9):755–63.

Page 199: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

BIBLIOGRAPHY 179

[77] Endter, C., Kzhyshkowska, J., Stauber, R., and Dobner, T. (2001). SUMO-1 modification required for transformation by adenovirus type 5 early region1B 55-kDa oncoprotein. Proc Natl Acad Sci U S A, 98(20):11312–7.

[78] Endy, D. and Yaffe, M. B. (2003). Signal transduction: molecular mono-gamy. Nature, 426(6967):614–5.

[79] Evans, P. and Owen, D. (2002). Endocytosis and vesicle trafficking. CurrOpin Struct Biol, 12(6):814–21.

[80] Facchiano, A., Facchiano, A., and Facchiano, F. (2003). Active SequencesCollection (ASC) database: a new tool to assign functions to protein se-quences. Nucleic Acids Res, 31(1):379–82.

[81] Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C., Hofmann, K., andBairoch, A. (2002). The PROSITE database, its status in 2002. NucleicAcids Res, 30(1):235–8.

[82] Ford, M. G. J., Mills, I. G., Peter, B. J., Vallis, Y., Praefcke, G. J. K., Evans,P. R., and McMahon, H. T. (2002). Curvature of clathrin-coated pits drivenby epsin. Nature, 419(6905):361–6.

[83] Friedl, J. E. F. (1997). Mastering Regular Expressions. O’Reilly.

[84] Fuentes, E. J., Der, C. J., and Lee, A. L. (2004). Ligand-dependentdynamics and intramolecular signaling in a PDZ domain. J Mol Biol,335(4):1105–15.

[85] Garner, E., Cannon, P., Romero, P., Obradovic, Z., and Dunker, A.(1998). Predicting Disordered Regions from Amino Acid Sequence: Com-mon Themes Despite Differing Structural Characterization. Genome InformSer Workshop Genome Inform, 9:201–213.

[86] Garner, E., Romero, P., Dunker, A., Brown, C., and Obradovic, Z. (1999).Predicting Binding Regions within Disordered Proteins. Genome Inform SerWorkshop Genome Inform, 10:41–50.

[87] Gasteiger, E., Jung, E., and Bairoch, A. (2001). SWISS-PROT: connect-ing biomolecular knowledge via a protein database. Curr Issues Mol Biol,3(3):47–55.

[88] Gatto, Jr, G., Geisbrecht, B., Gould, S., and Berg, J. (2000). Peroxisomaltargeting signal-1 recognition by the TPR domains of human PEX5. NatStruct Biol, 7(12):1091–5.

[89] Gavin, A.-C., Bösche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A.,Schultz, J., Rick, J. M., Michon, A.-M., Cruciat, C.-M., Remor, M., Höfert, C.,Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Hudak, M.,Dickson, D., Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein,C., Heurtier, M.-A., Copley, R. R., Edelmann, A., Querfurth, E., Rybin, V.,Drewes, G., Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster,

Page 200: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

180 BIBLIOGRAPHY

B., Neubauer, G., and Superti-Furga, G. (2002). Functional organization ofthe yeast proteome by systematic analysis of protein complexes. Nature,415(6868):141–7.

[90] Ginalski, K., Elofsson, A., Fischer, D., and Rychlewski, L. (2003). 3D-Jury:a simple approach to improve protein structure predictions. Bioinformatics,19(8):1015–8.

[91] Giot, L., Bader, J., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y.,Ooi, C., Godwin, B., Vitols, E., Vijayadamodar, G., Pochart, P., Machineni,H., Welsh, M., Kong, Y., Zerhusen, B., Malcolm, R., Varrone, Z., Collis, A.,Minto, M., Burgess, S., McDaniel, L., Stimpson, E., Spriggs, F., Williams,J., Neurath, K., Ioime, N., Agee, M., Voss, E., Furtak, K., Renzulli, R., Aan-ensen, N., Carrolla, S., Bickelhaupt, E., Lazovatsky, Y., DaSilva, A., Zhong,J., Stanyon, C., Finley, Jr, R., White, K., Braverman, M., Jarvie, T., Gold, S.,Leach, M., Knight, J., Shimkets, R., McKenna, M., Chant, J., and Rothberg,J. (2003). A protein interaction map of Drosophila melanogaster. Science,302(5651):1727–36.

[92] Gough, J. (2002). The SUPERFAMILY database in structural genomics.Acta Crystallogr D Biol Crystallogr, 58(Pt 11):1897–900.

[93] Guerois, R., Nielsen, J. E., and Serrano, L. (2002). Predicting changes inthe stability of proteins and protein complexes: a study of more than 1000mutations. J Mol Biol, 320(2):369–87.

[94] Guex, N. and Peitsch, M. (1997). SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electro-phoresis, 18(15):2714–23.

[95] Gunasekaran, K., Tsai, C., Kumar, S., Zanuy, D., and Nussinov, R. (2003).Extended disordered proteins: targeting function with less scaffold. TrendsBiochem Sci, 28(2):81–5.

[96] Gupta, R. and Brunak, S. (2002). Prediction of glycosylation across thehuman proteome and the correlation to protein function. Pac Symp Biocom-put, pages 310–22.

[97] Görbitz, C. and Etter, M. (1992). Hydrogen bond connectivity patterns andhydrophobic interactions in crystal structures of small, acyclic peptides. IntJ Pept Protein Res, 39(2):93–110.

[98] Hahn, M., Winkler, D., Welfle, K., Misselwitz, R., Welfle, H., Wessner,H., Zahn, G., Scholz, C., Seifert, M., Harkins, R., Schneider-Mergener, J.,and Höhne, W. (2001). Cross-reactive binding of cyclic peptides to an anti-TGFalpha antibody Fab fragment: an X-ray structural and thermodynamicanalysis. J Mol Biol, 314(2):293–309.

[99] Harrison, A., Pearl, F., Sillitoe, I., Slidel, T., Mott, R., Thornton, J., andOrengo, C. (2003). Recognizing the fold of a protein structure. Bioinformat-ics, 19(14):1748–59.

Page 201: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

BIBLIOGRAPHY 181

[100] Heery, D., Kalkhoven, E., Hoare, S., and Parker, M. (1997). A signaturemotif in transcriptional co-activators mediates binding to nuclear receptors.Nature, 387(6634):733–6.

[101] Heger, A. and Holm, L. (2003). Exhaustive enumeration of protein do-main families. J Mol Biol, 328(3):749–67.

[102] Hegger, R., Kantz, H., and Schreiber, T. (1999). Practical implementationof nonlinear time series methods: The tisean package. CHAOS, 9(413).

[103] Henikoff, J., Greene, E., Pietrokovski, S., and Henikoff, S. (2000). In-creased coverage of protein families with the blocks database servers. Nuc-leic Acids Res, 28(1):228–30.

[104] Henikoff, J. and Henikoff, S. (1996). Using substitution probabilities toimprove position-specific scoring matrices. Comput Appl Biosci, 12(2):135–43.

[105] Hermjakob, H., Montecchi-Palazzi, L., Bader, G., Wojcik, J., Salwinski,L., Ceol, A., Moore, S., Orchard, S., Sarkans, U., von Mering, C., Roech-ert, B., Poux, S., Jung, E., Mersch, H., Kersey, P., Lappe, M., Li, Y., Zeng,R., Rana, D., Nikolski, M., Husi, H., Brun, C., Shanker, K., Grant, S. G. N.,Sander, C., Bork, P., Zhu, W., Pandey, A., Brazma, A., Jacq, B., Vidal, M.,Sherman, D., Legrain, P., Cesareni, G., Xenarios, I., Eisenberg, D., Steipe,B., Hogue, C., and Apweiler, R. (2004). The HUPO PSI’s molecular interac-tion format–a community standard for the representation of protein interac-tion data. Nat Biotechnol, 22(2):177–83.

[106] Hill, D., Blake, J., Richardson, J., and Ringwald, M. (2002). Extensionand integration of the gene ontology (GO): combining GO vocabularies withexternal vocabularies. Genome Res, 12(12):1982–91.

[107] Hon, W.-C., Wilson, M. I., Harlos, K., Claridge, T. D. W., Schofield, C. J.,Pugh, C. W., Maxwell, P. H., Ratcliffe, P. J., Stuart, D. I., and Jones, E. Y.(2002). Structural basis for the recognition of hydroxyproline in HIF-1 alphaby pVHL. Nature, 417(6892):975–8.

[108] Hooft, R., Sander, C., Scharf, M., and Vriend, G. (1996). The PD-BFINDER database: a summary of PDB, DSSP and HSSP information withadded value. Comput Appl Biosci, 12(6):525–9.

[109] Hu, H., Columbus, J., Zhang, Y., Wu, D., Lian, L., Yang, S., Goodwin, J.,Luczak, C., Carter, M., Chen, L., James, M., Davis, R., Sudol, M., Rodwell,J., and Herrero, J. J. (2004). A map of WW domain family interactions.Proteomics, 4(3):643–55.

[110] Huntley, M. A. and Golding, G. B. (2002). Simple sequences are rare inthe Protein Data Bank. Proteins, 48(1):134–40.

Page 202: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

182 BIBLIOGRAPHY

[111] Iakoucheva, L., Brown, C., Lawson, J., Obradovic, Z., and Dunker, A.(2002). Intrinsic disorder in cell-signaling and cancer-associated proteins. JMol Biol, 323(3):573–84.

[112] Iakoucheva, L. M., Radivojac, P., Brown, C. J., O’Connor, T. R., Sikes,J. G., Obradovic, Z., and Dunker, A. K. (2004). The importance of intrinsicdisorder for protein phosphorylation. Nucleic Acids Res, 32(3):1037–49.

[113] Irie, T., Licata, J. M., McGettigan, J. P., Schnell, M. J., and Harty, R. N.(2004). Budding of PPxY-Containing Rhabdoviruses Is Not Dependent onHost Proteins TGS101 and VPS4A. J Virol, 78(6):2657–2665.

[114] Iwanishi, M. (2003). Overexpression of Fer increases the associationof tyrosine-phosphorylated IRS-1 with P85 phosphatidylinositol kinase viaSH2 domain of Fer in transfected cells. Biochem Biophys Res Commun,311(3):780–5.

[115] Jensen, L., Gupta, R., Staerfeldt, H.-H., and Brunak, S. (2003a). Pre-diction of human protein function according to Gene Ontology categories.Bioinformatics, 19(5):635–42.

[116] Jensen, L., Ussery, D., and Brunak, S. (2003b). Functionality of SystemComponents: Conservation of Protein Function in Protein Feature Space.Genome Res.

[117] Jensen, L. J., Gupta, R., Blom, N., Devos, D., Tamames, J., Kesmir, C.,Nielsen, H., Stærfeldt, H. H., Rapacki, K., Workman, C., Andersen, C. A. F.,Knudsen, S., Krogh, A., Valencia, A., and Brunak, S. (2002). Prediction ofhuman protein function from post-translational modifications and localizationfeatures. J. Mol. Biol., 319:1257–1265.

[118] Jonassen, I., Eidhammer, I., and Taylor, W. (1999). Discovery of localpacking motifs in protein structures. Proteins, 34(2):206–19.

[119] Jones, D. T. and Ward, J. J. (2003). Prediction of disordered regions inproteins from position specific score matrices. Proteins, 53 Suppl 6:573–8.

[120] Kabsch, W. and Sander, C. (1983). Dictionary of protein secondarystructure: pattern recognition of hydrogen-bonded and geometrical features.Biopolymers, 22(12):2577–637.

[121] Kamemura, K. and Hart, G. W. (2003). Dynamic interplay between O-glycosylation and O-phosphorylation of nucleocytoplasmic proteins: a newparadigm for metabolic control of signal transduction and transcription. ProgNucleic Acid Res Mol Biol, 73:107–36.

[122] Kammerer, Kostrewa, Zurdo, Detken, García-Echeverría, Green, Müller,Meier, Winkler, Dobson, and Steinmetz (2004). Exploring amyloid formationby a de novo design. Proc Natl Acad Sci U S A.

Page 203: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

BIBLIOGRAPHY 183

[123] Kaplan, B., Ratner, V., and Haas, E. (2003). alpha-Synuclein: Its Bio-logical Function and Role in Neurodegenerative Diseases. J Mol Neurosci,20(2):83–92.

[124] Kendrew, J., Dickerson, R., Strandberg, B., Hart, R., and Davies, D.(1960). Structure of myoglobin a three-dimensional fourier synthesis at 2 Åresolution. Nature, 185:422–427.

[125] Kessels, H. W. H. G., Ward, A. C., and Schumacher, T. N. M. (2002).Specificity and affinity motifs for Grb2 SH2-ligand interactions. Proc NatlAcad Sci U S A, 99(13):8524–9.

[126] Klein-Seetharaman, J., Oikawa, M., Grimshaw, S., Wirmer, J., Duch-ardt, E., Ueda, T., Imoto, T., Smith, L., Dobson, C., and Schwalbe, H.(2002). Long-range interactions within a nonnative protein. Science,295(5560):1719–22.

[127] Koshland, D. (1998). Conformational changes: how small is big enough?Nat Med, 4(10):1112–4.

[128] Kreegipuu, A., Blom, N., and Brunak, S. (1999). PhosphoBase, a data-base of phosphorylation sites: release 2.0. Nucleic Acids Res, 27(1):237–9.

[129] Krieger, F., Fierz, B., Bieri, O., Drewello, M., and Kiefhaber, T. (2003).Dynamics of unfolded polypeptide chains as model for the earliest steps inprotein folding. J Mol Biol, 332(1):265–74.

[130] Kuhlman, B., Dantas, G., Ireton, G. C., Varani, G., Stoddard, B. L., andBaker, D. (2003). Design of a novel globular protein fold with atomic-levelaccuracy. Science, 302(5649):1364–8.

[131] Kussie, P., Gorina, S., Marechal, V., Elenbaas, B., Moreau, J., Levine, A.,and Pavletich, N. (1996). Structure of the MDM2 oncoprotein bound to thep53 tumor suppressor transactivation domain. Science, 274(5289):948–53.

[132] Lacroix, E., Viguera, A., and Serrano, L. (1998). Elucidating the fold-ing problem of alpha-helices: local motifs, long-range electrostatics, ionic-strength dependence and prediction of NMR parameters. J Mol Biol,284(1):173–91.

[133] Letunic, I., Copley, R., Schmidt, S., Ciccarelli, F., Doerks, T., Schultz,J., Ponting, C., and Bork, P. (2004). SMART 4.0: towards genomic dataintegration. Nucleic Acids Res, 32(1):D142–4.

[134] Letunic, I., Goodstadt, L., Dickens, N., Doerks, T., Schultz, J., Mott, R.,Ciccarelli, F., Copley, R., Ponting, C., and Bork, P. (2002). Recent improve-ments to the SMART domain-based sequence annotation resource. NucleicAcids Res, 30(1):242–4.

[135] Li, W., Jaroszewski, L., and Godzik, A. (2002). Tolerating some redund-ancy significantly speeds up clustering of large protein databases. Bioin-formatics, 18(1):77–82.

Page 204: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

184 BIBLIOGRAPHY

[136] Li, X., Obradovic, Z., Brown, C., Garner, E., and Dunker, A. (2000). Com-paring predictors of disordered protein. Genome Inform Ser Workshop Gen-ome Inform, 11:172–84.

[137] Licata, J. M., Simpson-Holley, M., Wright, N. T., Han, Z., Paragas, J.,and Harty, R. N. (2003). Overlapping motifs (PTAP and PPEY) within theEbola virus VP40 protein function independently as late budding domains:involvement of host proteins TSG101 and VPS-4. J Virol, 77(3):1812–9.

[138] Linding, R., Jensen, L., Diella, F., Bork, P., Gibson, T., and Russell, R.(2003a). Protein Disorder Prediction: Implications for Structural Proteomics.Structure, 11(11):1453–1459.

[139] Linding, R., Russell, R., Neduva, V., and Gibson, T. (2003b). GlobPlot:Exploring protein sequences for globularity and disorder. Nucleic Acids Res,31(13):3701–8.

[140] Liu, J., Tan, H., and Rost, B. (2002). Loopy proteins appear conservedin evolution. J Mol Biol, 322(1):53–64.

[141] Lo Conte, L., Brenner, S., Hubbard, T., Chothia, C., and Murzin, A.(2002). SCOP database in 2002: refinements accommodate structural gen-omics. Nucleic Acids Res, 30(1):264–7.

[142] Lopez De La Paz, M., Goldie, K., Zurdo, J., Lacroix, E., Dobson, C.,Hoenger, A., and Serrano, L. (2002). De novo designed peptide-based amyl-oid fibrils. Proc Natl Acad Sci U S A, 99(25):16052–7.

[143] Lopez Garcia, F., Zahn, R., Riek, R., and Wuthrich, K. (2000). NMRstructure of the bovine prion protein. Proc Natl Acad Sci U S A, 97(15):8334–9.

[144] Lovell, S. C. (2003). Are non-functional, unfolded proteins (’junk pro-teins’) common in the genome? FEBS Lett, 554(3):237–9.

[145] Lubman, O. Y. and Waksman, G. (2003). Structural and thermodynamicbasis for the interaction of the Src SH2 domain with the activated form of thePDGF beta-receptor. J Mol Biol, 328(3):655–68.

[146] Luger, K., Mäder, A., Richmond, R., Sargent, D., and Richmond, T.(1997). Crystal structure of the nucleosome core particle at 2.8 A resolu-tion. Nature, 389(6648):251–60.

[147] Luger, K. and Richmond, T. (1998). The histone tails of the nucleosome.Curr Opin Genet Dev, 8(2):140–6.

[148] Marchler-Bauer, A., Anderson, J., DeWeese-Scott, C., Fedorova, N.,Geer, L., He, S., Hurwitz, D., Jackson, J., Jacobs, A., Lanczycki, C.,Liebert, C., Liu, C., Madej, T., Marchler, G., Mazumder, R., Nikolskaya, A.,Panchenko, A., Rao, B., Shoemaker, B., Simonyan, V., Song, J., Thiessen,P., Vasudevan, S., Wang, Y., Yamashita, R., Yin, J., and Bryant, S. (2003).

Page 205: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

BIBLIOGRAPHY 185

CDD: a curated Entrez database of conserved domain alignments. NucleicAcids Res, 31(1):383–7.

[149] Margulies, D. H. (2003). Molecular interactions: stiff or floppy (or some-where in between?). Immunity, 19(6):772–4.

[150] Marsden, R., McGuffin, L., and Jones, D. (2002). Rapid protein domainassignment from amino acid sequence using predicted secondary structure.Protein Sci, 11(12):2814–24.

[151] Melamud, E. and Moult, J. (2003). Evaluation of disorder predictions inCASP5. Proteins, 53 Suppl 6:561–5.

[152] Melchior, F. (2000). SUMO–nonclassical ubiquitin. Annu Rev Cell DevBiol, 16:591–626.

[153] Miele, A. E., Watson, P. J., Evans, P. R., Traub, L. M., and Owen, D. J.(2004). Two distinct interaction motifs in amphiphysin bind two independentsites on the clathrin terminal domain beta-propeller. Nat Struct Mol Biol,11(3):242–8.

[154] Motley, A., Hettema, E., Ketting, R., Plasterk, R., and Tabak, H. (2000).Caenorhabditis elegans has a single pathway to target matrix proteins toperoxisomes. EMBO Rep, 1(1):40–6.

[155] Mott, R. (2000). Accurate formula for P-values of gapped local sequenceand profile alignments. J Mol Biol, 300(3):649–59.

[156] Mulder, N., Apweiler, R., Attwood, T., Bairoch, A., Barrell, D., Bateman,A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R., Cour-celle, E., Das, U., Durbin, R., Falquet, L., Fleischmann, W., Griffiths-Jones,S., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M.,Lopez, R., Letunic, I., Lonsdale, D., Silventoinen, V., Orchard, S., Pagni,M., Peyruc, D., Ponting, C., Selengut, J., Servant, F., Sigrist, C., Vaughan,R., and Zdobnov, E. (2003). The InterPro Database, 2003 brings increasedcoverage and new features. Nucleic Acids Res, 31(1):315–8.

[157] Muller, S., Hoege, C., Pyrowolakis, G., and Jentsch, S. (2001). SUMO,ubiquitin’s mysterious cousin. Nat Rev Mol Cell Biol, 2(3):202–10.

[158] Nair, R. and Rost, B. (2003a). Better prediction of sub-cellular loc-alization by combining evolutionary and structural information. Proteins,53(4):917–30.

[159] Nair, R. and Rost, B. (2003b). LOC3D: annotate sub-cellular localizationfor protein structures. Nucleic Acids Res, 31(13):3337–40.

[160] Ng, S.-K., Zhang, Z., and Tan, S.-H. (2003). Integrative approachfor computationally inferring protein domain interactions. Bioinformatics,19(8):923–9.

Page 206: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

186 BIBLIOGRAPHY

[161] Nielsen, H., Brunak, S., and von Heijne, G. (1999). Machine learningapproaches for the prediction of signal peptides and other protein sortingsignals. Protein Eng, 12(1):3–9.

[162] Nilsson, M. and Dobson, C. (2003). In vitro characterization of lactoferrinaggregation and amyloid formation. Biochemistry, 42(2):375–82.

[163] Obenauer, J., Cantley, L., and Yaffe, M. (2003). Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs.,.Nucleic Acids Res, 31(13):3635–41.

[164] Orengo, C., Pearl, F., and Thornton, J. (2003). The CATH domain struc-ture database. Methods Biochem Anal, 44:249–71.

[165] Owen, D. and Luzio, J. (2000). Structural insights into clathrin-mediatedendocytosis. Curr Opin Cell Biol, 12(4):467–74.

[166] Palencia, A., Cobos, E. S., Mateo, P. L., Martínez, J. C., and Luque, I.(2004). Thermodynamic dissection of the binding energetics of proline-richpeptides to the Abl-SH3 domain: implications for rational ligand design. JMol Biol, 336(2):527–37.

[167] Pappu, R., Srinivasan, R., and Rose, G. (2000). The Flory isolated-pairhypothesis is not valid for polypeptide chains: implications for protein folding.Proc Natl Acad Sci U S A, 97(23):12565–70.

[168] Paroush, Z., Finley, Jr, R., Kidd, T., Wainwright, S., Ingham, P., Brent, R.,and Ish-Horowicz, D. (1994). Groucho is required for Drosophila neurogen-esis, segmentation, and sex determination and interacts directly with hairy-related bHLH proteins. Cell, 79(5):805–15.

[169] Parthasarathy, S. and Murthy, M. (2000). Protein thermal stability:insights from atomic displacement parameters (B values). Protein Eng,13(1):9–13.

[170] Pawson, T. (2004). Specificity in signal transduction: fromphosphotyrosine-SH2 domain interactions to complex cellular systems. Cell,116(2):191–203.

[171] Pei, J. and Grishin, N. (2001). AL2CO: calculation of positional conser-vation in a protein sequence alignment. Bioinformatics, 17(8):700–12.

[172] Peri, S., Navarro, J. D., Kristiansen, T. Z., Amanchy, R., Surendranath,V., Muthusamy, B., Gandhi, T. K. B., Chandrika, K., Deshpande, N., Suresh,S., Rashmi, B., Shanker, K., Padma, N., Niranjan, V., Harsha, H., Talreja, N.,Vrushabendra, B., Ramya, M., Yatish, A., Joy, M., Shivashankar, H., Kavitha,M., Menezes, M., Choudhury, D. R., Ghosh, N., Saravana, R., Chandran,S., Mohan, S., Jonnalagadda, C. K., Prasad, C., Kumar-Sinha, C., Desh-pande, K. S., and Pandey, A. (2004). Human protein reference databaseas a discovery resource for proteomics. Nucleic Acids Res, 32 Databaseissue:D497–501.

Page 207: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

BIBLIOGRAPHY 187

[173] Petersen, S., Bohr, H., Bohr, J., Brunak, S., Cotterill, R., Fredholm, H.,and Lautrup, B. (1990). Training neural networks to analyse biological se-quences. Trends Biotechnol, 8(11):304–8.

[174] Petock, J. M., Torshin, I. Y., Weber, I. T., and Harrison, R. W. (2003).Analysis of protein structures reveals regions of rare backbone conformationat functional sites. Proteins, 53(4):872–9.

[175] Pollastri, G., Przybylski, D., Rost, B., and Baldi, P. (2002). Improving theprediction of protein secondary structure in three and eight classes usingrecurrent neural networks and profiles. Proteins, 47(2):228–35.

[176] Pornillos, O., Alam, S. L., Davis, D. R., and Sundquist, W. I. (2002).Structure of the Tsg101 UEV domain in complex with the PTAP motif of theHIV-1 p6 protein. Nat Struct Biol, 9(11):812–7.

[177] Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (2002). Numer-ical Recipes in C++ The Art of Scientific Computing. Cambridge UniversityPress, second edition.

[178] Promponas, V., Enright, A., Tsoka, S., Kreil, D., Leroy, C., Hamodrakas,S., Sander, C., and Ouzounis, C. (2000). CAST: an iterative algorithm forthe complexity analysis of sequence tracts. Complexity analysis of sequencetracts. Bioinformatics, 16(10):915–22.

[179] Puntervoll, P., Linding, R., Gemund, C., Chabanis-Davidson, S., Mat-tingsdal, M., Cameron, S., Martin, D., Ausiello, G., Brannetti, B., Costantini,A., Ferre, F., Maselli, V., Via, A., Cesareni, G., Diella, F., Superti-Furga,G., Wyrwicz, L., Ramu, C., McGuigan, C., Gudavalli, R., Letunic, I., Bork,P., Rychlewski, L., Kuster, B., Helmer-Citterich, M., Hunter, W., Aasland,R., and Gibson, T. (2003). Elm server: a new resource for investigatingshort functional sites in modular eukaryotic proteins. Nucleic Acids Res,31(13):3625–30.

[180] Radhakrishnan, I., Perez-Alvarado, G., Parker, D., Dyson, H., Montminy,M., and Wright, P. (1997). Solution structure of the KIX domain of CBP boundto the transactivation domain of CREB: a model for activator:coactivator in-teractions. Cell, 91(6):741–52.

[181] Ramu, C. (2003). SIRW: A web server for the Simple Indexing andRetrieval System that combines sequence motif searches with keywordsearches. Nucleic Acids Res, 31(13):3771–4.

[182] Romero, P., Obradovic, Z., Kissinger, C. R., Villafranca, J., and Dunker,A. (1997). Identifying disordered proteins from amino acid sequences. Proc.IEEE Int. Conf. Neural Networks, 1:90–95.

[183] Rost, B. and Liu, J. (2003). The PredictProtein server. Nucleic AcidsRes, 31(13):3300–4.

Page 208: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

188 BIBLIOGRAPHY

[184] Rousseau, F., Schymkowitz, J., Wilkinson, H., and Itzhaki, L. (2001).Three-dimensional domain swapping in p13suc1 occurs in the unfolded stateand is controlled by conserved proline residues. Proc Natl Acad Sci U S A,98(10):5596–601.

[185] Rousseau, F., Schymkowitz, J. W. H., and Itzhaki, L. S. (2003). Theunfolding story of three-dimensional domain swapping. Structure (Camb),11(3):243–51.

[186] Rudino-Pinera, E., Morales-Arrieta, S., Rojas-Trejo, S., and Horjales, E.(2002). Structural flexibility, an essential component of the allosteric activa-tion in Escherichia coli glucosamine-6-phosphate deaminase. Acta Crystal-logr D Biol Crystallogr, 58(Pt 1):10–20.

[187] Russell, R. and Ponting, C. (1998). Protein fold irregularities that hindersequence analysis. Curr Opin Struct Biol, 8(3):364–71.

[188] Rustandi, R., Baldisseri, D., and Weber, D. (2000). Structure of the neg-ative regulatory domain of p53 bound to S100B(betabeta). Nat Struct Biol,7(7):570–4.

[189] Rychlewski, L., Jaroszewski, L., Li, W., and Godzik, A. (2000). Com-parison of sequence profiles. Strategies for structural predictions using se-quence information. Protein Sci, 9(2):232–41.

[190] Saqi, M. and Sternberg, M. (1994). Identification of sequence motifsfrom a set of proteins with related function. Protein Eng, 7(2):165–71.

[191] Schoch, S., Cibelli, G., Magin, A., Steinmüller, L., and Thiel, G. (2001).Modular structure of cAMP response element binding protein 2 (CREB2).Neurochem Int, 38(7):601–8.

[192] Schultz, J., Milpetz, F., Bork, P., and Ponting, C. (1998). SMART, asimple modular architecture research tool: identification of signaling do-mains. Proc Natl Acad Sci U S A, 95(11):5857–64.

[193] Schwede, T., Kopp, J., Guex, N., and Peitsch, M. (2003). SWISS-MODEL: An automated protein homology-modeling server. Nucleic AcidsRes, 31(13):3381–5.

[194] Schweers, O., Schönbrunn-Hanebeck, E., Marx, A., and Mandelkow, E.(1994). Structural studies of tau protein and Alzheimer paired helical fila-ments show no evidence for beta-structure. J Biol Chem, 269(39):24290–7.

[195] Serpell, L., Blake, C., and Fraser, P. (2000). Molecular structure of afibrillar Alzheimer’s A beta fragment. Biochemistry, 39(43):13269–75.

[196] Serpell, L., Sunde, M., and Blake, C. (1997). The molecular basis ofamyloidosis. Cell Mol Life Sci, 53(11-12):871–87.

Page 209: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

BIBLIOGRAPHY 189

[197] Serpell, L., Sunde, M., Fraser, P., Luther, P., Morris, E., Sangren, O.,Lundgren, E., and Blake, C. (1995). Examination of the structure of the tran-sthyretin amyloid fibril by image reconstruction from electron micrographs. JMol Biol, 254(2):113–8.

[198] Servant, F., Bru, C., Carrere, S., Courcelle, E., Gouzy, J., Peyruc, D., andKahn, D. (2002). ProDom: automated clustering of homologous domains.Brief Bioinform, 3(3):246–51.

[199] Shortle, D. and Ackerman, M. (2001). Persistence of native-like topologyin a denatured protein in 8 M urea. Science, 293(5529):487–9.

[200] Sigrist, C., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bair-och, A., and Bucher, P. (2002). PROSITE: a documented database usingpatterns and profiles as motif descriptors. Brief Bioinform, 3(3):265–74.

[201] Smith, D., Radivojac, P., Obradovic, Z., Dunker, A., and Zhu, G. (2003).Improved amino acid flexibility parameters. Protein Sci, 12(5):1060–72.

[202] Smith, L., Bolin, K., Schwalbe, H., MacArthur, M., Thornton, J., and Dob-son, C. (1996a). Analysis of main chain torsion angles in proteins: predictionof NMR coupling constants for native and random coil conformations. J MolBiol, 255(3):494–506.

[203] Smith, L., Fiebig, K., Schwalbe, H., and Dobson, C. (1996b). Theconcept of a random coil. Residual structure in peptides and denatured pro-teins. Fold Des, 1(5):R95–106.

[204] Smith, T. and Waterman, M. (1981). Identification of common molecularsubsequences. J Mol Biol, 147(1):195–7.

[205] Smyth, E., Syme, C., Blanch, E., Hecht, L., Vasak, M., and Barron, L.(2001). Solution structure of native proteins with irregular folds from Ramanoptical activity. Biopolymers, 58(2):138–51.

[206] Stefani, M. and Dobson, C. (2003). Protein aggregation and aggregatetoxicity: new insights into protein folding, misfolding diseases and biologicalevolution. J Mol Med.

[207] Strahl, B. and Allis, C. (2000). The language of covalent histone modi-fications. Nature, 403(6765):41–5.

[208] Suyama, M. and Ohara, O. (2003). DomCut: prediction of inter-domainlinker regions in amino acid sequences. Bioinformatics, 19(5):673–4.

[209] Taylor, W. (1999). Protein structural domain identification. Protein Eng,12(3):203–16.

[210] Taylor, W. (2000). A deeply knotted protein structure and how it mightfold. Nature, 406(6798):916–9.

Page 210: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

190 BIBLIOGRAPHY

[211] Taylor, W. (2001). Defining linear segments in protein structure. J MolBiol, 310(5):1135–50.

[212] Taylor, W. and Lin, K. (2003). Protein knots: A tangled problem. Nature,421(6918):25.

[213] Teilum, K., Kragelund, B., and Poulsen, F. (2002). Transient structureformation in unfolded acyl-coenzyme A-binding protein observed by site-directed spin labelling. J Mol Biol, 324(2):349–57.

[214] Thompson, P. R., Wang, D., Wang, L., Fulco, M., Pediconi, N., Zhang, D.,An, W., Ge, Q., Roeder, R. G., Wong, J., Levrero, M., Sartorelli, V., Cotter,R. J., and Cole, P. A. (2004). Regulation of the p300 HAT domain via a novelactivation loop. Nat Struct Mol Biol.

[215] Tompa, P. (2002). Intrinsically unstructured proteins. Trends BiochemSci, 27(10):527–33.

[216] Tompa, P. (2003). Intrinsically unstructured proteins evolve by repeatexpansion. Bioessays, 25(9):847–55.

[217] Torshin, I. Y. (2002). Structural criteria of biologically active RGD-sites foranalysis of protein cellular function - a bioinformatics study. Med Sci Monit,8(8):BR301–12.

[218] Turner, B. (2002). Cellular memory and the histone code. Cell,111(3):285–91.

[219] Uversky, V. (2002). Natively unfolded proteins: a point where biologywaits for physics. Protein Sci, 11(4):739–56.

[220] Vaziri, H., Dessain, S., Eaton, E. N., Imai, S., Frye, R., Pandita, T., Guar-ente, L., and Weinberg, R. (2001). hSIR2(SIRT1) functions as an NAD-dependent p53 deacetylase. Cell, 107(2):149–59.

[221] Verkhivker, G., Bouzida, D., Gehlhaar, D., Rejto, P., Freer, S., and Rose,P. (2003). Simulating disorder-order transitions in molecular recognition ofunstructured proteins: where folding meets binding. Proc Natl Acad Sci U SA, 100(9):5148–53.

[222] Vihinen, M., Torkkila, E., and Riikonen, P. (1994). Accuracy of proteinflexibility predictions. Proteins, 19(2):141–9.

[223] Viles, J., Donne, D., Kroon, G., Prusiner, S., Cohen, F., Dyson, H., andWright, P. (2001). Local structural plasticity of the prion protein. Analysis ofNMR relaxation dynamics. Biochemistry, 40(9):2743–53.

[224] von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P., and Snel,B. (2003). STRING: a database of predicted functional associations betweenproteins. Nucleic Acids Res, 31(1):258–61.

Page 211: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

BIBLIOGRAPHY 191

[225] Wampler, J. (1997). Distribution analysis of the variation of B-factors ofX-ray crystal structures; temperature and structural variations in lysozyme.J Chem Inf Comput Sci, 37(6):1171–80.

[226] Wang, H., Machesky, N. J., and Mansky, L. M. (2004). Both the PPPYand PTAP motifs are involved in human T-cell leukemia virus type 1 particlerelease. J Virol, 78(3):1503–12.

[227] Warbrick, E. (2000). The puzzle of PCNA’s many partners. Bioessays,22(11):997–1006.

[228] Ward, J., McGuffin, L., Buxton, B., and Jones, D. (2003). Second-ary structure prediction with support vector machines. Bioinformatics,19(13):1650–5.

[229] Weinberg, G. (1999). The Complete Reference SQL. McGraw Hill.

[230] Wells, L. and Hart, G. W. (2003). O-GlcNAc turns twenty: functional im-plications for post-translational modification of nuclear and cytosolic proteinswith a sugar. FEBS Lett, 546(1):154–8.

[231] Westbrook, J., Feng, Z., Chen, L., Yang, H., and Berman, H. (2003). TheProtein Data Bank and structural genomics. Nucleic Acids Res, 31(1):489–91.

[232] Wheeler, D., Church, D., Lash, A., Leipe, D., Madden, T., Pontius, J.,Schuler, G., Schriml, L., Tatusova, T., Wagner, L., and Rapp, B. (2002).Database resources of the National Center for Biotechnology Information:2002 update. Nucleic Acids Res, 30(1):13–6.

[233] Wilkins, D., Grimshaw, S., Receveur, V., Dobson, C., Jones, J., andSmith, L. (1999). Hydrodynamic radii of native and denatured proteins meas-ured by pulse field gradient NMR techniques. Biochemistry, 38(50):16424–31.

[234] Williams, R., Obradovi, Z., Mathura, V., Braun, W., Garner, E., Young,J., Takayama, S., Brown, C., and Dunker, A. (2001). The protein non-foldingproblem: amino acid determinants of intrinsic order and disorder. Pac SympBiocomput, pages 89–100.

[235] Wootton, J. (1994). Non-globular domains in protein sequences:automated segmentation using complexity measures. Comput Chem,18(3):269–85.

[236] Wright, P. and Dyson, H. (1999). Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol, 293(2):321–31.

[237] Xenarios, I. and Eisenberg, D. (2001). Protein interaction databases.Curr Opin Biotechnol, 12(4):334–9.

Page 212: Linear Functional Modules€¦ · Dansk Resume I årtier har ... Martin DM, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G,

192 BIBLIOGRAPHY

[238] Xenarios, I., Salwinski, L., Duan, X., Higney, P., Kim, S., and Eisen-berg, D. (2002). DIP, the Database of Interacting Proteins: a research toolfor studying cellular networks of protein interactions. Nucleic Acids Res,30(1):303–5.

[239] Xiong, J., Stehle, T., Diefenbach, B., Zhang, R., Dunker, R., Scott, D.,Joachimiak, A., Goodman, S., and Arnaout, M. (2001). Crystal structure ofthe extracellular segment of integrin alpha Vbeta3. Science, 294(5541):339–45.

[240] Xiong, J.-P., Stehle, T., Zhang, R., Joachimiak, A., Frech, M., Good-man, S. L., and Arnaout, M. A. (2002). Crystal structure of the extracellularsegment of integrin alpha Vbeta3 in complex with an Arg-Gly-Asp ligand.Science, 296(5565):151–5.

[241] Yaffe, M., Leparc, G., Lai, J., Obata, T., Volinia, S., and Cantley, L.(2001). A motif-based profile scanning approach for genome-wide predic-tion of signaling pathways. Nat Biotechnol, 19(4):348–53.

[242] Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M., and Cesareni, G. (2002). MINT: a Molecular INTeraction data-base. FEBS Lett, 513(1):135–40.

[243] Zarrinpar, A., Park, S.-H., and Lim, W. A. (2003). Optimization of spe-cificity in a cellular protein interaction network by negative selection. Nature,426(6967):676–80.

[244] Zoete, V., Michielin, O., and Karplus, M. (2002). Relation between se-quence and structure of HIV-1 protease inhibitor complexes: a model systemfor the analysis of protein flexibility. J Mol Biol, 315(1):21–52.