m a s t e r ’ s t h e s i s - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › rinck2009.pdfing...

93
TECHNISCHE UNIVERSIT ¨ AT M ¨ UNCHEN MASTER’S THESIS to obtain the title Master of Science Molecular Biotechnology of the Technical University Munich defended by Andrea Rinck Bioinformatical analysis of extraordinarily large, prokaryotic proteins Supervisor: Dr. Anita Kriˇ sko Institut National de la Sant´ e et de la Recherche M´ edicale, Paris Professor (extern): Prof. Dr. Ivo F. Sbalzarini Institute for Computational Science, ETH Zurich Professor (intern): Prof. Dr. Dmitrij Frishman Department of Genome-Oriented Bioinformatics, TU Munich February 9, 2009

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

TECHNISCHE UNIVERSITAT MUNCHEN

M A S T E R ’ S T H E S I Sto obtain the title

Master of ScienceMolecular Biotechnology

of the Technical University Munich

defended by

Andrea Rinck

Bioinformatical analysis ofextraordinarily large, prokaryotic

proteins

Supervisor:Dr. Anita KriskoInstitut National de la Sante et de la Recherche Medicale, Paris

Professor (extern):Prof. Dr. Ivo F. SbalzariniInstitute for Computational Science, ETH Zurich

Professor (intern):Prof. Dr. Dmitrij FrishmanDepartment of Genome-Oriented Bioinformatics, TU Munich

February 9, 2009

Page 2: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 3: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Zusammenfassung

Hintergrund: Prokaryotische Proteine beschreiben lineare Aminosaureketten, welche imGegensatz zu eukaryotischen Polypeptiden aus Organismen einer der zwei weiteren biolo-gischen Domanen, Archaeen oder Bakterien, stammen. Trotz unterschiedlichster zellularerFunktionen weisen die meisten prokaryotischen Proteine eine ahnliche Lange von ca. 300Aminosauren auf. Allerdings erreichen sie mitunter auch erstaunliche Dimensionen vonweit uber 1000 Resten und obgleich immenser Synthesekosten fur die Zellen evolutionareStabilitat. Die Untersuchung dieser so genannten Proteinriesen stellt daher ein hoch inter-essantes Forschungsgebiet dar.Inhalt: Im Verlauf dieser Arbeit wurde fur alle verfugbaren, prokaryotischen Pro-teome (HAMAP) die Haufigkeit des Auftretens besonders langer Proteine bestimmt,sowie Unterschiede in der Aminosaureverteilung fur verschiedene Proteingroßen ermit-telt. Auch auf Sekundarstrukturebene erfolgte ein Vergleich in Abhangigkeit der Se-quenzlangen. Schließlich wurde eines der großten Archaeenproteine namens halomucin(9159 Aminosauren) aus dem halophilen Archaeon Haloquadratum walsbyi etwas detail-lierter untersucht. Uber die Analyse seiner Organisation in Proteindomanen und seinerraumlichen Struktur, sowie einer darauf basierenden Ableitung der potentiellen Protein-funktionen, wurde versucht die enorme Proteingroße und seine Funktionalitat mit demextremen Lebensraum von H. walsbyi in Relation zu setzen.Ergebnisse und Interpretation: Vorab sollte erwahnt sein, dass der Anteil prokaryot-ischer Proteine mit Sequenzlangen großer 1000 Aminosauren nur ca. ein Prozent des un-tersuchten Datensatzes darstellt, wodurch sich teilweise fragwurdige Statistiken ergeben.Die Betrachtungen der Aminosaureverteilungen, sowie die Analysen auf Sekundarstruk-turebene, verdeutlichten Unterschiede zwischen den generell eher extremophilen Archaeenund den zusammengefasst facettenreicheren Bakterien, sowie zwischen der Mehrheit ankleinen Proteinen und den seltenen großen Polypeptidsequenzen. Lange prokaryotischeProteine weisen in Allgemeinen eine erhohte Flexibilitat und Hydrophilizitat auf, wassowohl eine starkere Tendenz zu einer intrinsischen Unstrukturiertheit, als auch ein großeresVerhaltnis von Proteinoberflache zu Proteinvolumen mit sich fuhrt. Es ware daher an-nehmbar, dass diese anstatt spezifische, enzymatische Funktionen zu ubernehmen, eher dieErkennung, Bindung und Regulation diverser Liganden vermitteln. Außerdem ist durchsie eine Bereitstellung abgegrenzter Reaktionsraume in den nicht durch Membranen kom-partimentierten Zellinneren von Prokaryoten denkbar. Hinzu kommt noch eine moglicheRolle bei der Erzeugung extrazellularer Mikromilieus. Der Proteinkoloss halomucin stelltwahrscheinlich ein extrazellulares Polypeptid dar, welches auf Grund seiner strukturellenVoraussetzungen noch uber weitere Funktionen als der beschriebenen Mitwirkung in derErrichtung einer zellularen Oberflachenstruktur zum Schutz vor Austrocknung und zumErhalt der quadratische Zellmorphologie verfugen konnte. Diese waren eine Beteiligung anZelladhasions- und/ oder enzymatischen Abbauprozessen.Ausblick: Ein erhohter Datenbestand an großen, prokaryotischen Proteinendurch die Verfugbarkeit von mehr, zuverlassig annotierten Proteominformatio-nen, sowie die genauere Untersuchung einzelner Beziehungen zwischen den Prote-ingroßen, -strukturen und -funktionen, aber auch den evolutionaren Entwicklungenund Umweltbedingungen der zugehorigen Organismen, werden zu einem besserenVerstandnis der zellularen Bedeutung dieser faszinierenden Proteinriesen fuhren.

Page 4: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 5: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Abstract

Background: Prokaryotic proteins are linear chains of amino acids which unlike eukaryoticpolypeptides originate from one of the two other biological domains, bacteria or archaea.Despite the huge variety of cellular functions fulfilled by prokaryotic proteins, their generalsize is rather similar, comprising around 300 amino acids. However, a few so called giantproteins reach lengths of fare more than 1000 amino acids. Since their synthesis is asso-ciated with huge production costs for the cells, it appears intriguing to analyze why theysometimes achieve evolutionary stability.Content: All available, prokaryotic full proteomes (HAMAP) were tested for the abun-dance of large proteins and the amino acid compositions between protein assemblies ofdifferent lengths were compared. Further on the level of primary structure a comparisonbased on sequence lengths was accomplished. After these preliminary analyses I focusedmy studies on the extraordinarily large protein halomucin (9159 amino acids) of the squarehalophilic archaeon Haloquadratum walsbyi. This protein giant represents one of the largestarchaeal polypeptides. By approaches to elucidate its function, like gaining insights into itsstructure through an analysis of its domain organization, it was aimed to draw a potentialcorrelation between that function, halomucins’s size and the extreme environmental nicheof H. walsbyi.Results and conclusion: First of all it has to be pointed out, that prokaryotic proteinsof sequence lengths equal to or larger than 1000 amino acids reflect only one percent ofthe total dataset, resulting in an occasionally questionable reproducibility. Referring tothe amino acid composition and the primary structure analyses, differences between themainly more extremophilic archaea and the rather diverse bacteria as well as between thebulk of small proteins and the rare polypeptide giants could be observed. In general large,prokaryotic proteins appear to be more flexible and hydrophilic, featuring a higher degree ofintrinsically unfolding and a larger surface to volume ratio. Hence they might not performspecific, enzymatic activities, but rather recognition, binding and modulation of multipleligands. Further, the provision of cytoplasm separation compensating the prokaryotic lackof an eukaryotic cellular compartmentalization as well as an extracellular micromilieu gen-eration seem to come into consideration. It can be assumed that the halophilic archaealprotein giant halomucin represents an extracellular polypeptide, which according to itsstructural potential might be able to perform further functions, next to its supposed rolesas an aqueous shield from desiccation and as a superficial framework accounting for theunique square cell morphology of H. walsbyi. These could include the involvement intocellular adhesion and/ or degradation processes.Outlook: An increased dataset of large prokaryotic proteins by means of more,reliably annotated proteome data as well as a closer examination of individ-ual correlations between the proteins’ sizes, structures and functions, the evo-lutionary pressures and between the environmental demands of the harboringcells will provide a better insight into the fascinating world of protein giants.

Page 6: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 7: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Declaration of Authorship

Andrea RinckFriedrich-Engels-Str. 9007749 [email protected] no: 2452494

I, Andrea Rinck, declare that this thesis titled, ‘Bioinformatical analysis of extraordinarilylarge, prokaryotic proteins’ and the work presented in it is, to the best of my knowledge andbelief, original and the result of my own investigations, except as acknowledged. Further ithas not been submitted, either in part or whole, for a degree at this or any other University.

February 9, 2009

Andrea Rinck

Page 8: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 9: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Acknowledgments

First, I would like to express my gratitude to my supervisor, Dr. Anita Krisko (INSERM,Paris), which through her expertise, encouragement, patience and kindness contributedsubstantially to the performance of my master project. She provided me with her skills inmany areas, time out from her busy schedule, direction at hard times and became more ofa mentor and a good friend than a supervisor for me.I must also acknowledge Prof. Dr. Ivo F. Sbalzarini (Institute for Computational Science,ETH Zurich) for offering me this research project as part of the ’SiROP, Student ResearchOpportunities Program’ of the ETH Zurich.A very special thanks goes out to Prof. Dr. Dmitrij Frishman (Department of Genome-Oriented Bioinformatics, TU Munich), who agreed to serve as my TUM-internal supervisorand whose motivation and prudence supported me at all levels of my project.Likewise, I have to mention and greatly appreciate Dr. Bojan Zagrovic, for the inspir-ing scientific conversations, Zlatko Smole, for his support while learning to program withPython, Nela Nikolic, Christian Muller and all scientists at the Mediterranean Institute ForLife Sciences in Split, Croatia, as well as of my research group at the Institut National dela Sante et de la Recherche Medicale in Paris, France, for the exchanges of knowledge, theiropen reception and real friendship, which turned my master abroad into an unforgettableexperience.

Page 10: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 11: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Contents

1 Introduction 11.1 The three domains of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Prokaryotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1.1 Bacteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1.2 Archaea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Haloquadratum walsbyi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Halomucin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 From 1D sequence to 3D structure . . . . . . . . . . . . . . . . . . . 71.3.2 Natively unfolded proteins and intrinsically disordered regions . . . 91.3.3 Ordinarily and extraordinarily sized proteins . . . . . . . . . . . . . 10

1.4 Objectives of this master project . . . . . . . . . . . . . . . . . . . . . . . . 111.4.1 Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.2 Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Materials and methods 132.1 Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Collection of protein sequences . . . . . . . . . . . . . . . . . . . . . 132.1.2 Distribution of prokaryotic protein sizes . . . . . . . . . . . . . . . . 13

2.1.2.1 Definition of lengths intervals . . . . . . . . . . . . . . . . . 132.1.2.2 Mean length of proteins within the three domains of life . . 14

2.1.3 Amino acid composition . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.3.1 Physicochemical properties . . . . . . . . . . . . . . . . . . 142.1.3.2 Secondary structure preferences . . . . . . . . . . . . . . . 152.1.3.3 Hydropathy values . . . . . . . . . . . . . . . . . . . . . . . 162.1.3.4 Electric charge . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.4 Amino acid sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.4.1 Secondary structure based on primary structure . . . . . . 172.1.4.2 Natively unfolded regions . . . . . . . . . . . . . . . . . . . 18

2.2 Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.1 Visualization of halomucin’s dimensions . . . . . . . . . . . . . . . . 192.2.2 Functional predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2.1 Background: Amino acid sequence . . . . . . . . . . . . . . 20

3 Results and discussion 233.1 Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Distribution of prokaryotic protein sizes and a consideration of thedataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.1.1 Mean length of proteins within the three domains of life . . 25

3.1.2 Amino acid composition . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.2.1 Physicochemical properties . . . . . . . . . . . . . . . . . . 293.1.2.2 Secondary structure preferences . . . . . . . . . . . . . . . 293.1.2.3 Hydropathy values . . . . . . . . . . . . . . . . . . . . . . . 333.1.2.4 Electric charge . . . . . . . . . . . . . . . . . . . . . . . . . 37

Page 12: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1.3 Amino acid sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.1.3.1 Secondary structure based on primary structure . . . . . . 413.1.3.2 Natively unfolded regions . . . . . . . . . . . . . . . . . . . 42

3.1.4 Discussion and explanation attempts . . . . . . . . . . . . . . . . . . 473.1.4.1 Archaea vs. bacteria . . . . . . . . . . . . . . . . . . . . . . 473.1.4.2 Small vs. large . . . . . . . . . . . . . . . . . . . . . . . . . 483.1.4.3 Functional category of surface proteins for protein giants . 49

3.2 Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2.1 Visualization of halomucin’s dimensions . . . . . . . . . . . . . . . . 503.2.2 Functional predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.2.1 Background: Amino acid composition . . . . . . . . . . . . 503.2.2.2 Background: Amino acid sequence . . . . . . . . . . . . . . 52

4 Conclusion 594.1 Extraordinarily large proteins and mean protein lengths among the three

domains of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.2 Small, housekeeping generalists and large, accessory specialists . . . . . . . 604.3 Halomucin, the secret of H. walsbyi . . . . . . . . . . . . . . . . . . . . . . . 604.4 Challenges and future prospects . . . . . . . . . . . . . . . . . . . . . . . . . 61

A Appendix 63A.1 Chou-Fasman Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.2 Malkov Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 63A.3 Hydropathy Indices of Kyte and Doolittle . . . . . . . . . . . . . . . . . . . 63

Bibliography 65

Page 13: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

List of Figures

1.1 Phylogenetic Tree of Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Haloquadratum walsbyi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Protein structural levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Distribution of protein sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Amino acid composition - test intervals . . . . . . . . . . . . . . . . . . . . 273.3 Amino acid composition - physicochemical properties . . . . . . . . . . . . . 303.4 Amino acid composition - secondary structure preferences [Fasman 1989] . . 313.5 Amino acid composition - secondary structure preferences [Malkov 2008] . . 323.6 Amino acid composition - hydropathy values, test intervals . . . . . . . . . 343.7 Amino acid composition - hydropathy values, all data points . . . . . . . . 353.8 Amino acid composition - hydropathy values, equal number of data points . 363.9 Amino acid composition - electric charge, test intervals . . . . . . . . . . . . 383.10 Amino acid composition - electric charge, all data points . . . . . . . . . . . 393.11 Amino acid composition - electric charge, equal number of data points . . . 403.12 Amino acid sequence - PSIPRED . . . . . . . . . . . . . . . . . . . . . . . . 413.13 Amino acid sequence - SEG, lcr quantities . . . . . . . . . . . . . . . . . . . 433.14 Amino acid sequence - SEG, lcr lengths (archaea) . . . . . . . . . . . . . . . 453.15 Amino acid sequence - SEG, lcr lengths (bacteria) . . . . . . . . . . . . . . 463.16 Halomucin - length comparison . . . . . . . . . . . . . . . . . . . . . . . . . 513.17 Halomucin - sequence plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.18 Halomucin - structural hints . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Page 14: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 15: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

List of Tables

2.1 Amount of prokaryotic sequences analyzed using PSIPRED . . . . . . . . . 182.2 Calculation of halomucin’s mean volume per amino acid . . . . . . . . . . . 19

3.1 Definition of lengths intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Mean length of proteins within the three domains of life . . . . . . . . . . . 253.3 Halomucin - sequence partitioning and structural homologues . . . . . . . . 54

A.1 Secondary structure preference groups according to [Fasman 1989] . . . . . 63A.2 Secondary structure preference groups according to [Malkov 2008] . . . . . 63A.3 Hydropathy scale according to [Kyte 1982] . . . . . . . . . . . . . . . . . . . 63

Page 16: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 17: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Abbreviations

General abbreviations

1D One-dimensional

2D Two-dimensional

3D Three-dimensional

Aa comp. Amino acid composition

A Angstrom (unit), 10−10 meters

BLAST Basic Local Alignment Search Tool

C Coulomb (unit), 1 ampere × 1 second

COG Cluster of Orthologous Groups

DNA Deoxyribonucleic acid

e Elementary charge (unit), ≈ 1.6× 10−19 C

e. g. Exempli gratia, for example

etc. Et cetera, and so on

HAMAP High-quality Automated and Manual Annotation of microbial Proteomes

HSP High-scoring Segment Pairs

i. e. Id est, that is

kb Kilo-base pair (unit), ≈ 3.4× 10−7 meters

kDa Kilo-dalton (unit), ≈ 1.7× 10−27 kilograms

LEA Late embryogenesis abundant

lcr Low-complexity region

mRNA Messenger RNA

PDB Protein Data Bank of the Research Collaboratory for Structural Bioinformatics

PSI-BLAST Position-Specific Iterative-BLAST

RNA Ribonucleic acid

rRNA Ribosomal RNA

S Svedberg (unit), 10−13 seconds

SSEA Secondary structure element alignment

UV Ultraviolet

vs. Versus

Amino acids - single-letter representation

A Alanine M Methionine

C Cysteine N Asparagine

D Aspartic acid P Proline

E Glutamic acid Q Glutamine

F Phenylalanine R Arginine

G Glycine S Serine

H Histidine T Threonine

I Isoleucine V Valine

K Lysine W Tryptophan

L Leucine Y Tyrosine

Page 18: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 19: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Dedicated to my mother and my father

Page 20: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 21: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Chapter 1

Introduction

Contents

1.1 The three domains of life . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Prokaryotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Haloquadratum walsbyi . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Halomucin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 From 1D sequence to 3D structure . . . . . . . . . . . . . . . 7

1.3.2 Natively unfolded proteins and intrinsically disordered regions 9

1.3.3 Ordinarily and extraordinarily sized proteins . . . . . . . . . 10

1.4 Objectives of this master project . . . . . . . . . . . . . . . . 11

1.4.1 Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.2 Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

The first chapter of this thesis aims to introduce general knowledge as well as recentscientific outcomes providing the basis of this master project. To explain the objectiveof analyzing extraordinarily large, prokaryotic proteins, at first an introduction to somefundamental concepts of the classification of life and to a few particularities of prokaryoticorganisms will be given. Further a basic consideration of proteins as such and a clarificationof the term ’extraordinarily large’ will be offered, before the goals of this master projectwill be outlined.

1.1 The three domains of life

It is now generally accepted, that life can be divided into three domains: eukaryotes, bacteriaand archaea [Woese 1990]. Based on differences between their 16 S rRNA genes, it couldbe shown that all three arose separately form a common ancestor at the origin of life.Molecular phylogeneticists have therefore constructed a ’universal phylogenetic tree of life’as a hierarchical classification of all living things (Figure 1.1).

However, it exists evidence that most genomes contain genes from multiple sources anddue to biological processes such as horizontal gene transfer life might be more properlyrepresented as a reticulated tree or net [Martin 1999], [Doolittle 1999].Almost all forms of life are organized as cells, use the same genetic code (DNA and RNA)and share similar metabolic pathways and proteins. Nevertheless, members of all threedomains show certain differences between each other.Animals, plants, fungi and protists are called eukaryotes, since their cells possess a real DNA

Page 22: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 1. Introduction

Figure 1.1: Phylogenetic Tree of Life: Modified illustration of the ’universal phylogenetic tree of

life’ in rooted form according to [Woese 1990]. The branch lengths are based upon rRNA sequence

comparisons and the branch tip for halophiles was additionally emphasized.

containing nucleus (Greek karyon: ’kernel’). Bacteria and archaea lack such a membrane-enclosed organelle and are therefore called prokaryotes. Their genetic material is localizedin an irregular DNA-protein complex called nucleoid [Thanbichler 2005] and compared toeukaryotes, prokaryotic genomes are generally circular and exhibit an organization intomulti-gene operons. Eukaryotes on the other hand show linear chromosomes with a single-gene organization but many introns. Further, their cellular structure differs greatly fromthe one of bacteria or of archaea, which lack membrane-bound cell compartments, such asmitochondria and chloroplasts. Eukaryotic cells are mostly much larger than the cells ofprokaryotes and often organized as complex, multicellular organisms. In contrast, due totheir increased surface to volume ratio, prokaryotic cells feature a higher metabolic/ growthrate and thus a shorter generation time.The next subsection will center prokaryotic organisms and in particular the differencesbetween bacteria and archaea, since proteins out of these the two biological domains havebeen of interest during the analyses of this master project.

1.1.1 Prokaryotes

The first organisms on earth must have been some form of prokaryotes, living approximately3.5 billion years ago. They have diversified greatly throughout their long existence and fea-ture perhaps the most successful and abundant species in nature. Compared to eukaryotes,which obtain energy solely using photosynthesis or organic compounds, prokaryotes are ableto utilize inorganic matter as an energy source, allowing them to occupy hostile environ-ments.With the recognition of archaea [Woese 1977], a group of organisms evolutionarily com-pletely distinct from bacteria, prokaryotes had to be separated into bacteria (originallycalled eubacteria) and archaea (archaebacteria). Since these two kinds of prokaryotes areactually no more related to each other than they are to eukaryotes, it further has beenargued that the term prokaryotes would have no real evolutionary meaning and should

2

Page 23: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

1.1. The three domains of life

therefore be discarded entirely [Woese 1994]. However, both share the mentioned absenceof membrane-surrounded organelles and the small, circular chromosomes with gene organi-zation in operons, as well as similar morphologies and cell sizes. In the following, some oftheir major differences and unique features are listed separately.

1.1.1.1 Bacteria

Bacterial cells are covered by cell walls, which are distinct from the ones of eukaryotic or ar-chaeal cells. In contrast to cells of plants or fungi, whose walls are made out of cellulose andchitin or unlike archaea, bacteria use peptidoglycan as a basis of their cell walls. This specialpolymer is made out of polysaccharide chains cross-linked through unusual peptides con-taining D-amino acids [van Heijenoort 2001]. Compared to Gram-positive bacteria whichpossess a thick cell wall containing many layers of peptidoglycan interspersed with teichoicacids, Gram-negatives’ cell walls consist of only a few layers of peptidoglycan surroundedby a second lipid membrane containing lipopolysaccharides and lipoproteins. Moreover,many bacteria surround their cells by slime layers of extra-cellular polymers or more com-plex structured capsules. Such envelopes may prevent the endocytotic uptake of bacteriaby eukaryotic cells (e. g. macrophages) [Stokes 2004] or mediate cellular recognition andattachment processes as well as the formation of biofilms [Daffe 1999].Bacteria of the phylum of firmicutes are able to form tough and dormant structures calledendospores [Nicholson 2000]. In contrast to eukaryotic spores, which are made by many eu-karyotes for reproductive purposes, bacteria are producing single, non-reproductive sporesto ensure their survival through periods of environmental stress. Endospores are resis-tant to extreme physical and chemical conditions such as ultraviolet and gamma radi-ation, high and low temperatures and pressures, starvation, detergents and desiccation[Nicholson 2002]. They show no detectable metabolism and may remain viable for millionsof years [Vreeland 2000].Bacteria are ubiquitous in basically every habitat on earth. There are approximately tentimes as many bacterial as human cells in the human body, mostly of harmless, but someof beneficial or pathogenic character. They adapted to a wide variety of ecological con-ditions and are therefore subject to stresses of various nature and amplitude. Next to anindependent lifestyle of free-living cells, bacterial organisms can form complex associationswith other cells and organisms, classified into mutualism, commensalism and parasitism.It could be shown that parasitic organisms like Mycoplasma or Chlamydia, which live in aprotected environment with a nutrition drawn from their hosts, exhibit smaller proteomesand longer median protein lengths than most other species [Brocchieri 2005]. On the otherhand, shorter, less complex and more stable proteins can be observed for free-living species,exposed to more intense stresses and environmental fluctuations. This has been explainedby the idea that a selective pressure for shorter and therefore less expensive (related to theamino acid usage) but more stable proteins would act stronger on species that are morelikely to encounter starving and awkward conditions [Seligmann 2003], [Brocchieri 2005].However, compared to archaea, bacteria can be considered as rather mesophilic.

1.1.1.2 Archaea

Archaea are generally accepted as life’s extremists, found in the harshest environments onearth. But, next to hostile habitats like hot springs, black smokers and highly saline, acidicor alkaline waters, archaea can be found in much less extreme environments like oceansand soils as well [DeLong 1998]. They may contribute up to 20 % of the total biomass onearth [DeLong 2001]. In comparison to bacteria no clear examples of archaeal pathogens

3

Page 24: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 1. Introduction

or parasites are known [Eckburg 2003], [Cavicchioli 2003].As already mentioned, bacterial and archaeal cells share similar morphologies and sizes.However, some archaeal species exhibit very unusual shapes like perfectly rectangular rodsor flat squares which are probably maintained through both intra- and extracellular struc-tures.Further, archaea possess genes and features concerning processes such as transcription andtranslation, which are more closely related to those of eukaryotes. Therefore archaeal cellsare resistant to antibiotics blocking these processes. Other typical eukaryotic features foundin archaea are the presence of introns and N-linked glycosylated proteins.An attribute unique for archaea is the chemical composition of their cytoplasmic mem-brane. Compared to bacteria and eukaryotes featuring membranes that contain mostlyphospholipids derived from fatty acids liked to glycerol via an ester bond, archaeal lipidsare composed of saturated phytanyl chains that are liked to glycerol via an ether bond[Rosa 1986]. Ether bonds are chemically more resistant then ester bonds, which might con-tribute to the ability of some archaea to survive at extreme temperatures and in very acidicor alkaline environments [Albers 2000]. Likewise, the branched phospholipid side-chainsbased on isoprenoid may help to prevent archaeal membranes from becoming leaky at hightemperatures [Koga 2005]. Some archaea even exhibit tetraether lipids, which span theentire membrane and form a monolayer instead of a bilayer. Furthermore, archaeal lipidsare unique because of the stereochemistry of the glycerol group. In comparison to the D-glycerol utilized by bacteria and eukaryotes, archaeal lipids are built of the correspondingL-enantiomer. This suggests that archaea use entirely different enzymes for synthesizingtheir phospholipids. Since such enzymes have developed very early in the history of life,this would in turn indicate that archaea split off very early from the other two domains[Koga 2007].Archaea resemble bacteria in that their cell membrane is usually surrounded by a cell wall.In overall structure archaea are most similar to Gram-positive bacteria, since most have asingle plasma membrane and a cell wall, without a periplasmic space. As already mentioned,other than bacteria most archaea lack peptidoglycan in their cell walls [Howland 2000].Some species of archaea form aggregates or filaments of cells and can be prominent membersof microbial communities forming biofilms [Hall-Stoodley 2004].In contrast to bacteria and eukaryotes, spores are not formed in any of the known archaea[Onyenwoke 2004].Compared to eukaryotic cells, archaea are able to exploit a much greater variety of energysources. Next to organic compounds they are can use ammonia, metal ions or even hydro-gen gas as nutrients. Salt-tolerant archaea (halobacteria) use sunlight as a source of energyand other species of archaea fix carbon. However, unlike cells of plants or cyanobacteria,archaeal cells are not known to do both. But, since the vast majority of archaeal organismshas never been studied in the laboratory, the list of all possible energy sources known todaymight be incomplete.Likewise the exact specification of the total number of archaeal species or all occupied habi-tats, as well as their definite classification is still difficult. Current classification systemsaim to organize archaea into groups of organisms that share structural features and com-mon ancestors [Gevers 2006]. Most of the culturable and hence better investigated speciesof archaea are members of two main phyla, euryarchaeota and crenarchaeota.Next to the classical classification, extremophilic archaea are sorted into four physiologicalgroups of thermophiles, halophiles, alkaliphiles and acidophiles [Pikuta 2007], whereas somearchaea belong to several of these groups. Compared to mesophiles with optimum growthtemperatures of 24 to 40 degree Celsius, thermophiles prefer temperatures between 50 and

4

Page 25: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

1.2. Haloquadratum walsbyi

70 degree. Organisms with even higher optimum growth temperatures (above 80 ◦C) aredefined as hyperthermophiles. Halophiles, including the class halobacteria out of the phylumof euryarchaeota, live in extremely saline environments.One archaeal species of the class of halobacteria, the order of halobacteriales, the familyof halobacteriaceae and finally the genus of haloquadratum represents Haloquadratum wals-byi. Since this organism produces one of the largest archaeal proteins (9159 amino acids),halomucin, which has been an object of investigation during this master project, the nextsections aims to dwell on this particular organism and its intriguing giant of protein.

1.2 Haloquadratum walsbyi

The saturated brines of solar salterns, highly enriched in hygroscopic magnesium salts, forman environment that seems to make life nearly impossible. Cells would suffer severe desicca-tion stress, almost anaerobic conditions, high levels of UV radiation and low concentrationsof important nutrients due to complexing with Mg2+. However, often dense biomass canbe found into such hostile habitats. Molecular approaches revealed that more than 80 %of the identified organisms are represented by square, non-motile, pigmented archaea inwhich 16 S rRNA genes differ less than 1 % [Bolhuis 2006]. Thus Haloquadratum walsbyi(Figure 1.2), already described in 1980 [Walsby 1980], largely dominates this ecologicalniche, reaching population densities of over 107 cells per ml. To thrive in such an extremeenvironment, H. walsbyi requires sophisticated adaptations.The first prominent feature is the remarkable cell morphology of this square archaeon. Byextremely flattening itself with a cell thickness between 0.1 and 0.5 µm, H. walsbyi achievesone of the highest surface to volume ratios within the entire microbial world [Bolhuis 2005].In contrast to spherical microorganisms, which have to remain small to optimize this ratioand therefore the nutrient uptake capacity per cell volume, the square archaeal cells can be-come almost unlimitedly large. For liquid cultures square cells with lateral length of 40 µmare not uncommon [Bolhuis 2004]. Membrane processes appear to be of major importancefor H. walsbyi. A large surface stimulates the cells’ ability to grow phototrophically, sincethen the copy-number of bacteriorhodopsins can be increased drastically without affectingthe membrane space needed for other components, such as transport proteins. Several gasvesicles aid the squares to position themselves close and parallel to the surface, which fur-ther helps H. walsbyi to adapt to an environment low in nutrients.Compare to other archaeal genomes (60 - 70 % GC), the overall GC content of H. wals-byi’s genome is remarkably low (47.9 %). This seems curious, since it is generally acceptedthat AT-rich genomes are more prone to UV induced thymidine dimer formation, whichwould lead to an accumulation of mutations for the cells of H. walsbyi, that suffer highlevels of solar radiation. However, the genome of this square archaeon also encodes fourphotolyase proteins, which might at least partly compensate the increased mutation risk.The drift to an AT-rich genome on the other hand could be a necessary adaptation tohigh environmental concentrations of Mg2+, known to stabilize DNA and RNA molecules.An additional stabilization of the genome through a high GC content might interfere withessential processes like DNA replication and transcription. Moreover it was argued forbacterial species that an enrichment in AT could be a side effect of a decreased selectivepressure in a physically limited environment, leading to a neutral drift towards a slightlydecreased demand for nitrogen by replacing guanines [Dufresne 2005], [Giovannoni 2005].Hence the trend towards a more AT-rich genome, together with its peculiarity to exhibit alow coding density (76 %) compared to other haloarchaea genomes (86 - 91 %), could alsoreflect the restricted environmental niche of H. walsbyi with a subsequent lack of growth

5

Page 26: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 1. Introduction

competition from other species.The secret to the success of H. walsbyi in an environment of very low water activity andhigh desiccation stress might rely on the expression of a water enriched capsule with astructure provided by a huge protein called halomucin.

Figure 1.2: Darkfield micrograph of a normal-sized (arrow, ≈ 5× 5 µm) and a large (≈ 40× 40 µm)

cell of Haloquadratum walsbyi, taken from [Bolhuis 2004]. Both cells contain gas vesicle (light dots)

and owing to the flexibility of the larger cell structures, they are rarely found in an unfolded state.

1.2.1 Halomucin

A gene of over 27000 bp encodes this in full length (9159 amino acids) transcribed polypep-tide of H. walsbyi [Bolhuis 2006]. With a more than 30 times increased sequence lengthcompared to ordinary archaeal polypeptides, halomucin represents one of the largest ar-chaeal proteins. The name arose from the fact that its amino acid sequence as well asits domain organization is quite similar to animal mucins, which are known to play animportant role in protecting various tissues from desiccation or harsh chemical conditions[Hollingsworth 2004]. It likewise contains domain sections with the potential to act as gly-cosylation or sulfation sites, which would increase the overall negative charge of halomucin.The thus risen capacity of the polypeptide to bind cations and water of its environment,might lead to a similar effect as for animal mucins, namely the generation of a protect-ing micromilieu around the cells. This hypothesis is confirmed by the presence of a N-terminal signal sequence for halomucin indicating that it describes an extracellular protein[Bolhuis 2006].Further, H. walsbyi could be the first identified example of archaea with the ability to syn-thesize sialic acids. Within mucins of animals, sialic acids form rigid structures by cappingthe ends of polysaccharide side-chains of these proteins. Together with the fact that H.walsbyi might additionally be capable of producing a cross-linked matrix of poly-gamma-glutamate, it might be conceivable that the mentioned aqueous shield against desiccation

6

Page 27: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

1.3. Proteins

stress simultaneously represents a rigid capsule, necessary to achieve and maintain theunique square cell morphology of H. walsbyi [Bolhuis 2006].Consequently the synthesis of halomucin in order to construct a chemical and physical cel-lular barrier could be interpreted as an expensive but worthwhile investment of H. walsbyito survive in its extremely hostile ecological niche.

1.3 Proteins

After the explanation of a prokaryotic origin, a closer look at proteins as such should betaken before continuing with the related topics of this thesis. Hence a short introduction togeneral concepts and conventions with respect to the different structural levels of proteinsand to the issue of intrinsically unstructured regions as well as a better feeling for the lengthdimensions of proteins will be provided with the next sections.

1.3.1 From 1D sequence to 3D structure

Proteins (Greek proteios: ’primary’) are essential parts of organisms participating in almostevery cellular process. Examples are the catalysis of biochemical reactions by enzymaticproteins, the recognition of other molecules (like antigens for anti-bodies), the regulation ormodulation of various ligands, the performance of structural and mechanical functions (e.g. to maintain the cellular shape or to achieve cell adhesions) or the mediation of transportprocesses (for instance of electrons). This class of biopolymers is in general made of 20different L-alpha-amino acids arranged in a linear chain and joined together by peptidebonds. They are assembled by using information encoded in the nucleotide sequence of thecorresponding gene within the genome of its organism. The entire complement of proteinsexpressed by a genome (at a given time and under defined conditions) is called proteome. Agene encoded in DNA is first transcribed into mRNA which is then used as a template forthe protein translation into its amino acid chain by the ribosome. Therefore the individualamino acid composition of a protein is already defined by the content of particular basetriplets within the open reading frame of its belonging gene. Starting with the N-terminuseach protein obtains its characteristic amino acid sequence, which is also referred to as its1D or primary structure (Figure 1.2). Most proteins adopt certain, regularly repeating, butlocal conformations of their backbones, stabilized by hydrogen bonds. Examples for such 2Dor secondary structures of a polypeptide are turns, alpha-helices or beta-strands. The 3D ortertiary structure of a protein is defined as its entire layout or the spatial relationship of allsecondary structures to each other. It is generally stabilized by non-local interactions, likethe formation of a hydrophobic core or by salt bridges, hydrogen and disulfide bonds, vander Waals interactions and sometimes by post-translational modifications. An even higherdegree of spatial structure is achieved through the interaction of more than one proteinmolecule within the assembly of oligomeric proteins, referred to as quaternary structure.Further, terms often used with regard to a protein’s structure are domain and fold. Aprotein domain often denotes an autonomously folding functional module of a protein.Proteins are defined as having a common fold if they have the same major secondarystructure elements in the same arrangement and with the same topological connections.Although all structural information needed to achieve the native fold should be encoded bythe amino acid sequence of a protein [Anfinsen 1973], many polypeptides require specificchemical conditions or the aid of molecular chaperones to fold properly.

7

Page 28: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 1. Introduction

Figure 1.3: Demonstration of the general convention of four structural levels describing the spatial

arrangement of a polypeptide. Source: National Human Genome Research Institute/ Educational

Resources (www.genome.gov/Education/).

8

Page 29: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

1.3. Proteins

1.3.2 Natively unfolded proteins and intrinsically disordered re-gions

Starting already in the 19th century with H. E. Fischer and his proposed ’lock and keymodel’, visualizing the interaction between an enzyme and its substrate (invertase withalpha-glycosides) [Fischer 1894], it has long been assumed that an amino acid chain firsthas to fold into its native structure, meaning a complex 3D structure, before it can becomefunctionally active.However, this so called structure-function paradigm appears to be invalid for significantfractions of proteins performing important cellular functions in spite a partial or completelack of structure. In fact it has been predicted that as much as on third of all tested eu-karyotic proteins, 4 % of bacterial and 2 % or archaeal sequences contain long regions ofintrinsic disorder or are completely unfolded within their native states [Ward 2004].The evident difference between eukaryotes and prokaryotes is thereby explained throughmultiple reasons. It is proposed that less complex prokaryotes lacking a cellular com-partmentalization might possess reduced abilities to physically protect unfolded structureswhich are more prone to degradation. Since they are additionally subject to a strongerselective pressure on biochemical efficiency, the costs of short protein lifetimes are likely tobe fare greater. Further, it was observed, that the direct correlation between the degreeof disorder and the complexity of an organism is more pronounced for non-housekeepingproteins. This would suggest that highly evolved organisms depend significantly more oncomplex and regulative protein-protein interactions performed by natively unfolded non-housekeeping proteins. In contrast, housekeeping proteins appeared to be not substantiallymore disordered with an increased complexity of the organism, which was explained by thefact that many of these proteins represent structurally more rigid enzymes.The higher degree of flexibility enabling disordered proteins to interact more easily witha greater number of diverse ligands might be one of the reasons for their wide prevalence.Another hypothesis assumes that natively unfolded proteins exist due to their ability toform large intermolecular interfaces, while the size of the proteins and hence of the genomeas well as of the cell could remain moderate. For monomeric folded proteins, to achieve acomparable interface, the protein size must be two or three times larger. Therefore cellswould either be subject to an increased cellular crowding or needed to be enlarged by 15 -30 % [Gunasekaran 2003].Since unfolded proteins are generally expected to become degraded rapidly, they might ei-ther be complexed with a certain binding partner experiencing at least partial folding orthey may be sequestered in cell regions of low protease activity. At any rate, the fact thatlarge fractions of intrinsically disordered proteins remain evolutionary stable indicates thatthey must have important cellular functions [Fink 2005].There are more than 30 different types of functions that have been ascribed to nativelyunfolded proteins, which are often connected with some of the most important cellular reg-ulation processes like the control of cell cycle, transcription or translation [Vucetic 2003].Intrinsically disordered regions, uncoupling specificity from binding affinity, are also veryfrequent within RNA and protein chaperone sequences [Tompa 2004]. This might be aconsequence of the increased need of these molecules for a flexible and malleable recog-nition of diverse substrates. Further, protein phosphorylation sites, involved in a majorpost-translational regulation system, predominantly occur within sequence regions featur-ing intrinsic disorder [Iakoucheva 2004]. Moreover, a group of proteins involved in the cel-lular protection from dehydration (freezing, saline conditions or drying) is represented bynatively unfolded polypeptides called late embryogenesis abundant (LEA) proteins. Theywere first found in cotton seeds (Gossypium hirsutum), were they accumulated late in em-

9

Page 30: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 1. Introduction

bryogenesis [Dure 1981], but they occur in a variety of organisms from bacteria to plantsand lower animals. For example desiccation-tolerant rotifers show a strong induction ofLEA gene expression during drying [Tunnacliffe 2005], [Pouchkina-Stantcheva 2007].On the other hand proteins involved in cellular processes such as biosynthesis andmetabolism appear to contain fare less fractions of intrinsically disorder [Iakoucheva 2002].Natively unfolded proteins and extended disordered regions seem to differ greatly in theircomposition of amino acids compared to structured proteins. They are usually characterizedby a high overall hydrophilicity, a large net charge [Uversky 2000], a low sequence complex-ity and a high degree of flexibility. Hence, they exhibit an amino acid compositional biastowards higher levels of arginine, glutamate, glutamine, glycine, lysine, proline and serinewith simultaneously lower levels of asparagine, cysteine, isoleucine, leucine, phenylalanine,tryptophan, tyrosine and valine [Romero 2001]. Due to the lack of hydrophobic aminoacids, unfolded polypeptides do not possess a tight, hydrophobic core like globular proteinsand they are generally deficient in secondary structures. The distinct amino acid sequenceproperties allow to predict extended disordered regions with a high degree of accuracy.

1.3.3 Ordinarily and extraordinarily sized proteins

The intention to study large, prokaryotic proteins might first raise the question of how longproteins usually become as well as where are the lower and especially the upper bounds.Further, it would be worth to consider (in advance and during the analyses) how frequentso called giant proteins appear and what it means for the cells to invest in the expensiveproduction of such extraordinarily large polypeptides.Analyzes of the protein lengths within several prokaryotic and eukaryotic proteomes revealedthat mean sequence lengths reach around 300 amino acids for prokaryotic proteins and ap-proximately 500 residues for eukaryotic polypeptides [Skovgaard 2001], [Brocchieri 2005].The lower limit of how long a protein may be is rather a matter of definition. Even sin-gle ’amino acids’ like proline are reported to feature enzymatic activity [Movassaghi 2002],but since the term protein is generally used to refer to a biological molecule larger than apeptide, the lower boarder might lie near 20 - 30 residues [Lodish 2004]. The largest exper-imentally and structurally characterized human protein represents titin [Bang 2001], alsoknown as connectin, which is important in the contraction of the striated muscle tissues.With its 38138 amino acids and a richness of proline sequences it might further constituteone of the largest intrinsically disordered proteins in nature [Ma 2006]. The only prokary-otic giant gene that has experimentally been proven to be transcribed in full length typifieshalomucin (9159 amino acids) [Bolhuis 2006]. However, the largest archaeal gene out ofCenarchaeum symbiosum would encode for a protein of 11910 residues and the bacterialgenome of Chlorobium chlorochromatii contains an open reading frame which could be tran-scribed into as many as 36805 amino acids.Admittedly, an examination of 580 totally sequenced prokaryotic genomes revealed thatonly 0.2 % of the open reading frames reached lengths of more than 5000 bp [Reva 2008].Even though this fraction could be slightly higher because of the fact that many gene find-ers tend to predict several smaller genes instead of one extremely large open reading frame,large genes and hence large proteins obviously represent the minority. The reason for thisrareness of giant genes might become obvious by imagining the production costs and timesof their corresponding protein sequences. With a maximum bacterial translation rate of 40amino acids per second [Watson 1987], the synthesis of the mentioned theoretically largestbacterial protein known to date (36805 residues) would last at least 15 minutes. To beworth such an immense production effort (which might be even further increased by the

10

Page 31: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

1.4. Objectives of this master project

costs necessary to translocate such giant sequences in the case of extracellular proteins),cells must experience a substantial gain of fitness through this investment. The mentionedanalysis of prokaryotic giant genes [Reva 2008] resulted in the diagnosis, that with increasedsequence lengths of the derived proteins, two major functional purposes become more andmore evident. Over 90 % of potential proteins spanning more than 5000 amino acids(derived from genes larger than 15 kb) have been suggested to represent one of the twofunctional categories of either intracellular non-ribosomal peptide/ polyketide synthetasesor of extracellular surface proteins. Both functions can be associated with the defense ofcompetitors. Non-ribosomal peptide or polyketide synthetases produce secondary metabo-lites mediating antimicrobial, antifungal or antiparasitic activities. The functional categoryof surface proteins, comparable to mucins and collectins in mammals, provides cellular en-velopes and shields. They build up a protective, extracellular micromilieu around the cells,confer cellular adhesion processes and sense environmental signals [Reva 2008].It was recognized that giant genes typically do not belong to the core genome, but ratherrepresent specific features increasing the fitness of the appropriate cells to persist in theirindividual ecological niches.In spite of the extreme demands of cellular resources, the production of giant proteins ap-pears profitable for such cells and the investigation of their individual cellular functionschallenging for us.

1.4 Objectives of this master project

After the introduction of prokaryotes and extraordinarily large proteins, it can be proceededwith the analyses this master project aimed to perform. They might be best divided into afirst part of rather general examinations regarding large, prokaryotic proteins and a secondpart, intending a more detailed analysis of one such protein giant, namely halomucin.

1.4.1 Part 1:Extraordinarily large, prokaryotic proteins - in general

The first part of this project sought to accomplish a survey of the abundance of extremelylarge proteins within prokaryotes as well as an overview of the differences between large andordinarily sized proteins. Therefore comparative analyses on the level of both, amino acidcomposition and amino acid sequence, were intended. In doing so, the objective to iden-tify potential correlations between the protein sizes, their functions and the environmentalconditions of their producing cells should be pursued.

1.4.2 Part 2:Haloquadratum walsbyi’s protein giant halomucin - in detail

The intention of the second part has been the closer analysis of one exemplary giant protein,for which halomucin of the halophilic square archaeon Haloquadratum walsbyi had beenchosen. Thereby it has been of interest, how H. walsbyi might benefit from the expensivesynthesis of halomucin as well as in which way the performed functions of halomucin couldbe connected with the extremely hostile habitat of H. walsbyi. Functional hints were meantto obtain by the identification of structural homologues to sequence sections of halomucin.

11

Page 32: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 33: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Chapter 2

Materials and methods

Contents

2.1 Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Collection of protein sequences . . . . . . . . . . . . . . . . . 13

2.1.2 Distribution of prokaryotic protein sizes . . . . . . . . . . . . 13

2.1.3 Amino acid composition . . . . . . . . . . . . . . . . . . . . . 14

2.1.4 Amino acid sequence . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 Visualization of halomucin’s dimensions . . . . . . . . . . . . 19

2.2.2 Functional predictions . . . . . . . . . . . . . . . . . . . . . . 20

Subject of this chapter will be the specification of materials and methods used for therealization of this master project. Since all analyses have been performed bioinformatically,all materials refer to virtual data on which the research was based on. Mostly, the approachto each analysis will be described separately and in connection with its individual intention.

2.1 Part 1:

Extraordinarily large, prokaryotic proteins - in general

2.1.1 Collection of protein sequences

At the beginning of this master project intending to analyze large, prokaryotic proteinshad to stand the collection of analyzable protein data. All later examined prokaryotic se-quences were obtained from the HAMAP (High-quality Automated and Manual Annotationof microbial Proteomes) database [Gattiker 2003] by cumulating the amino acid sequencesof all at this time available full proteome datasets (date of download: 2008-06-19). Henceall studies of prokaryotic proteins within this project are based on 51 archaeal and 595bacterial proteomes comprising 110621 and 1955773 amino acid sequences respectively. Todraw a comparison, additionally 986022 eukaryotic sequences were obtained by integrat-ing all until then available 45 complete proteomes of eukaryotes at the Ensembl database[Hubbard 2009] (date of download: 2008-11-23).

2.1.2 Distribution of prokaryotic protein sizes

2.1.2.1 Definition of lengths intervals

To analyze the abundance of large proteins and the distribution of the sequence lengths ingeneral, all proteins of one of the three biological domains (archaea, bacteria or eukaryotes)

Page 34: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 2. Materials and methods

were combined into one large file of protein sequences at a time. Since afterwards ofinterest, likewise all amino acid sequences of the archaeal proteome of Haloquadratumwalsbyi have been pooled to one file. The protein sequences of all four files were latersorted according to different lengths interval arrangements to identify the distribution oftheir lengths. A lengths interval notation of [a;b) indicates that all sequences equal to orlarger than a, but shorter than b are included in this particular interval. [a;) denotes thatan interval contains all sequences of sizes equal to or larger than a.The examinations basically resulted in the definition of three types of sequence arrange-ments into certain lengths intervals:

test intervals: [100;200), [200;300), [300;400), [3000;)

complete intervals: [0;100), [100;200), [200;300), [300;400), [400;500), [500;1000),

[1000;1500), [1500;2000), [2000;3000), [3000;4000),

[4000;5000), [5000;10000), [10000;)

adjusted complete intervals: [0;100), [100;200), [200;300), [300;400), [400;500), [500;1000),

[1000;1500), [1500;2000), [2000;3000), [3000;)

2.1.2.2 Mean length of proteins within the three domains of life

The calculation of the mean sequence lengths and their appropriate standard deviationshas been performed for all indicated groups (archaea, h. walsbyi, bacteria, prokaryotes andeukaryotes) following equations (2.1) and (2.2) respectively.

x =1n

n∑i=1

xi (2.1)

s =

√√√√ 1N − 1

N∑i=1

(xi − x)2 (2.2)

2.1.3 Amino acid composition

One of the objectives of this master project has been to analyze the dependency of theamino acid content of a protein on its belonging sequence length. Therefore initially thefour test lengths intervals were checked for either archaeal or bacterial sequences. Foreach sequence out of one of these resultant eight lengths intervals, the determination of itsamino acid content with respect to the common 20 proteinogenic amino acids (normalizedto the sum of all 20 fractions) occurred. Residues labeled different from the 20 lettersrepresenting these 20 amino acid were excluded. Subsequently, the mean values (2.1) andstandard deviations (2.2) were calculated over all frequencies for one amino acid characterand one lengths interval at a time.

2.1.3.1 Physicochemical properties

On closer examination, the 20 amino acids were sorted into 8 different groups, accordingto their physicochemical properties:

14

Page 35: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

2.1. Part 1

nonpolar, aliphatic: alanine, isoleucine, leucine, methionine and valine

polar, uncharged: asparagine, glutamine, serine and threonine

positively charged: histidine, lysine and arginine

negatively charged: aspartate and glutamate

aromatic: phenylalanine, tryptophane and tyrosine

cysteine: cysteine

glycine: glycine

proline: proline

The following analyses of the complete lengths intervals for either archaeal or bacterialproteins was performed similarly to the examination of the test intervals with 20 aminoacid fractions (above). However, in this case only 8 different fractions were distinguished,before the calculation of the mean values (2.1) and standard deviations (2.2) occurred.Further, difference values between the mentioned means and the calculated averages overall archaeal or bacterial sequences were plotted.

2.1.3.2 Secondary structure preferences

Further analyses on the level of amino acid content and for the complete lengths intervalsinvolved the arrangement of the amino acids with respect to their preferences for a certaintype of secondary structure. Two different groupings have been applied, either accordingto [Fasman 1989] and the Chou-Fasman Parameters (Appendix A.1) or corresponding to[Malkov 2008] and the Malkov Correlation Coefficients (Appendix A.2). The arrangementwith respect to [Fasman 1989] occurred by combining all residues featuring a Pα valuelarger than 1.20 to the group of strong helix formers and all amino acids with respectivePα values reaching from 1.08 to 1.20 to the category of helix formers. Further, residuesexhibiting a Pβ value above 1.40 were pooled as strong sheet formers and amino acids ofPβ values ranging from 1.10 to 1.40 as sheet formers. All amino acids holding a Pτ valuelarger 1.40 were grouped as turn formers. For the partitioning according to [Malkov 2008],all significant positive correlation coefficients between amino acids and their comparedsecondary structure types were taken into account to arrange residues into their belongingsecondary structure groups.

Arrangement according to [Fasman 1989]:

strong helix formers: alanine, glutamate, leucine, methionine

helix formers: glutamine, isoleucine, lysine, phenylalanine, tryptophan

strong sheet formers: isoleucine, tyrosine, valine

sheet formers: cysteine, glutamine, leucine, phenylalanine, threonine, tryptophan

turn formers: asparagine, aspartate, glycine, proline, serine

Arrangement according to [Malkov 2008]:

alpha-helix: alanine, arginine, glutamate, glutamine, leucine, lysine, methionine

3-helix: aspartate, proline, serine

strand: isoleucine, leucine, phenylalanine, threonine, tryptophan, tyrosine, valine

15

Page 36: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 2. Materials and methods

turn: asparagine, aspartate, glycine

bend: asparagine, aspartate, glycine, serine

coil: asparagine, aspartate, proline, serine, threonine

It should be pointed out again that within both arrangements several amino acid charactersfeature an affiliation to more than one secondary structure preference type. In such cases therelevant amino acid has been counted multiple times, although all fractions were normalizedby the unaltered length of the belonging sequence in the end. Hence the sum of all fractionswithin a certain sequence or for one lengths interval does not equal one. Anyway, theaveraging occurred analogously over all fractions for each secondary structure preferencegroup and one lengths interval at a time. Likewise, the differences between these meanvalues and the calculated averages over all archaeal or bacterial sequences were plotted.

2.1.3.3 Hydropathy values

Another examination based on the amino acid composition of an analyzed sequence hasbeen the calculation of the mean hydropathy per protein, normalized by the length ofthe respective amino acid sequence. These mean residue hydropathies were calculated byfirst representing all amino acids of a certain sequence by their characteristic hydropathyindices (according to [Kyte 1982], Appendix A.3), second adding all these values and thirddividing the sum by the number of residues considered (all out of the 20 common aminoacid characters). The hydropathy indices per residue according to [Kyte 1982] are givenwithout unit. They can be understood as manually adjusted transfer energies (in pseudo-kcal/mol) to a numerative interval reaching from a minimum of -4.5 to a maximum value of4.5. To determine the dependency of these values on the protein lengths, initially the fourtest lengths intervals were analyzed for archaea, bacteria or h. walsbyi and the resultingdata plotted in form of histograms (one for each lengths interval). All occurrences of meanresidue hydropathies between -3 and 3 were taken into account and applied to 100 bins.Thereby, attention should be paid to the definitions of the ordinate scales, that had beenadjusted to the far smaller number of large sequences for each [3000;) interval. Afterwards,the plot against sequence lengths was performed for the complete lengths intervals by meansof box and whisker plots. Therefore the ordinate scales were defined to account for meanresidue hydropathies ranging from -4 to 4. The [0;) interval represents all protein sequencesand thus all data points for archaea (110621), bacteria (1955773) or h. walsbyi (2645).Medians are indicated as horizontal red lines, boxes define the space between each first andthird quartile and outliers are represented by red diamonds. For archaeal and bacterialsequences the mean residue hydropathies were again represented as box and whisker plots,but in this case only for the adjusted complete lengths intervals. In doing so, all ten lengthsintervals of archaeal or bacterial proteins were intended to plot the same amount of datapoints. Since both maximum numbers of points are given by the limited quantities of largesequences ([3000;) intervals), for all lengths intervals containing smaller sequences either 20(archaea) or 1067 (bacteria) proteins were picked randomly. Such arbitrary choices weremade several times for all lengths intervals except the [3000;) interval and the results wereplotted as so called tests.

2.1.3.4 Electric charge

Likewise on the level of amino acid content, a calculation similar to the determinationof the mean residue hydropathy was performed to identify the mean residue charge perprotein. Therefore, it was first determined the theoretical total charge of a particular

16

Page 37: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

2.1. Part 1

sequence at pH 7 and subsequently this value was normalized by the number of residuescontained within this sequence. For the calculation of the total charges, the chargedfractions (at pH 7) of both termini and of the charged amino acid side-chains were takeninto account using the Henderson-Hasselbalch equation (2.3) and the pKa values accordingto [Stryer 1995]. The unit in which the charges were calculated is e (≈ 1.6× 10−19 C).

pH = pKa + log[A−][HA]

(2.3)

positively charged fractions: arginine (1.000), lysine (0.999), N-terminus (0.909),

histidine (0.240)

negatively charged fractions: C-terminus (1.000), aspartate (0.997), glutamate (0.997),

cysteine (0.031), tyrosine (0.001)

The illustration of the data occurred similarly to the three plots for the mean residuehydropathies. However, the 100 bins of each histogram only reached from -0.3 to 0.3 (withagain adjusted ordinate scales for the [3000;) intervals), the ordinate scales for the box andwhisker plots of the complete lengths intervals ranged from -1 to 1 and the same scales forall test box and whisker plots started at -0.4 and stopped at 0.4 e.

2.1.4 Amino acid sequence

On the level of primary structure, two kinds of examinations were performed. First aprediction of the secondary structure content and second the identification of potentialintrinsically disordered regions.

2.1.4.1 Secondary structure based on primary structure

To determine the dependency of the secondary structure content on the sequence lengths,first a prediction of the secondary structure had to be performed. Therefore the PSIPRED[Jones 1999b] secondary structure prediction method was applied. PSIPRED is a simpleand accurate (average Q3 = 80.7 %) secondary structure prediction method, incorporat-ing two feed-forward neural networks. The PSIPRED predictions are based on an outputobtained from PSI-BLAST [Altschul 1997]. This program sensitively identifies distant evo-lutionary relationships by searching the protein database with profiles created out of moreclosely related protein sequences. Thereby a larger group of proteins will be found and it-eratively the profile queries will be adjusted. However, this increase in sensitivity demandslonger processing times and due to time constraints not all of the over 2 million prokaryoticprotein sequences could be analyzed. Hence, it was concentrated solely on large sequencesequal to or larger than 2000 residues, which comprised 98 archaeal and 2506 bacterial pro-teins representing together 290407 and 8246217 amino acids respectively (Table 2.1).Since the PSIPRED method only accepts sequences of less than 1500 amino acids, all pro-teins had to be submitted in more than one fragment. Subsequently, all residues within acertain lengths interval and contained in sequences of either archaea, bacteria, halomucin orof archaea + bacteria together, were arranged into one of four categories according to theirprediction results. In case of a prediction confidence (calculated by PSIPRED) equal to orlarger than 6, the residue was accepted into its appropriate group of secondary structuretypes as either H (helix), E (strand) or C (coil). Amino acids exhibiting predictions oflower confidence were counted as uncertainly predicted residues. Finally the entries of all

17

Page 38: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 2. Materials and methods

four categories were normalized by the total number of analyzed residues within the fourcategories together.

Table 2.1: List of the numbers of sequences equal to or larger than 2000 amino acids and of respective

amino acid numbers within each lengths interval analyzed using the PSIPRED secondary structure

prediction method for archaeal and bacterial proteins.

2.1.4.2 Natively unfolded regions

For the identification of potential intrinsically disordered sequence sections, first a predic-tion of low-complexity regions using the SEG [Wootton 1993], [Wootton 1996] method wasperformed. SEG is usually applied on BLAST [Altschul 1997] queries to remove such low-complexity regions or sequence repeats before starting the run, since they might result inhigh scores, which could confuse the program to find the actual significant sequences inthe database. Low-complex regions can be represented by large homopolymeric runs orother stretches of biased amino acid composition. More subtle over-representation of oneor a few amino acids are also recognized by SEG. Since intrinsically disordered sequencesfeature extensive (larger than 30 or 40 amino acids) regions of low sequence complexity[Fink 2005], of all SEG predicted low-complexity regions only the ones equal to or largerthan 40 residues were taken into account. A further characteristic of natively unfoldedproteins is an amino acid bias towards hydrophilic amino acids [Romero 2001]. Thereforeall detected hydrophobic low-complexity regions (positive overall hydropathy value accord-ing to [Kyte 1982]) were excluded as well. The first plot of the resulting data representedthe mean number of such potential intrinsically disordered regions per protein for all se-quences within one of the complete lengths intervals. This implies, that for all sequencesthe quantity of potential natively unfolded regions was determined and that subsequentlya calculation of the mean value (2.1) and the appropriate standard deviation (2.2) for allsequences within one of the stated lengths intervals occurred. Thereby archaeal and bac-terial proteins were analyzed for all of the complete lengths intervals and the results of allprokaryotic sequences as well as the values for halomucin were only plotted for the overallinterval ([0;)). A second illustration demonstrated the dependency of the lengths of theobtained potential intrinsically disordered regions on the respective protein lengths for ar-chaeal and bacterial sequences separately. In each case (for archaea as well as for bacteria)first a scatter plot with an ordinate scale reaching until 3000 amino acids and second a boxand whisker plot with a range until 500 residues were given for the complete lengths inter-vals as well as for all sequence lengths ([0;)). Further, the data of halomucin were specifiedseparately within the graph of archaea. However, since the sequence of halomucin featuresonly two potential natively unstructured regions, the depiction of a box and whisker plotis rather unsubstantial.

18

Page 39: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

2.2. Part 2

2.2 Part 2:

Haloquadratum walsbyi’s protein giant halomucin - in detail

2.2.1 Visualization of halomucin’s dimensions

Since halomucin represents a protein of extraordinary dimensions, a few thought exper-iments were performed to become more acquainted with its spatial extent. Initially itwas imagined as a spherical molecule similar to globular proteins. For the calculation ofthe volume, halomucin would comprise in the case of a globular molecule, first the meanvolume per amino acid based on halomucin’s amino acid composition was determined(Table 2.2). With a mean volume of 118.3 A3 per amino acid, a total number of 9159residues and the assumption that the volume taken by the van der Waals volume ofhalomucin had to lie between 75 % (tight protein interior) and 58 % (in water), the proteinvolume could range between 1444.7 and 1868.2 nm3 respectively. For the imagined casethat the protein would adopt a sphere like shape and thus equation (2.4), the diameter ofa globular folded version of halomucin would lie between 14.0 and 15.3 nm.

V =16πd3 (2.4)

Secondly, it was tried to visualize the complete sequence of halomucin folded as one single

Table 2.2: Calculation of halomucin’s mean volume per amino acid based on its amino acid composition.

The first column represents the 20 amino acid characters in single letter nomenclature, the second

column lists the fraction of each residue within the sequence of halomucin, the third column indicated

the theoretical volume of each amino acid (according to [Zamyatnin 1972]) in◦A

3

and the last column

the product of second and third column. The sum of the latter values represents the theoretical mean

volume per amino acid for halomucin.

alpha-helix. Since the structure of an alpha-helix features a translation of 0.15 nm peramino acid, the lengths of such a helix built up of 9159 residues would reach approximately1.4 µm and the wideness would match the typical alpha-helix value of 0.54 nm.Furthermore it was imagined that the whole sequence of halomucin would adopt a linearlyextended conformation forming a long all-trans polyaminocarbonic acid. With a translationof 0.38 nm between two Cα-atoms of the protein backbone, the length of this theoreticallinear molecule would reach approximately 3.5 µm.

19

Page 40: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 2. Materials and methods

2.2.2 Functional predictions

2.2.2.1 Background: Amino acid sequence

Dissection into submittable fragments In order to make bioinformatical ap-proaches like structural prediction methods accessible to halomucin, first its sequence hadto be split into submittable fragment. It was tried to avoid the separation of structuraldomains and hence a prediction of domain boundaries of halomucin was performed.The first method applied for this purpose was DomSSEA at the DomPred Protein DomainPrediction Server [Marsden 2002]. Since the homology-based approaches at the DomPredserver had been unsuccessful, i. e. neither meaningful matches to domain sequences fromPfam-A [Bateman 2004] nor significant domain termini peaks searched by the PSI-BLAST[Altschul 1997] alignment algorithm were detectable, this third method at the DomPredProtein Domain Prediction Server had to be utilized. DomSSEA is based on the idea thatit might be possible to parse a long target sequence into putative domains by simply map-ping its secondary structure elements (predicted for the target by a crude fold recognitionalgorithm) to observed secondary structure patterns in domains of known 3D structure[McGuffin 2003]. This process is known as secondary structure element alignment (SSEA).Due to the fact, that the DomSSEA method does not accept target sequences greater than2500 residues, halomucin had to be submitted in parts. To account for prediction inaccu-racies near the fragment borders, two shifted segmentations of halomucin’s sequence wereapplied. Next to the submission of four almost equal parts of 2290 amino acids (1 - 2290,2291 - 4580, 4581 - 6870 and 6870 - 9159 amino acids), five segments with unaltered spacingbut a start of the second fragment in the middle of the previously first one (1 - 1145, 1146- 3435, 3436 - 5725, 5726 - 8015, 8016 - 9159 amino acids) were transmitted as well. Toall 9 fragments the DomSSEA results featuring the highest scores for the predicted domainboundaries were applied. Subsequently a manual combination of these outcomes took place,which resulted in 22 fragments for the sequence dissection of halomucin.A second method used to dived halomucin’s sequence into submittable fragments wasperformed using the automatically constructed protein domain family database ProDom[Corpet 2000], [Servant 2002], [Bru 2005]. At this server no initial fragmentation of halo-mucin had been necessary. Its sequence was compared with the ProDom database usinga BLAST search against the multiple alignments provided for each ProDom family and adefault E value of 0.01. The initial filter function to exclude low-complexity regions hadbeen switched off. The BLAST search resulted in 48 hits of ProDom domains producingHigh-scoring Segment Pairs (HSP). After the removal of domains with overlapping posi-tions favoring the ones of lowest E values, 26 sequence sections similar to ProDom domainsremained for halomucin.

Identification of structural homologues One of the objectives during the analy-ses of halomucin has been the identification of structural neighbors. Since a BLAST searchrevealed only quite distant homologies for halomucin, the detection of fold-level homol-ogy was attempted. Therefore the comparison of three fold recognition methods, namelymGenTHREADER [Jones 1999a], [McGuffin 2003], 3D-PSSM [Fischer 1999], [Kelley 1999]and Phyre [Bennett-Lovsey 2008] took place. There are many approaches to this kind ofinverse folding problem of predicting how well a fold will fit a sequence. All are basedon the fact that nature is apparently restricted to a limited number of protein folds, in-creasing the chance that a protein of similar fold to the target exists, for which the 3Dstructure has already been determined experimentally. However, for a full 3D threadingthe problem of identifying the best alignment between target sequence and template struc-

20

Page 41: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

2.2. Part 2

ture is very difficult. Hence fold recognition methods often derive a 1D profile for eachstructure within their fold library and align the target sequence to these profiles. The threecompared methods follow this profile-based approach. They further improve the search bygenerating profiles of the target sequences as well and by incorporating structural infor-mation through secondary structure predictions. Fold recognition by mGenTHREADERincludes the secondary structure information predicted by PSIPRED as well as secondarystructure element alignments (SSEAs). The web-based fold recognition method 3D-PSSMfurther utilizes solvation potential information to search for compatible folds and to predicta 3D structure for the target sequence. Phyre additionally introduces changes to the tem-plate backbones in the case of modelling insertions or deletions, accomplished by searchingthrough a loop library for compatible loops.All three methods were applied to each of the 22 sequence fragments (fragment 1 - 22)of halomucin obtained by using the DomSSEA method. Further, the 26 short sequencesections (domain A - Z) representing ProDom domains were analyzed. Hence, for fragment1 (A - E), 2 (F), 3 (G), 4 (H), 5 (I, J), 6 (K - M), 7 (N, O), 9 (P, Q), 10 (R), 11 (S), 12(T), 13 (U), 14 (V - X), 15 (Y) and 16 (Z) additional predictions were taken into accountto manually (and to some extend following a gut feeling) perform an assignment of onetemplate structure to each sequence fragment of halomucin. However, the predictions forfragment 3 did not seem sufficient to make a decision.

Conclusion In order to prepare a slightly more realistic picture of halomucin’s spa-tial dimensions, the structures of the predicted homologues for fragments of halomucin’ssequence were attributed to the corresponding sequence sections of halomucin and manu-ally arranged on a 2D plane. For the depiction of the PDB [Berman 2000] structures thePython-enhanced molecular graphics program PyMOL [DeLano 2002] was utilized. In or-der to present only the structural parts that actually aligned with halomucin’s sequence, atruncation of each template structure occurred. The figures were generated using a gray70surface representation (transparency: 0.5) as well as a display of the backbone structurein cartoon modus. It further occurred a coloration of the secondary structure types inblue (helices), red (strands) and yellow (coils) as well as an indication of each N- (blue)and C- (black) terminus. Images were saved after choosing ray tracing and the white back-ground option. The loops connecting these 21 structural representations within halomucin’ssequence were drawn in manually (arbitrarily).

21

Page 42: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 43: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Chapter 3

Results and discussion

Contents

3.1 Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Distribution of prokaryotic protein sizes and a considerationof the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.2 Amino acid composition . . . . . . . . . . . . . . . . . . . . . 26

3.1.3 Amino acid sequence . . . . . . . . . . . . . . . . . . . . . . . 41

3.1.4 Discussion and explanation attempts . . . . . . . . . . . . . . 47

3.2 Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.1 Visualization of halomucin’s dimensions . . . . . . . . . . . . 50

3.2.2 Functional predictions . . . . . . . . . . . . . . . . . . . . . . 50

The matters of my thesis can be divided into first a rather general part of analysesconcentrating on the whole dataset of prokaryotic proteins and second a more detailedconcern of one extremely large archaeal polypeptide, namely halomucin.The following chapter aims to present the results of the two subprojects together with acritical discussion of these outcomes. After giving attention to the data and their difficulties,the results based on the amino acid composition as well as on the amino acid sequence of theproteins will be described. To combine and if possible to explain these findings representsthe ambition of the subsequent section. The chapter closes with the attempt to give abetter picture of the extraordinarily large protein halomucin, originating from the squarehalophilic archaeon Haloquadratum walsbyi.

3.1 Part 1:

Extraordinarily large, prokaryotic proteins - in general

3.1.1 Distribution of prokaryotic protein sizes and a considerationof the dataset

The first step of this master project consisted in the collection of prokaryotic protein data,which were obtained from the HAMAP (High-quality Automated and Manual Annotationof microbial Proteomes) database [Gattiker 2003]. 51 archaeal and 595 bacterial proteomeshave been available at this time. The belonging protein sequences of all that prokaryoticfull proteomes were pooled and subsequently arranged according to their lengths and theirmembership to one of the following groups: archaea, h. walsbyi or bacteria. To comparethe prokaryotic data with eukaryotic proteins, 45 proteomes of eukaryotes were downloadedfrom the Ensembl database [Hubbard 2009] and the proteins were likewise sorted by their

Page 44: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

extents (eukaryotes). Figure 3.1 represents each resulting fraction of proteins per one oftwenty length intervals, which were chosen in 100 and 1000 amino acid steps respectively.Apparently the fractions of proteins with elongations equal to or larger than 1000 aminoacids are quite small considering that these proteins have been the ones of interest withinthe present study. Contrariwise it seems astonishing that 2 archaeal, 31 bacterial and 73eukaryotic proteins are larger than 10000 amino acids. Since the lengths intervals from[500;600) till [900;1000) contain together less proteins than the appropriate [400;500) in-tervals they are later merged to one interval [500;1000). On the other hand the interval[1000;2000) will be split into [1000;1500) and [1500;2000). Table 3.1 represents this newarrangement numerically. It can be observed that of approximately 3 millions of analyzedproteins, about one third is of eukaryotic and two thirds of prokaryotic origin. Of theprokaryotic proteins around 5 % derive from archaea and 95 % from bacteria.

The smallest sequences within the datasets of archaea and bacteria are represented

Figure 3.1: Fraction of proteins per lengths interval: Illustration of the lengths distribution of proteins

derived from 51 archaeal, 595 bacterial (HAMAP) and 45 eukaryotic proteomes (Ensembl). Plotted are

the same data twice but with altered ordinate scales. The brackets indicate that a certain interval covers

protein lengths from equal to the first value to one amino acid less than the second value. [10000;)

means all proteins from lengths of 10000 amino acids on. The ordinate indicates the protein fraction

relatively to one of the four shown data sets of archaea, h. walsbyi (data included in archaea), bacteria

or eukaryotes. Above the bars the total number of proteins is written (upper graph: lengths intervals

[0;100) - [900;1000), lower graph: [1000;2000) - [10000;)).

by an archaeal putative uncharacterized protein of Methanosarcina mazei (Q8PZD8, 16amino acids) and a bacterial hypothetical small peptide of Lactobacillus sakei subsp. sakei(Q38ZK3, 8 residues). The shortest sequence of the archaeon Haloquadratum walsbyi isdescribed as IS1341-type transposase (Q18KW6, 17 amino acids). Returning to the proteingiants, the largest archaeal protein sequence found within the prokaryotic proteome data isrepresented by a putative uncharacterized protein of Cenarchaeum symbiosum (A0RVT6,11910 residues), the longest bacterial sequence by a parallel beta-helix repeat protein of

24

Page 45: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

Table 3.1: Numerical demonstration of the arrangement applied to the protein data according to their

lengths. Each first column of a certain group depicts the total numbers of proteins, each second column

the appropriate fractions. The data set of h. walsbyi is included in the one of archaea. The protein

numbers of prokaryotes correspond to the sum of archaea and bacteria.

Chlorobium chlorochromatii (Q3ASY8, 36805 amino acids) and the biggest protein of H.walsbyi is known as halomucin (Q18DN4, 9159 residues).It should be pointed out again, that the abundance of prokaryotic proteins with sequencelengths equal to or larger than 1000 amino acid reflects only one percent of the total dataset, resulting in an occasionally questionable reproducibility.Furthermore, the reliability is reduced by the fact that prokaryotic (and eukaryotic) pro-teomes contain to more than 50 % proteins described as putative, hypothetical, poorly char-acterized, unknown or short of uncertain determination [Tatusov 2001], [Brocchieri 2005].

3.1.1.1 Mean length of proteins within the three domains of life

In spite these possible limitations of the data, the consistently reported results, that eukary-otic proteins are generally larger than bacterial sequences and they in turn slightly longerthan archaeal proteins [Galperin 1999], [Zhang 2000], [Karlin 2002], [Brocchieri 2005] wereobtained as well. While the most prokaryotic proteins can be found in the interval of pro-teins reaching from 100 till 199 amino acids, eukaryotes show the most entries between 300and 399 amino acids. The smaller difference of the mean protein lengths between prokary-otes, at which archaeal sequences average slightly less than the mean length of bacterialproteins and the more considerable distinction comparing prokaryotes and eukaryotes isalso demonstrated by the second table (Table 3.2).The greater length of eukaryotic proteins may be due to a higher degree of cellular complex-

Table 3.2: List of mean values and respective standard deviations of protein lengths for all entries of

a certain group indicated in the first column. The total number of tested proteins for each group are

given in the last column.

ity compared to prokaryotic organisms [Brocchieri 2005]. It has been shown that sequencesof eukaryotes are often expanded by functional regulator domains or motifs [Zhang 2000].

25

Page 46: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

The evidence that within eukaryotes the evolutionary processes of fusion led from single-function proteins to multi-functional and multi-domain sequences, might be explained byenhanced difficulties for interacting subunits to associate with each other in a more crowdedcytoplasm and a complex array of compartments. Fusion of interacting functional unitscould have circumvented the need to produce higher concentrations of the interaction part-ners for achieving protein complexes [Das 1997], [Karlin 2002].One explanation for the fact that archaeal proteins are in the mean marginally smallerthan bacterial proteins, has been the assumption it represented an artifact of genome an-notation [Skovgaard 2001]. However, it could be demonstrated that, bacterial proteomesinclude a greater fraction of longer proteins involved in metabolism or other cellular pro-cesses, that within most functional classes, protein families unique to bacteria are longerthan ones unique to archaea and that within one protein family, homologues from bacteriatend to be longer than corresponding homologues from archaea [Brocchieri 2005]. The bio-logical reasons might root in the differences between the environmental temperatures of themostly mesophilic, bacterial and the mostly thermophilic, archaeal species [Koonin 2002].A higher need for protein stability often results in the lengths reduction of disordered loopsor terminal tails [Thompson 1999], [Kumar 2001], [Vieille 2001]. Further, smaller proteinsfeature a less distinct difference in the heat capacities between the folded and unfoldedstate [Myers 1995], [Ganesh 1999], which results in a lower curvature of the protein stabil-ity curves and thus in higher values of the heat denaturation temperature. In addition,bacterial proteins appear to be more prone to domain-fusion, out of similar reasons as foreukaryotes [Brocchieri 2005], even though there is no obstructive cellular compartmental-ization. Finally it should be noted that compared to archaea many prokaryotic organismslive a parasitic way of life. They almost completely lack the pressure of minimizing costs re-lated to amino acid usage, which is in turn exerted on the predominantly free-living archaea,along with the exposure to environmental stresses and fluctuations [Seligmann 2003].

3.1.2 Amino acid composition

Proceeding the general examinations of large prokaryotic proteins, basic features of longsequences as well as their differences compared to ordinarily sized proteins have been ofinterest. The analysis of the amino acid composition seemed to be a worthwhile subjectmatter and fairly convenient to perform. At first it had to be clarified, if there was anydependency at all, between the amino acid compositions of the proteins and their respectivelengths. Therefore all proteins of either archaea or bacteria with lengths falling in one offour test intervals (100 - 199 amino acids, 200 - 299, 300 - 399 or 3000 and more aminoacids) were tested for their amino acid composition and the mean values of a certain lengthsinterval were plotted for each amino acid (Figure 3.2).

Evidently there is indeed a correlation between the amino acid compositions and thelengths of the tested proteins. Not only between the three small protein intervals and thelarge proteins, but also between an equal interval but either of archaeal or bacterial origin,are several differences noticeable.To begin with the differences between archaea and bacteria, it seems that the amino acidfractions of glutamate, isoleucine, lysine, valine and tyrosine are increased and of alanine,histidine and glutamine decreased for archaeal proteins compared to bacterial sequences.Assuming that the major difference between archaea and bacteria would be the environmen-tal conditions both domains occupy the most, one would expect an evolutionary pressuretowards a higher degree of protein stability for archaeal species. Thereby all informa-tion needed to for example achieve thermotolerance is encoded in the proteins’ sequences

26

Page 47: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

Figure 3.2: Amino acid composition - mean amino acid frequencies per protein: To test if there is any

correlation between the amino acid compositions and the lengths of the examined proteins four test

intervals with proteins extending between 100 - 199 amino acids, 200 - 299 amino acids, 300 - 399

amino acids and with large proteins equal to or longer than 3000 amino acids were analyzed for archaeal

(top) and bacterial proteins (bottom). Depicted are the mean values for the occurrence of each of the

20 proteinogenic amino acids and respective standard deviations within one test interval.

27

Page 48: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

[Bohm 1994], [Vieille 1996] and thus potential differences may be already observable on thelevel of amino acid composition. It was published [Chakravarty 2000] and it would also beconfirmed by this work, that amino acids like valine and glutamate are enriched within pro-teins of thermophiles. The increase of the charged amino acid fraction of glutamate mightbe due to an enhanced occurrence of salt bridges and ionic bonding [Xiao 1999]. The rise ofthe valine content is perhaps a result of an increased need for the rigidity offered by theseβ-branched amino acids [Lee 1994], that reduces the increase of conformational entropyupon unfolding. Likewise, the compared to bacteria enhanced amount of isoleucines mightbe explained. The mentioned study of determinants of protein stability [Chakravarty 2000]showed a simultaneous depletion of the histidine, serine, threonine and glutamine contents.This can be confirmed here only for histidine and glutamine. Thereby the decrease ofglutamine could prevent proteins of thermophiles from temperature induced deamidation[Haney 1999].To pursue with the characteristics of large proteins (from 3000 amino acids on) comparedto the three sets of smaller sequences, archaeal and bacterial proteins show an increase ofalanine, aspartate, glycine, asparagine, serine and threonine as well as a decrease of cys-teine, glutamate, phenylalanine, histidine, isoleucine, lysine, leucine, methionine, arginineand tryptophan with larger protein lengths.The rise of rather small residues might improve the spatial packing of large proteins.However, the increase of amino acids like glycine, asparagine and for archaea also prolinecould as well be the result of an enhanced amount of turn structures, which require small,polar amino acids to sample their special conformations [Marcelino 2008].On the other hand, the likewise small amino acid cysteine appears to be less frequentlywithin large proteins. Pairs of cysteines are under certain conditions able to form disulfidebridges. Analyzes of the optimization of electrostatic interactions as a stabilization factorwithin proteins revealed, that a low level of spatial optimization of electrostatic interactionsis often compensated by covalent cross-links within the protein structures [Spassov 1994].Further, smaller proteins are generally characterized by a lower electrostatic optimizationvalue and hence show a higher density of disulphide bridges [Ladenstein 2006]. In general,covalent disulphide bridges are believed to stabilize proteins by decreasing the entropy ofthe unfolded state [Matsumura 1989]. Less cysteines and therefore less cystines could thusalso mean, that larger proteins tend to prefer the unfolded state.Next to the loss of cysteines, also the fractions of phenylalanine, isoleucine, leucine and tryp-tophan are dropped. Together with the increase of glycines, serines and prolines (archaea),such an amino acid compositional bias represents a distribution characteristic for nativelyunfolded proteins [Romero 2001]. Though contradictory seems the rise of asparagines andthe decrease of glutamates, lysines and arginines.The latter, along with the lessening of histidines and cysteines appear to indicate a loss ofcatalytically active residues. Analyzes showed that 65 % of the amino acids within activesites of the tested enzymes are provided by charged residues [Bartlett 2002]. This is ex-plained by the requirement for electrostatic forces to catalyze movements of protons andelectrons. Since the amino acid side-chains of histidines and cysteines have the closest pKa

constants to physiological pH values, they are also very often involved in biocatalytic pro-cesses, especially the catalysis of acid-base reactions (histidine 18 %, cysteine 6 % of allcatalytic residues [Bartlett 2002]). Nevertheless, a catalytic center seldom consists of morethan three, at the most seven residues and the increase of aspartate fractions for largerproteins would be inconsistent with this theory. However, the higher degree of flexibility,introduced by small amino acids like glycine and aspartate would further act against therigidity needed for specific, enzymatic activity.

28

Page 49: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

Based on the fact that several differences had been observable, additional analyses referringto the residue contents had been performed. Admittedly, a bias of the amino acid composi-tion might often be evolutionary relevant, rather than a real indication of an adaptation ofa certain kind. More important than the amino acid contents are in either case the aminoacid sequences as well as the residue interactions within a protein [Vieille 2001], which aredisregarded in the subsequent studies predicated on the amino acid compositions.

3.1.2.1 Physicochemical properties

To analyze the influence of the protein lengths on the amino acid contents in a more de-tailed way, the 20 side-chains were grouped according to their physicochemical propertiesand all defined lengths intervals (complete intervals) were compared. The chosen groupsof amino acids were nonpolar, aliphatic amino acids (alanine, isoleucine, leucine, methion-ine and valine), polar, uncharged (asparagine, glutamine, serine and threonine), positivelycharged (histidine, lysine and arginine) and negatively charged residues (aspartate and glu-tamate), aromatic amino acids (phenylalanine, tryptophane and tyrosine) as well as thethree residues cysteine, glycine and proline, which were plotted separately (Figure 3.3).

Since most of the proteins exhibit lengths up to 500 amino acids, these lengths intervalsshow rather small differences compared to the mean values over all proteins.Concentrating on the larger proteins from around 1000 amino acids on, the fraction ofpolar but uncharged residues, as well as the number of glycines increases within archaealand bacterial proteins. As mentioned before, glycine introduces a high degree of flexibilityinto a proteins backbone and allows it to adopt turn structures [Marcelino 2008] as well asintrinsically disordered regions [Romero 2001].On the other hand, the number of nonpolar, aliphatic and especially of positively chargedamino acids seems to be decreased for both, archaea and bacteria.The reduction of the nonpolar, aliphatic fractions is mostly caused by the loss of isoleucineand leucine with increased protein length. Both residues are diminished within nativelyunfolded regions [Romero 2001].Since enzymatic active sites often contain positively charged amino acids (38 %,[Bartlett 2002]), perhaps their reduced amounts within larger sequences reflect a decreaseof enzymatic activity exerted by long proteins. They might therefore be exposed to lessevolutionary pressure towards maintaining positively charged amino acids.The drop of cysteines has already been explained by a higher density of disulphide bridgeswithin smaller proteins [Ladenstein 2006], the possible aim to increase the entropy of theunfolded state, a higher tendency of large proteins to form unfolded regions as well as by alower enzymatic activity, which decreases the need of catalytically active cysteines.Less aromatic residues are in consistency with the assumption of a higher ratio of in-trinsically disorder for larger sequences, which show a compositional bias towards fewerhydrophobic and inflexible amino acids [Romero 2001].

3.1.2.2 Secondary structure preferences

Admittedly, the expressiveness of the amino acid composition about the secondary struc-ture of a protein is rather small and structural conclusions have to be drawn very carefully.The following graphs do not represent any kind of secondary structure predictions. Theyare only named with Fasman89 (Figure 3.4) and Malkov08 (Figure 3.5), since the arrange-ments of amino acids into groups of preferred secondary structures (indicated in the figurecaptions) were carried out according to the Chou-Fasman Parameters (Appendix A.1) andMalkov Correlation Coefficients (Appendix A.2) respectively.

29

Page 50: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

Figure 3.3: Amino acid composition - mean physicochemical amino acid fractions per protein: Depiction

of the dependency of the size of certain physicochemical amino acid fractions on the sequence lengths of

either archaeal (top) or bacterial (bottom) proteins. Each first graph represents the medial amino acid

ratios and respective standard deviations over a certain sequence lengths partition. The two lower plots

demonstrate for each case these fractions less the mean values over all archaeal or bacterial sequences.

They represent the differences in two versions of ordinate scaling. The chosen colors are yellow for

nonpolar and aliphatic amino acids (A, I, L, M and V), orange for polar and uncharged ones (N, Q, S

and T), red means positively (H, K and R), blue negatively charged (D and E). Aromatic amino acids

are colored in violet (F, W and Y) and the three amino acids cysteine (green), glycine (dark turquoise)

and proline (light turquoise) are specified separately.

30

Page 51: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

Figure 3.4: Amino acid composition - mean secondary structure preferences [Fasman 1989] per pro-

tein: Presentation of the connection between the numbers of amino acids with preferences for certain

secondary structure types (according to [Fasman 1989]) and the lengths of their belonging sequences.

Each of the two upper graphs shows the direct amino acid fractions and the lower ones these ratios

minus the mean values over all archaeal (top) or bacterial (bottom) proteins. The mentioned fractions

correspond to averages including standard deviations of the added frequencies for all amino acids within

the same secondary structure preference group of a given protein. Since some amino acids exhibit mul-

tiple membership the sum of all fractions within a certain length interval does not equal one. Referring

to the colors, dark blue stands for strong helix formers (A, E, L and M), blue for helix formers (F, I, K,

Q and W), red means strong sheet formers (I, V and Y), orange sheet formers (C, F, L, Q, T and W)

and turn forming residue fractions (D, G, N, P and S) are colored in brown.

31

Page 52: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

Figure 3.5: Amino acid composition - mean secondary structure preferences [Malkov 2008] per protein:

Representation of the correlation between the numbers of amino acids with preferences for certain sec-

ondary structure types (according to [Malkov 2008]) and the lengths of their particular sequences. Both

first figures demonstrate the direct amino acid fractions and the belonging lower ones the differences

between these ratios and the mean values over all archaeal (top) or bacterial (bottom) proteins. The

calculation of the fractions was carried out by first computing the fraction sizes for a single protein

through adding the frequencies for all amino acids of a certain secondary structure preference group.

Subsequently, the respective mean values (and standard deviations) for all proteins within a certain

lengths interval were derived. Since some amino acids exhibit a membership to more than one sec-

ondary structure preference group, the sum of all fractions within a certain length interval does not

equal one. The selected colors are dark blue for alpha-helix preferrers (A, E, K, L, M, Q and R), blue

for 3-helix preferrers (D, P and S), red means strand preferrers (F, I, L, T, V, W and Y), dark brown

turn preferring (D, G, N and P) and light brown bend preferring (D, G, N and S) amino acids fractions.

Coil forming residue fractions (D, N, P, S and T) are colored in yellow.

32

Page 53: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

Within Figure 3.4 obviously larger archaeal and bacterial proteins contain a higher numberof amino acids with the tendency to form turn structures. This seems to be consistent withthe above explanation for the increase of glycines and other small polar amino acids. Turnsare broadly defined as those region of a polypeptide chain, where a change of direction takesplace [Rose 1985]. If a longer sequence is turned with an equal amount of residues withinturn and interjacent regions as for a small sequence, until the boundary value (length ofsequence for one turn/ length of one intermediate region) is reached, the fraction of residueswithin turn structures would increase.Viewed from the opposite side, amino acids with the characteristic of helix or strong sheetformers become less. A resulting increase of turns and loops would confirm the idea of ahigher degree of native disorder within large proteins.The second classification method (Figure 3.5) is in consistency with the one applied before(Figure 3.4). Again the number of amino acids with a preference to form turns, bends andespecially unstructured coils is increased within large proteins. Frequencies of amino acidswhich are often found in helical or extended structures are decreased. This second approachis additionally shown to demonstrate the concordance with the previously demonstrated,early classification method (according to [Fasman 1989]). Besides, the separation of coilforming residue fractions puts a stronger emphasis on the increase of this particular sec-ondary structure preference group for large proteins. More residues in flexible, unstructuredand surfaced exposed chain sections support the idea of the tendency for large proteins toform big, intermolecular interfaces and to show molecular recognition and binding func-tions, rather than enzymatic activities.However, it has to be pointed out again, that the calculations for both figures are exclu-sively based on the amino acid compositions and do not take into account the sequentialarrangements of the residues.

3.1.2.3 Hydropathy values

Likewise on the level of amino acid content, a calculation of the total protein hydropathies(according to hydopathy indices for each amino acid [Kyte 1982]), normalized by the num-ber of residues per each sequence, was accomplished. To test the dependency of the resultson the lengths of the analyzed proteins, initially again four test intervals were plotted ashistograms (Figure 3.6).The applied hydropathy indices can be understood as manually adjusted transfer energies

(between an interval of -4.5 to 4.5) of a particular amino acid to move it from vapor intowater (in pseudo-kcal/mol). This means that the translocation of hydrophilic residues likearginine would release and of hydrophobic amino acids like isoleucine would cost energy.Keeping this in mind, two groups of proteins can be extracted for all analyzed small pro-teins: One larger group of proteins containing predominantly hydrophilic amino acids andone approximately five times smaller group of generally more hydrophobic proteins. Incontrast, the meaningful analyzable amount of bacterial proteins (1067) equal to or largerthan 3000 amino acids does not show a comparable peak of hydrophobic proteins around amean hydropathy index of 0.8 with an expected height of circa 30 entries, indicating thatoverall hydrophilic polypeptides preponderate for sequences from 3000 amino acids on.A higher hydrophilicity for larger proteins would be consistent with the previous findings ofhigher loop and turn tendencies. Because turns occur between regions of regular secondarystructures, they are frequently located at the protein’s surface. As a consequence turns areprimarily composed of hydrophilic residues [Rose 1978].Since there are much less data points for large archaeal (20) and proteins of H. walsbyi (1)

33

Page 54: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

Figure 3.6: Amino acid composition - mean residue hydropathy [Kyte 1982] per protein, test intervals,

all data points: To check if there exists any correlation between the mean hydropathy and the length

of a particular protein four test intervals with proteins between 100 - 199 amino acids, 200 - 299 amino

acids and 300 - 399 amino acids in length and with large proteins equal or longer than 3000 amino acids

were analyzed for archaeal (top), bacterial (middle) and proteins of h. walsbyi (bottom). Presented

are the numbers of entries for each of 100 bins reaching from a hydropathy index of -3 to 3. The

hydropathy indices (according to [Kyte 1982]) range from -4.5 for arginine to 4.5 for isoleucine. The

plotted values were calculated as an average over one protein for each protein within a test interval.

Since there are much more smaller proteins than ones from 3000 amino acids on, two different ordinate

scales were applied in each row. The data set of h. walsbyi is (as always) included in the one of archaea

and the one entry of h. walsbyi within the [3000;) interval represents halomucin.

34

Page 55: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

a reasonable statement seems to be difficult.The plot of histograms merely depicts the data of either small or large proteins. Thereforea second graph (Figure 3.7) in the form of box and whisker plots tries to achieve a repre-sentation of the complete lengths intervals.One has to be reminded that the number of values for each box plot varies greatly between

Figure 3.7: Amino acid composition - mean residue hydropathy [Kyte 1982] per protein, all data

points: Box and whisker plots of archaeal (top), bacterial (middle) and proteins of h. walsbyi (bottom)

presenting all mean hydropathy values according to [Kyte 1982] within a particular lengths interval.

The ordinate scale was defined to cover all data points calculated and the [0;) interval contains all

sequences of one of the three groups.

small and large protein lengths. It is therefore impossible to say if the observed decreaseof outliers for the intervals representing larger proteins is because of a less distributed dataset and more similar properties of large proteins or just due to the fact that there were notenough points to hit the same number of seldom occurring outliers. Further, it can notbe said if the small number of longer proteins per interval maps to a representative graphwhich would not change largely by adding more data points.To check this, plots of special subsets were drawn (Figure 3.8).

By demonstrating tests with equal numbers of entries for each of the adjusted completelengths intervals, two things are intended. First, to find out if the amount of data withinthe last interval ([3000;)) is sufficient to show a representative crowd (vertically) and if allplots are in consistency with the ones representing all data points (Figure 3.7). And sec-ond, to identify if a plot of the randomly picked, smaller proteins could perchance producea graph similar to the one of the [3000;) interval (horizontally). Hence both approaches tryto discover, if the data sets are comparable.For the analyses of the bacterial sequences, the latter seems to be confirmable. Particu-

35

Page 56: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

Figure 3.8: Amino acid composition - mean residue hydropathy [Kyte 1982] per protein, equal number

of data points: Box and whisker plots of five tests for archaeal (left) as well as bacterial (right)

protein sequences analyzing the dependency between the mean hydropathy of a protein (according to

[Kyte 1982]) and its appropriate length. The number of analyzed proteins is equal for each of the ten

lengths intervals within one test. Since the limiting number is the amount of proteins contained within

the intervals of large sequences, for both (archaea and bacteria) all data of the previous (Figure 3.7)

intervals [3000;4000), [4000;5000), [5000;10000) and [10000;) were combined to one [3000;) interval.

Therefore all smaller archaeal intervals contain 20 randomly picked protein entries and all bacterial

1067. The specification of the ordinate scale took place with the aim to compare the data in a most

convenient way, rather than showing all outliers (four of bacteria between 2.0 and 2.1 are not shown).

36

Page 57: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

larly conspicuous is the distribution of outliers. Small bacterial proteins show fare moreentries with medial hydropathies larger than covered by the positive whiskers, than biggersequences. Surprisingly, the interval before the last one ([2000;3000)) shows constantly aplot quite similar to the one of the [3000;) interval. Apparently the tendency of bacterialproteins to form a subgroup of rather hydrophobic sequences drops substantially for pro-teins equal to or larger than already 2000 amino acids.The archaeal plots on the left side seem to be fare less significant. Since the [3000;) intervalof archaea and therefore all other intervals contain only 20 proteins, more than 5 tests(20, data not shown) were accomplished. However, for all tests neither a vertically nor ahorizontally consistency was observable. Probably 20 data points are rather insufficient toderive clear conclusions.

3.1.2.4 Electric charge

Next to the determination of the protein hydropathies per residue, the calculation of simi-lar values for the protein charges, normalized to the respective protein lengths, was carriedout. To test for a correlation between these charge values and the lengths of the analyzedproteins, again the four test intervals were used for plotting histograms (Figure 3.9).Unlike the histograms demonstrating the mean residue hydropathies (Figure 3.6), the mean

charges seem to represent only one normal distribution with a mean around 0. Though, thehistograms of archaeal proteins within 300 - 399 amino acids and especially large bacterialproteins ([3000;)) exhibit a small left shoulder implying that they encompass relatively moreproteins which are overall more negatively charged. As presented above, the fractions ofpositively charged amino acids decrease highly with increased sizes of archaeal and bacterialproteins (Figure 3.3). In contrast, negative residue fractions did not seem to correlate dras-tically with the proteins’ lengths. The resulting effect of obtaining more negatively chargedproteins might be explained in the same way as above. This is, that large proteins could beemployed to accomplish other than enzymatic functions. Therefore they harbor less activesites which are mostly built of positively charged amino acid side-chains [Bartlett 2002].The plots of the last interval for archaea and halomucin are drawn with only 20 and 1 datapoints respectively. Hence the explanatory power of these graphs is rather small.As in the case of analyzing the proteins’ hydropathy values per amino acid (Figure 3.7), agraph in form of box and whisker plots (Figure 3.10) tries to include all lengths intervals.But, as mentioned before, one needs to be careful because of the unequal distribution of thedata. Likewise, this graph must be viewed in combination with the subsequent one, whichis plotting subsets of an equal amount of data (Figure 3.11).Analogously, transforming the data into graphs with only one lengths interval for the large

proteins and an equal number of data points tries to achieve a higher information contentand to test the comparative reliability. As already observed for the box and whisker plotsof the mean residue hydopathy (Figure 3.7), the graphs of bacterial proteins show muchmore outliers. Although 20 archaeal data points per interval seem to be insufficient tosample more rare outliers and the plots are indeed not very representative for the graphsobtained with all of the data (Figure 3.10), as observed there also the plots of all archaealproteins show less outliers than the ones of bacteria. This can be interpreted as either bac-terial proteins show a higher degree of diversity, i. e. more positively and especially morenegatively charged proteins, or the number of even all archaeal data points per interval issimply insufficient to sample more rare outliers. But, in contrast to the previous hydropa-thy plots, the five tests (and 15 more, data not shown) for archaeal sequences represent aquite high (horizontal and vertical) consistency. The constantly observable decrease of the

37

Page 58: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

Figure 3.9: Amino acid composition - mean residue charge [Stryer 1995] per protein, test intervals, all

data points: In order to find out, if there is a relatedness between the charge of a protein normalized

to its number of amino acids and the length of the examined protein, four test intervals: 100 - 199

amino acids, 200 - 299, 300 - 399 amino acids and 3000 plus were compared for proteins of archaea

(top), bacteria (middle) and h. walsbyi (bottom). One can see the number of hits for each of 100 bins

ranging from a mean residue charge of -0.3 to 0.3 (in e ≈ 1.6× 10−19 C). The values were calculated

by taking into account the charge of the termini and of the charged residues as well as their charged

fractions at pH 7. Positive: R - 1.000, K - 0.999, N-terminus - 0.909, H - 0.240; negative: C-terminus

- 1.000, D - 0.997, E - 0.997, C - 0.031, Y - 0.001 (pKa values extracted from [Stryer 1995]). Due to

much more entries within the first three intervals in each row, an adjusted ordinate scale was applied

for the [3000;) interval. The data set of h. walsbyi is embedded in the one of archaea and the single

entry of h. walsbyi within the [3000;) interval represents halomucin.

38

Page 59: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

Figure 3.10: Amino acid composition - mean residue charge [Stryer 1995] per protein, all data points:

Box and whisker plots of the mean residue charge (according to [Stryer 1995]) for all proteins of archaea

(top), bacteria (middle) or h. walsbyi (bottom) within a certain lengths interval. The definition of the

ordinate scale intends to cover all values and the interval [0;) represents the data points of all other

intervals within one row.

median with increasing protein length appears more noticeable than for bacterial proteins.This is concordant with the earlier discovered greater loss of positively charged amino acidsfor larger archaeal proteins compared to bacteria (Figure 3.3). Further, the close distancebetween the median and the third quartile within the [3000;) interval of archaea impliesthat actually most of the proteins are now negatively charged. The small shoulder foundin the histogram above plotting the data for achaeal sequences of 300 to 399 amino acidsbecame the peak. Admittedly, the [3000;) interval contains for both graphs only as few as20 data points.For bacterial sequences it can be stated, that smaller proteins show again a much broaderdistribution, but in this case concerning their mean residue charge.Since small proteins form the majority and thus most of the diverse functions proteins areable to perform are covered by them, a high degree of variety seems not very surprising.Contrary, large proteins appear to be more homogeneous which might be due to a certainkind of specialization. The previously observed higher turn structure tendency, as well as theincreased likelihood for large proteins to adopt intrinsically disordered regions, might sup-pose a greater involvement and specialization of large sequences to suiting functions. Thesewould encompass the participation in ligand binding, molecular recognition, the regulationand modulation of protein functions, etc. [Marcelino 2008], [Nakayama 2001], [Fink 2005],or because of the increased surface to volume ratio the provision of a compartmentaliza-tion of functions. The fraction of performed activities during cellular processes such asenzymatic biosynthesis or metabolism would thereby be decreased [Iakoucheva 2002].

39

Page 60: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

Figure 3.11: Amino acid composition - mean residue charge [Stryer 1995] per protein, equal number

of data points: Box and whisker plots of five tests accomplished for archaeal (left) and bacterial

(right) proteins to examine the correlation between the mean residue charge of a protein (according

to [Stryer 1995]) and its length. The therefore analyzed number of sequences is equal for each of the

ten lengths intervals within one test. Because the amount of large proteins is limited, the data of

the intervals [3000;4000), [4000;5000), [5000;10000) and [10000;) of the previous figure (Figure 3.7)

were pooled to obtain one interval of large archaeal or bacterial proteins ([3000;)). Hence a number

of at least 20 archaeal and 1067 bacterial proteins could be accumulated and for all smaller intervals

the same number of sequences was randomly picked and analyzed. The ordinate scale was chosen to

compare the data in the best way possible, rather than showing all entries (in total six bacterial outliers

between 0.4 and 0.6 are not shown).

40

Page 61: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

3.1.3 Amino acid sequence

Next to the analyses on the level of the amino acid composition, the sequential arrange-ments of residues and their potential correlations with the lengths of archaeal and bacterialproteins have been of particular interest. Thereby a prediction of the secondary structurecontents as well as the identification of low-complexity regions with the ability to representnatively unfolded sections of the proteins’ sequences were performed.

3.1.3.1 Secondary structure based on primary structure

For the determination of the secondary structure contents the PSIPRED [Jones 1999b]secondary structure prediction method was applied. PSIPRED is a simple and accu-rate (average Q3 = 80.7 %) secondary structure prediction method, incorporating twoneural networks. Since the analyses are based on an output obtained from PSI-BLAST[Altschul 1997], the input of time and processing power had to be quite vast. Thereforeonly a smaller number of large prokaryotic proteins were analyzed, namely sequences equalto or larger than 2000 residues. For all such proteins of either archaea, bacteria, halomucinor of archaea + bacteria, four fractions (ratios of amino acids predicted to adopt helix,strand or coil structures and insecure predicted residues) were calculated in dependency onthe proteins’ lengths (Figure 3.12).At first one can notice that for each data set almost only one third of the predictions had to

Figure 3.12: PSIPRED results - confidence = 6: Plot of the results of a PSIPRED [Jones 1999b]

secondary structure prediction for all prokaryotic proteins equal to or larger than 2000 amino acids.

To perform this analysis all proteins were submitted as fragments no longer than 1500 amino acids.

Afterwards, for all amino acids of each lengths interval of archaea, each of bacteria, all residues of

halomucin and of archaea + bacteria together, four fractions were calculated. The first three always

consist of the three amino acid fractions predicted by PSIPRED according to their secondary structure

types as either H (helix), E (strand) or C (coil). However, amino acids were counted only if the

confidence of their prediction calculated by PSIPRED was equal to or larger than six. All other amino

acids were put into a fourth fraction of uncertainly predicted residues.

41

Page 62: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

be excluded due to a lack of confidence. Secondly and quite clearly, archaeal proteins seemto contain fare less amino acids within helical secondary structures compared to bacteria.On the other hand, their amount of residues featuring extended and coiled structures isincreased and nearly equal to each other. Bacterial proteins show almost equality of theamounts of residues within either helices or sheets and an excess number of amino acidswithin coiled structures.The mentioned imbalance for archaea towards sheets might be, as mentioned before, theresult of an increased need for protein stability because of their characteristic to generallyoccupy harsher and more extreme environments. Thermophiles show a fare higher fractionof residues within strand structures, probably because the resulting increase of hydrogenbonding leads to an enhanced thermal stability [Chakravarty 2000].However, for both biological domains no clear correlation between the fractions of residueswithin one of the three secondary structure types and the protein lengths are detectable.Admittedly and in comparison to the previous secondary structure analyses based on theamino acid compositions (Figures 3.4, 3.5), the PSIPRED prediction was only performedfor the larger and therefore fewer proteins from 2000 amino acids on. Hence the earlier de-tected differences between small proteins and the ones equal to or larger than 1000 aminoacids, namely the observed decrease of helix and strand preferring residues and the increaseof amino acids with the tendency to form coils, can not be followed here.For proteins of archaea and in particular for halomucin the limitations of the informationabout the secondary structure preferences based on the amino acid composition only be-come obvious. The more accurate PSIPRED results, taking into account the amino acidsequence, show fare higher sheet tendencies than found above. Nevertheless, the PSIPREDsecondary structure prediction method is not inerrable and can only represent indicationsfor the real structures.

3.1.3.2 Natively unfolded regions

Natively unfolded proteins or sequences with extended, intrinsically disordered regions aresequence sections of more than 30 till 40 residues (sometime the whole protein length), oflow sequence complexity, high flexibility, a low overall hydrophobicity and a characteris-tic composition of amino acids [Fink 2005], [Romero 2001]. Although it was shown thatthe ratio of genome encoded, intrinsically unstructured proteins is increase with the com-plexity of an organism [Wright 1999], [Dunker 2000], [Iakoucheva 2002], [Tompa 2003], alsoprokaryotic proteins (≈ 2 % of archaeal, ≈ 4 % of bacterial sequences) can contain longregions of native disorder [Ward 2004]. Since they remain evolutionary stable, they musthave important functions, which on the other hand appears conflicting with the classicalconvention that the function of a protein is determined by its 3D structure. The observedfunctions of intrinsically, unfolded regions are mostly connected to the regulation of cellcycle, transcription and translation [Vucetic 2003] by molecular recognition processes andthe performance of binding functions.To identify such sequence sections for the analyzed prokaryotic proteins, firstly a predictionof low-complexity regions using the SEG [Wootton 1993], [Wootton 1996] method was ac-complished. Secondly, of all obtained regions, fragments of lengths equal to or larger than40 amino acids and a negative overall hydropathy values (according to [Kyte 1982]) wereextracted. Figure 3.13 represents the number of potential intrinsically disordered regionsper protein for each lengths interval.Apparently the number of potential intrinsically disordered regions and therefore the char-

acteristic of proteins to be at least in part natively unfolded, increases for larger sequences.

42

Page 63: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

Figure 3.13: SEG results - mean number of hydrophilic low-complexity regions = 40 amino acids per

protein: Depiction of the mean numbers (and respective standard deviations) of low-complexity regions

per protein for all sequences within a certain lengths interval. The detection of low-complexity regions

was performed by the SEG [Wootton 1993], [Wootton 1996] method and only over all hydrophilic

regions (negative values according to [Kyte 1982]) with lengths equal to or larger than 40 amino acids

were counted. The results of the groups halomucin and archaea + bacteria are only plotted within the

last lengths interval ([0;)).

43

Page 64: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

Archaeal proteins ranging from 5000 - 9999 amino acids for example, show in average onesuch unstructured region per protein. The square archaeal protein halomucin (9159 aminoacids) contains two. Although the standard deviations are quite huge, there is a clear pos-itive correlation between the number of natively unfolded regions within a protein and itslength. This could be explained by either a higher functional or structural need, forcinglarger proteins to contain more unfolded regions or simply by a probability increase forlarger sequences. In either case, large proteins feature sequence areas which are more oftennaturally unfolded than small ones.Perhaps the increase of malleability introduced by natively disordered regions [Fink 2005]is particularly profitable for large sequences. Further it might be imaginable, that sinceprokaryotic organisms lack the real cellular compartmentalization of eukaryotes, largeprokaryotic proteins provided comparable structures. For such purposes, the characteristicthat natively unfolded proteins exhibit large intermolecular interfaces with simultaneouslyacceptable sizes [Gunasekaran 2003], would be helpful. Monomeric folded proteins, aimingto achieve the same sizes of interfaces, needed to be two or three times larger, resulting in anincreased cellular crowding or an enlarged cell size, which is both rather disadvantageous.Figure 3.14 and Figure 3.15 demonstrate the dependency of the lengths of potential nativelyunfolded regions on the sequence lengths of the analyzed proteins for archaea and bacteriarespectively. Since the number of smaller proteins is fare higher than of large sequences,the additional scatter plots aim to visualize the amount of data which is actually availablefor each length interval.

For instance for archaeal proteins (Figure 3.14) there is no low-complexity region withthe attributes of a natively unfolded region for the [4000;5000) interval, but within it thereare also only 5 sequences that could be tested. Despite the lack of data, the medians ofthe lower box plots seem to slightly increase with larger, archaeal sequence lengths. Thiswould imply, that within large proteins of archaea, there are not only more but also longersequence sections which are unfolded in the native state of the proteins’ structure. Thereby,the mentioned idea that larger proteins could need these unfolded regions for functional orstructural purposes would be supported.The graphic for bacterial proteins (Figure 3.15) impressively reminds one that there are

almost 20 times more bacterial than archaeal sequences that could be analyzed. Never-theless the amount of outliers within the box plots is pretty impressive. There are verylarge proteins ([3000;4000) interval), whose almost entire sequences appear to be nativelyunfolded. Hydrophilic low-complexity regions of nearly 3000 amino acids for bacteria aregreatly opposed to only one larger, possibly unfolded region over 300 amino acids for ar-chaea. In fact, many lengths intervals of bacteria show outliers with low-complexity regionsof lengths near the respective interval limit, meaning that almost the whole protein wouldbe intrinsically unstructured.The mentioned direct proportionality between the fractions of intrinsically unstructured se-quences and the complexities of the belonging organisms might be the reason behind thesedifferences between archaea and bacteria. Higher evolved species seem to depend signifi-cantly more on complex protein-protein interactions and non-housekeeping proteins appearto be preferentially affected by an increased proportion of disordered regions, which mightreflect a diversification of their regulatory functions within higher organisms [Fink 2005]. Incontrast, less complex species are subject to a stronger selective pressure towards biochem-ical efficiency. At this point, the differences in the general habitats of these two biologicaldomains could be stated again, with often extreme conditions for archaea and sometimeseven parasitic lands of cockaigne for bacteria.

44

Page 65: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

Figure 3.14: SEG results - archaea - lengths of hydrophilic low-complexity regions = 40 amino acids:

Presentation of the lengths of SEG -predicted low-complexity regions (hydrophilic, 40 amino acids plus)

against the lengths of archaeal proteins. The upper graph represents scatter, the lower box and whisker

plots of the data. The values of halomucin are also included in the [5000;9000) interval and the ordinate

scales were chosen to be comparable with the plots for bacteria (Figure 3.15).

45

Page 66: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

Figure 3.15: SEG results - bacteria - lengths of hydrophilic low-complexity regions = 40 amino acids:

Demonstration of the lengths of low-complexity regions (hydrophilic, 40 amino acids plus) depending

on the lengths of the respective proteins which are in this case of bacterial origin. Again the same data

is presented in form of scatter (top) and box and whisker plots (bottom).

46

Page 67: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.1. Part 1

3.1.4 Discussion and explanation attempts

During the previous analyses of results a lot of information was gathered. However, somefacts seem to reconfirm and thus should be pointed out again. The following two subsectionsshortly combine the outcomes for either the comparison between archaeal and bacterialsequences or between normal sized and large prokaryotic proteins. They might be captionedwith oversimplified statements but they try to offer grounds for these within the texts.A third subsection heads for a so fare disregarded fact, namely the assumption that aremarkable fraction of large, prokaryotic proteins might belong to the functional categoryof surface proteins.

3.1.4.1 Archaea vs. bacteria- robust thermophiles vs. complex mesophiles

The previous analyses were able to expose a few considerable differences between archaealand bacterial sequences. Among them, there were some variances of the amino acid com-positions and based on the amino acid sequences, a higher secondary structure content ofstrands for archaea as well as longer, potentially natively unfolded regions for bacteria,could be noticed.As documented in the literature, there is strong evidence that the amino acid compositionsdiffer significantly between thermophiles and mesophiles [Chakravarty 2000]. An increaseof the charged glutamate fractions, explained by an enhanced need for salt bridges and ionpairs [Xiao 1999] as well as higher contents of valine and isoleucine, stabilizing through adecrease of the unfolded chain’s entropy [Lee 1994], have also been observable for the aminoacid compositions of archaeal compared to bacterial sequences (Figure 3.15). Even furthercompositional bias equal to the one detected for the proteins of thermophiles, namely thediminution of histidines and glutamines, could be noticed for archaeal polypeptides.Next to an adjusted amino acid composition, proteins of thermophiles exhibit further sta-bilizing factors including a reduced sequence size, a higher number of residues involvedin hydrogen bonding and therefore an increased content of sheet secondary structures[Chakravarty 2000]. The results of the PSIPRED analyses (Figure 3.13) would reinforcethese findings.The fraction of intrinsically unstructured proteins behaves commensurate with the complex-ity of an organism [Wright 1999], [Dunker 2000], [Iakoucheva 2002], [Tompa 2003]. Further-more, only a few occurrences of natively unfolded regions have been observed for housekeep-ing proteins involved in cellular biosynthesis and metabolism [Iakoucheva 2002]. It couldthus be speculated, that more complex bacterial species, experiencing lower selective pres-sures on biochemical efficiency, evolutionarily established larger, intrinsically unstructuredsequence sections (Figures 3.14, 3.15).In short, a comparison between the biological domain of archaea, specialized to occupy ex-centric environments, and the diverse and complex domain of bacteria, must expose certaindifferences between these distinct kinds of prokaryotes.However, of all analyzed prokaryotic proteins only 5 % were represented by archaealsequences, analyses on the level on amino acid content might be misleading, even thePSIPRED method is fallible and again the proteins of fare less species of archaea (com-pared to bacteria) were tested for potential intrinsically disordered regions.

47

Page 68: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

3.1.4.2 Small vs. large- rigid, essential enzymes vs. flexible, regulating binders

Since the present thesis is captioned as bioinformatical analysis of extraordinarily large,prokaryotic proteins, the comparison between smaller and larger proteins has been of ma-jor interest within the previous examinations.Many differences have been noticeable, but three major facts seemed to recur consistently,all supporting the main hypothesis, that large, prokaryotic proteins show flexible recogni-tion and non-housekeeping binding activities, which are increasing a species’ complexity,rather than the execution of specific and essential enzymatic functions. Further the ten-dency to an increased surface to volume ratio gives evidence that they provide not only theperformance of certain functions, but also an environment for cellular processes, compen-sating the prokaryotic lack of an eukaryotic compartmentalization of functions.The mentioned three leading arguments are the reduced qualities of large sequences to ac-complish enzymatic functions, the increased tendency to adopt turn and loop structuresand enhanced fractions of natively unstructured regions within larger proteins.Studies on the level of the amino acid composition (Figure 3.2) revealed a decrease in theresidue content for six of the seven amino acids accounting for 70 % of all catalytically ac-tive residues [Bartlett 2002] and especially a loss of positively charged amino acids (Figure3.3). Besides, the observed higher degree of flexibility introduced by an increase of smallamino acids like glycine would act against the rigidity needed to perform specific, enzymaticactivities.Turns reflect a major class of protein secondary structure and are of heterogeneous, non-periodic nature. Because they are broadly defined as amino acid chain sections, where achange in direction occurs [Rose 1985] and they a mostly situated between regions of regu-lar secondary structures, turns are frequently located at a protein’s surface. Therefore theyare primarily composed of hydrophilic amino acids [Rose 1978]. Furthermore, they show acompositional bias towards small residues like glycine or asparagine to sample their special3D conformations. The previous analyses found an increase of such small residues (Figure3.2) as well as less over all hydrophobicity for large sequences (Figure 3.6). Since turnsoften perform regulative and modulating functions through ligand binding or molecularrecognition, a tendency of large proteins to carry out similar tasks might be expectable.In addition, the trend towards turn and loop structures could result in a higher surface tovolume ratio for larger sequences.Natively unfolded proteins and intrinsically disordered regions are more malleable and ex-hibit larger intermolecular interfaces than monomeric folded proteins, leading to an im-proved regulation and binding of diverse ligands and a bigger intermolecular interface tovolume ratio respectively [Gunasekaran 2003]. Such regions specially occur within pro-teins performing non-housekeeping functions, regulators of protein-protein, protein-DNAor protein-membrane interactions, polypeptides that are increasing the complexity of aparticular organism. The predictions of potential natively unfolded regions resulted in anincrease of their number and lengths with larger protein sizes (Figures 3.13, 3.14, 3.15).Further support comes through the detected higher overall hydrophilicity, a quite similarcompositional bias as for the amino acid contents of intrinsically disordered regions andthe increase of flexibility for larger sequences. With a higher degree of natively unfoldedregions, large, prokaryotic proteins would be generally more likely to be involved in bindingand recognition functions as well as in providing environment for cellular processes, thanto function as rather rigid enzymes.Therefore a general trend with respect to the functional differences between smaller andlarger proteins could be seen in the involvement in biosynthesis and metabolism on one side

48

Page 69: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.2. Part 2

and regulatory or modulatory functions as well as a supply of cellular reaction chamberson the other.Admittedly, the fraction of proteins larger than 1000 amino acids represented only 1 % ofall prokaryotic sequences, active sites of enzymatic proteins rarely exceed seven residues,archaeal proteins should trend to shorten loops with turn structures in order to achieve ahigher protein stability and the amino acid compositional bias observed for natively un-structured regions was not completely notable for large, prokaryotic proteins.

3.1.4.3 Functional category of surface proteins for protein giants

All statements and discussions made for the analyses of large, prokaryotic sequences pre-sumed that the majority of proteins within a prokaryotic proteome is composed of intracellu-lar sequences and conclusions were therefore oriented towards their standards. A study of socalled giant prokaryotic genes could show, that proteins encoded of genes larger than 15 kbbelong to a couple of different functional categories [Reva 2008]. Next to intracellular reg-ulatory proteins, transporters, repeat domain proteins and a large group of non-ribosomalpeptide or polyketide synthetases, many of the tested polypeptides showed membership withthe functional category of extracellular surface proteins, receptors and haemolysins. How-ever, the annotations of only 16 of the analyzed 145 sequences were experimentally verified.Nevertheless, the differences referring to the amino acid usage between large extracellularand intracellular proteins pointed out by these authors show a quite high consistency withthe variances noted between small and large prokaryotic proteins. Likewise, an increaseof polar, aliphatic amino acids and aspartates as well as a decreased fraction of positivelycharged amino acids and of cysteines have been observable (Figures 3.2, 3.3). The de-scribed rise of glutamates has not been detected. The authors [Reva 2008] assume thatsince these large surface proteins feature acidic and hydrophilic characteristics as well asa lack of cysteines, they should be prone to interactions with cations or water and exhibita high flexibility owing to fewer constraints through less covalent disulphide bridges. Likemucins and collectins in mammals, these features would allow such proteins the abundantbinding of water, ions and other substrates to generate a special micromilieu around thecells [Reva 2008].The next section deals with an extremely large archaeal protein, probably out of the samekind of functional category.

3.2 Part 2:

Haloquadratum walsbyi’s protein giant halomucin - in detail

The second part of this master project comprised a more detail view on a single extraordinar-ily large, prokaryotic protein, halomucin. The 9159 amino acids of its sequence are encodedby an over 27000 nucleotides long gene within the genome of the halophilic archaeon Halo-quadratum walsbyi [Bolhuis 2006]. This organism occupies a hostile and narrow ecologicalniche at the limits of water activity, with high concentrations of MgCl2 (more than 2 M)and NaCl (more than 3 M) [Bolhuis 2004]. In addition its environment is almost anaerobic,the level of solar radiation hazardous and the concentration of nutrition due to complex-ing with Mg2+ reduced. However, H. walsbyi features a lot of adjustment mechanismsallowing its growth to high concentrations into that unsurprisingly by other organisms lesscompeted habitat (Introduction 1.2). One of its secrets might lie in a water enriched capsulecovering the cells as an aqueous shield and perhaps also accounting for the maintenance

49

Page 70: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

of the unique square cell morphology of H. walsbyi [Bolhuis 2006]. According to Bolhuiset al. the giant protein halomucin represents a similar amino acid sequence and domainorganization to animal mucins, that protect organs such as eyes and lungs from desiccation[Hollingsworth 2004]. Further confirmed by a N-terminal signal sequence (suggesting thathalomucin is translocated across the membrane) as well as the inclusion of domains harbor-ing potential sites for glycosylation and sulfation (increasing halomucin’s overall negativecharge), the authors assume that halomucin mediates a specific adaptation to desiccationstress through providing the structure of that mentioned capsule. Since they could showthat the gene of halomucin is transcribed completely and H. walsbyi is potentially capableof synthesizing sialic acids as well as poly-gamma-glutamate, they suppose that a hencestabilized and cross-linked capsule might in addition contribute to the achievement of thepostage stamp like cell morphology of H. walsbyi.

3.2.1 Visualization of halomucin’s dimensions

The unique dimensions of H. walsbyi range from 2 till 5 µm wide and 0.1 till 0.5 µm thick[Bolhuis 2004]. Since the structure of halomucin is still not solved Figure 3.16 tries tovisualize its dimensions with respect to the producing cell.To get an idea about the volume of halomucin, first the mean volume per amino acid based

on halomucin’s amino acid composition was calculated (118.3 A3). With 9159 amino acidsand the assumption that the volume taken by the van der Waals volume of the amino acidshas to lie between 75 % (tight protein interior) and 58 % (in water) the protein volumecould range between 1445 and 1868 nm3 respectively. For the improbable case that theprotein would adopt a sphere like shape its diameter might then lie between 14 and 15 nm.If one imagined that the whole 9159 amino acids would built only one large alpha-helix, thelength of this helix added up to around 1.4 µm (translation: 0.15 nm/ amino acid). For thetheoretical case, that the whole sequence would adopt the linearly extended conformationof a long all-trans isomer, its length would reach approximately 3.5 µm (translation: 0.38nm between two Cα-atoms). However, these computations are quite unrealistic and arejust thought to make this incredibly large protein a bit more tangible.

3.2.2 Functional predictions

Although the functional predictions made for halomucin [Bolhuis 2006] seem quite plausible,no experimental evidence could be adduced. The lack of structural information does notease the problem of predicting of halomucin’s cellular or extracellular roles. The followingparagraphs aim to confirm and predict potential functions for halomucin based on eitherits amino acid composition or its amino acid sequence.

3.2.2.1 Background: Amino acid composition

Comparing the results for halomucin obtained within the general analyses of prokaryoticproteins on the level of amino acid content, polar and uncharged residues seem to behighly increased (Figure 3.3). Together with the drop of the positively charged aminoacid fraction, the compositional bias would show similarity to the one observed for large,extracellular proteins [Reva 2008]. However, thereto conflicting would be the risen contentof cysteines, which were diminished within the studies of large surface proteins accomplishedby these authors, resulting in a flexible protein structure. On the other hand, the amino acidcontent of halomucin might reflect a compromise to achieve both, an increased potential

50

Page 71: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.2. Part 2

Figure 3.16: Halomucin - length comparison: Visualization of the spatial dimensions of the non-

motile, pigmented halophilic archaea Haloquadratum walsbyi and its encoded halomucin, one of the

largest archaeal proteins (9159 amino acids). This square shaped organism shows a lateral length of 2

- 5 µm (sometimes even 40 µm) and a cell thickness of 0.1 - 0.5 µm [Bolhuis 2004]. Halomucin is a

huge protein with a molecular weight of 927.7 kDa. It might be exported outside the cell helping to

create an aqueous shield from desiccation and solar irradiation as well as to maintain the unique square

shape of H. walsbyi. The picture of the cell was taken out of [Burns 2004] and the illustrations of a

theoretical helix and of a trans-polypeptide for halomucin are not true to the scale of H. walsbyi. Black

lines try to indicate scaled translations.

51

Page 72: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

to bind water and cations, creating a covering water enriched shield and simultaneous themaintenance of stability, contributing to the rigidity of H. walsbyi’s cellular capsule.

3.2.2.2 Background: Amino acid sequence

General facts The general examinations on the level of primary structure resulted intwo potential intrinsically unstructured regions (near the C-terminus) and the PSIPREDmethod predicted an extremely high content of strand structures.According to the sequence annotation of the UniProtKB/ Swiss-Prot Protein Knowledgebase[Consortium 2009] the first 30 N-terminal residues of halomucin represent its potentialsignal peptide. Furthermore three domains are annotated: C-type lectin 1 (residue 644 -776), C-type lectin 2 (residue 929 - 1060) and Cadherin (residue 7686 - 7793). Figure 3.17depicts these annotations as well as the location of the two predicted intrinsically disorderedregions.

Figure 3.17: Halomucin - sequence plot: Representation of the location of sequence feature annotations

derived from of the UniProtKB/ Swiss-Prot Protein Knowledgebase [Consortium 2009] as well as of the

sequence sections exhibiting potential natively unfolded regions of low sequential complexity within the

9159 residues of halomucin’s amino acid sequence. The abbreviations used are sp: signal peptide, ctl :

C-type lectin domain, cad : Cadherin domain and lcr : hydrophilic low-complexity region larger than 40

amino acids.

Dissection into submittable fragments Since the annotations of functions andsequence similarities obtainable from the UniProtKB/ Swiss-Prot database are rather poorfor halomucin, manual considerations and prediction methods were applied to acquire fur-ther information.Due to the uncommon dimensions of approximately 30 times the size of ordinary archaealproteins, most bioinformatical prediction methods would not accept the entire halomucin asa query sequence. Hence the first approach included the ’cleavage’ of the protein’s sequenceaccording to its structural domain boundaries. Therefore the DomSSEA method at theDomPred Protein Domain Prediction Server [Marsden 2002] was utilized. The sequencesplitting into submittable fragments occurred in firstly four almost equal parts of 2290amino acids (1 - 2290, 2291 - 4580, 4581 - 6870 and 6870 - 9159) and secondly with unal-tered spacing but starting with the second fragment in the middle of the previously first one(1 - 1145, 1146 - 3435, 3436 - 5725, 5726 - 8015, 8016 - 9159). Both results were combinedmanually obtaining 22 fragments of the whole protein sequence. Further the automati-cally constructed protein domain family database ProDom [Corpet 2000], [Servant 2002],[Bru 2005] was employed to carry out a comparative prediction of the possible domain or-ganization. For this method the sequence of halomucin could be submitted as a whole.However, since the fragments predicted this way did not cover the total amino acid se-quence, most of the further analyses were performed using the 22 protein parts obtained

52

Page 73: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.2. Part 2

with the secondary structure element alignment method DomSSEA (Table 3.3).Taking into account that these fragments vary in length from 132 - 952 amino acids and thedetermination of their boundaries was done manually by roughly combining the two dif-ferent prediction results, it is quite unlikely that the defined fragments represent biologicalrelevant domains. Nevertheless a splitting of the polypeptide into submittable fragmentshad been necessary for further analyses, namely the identification of structural neighborsand thus the gathering of information about potential functions of particular sequencesections.

Identification of structural homologues To identify structural homologues threefold recognition methods were applied to the determined 22 sequence parts of halomucin:mGenTHREADER [Jones 1999a], [McGuffin 2003], 3D-PSSM [Fischer 1999], [Kelley 1999]and Phyre [Bennett-Lovsey 2008]. Fold recognition methods involve identifying homologuesof known structure, to predict the 3D conformation of a target sequence. The mentionedmethods use improved sequence searching based on sequence profiles and incorporate struc-tural information to detect distant homologues. Further input draws distinctions betweenthe three predication methods for which reason they were compared. Table 3.3 presents thederived results by manual combination of these predictions. Listed are the PDB (ProteinData Bank of the Research Collaboratory for Structural Bioinformatics [Berman 2000]) ac-cession numbers of the template structures allocated to the fragments of halomucin (column3) as well as the main functions of these templates (column 4). For fragment 3 no structuralhomologue was identifiable.

Apparently some of the template structures were assigned to more than on fragment,yet mostly different parts of the template sequences, so that in total only eleven differenttemplates served to derive structural and functional hints for halomucin.With template 1cwv [Hamburger 1999] (fragments 1 and 18) the structure of the invasinprotein of Yersinia pseudotuberculosis was assigned to parts of the sequence sections 1 and18 of halomucin. Bacterial invasin binds to integrins at the surface of eukaryotic cells topromote the pathogenic bacterial entry. However, the invasion of H. walsbyi into eukaryoticcells seems not very likely. Probably in this case, an argumentation that similar structuresshould perform analog functions might not be valid.1dv8 [Meier 2000] (fragments 2 and 4) represents the carbohydrate recognition domain(CRD) of the human asialoglycoprotein receptor (ASGPR). This receptor can be subdividedinto four functional domains and the CRD is responsible for the association of glycoproteinsby binding to terminal non-reducing galactose residues and N-acetyl-galactosamine residues.Since this domain belongs to the superfamily of C-type (calcium-dependent) lectins, thefold recognition methods were able to confirm the sequence annotations at the UniProtKB/Swiss-Prot Protein Knowledgebase [Consortium 2009]. There C-type lectin 1 and C-typelectin 2 are annotated for the residues 644 - 776 and 929 - 1060 compared to the sequencesections 647 - 799 (fragment 2) and 932 - 1061 (fragment 4) identified as C-type lectins bythe fold recognition methods.Likewise, the UniProtKB/ Swiss-Prot annotation of the third domain, Cadherin (7686 -7793), has been reinforced by the prediction methods. 1l3w [Boggon 2002] (fragments 5and 20) is the identifier for the 3D structure of C-cadherin’ s extracellular domain. Next tothe residues 7576 - 8093 (fragment 20) also parts of fragment 5 (1230 - 1569) were predictedto adopt cadherin superfamily like structures. Cadherins are Ca2+-dependent cell adhesionmolecules mediating specific cell-cell interactions. They have been found in vertebrates aswell as invertebrates [Nollet 2000], [Cox 2004], but not in prokaryotes which exhibit onlyadhesins as cellular adhesion molecules. Nevertheless, the functional hint towards the po-

53

Page 74: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

Table 3.3: Demonstration of the results of a sequence partitioning of halomucin into submittable frag-

ments and subsequent identification of structural homologues. Two superimposed predictions by the

DomSSEA method [Marsden 2002] were used to manually combine these DomSSEA outputs in order

to obtain the 22 fragments listed in the second column (starting - ending residue of a certain frag-

ment). By merging the outputs of three fold recognition methods (mGenTHREADER [Jones 1999a],

[McGuffin 2003], 3D-PSSM [Fischer 1999], [Kelley 1999] and Phyre [Bennett-Lovsey 2008]) possible

structural neighbors (represented by their PDB identifiers) for each of the fragments (except number 3)

have been listed (column 3). The biological functions of these template structures are given in column

4. Further the amino acids involved in the fragment-template alignments are listed for the templates

(column 5) and the appropriate target fragments (column 6). The last column presents the lengths of

each submitted fragment of halomucin (all length descriptions in amino acids).

54

Page 75: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.2. Part 2

tential role of halomucin to act as kind of an adhesion molecules does not seem to be totallymade up out of thin air. The square shaped cells of H. walsbyi appear to divide at alter-nating right angle often producing clusters of cells, attached to each other like sheets ofpostage stamps [Walsby 1980], [Kessel 1982]. Thus there should be a mechanism mediatingthese cell-cell interactions.PDB structure 1dab [Emsley 1996] (fragments 6 and 13) belongs to virulence factor P.69pertactin of Bordetella pertussis. It mediates the adhesion of this virulent bacterium totarget mammalian cells. Again the aim of an endocytotic uptake by an eukaryotic cellappears implausible for H. walsbyi. Nonetheless, the structural potential of halomucin tobe involved in cellular adhesion processes would confirm the idea mentioned above.1wxr [Otto 2005] (fragments 7, 8, 9, 12, 14, 16 and 17) denotes the 3D structure ofhemoglobin protease (Hbp) from pathogenic Escherichia coli. This autotransporter pas-senger domain is released by its producing cells to degrade host hemoglobin and extractheme, assuring the pathogens iron supply. The reason for the repeated identification ofstructural similarity between this domain and parts of halomucin seems obscure. Althoughit would still corroborate the extracellular nature of halomucin, a serine protease activityappears rather improbable and the structural relatedness might be reduced to the fact thatboth share a high content of sheet secondary structures.1hg8 [Federici 2001] (fragment 10) indicates the PDB entry endopolygalacturonase (PG)from the phytopathogenic fungus Fusarium moniliforme. This enzyme catalyzes the frag-mentation and solubilization of homogalacturonan contributing to the cell-wall degradationof plant target cells. Since there is also no need for H. wasbyi to invade plant tissues, thestructural similarity to this template might not imply a functional analogy for fragment 10of halomucin as well.The identifier 1nhc [van Pouderoyen 2003] (fragment 11) standing for another endopoly-galacturonase (endopolygalacturonase I ) from Aspergillus niger, represents a further cellwall-degrading enzyme, catalyzing the random hydrolysis of 1, 4-alpha-D-galactosiduroniclinkages in pectate and other galacturonans. Due to the same reasons as for 1hg8, no func-tional conclusions based on the structural similarities should be drawn for fragment 11.2qub [Meier 2007] (fragment 15) is the PDB abbreviation for the 3D coordinates of theextracellular lipase LipA from Serratia marcescens. This lipolytic enzyme hydrolyzes theester bonds of acylglycerides and possesses 13 copies of a calcium binding tandem repeatmotif within its C-terminal part. Even though lipase activity does not seem very conclusivefor halomucin, the structural potential to bind divalent cations like Ca2+ might be inter-esting. Since the habitat of H. wasbyi exhibits an extremely high concentration of MgCl2,the association of Mg2+ could be conceivable for halomucin’s middle section.2z8r [Ochiai 2007] (fragment 19) identifies rhamnogalacturonan lyase YesW produced bysaprophytic Bacillus subtilis. Among other reactions, this extracellular enzyme catalyzesthe cleavage of glycoside bonds in polygalacturonan and thereby accounts for the degrada-tion of plant cell walls. Once again, there is no point in that H. wasbyi would have theopportunity to invade plant cell. However, if some dead organic matter accidentally endsup in that solar salterns or by organisms perishing into the hypersaline environments of H.wasbyi, the square cells might have evolved mechanisms to gain access to these sources ofnutrition.The label 2c3f [Smith 2005] (fragment 21) is assigned to the structure of the hyaluronatelyase HylP1, a streptococcal phage-encoded virulence factor. This tail-fiber protein is re-sponsible for the digestion of the hyaluronan capsule during infection of its target Strep-tococcus pyogenes. Although the predicted structural homology to halomucin might notimply a functional similarity in this case as well, the accumulated potential involvement of

55

Page 76: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 3. Results and discussion

parts of halomucin’s structure into degradation processes appears quite remarkable.2i1q [Qian 2006] (fragment 22) represents the PDB identifier for the archaeal RadA/ Rad51.recombinase from Methanococcus voltae (MvRadA). The enzyme plays a key role in DNA re-pair by forming helical nucleoprotein filaments in which a hallmark strand exchange reactionbetween homologous DNA substrates occurs. This strand exchange activity is stimulatedby calcium. Since a transmembrane topology prediction (Phobius [Kall 2004]) detected notransmembrane helices for halomucin, its simultaneous performance of extracellular andintracellular functions seems improbable. Nevertheless, this structural alikeness supportsthe mentioned idea that halomucin could associate with divalent cations.

Conclusion Disregarding fragment 22, all structural homologues to the submitted partsof halomucin represent extracellular proteins or domains. In general, their annotated func-tions involve either binding and recognition or degradation processes. Therefore, next tothe supposed functions of halomucin, namely the analogy to eukaryotic mucins as well as itspotentially stabilizing effect on the cellular capsule and shape of H. walsbyi [Bolhuis 2006],one could speculate that this giant protein might have further functional abilities. Thestructural preconditions to bind sugars, surface proteins and divalent cations as well as thepartial homology to cellular adhesion molecules could elect halomucin to be responsible forthe special, spatial arrangements of H. walsbyi cells in the form of postage stamp sheets.On the other hand, degradational, enzymatic activity could account for the utilization ofthe scarce resources of H. walsbyi’s environment.Figure 3.18 depicts a manual and arbitrary 2D arrangement of the sequence parts of halo-mucin for which structural homologues could be predicted. The graph at the bottomindicates the coverage of the structure assignments along the protein sequence.

Obviously all structural templates feature the same high content of strands and sheets.Certainly, by this figure no conclusions about the 3D structure of the whole protein sequencecan be drawn. However, the vast proportions of halomucin become clear impressively.Perhaps it has to be waited for the solvation of the entire 3D conformation and more experi-mental knowledge until the insights are sufficient to state the true function of this intriguinggiant of protein, as well as its connection to the environment of H. walsbyi.

56

Page 77: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

3.2. Part 2

Figure 3.18: Halomucin - structural hints: Demonstration of an arbitrary 2D arrangement of structural

homologues predicted for fragments of halomucin (Table 3.3). The 3D structures represent parts of the

templates, that aligned with halomucin’s sequence. The secondary structure of these PDB structures is

depicted in cartoon modus (helices - blue, strands - red, coils - yellow). Sequence regions of halomucin

that could not be predicted by the fold recognition methods are indicated as black lines (not to scale,

numbers denote amino acids per section). Further, the first (blue) and last (black) residue of each

template structure as well as N- (N) and C-termini (C) of halomucin are emphasized. A sequence plot

at the bottom visualizes the alignment locations of the 21 structural templates (red lines) with the

sequence of halomucin (black line).

57

Page 78: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 79: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Chapter 4

Conclusion

Contents

4.1 Extraordinarily large proteins and mean protein lengthsamong the three domains of life . . . . . . . . . . . . . . . . . 59

4.2 Small, housekeeping generalists and large, accessory spe-cialists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Halomucin, the secret of H. walsbyi . . . . . . . . . . . . . . 604.4 Challenges and future prospects . . . . . . . . . . . . . . . . 61

The last chapter of this thesis provides a concluding statement on the results of thismaster project as well as an outlook on potential follow-up examinations.

4.1 Extraordinarily large proteins and mean protein

lengths among the three domains of life

One astonishing outcome of the analyses performed has been the fact that there are proteinsequences for archaea (2), bacteria (31) and eukaryotes (73) that reach lengths of morethan 10000 amino acids. By putting the numbers of these extremely large sequences intorelation to the number of proteins analyzed for each of the three domains of life, the differ-ence between the two kinds of prokaryotes is rather small compared to an almost five timesincreased amount for eukaryotes. Analog proportions were obtained for the mean lengthsover all proteins of a certain domain.Probably the higher degree of cellular complexity [Brocchieri 2005], an increased fractionof regulatory domains [Zhang 2000] as well as fusion of single-function proteins to versatilemulti-domain structures [Das 1997], [Karlin 2002] account for the mainly greater lengthsof eukaryotic proteins. The less distinctive difference between the smaller archaeal and theslightly larger bacterial sequences might be caused by different general habitats of the twodomains. A negative correlation between the mean protein length and the environmentaltemperature of an organism has been noted repeatedly and explained by an increased needfor protein stability [Thompson 1999], [Kumar 2001], [Vieille 2001]. Likewise, the results ofthe analyses on the level of amino acid composition as well as the detected greater contentof strand secondary structures for archaea, compared to a higher flexibility and an increasedfraction of intrinsic disorder observed for bacterial sequences, support the idea of mostlythermophilic archaea and rather mesophilic bacteria. The often extreme environmentalconditions for archaea differ greatly from the rather moderate, but divers and sometimeseven parasitic ways of bacterial life.However, the amount of proteome data available for archaea clearly needs to be increasedbefore reliable conclusions can be drawn about the differences of these two biological do-mains referring to protein sizes.

Page 80: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Chapter 4. Conclusion

4.2 Small, housekeeping generalists and large, acces-

sory specialists

Approximately 99 % of prokaryotic protein sequences and thus the majority of cellular pro-tein functions performed through their corresponding structures are represented by polypep-tides of sequence lengths less than 1000 amino acids. Nevertheless, a couple of amazinglylong prokaryotic sequences achieved evolutionary stability, meaning that their immensecosts of synthesis must be profitable for the producing cells. Large, prokaryotic proteinsappear to perform rather non-housekeeping activities increasing an organisms complexityand adjustment, than specific, essential enzymatic functions. Differences with respect tothe amino acid contents imply a higher degree of flexibility, hydrophilicity and hence anincreased tendency to adopt turn secondary structures. This corroborates the assumptionof an amplified surface to volume ratio as well as a greater involvement into molecular inter-action processes for longer sequences. More and larger regions of the potential to representnatively unfolded protein sections might further confirm the general increase of malleabilityand of the sizes of available intermolecular interfaces. Therefore intracellular protein giantswould be suitable to either recognize, bind and modulate various ligands or even to provideseparated reaction environments similar to an eukaryotic, cellular compartmentalizationof functions. By transloction from the crowded cytoplasm to the extracellular space, thegeneration of a shielding micromilieu [Reva 2008] as well as the improvement of the mechan-ical stability of the cell envelope would be conceivable. Further they might allow cellularadhesion to animate or inanimate surfaces or enable the transduction of signals from theenvironment.Because of the marginal fraction of large protein sequences, the analysis of fare more an-notated proteomes would be preferable to sample enough giant proteins for drawing moretrustworthy conclusions.

4.3 Halomucin, the secret of H. walsbyi

With a more than 30 times larger sequence as for ordinarily sized, archaeal polypeptides,the assumably extracellular protein colossus halomucin of the halophilic square archaeonH. walsbyi represents one of the biggest archaeal proteins. It might further be one of themain reasons why this remarkable archaeon is able survive in its extremely hostile ecolog-ical niche [Bolhuis 2006]. Similar to animal mucins, it is supposed to create an aqueousshield protecting the cells from desiccation within their hypersaline environments. In ad-dition, it might establish the basis of a cross-linked extracellular matrix contributing tothe rigidity and maintenance of the unique square cell morphology of H. walsbyi. However,the prediction of structural homologues to sequence sections of halomucin brought furtherpossible functions into play. Several structural homologues to parts of this protein giantexhibit degradational activity, which could performed by halomucin grant H. walsbyi ac-cess to the scarce sources of nutrition within its habitat. Due to its predicted structuralpotential to interact with sugar molecules, surface proteins and divalent cations as well asits fractional homology to cellular adhesion proteins, it would further be imaginable thathalomucin accounts for an attachment of the square cells to each other, achieving the ob-served association pattern of H. walsbyi like sheets of postage stamps [Walsby 1980].In either case the production costs of this giant protein seem worthwhile for H. walsbyi andfuture analyses might reveal the exact reasons.

60

Page 81: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

4.4. Challenges and future prospects

4.4 Challenges and future prospects

All examinations based on proteome data can only be as reliable as the annotation ofproteins occurred. As a matter of fact, eukaryotic and prokaryotic proteomes contain alarge fraction of proteins described as putative, hypothetical, predicted, poorly characterized,uncharacterized or unknown [Tatusov 2001], [Brocchieri 2005]. Often wrong predictions al-ready take place during genome annotations by accidentally taking into account open read-ing frames that occur by chance instead of solely protein-coding genes [Skovgaard 2001].A trustable database is therefore very important and perhaps it would be reasonablefor further analyses to compare different data sources. For example the COG database[Tatusov 2001], representing proteins that are conserved in various organisms and belongto a certain functional class, or the Pfam(-A) database [Bateman 2004] of manually cu-rated alignments, characterizing proteins by the presence of conserved domains, might offeralternative datasets. Since the fraction of large prokaryotic proteins and even of archaealcompared to bacterial sequences is still rather insufficient to derive dependable statistics,one should hope that the amount of available and reliable data will increase altogether.Furthermore, only the growth of experimentally gained data and knowledge might help toprovide a deeper insight into the correlations between the sizes of individual giant proteins,their biological functions and the environments of their producing cells. Aside from that,it would be challenging to discover the evolution of large proteins in general. Which eventsare necessary for the creation of proteins featuring such incredible lengths and why do cellsmaintain them in particular cases, accepting their huge production costs.

61

Page 82: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 83: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Appendix A

Appendix

A.1 Chou-Fasman Parameters

Table A.1: Visualization of the amino acids grouped to represent one of the five indicated secondary

structure preference types (propensity values according to [Fasman 1989]).

A.2 Malkov Correlation Coefficients

Table A.2: Demonstration of the residues pooled to represent one of the six indicated secondary

structure preference groups (correlation coefficients according to [Malkov 2008]).

A.3 Hydropathy Indices of Kyte and Doolittle

Table A.3: Hydropathy indices (line 1) and arbitrarily normalized ∆Gotransfer(water− vapor) values (in

pseudo-kcal/mol) to spread them between −4.5 and +4.5 (line 2) according to [Kyte 1982]. Normal-

ization function: −0.679× (∆Gotransfer(water − vapor)) + 2.32.

Page 84: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with
Page 85: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Bibliography

[Albers 2000] S. V. Albers, J. L. van de Vossenberg, A. J. Driessen and W. N. Konings.Adaptations of the archaeal cell membrane to heat stress. Front Biosci, vol. 5, pagesD813–20, September 2000. 4

[Altschul 1997] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Millerand D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of pro-tein database search programs. Nucleic Acids Res, vol. 25, no. 17, pages 3389–402,September 1997. 17, 18, 20, 41

[Anfinsen 1973] C. B. Anfinsen. Principles that govern the folding of protein chains. Sci-ence, vol. 181, no. 96, pages 223–30, July 1973. 7

[Bang 2001] M. L. Bang, T. Centner, F. Fornoff, A. J. Geach, M. Gotthardt, M. McNabb,C. C. Witt, D. Labeit, C. C. Gregorio, H. Granzier and S. Labeit. The complete genesequence of titin, expression of an unusual approximately 700-kDa titin isoform, andits interaction with obscurin identify a novel Z-line to I-band linking system. CircRes, vol. 89, no. 11, pages 1065–72, November 2001. 10

[Bartlett 2002] G. J. Bartlett, C. T. Porter, N. Borkakoti and J. M. Thornton. Analysis ofcatalytic residues in enzyme active sites. J Mol Biol, vol. 324, no. 1, pages 105–21,November 2002. 28, 29, 37, 48

[Bateman 2004] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. L. L. Sonnhammer, D. J. Studholme,C. Yeats and S. R. Eddy. The Pfam protein families database. Nucl. Acids Res.,vol. 32, no. suppl 1, pages D138–141, January 2004. 20, 61

[Bennett-Lovsey 2008] R. M. Bennett-Lovsey, A. D. Herbert, M. J. E. Sternberg and L. A.Kelley. Exploring the extremes of sequence/structure space with ensemble fold recog-nition in the program Phyre. Proteins, vol. 70, no. 3, pages 611–25, February 2008.20, 53, 54

[Berman 2000] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig,I. N. Shindyalov and P. E. Bourne. The Protein Data Bank. Nucleic Acids Res,vol. 28, no. 1, pages 235–42, January 2000. 21, 53

[Boggon 2002] T. J. Boggon, J. Murray, S. Chappuis-Flament, E. Wong, B. M. Gumbinerand L. Shapiro. C-cadherin ectodomain structure and implications for cell adhesionmechanisms. Science, vol. 296, no. 5571, pages 1308–13, May 2002. 53

[Bohm 1994] G. Bohm and R. Jaenicke. Relevance of sequence statistics for the propertiesof extremophilic proteins. Int J Pept Protein Res, vol. 43, no. 1, pages 97–106,January 1994. 28

[Bolhuis 2004] H. Bolhuis, E. M. T. Poele and F. Rodriguez-Valera. Isolation and cultiva-tion of Walsby’s square archaeon. Environ Microbiol, vol. 6, no. 12, pages 1287–91,December 2004. 5, 6, 49, 50, 51

[Bolhuis 2005] H. Bolhuis. Walsby’s square archaeon; it’s hip to be square but even morehip to be culturable., volume 9 of Adaptation to life at high salt concentrations inarchaea, bacteria and eukarya., pages 185–199. Springer, 2005. 5

Page 86: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Bibliography

[Bolhuis 2006] H. Bolhuis, P. Palm, A. Wende, M. Falb, M. Rampp, F. Rodriguez-Valera,F. Pfeiffer and D. Oesterhelt. The genome of the square archaeon Haloquadratumwalsbyi : life at the limits of water activity. BMC Genomics, vol. 7, page 169, July2006. 5, 6, 7, 10, 49, 50, 56, 60

[Brocchieri 2005] L. Brocchieri and S. Karlin. Protein length in eukaryotic and prokaryoticproteomes. Nucleic Acids Res, vol. 33, no. 10, pages 3390–400, June 2005. 3, 10, 25,26, 59, 61

[Bru 2005] C. Bru, E. Courcelle, S. Carrere, Y. Beausse, S. Dalmar and D. Kahn. TheProDom database of protein domain families: more emphasis on 3D. Nucleic AcidsRes, vol. 33, no. Database issue, pages D212–5, January 2005. 20, 52

[Burns 2004] D. G. Burns, H. M. Camakaris, P. H. Janssen and M. L. Dyall-Smith. Cul-tivation of Walsby’s square haloarchaeon. FEMS Microbiol Lett, vol. 238, no. 2,pages 469–73, September 2004. 51

[Cavicchioli 2003] R. Cavicchioli, P. M. G. Curmi, N. Saunders and T. Thomas. Pathogenicarchaea: do they exist? Bioessays, vol. 25, no. 11, pages 1119–28, November 2003.4

[Chakravarty 2000] S. Chakravarty and R. Varadarajan. Elucidation of determinants ofprotein stability through genome sequence analysis. FEBS Lett, vol. 470, no. 1,pages 65–9, March 2000. 28, 42, 47

[Consortium 2009] The UniProt Consortium. The Universal Protein Resource (UniProt)2009. Nucleic Acids Res, vol. 37, no. Database issue, pages D169–74, January 2009.52, 53

[Corpet 2000] F. Corpet, F. Servant, J. Gouzy and D. Kahn. ProDom and ProDom-CG:tools for protein domain analysis and whole genome comparisons. Nucleic AcidsRes, vol. 28, no. 1, pages 267–9, January 2000. 20, 52

[Cox 2004] E. A. Cox and J. Hardin. Sticky worms: adhesion complexes in C. elegans. JCell Sci, vol. 117, no. Pt 10, pages 1885–97, April 2004. 53

[Daffe 1999] M. Daffe and G. Etienne. The capsule of Mycobacterium tuberculosis and itsimplications for pathogenicity. Tuber Lung Dis, vol. 79, no. 3, pages 153–69, June1999. 3

[Das 1997] S. Das, L. Yu, C. Gaitatzes, R. Rogers, J. Freeman, J. Bienkowska, R. M.Adams, T. F. Smith and J. Lindelien. Biology’s new Rosetta stone. Nature, vol. 385,no. 6611, pages 29–30, January 1997. 26, 59

[DeLano 2002] W. L. DeLano. The PyMOL Molecular Graphics System., 2002. 21

[DeLong 1998] E. F. DeLong. Everything in moderation: archaea as ’non-extremophiles’.Curr Opin Genet Dev, vol. 8, no. 6, pages 649–54, December 1998. 3

[DeLong 2001] E. F. DeLong and N. R. Pace. Environmental diversity of bacteria andarchaea. Syst Biol, vol. 50, no. 4, pages 470–8, August 2001. 3

[Doolittle 1999] W. F. Doolittle. Phylogenetic classification and the universal tree. Science,vol. 284, no. 5423, pages 2124–9, June 1999. 1

[Dufresne 2005] A. Dufresne, L. Garczarek and F. Partensky. Accelerated evolution associ-ated with genome reduction in a free-living prokaryote. Genome Biol, vol. 6, no. 2,page R14, January 2005. 5

66

Page 87: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Bibliography

[Dunker 2000] A. K. Dunker, Z. Obradovic, P. Romero, E. C. Garner and C. J. Brown. In-trinsic protein disorder in complete genomes. Genome Inform Ser Workshop GenomeInform, vol. 11, pages 161–71, December 2000. 42, 47

[Dure 1981] L. Dure, S. C. Greenway and G. A. Galau. Developmental biochemistry ofcottonseed embryogenesis and germination: changing messenger ribonucleic acidpopulations as shown by in vitro and in vivo protein synthesis. Biochemistry, vol. 20,no. 14, pages 4162–8, July 1981. 10

[Eckburg 2003] P. B. Eckburg, P. W. Lepp and D. A. Relman. Archaea and their potentialrole in human disease. Infect Immun, vol. 71, no. 2, pages 591–6, February 2003. 4

[Emsley 1996] P. Emsley, I. G. Charles, N. F. Fairweather and N. W. Isaacs. Structureof Bordetella pertussis virulence factor P.69 pertactin. Nature, vol. 381, no. 6577,pages 90–2, May 1996. 55

[Fasman 1989] G. D. Fasman. The development of the prediction of protein structure.,pages 317–358. Prediction of Protein Structure and the Principles of Protein Con-formation. Plenum Press, 1989. 11, 13, 15, 31, 33, 63

[Federici 2001] L. Federici, C. Caprari, B. Mattei, C. Savino, A. D. Matteo, G. D. Lorenzo,F. Cervone and D. Tsernoglou. Structural requirements of endopolygalacturonase forthe interaction with PGIP (polygalacturonase-inhibiting protein). Proc Natl AcadSci U S A, vol. 98, no. 23, pages 13425–30, November 2001. 55

[Fink 2005] A. L. Fink. Natively unfolded proteins. Curr Opin Struct Biol, vol. 15, no. 1,pages 35–41, February 2005. 9, 18, 39, 42, 44

[Fischer 1894] E. Fischer. Einfluss der Configuration auf die Wirkung der Enzyme. Berichteder deutschen chemischen Gesellschaft, vol. 27, no. 3, pages 2985–2993, December1894. 9

[Fischer 1999] D. Fischer, C. Barret, K. Bryson, A. Elofsson, A. Godzik, D. Jones, K. J.Karplus, L. A. Kelley, R. M. MacCallum, K. Pawowski, B. Rost, L. Rychlewskiand M. Sternberg. CAFASP-1: critical assessment of fully automated structureprediction methods. Proteins, vol. Suppl 3, pages 209–17, May 1999. 20, 53, 54

[Galperin 1999] M. Y. Galperin, R. L. Tatusov and E. V. Koonin. Comparing microbialgenomes: How the gene set determines the lifestyle., pages 91–108. Organization ofthe Prokaryotic Genome. ASM Press, 1999. 25

[Ganesh 1999] C. Ganesh, N. Eswar, S. Srivastava, C. Ramakrishnan and R. Varadarajan.Prediction of the maximal stability temperature of monomeric globular proteins solelyfrom amino acid sequence. FEBS Lett, vol. 454, no. 1-2, pages 31–6, July 1999. 26

[Gattiker 2003] A. Gattiker, K. Michoud, C. Rivoire, A. H. Auchincloss, E. Coudert,T. Lima, P. Kersey, M. Pagni, C. J. A. Sigrist, C. Lachaize, A. L. Veuthey,E. Gasteiger and A. Bairoch. Automated annotation of microbial proteomes inSWISS-PROT. Comput Biol Chem, vol. 27, no. 1, pages 49–58, February 2003.13, 23

[Gevers 2006] D. Gevers, P. Dawyndt, P. Vandamme, A. Willems, M. Vancanneyt,J. Swings and P. D. Vos. Stepping stones towards a new prokaryotic taxonomy.Philos Trans R Soc Lond B Biol Sci, vol. 361, no. 1475, pages 1911–6, November2006. 4

67

Page 88: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Bibliography

[Giovannoni 2005] S. J. Giovannoni, H. J. Tripp, S. Givan, M. Podar, K. L. Vergin, D. Bap-tista, L. Bibbs, J. Eads, T. H. Richardson, M. Noordewier, M. S. Rappe, J. M. Short,J. C. Carrington and E. J. Mathur. Genome streamlining in a cosmopolitan oceanicbacterium. Science, vol. 309, no. 5738, pages 1242–5, August 2005. 5

[Gunasekaran 2003] K. Gunasekaran, C.-J. Tsai, S. Kumar, D. Zanuy and R. Nussinov.Extended disordered proteins: targeting function with less scaffold. Trends BiochemSci, vol. 28, no. 2, pages 81–5, February 2003. 9, 44, 48

[Hall-Stoodley 2004] L. Hall-Stoodley, J. W. Costerton and P. Stoodley. Bacterial biofilms:from the natural environment to infectious diseases. Nat Rev Microbiol, vol. 2, no. 2,pages 95–108, February 2004. 4

[Hamburger 1999] Z. A. Hamburger, M. S. Brown, R. R. Isberg and P. J. Bjorkman. Crystalstructure of invasin: a bacterial integrin-binding protein. Science, vol. 286, no. 5438,pages 291–5, October 1999. 53

[Haney 1999] P. J. Haney, J. H. Badger, G. L. Buldak, C. I. Reich, C. R. Woese andG. J. Olsen. Thermal adaptation analyzed by comparison of protein sequences frommesophilic and extremely thermophilic Methanococcus species. Proc Natl Acad SciU S A, vol. 96, no. 7, pages 3578–83, March 1999. 28

[Hollingsworth 2004] M. A. Hollingsworth and B. J. Swanson. Mucins in cancer: protectionand control of the cell surface. Nat Rev Cancer, vol. 4, no. 1, pages 45–60, January2004. 6, 50

[Howland 2000] J. L. Howland. The surprising archaea: Discovering another domain of life.Oxford University Press, 2000. 4

[Hubbard 2009] T. J. P. Hubbard, B. L. Aken, S. Ayling, B. Ballester, K. Beal, E. Bra-gin, S. Brent, Y. Chen, P. Clapham, L. Clarke, G. Coates, S. Fairley, S. Fitzgerald,J. Fernandez-Banet, L. Gordon, S. Graf, S. Haider, M. Hammond, R. Holland,K. Howe, A. Jenkinson, N. Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella,F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin,A. Parker, B. Pritchard, D. Rios, M. Schuster, G. Slater, D. Smedley, W. Spooner,G. Spudich, S. Trevanion, A. Vilella, J. Vogel, S. White, S. Wilder, A. Zadissa,E. Birney, F. Cunningham, V. Curwen, R. Durbin, X. M. Fernandez-Suarez, J. Her-rero, A. Kasprzyk, G. Proctor, J. Smith, S. Searle and P. Flicek. Ensembl 2009.Nucleic Acids Res, vol. 37, no. Database issue, pages D690–7, January 2009. 13, 23

[Iakoucheva 2002] L. M. Iakoucheva, C. J. Brown, J. D. Lawson, Z. Obradovic and A. K.Dunker. Intrinsic disorder in cell-signaling and cancer-associated proteins. J MolBiol, vol. 323, no. 3, pages 573–84, October 2002. 10, 39, 42, 47

[Iakoucheva 2004] L. M. Iakoucheva, P. Radivojac, C. J. Brown, T. R. O’Connor, J. G.Sikes, Z. Obradovic and A. K. Dunker. The importance of intrinsic disorder forprotein phosphorylation. Nucleic Acids Res, vol. 32, no. 3, pages 1037–49, January2004. 9

[Jones 1999a] D. T. Jones. GenTHREADER: an efficient and reliable protein fold recog-nition method for genomic sequences. J Mol Biol, vol. 287, no. 4, pages 797–815,April 1999. 20, 53, 54

[Jones 1999b] D. T. Jones. Protein secondary structure prediction based on position-specificscoring matrices. J Mol Biol, vol. 292, no. 2, pages 195–202, September 1999. 17,41

68

Page 89: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Bibliography

[Kall 2004] L. Kall, A. Krogh and E. L. L. Sonnhammer. A combined transmembranetopology and signal peptide prediction method. J Mol Biol, vol. 338, no. 5, pages1027–36, May 2004. 56

[Karlin 2002] S. Karlin, L. Brocchieri, J. Trent, B. E. Blaisdell and J. Mrazek. Heterogeneityof genome and proteome content in bacteria, archaea, and eukaryotes. Theor PopulBiol, vol. 61, no. 4, pages 367–90, June 2002. 25, 26, 59

[Kelley 1999] L. A. Kelley, R. M. MacCallum and M. J. E. Sternberg. Recognition of re-mote protein homologies using three-dimensional information to generate a positionspecific scoring matrix in the program 3d-pssm., pages 218–225. RECOMB ’99: Pro-ceedings of the third annual international conference on Computational molecularbiology. ACM, 1999. 20, 53, 54

[Kessel 1982] M. Kessel and Y. Cohen. Ultrastructure of square bacteria from a brine poolin Southern Sinai. J Bacteriol, vol. 150, no. 2, pages 851–60, May 1982. 55

[Koga 2005] Y. Koga and H. Morii. Recent advances in structural research on ether lipidsfrom archaea including comparative and physiological aspects. Biosci BiotechnolBiochem, vol. 69, no. 11, pages 2019–34, November 2005. 4

[Koga 2007] Y. Koga and H. Morii. Biosynthesis of ether-type polar lipids in archaea andevolutionary considerations. Microbiol Mol Biol Rev, vol. 71, no. 1, pages 97–120,March 2007. 4

[Koonin 2002] E. V. Koonin, Y. I. Wolf and G. P. Karev. The structure of the proteinuniverse and genome evolution. Nature, vol. 420, no. 6912, pages 218–23, November2002. 26

[Kumar 2001] S. Kumar and R. Nussinov. How do thermophilic proteins deal with heat?Cell Mol Life Sci, vol. 58, no. 9, pages 1216–33, August 2001. 26, 59

[Kyte 1982] J. Kyte and R. F. Doolittle. A simple method for displaying the hydropathiccharacter of a protein. J Mol Biol, vol. 157, no. 1, pages 105–32, May 1982. 13, 16,18, 33, 34, 35, 36, 42, 43, 63

[Ladenstein 2006] R. Ladenstein and B. Ren. Protein disulfides and protein disulfide oxi-doreductases in hyperthermophiles. FEBS J, vol. 273, no. 18, pages 4170–85, Septem-ber 2006. 28, 29

[Lee 1994] K. H. Lee, D. Xie, E. Freire and L. M. Amzel. Estimation of changes in side chainconfigurational entropy in binding and folding: general methods and application tohelix formation. Proteins, vol. 20, no. 1, pages 68–84, September 1994. 28, 47

[Lodish 2004] H. Lodish, A. Berk, P. Matsudaira, C. A. Kaiser, M. Krieger, M. P. Scott,S. L. Zipurksy and J. Darnell. Molecular cell biology., volume 5. W.H. Freeman &Company, 2004. 10

[Ma 2006] K. Ma, J. G. Forbes, G. Gutierrez-Cruz and K. Wang. Titin as a giant scaffoldfor integrating stress and Src homology domain 3-mediated signaling pathways: theclustering of novel overlap ligand motifs in the elastic PEVK segment. J Biol Chem,vol. 281, no. 37, pages 27539–56, September 2006. 10

[Malkov 2008] S. N. Malkov, M. V. Zivkovic, M. V. Beljanski, M. B. Hall and S. D. Zaric.A reexamination of the propensities of amino acids towards a particular secondarystructure: classification of amino acids based on their chemical structure. J MolModel, vol. 14, no. 8, pages 769–75, August 2008. 11, 13, 15, 32, 63

69

Page 90: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Bibliography

[Marcelino 2008] A. M. C. Marcelino and L. M. Gierasch. Roles of beta-turns in proteinfolding: from peptide models to protein engineering. Biopolymers, vol. 89, no. 5,pages 380–91, May 2008. 28, 29, 39

[Marsden 2002] R. L. Marsden, L. J. McGuffin and D. T. Jones. Rapid protein domainassignment from amino acid sequence using predicted secondary structure. ProteinSci, vol. 11, no. 12, pages 2814–24, December 2002. 20, 52, 54

[Martin 1999] W. Martin. Mosaic bacterial chromosomes: a challenge en route to a tree ofgenomes. Bioessays, vol. 21, no. 2, pages 99–104, February 1999. 1

[Matsumura 1989] M. Matsumura, G. Signor and B. W. Matthews. Substantial increaseof protein stability by multiple disulphide bonds. Nature, vol. 342, no. 6247, pages291–3, November 1989. 28

[McGuffin 2003] L. J. McGuffin and D. T. Jones. Improvement of the GenTHREADERmethod for genomic fold recognition. Bioinformatics, vol. 19, no. 7, pages 874–81,May 2003. 20, 53, 54

[Meier 2000] M. Meier, M. D. Bider, V. N. Malashkevich, M. Spiess and P. Burkhard.Crystal structure of the carbohydrate recognition domain of the H1 subunit of theasialoglycoprotein receptor. J Mol Biol, vol. 300, no. 4, pages 857–65, July 2000. 53

[Meier 2007] R. Meier, T. Drepper, V. Svensson, K.-E. Jaeger and U. Baumann. A calcium-gated lid and a large beta-roll sandwich are revealed by the crystal structure of ex-tracellular lipase from Serratia marcescens. J Biol Chem, vol. 282, no. 43, pages31477–83, October 2007. 55

[Movassaghi 2002] M. Movassaghi and E. N. Jacobsen. Chemistry. The simplest “enzyme”.Science, vol. 298, no. 5600, pages 1904–5, December 2002. 10

[Myers 1995] J. K. Myers, C. N. Pace and J. M. Scholtz. Denaturant m values and heatcapacity changes: relation to changes in accessible surface areas of protein unfolding.Protein Sci, vol. 4, no. 10, pages 2138–48, October 1995. 26

[Nakayama 2001] K. I. Nakayama, S. Hatakeyama and K. Nakayama. Regulation of thecell cycle at the G1-S transition by proteolysis of cyclin E and p27Kip1. BiochemBiophys Res Commun, vol. 282, no. 4, pages 853–60, April 2001. 39

[Nicholson 2000] W. L. Nicholson, N. Munakata, G. Horneck, H. J. Melosh and P. Setlow.Resistance of Bacillus endospores to extreme terrestrial and extraterrestrial envi-ronments. Microbiol Mol Biol Rev, vol. 64, no. 3, pages 548–72, September 2000.3

[Nicholson 2002] W. L. Nicholson, P. Fajardo-Cavazos, R. Rebeil, T. A. Slieman, P. J.Riesenman, J. F. Law and Y. Xue. Bacterial endospores and their significance instress resistance. Antonie Van Leeuwenhoek, vol. 81, no. 1-4, pages 27–32, August2002. 3

[Nollet 2000] F. Nollet, P. Kools and F. van Roy. Phylogenetic analysis of the cadherinsuperfamily allows identification of six major subfamilies besides several solitarymembers. J Mol Biol, vol. 299, no. 3, pages 551–72, June 2000. 53

[Ochiai 2007] A. Ochiai, T. Itoh, Y. Maruyama, A. Kawamata, B. Mikami, W. Hashimotoand K. Murata. A novel structural fold in polysaccharide lyases: Bacillus subtilisfamily 11 rhamnogalacturonan lyase YesW with an eight-bladed beta-propeller. JBiol Chem, vol. 282, no. 51, pages 37134–45, December 2007. 55

70

Page 91: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Bibliography

[Onyenwoke 2004] R. U. Onyenwoke, J. A. Brill, K. Farahi and J. Wiegel. Sporulation genesin members of the low G+C Gram-type-positive phylogenetic branch ( Firmicutes).Arch Microbiol, vol. 182, no. 2-3, pages 182–92, October 2004. 4

[Otto 2005] B. R. Otto, R. Sijbrandi, J. Luirink, B. Oudega, J. G. Heddle, K. Mizutani,S.-Y. Park and J. R. H. Tame. Crystal structure of hemoglobin protease, a hemebinding autotransporter protein from pathogenic Escherichia coli. J Biol Chem,vol. 280, no. 17, pages 17339–45, April 2005. 55

[Pikuta 2007] E. V. Pikuta, R. B. Hoover and J. Tang. Microbial extremophiles at the limitsof life. Crit Rev Microbiol, vol. 33, no. 3, pages 183–209, July 2007. 4

[Pouchkina-Stantcheva 2007] N. N. Pouchkina-Stantcheva, B. M. McGee, C. Boschetti,D. Tolleter, S. Chakrabortee, A. V. Popova, F. Meersman, D. Macherel, D. K.Hincha and A. Tunnacliffe. Functional divergence of former alleles in an ancientasexual invertebrate. Science, vol. 318, no. 5848, pages 268–71, October 2007. 10

[Qian 2006] X. Qian, Y. He, X. Ma, M. N. Fodje, P. Grochulski and Y. Luo. Calciumstiffens archaeal Rad51 recombinase from Methanococcus voltae for homologous re-combination. J Biol Chem, vol. 281, no. 51, pages 39380–7, December 2006. 56

[Reva 2008] O. Reva and B. Tummler. Think big–giant genes in bacteria. Environ Micro-biol, vol. 10, no. 3, pages 768–77, March 2008. 10, 11, 49, 50, 60

[Romero 2001] P. Romero, Z. Obradovic, X. Li, E. C. Garner, C. J. Brown and A. K.Dunker. Sequence complexity of disordered protein. Proteins, vol. 42, no. 1, pages38–48, January 2001. 10, 18, 28, 29, 42

[Rosa 1986] M. D. Rosa, A. Gambacorta and A. Gliozzi. Structure, biosynthesis, andphysicochemical properties of archaebacterial lipids. Microbiol Rev, vol. 50, no. 1,pages 70–80, March 1986. 4

[Rose 1978] G. D. Rose. Prediction of chain turns in globular proteins on a hydrophobicbasis. Nature, vol. 272, no. 5654, pages 586–90, April 1978. 33, 48

[Rose 1985] G. D. Rose, L. M. Gierasch and J. A. Smith. Turns in peptides and proteins.Adv Protein Chem, vol. 37, pages 1–109, January 1985. 33, 48

[Seligmann 2003] H. Seligmann. Cost-minimization of amino acid usage. J Mol Evol,vol. 56, no. 2, pages 151–61, February 2003. 3, 26

[Servant 2002] F. Servant, C. Bru, S. Carrere, E. Courcelle, J. Gouzy, D. Peyruc andD. Kahn. ProDom: automated clustering of homologous domains. Brief Bioinform,vol. 3, no. 3, pages 246–51, September 2002. 20, 52

[Skovgaard 2001] M. Skovgaard, L. J. Jensen, S. Brunak, D. Ussery and A. Krogh. On thetotal number of genes and their length distribution in complete microbial genomes.Trends Genet, vol. 17, no. 8, pages 425–8, August 2001. 10, 26, 61

[Smith 2005] N. L. Smith, E. J. Taylor, A.-M. Lindsay, S. J. Charnock, J. P. Turkenburg,E. J. Dodson, G. J. Davies and G. W. Black. Structure of a group A streptococcalphage-encoded virulence factor reveals a catalytically active triple-stranded beta-helix.Proc Natl Acad Sci U S A, vol. 102, no. 49, pages 17652–7, December 2005. 55

[Spassov 1994] V. Z. Spassov, A. D. Karshikoff and R. Ladenstein. Optimization of theelectrostatic interactions in proteins of different functional and folding type. ProteinSci, vol. 3, no. 9, pages 1556–69, September 1994. 28

71

Page 92: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

MASTER’S THESIS, Andrea RINCK Bibliography

[Stokes 2004] R. W. Stokes, R. Norris-Jones, D. E. Brooks, T. J. Beveridge, D. Doxsee andL. M. Thorson. The glycan-rich outer layer of the cell wall of Mycobacterium tuber-culosis acts as an antiphagocytic capsule limiting the association of the bacteriumwith macrophages. Infect Immun, vol. 72, no. 10, pages 5676–86, October 2004. 3

[Stryer 1995] L. Stryer. Biochemistry. W.H. Freeman & Company, 4 edition, 1995. 17, 38,39, 40

[Tatusov 2001] R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T.Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova and E. V.Koonin. The COG database: new developments in phylogenetic classification ofproteins from complete genomes. Nucleic Acids Res, vol. 29, no. 1, pages 22–8,January 2001. 25, 61

[Thanbichler 2005] M. Thanbichler, S. C. Wang and L. Shapiro. The bacterial nucleoid:a highly organized and dynamic structure. J Cell Biochem, vol. 96, no. 3, pages506–21, October 2005. 2

[Thompson 1999] M. J. Thompson and D. Eisenberg. Transproteomic evidence of a loop-deletion mechanism for enhancing protein thermostability. J Mol Biol, vol. 290,no. 2, pages 595–604, July 1999. 26, 59

[Tompa 2003] P. Tompa. Intrinsically unstructured proteins evolve by repeat expansion.Bioessays, vol. 25, no. 9, pages 847–55, September 2003. 42, 47

[Tompa 2004] P. Tompa and P. Csermely. The role of structural disorder in the functionof RNA and protein chaperones. FASEB J, vol. 18, no. 11, pages 1169–75, August2004. 9

[Tunnacliffe 2005] A. Tunnacliffe, J. Lapinski and B. McGee. A Putative LEA Protein, butno Trehalose, is Present in Anhydrobiotic Bdelloid Rotifers. Hydrobiologia, vol. 546,no. 1, pages 315–321, March 2005. 10

[Uversky 2000] V. N. Uversky, J. R. Gillespie and A. L. Fink. Why are “natively unfolded”proteins unstructured under physiologic conditions? Proteins, vol. 41, no. 3, pages415–27, November 2000. 10

[van Heijenoort 2001] J. van Heijenoort. Formation of the glycan chains in the synthesis ofbacterial peptidoglycan. Glycobiology, vol. 11, no. 3, pages 25R–36R, March 2001. 3

[van Pouderoyen 2003] G. van Pouderoyen, H. J. Snijder, J. A. E. Benen and B. W. Dijk-stra. Structural insights into the processivity of endopolygalacturonase I from As-pergillus niger. FEBS Lett, vol. 554, no. 3, pages 462–6, November 2003. 55

[Vieille 1996] C. Vieille, D. S. Burdette and J. G. Zeikus. Thermozymes. Biotechnol AnnuRev, vol. 2, pages 1–83, January 1996. 28

[Vieille 2001] C. Vieille and G. J. Zeikus. Hyperthermophilic enzymes: sources, uses, andmolecular mechanisms for thermostability. Microbiol Mol Biol Rev, vol. 65, no. 1,pages 1–43, March 2001. 26, 29, 59

[Vreeland 2000] R. H. Vreeland, W. D. Rosenzweig and D. W. Powers. Isolation of a 250million-year-old halotolerant bacterium from a primary salt crystal. Nature, vol. 407,no. 6806, pages 897–900, October 2000. 3

[Vucetic 2003] S. Vucetic, C. J. Brown, A. K. Dunker and Z. Obradovic. Flavors of proteindisorder. Proteins, vol. 52, no. 4, pages 573–84, September 2003. 9, 42

72

Page 93: M A S T E R ’ S T H E S I S - mosaic.mpi-cbg.demosaic.mpi-cbg.de › docs › Rinck2009.pdfing scienti c conversations, Zlatko Smole, for his support while learning to program with

Bibliography

[Walsby 1980] A. E. Walsby. A square bacterium. Nature, vol. 283, no. 5742, pages 69–71,January 1980. 10.1038/283069a0. 5, 55, 60

[Ward 2004] J. J. Ward, J. S. Sodhi, L. J. McGuffin, B. F. Buxton and D. T. Jones. Predic-tion and functional analysis of native disorder in proteins from the three kingdomsof life. J Mol Biol, vol. 337, no. 3, pages 635–45, March 2004. 9, 42

[Watson 1987] J. D. Watson, N. H. Hopkins, J. W. Roberts, J.-A. Steitz and A. M. Weiner.Molecular biology of the gene., volume 1. Benjamin/Cummings Publishing Com-pany, 4 edition, 1987. 10

[Woese 1977] C. R. Woese and G. E. Fox. Phylogenetic structure of the prokaryotic domain:the primary kingdoms. Proc Natl Acad Sci U S A, vol. 74, no. 11, pages 5088–90,November 1977. 2

[Woese 1990] C. R. Woese, O. Kandler and M. L. Wheelis. Towards a natural system oforganisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc NatlAcad Sci U S A, vol. 87, no. 12, pages 4576–9, June 1990. 1, 2

[Woese 1994] C. R. Woese. There must be a prokaryote somewhere: microbiology’s searchfor itself. Microbiol Rev, vol. 58, no. 1, pages 1–9, March 1994. 3

[Wootton 1993] J. C. Wootton and S. Federhen. Statistics of local complexity in amino acidsequences and sequence databases. Computers & Chemistry, vol. 17, no. 2, pages149–163, June 1993. 18, 42, 43

[Wootton 1996] J. C. Wootton and S. Federhen. Analysis of compositionally biased regionsin sequence databases. Methods Enzymol., vol. 266, pages 554–571, January 1996.18, 42, 43

[Wright 1999] P. E. Wright and H. J. Dyson. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol, vol. 293, no. 2, pages321–31, October 1999. 42, 47

[Xiao 1999] L. Xiao and B. Honig. Electrostatic contributions to the stability of hyper-thermophilic proteins. J Mol Biol, vol. 289, no. 5, pages 1435–44, June 1999. 28,47

[Zamyatnin 1972] A. A. Zamyatnin. Protein volume in solution. Progress in Biophysicsand Molecular Biology, vol. 24, pages 107–123, January 1972. 19

[Zhang 2000] J. Zhang. Protein-length distributions for the three domains of life. TrendsGenet, vol. 16, no. 3, pages 107–9, March 2000. 25, 59

73