a holistic approach to understanding cazy families through
TRANSCRIPT
A holistic approach to understanding CAZy families through
reductionist methods.
Jens Eklöf
Licentiate thesis
KTH Royal Institute of Technology
School of Biotechnology
Department of Glycoscience
Stockholm 2009
ii
© Jens Eklöf
Stockholm 2009
Royal Institute of Technology
School of Biotechnology
Albanova University Center
SE-100 44 Stockholm
Sweden
ISBN 978-91-7415-269-2
TRITA BIO Report 2009:5
ISSN 1654-2312
iii
Abstract
In a time when the amount of biological data present in the public domain is becoming
increasingly vast, the need for good classification systems has never been greater. In the
field of glycoscience the necessity of a good classification for the enzymes involved in the
biosynthesis, modification and degradation of polysaccharides is even more pronounced
than in other fields. This is due to the complexity of the substrates, the polysaccharides, as
the theoretical number of possible hexa-oligosaccharides from only hexoses exceeds 1012
isomers!
An initiative to classify enzymes acting on carbohydrates began around 1990 by the French
scientist Bernard Henrissat. The resulting database, the Carbohydrate Active enZymes
database (CAZy), classifies enzymes by sequence similarity into families allowing the
inference of structure and catalytic mechanism. What CAZy does not provide however, are
means to understand how members of a family are related, and in what way they differ from
each other. The top-down approach used in this thesis, combining phylogenetic analysis of
whole CAZy families, or sub-families, with structural determinations and detailed kinetic
analysis allows for exactly that.
Finding determinants for transglycosylation versus hydrolysis within the xth gene product
family of GH16 as well as restricting the hydrolytic enzymes to a well defined clade are
integral parts of paper I. In paper II a new bacterial sub-clade within CE8 was discovered.
The structural determination of the Escherichia coli outer membrane lipoprotein YbhC from
the new sub-clade explained the difference in specificity. The information provided in the
two papers of this thesis gives a better understanding of the development of different
specificities of diverse CAZY families as well as it aids in future gene product annotations.
Furthermore this work has begun to fill the white spots uncovered in the phylogenetic trees.
iv
Sammanfattning
Nu när mängden biologisk data på servrar runt om i världen, vida överstiger det som kan
anses vara överblickbart, är nöden att finna goda klassificeringssystem som störst. Inom
glykovetenskapen är behovet av att finna goda klassificeringsmetoder för de enzymer som
ansvarar för biosyntes-, modifiering- och degradering av polysackarider extra stort. Skälet till
det står att finna i komplexiteten hos substraten till dessa enzymer, polysacckariderna.
Antalet teoretiskt möjliga isomerer av en hexasacckarid från enbart hexoser överstiger 1012
stycken.
Ett sådant klassificeringsinitiativ startades kring 1990 av den franske forskaren Bernard
Henrissat. Hans arbete resulterade i en databas kallad, Carbohydrate Active enZymes
(CAZy), där enzymer klassificeras i olika familjer med ledning av deras aminosyrasekvens.
Denna familjeuppdelning möjliggör för besökare att dra slutsatser om ett enzyms struktur
och mekanism. Vad CAZy dock inte gör, är att visa hur sekvenserna är besläktade och på
vilket sätt de skiljer sig åt. Det uppifrån-och-ner tillvägagånssätt som använts i denna
avhandling, en kombination av fylogenetiskanalys av hela CAZy-familjer, eller underfamiljer,
samt strukturbestämmning av proteiner och detaljerad kinetikanalys åskådligör precis de
saker som CAZy utelämnar. I denna avhandling visas två exempel av ovan nämnda
tillvägagångssätt.
I artikel I påvisas faktorer för transglykosylering jämfört med hydrolys inom en underfamilj
(xth genproduktfamiljen) av glykosylhydrolysfamilj 16, GH16, och den hydrolytiska
aktiviteten begränsas till en underfamilj av xth genproduktfamiljen. I artikel II identifieras en
ny underfamilj av kolhydratesterasfamilj 8, CE8, som efter strukturbestämning av en dess
medlemar, yttermembran lipoproteinet YbhC från Escherichia coli, visade sig ha utvecklat en
annan specificitet än sina stamfäder pektinmetylesteraserna. Informationen i ovanstående
artiklar ökar förståelsen för utvecklingen av nya aktiviteter inom dessa två familjer och
förenklar framtida genannoteringar. Dessutom, så har de vita fläckar som uppdagats i dessa
fylogenetiska träd börjat fyllas i.
v
List of publications
I The crystal structure of the outer membrane lipoprotein YbhC from Escherichia coli
sheds new light on the phylogeny of Carbohydrate Esterase family 8. Eklöf J.M.≡, Tan
T.C.≡, Divne C., Brumer H. Submitted (Proteins: Structure, Function, and Bioinformatics)
Personal contribution to paper I. I did the phylogenetic work on Carbohydrate Esterase Family 8
and the biochemical characterisation. I also contributed to the crystallisation of YbhC and
was the principal author.
II Structural Evidence for the Evolution of Xyloglucanase Activity from Xyloglucan
Endo-Transglycosylases: Biological Implications for Cell Wall Metabolism. Baumann M.J.,
Eklöf J.M., Michel G., Kallas Å.M., Teeri T.T., Czjzek M., Brumer H., III.. Plant Cell 2007;
19(6):1947-1963
Personal contribution to paper II. Shared the work on producing and characterising TmNXG1
and TmNXG2 with Martin Baumann, and the work on the phylogeny of the xth gene
product family with Gurvan Michel.
vi
Related publications
1. Analysis of nasturtium TmNXG1 complexes by crystallography and molecular
dynamics provides detailed insight into substrate recognition by family GH16 xyloglucan
endo-transglycosylases and endo-hydrolases. Mark P, Baumann M.J., Eklöf J.M., Gullfot F.,
Michel G., Kallas Å.M., Teeri T.T., Brumer H., Czjzek M. Proteins: Structure, Function, and
Bioinformatics 2008; DOI 10.1002/prot.22291
2. Top-Down Grafting of Xyloglucan to Gold Monitored by QCM-D and AFM:
Enzymatic Activity and Interactions with Cellulose. Nordgren N., Eklöf J. M., Zhou Q.,
Brumer H., Rutland M.W. Biomacromolecules 2008; 9(3):942-948.
3 Characterisation and 3-D structures of two distinct bacterial xyloglucanases from
families GH5 and GH12. Gloster T.M., Ibatullin F.M., Macauley K., Eklöf J.M., Roberts S.,
Turkenburg J.P., Bjornvad M.E., Jorgensen P.L., Danielsen S., Johansen K.E., Borchert
T.V., Wilson K.S., Brumer H., Davies G.J. Journal of Biological Chemistry 2007; 282(26):19177-
19189
vii
Table of contents
INTRODUCTION ..................................................................................................................................... 1
1 PROTEINS ................................................................................................................................................... 2
1.1 Protein evolution ............................................................................................................................. 4
1.2 Protein folding ................................................................................................................................. 5
2 BIOINFORMATICS ......................................................................................................................................... 8
2.1 Multiple sequence analysis ............................................................................................................. 9
2.1.1 An example of a multiple sequence alignment algorithm ...................................................................... 11
2.3 Phylogenetics ................................................................................................................................ 13
3 THE CARBOHYDRATE‐ACTIVE ENZYMES DATABASE, CAZY .................................................................................. 17
3.1 Reaction mechanism of glycosyl hydrolases ................................................................................. 18
3.2 Glycosyl Hydrolase family 16 ........................................................................................................ 20
3.2.1 Xyloglucan endo‐transglycosidases ........................................................................................................ 21
3.2.2 XETs physiological role in plant cell walls ............................................................................................... 22
3.3 Carbohydrate Esterase family 8 .................................................................................................... 24
3.3.1 Pectin methylesterases role in plant cell walls and pectin degradation ................................................ 25
3.3.2 The pectin methylesterases ................................................................................................................... 28
PRESENT INVESTIGATION .................................................................................................................... 32
4.1 PAPER I: A PHYLOGENETIC ANALYSIS OF CE8 LOCATES A NEW BACTERIAL SUB‐CLADE AND THE STRUCTURAL
DETERMINATION OF E. COLI YBHC. ................................................................................................................... 33
4.2 PAPER II: INVESTIGATION OF THE GH16 XTH GENE FAMILY CLARIFIES THE DETERMINANTS FOR TRANSGLYCOSYLATION
VERSUS HYDROLYSIS WITHIN THE FAMILY BY EXPLORING THE NEW 3D STRUCTURE OF TMNXG1 AND RESTRICTS THE
HYDROLYTIC ACTIVITY TO A SPECIFIC SUB‐CLADE. ................................................................................................. 35
CONCLUDING REMARKS ...................................................................................................................... 38
ACKNOWLEDGEMENT ......................................................................................................................... 39
REFERENCES ........................................................................................................................................ 41
viii
.
Jens Eklöf 1
Introduction
This thesis is focused on proteins, especially those acting on carbohydrates. The complexity
of the substrates of these proteins, the polysaccharides built from carbohydrates, and the
important functions they have in organisms, make carbohydrate acting proteins an exciting
and diverse field. Proteins acting on carbohydrates are interesting for the pharmaceutical
industry due to the diseases caused by dysfunctional biosynthetic enzymes and because of
immune response caused by oligosaccharides, to the energy sector for their use in making
alternative fuels and to the forest and agricultural industry because of the roles
polysaccharides play in their products, just to mention a few of the beneficiaries of the
research on these enzymes.
Because of the abundance and variety of carbohydrate acting proteins an initiative (CAZy)
was started around 1990 to order these proteins into families by their primary structure.
Research in the field of glycoscience is often focused on singular enzymes and clarifying that
enzyme’s role. There is therefore a gap in the information between the family level of CAZy
and the individual proteins of these families. Each enzyme can be seen as an island in an
ocean (representing a protein family) but there is no map of the ocean. The work in this
thesis has tried to bridge that gap between families and individual proteins by showing how
they are related (mapping the ocean) and investigating proteins in previously uncharacterised
areas of these families.
Chapter 1 gives an introduction to proteins, their origin, evolution and folding. In chapter 2
the field of bioinformatics is introduced together with some specific examples of
bioinformatics-methods used to understand the relationship between proteins, described for
non-bioinformaticians. In chapter 3 the concept of the CAZy database is described in more
detail as well as the CAZy families that are investigated in chapter 4.
2 A holistic approach to understanding CAZy families through reductionist methods
1 Proteins Proteins are extremely versatile biopolymers whose complexity is matched only by the
complexity of polysaccharides. They perform crucial functions in virtually all biological
processes. Some of these functions are as catalyst (enzymes), or involved in transport, in
signal transmission, as provider of structural support and a whole array of other functions.
The term protein was first mentioned in a letter by the Swedish chemist Jöns Jacob Berzelius
in 1838;1,2
“The name protein that I propose for the organic oxide of fibrin and albumin, I wanted to
derive from the Greek word πρωτειος, because it appears to be the primitive or principal
substance of animal nutrition.”
Apart from being essential to all forms of life, proteins have also become work horses of
man in our modern society. They are manufactured and sold as a commodity in animal feed
and as enzymes to the baking and brewing industries. Proteins are also used in detergents, as
biocatalyst in the chemical industry and an increasing demand is coming from the biofuel
sector.3 The first protein used as a pharmaceutical drug was insulin introduced back in 1922
and today pharmaceutical companies are focusing much of their research into protein drugs.4
The rationale for using enzymes for industrial purpose is their inherent specificity, the
possibility to save energy and replacing harmful chemicals. In 2007, the world wide enzyme
market was $4.1 billion according to the Freedonia Group Inc. (www.freedoniagroup.com).
Proteins are linear biopolymers built up by monomers called amino-acids. There are 20
common amino-acids and they all share a backbone structure with an amino group on one
end and a carboxylic group on the other. The amino acids vary in their R group, which
covers a wide variety of chemical and physical properties (Figure 1a).
Jens Eklöf 3
Figure 1. The common building blocks of all proteins. a, the common structure of all
amino-acids. The R group is variable and unique for the 20 amino-acids. b, two amino-
acid coupled together via a peptide bond (connecting the amino group and the carbonyl
carbon within the box).
The biosynthesis of proteins begins with the transcription of DNA into mRNA, called
transcription, the mRNA is then translated into a peptide chain, called translation. The
transcription and the translation form the core of what is called the central dogma of molecular
biology5,6 that stipulates the flow of information between proteins, RNA and DNA (Figure
2).
Figure 2. A simplified version of the central dogma of molecular biology.
In the translation, that takes place at the ribosome, amino-acids are coupled together through
a condensation reaction between the carboxylic group of one amino-acid and the amino
group of another. The bond created is called a peptide bond and it is part of the backbone of
all proteins (Figure 1b).
C
R
NH3+
OOCH
H3N
R
O
HN
R
a b
COO+
Transcription Translation
ProteinmRNADNA
Replication
4 A holistic approach to understanding CAZy families through reductionist methods
The order in which amino-acids are put together is dictated by the mRNA, and a protein’s
amino-acid sequence is called the primary structure. The polypeptide chain forms certain
regular structures such as α-helices, β-sheets, turns and coils. These regular structures are
referred to as a protein’s secondary structure. The tertiary structure is a protein’s 3D structure i.e.
how the different secondary structure elements interact in space. Proteins can also build
complexes with other proteins or domains and this is called the quaternary structure.
1.1 Protein evolution
Even though the origin of proteins is unknown one thing is certain. Once proteins entered
the stage of life they came to stay. One theory of the origin of life, “the RNA world”, states
that RNA was the first biopolymer,7 acting both as a blueprint and a catalyst to replicate
itself. Today the different proposed roles of prehistoric RNA have to a large extent been
exchanged by other biomolecules. DNA has taken over as the blueprint of life (even though
some viruses use RNA as their information carrier) and proteins have taken over the role as
the work horse of life. RNA still has important roles but they have become increasingly
specific.
The reason proteins have overtaken the proposed roles of RNA is because of their greater
functional potential. Firstly the peptide bond in proteins is less susceptible to hydrolysis than
the phosphodiester bond in RNA increasing the life span for a protein molecule compared to
an RNA molecule. Secondly and perhaps more important is the fact that RNA is limited in
the amount of building blocks i.e. the four ribonucleotide sugars, and their similarity in terms
of functional groups. Proteins on the other hand are built from 20 amino-acids that have a
wide variety of functional groups with different chemical properties.
Jens Eklöf 5
1.2 Protein folding
Rendering a protein active and functional requires both that a protein is translated correctly
and that the polypeptide chain folds properly. Incorrectly folded proteins are not just
dysfunctional. They can aggregate and become harmful as in the case of Alzheimer’s
disease8,9 or in the mad cow disease.10 In both cases proteins adopt an alternative fold,
building up insoluble plaques.
The mechanisms behind protein folding are not fully elucidated but several theories have
been presented. One of the first theories on protein folding, originally hypothesised by
Dorothy M. Wrinch and Harold Jeffries already in 1919, called hydrophobic collapse,11-13 is
sprung from the observation that most proteins have a hydrophobic core and that these
residues interact in the folding process to minimise their exposure to the aqueous solvent,
thereby maximising entropy. Other protein folding theories have found alternative driving
forces for protein folding, for example that small domains or islands of the polypeptide chain
adopt a correct secondary structure first and that these islands then interact to build the
correctly folded protein (ref. 14 and references therein). The consensus today is that several
mechanisms work together in protein folding. However, there is still debate on what
happens first and to what extent different mechanisms contribute to the folding of a given
protein. One issue in studying the complex mechanism behind protein folding is that the
studies are usually conducted on small peptides or certain proteins like lysozyme and
assessing to what extent the knowledge attained is applicable on larger more complex
proteins is difficult.15-17
So how many protein folds are there? As more and more 3D structures of protein are being
solved it is likely that we can make an accurate estimate of the protein fold space. The
millions of protein sequences known can be grouped into families counted in tens of
thousands.18,19 Even though these families can have very little or no detectable sequence
similarity they can still share a similar fold as folds tend to be more preserved than
sequence.20 Therefore the tens of thousands of families can be boiled down to roughly 1000
distinct folds.21,22
6 A holistic approach to understanding CAZy families through reductionist methods
Is there a common ancestor for these thousand folds or do different folds represent distinct
evolutionary events? Choi and Kim suggested that proteins may have had several
independent origins but also present the possibility that new folds could arise from a frame
shift, a reversed reading direction or from a mutated stop codon.23 Even though a fold is
usually seen as something rather rigid there is evidence for structural plasticity. In the last
couple of years it has been shown that the same sequence can adopt different conformations
based on the environment. Factors such as pH, ionic strength or the presence of metal ions
can all act as conformational switches (ref. 24,25 and references therein), being important both
in certain diseases and potentially also in the development of new folds.
Out of all protein folds some seem to have been more successful as they are used by more
sequences than others. The concept of designability has been used to explain this
phenomenon.26 Designability implies that proteins with a higher average number of contacts
formed by residues are more stable and therefore allow a sequence to be further from
optimal but still fold correctly (Figure 3). The higher stability of highly designable proteins
renders them more evolvable since they can withstand a higher degree of genetic drift.27
Figure 3. Higher mountains have wider bases. A conceptual illustration of designability.
Structure A with a higher foldability also has a larger space of possible sequences to
explore and still be able to fold into a native state. Structure B more easily falls under the
threshold of foldability and is less likely to evolve a new function. Picture adapted from
Goldstein, 2008.28
Jens Eklöf 7
The idea of designability has been supported by a study in yeast where highly designable
proteins had a faster evolution rate.29
We have seen that proteins evolve, but how and why? The answer is that proteins evolve
attaining improved characteristics or acquiring new functionalities that make their host better
suited for their environment. This evolution happens through random genetic drift in the
DNA. Through mutations, deletions and inserts in the DNA sequence, a protein amino-acid
sequence or expression is altered. Some changes are silent (do not affect the amino-acid
sequence) and the changes that leads to the translation of a different amino-acid are likely to
be deleterious or neutral to a protein and only a few actually improve a proteins property
such as catalytic rate, temperature stability or substrate specificity. A number of different
factors influence the rate of this drift, if designability sets the framework for protein
evolution, factors such as protein expression levels, genetic linkage and others, have been
shown to set the rate of evolution on a protein.30
8 A holistic approach to understanding CAZy families through reductionist methods
2 Bioinformatics The vast amount of data produced by scientists around the world demands efficient and fast
tools for handling it. Whether the data comes from a genome project, a metabolomics
project or from any other source it needs to be processed and ordered to get the most out of
it. Ever since the early 1990s the importance and acceptance of bioinformatics has grown to
become tools used by scientists within all sectors of biology. Because of its diversity and
complexity a definition of this field is challenging. Here two definitions are presented, the
first by the National Institute of Health and the second from the National Center for
Biotechnology Information;
“Bioinformatics: Research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral or health data, including
those to acquire, store, organize, archive, analyze, or visualize such data.”
(http://www.bisti.nih.gov/docs/CompuBioDef.pdf)
“Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. There are three important sub-disciplines within
bioinformatics: the development of new algorithms and statistics with which to assess
relationships among members of large data sets; the analysis and interpretation of various
types of data including nucleotide and amino acid sequences, protein domains, and protein
structures; and the development and implementation of tools that enable efficient access and
management of different types of information.”
(http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html)
Because of the complexity and diversity of the field of bioinformatics only those areas that
have been touched upon within this work are presented. These two fields are the related
areas of multiple sequence alignments and phylogenetic analysis. Both are instrumental
disciplines for deciphering the relationship between different organisms and proteins.
Jens Eklöf 9
2.1 Multiple sequence analysis
Multiple sequence alignments (MSAs) are widely used as starting points for protein function
and structure prediction, phylogenetic studies and other common tasks in sequence analysis
of proteins and DNA. They also aid in experimental design by revealing conserved residues
of potential functional importance.
Ever since the most commonly used program ClustalW was introduced in 198831 many new
programs have been developed, especially in the last half decade. These have all made claims
to be either more accurate or faster than previous methods. The early programs align
sequences on the basis on sequence alone. This can be done in a relatively fast way, with for
example MAFFT32 or MUSCLE33 but as the sequence identity of a given dataset drops below
30% these methods become increasingly unreliable. Many newer programs therefore
incorporate secondary or tertiary structure data into their algorithm to improve the alignment
quality on distantly related sequences. Secondary structure prediction programs such as
PSIBLAST34 and PSIPRED35 are used by for example PROMALS36, PRALINE37, SPEM38
while other programs use real structural data (Expresso39) or both predicted and real
structural data (PROMALS3D). The secondary structure elements, α-helices and β-sheets,
are more important for structural stability and therefore these newer programs can put
insertions and deletions in loop regions making them more accurate even when the sequence
similarity is low.
Quality control of different MSA programs is done by comparing alignment results against
“golden standards”, such as BAliBASE40, HOMSTRAD41, Prefab42 and SABmark43. By
comparing the “golden standards” true alignment with the alignment produced by a MSA
program, the accuracy of a program can be assessed.
MSA programs can be divided into two categories, matrix or consistency-based. The matrix-
based algorithms such as ClustalW,31 MUSCLE,42 and SPEM,38 use substitution matrices to
find the cost of matching two symbols or profiles. This method only takes the symbol or its
immediate surrounding into account. The consistency-based methods such as T-Coffee,44
MAFFT,32 ProbCons45 or PROMALS 36 uses a combination of local and global alignments to
10 A holistic approach to understanding CAZy families through reductionist methods
make a position-specific substitution matrix. This incorporation of more information comes
with a computational cost generally making consistency-based methods slower.
To choose the right software for a certain set of sequences can be difficult for the
inexperienced user and to avoid some common pitfalls one first needs to decide what is most
sought after; biological accuracy, speed or memory usage. Certain parameters such as the
number of sequences, length of the sequences and the homology within a certain dataset also
makes an influence on the choice of program.
As mentioned above different programs use different means to achieve an alignment. In
Figure 4 the relationship between the results, and not the method, of different alignment
algorithms on a certain golden standard is shown, reflecting the similarity between different
algorithms in terms of the alignment they produce.
Jens Eklöf 11
Figure 4. Method tree. A tree showing clustering of some multiple sequence alignment
methods. The fftns -1, -2, -i, finsi and ginsi are different versions of MAFFT v5.531.
Pairwise distances were calculated using the HOMSTRAD41 benchmark by computing
Sum of Pair score differences produced by the individual methods. Methods in bold are
included in M-Coffee suite. M-Coffee uses these different algorithms and combines the
results through T-Coffee to make a MSA.46 Picture adapted from Wallace et al., 2005.46
2.1.1 An example of a multiple sequence alignment algorithm
For the reader to better grasp how an alignment algorithm works the outline of the
MUSCLE algorithm33,42 is presented below.
MUSCLE is an multiple sequence alignment program that is similar to MAFFT32,47,48 when it
comes to accuracy and speed but slightly faster on large datasets e.g. >5000 sequences. Even
though they are similar in terms of performance, the algorithms behind them are different.
MUSCLE uses a three step process (Figure 5), first building a rough guide tree, secondly
making a more accurate tree and then finally iterative refinement. A more detailed
description is given below:
ClustalW
Dialign
ProbCons
Muscle 6
T-Coffe2
ginsi
finsi
muscle 3.52
PCMA
fftnsi
fftns2
fftns1
Poa-local
Poa-global
Dialign-T
12 A holistic approach to understanding CAZy families through reductionist methods
1. 1.1 Computing the k-mer distance, a k-mer is a short string of length k, and related
sequences share more k-mers than expected by chance. The k-mer distance is a
measurement of have many k-mers two sequences have in common. 1.2 These
distances are made into a distance matrix and a guide tree is constructed. 1.3 A
progressive alignment is built by following the order of the branching in the first
guide tree.
2. In stage 2 the first tree is improved and a new progressive alignment made. 2.1 From
the alignment in 1.3 the Kimura distance49 is computed. The Kimura distance is a
fast scoring method for protein similarity that ignores gaps and only counts identities.
A score, S is calculated by dividing the number of identities, m by the number of
positions, npos. The distance, d is then calculated through a series of manipulations
(vide infra)
nposmS =
SD −=1
( )22.01ln DDd −−−=
2.2 From the new Kimura distance matrix a new tree is constructed that dictates the
order of the progressive alignment (2.3).
3. Refinement. 3.1 Tree 2 (Figure 5) is cut in two and a new alignment is made for each
of the two sub-trees (3.2). 3.3 These sub-alignments are, in turn, realigned and if the
Sum-of-Pairs score (SP, a score that takes gaps into account) is better than the old
tree 2 it now enters a new iteration process from point 3, otherwise it is discarded
(3.4). The iterative process continues until convergence or a predetermined limit.
Jens Eklöf 13
Figure 5. The flow of the MUSCLE algorithm. The picture is from Edgar.42
The more advanced programs using secondary or tertiary structures include extra steps in
their algorithms. The program PROMALS that uses profiles in the algorithm works much
like MUSCLE up to 1.3 in Figure 3. After that it aligns sequences with a sequence identity
over a certain predetermined threshold value in a fast way, clustering groups of pre-aligned
sequences that are relatively divergent from each other. In the second step one sequence
from each group is chosen and PSIBLAST and PSIPRED are used to build secondary
structure based profiles. These profiles are then used to increase the quality of the
consistency scoring.36
2.3 Phylogenetics
One of the reasons for making multiple sequence alignments is because of their use for
inferring phylogenetic relationships (Figure 6). As in the case of the multiple sequence
alignment programs there are multitude of phylogenetic or at least tree-making programs
based on different algorithms and methods.
The two main techniques for inferring phylogenies from proteins are the distance-based and
the character-based methods. The distance methods are usually faster but have some
drawbacks when it comes to making true phylogenies since they are only based on a distance
matrix (i.e. only using part of the information present). The UPGMA method (Unweighted
14 A holistic approach to understanding CAZy families through reductionist methods
Pair Group Method with Arithmetic mean) is the simplest and fastest method with the
drawbacks of being sensitive to difference in evolutionary rates and branch lengths. It is
often used, as an initial tree builder in multiple sequence alignment programs due to its speed
(For example in the MUSCLE algorithm, vide supra). The other commonly used distance
method is the Neighbour Joining method (NJ).50 Unlike UPGMA the NJ method does not
require the data set to be ultrametric (meaning it does not require all lineages to have evolved
equal distances from the root, or it allows different rates of evolution). Another difference
between NJ methods and UPGMA is that NJ methods start with a star-like tree assumption,
rendering an unrooted tree while UPGMA produces rooted trees.
Figure 6. A typical workflow in the phylogenetic analysis of protein or DNA sequences.
Starting by choosing a dataset, aligning the dataset and finally inferring phylogenetic
relationships from the multiple sequence alignment.
The character-based methods have the advantage of using the information in the characters
of a sequence as opposed to only distances between sequences. This comes at the cost of
computational time making them more cumbersome for large data sets but generally more
accurate. The three character-based methods are maximum parsimony, maximum likelihood
and Bayesian inference phylogenetics.
The maximum parsimony method is a method that tries to minimise the “cost” of mutational
events to explain a given data set. Compared to the other character-based methods it tries to
fit data unbiased of any evolutionary model. The maximum parsimony method is therefore
not a true phylogenetic method but rather a cladistic method since it does not build trees on
the basis of an evolutionary theory. A drawback with the maximum parsimony method is
that it only uses informative sites, i.e. sites where at least two different characters are present
in at least two sequences per character.
Choose sequences Phylogenetic analysis
Multiple sequence alignment
Jens Eklöf 15
Maximum likelihood methods, on the other hand, use evolutionary models when
constructing phylogenies and therefore require the use of an appropriate model for a given
dataset. The evolutionary models for amino-acid sequences are mostly derived from
empirical data. An early pioneer was Margaret Dayhoff the maker of the PAM matrices, that
were constructed by observing differences in closely related sequences.51 Other substitution
matrices soon followed, for example the JTT matrix52,53 and the BLOSUMs54 that are
supposed to be more accurate on more distantly related proteins.
The BLOSUMs were calculated by comparing conserved local alignments in datasets with
different sequence identities giving rise to a number of matrices suitable for datasets of
different sequence divergence. Therefore BLOSUM62 was made from a dataset with a
threshold at 62% identity, while BLOSUM42 had a threshold value of 42%. To make things
more difficult the PAM matrices have an opposite numbering with low numbers for datasets
of high similarity. There are also some special substitutions matrices tailored for
transmembrane regions55 or for chloroplast proteins,56 just to mention a few.
The Bayesian inference of phylogeny is also based on likelihood function but finds a suitable
tree in a different way using Bayes’s theorem. The Bayes’s theorem is a method used in
statistics that describes how one can update beliefs about a hypothesis in the light of new
data. This requires an a priori guess (hypothesis) about the tree you want to find. Fortunately
an a priori guess can be that all trees are equally likely. The problem comes in assessing the
probability of all the possible trees. This has been solved numerically, using Markov Chain
Monte Carlo methods (MCMC) that work in a two step process: (1) A new tree is proposed
by stochastically altering the current tree and (2) the new tree is either accepted or rejected
with a certain probability. This process of perturbing and evaluating trees is the chain
process and the number of times a certain tree comes up in this chain is an estimate of its
posterior probability i.e. how likely it is to be the best tree (ref.57 and references therein).
Compared to maximum parsimony methods, the maximum likelihood and Bayesian
inference of phylogeny methods use all sequence information potentially making them more
accurate but on the downside they are very CPU-intensive and therefore slow. The Bayesian
inference of phylogeny methods have the advantage over maximum likelihood that it
16 A holistic approach to understanding CAZy families through reductionist methods
calculates the probability of a clade as a part of algorithm making statistical sampling such as
bootstrapping redundant.
Once the appropriate method has been chosen for a given data set the choice left is what
program to use. For the distance-based methods a program called MEGA58,59 is suitable. It
is both simple to use and has a nice tree-drawing function as a starting point for figure
making. MEGA can also calculate maximum parsimony cladograms. Other programs with
packages of methods like MEGA are PHYLIP60 and PAUP*
(http://paup.csit.fsu.edu/index.html). PHYLIP and PAUP* can also do maximum
likelihood phylogenies. The trees presented in this thesis are all built using Phyml61, using
maximum likelihood calculations. For Bayesian phylogenetic inference the most popular
program is MrBayes3.62,63 Most of these programs are available as freeware to be downloaded
and run locally or as web-based services that can be run online. PAUP* is the only program
that needs to be purchased.
Jens Eklöf 17
3 The Carbohydrate‐Active enZymes database, CAZy The diversity in carbohydrates far exceeds any other biopolymer. In nature over 30 different
sugars64 have been found and these can be further modified by the non-carbohydrates such
as sulphate, phosphate and acyl esters. Only assuming unmodified carbohydrates the number
of possible hexasaccharide isomers from D-hexoses reach a staggering 1012. Therefore the
enzymes responsible for their biosynthesis, modification and degradation face a daunting
task. The work of classifying the carbohydrate active enzymes began around 1990 by
Bernard Henrissat, comparing different glycosyl hydrolases.65,66 The resulting database,
CAZy is an excellent example of a successful bioinformatic project and today it is an
instrumental tool for all glyco-scientists with approximately 3000 downloaded pages daily
showing the key role sequence classification has in carbohydrate research.
Bernard Henrissat began the CAZy project by ordering 291 sequences into 35 glycosyl
hydrolase families, GH1-GH35, using the unusual method, hydrophobic clustering analysis
(HCA).67 From the original classification of 35 GH families, CAZy has continuously grown
to now contain almost 40 000 ORFs of GHs ordered into a total of 112 GH families (Figure
7).68
Figure 7. The year by year growth of GH ORFs in the CAZy database. Picture from
Davies & Sinnott, 2008.68
18 A holistic approach to understanding CAZy families through reductionist methods
Following the initial initiative to classify GHs other types of carbohydrate-acting proteins
were added to CAZy. The newer groups are glycosyl transferases (GTs), polysaccharide
lyases (PLs), carbohydrate esterases (CEs) and carbohydrate binding modules (CBMs).
The power of the CAZy database lies in that the classification contains more information
than what is found in an EC number (NC-IUBMB,
http://www.chem.qmul.ac.uk/iubmb/enzyme/). The EC numbers 3.2.1.x account for all
GHs’ activities as the first three numbers indicate that an enzyme hydrolyses O or S-glycosyl
linkages. The shortcomings of EC numbers are that they only account for one reaction and
seldom reveal anything about the mechanism of the enzyme. For GHs, who often have
broad substrate specificities, a single EC number is not sufficient. Instead CAZy classifies
enzymes according to their amino acid sequence and fold. This classification enables the user
to infer protein fold, reaction mechanism and catalytic amino-acids from other family
members as well as getting hints to an enzymes activity.
3.1 Reaction mechanism of glycosyl hydrolases
The glycosidic bond, especially between two glucose residues, is the strongest bond found in
biopolymers with a calculated half-life of about 5 million years. Glycosyl hydrolases face a
daunting task trying to hydrolyse these bonds but have managed to accelerate the reaction by
an impressive 1017-fold.69 This is accomplished by several means but one of the more
important features is the catalytic residues within an enzymes active site. Since Koshland
back in 195370 first laid the basis of how the GHs accomplish hydrolysis much work has been
done, but his early theories still hold true.71 Essentially all GH families use one of two
reaction mechanisms, either the inverting mechanism or the retaining mechanism.72
The inverting mechanism is a one step mechanism leading to overall inversion of the
anomeric carbon. It uses a general base to activate a water molecule by extracting a proton
and an acid to facilitate the departure of the leaving group by protonation. The transition
Jens Eklöf 19
state is proposed to be oxocarbenium-ion-like73 (Figure 8a). The catalytic residues are usually
aspartates or glutamates and even though the distance between them have been shown to
vary considerably74 a general rule of thumb is that they are ~10 Å apart.
Figure 8. General mechanisms for inverting and retaining glycosidases. a, the inverting
mechanism, a one step mechanism leading to overall inversion of the anomeric carbon.
b, the retaining mechanism, a two step mechanism leading to overall retention of the
anomeric carbon.
The other mechanism used by glycosidases is the retaining mechanism (Figure 8b). It is a
two step mechanism involving a nucleophile and an acid-base functionality. Similar to the
inverting glycosidases, the catalytic residues are usually aspartates or glutamates but with
some variation as in clan GH-E, where a tyrosine acts as the nucleophile,75 and in GH20 and
GH84, that use substrate-assisted catalysis and where the nucleophile is an acetoamide group
from the substrate itself.76 In the retaining mechanism the first step is the attack of the
nucleophile on the C-1 carbon of a sugar ring resulting in a covalent enzyme-substrate
intermediate. The departure of the leaving group is assisted by protonation from the acid-
OO R
OO
HO O
OO
OO
O O
O
O O
H
R
O
H
H
OO
OO
OO
O O
H
H
O
OH
OO
HO O
δ−
δ+
δ+
δ−
δ−
δ−
O
O R
OO
HO O
HO H
O O
OO
O O
H
OH
O
OH
OHO
O OH
R
O
H
R
δ+
δ−
δ−
a
b
20 A holistic approach to understanding CAZy families through reductionist methods
base. In the second step the acid-base activates a water molecule that subsequently attacks
the C-1 carbon releasing the substrate with retention of the anomeric carbon and restoring
the active site. For a more in depth discussion on the mechanistic properties of glycosyl
hydrolases the reader is referred to a recent review by Vocadlo and Davies.71
3.2 Glycosyl Hydrolase family 16
The glycosyl hydrolase family 16 is an interesting GH family using the retaining mechanism
to cleave glycosidic bonds. It is interesting for several reasons, for one it has a wide spectrum
of enzyme specificities. The different enzymes of GH16 cleave β-1,4, β-1,3 and β-1,3-β-1,4
linkages in various glucans and galactans. Even more interesting, GH16 does not contain
only hydrolytic enzymes. Some enzymes in GH16 have evolved into strict transglycosidases.
While transglycosylation can often be forced upon an enzyme, especially enzymes using the
retaining mechanism with a glycosyl intermediate, most members of the xth gene family, a
subgroup of GH16 seem to be strict xyloglucan endo-transglycosidases77 (and paper II)
without any hydrolytic activity. There are also indications of other transglycosylating enzymes
in GH16. These indications are from a group of fungal enzymes called Crh1 and Crh2
proposed to covalently link chitin with β-1,3 branches of β-1,6 glucans,78,79 suggesting that
the GH16 scaffold might be well suited for constructing strict transglycosidases with novel
substrate specificities.
GH16 shares its β jelly roll fold with the related glycosyl hydrolase family 7. These two
glycosyl hydrolase families share a common ancestry and are ordered into clan-B of the
CAZy classification. The enzymes of GH7 mainly act on cellulose, while GH16 enzymes
have, as stated previously, a wider substrate specificity. Phylogenetic analysis of GH16 has
shown that these different substrate specificities can be grouped into separate, distinct clades
(Figure 9).80,81
Jens Eklöf 21
Figure 9. A tree representation of Clan B, showing its evolution81 with structural
representatives for the different substrate specificities.
3.2.1 Xyloglucan endo‐transglycosidases
The xyloglucan endo-transglycosidases (XETs) are exclusively found within the kingdom of
plantae, where they act on xyloglucan, a heavily substituted type of glucan with a β-1,4
glycosyl backbone. The smallest unit of xyloglucan is called a xylogluco-oligosaccharide
(XGO). The XGO is commonly Glc4-based with xylose attached to the three first glucoses
beginning from the non-reducing end via α-1,6 linkages (Figure 10). The xyloses can then be
further substituted in a multitude of ways. To simplify the nomenclature Fry et al. devised a
coding system for naming different sidechains back in 1993.82 An unsubstituted glucosyl
residue is named G, if substituted with xylose it is called X. The last relevant substitution
within the scope of this thesis is when the xylosyl residue is further substituted with a
β-agarases
κ-carrageenases
cellulases
Common ancestorActive site:β-bulge
XETs
lichenases
laminarinases
(1,4-β transglycosidases / hydrolases)
(1,3-β glucanases)
(1,3-1,4-β glucanases)
GH 7
GH 16
Subgroup 2Active site: regular β-strand
Subgroup 1Active site: β-bulge
(1,4-β glucanases)
Clan B
(1,4-β galactanases)
22 A holistic approach to understanding CAZy families through reductionist methods
galactosyl residue via a β-1,2 bond denominated L (Figure 10). The L can then be further
substituted by more galactosyl, fucosyl or arabinosyl residues depending on species and
tissue.
Figure 10. A xylogluco-oligosaccharide, XXLG, with the non-reducing end to the left
and the reducing end to the right.
Xyloglucan is the main hemicellulose in the primary cell wall of dicotelydons and non-
commelinoid monocotelydons where it crosslinks cellulose microfibrils.83-85 It is also present
in other plants species but to a lesser extent and in for example the commelinoid
monocotyledons the crosslinking function is instead mainly held by glucoronoarabinoxylan
(GAX) and β-1,3-1,4 glucans.86-88 In some species xyloglucan is also used as a storage
polysaccharide89,90 and is produced in large scale from tamarind seeds (Tamarindus indica).
3.2.2 XETs physiological role in plant cell walls
XETs are as xyloglucan, mainly found in the primary cell wall of plants. This thin cell wall of
growing or fleshy tissues is made up of strength-bearing thin cellulose microfibrils in a matrix
of crosslinking glucans, pectins and some structural proteins (ref 85 and references therein).
The primary cell wall serves many purpose such as defining cell shape, being the load bearing
entity, act as a first defence against plant pathogens, allowing passage of small molecules etc.
As this is the cell wall of growing cells it needs to allow the cells to expand. Some of the
cross-linking glucans functions are to prevent the cellulose microfibrils from aggregating, and
to make the cell wall resilient, yet sufficiently strong to maintain the cell wall intact.
O
HOOH
O
HOOH
O
O
HOOH
O
HOOH
OO
OHOHO
OOH
OH
O
HOOH
O
O
O
O
HO
HO
OH
O
HO
HO
HO
O
OH
OHH
n
X GLX
Jens Eklöf 23
Plant cells grow either by tip growth or by a more general diffuse growth where the cell
grows in all directions. The diffuse growth of plant cells is governed by an increase in turgor
pressure accompanied by a decrease in apoplastic pH and can be described as a controlled
polymer creep.85 The XETs act in the diffuse growth by transiently weakening the
interactions between cellulose microfibrils, allowing them to slide with respect to each other
as the cell wall expands. At the same time new cell wall material is synthesised and deposited
into the growing wall to maintain its thickness. XETs accomplish this transient weakening
by a cleaving and religation process. They cleave a xyloglucan chain but instead of
hydrolysing the glycosyl enzyme intermediate with water, the stable intermediate can only be
broken down by a new xyloglucan chain resulting in a new crosslink. It has been proposed,
and shown that at least some XETs can use other acceptors than xyloglucan but at reduced
rates91 (and paper II). The biological importance of this activity is however not clear and
might be irrelevant due to the rate of these reactions.
Plants have several xth genes in their genomes and at this early stage in plant genomics, when
only a few plants genomes have been published, it seems that plants regardless of origin have
20-50 xth genes in their genomes: Thale cress (Arabidopsis thaliana) has 33 xth genes,92 the
first monocot genome from rice (Oryza sativa) has 29 xth genes93 and the first tree genome
from black cottonwood (Populus trichocarpa) has 41 xth genes.94 The differential expression of
the Arabidopsis xth genes have been investigated in detail using GUS fusions showing their
spatial and temporal expression patterns.95
The first XET protein structure solved was a XET from Populus tremula×tremuloides,
(PttXET16-34, PDB ID 1UMZ) in 2002 by Johansson et al.77 Compared to other GH16
enzymes XETs have a C-terminal extension that elongates the acceptor side of the active site
cleft. A ligand structure with a XGO bound in the positive sub-sites of PttXET16-34
showed that the elongation was necessary to accommodate a full XGO.77 The second
structure of an xth gene product was the mainly hydrolytic xyloglucan endo-hydrolase (XEH),
TmNXG1 (paper II) from the ornamental plant nasturtium (Tropaeolum majus). This is the
first and so far only xth gene product that has been shown to have hydrolytic activity.96-98
Earlier phylogenetic analysis on xth gene products divided the xth gene product family into
three different subgroups, I, II and III with TmNXG1 in subgroup III.99,100 This grouping
24 A holistic approach to understanding CAZy families through reductionist methods
has recently been revised merging group I and II and splitting group III into IIIA and IIIB
with the XEH activity restricted to the sub group IIIA. (paper II)
3.3 Carbohydrate Esterase family 8
Carbohydrate esterases generally hydrolyse an ester into an acid and an alcohol. The alcohol
group can be contributed by the sugar as in the C-2, C-3 acetyl groups of xylan or the alcohol
can be exchanged for an amine as the N-acetyl groups of chitin.101 In carbohydrate esterase
family 8, CE 8, of the CAZy database all characterised enzymes work on one substrate. This
is the plant polysaccharide homogalacturonan, where the acid group of the ester originates
from the sugar. These enzymes are called pectin methylesterases (PMEs, EC nr. 3.1.1.11).
The members of CE8 can be divided into groups depending on their kingdom of origin e.g.
plantae, eubacteria or fungi (Figure 11, paper I and ref.102). The bacterial and fungal proteins
of CE8 are either from plant cell wall-degrading organisms or from plant pathogens. The
plant enzymes are either responsible for building/remodelling the cell wall or for degrading it
as in fruit ripening.103 To this day no proteins of archeal origin have been found belonging to
carbohydrate esterase family 8.
Jens Eklöf 25
Figure 11. Maximal likelihood tree showing the tree topology of all CE8 entries in CAZy.
The plant clade names are from Markovič and Janeček.102 The tree is a version of the tree
presented in paper I.
3.3.1 Pectin methylesterases role in plant cell walls and pectin degradation
Pectin is a collective name for a diverse collection of plant polysaccharides that are found in
the primary cell wall and in the middle lamella.104 The common denominator for pectins is
that they are extractable from the cell wall with hot water. Pectin contains four major
polysaccharide domains called homogalacturonan, HGA, rhamnogalcturonan-I, RGI,
rhamnogalcturonan-II, RGII and xylogalacturonan, XGA. While RGI and RGII are
complex polysaccharide with extensive branching HGA is an unsubstituted homopolymer
consisting of α-1,4-galacturonyl residues.105 HGA can be modified with acetyl groups at the
C-2 and the C-3 position. XGA has a structure similar to HGA but is supposed to be closely
connected to RGI and it is branched with single xylosyl residue, β-1,3 linked to the
galacturonan backbone. Exactly how these different pectin polysaccharides are connected to
each other is not known and an old consensus that HGA, RGI and RGII composed the
Bacterial clade
Fungal clade
Plant clades I-IV & X1Plant clade X2
26 A holistic approach to understanding CAZy families through reductionist methods
backbone of pectin was challenged in 2003 by Vincken who argued that HGA was in fact a
very long sidechain of RGI.106
Pectin is synthesised in the Golgi and here HGA becomes heavily methylesterified before it
is released into the cell wall.107 The properties of HGA are profoundly dependent on
postsecretory changes of its structure due to enzymes in the cell wall. At least three different
types of enzymes act on homogalcturonan in muro. These are the PMEs, the
polygalacturonases, PGs, that hydrolyse the α-1,4 bond between two galacturonyl residues
and the pectate lyases, PLs, that break α-1,4 bonds as PGs but through an elimination
reaction yielding a non-reducing end made up by a 4-deoxy-α-D-galact-4-enuronosyl.
PMEs expose the carboxylate of galacturonic acid by releasing protons and methanol,
thereby making HGA negatively charged. The action pattern of most plant PMEs is believed
to make contiguous stretches of free carboxyl groups i.e. work in a block-wise fashion.108
The negatively charged groups of different pectin chains can then be crosslinked via Ca2+
ions through “egg boxes” (Figure 12).109,110 This stiffens111 the cell wall restricting cell growth,
or glues112 cells together as in the middle lamella. The action of PMEs can also weaken the
cell wall. A random de-esterification pattern does not promote “egg box” formation and
instead the reduction in apoplastic pH due to the released protons, inhibits PME activity and
activates pectin degrading enzymes such as polygalacturonases and pectate and pectin lyases.
In addition many pectinolytic enzymes are more efficient in cleaving demethylated HGA
(Figure 12).113
Jens Eklöf 27
Figure 12. The action of PMEs in cell walls. Random de-esterification promotes pectin
degradation while block-wise de-esterification promotes cell wall stiffening.
The action pattern of fungal PMEs is different from plant PMEs (even though no plant PME
from the fungal-like plant clade X2 (Figure 11), has been characterised) as they have been
shown to work in a random fashion.114 A random demethylation pattern prevents the
formation of Ca2+ egg boxes and promotes the degradation by pectinases. Even though
bacterial PMEs have been proposed to work like fungal PMEs, crystallographic data of the
PmeA from E. chrysanthemi suggests that at least this bacterial PME works in a processive
manner preferring esterified galacturonyl residues at the -1 subsite and a free carboxyl group
at the +1 subsite.115 It is possible however that different bacterial enzymes work in different
fashions, either in a block-wise, random or mixed mode way.
+H+
pH↓
PG
PL
pectin degradation
PME
pH↑wall loosening/cell separation
cell growth
charge dilution
Ca2+
wall stiffening
-
-
-
-
-
-
-- -
-
-
-
-
--
-
de-esterified pectinesterified pectin
-
- +
+
28 A holistic approach to understanding CAZy families through reductionist methods
Most plant PMEs have an extra N-terminal domain. This domain is called the pro-domain
and shares sequence similarity with pectin methylesterase inhibitors (PMEIs) found in
plants.116 The function of the pro domain is unknown but it has been proposed to be a
folding chaperone or to act as a PMEI during transport to the cell wall where it is cleaved off
since PMEs found in the cell wall only contain the catalytic domain. Whether the final
destination of the pro-domain is the cytosol or the cell wall is unknown but it is likely cleaved
off at a conserved di-basic KEX2-like protease site situated in a linker-region between the
two domains.102
Plant PMEs can be further divided by their pI. While most plant PMEs isoforms have a
basic pI, two thirds of the PMEs in Arabidopsis have a pI over 8, only a few have a pI lower
than six. The pI of plant PMEs has been proposed to have important effects on the mode of
action of the enzymes and while acidic plant PMEs can be extracted with water from plant
cell walls basic isoforms need harsher conditions for extraction.117 Since the negative charge
of the cell wall is mostly due to homogalacturonan, the substrate for PMEs, one can
hypothesise that basic PMEs might have additional binding sites for homogalacturonan
aiding them in making egg box structures and that this could be the reason for their block-
wise action pattern.
3.3.2 The pectin methylesterases
The pectin methylesterases of CE 8 all act on the C-6 methylesterified galacturonyl residues
of homogalacturonan producing methanol, galacturonyl residues and protons. The
mechanism is not that of the common catalytic triad118 used by many esterases (Asp, His and
Ser), instead it resembles that of the aspartic acid proteases with two carboxylic acids acting
as a nucleophile and an acid-base respectively. The details of the PME mechanism have been
debated in the past but recent work by Fries et al., 2007 strongly suggests that the mechanism
goes through a covalent enzyme-substrate intermediate. It begins with a nucleophilic attack
by Asp199 on the C-6 carbonyl carbon of a galacturonyl residue (1 Figure 13, Erwinia
chrysanthemi numbering, pdb ID 2NSP). A tetrahedral intermediate is formed with a negative
charge on the carbonyl oxygen. This negative charge is stabilised by the anion hole made up
by Gln177 and Asp178 (2, Figure 13). The first leaving group, methoxide is released aided by
Jens Eklöf 29
protonation from the acid-base, Asp178 (3, Figure 13). Asp178 sits in a hydrophobic
environment making the deprotonated state unfavourable. It therefore acts as a base
activating water to break the anhydride again creating a negative charge on the carbonyl
carbon. The tetrahedral intermediate breaks, restoring the active site and releasing a
demethylated galacturonyl residue (4-6, Figure 13).
Figure 13. The proposed mechanism of CE8 pectin methylesterases.115
Pectin methylesterases have their right handed β-helix structure in common with other
pectinolytic enzymes such as pectate and pectin lyases, polygalacturonases and
O
O
Asp
H
O O
N N
H
H
NH H
HO
N
H
H
O
R2
R1
HOOH
O O
O
O
H
O OO
N
H
H
O
R2
R1
HOOH
O O
Asp
Asp
Gln
Arg
Asp
Gln
O
O
H
O OO
N
H
H
O
R2
R1
HOOH
O O
Asp
AspGln
O
O
O OO
N
H
H
O
R2
R1
HOOH
O
Asp
AspGln -CH3OH
+H2O
O
H H
O
O
O OO
N
H
H
O
R2
R1
HOOH
O
Asp
AspGln
O
HH
O
O
O OO
NH
H
O
R2
R1
HOOH
O
Asp
Asp
Gln
OH
H
1 2
4 3
5 6
30 A holistic approach to understanding CAZy families through reductionist methods
rhamnogalacturonases.119,120 In CAZy, there are three PME structures belonging to CE8 (not
counting the structure in paper I). Two are of plant origin, one from carrot121 and one from
tomato in complex with a pectin methylesterase inhibitor (PMEI).122 The third is from the
bacterial plant pathogen E. chrysanthemi.120 One striking difference between the plant and
bacterial enzymes is that in the bacterial PmeA from E. chrysanthemi, elongated loops line both
sides of the active site cleft (Figure 14).
Figure 14. a; Superpositioning of PmeA, silver, from E. chrysanthemi and a PME from
carrot, Daucus carota, gold, (PDB ID 1QJV and 1GQ8 respectively). b; Structure of
PME1 from tomato, Solanum lycopersicum green, in complex with a pectin methylesterase
inhibitor from kiwi, Actinidia chinensis, beige (PDB ID 1XG2). The catalytic residues are
represented in sticks.
These elongated loops of the bacterial PMEs could be an adaptation to avoid detection by
plant pectin methylesterase inhibitors (PMEIs). Plants secrete PMEIs into their cell walls to
regulate the activity of their own PMEs in muro. Plant PMEIs have been shown to inhibit
most plant PMEs but not fungal or bacterial PMEs.123,124 When the complex structure of a
plant PME and an inhibitor was solved in 2004122 the mechanism behind the inhibition
became clear (Figure 14b). The inhibitor binds to the active site cleft, mainly via polar
interactions and the reason why bacterial enzymes are not inhibited is most likely because of
a b
Jens Eklöf 31
the extended loops lining the active site cleft (Figure 14a) making an interaction impossible.
Loop extension is a common feature in the arms race between plants and pathogens to avoid
detection by inhibitors.125
32 A holistic approach to understanding CAZy families through reductionist methods
Present investigation
Jens Eklöf 33
4.1 Paper I: A phylogenetic analysis of CE8 locates a new bacterial sub‐
clade and the structural determination of E. coli YbhC. The phylogenetic analysis of the full carbohydrate esterase family 8 revealed a new bacterial
sub-clade previously not noted by others working on CE8.102,126 This new sub-clade
contained outer membrane lipoproteins exclusively from gram-negative bacteria (YbhC and
PmeB sub-clades, Figure 15). In comparison with the other pectin methylesterases of CE8
the YbhC sub-clade had some interesting sequence features that caught our attention. Apart
from major differences in loop regions several important CE8 conserved amino-acid were
different including the acid-base in the proposed mechanism115 of PMEs.
Figure 15. A maximum likelihood phylogram of the sequences present in the publicly
available CE8 in November 2007. The plant clades are named according to Markovič and
Janeček.102
While PME activity most likely needs an acid-base for aiding the departure of the methoxide
group the previously indicated thioesterase activity127 does not necessarily need assistance of
an acid-base for leaving group departure.128 This activity on palmitoyl-CoA could however
not be reproduced nor could any general esterase activity (on acyl pNPs of various lengths)
Bacterial clade
Fungal clade
Plant clades I-IV & X1
Plant clade X2
YbhC sub-clade
PmeB sub-clade
34 A holistic approach to understanding CAZy families through reductionist methods
or PME activity be demonstrated. As closely related to the experimentally determined PME
PmeB129 it was also tested for pectin binding but to no avail.
Figure 16. a, the structure of E. coli YbhC. b, Surface representation of YbhC (carbon,
yellow; nitrogen, blue, oxygen, red).
The proteins of the YbhC sub-clade all reside alone in their operons and gave no clue as to
their function. Instead clues to a putative function in the biosynthesis of murein or cell
division came from clustering transctriptomic data (http://genexpdb.ou.edu) where the top
hits were all involved in murein synthesis and turnover. The crystal structure of YbhC also
gives some clues to the characteristics of a potential substrate. While PMEs have a cleft-like
active site YbhC has one end of the cleft cut off by a long blocking loop and a large insert
(Figure 16). The closed end of the cleft is very hydrophobic with mainly aliphatic side chains
while the open end looks more like a PME indicating that a substrate should have a
hydrophobic part and a hydrophilic part (Figure 16b). Such substrates present in the
periplasm could be lipids, lipoproteins or some quorum sensing molecules to mention but a
few. Whatever the function of YbhCs might be, the crystal structure of the E.coli YbhC lays
the basis for further characterisation of these proteins.
Insert Insert
Blocking loop Blocking loopa b
Jens Eklöf 35
4.2 Paper II: Investigation of the GH16 xth gene family clarifies the
determinants for transglycosylation versus hydrolysis within the family
by exploring the new 3D structure of TmNXG1 and restricts the
hydrolytic activity to a specific sub‐clade.
It has been known for a long time that two different activities exist within the xth gene
family. The hydrolytic activity96 was discovered a long time ago but it is still confined to a
single member, namely TmNXG1 while all other characterised members have been
transglycosidases, XETs (29 according to CAZy). The underlying mechanisms for
transglycosylation versus hydrolysis are interesting both from a fundamental research
perspective as from a biotechnology perspective. Therefore the previous phylogenetic work
done on the xth gene product family was revisited. Our analysis included more sequences
than previous work and we could conclude that the old grouping was misleading. The old
family I and II were merged in our tree and the old family III was clearly divided in two sub-
clades, IIIA and IIIB (Figure 17). This revised grouping was supported by high bootstrap
values but also by enzyme characteristics. While all characterised members of group I and II
had been shown to be XETs, family III had conflicting activities with XETs in IIIB and the
xyloglucan endo-hydrolase (XEH), TmNXG1 in group IIIA.
36 A holistic approach to understanding CAZy families through reductionist methods
Figure 17. Unrooted phylogenetic tree of ca. 130 full length xth gene products and Bacillus
licheniformis lichenase (1GBG, GenPept CAA40547). Bootstrap values from 100
Maximum Likelihood resamplings are indicated.
To elucidate why TmNXG1 was hydrolytic, after the earlier success of the lab with
PttXET16-34,77 a group I-II member, TmNXG1 was also produced and crystallised. A loop
region conserved in family IIIB looked to be a promising target for mutagenesis and
therefore a chimera was made, using TmNXG1 as the scaffold and the loop from PttXET16-
34. In order to characterise these proteins in detail a new HPLC-based assay was developed
that could measure both hydrolysis and transglycosylation simultaneously.
Interestingly the chimera TmNXG1ΔYNIIG showed intermediate characteristics with a
transglycosylation rate in between that of TmNXG1 and PttXET16-34 and a hydrolytic rate
PD
B 1
gb
g (
lich
enas
e)A
t-X
TH
11A
t-X
TH
3A
t-X
TH
1A
t-X
TH
2P
t-X
TH
15
Os-
XTH
11O
s-X
TH10
SI-
XTH
11S
I-XTH
10P
t-XTH
10S
I-XTH
9Sl
-XTH
17
Pt-X
TH12
Pt-X
TH33
Pt-X
TH28
Pt-XTH
13Pt-X
TH24
At-XTH15
At-XTH16
Ptt-XTH1614
Pt-XTH14
Ptt-XTH1621
Pt-XTH21
SI-XTH13
At-XTH25
Pt-XTH19
Pt-XTH2
Pt-XTH37
Pt-XTH18
Pt-XTH17
At-XTH21
Os-XTH12
Pt-XTH11
At-XTH22
At-XTH23
At-XTH24At-XTH20At-XTH17At-XTH18At-XTH19SI-XTH2Ptt-XTH166Pt-XTH6
At-XTH14At-XTH12At-XTH13
Pt-XTH20Os-XTH9
Os-XTH8Os-XTH7
Os-XTH6Os-XTH5.2
Os-XTH5
Os-XTH4
Os-XTH16O
s-XTH17
Os-XTH
18
Os-X
TH14
Os-X
TH13
Os-X
TH15
At-X
TH26
Pt-X
TH23
At-X
TH
10
Pt-X
TH
29
Os-
XTH
3A
t-X
TH
9S
l-XT
H16
Ptt-
XE
T16
35P
t-X
TH
35
Ptt-
XT
H16
30
Pt-
XT
H30
SI-
XT
H7
At-
XT
H7
At-
XT
H6
Ptt-
XT
H16
36
Pt-
XT
H36
Os-
XTH
1
SI-
XTH
23Sl-X
TH15Pt-X
TH16
Pt-X
TH25
At-X
TH8
Os-
XTH
2
SI-X
TH4
SI-X
TH3
SI-XTH
1
Ptt-XTH16
27
Pt-XTH27
Pt-XTH34
Ptt-XET1634
Ptt-XTH1626
Pt-XTH26
At-XTH5
At-XTH4
Pt-XTH38Ptt-X
TH1638At-XTH33Pt-XTH22
Os-XTH23Os-XTH24Os-XTH25
Os-XTH26SI-XTH5
Ptt-XTH163
Pt-XTH3
Pt-XTH41
At-XTH30
At-XTH29
SI-XTH8
Pt-XTH40
Ptt-XTH1639
Pt-XTH39
At-XTH27
At-XTH28
Os-XTH27
Os-XTH28
Os-XTH29Lc-XET1
C. papaya
Pt-XTH7
Vv-XET1
Tm-NXG1
At-XTH31SI-XTH6
At-XTH32Ptt-XTH
1632Pt-XTH
32Pt-XTH
31
Os-X
TH22
Os-X
TH19
Os-X
TH20
Os-X
TH21
5945
993493
100
78
14
4
34
97
37
97
161
6285
51
76
26
7 9863
68
72
100
100
100
100
100100100
98
31
97
50
60
49
18
17100
100
100
100
100
100
100
100
100
100
100
100
100
100
100100
100
100
100
100
100
100
5985
4644
13
19
98
84
35
40
87
95
83
8195
95
99
86
91
96
98
100
100
100
100
56
70
3536
17
33
70
5397
97
94
7596
50
34
48
82
67
98
98
97
52
54
76
66
73
94
9891
94
100
100
100
100
100
100
97
17
5632
1427
18
Gro
up II
I-A
Group III-B
AncestralGroup
Group I/I
I
Jens Eklöf 37
close to 6 times lower than that of the parental scaffold TmNXG1 under saturating substrate
conditions (Figure 18).
Figure 18. Initial rate kinetics of TmNXG1 (A), TmNXG1-ΔYNIIG (B), and PttXET16-
34 (C) as a function of XGO2 concentration. Closed circles, rate of XGO3 production
due to transglycosylation (2 XGO2 → XGO3 + XGO1); open squares, total rate of XGO1
production; closed squares, corrected rate of XGO1 production obtained by subtracting
contribution of XGO1 release due to substrate transglycosylation. This observed rate is
twice the actual catalytic rate, according to the stoichiometry of the hydrolysis reaction
(XGO2 → 2 XGO1).
The presence of xyloglucan endo-hydrolases (group IIIB members) in plants is not only for
breaking down xyloglucan in germinating seeds. In fact only a few species uses xyloglucan as
a storage polysaccharide, instead it seems that group IIIB members are expressed in fast
growing tissues. The Arabidopsis group IIIB members XTH31 and XTH32 are expressed in
the root elongation zone and in the shoot apex respectively,95 both being fast
growing/dividing tissues.
C
[Glc8 based XGO2] (mM)
TmNXG1-DYNIIG
0 500 1000 1500 2000 2500 3000
0
1
2
3
v 0 / [E
0], (1
/min
)
PttXET16A
TmNXG1
[Glc8 based XGO2] (mM)
A
B
0 500 1000 1500 2000 2500 3000
0
2
4
6
8
[Glc8 based XGO2] (mM)
0 500 1000 1500 2000 2500 30000
1
2
3
v 0 / [E
0], (1
/min
)v 0 /
[E0],
(1/m
in)
38 A holistic approach to understanding CAZy families through reductionist methods
Concluding remarks
The work presented in this thesis has lead to a better understanding of the determinants for
different activities in glycosyl hydrolase family16 and carbohydrate esterase family 8 as well as
shedding some light on the evolution of these two families. CAZy contains hundreds of
families, and while the methods applied in this thesis are applicable to all of them, the
predictive power is probably greater in some families, especially those working on polymeric
substrates.
The comparison of the phylogenetic analysis of CE8 and GH16 in this work with previous
analyses done by others, clearly shows that incorporation of more sequences into an analysis
improves the quality of the outcome. Selecting only a few sequences without knowledge of
how they are related can lead to false conclusions such as the old family grouping of the xth
gene product family discussed in paper II.
While the speed for genome sequencing has increased, the annotation of a genome is
challenging and time consuming. In paper I the previous misannotation of YbhC as a PME,
is an example of the challenges posed on electronic annotation. To continue the functional
annotation of genes is important and the methodology used in this thesis can, and has
predicted which proteins are interesting to study from a functional divergence perspective.
Jens Eklöf 39
Acknowledgement
It all began with dissected radios on the living room table and with small experiments such as
spraying cold water on hot light bulbs. Yes your right, they do explode, but who would
know if no one tried? I think dad thought Emil i Lönneberga was a far too kind a nickname
and probably swore behind grinning teeth...
My inspirations to the exploration of nature, came as a results of walks in the woods near our
summer house with my grandmother, and from the digging for flint tools in the fields of my
aunts summerhouse in Bohuslän. A less obvious source of inspiration came from that
elusive uncle of mine, that was always somewhere else, weighing the largest southern sea
elephant ever close to Antarctica (think he was called Stalin, a bad boy), or diving with sperm
whales around Galapagos, or hunting down wolfs in Canada. Once home he taught me
about plants and birds and took me on excursions. On one of these trips, I and my sister
saw a jumping sperm whale, but no one believed us. “They’re not supposed to jump around
Lofoten” someone said… An early sign of global warming perhaps?
Under all these years I never said thanks. So here it comes, grandma thanks, where ever you
are and Tom, thanks for everything, you are the reason I ended up here working with great
enthusiasm on something that almost no one else in the world cares about…
On a sunny and warm early summer day I stepped into Tuula’s office and sat down with her
and my future mentor in the lab, Martin Baumann. A lot of laughs and Friday beers later my
diploma work ended and my PhD with Harry Brumer began.
Hooked on plants, I’m still here and I would like to take the opportunity to thank all the
people in the lab for their support, and the good times we have had. Especially the people in
my group, past and present, Farid, Fredrika, Johan, Maria, Martin, Niklas and Nomchit, one
could not wish for better team mates. A special thanks goes to the people in my room Erik,
Felicia, Gustav, Johanna for making work a lot easier, maybe not always more efficient, but
hey it’s all good, and finally to Vincent the biggest fish we’ve got.
40 A holistic approach to understanding CAZy families through reductionist methods
Did you really think I had forgotten you Harry? Of course I hadn’t. I’m not only thanking
you for work related help but also for interesting conversations and for putting up with me...
Finally, thank you friends and family for your support, someday, hopefully I can repay you.
Jens Eklöf 41
References
1. Vickery HB. The origin of the word protein. Yale J Biol Med, 1950(22):387-393. 2. Berzelius JJ. Brevväxling mellan Berzelius och G.J. Mulder (1834-1837). Bref Utgifna
af Kungl Svenska Vetenskapsakademien genom HG Söderbaum. Uppsala; 1916. 3. Ragauskas AJ, Williams CK, Davison BH, Britovsek G, Cairney J, Eckert CA,
Frederick WJ, Hallett JP, Leak DJ, Liotta CL, Mielenz JR, Murphy R, Templer R, Tschaplinski T. The path forward for biofuels and biomaterials. Science, 2006;311(5760):484-489.
4. Wong G. Biotech scientists bank on big pharma's biologics push. Nat Biotech, 2009;27(3):293-295.
5. Crick F. Central Dogma of Molecular Biology. Nature, 1970;227(5258):561-563. 6. Crick FHC. Symp. Soc. Exp. Biol., The Biological Replication of Macromolecules.
1958. 7. Gilbert W. Origin of life: The RNA world. Nature, 1986;319(6055):618-618. 8. Hardy J, Selkoe DJ. Medicine - The amyloid hypothesis of Alzheimer's disease:
Progress and problems on the road to therapeutics. Science, 2002;297(5580):353-356. 9. Lauren J, Gimbel DA, Nygaard HB, Gilbert JW, Strittmatter SM. Cellular prion
protein mediates impairment of synaptic plasticity by amyloid-β oligomers. 2009;457(7233):1128-1132.
10. Prusiner SB. Prions causing degenerative neurological diseases. Annu Rev Med, 1987;38:381-398.
11. Wrinch DM, Jeffreys H. On Some Aspects of the Theory of Probability. Philosophical Magazine, 1919;38:715-731.
12. Fiebig KM, Dill KA. Protein core assembly processes. J Chem Phys, 1993;98(4):3475-3487.
13. Kauzmann W, C.B. Anfinsen, Anson MLJ, Bailey K, John TE. Some Factors in the Interpretation of Protein Denaturation. Advances in Protein Chemistry. 14: Academic Press; 1959. p 1-63.
14. Ivarsson Y, Travaglini-Allocatelli C, Brunori M, Gianni S. Mechanisms of protein folding. Eur Biophys J Biophy, 2008;37(6):721-728.
15. Mok KH, Kuhn LT, Goez M, Day IJ, Lin JC, Andersen NH, Hore PJ. A pre-existing hydrophobic collapse in the unfolded state of an ultrafast folding protein. Nature, 2007;447(7140):106-109.
16. Dill KA, Ozkan SB, Shell MS, Weikl TR. The Protein Folding Problem. Annu Rev Biophys, 2008;37(1):289-316.
17. Udgaonkar JB. Multiple Routes and Structural Heterogeneity in Protein Folding. Annu Rev Biophys, 2008;37(1):489-510.
18. Pearl FMG, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, Sillitoe I, Thornton J, Orengo CA. The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res, 2003;31(1):452-455.
19. Yan Y, Moult J. Protein Family Clustering for Structural Genomics. J Mol Biol 2005;353(3):744-759.
20. Flaherty KM, McKay DB, Kabsch W, Holmes KC. Similarity of the 3-dimensional structures of actin and ATPase fragment of a 70 kDa heat-shock cognate protein. Proc Nat Acad Sci USA 1991;88(11):5041-5045.
42 A holistic approach to understanding CAZy families through reductionist methods
21. Chothia C. One thousand families for the molecular biologist. Nature, 1992;357(6379):543-544.
22. Zhang C, DeLisi C. Estimating the number of protein folds. J Mol Biol 1998;284(5):1301-1305.
23. Choi I-G, Kim S-H. Evolution of protein structural classes and protein sequence families. Proc Nat Acad Sci USA, 2006;103(38):14056-14061.
24. Minor DL, Kim PS. Context-dependent secondary structure formation of a designed protein sequence. Nature, 1996;380(6576):730-734.
25. Pagel K, Koksch B. Following polypeptide folding and assembly with conformational switches. Curr Opin Chem Biol, 2008;12(6):730-739.
26. Govindarajan S, Goldstein RA. Why are some proteins structures so common? Proc Nat Acad Sci USA 1996;93(8):3341-3345.
27. Zeldovich KB, Berezovsky IN, Shakhnovich EI. Physical Origins of Protein Superfamilies. J Mol Biol, 2006;357(4):1335-1343.
28. Goldstein RA. The structure of protein evolution and the evolution of protein structure. Curr Opin Struct Biol 2008;18(2):170-177.
29. Bloom JD, Drummond DA, Arnold FH, Wilke CO. Structural Determinants of the Rate of Protein Evolution in Yeast. Mol Biol Evol, 2006;23(9):1751-1761.
30. Pál C, Papp B, Lercher MJ. An integrated view of protein evolution. Nat Rev Genet, 2006;7(5):337-348.
31. Higgins DG, Sharp PM. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 1988;73(1):237-244.
32. Katoh K, Misawa K, Kuma K-i, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res, 2002;30(14):3059-3066.
33. Edgar R. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinf 2004;5(1):113.
34. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997;25(17):3389-3402.
35. McGuffin LJ, Bryson K, Jones DT. The PSIPRED protein structure prediction server. Bioinformatics, 2000;16(4):404-405.
36. Pei J, Grishin NV. PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics, 2007;23(7):802-808.
37. Simossis VA, Heringa J. PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res, 2005;33:W289-W294.
38. Zhou H, Zhou Y. SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics, 2005;21(18):3615-3621.
39. O'Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C. 3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments. J Mol Biol, 2004;340(2):385-395.
40. Thompson JD, Plewniak F, Poch O. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 1999;15(1):87-88.
41. Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOMSTRAD: A database of protein structure alignments for homologous families. Protein Sci, 1998;7(11):2469-2471.
Jens Eklöf 43
42. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004;32(5):1792-1797.
43. Van Walle I, Lasters I, Wyns L. SABmark--a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 2005;21(7):1267-1268.
44. Notredame C, Higgins DG, Heringa J. T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000;302(1):205-217.
45. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res, 2005;15(2):330-340.
46. Wallace IM, O'Sullivan O, Higgins DG, Notredame C. M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res, 2006;34(6):1692-1699.
47. Katoh K, Kuma K-i, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 2005;33(2):511-518.
48. Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform, 2008;9(4):286-298.
49. Kimura M. The Neutral Theory of Molecular Evolution: Cambridge University Press; 1985. 384 p.
50. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987;4(4):406-425.
51. Dayhoff MO, Eck EV, Park CM. A Model of Evolutionary Change in Proteins. In: Dayhoff MO, editor Atlas of Protein Sequence and Structure. Vol 5. Silver Spring, Maryland: National Biomedical Research Foundation; 1972. p pp. 89–99.
52. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data from protein sequences. CABIOS, 1992;8(3):275-282.
53. Gonnet GH, Cohen MA, Benner SA. Exhaustive matching of the entire protein-sequence database. Science, 1992;256(5062):1443-1445.
54. Henikoff S, Henikoff JG. Amino-acid substituition matrices from protein blocks. Proc Nat Acad Sci USA, 1992;89(22):10915-10919.
55. Ng PC, Henikoff JG, Henikoff S. PHAT: a transmembrane-specific substitution matrix. Bioinformatics, 2000;16(9):760-766.
56. Adachi J, Waddell PJ, Martin W, Hasegawa M. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol, 2000;50(4):348-358.
57. Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. Evolution - Bayesian inference of phylogeny and its impact on evolutionary biology. Science, 2001;294(5550):2310-2314.
58. Kumar S, Tamura K, Nei M. MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Brief Bioinform, 2004;5(2):150-163.
59. Tamura K, Dudley J, Nei M, Kumar S. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) Software Version 4.0. Mol Biol Evol, 2007;24(8):1596-1599.
60. Felsenstein J, Churchill GA. A hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol, 1996;13(1):93-104.
61. Guindon S, Gascuel O. A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood. Syst Biol, 2003;52(5):696 - 704.
62. Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 2001;17(8):754-755.
44 A holistic approach to understanding CAZy families through reductionist methods
63. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 2003;19(12):1572-1574.
64. Marth JD. A unified vision of the building blocks of life. Nat Cell Biol, 2008;10(9):1015-1016.
65. Henrissat B, Claeyssens M, Tomme P, Lemesle L, Mornon JP. Cellulase families revealed by hydrophobic cluster-analysis. Gene, 1989;81(1):83-95.
66. Henrissat B. A classification of Glycosyl hydrolases based on amino-acid sequence similarities. Biochem J 1991;280:309-316.
67. Gaboriaud C, Bissery V, Benchetrit T, Mornon JP. Hydrophobic cluster analysis: An efficient new way to compare and analyse amino acid sequences. FEBS Lett, 1987;224(1):149-155.
68. Davies GJ, Sinnott ML. Sorting the diverse:the sequence-based classification of carbohydrate-active enzymes. Biochem J, 2008;DOI 10.1042/BJ20080382 (online only).
69. Wolfenden R, Lu X, Young G. Spontaneous Hydrolysis of Glycosides. J Am Chem Soc 1998;120(27):6814-6815.
70. Koshland JDE. Stereochemistry and the mechanism of enzymatic reactions. Biol Rev Camb Philos Soc, 1953;28(4):416-436.
71. Vocadlo DJ, Davies GJ. Mechanistic insights into glycosidase chemistry. Curr Opin Chem Biol 2008;12(5):539-555.
72. Rye CS, Withers SG. Glycosidase mechanisms. Curr Opin Chem Biol, 2000;4(5):573-580.
73. Vocadlo DJ, Davies GJ, Laine R, Withers SG. Catalysis by hen egg-white lysozyme proceeds via a covalent intermediate. Nature, 2001;412(6849):835-838.
74. Zechel DL, Withers SG. Glycosidase Mechanisms: Anatomy of a Finely Tuned Catalyst. Acc Chem Res 2000;33(1):11-18.
75. Watts AG, Damager I, Amaya ML, Buschiazzo A, Alzari P, Frasch AC, Withers SG. Trypanosoma cruzi Trans-sialidase Operates through a Covalent Sialyl-Enzyme Intermediate: Tyrosine Is the Catalytic Nucleophile. J Am Chem Soc, 2003;125(25):7532-7533.
76. Macauley MS, Whitworth GE, Debowski AW, Chin D, Vocadlo DJ. O-GlcNAcase Uses Substrate-assisted Catalysis: Kinetic analysisand development of highly selective mechanism-inspired inhibitors. J Biol Chem, 2005;280(27):25313-25322.
77. Johansson P, Brumer H, Baumann MJ, Kallas AM, Henriksson H, Denman SE, Teeri TT, Jones TA. Crystal structures of a poplar xyloglucan endotransglycosylase reveal details of transglycosylation acceptor binding. Plant Cell, 2004;16(4):874-886.
78. Cabib E, Blanco N, Grau C, Rodríguez-Peña JM, Arroyo J. Crh1p and Crh2p are required for the cross-linking of chitin to β-(1-6)glucan in the Saccharomyces cerevisiae cell wall. Mol Microbiol, 2007;63(3):921-935.
79. Cabib E, Farkas V, Kosik O, Blanco N, Arroyo J, McPhie P. Assembly of the Yeast Cell Wall: Crh1p AND Crh2p act as transglycosylases in vivo and in vitro. J Biol Chem, 2008;283(44):29859-29872.
80. Barbeyron T, Gerard A, Potin P, Henrissat B, Kloareg B. The kappa-carrageenase of the marine bacterium Cytophaga drobachiensis. Structural and phylogenetic relationships within family-16 glycoside hydrolases. Mol Biol Evol, 1998;15(5):528-537.
81. Michel G, Chantalat L, Duee E, Barbeyron T, Henrissat B, Kloareg B, Dideberg O. The κ-carrageenase of P. carrageenovora Features a Tunnel-Shaped Active Site: A Novel
Jens Eklöf 45
Insight in the Evolution of Clan-B Glycoside Hydrolases. Structure, 2001;9(6):513-525.
82. Fry SC, York WS, Albersheim P, Darvill A, Hayashi T, Joseleau JP, Kato Y, Lorences EP, Maclachlan GA, McNeil M, Mort AJ, Reid JSG, Seitz HU, Selvendran RR, Voragen AGJ, White AR. An unambiguous nomenclature for xyloglucan-derived oligosaccharides. Physiol Plant, 1993;89(1):1-3.
83. Rose JKC, Bennett AB. Cooperative disassembly of the cellulose-xyloglucan network of plant cell walls: parallels between cell expansion and fruit ripening. Trends Plant Sci, 1999;4(5):176-183.
84. Carpita NC, Gibeaut DM. Structural models of primary-cell walls in flowering plants - consistency of molecular-structure with the physical-properties of the walls during growth. Plant J, 1993;3(1):1-30.
85. Cosgrove DJ. Growth of the plant cell wall. Nat Rev Mol Cell Biol, 2005;6(11):850-861.
86. Pauly M, Albersheim P, Darvill A, York WS. Molecular domains of the cellulose/xyloglucan network in the cell walls of higher plants. Plant J, 1999;20(6):629-639.
87. Fry SC. The Structure and Functions of Xyloglucan. J Exp Bot, 1989;40(210):1-11. 88. Hayashi T. Xyloglucans in the Primary-Cell Wall. Annu Rev Plant Physiol Plant
Molec Biol, 1989;40:139-168. 89. Kooiman P. On the occurence of amyloids in plant seeds. Acta Bot Neerl,
1960;9:208-219. 90. Tine MAS, Silva CO, Lima DUd, Carpita NC, Buckeridge MS. Fine structure of a
mixed-oligomer storage xyloglucan from seeds of Hymenaea courbaril. Carbohy Pol, 2006;66(4):444-454.
91. Hrmova M, Farkas V, Lahnstein J, Fincher GB. A Barley Xyloglucan Xyloglucosyl Transferase Covalently Links Xyloglucan, Cellulosic Substrates, and (1,3;1,4)-beta-D-Glucans. J Biol Chem, 2007;282(17):12951-12962.
92. Yokoyama R, Nishitani K. A comprehensive expression analysis of all members of a gene family encoding cell-wall enzymes allowed us to predict cis-regulatory regions involved in cell-wall construction in specific organs of arabidopsis. Plant Cell Physiol, 2001;42(10):1025-1033.
93. Yokoyama R, Rose JKC, Nishitani K. A surprising diversity and abundance of xyloglucan endotransglucosylase/hydrolases in rice. Classification and expression analysis. Plant Physiol, 2004;134(3):1088-1099.
94. Geisler-Lee J, Geisler M, Coutinho PM, Segerman B, Nishikubo N, Takahashi J, Aspeborg H, Djerbi S, Master E, Andersson-Gunneras S, Sundberg B, Karpinski S, Teeri TT, Kleczkowski LA, Henrissat B, Mellerowicz EJ. Poplar Carbohydrate-Active Enzymes. Gene Identification and Expression Analyses. Plant Physiol, 2006;140(3):946-962.
95. Becnel J, Natarajan M, Kipp A, Braam J. Developmental Expression Patterns of Arabidopsis XTH Genes Reported by Transgenes and Genevestigator. Plant Mol Biol, 2006;61(3):451-467.
96. Edwards M, Dea IC, Bulpin PV, Reid JS. Purification and properties of a novel xyloglucan-specific endo-(1 →4)-beta-D-glucanase from germinated nasturtium seeds (Tropaeolum majus L.). J Biol Chem, 1986;261(20):9489-9494.
46 A holistic approach to understanding CAZy families through reductionist methods
97. Fanutti C, Gidley MJ, Reid JS. Action of a pure xyloglucan endo-transglycosylase (formerly called xyloglucan-specific endo-(1-->4)-beta-D-glucanase) from the cotyledons of germinated nasturtium seeds. Plant J, 1993;3(5):691-700.
98. Fanutti C, Gidley MJ, Reid JS. Substrate subsite recognition of the xyloglucan endo-transglycosylase or xyloglucan-specific endo-(1-->4)-beta-D-glucanase from the cotyledons of germinated nasturtium (Tropaeolum majus L.) seeds. Planta, 1996;200(2):221-228.
99. Xu W, Campbell, Vargheese PAK, Braam J. The Arabidopsis XET-related gene family: environmental and hormonal regulation of expression. Plant J, 1996;9(6):879-889.
100. Uozu S, Tanaka-Ueguchi M, Kitano H, Hattori K, Matsuoka M. Characterization of XET-Related Genes of Rice. Plant Physiol, 2000;122(3):853-860.
101. Davies GJ, Gloster TM, Henrissat B. Recent structural insights into the expanding world of carbohydrate-active enzymes. Curr Opin Struct Biol, 2005;15(6):637-645.
102. Markovič O, Janeček S. Pectin methylesterases: sequence-structural features and phylogenetic relationships. Carbohydr Res, 2004;339(13):2281-2295.
103. Duan XW, Cheng GP, Yang E, Yi C, Ruenroengklin N, Lu WJ, Luo YB, Jiang YM. Modification of pectin polysaccharides during ripening of postharvest banana fruit. Food Chem, 2008;111(1):144-149.
104. Iwai H, Masaoka N, Ishii T, Satoh S. A pectin glucuronyltransferase gene is essential for intercellular attachment in the plant meristem. Proc Natl Acad Sci U S A, 2002;99(25):16319-16324.
105. Visser J, Voragen AGJ. Progress in Biotechnology 14: Pectins and pectinases. Amsterdam: Elsevier; 1996.
106. Vincken J-P, Schols HA, Oomen RJFJ, McCann MC, Ulvskov P, Voragen AGJ, Visser RGF. If Homogalacturonan Were a Side Chain of Rhamnogalacturonan I. Implications for Cell Wall Architecture. Plant Physiol, 2003;132(4):1781-1789.
107. Scheller HV, Jensen JK, Sorensen SO, Harholt J, Geshi N. Biosynthesis of pectin. Physiol Plant, 2007;129(2):283-295.
108. Markovic O, Kohn R. Mode of pectin deesterification by Trichoderma reesei pectineserase. Experientia, 1984;40(8):842-843.
109. Cabrera JC, Boland A, Messiaen J, Cambier P, Van Cutsem P. Egg box conformation of oligogalacturonides: The time-dependent stabilization of the elicitor-active conformation increases its biological activity. Glycobiology, 2008;18(6):473-482.
110. Jarvis MC, Apperley DC. Chain conformation in concentrated pectic gels - evidence from C-13 NMR. Carbohydr Res, 1995;275(1):131-145.
111. Bordenave M, Goldberg R. Immobilized and free apoplastic pectinmethylesterases in mung bean hypocotyl. Plant Physiol, 1994;106(3):1151-1156.
112. Wen FS, Zhu YM, Hawes MC. Effect of pectin methylesterase gene expression on pea root development. Plant Cell, 1999;11(6):1129-1140.
113. Payasi A, Sanwal R, Sanwal GG. Microbial pectate lyases: characterization and enzymological properties. World J Microbiol Biotechnol, 2009;25(1):1-14.
114. Limberg G, Körner R, Buchholt HC, Christensen TMIE, Roepstorff P, Mikkelsen JD. Analysis of different de-esterification mechanisms for pectin by enzymatic fingerprinting using endopectin lyase and endopolygalacturonase II from A. Niger. Carbohydr Res, 2000;327(3):293-307.
115. Fries M, Ihrig J, Brocklehurst K, Shevchik VE, Pickersgill RW. Molecular basis of the activity of the phytopathogen pectin methylesterase. Embo J, 2007;26(17):3879-3887.
Jens Eklöf 47
116. Micheli F. Pectin methylesterases: cell wall enzymes with important roles in plant physiology. Trends Plant Sci, 2001;6(9):414-419.
117. Micheli F, Sundberg B, Goldberg R, Richard L. Radial Distribution Pattern of Pectin Methylesterases across the Cambial Region of Hybrid Aspen at Activity and Dormancy. Plant Physiol, 2000;124(1):191-200.
118. Matthews BW, Sigler PB, Henderson R, Blow DM. Three-dimensional Structure of Tosyl-α-chymotrypsin. Nature, 1967;214(5089):652-656.
119. Jenkins J, Pickersgill R. The architecture of parallel β-helices and related folds. Prog Biophys Mol Biol 2001;77(2):111-175.
120. Jenkins J, Mayans O, Smith D, Worboys K, Pickersgill RW. Three-dimensional structure of Erwinia chrysanthemi pectin methylesterase reveals a novel esterase active site. J Mol Biol 2001;305(4):951-960.
121. Johansson K, El-Ahmad M, Friemann R, Jörnvall H, Markovič O, Eklund H. Crystal structure of plant pectin methylesterase. FEBS Lett, 2002;514(2-3):243-249.
122. Di Matteo A, Giovane A, Raiola A, Camardella L, Bonivento D, De Lorenzo G, Cervone F, Bellincampi D, Tsernoglou D. Structural Basis for the Interaction between Pectin Methylesterase and a Specific Inhibitor Protein. Plant Cell, 2005;17(3):849-858.
123. Giovane A, Servillo L, Balestrieri C, Raiola A, D'Avino R, Tamburrini M, Ciardiello MA, Camardella L. Pectin methylesterase inhibitor. Biochim Biophys Acta, Proteins Proteomics, 2004;1696(2):245-252.
124. Raiola A, Camardella L, Giovane A, Mattei B, De Lorenzo G, Cervone F, Bellincampi D. Two Arabidopsis thaliana genes encode functional pectin methylesterase inhibitors. FEBS Lett, 2004;557(1-3):199-203.
125. Misas-Villamil JC, van der Hoorn RAL. Enzyme-inhibitor interactions at the plant-pathogen interface. Curr Opin Plant Biol 2008;11(4):380-388.
126. Spök A, Stubenrauch G, Schorgendorfer K, Schwab H. Molecular-cloning and sequencing of a pectinesterase gene from Pseudomonas-solanacearum. J Gen Microbiol 1991;137:131-140.
127. Kuznetsova E, Proudfoot M, Sanders SA, Reinking J, Savchenko A, Arrowsmith CH, Edwards AM, Yakunin AF. Enzyme genomics: Application of general enzymatic screens to discover new enzymes. FEMS Microbiol Rev, 2005;29(2):263-279.
128. Maskill H. The Physical Basis of Organic Chemistry: Oxford University Press; 1986. 480 p.
129. Shevchik VE, Condemine G, Hugouvieux-Cotte-Pattat N, Robert-Baudouy J. Characterization of pectin methylesterase B, an outer membrane lipoprotein of Erwinia chrysanthemi 3937. Mol Microbiol 1996;19(3):455-466.