cath , bilogical data bases, bioinformatics data base
TRANSCRIPT
-
8/4/2019 CATH , bilogical data bases, bioinformatics data base
1/3
CATH Data Base
The CATH (http://www.cathdb.info/) Protein Structure Classification is a semi-
automatic, hierarchical classification of protein domains published in 1997 by Christine
Orengo, Janet Thornton and their colleagues. CATH is a manually curated classification of
protein domain structures. Each protein has been chopped into structural domains and
assigned into homologous superfamilies (groups of domains that are related by evolution).
This classification procedure uses a combination of automated and manual techniques whichinclude computational algorithms, empirical and statistical evidence, literature review and
expert analysis.
The CATH database is a hierarchical domain classification of protein structures in the
Protein Data Bank (PDB, Berman et al. 2003). Only crystal structures solved to resolution
better than 4.0 angstroms are considered, together with NMR structures. All non-proteins,
models, and structures with greater than 30% C-alpha only are excluded from CATH. This
filtering of the PDB is performed using the SIFT protocol (Michie et al., 1996). Protein
structures are classified using a combination of automated and manual procedures. There are
four major levels in this hierarchy: Class, Architecture, Topology (fold family) and
Homologous superfamily (Orengo et al., 1997). Each level is described below, together withthe methods used for defining domain boundaries and assigning structures to a specific
family.
Domain Boundary Assignments
All the classification is performed on individual protein domains. To divide multidomainprotein structures into their constituent domains, a combination of automatic and manual
techniques are used. If a given protein chain has sufficiently high sequence identity and
structural similarity (ie. 80% sequence identity, SSAP score >= 80) with a chain that has
previously been chopped, the domain boundary assignment is performed automatically by
inheriting the boundaries from the other chain (ChopClose). Otherwise, the domain
boundaries are assigned manually, based on an analysis of results derived from a range of
algorithms which include structure based methods (CATHEDRAL, SSAP, DETECTIVE
(Swindells, 1995), PUU (Holm & Sander, 1994), DOMAK (Siddiqui and Barton, 1995)),
sequence based methods (Profile HMMs) and relevant literature.
The CATH Hierarchy and Classification
Automated Procedures
If a given domain has sufficiently high sequence and structural similarity (ie. 35% sequenceidentity, SSAP score >= 80) with a domain that has been previously classified in CATH, theclassification is automatically inherited from the other domain. Otherwise, the domain isclassified manually, based upon an analysis of the results derived primarily from a range ofcomparison algorithms CATHEDRAL, HMMs, SSAP scores and relevant literature.
-
8/4/2019 CATH , bilogical data bases, bioinformatics data base
2/3
Manual and Automated Procedures Combined
Class, C-levelClass is determined according to the secondary structure composition and
packing within the structure. Three major classes are recognised; mainly-alpha, mainly-beta
and alpha-beta. This last class (alpha-beta) includes both alternating alpha/beta structures and
alpha+beta structures, as originally defined by Levitt and Chothia (1976). A fourth class is
also identified which contains protein domains which have low secondary structure content.
Architecture, A-level-This describes the overall shape of the domain structure as
determined by the orientations of the secondary structures but ignores the connectivity
between the secondary structures. It is currently assigned manually using a simple description
of the secondary structure arrangement e.g. barrel or 3-layer sandwich. Reference is made to
the literature for well-known architectures (e.g the beta-propellor or alpha four helix bundle).
Topology (Fold family), T-levelStructures are grouped according to whether they share
the same topology or fold in the core of the domain, that is, if they share the same overall
shape and connectivity of the secondary structures in the domain core. Domains in the same
fold group may have different structural decorations to the common core.
Homologous Superfamily, H-levelThis level groups together protein domains which are
thought to share a common ancestor and can therefore be described as homologous.
Similarities are identified either by high sequence identity or structure comparison using
SSAP. Structures are clustered into the same homologous superfamily if they satisfy one of
the following criteria:
Sequence identity >= 35%, overlap >= 60% of larger structure equivalent to smaller. SSAP score >= 80.0, sequence identity >= 20%, 60% of larger structure equivalent to
smaller.
SSAP score >= 70.0, 60% of larger structure equivalent to smaller, and domainswhich have related functions, which is informed by the literature and Pfam proteinfamily database, (Bateman et al., 2004).
Significant similarity from HMM-sequence searches and HMM-HMM comparisonsusing SAM (Hughey &Krogh, 1996), HMMER (http://hmmer.wustl.edu) and PRC
(http://supfam.org/PRC).
Sequence Family Levels: (S,O,L,I,D)
Domains within each H-level are subclustered into sequence families using multi-linkage
clustering at the following levels:
Level Sequence Identity Overlap
S 35% 80%
O 60% 80%
L 95% 80%
I 100% 80%
The D-level acts as a counter within each S100 family and is appended to the classification
hierarchy to ensure that every domain in CATH has a unique CATHSOLID classification.
The sequence identity and overlap used for clustering are obtained from an implementation ofthe Needleman-Wunsch algorithm (Needleman & Wunsch, 1970) using a gap penalty of 3.
The percentage sequence identity is calculated as (100 * Number Of Identical
Residues/Length Of The Shortest Sequence) and the percentage overlap is calculated as (100
* Number Of Aligned Residues/Length Of The Longest Sequence).
http://hmmer.wustl.edu/http://hmmer.wustl.edu/http://hmmer.wustl.edu/http://supfam.org/PRChttp://supfam.org/PRChttp://supfam.org/PRChttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://supfam.org/PRChttp://hmmer.wustl.edu/ -
8/4/2019 CATH , bilogical data bases, bioinformatics data base
3/3