cath , bilogical data bases, bioinformatics data base

Upload: rajesh-guru

Post on 07-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 CATH , bilogical data bases, bioinformatics data base

    1/3

    CATH Data Base

    The CATH (http://www.cathdb.info/) Protein Structure Classification is a semi-

    automatic, hierarchical classification of protein domains published in 1997 by Christine

    Orengo, Janet Thornton and their colleagues. CATH is a manually curated classification of

    protein domain structures. Each protein has been chopped into structural domains and

    assigned into homologous superfamilies (groups of domains that are related by evolution).

    This classification procedure uses a combination of automated and manual techniques whichinclude computational algorithms, empirical and statistical evidence, literature review and

    expert analysis.

    The CATH database is a hierarchical domain classification of protein structures in the

    Protein Data Bank (PDB, Berman et al. 2003). Only crystal structures solved to resolution

    better than 4.0 angstroms are considered, together with NMR structures. All non-proteins,

    models, and structures with greater than 30% C-alpha only are excluded from CATH. This

    filtering of the PDB is performed using the SIFT protocol (Michie et al., 1996). Protein

    structures are classified using a combination of automated and manual procedures. There are

    four major levels in this hierarchy: Class, Architecture, Topology (fold family) and

    Homologous superfamily (Orengo et al., 1997). Each level is described below, together withthe methods used for defining domain boundaries and assigning structures to a specific

    family.

    Domain Boundary Assignments

    All the classification is performed on individual protein domains. To divide multidomainprotein structures into their constituent domains, a combination of automatic and manual

    techniques are used. If a given protein chain has sufficiently high sequence identity and

    structural similarity (ie. 80% sequence identity, SSAP score >= 80) with a chain that has

    previously been chopped, the domain boundary assignment is performed automatically by

    inheriting the boundaries from the other chain (ChopClose). Otherwise, the domain

    boundaries are assigned manually, based on an analysis of results derived from a range of

    algorithms which include structure based methods (CATHEDRAL, SSAP, DETECTIVE

    (Swindells, 1995), PUU (Holm & Sander, 1994), DOMAK (Siddiqui and Barton, 1995)),

    sequence based methods (Profile HMMs) and relevant literature.

    The CATH Hierarchy and Classification

    Automated Procedures

    If a given domain has sufficiently high sequence and structural similarity (ie. 35% sequenceidentity, SSAP score >= 80) with a domain that has been previously classified in CATH, theclassification is automatically inherited from the other domain. Otherwise, the domain isclassified manually, based upon an analysis of the results derived primarily from a range ofcomparison algorithms CATHEDRAL, HMMs, SSAP scores and relevant literature.

  • 8/4/2019 CATH , bilogical data bases, bioinformatics data base

    2/3

    Manual and Automated Procedures Combined

    Class, C-levelClass is determined according to the secondary structure composition and

    packing within the structure. Three major classes are recognised; mainly-alpha, mainly-beta

    and alpha-beta. This last class (alpha-beta) includes both alternating alpha/beta structures and

    alpha+beta structures, as originally defined by Levitt and Chothia (1976). A fourth class is

    also identified which contains protein domains which have low secondary structure content.

    Architecture, A-level-This describes the overall shape of the domain structure as

    determined by the orientations of the secondary structures but ignores the connectivity

    between the secondary structures. It is currently assigned manually using a simple description

    of the secondary structure arrangement e.g. barrel or 3-layer sandwich. Reference is made to

    the literature for well-known architectures (e.g the beta-propellor or alpha four helix bundle).

    Topology (Fold family), T-levelStructures are grouped according to whether they share

    the same topology or fold in the core of the domain, that is, if they share the same overall

    shape and connectivity of the secondary structures in the domain core. Domains in the same

    fold group may have different structural decorations to the common core.

    Homologous Superfamily, H-levelThis level groups together protein domains which are

    thought to share a common ancestor and can therefore be described as homologous.

    Similarities are identified either by high sequence identity or structure comparison using

    SSAP. Structures are clustered into the same homologous superfamily if they satisfy one of

    the following criteria:

    Sequence identity >= 35%, overlap >= 60% of larger structure equivalent to smaller. SSAP score >= 80.0, sequence identity >= 20%, 60% of larger structure equivalent to

    smaller.

    SSAP score >= 70.0, 60% of larger structure equivalent to smaller, and domainswhich have related functions, which is informed by the literature and Pfam proteinfamily database, (Bateman et al., 2004).

    Significant similarity from HMM-sequence searches and HMM-HMM comparisonsusing SAM (Hughey &Krogh, 1996), HMMER (http://hmmer.wustl.edu) and PRC

    (http://supfam.org/PRC).

    Sequence Family Levels: (S,O,L,I,D)

    Domains within each H-level are subclustered into sequence families using multi-linkage

    clustering at the following levels:

    Level Sequence Identity Overlap

    S 35% 80%

    O 60% 80%

    L 95% 80%

    I 100% 80%

    The D-level acts as a counter within each S100 family and is appended to the classification

    hierarchy to ensure that every domain in CATH has a unique CATHSOLID classification.

    The sequence identity and overlap used for clustering are obtained from an implementation ofthe Needleman-Wunsch algorithm (Needleman & Wunsch, 1970) using a gap penalty of 3.

    The percentage sequence identity is calculated as (100 * Number Of Identical

    Residues/Length Of The Shortest Sequence) and the percentage overlap is calculated as (100

    * Number Of Aligned Residues/Length Of The Longest Sequence).

    http://hmmer.wustl.edu/http://hmmer.wustl.edu/http://hmmer.wustl.edu/http://supfam.org/PRChttp://supfam.org/PRChttp://supfam.org/PRChttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_s.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_h.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_t.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_a.gifhttp://www.cathdb.info/wiki/lib/exe/detail.php?id=about:intro&media=about:ball_c.gifhttp://supfam.org/PRChttp://hmmer.wustl.edu/
  • 8/4/2019 CATH , bilogical data bases, bioinformatics data base

    3/3