secondary databases

21
  Single protein analysis Peter Højrup

Upload: daljit-singh

Post on 06-Oct-2015

224 views

Category:

Documents


0 download

DESCRIPTION

SECONDARY DATABASE, BIOINFORMATICS,SINGLE PROTEIN ANALYSIS, DATABASE

TRANSCRIPT

  • Single protein analysisPeter Hjrup

  • Single protein analysisPhysical chemical constantsMasspINet charge @ pHHydrophobic indexAbsorption coefficientEtc.. etc..Structure analysisHydrophobic regions (membrane spanning)Coiled coilSecondary structureSecondary database analysis

  • Physical/chemical constantsNumber of amino acids: 223 Molecular weight: 23475.5 Theoretical pI: 8.26 Amino acid composition: Ala (A) 15 6.7% Arg (R) 4 1.8% Asn (N) 18 8.1% ....Total number of negatively charged residues (Asp + Glu): 11Total number of positively charged residues (Arg + Lys): 14Atomic composition:Carbon C 1020Hydrogen H 1607Nitrogen N 287Oxygen O 321Sulfur S 14

    Formula: C1020H1607N287O321S14Total number of atoms: 3249

    276 278 279 280 282 nm nm nm nm nmExt. coefficient 34070 34362 34120 33720 32720Abs 0.1% (=1 g/l) 1.451 1.464 1.453 1.436 1.394Estimated half-life:The N-terminal of the sequence considered is I (Ile).The estimated half-life is: 20 hours (mammalian reticulocytes, in vitro). 30 min (yeast, in vivo). >10 hours (Escherichia coli, in vivo).Instability index:

    The instability index (II) is computed to be 32.39This classifies the protein as stable.

    Aliphatic index: 83.50Grand average of hydropathicity (GRAVY): -0.037

    GPMAWExpasy - ProtParam

  • Sliding window conceptCharge density, window size 7Sequence: K L A W Y I G D E H N R T P A E V T G H I K F R T E N Q W P PCharge: 1 0 0 0 0 0 0-1-1 1 0 1 0 0 0-1 0 0 0 1 0 1 0 1 0-1 0 0 0 0 0 value 1, residue 4 (W) value -1, residue 5 (Y) value -2, residue 6 (I)Window size 3Window size 13Typical uses:Phys/chem propertiesDot-plot analysisOther residue assignable parameters

  • HydrophobicityThe average hydrophibicity in a sliding window can tell us about transmembrane regions and propeptidesWindow size 3Window size 13Use a window size corresponding to the feature you are looking for!

  • Prediction of coiled coilsCoiled coil

  • Secondary structure predictionTrypsin precursor

  • Secondary databases

    Secondary databases are based on the observation that certain residues are conserved in proteins in a specific pattern for specific functions and/or families of proteins.

    This means that if we can recognize these patterns, and discreminate them from random sequences, we have the possibility to identify function directly from the primary structure (even without a search for homologous sequences).

    Furthermore, functions may be obscured by homology searches if the search hits other proteins that only contain parts of the search sequence.Secondary databases

  • Secondary databases

    Secondary databases are built based on the primary databases with the input of alignment of proteins of known functions.

    The record in a secondary database can beA regular expressionFingerprintsBlocksProfilesHidden Markov models (HMM)

  • Pattern databases:

  • Patterns databasesPROSITE Groups of proteins of similar biochemical function on basis of amino acid patterns.BLOCKS Derived from the PROSITE database. Uses PSSM position specific scoring matrices.

  • PROSITE ENTRYPDOC00124 Serine protease, trypsin family, signatures and profiles

    -Consensus pattern: [LIVM]-[ST]-A-[STAG]-H-C [H is the active site residue] -Sequences known to belong to this class detected by the pattern: ALL, except for complement components C1r and C1s, pig plasminogen, bovine protein C, rodent urokinase, ancrod, gyroxin and two insect trypsins. -Other sequence(s) detected in SWISS-PROT: 18.

    -Consensus pattern: [DNSTAGC]-[GSTAPIMVQH]-x(2)-G-[DE]-S-G-[GS]-[SAPHV]- [LIVMFYWH]-[LIVMFYSTANQH] [S is the active site residue] -Sequences known to belong to this class detected by the pattern: ALL, except for 18 different proteases which have lost the first conserved glycine. -Other sequence(s) detected in SWISS-PROT: 8.

    -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE.

    -Note: if a protein includes both the serine and the histidine active site signatures, the probability of it being a trypsin family serine protease is 100%

    -Note: this documentation entry is linked to both a signature pattern and a profile. As the profile is much more sensitive than the pattern, you should use it if you have access to the necessary software tools to do so.

  • BLOCKS entryWidth = number of residues in blockSeqs = number of sequences in the blockParts of the overall alignment are clustered (80% identity)Last column is position-based sequence weight (100 = most distant)

  • Regular expressions

    Regular expressions are a way to describe sequence patterns. For example the N-glycosylation pattern of first Ser/Thr/Cys followed by any residue except Pro followed by Asn : [STC]-{P}-N

    Sequence positions are separated by dash: Completely conserved residues are written: NChoice between multiple residues are written in sharp brackets [STC]Disallowed characters are written in curly brackets {P}Any characters are written as xsMultiple occurrences are numbers after the specification, e.g. any four residues are x4; two after each other following positions have to be Glu or Asp: [ED]2

  • Regular expressions

    Examples

    Kringle region[FY]-C-R-N-P-[DNR]

    MyristylationG-{EDRKHPFYW}-x(2)-[STAGCNI]-YGlycine has to be N-terminal this is not recognized by the search program! Tyrosine kinase phosphorylation site[RK]-x(2,3)-[DE]-x(2,3)-Y

    Serine protease active siteDACEGDSGGPFV

  • Rules

    Short patterns that not associated with any families

    Examples:N-Glycosylation [STC]-{P}-N

    Cell attachment siteRGD

    ER retention sequenceKDEL

  • Fuzzy regular expressions

    Fuzzy regular expressions is a special case used by the MOTIF system (based on BLOCKS and PRINTS). The fuzziness comes in when specific amino acid residues are replaced by fuzzy amino acids, e.g. D replaced by [DENQ], V by [VLI].This increases the sensitivity, but the signal-to-noise ratio is much worse.

  • Fingerprints

    Fingerprints are matrices of alignments populated by residue frequencies observed at each position of the matrix (Position-Specific Scoring Matrix PSSM).

    Blocks

    Blocks are ungapped aligned segments. Scoring is usually with AA substitution matrices (e.g. BLOSUM).The scoring is increased by searching for multiple blocks with proper spacing.

    Profiles

    Profiles are complete alignments that are distilled into scoring tables, including information leading to insertions and deletions (INDELs).

  • Hidden Markov Models (HMMs)

    Hidden Markov Models are probabilistic models that are used to encode linear chains of match, insertions and deletions.

  • Hidden Markov Model

    Part of a Hidden Markov Model trained for recognition of globin sequences.Line width indicates extent of use of path.Bar length indicate amino acid distribution.Deletion of residue 56-60 happens thus in ~50% of the training sequences.

  • Hidden Markov Models (HMMs)

    HMMs are used for predicting a large number of other modifications (e.g. Activation peptide, phosphorylations, O-glycosylation, subcellular localization, proteosomal cleavages etc.).

    NOTE Even though HMMs have the most advanced mathematics, they dont always perform as well as models based on the PROSITE database which is mainly based on hand-crafted alignments.

    Always use several secondary database models and always check the definitions of the entry in the database. E.g. PROSITE has a description of each entry, including the specificity of the search.

    Note also that many of the sites, particularly enzyme targets like phosphorylation, glycosylation etc., only indicates a potential modification site. The actual modification is dependent on physical exposure of the residues to enzyme, both in space and time.