identification of proteins through mass spectrometry databases

Upload: syed-kashif-raza

Post on 04-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    1/50

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    2/50

    2

    Proteome - complete set of proteins in cell

    Current methodologies: 2D gel, protein microarray, fluorescencemicroscopy, mass spectroscopy, chromatography, nuclearmagnetic resonance, microfluidics, microchip

    Mass spectrometry is an important practice for molecular and cellbiology

    New advances in automation of mass spectrometry like excisionof protein spots, enzymatic digestion and acquirement of massspectra and automatic data bases searching.

    Techniques for modified proteins and quantification have beendeveloped.

    Proteomics and Mass Spectrometry

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    3/50

    3

    Servers available for ProteinIdentification through MassSpectrometry

    For PMF For Sequence Query For MS/MS Ion Search

    ASCQ_ME Mascot Inspect

    Bupid MS-Seq (Protein Prospector) Mascot

    Mascot Tagldent MS-Seq (Protein Prospector)

    MassSearch Omssa

    MS-Fit (Protein Prospector) PepFrag (Prowl)

    PepMAPPER PepProbePrpfound (Powl) Rald_DbS

    Mowse Sonar (Knexus)

    PeptideSearch X!Tandam (The GPM)

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    4/50

    4

    MascotSoftware search engine

    Uses mass spectrometry data

    Mascot is uniqueWidely used

    Freely available by Matrix Science

    License is required for in-house use

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    5/50

    5

    Mascot ServerGives excellent results with peak lists from instrumentsmanufactured by:

    Agilent, Bruker, Thermo Scientific

    Waters AB Sciex, Shimadzu

    In-house use:Data sets that exceed the 1200 spectrum limit

    Confidentiality

    For automation

    To add and edit modifications, enzymes, quantitationmethods, etc.

    Time taken in search depends on number of processors.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    6/50

    6

    Three proven ways of using mass

    spectrometry dataPeptide mass fingerprint

    Uses the molecular masses of the peptides resulting fromdigestion of a protein by a specific enzyme

    Sequence queryMass values combined with amino acid sequence or

    composition data.

    MS/MS Ions Search

    Uninterpreted MS/MS data from a single peptide or from acomplete LC-MS/MS run.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    7/50

    7

    Peptide Mass Fingerprint

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    8/50

    8

    Peptide Mass Fingerprint

    Peak picking

    Find a utility to convert into a peak list

    Mass matter most

    Get as many peptide masses in the range 1000 to 3500 Da

    To perform a search

    Paste your peak list or upload it as a file

    Enter values for search parameters

    After submission, you receive the results.

    A list of matching proteins,

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    9/50

    9

    Protein Mass Fingerprinting

    Fast simple analysis.

    High Sensitivity

    Need a database of proteins

    not ESTSequence must be present in databases

    Not Good for mixtures

    Start with Swiss-Prot.

    Protein hit is significant if expect value below 0.05

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    10/50

    10

    MS/MS Ions Search

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    11/50

    11

    MS/MS Ions SearchSingle protein or a complex mixture

    Use chromatography to regulate the flow of peptides into the massspectrometer.

    Select peptides one at a time using the first stage of mass analysis.Each isolated peptide is then induced to fragment. Second stage ofmass analysis used to collect an MS/MS spectrum.

    We use software to determine which peptide sequence in the databasegives the best match.

    The degree of matching is scored.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    12/50

    12

    Fragment ion structures

    Peptide molecular ions fragment at preferred locations alongbackbone.

    Major peaks are b and y ions,

    Depends on the ionization technique, the mass analyser, andpeptide structure.

    If peptides fragmented cleanly, we wouldnt need databasesearch. A ladder of peaks for e ach ion series

    Fragmentation is rarely perfect

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    13/50

    13

    Results complicated to report

    Report, lists a series of proteins and the peptide matches thathave been assigned.

    Report uses a pop-up window to show the alternative peptidematches

    Top match has a high score

    MS/MS ion search

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    14/50

    14

    MS/MS ion searchEasily automated

    Searches can be slow

    Without enzyme

    Several variable modifications

    Large dataset

    Large database

    MS/MS is peptide identification

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    15/50

    15

    Sequence Query

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    16/50

    16

    Sequence tag searchEven the quality of spectrum is poor, its possible to pick outminimum of four clean peaks

    A few residues of amino acid sequence are interpreted

    What Mann and Wilm realized, that this very short stretch ofamino acid sequence might provide sufficient specificity toprovide identification if it was combined with the fragment ionmass values which enclose it, the peptide mass, and theenzyme specificity.

    Picking out a good tag requires both luck and experience.

    Requires interpretation of spectrum

    Usually manual, hence not high throughput

    Tag has to be called correctly

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    17/50

    17

    Peptide Sequence tag

    Standard sequence tag is obsolete.

    Easier to skip the interpretation step and pass the peak list tothe search engine.

    Rapid search timesError tolerant

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    18/50

    18

    Search parametersName, Email and Search Title

    The name and email are saved as a browser cookie. If Mascotsecurity is enabled, information taken from user database

    Email address used for sending results

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    19/50

    19

    Databases

    Swiss-Prot (~500000 entries)

    Best annotated database, ideal for PMF

    NCBI nr and UniRef100 (~19000000 entries)

    Large, comprehensive, best choice for MS/MSEST databases (>400000000 entries in translation)

    Huge, not advisable for PMF

    Single genome databases

    Not suitable for PMF

    cRAP and Contaminants

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    20/50

    20

    DatabaseChoose the right database

    In Mascot 2.3 and later, you can select multiple databases

    You cannot mix AA and DNA databases.

    Comprehensive database repositories, NCBI and EBI, to downloadnr, GenBank, Swiss-Prot, EMBL, Trembl, etc

    Searching for a single organism, always include a databaseof common contaminants.

    If interested in a bacterium/plant, try comprehensive proteindatabases e.g. NCBInr and UniRef100.

    how

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    21/50

    21

    Nucleic Acid DatabasesMascot always performs a 6 frame translation

    Translates entire sequence, don't look for start codon to begin

    When a stop codon is encountere d, leave a gap

    Uses the correct genetic code, as long as the taxonomy isknown.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    22/50

    22

    Taxonomy

    Speeds upSimple report

    Keep indexes up to date

    Check the stats file for each database.

    If the correct protein from the correct species is not inthe database , Dont specify a very narrow taxonomy.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    23/50

    23

    EnzymeFirst choice

    Allowed missed cleavage sites to zero

    Choose a setting of 1 or 2 when youre not sure aboutyour sample

    Higher number, increases the number of calculatedpeptide masses.

    No enzyme only in exceptional cases, never for PMF

    The list is user configurable.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    24/50

    24

    ModificationsFixed modifications

    Variable, post-translational modifications

    Display all modifications

    Keep less number of variable modifications

    Some modifications are worse then othersMods that affect a terminus are less of a problem, e.g. Pyro-glu

    Mods that apply to residues with a high fractional abundance and at anyposition are BIG prob, e.g. Phospho (ST)

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    25/50

    25

    ModificationsPost-translational

    Phosphorylation, acetylation

    Artifacts

    Oxidation, acetylations

    Derivatization

    Alkylation of cysteine

    Sequence varients

    Errors, SNPs, other varients

    Take complete list from unimodAnd if alkylation agent is iodoacetamide (carbamidomethyl),iodoacetic acid (carboxymethyl), and MMTS (methylthio).

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    26/50

    26

    PhosphorylationSite heterogeneity

    Poor ionization efficiency

    3 fragmentation channels

    Intact fragments

    Natural loss of HPO3 (80 Da)

    Natural loss of H3PO4 (98 Da)

    Can occur at STY -~16% of residues

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    27/50

    27

    Protein massMass of the intact protein in kDa.

    If this field is left blank, there is no restriction on protein mass

    Slow down the search a little.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    28/50

    28

    TolerancePeptide tolerance

    MS/MS toleranceError window on experimental peptide mass values

    Units: percentage, milli-mass units, parts per million, orDaltons.

    Protein/peptide view includes a graph of the mass errors for

    fragment ions.Specifying too tight peptide tolerance , common reason forfailing to get a match

    A more appropriate tolerance should be +/- 0.3 in MS/MS

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    29/50

    29

    Mass typeAverage or monoisotopic.

    Monoisotopic: most abundant natural isotopes

    First peak of isotope distribution.Average mass is the chemical mass, centre of gravity of theisotope distribution.

    Difference is approximately 0.06%.

    If you get this setting wrong, the mass errors will be very large

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    30/50

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    31/50

    31

    Data (PMF)Mass

    Query window are used when no data file.

    The data format is auto detected.

    List of mass values, one per line. If a second values is

    present, it is assumed to be intensity. Any further values onthe same line are ignored

    Mascot also supports other peak list formats

    Applied biosystems data explorer (.pkm)

    Bruker analysis autoxecute data report

    Bruker XML

    mzData (1.o5)

    mzML

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    32/50

    32

    Data (MS/MS)The format cannot be auto-detected, and must be specified.

    InstrumentType of instrument used to acquire the data.This setting determines which fragment ion series will be usedfor scoring

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    33/50

    33

    Report

    AUTO to display only protein hits with significant scores.One additional after the cutoff at the significant score.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    34/50

    34

    Final tipBeware of

    Narrowing the taxanomy

    Reducing mass tolerances

    Removing modifications

    Selecting spectras or mass values

    Set search parameters using standard samples

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    35/50

    35

    Types of Summary Report

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    36/50

    36

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    37/50

    37

    Scoring and statistics

    A list of proteins

    Some matches not statistically significant.

    The score threshold for this search is 76, and the top scoringmatch is 47.

    Area shaded green to indicate random, meaningless matches.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    38/50

    38

    Probability based scoringScoring whether the match is random or not.

    Probability: observed match, is a random event.

    Real match, not random, has very low probability.

    Reject anything with a probability greater than a chosen

    thresholdThe mascot score is 10log10(p)

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    39/50

    39

    Significant thresholdsThe threshold is calculated from the number of trials

    P=1/(20x500000)

    Standard score

    MudPIT score

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    40/50

    40

    Expectetion valueThe number of times you could expect to get this score or betterby chance

    E=Pthreshold*(10**((Sthreshold-score)/10))

    A completely random match has an expectation value of 1 ormore

    The better the match, the smaller the expectation value.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    41/50

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    42/50

    42

    Error tolerant searchTake query 218. the observed mass difference couldcorrespond to either carbamidomethylation orcarboxymethylation at the N-terminus.

    Since sample was alkylated with iodoacetamide.carbamidomethylation is also very believable, known artefactof over-alkylation.

    Finds new matches by introducing mass shifts

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    43/50

    43

    Phosphorylation site

    localizationFor confident site localization. Ascore, PTM score and MD-score

    MD -score, the score difference between top two matches

    Depends on fragmentation techniques

    Ability increases with increasing distance

    The MD score does not require complex computational

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    44/50

    44

    Validation (Decoy)False discovery rate.Most reliable is decoy databaseSeparate databases or concatenated to target entries

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    45/50

    45

    DecoySearch a decoy databaseVery simple

    Repeat the search

    Matches that are found in the decoy database are falsepositives.

    It isnt useful when small number of spectra.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    46/50

    46

    DecoyA utility to create a decoy database

    Reversed or randomised sequence of the same length isautomatically generated and tested.

    The average amino acid composition of the random sequencesis the same

    The matches and scores for the decoy sequences are recordedseparately in the result file.

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    47/50

    47

    Mascot DaemonAutomates the submission of data files

    Batch mode

    Real-time monitor modeFollow-up tasks

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    48/50

    48

    Mascot DistillerAccess all of the popular data formats

    To produce high quality peak lists

    Submit and review Mascot search results.

    Perform de novo sequencing and interpret sequencetags for tag searches

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    49/50

    49

    References

    http://www.matrixscience.com

    Mikhail M. S., Simone L., Markus B., Manja L., Toby M., MarcusB., Bernard K., The American Society for Biochemistry andMolecular Biology. (2011)

    Ville R. Koskinen, Patrick A. Emery, David M. Creasy, and JohnS. Cottrell, Molecular and Cellular Proteomics, (2011)

    Elias, J. E. and Gygi, S. P., Natural Methods 4 207-214 (2007)

  • 8/13/2019 Identification of Proteins Through Mass Spectrometry Databases

    50/50

    50