proteomics software available in the public . · pdf fileproteomics software available in the...

49
Proteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Upload: buikien

Post on 19-Feb-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Proteomics software available in the public domain.

Pratik Jagtap Minnesota Supercomputing institute

Page 2: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Two-Dimensional gel electrophoresis

pI

Mw

Proteins are resolved based on their isolelectric point (using isoelectric focusing) and then molecular weight (using SDS-PAGE).

Gels are compared, differentially expressed proteins are excised and identified.

Page 3: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Proteomics Fifteen Years Ago…

Page 4: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Proteomics Fifteen Years Ago…

Search algorithm 

   Mass Spectrometry 

Data Extrac5on.  Analysis So9ware that correlates the protein ID to the excised gel spot. 

Page 5: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Two-Dimensional gel electrophoresis

pI

Mw

2DGE : High molecular weight proteins, low molecular weight proteins, proteins with extreme isoelectric points, membrane proteins were underrepresented in the analysis.

Page 6: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Multi-Dimensional Protein Identification Technology

Page 7: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Proteomics workflow

Protein Peptide

Mass spectrum

Fragmentation

Search against database.

Page 8: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

mass spectrometry

Page 9: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Mass Spectrometers & data formats

Thermofinnigan Xcalibur / .raw

Life Technologies Analyst / .wiff ; .t2d

Waters Masslynx / .raw

Bruker .baf

Mascot .mgf .dat

Sequest .dta .out

X! tandem .xml OMSSA .xml .omx

mzxml mzData pepxml protxml

mzml

ProteinPilot .t2d .group

Page 10: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Proteo-Informatics

   Mass Spectrometry 

Data Extrac5on. Data Conversion.  Search 

algorithm 

Sta5s5cal valida5on of pep5de and protein iden5fica5ons. De novo  

   Tools. 

Spectral  Matching 

  Data Dissemina5on 

Quan5ta5ve  Tools. 

Targeted  Proteomics 

Page 11: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Data extraction

   Mass Spectrometry 

Data Extrac5on. 

http://www.ionsource.com/functional_reviews/readw/t2x_update_readw.htm ReAdW converts Xcalibur .raw files to universal mzXML format.

ReAdW

•  A tool that can access the Applied Biosystem’s MALDI-TOF/TOF 4700 and 4800 database and can extract T2D files as well as peak lists. •  It can be used to extract individual spectra, runs, or entire spotsets. MS/MS peaklists are provided in .mgf formats. •  Runs on Java 1.5 platform. •  LCMS Peaklist Extractor – Batch mode tool for extracting concatenated .mgf peaklist files. •  Quantitation Extractor – Batch mode tool for extracting areas for peaks in MS/MS spectra.

https://www.prime-sdms.org/PRIMEInstallationSite/MSViewer/T2DExtractor.zip

T2D Extractor

Page 12: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Mass Spectrometers & data formats

Thermofinnigan Xcalibur / .raw

Life Technologies Analyst / .wiff ; .t2d

Waters Masslynx / .raw

Bruker .baf

Mascot .mgf .dat Sequest .dta .out

X! tandem .xml

OMSSA .xml .omx

mzxml mzData pepxml protxml

mzml    Mass Spectrometry 

Data Conversion. 

Page 13: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

data conversion

   Mass Spectrometry 

Data Conversion. 

mzXML2Other http://www.proteomecommons.org/current/522/ Converter from mzXML to sequest dta, mascot generic and micromass pkl formats.

Peak List Conversion Utility (Java Web Start) https://proteomecommons.org/tool.jsp?i=1012 The ProteomeCommons.org IO Framework's tool for converting peak list and spectrum files between different formats. The tool can also merge multiple peak lists into a single concatinated peak list. The tools uses Java Web Start and runs locally on your computer.

http://searcher.rrc.uic.edu/mm-docs/downloads /MM_File_Conversion_1p0.exe MassMatrix File Conversion Tools

These tools convert between common input formats: .RAW, .mzXML, .MGF.

Page 14: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

search algorithm

   Mass Spectrometry 

Data Extrac5on. Data Conversion.  Search 

algorithm 

Page 15: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

SEARCH ALGORITHM Search algorithm 

Page 16: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

X!tandem & the GPM Search algorithm 

http://www.thegpm.org/TANDEM/index.html

•  X! Tandem can be utilized as a web-based application or deployed locally using precompiled binaries and FASTA-formatted files. •  X!Tandem takes inputs in .xml format and outputs .xml format. •  The data analysis components consist of Input file ; FASTA, Taxonomy; Parameters and output. •  Central Axiom : “For each identifiable protein, there is at least one detectable tryptic peptide.” •  Extensively search for modified/ non-enzymatic peptides only on identified proteins. •  How far is the top-scoring match from the rest of the pack? Uses E-value. Much faster than Sequest’s Xcorr.

The Global Proteome Machine Organization •  X!Hunter •  X! P3 •  Common

Page 17: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

OMSSA Search algorithm 

•  OMSSA takes experimental ms/ms spectra, filters noise peaks, extracts m/z values, and then compares these m/z values to calculated m/z values derived from peptides produced by an in silico digestion of a protein sequence library.

•  Calculates E-value as a discriminant score. •  An E-value for a hit is a score that is the expected number of random hits from a search library to a given spectrum such that the random hits have an equal or better score than the hit.

•  It uses classical hypothesis testing based on type of statistical model that is used in BLAST.

•  Faster; Runs on all platforms

http://pubchem.ncbi.nlm.nih.gov/omssa/

Page 18: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

maxquant Search algorithm 

•  MaxQuant is an integrated suite of algorithms specifically developed for high-resolution, quantitative MS data.

•  MaxQuant detects peaks, isotope clusters and stable amino acid isotope-labeled (SILAC) peptide pairs as three-dimensional objects in m/z, elution time and signal intensity space.

•  By integrating multiple mass measurements, mass accuracy in the p.p.b. range is achieved.

•  MaxQuant quantifies several hundred thousand peptides per SILAC-proteome experiment.

http://www.maxquant.org/

Page 19: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

De novo tools

   Mass Spectrometry 

Data Extrac5on. Data Conversion.  Search 

algorithm 

De novo     Tools. 

Page 20: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

de novo analysis De novo     Tools. 

Protein Peptide

Mass spectrum

Fragmentation

Search against database.

•  De novo Analysis : Generate sequence from spectrum and match against database by using BLAST

Page 21: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

pepnovo De novo     Tools. 

•  PepNovo is a software for de novo sequencing of peptides from mass spectra.

•  PepNovo uses a probabilistic network to model the peptide fragmentation events in a mass spectrometer.

•  In addition, it uses a likelihood ratio hypothesis test to determine if the peaks observed in the mass spectrum are more likely to have been produced under the fragmentation model, than under a probabilistic model that treats the appearance of peaks as random events.

hEp://pep5de.ucsd.edu/pepnovo.html 

Page 22: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

lutefisk De novo     Tools. 

•  LUTEFISK uses a graph theory approach for de novo peptide sequence determinations from low-energy collision-induced dissociation (CID) data of tryptic peptides.

•  Lutefisk converts all of the ions into their corresponding b-ion masses by making N- and C-terminal “evidence lists” that contain evidence for cleavage at every possible b-ion mass. Once the sequence spectrum has been established, the program proceeds by tracing sequences starting at the N-terminus.

•  Highest ranked sequences are subjected to a cross-correlation analysis and scores are combined and normalized to produce a final score and ranking.

http://sourceforge.net/projects/lutefiskxp

Page 23: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

spectral matching

   Mass Spectrometry 

Data Extrac5on. Data Conversion.  Search 

algorithm 

De novo     Tools. 

Spectral  Matching 

Page 24: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

x!hunter Spectral  Matching 

•  X! Hunter is a search engine that compares experimentally observed spectra directly with a library of spectra that have been confidently assigned to a particular peptide sequence (an Annotated Spectrum Library, or ASL).

•  It can identify proteins using information from large number of spectra in GPMDB database.

•  Creation of ASLs : 1) Confident assignments for human and yeast peptides were extracted from GPMDB. 2) Replicate observations of the same peptide were averaged together and a final list of averaged peptide spectra was produced.

•  Because the sequence modifications and cleavage sites for the peptides in the sequence library are already known, it is not necessary to specify as many parameters for this type of search as in more conventional search engines.

•  This type of pattern matching tool is ideal for applications such as biomarker discovery.

http://www.thegpm.org

Page 25: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

MS-Clustering Spectral  Matching 

http://proteomics.bioprojects.org/MassSpec

MS-Clustering of MS/MS spectra takes advantage of dataset redundancy by identifying multiple spectra of the same peptide and replacing them with a single representative spectrum.

Analyzing only representative spectra results in significant speed-up of MS/MS database searches.

Large MS/MS data sets (over 10 million spectra) were reduced to smaller datasets and resulted in higher number of peptide identifications as compared to regular nonclustered searches.

Page 26: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Sta5s5cal valida5on of pep5de and protein iden5fica5ons. 

   Mass Spectrometry 

Data Extrac5on. Data Conversion.  Search 

algorithm 

De novo     Tools. 

Spectral  Matching 

Page 27: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Sta5s5cal valida5on of pep5de and protein iden5fica5ons. 

Page 28: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Trans-proteomic pipeline

Sta5s5cal valida5on of pep5de and protein iden5fica5ons. 

Trans-Proteomic Pipeline (TPP) is a data analysis pipeline for the analysis of LC/MS/MS proteomics data. TPP includes modules for validation of database search results, quantitation of isotopically labeled samples, and validation of protein identifications, as well as tools for viewing raw LC/MS data, peptide identification results, and protein identification results. The XML backbone of this pipeline enables a uniform analysis for LC/MS/MS data generated by a wide variety of mass spectrometer types, and assigned peptides using a wide variety of database search engines.

Page 29: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

peparml

Sta5s5cal valida5on of pep5de and protein iden5fica5ons. 

http://mac.softpedia.com/get/Math-Scientific/PepArML.shtml

A model-free, result-combining peptide identification arbiter via machine learning.

X!Tandem

Mascot

OMSSA

Other

PepArML

Feature extraction

Page 30: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Quantitative tools

Sta5s5cal valida5on of pep5de and protein iden5fica5ons. 

   Mass Spectrometry 

Data Extrac5on. Data Conversion.  Search 

algorithm 

De novo     Tools. 

Spectral  Matching 

Quan5ta5ve  Tools. 

Page 31: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

iTRAQ™ : Isobaric Tags for Relative and Absolute Quantification.

31 114 PRG +

+

+

+

30 115 PRG

29 116 PRG

28 117 PRG

Trypsin digest

[Reporter-Balance-Peptide] MS

-N H -N

H -N H -N

H

Mix MS MS/MS

117

116

115

114

Mass (m/z) 0 10 20 30 40 50 60 70 80 90 100

% In

tens

ity

72.0 509.8 947.6 1385.4 1823.2 2261.0 Mass (m/z)

0 10 20 30 40 50 60 70 80 90

100

% In

tens

ity

QGQPIGLGEASNDTWITTK

Charged Neutral loss

Isobaric Tag (Total mass = 145)

Reporter Balance Peptide Reactive Group

Page 32: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Proteomics Quantitatition

Page 33: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

i-tracker Quan5ta5ve  Tools. 

•  i-Tracker is an open-source peptide quantitation algorithm that allows the user to extract reporter ion peak ratios from non-centroided peak lists.

•  The algorithm uses .dta and .mgf files as inputs. The reporter ion areas are calculated and corrected for their purity.

•  The .csv output of i-Tracker allows for the relative comparison of the iTRAQ labeled peptides.

http://www.dasi.org.uk/download/itracker.htm

Page 34: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

TPP Quantitative Tools ASAP ratio, Xpress and libra

Quan5ta5ve  Tools.  ASAPRatio

http://tools.proteomecenter.org/ASAPRatio.php

Automated Statistical Analysis on Protein Ratio (ASAPRatio) accurately calculates the relative abundances of proteins and the corresponding confidence intervals from ICAT-type ESI-LC/MS data. XPRESS http://tools.proteomecenter.org/XPRESS.php

The XPRESS software calculates the relative abundance of proteins, such as those obtained from an ICAT-reagent labeled experiment, by reconstructing the light and heavy elution profiles of the precursor ions and determining the elution areas of each peak.

LIBRA http://tools.proteomecenter.org/wiki/index.php?title=Software:Libra

Libra is a module within the trans-proteomic pipeline to perform quantification on MS/MS spectra that have iTRAQ labeled samples.

Page 35: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

APEX

Quan5ta5ve  Tools. 

•  The APEX Quantitative Proteomics Tool is a free and open source Java implementation of the APEX technique for the absolute quantitation of proteins based on standard LC- MS/MS proteomics data.

•  It uses machine learning techniques to improve quantitation accuracy for label-free technique.

•  The APEX Tool provides an intuitive user interface, an integrated help system, and rich documentation.

•  A tutorial and sample data set is included to help first time users become acquainted with the system.

http://pfgrc.jcvi.org/index.php/bioinformatics/

Page 36: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

maxquant Quan5ta5ve  Tool 

http://www.maxquant.org/

•  MaxQuant quantifies several hundred thousand peptides per SILAC-proteome experiment.

Page 37: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Targeted Proteomics

   Mass Spectrometry 

Data Extrac5on. Data Conversion.  Search 

algorithm 

Sta5s5cal valida5on of pep5de and protein iden5fica5ons. De novo  

   Tools. 

Spectral  Matching 

Quan5ta5ve  Tools. 

Targeted  Proteomics 

Page 38: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Biochemistry vs Proteomics

Targeted proteomics vs Shotgun Proteomics

Targeted  Proteomics 

Page 39: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

MRM Targeted  Proteomics 

Quantitative Proteomics Results Prediction

Choose and Optimize Transistions

Selectivity, Sensitivity and Dynamic Range…

Page 40: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

TIQAM Targeted  Proteomics 

•  TIQAM generates MRM transition lists and identifies the best performing transitions from MRM pre-experiments.

•  In addition TIQAM provides a viewer to validate transitions by MRM-triggered MS/MS experiments.

•  All the peptide and transition information is stored in a database to enable smart retrieval of the validated transitions for quantitative analysis.

•  Commercial softwares : MRMPilot (Applied Biosystems), SRM Workflow Software (Thermo Scientific), VerifyE (Waters) and Optimizer (Agilent Technologies).

http://tools.proteomecenter.org/TIQAM/TIQAM.html

Page 41: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

X! P3 Targeted  Proteomics 

•  Uses identification of “proteotypic peptides” for identification of a protein. Because there will only be a few proteotypic peptides for a protein, it improves both the speed and accuracy of the resultant protein identifications.

•  The X! P3 (Proteotypic Peptide Profiler) project uses the following steps :

1.  In the first round, the spectrum data set is examined for the presence of proteotypic peptides. This is done by querying GPMDB to find the best peptides representative of a particular protein.

2. The full protein sequences of the proteins identified in the first round are then pulled from a sequence library.

3. Using this small set of full sequences, multiple rounds of refinement are performed to extract all of the non-proteotypic peptides from the full spectrum data set

•  An X! P3 server has been established for two model organisms, namely Homo sapiens and Saccharomyces cerevisiae, as well as several commonly observed experimental artifacts, such as BSA and trypsin.

http://www.thegpm.org

ftp://ftp.thegpm.org/proteotypic_peptide_profiles

Page 42: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

  Data Dissemina5on 

De novo     Tools. 

Spectral  Matching 

   Mass Spectrometry 

Data Extrac5on. Data Conversion.  Search 

algorithm 

Sta5s5cal valida5on of pep5de and protein iden5fica5ons. 

Quan5ta5ve  Tools. 

Targeted  Proteomics 

Your Answer is going to be determined by the ques5on asked. 

Page 43: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Prestomic

http://code.google.com/p/prestomic

  Data Dissemina5on 

•  An open-source suite of tools for storing data and for presenting the data in a user-friendly format via a browser.

•  The program was developed using mostly Perl.

Page 44: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Tranche   Data Dissemina5on 

https://proteomecommons.org/tranche/

Tranche is a free and open source file sharing tool that enables the storage of large amounts of data. Designed and built with scientists and researchers in mind, Tranche can handle very large data sets, is secure, is scalable, and all data sets are citable in scientific journals.

Page 45: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Proteomic pipelines that use Open-source software.

CPAS http://proteomics.fhrc.org/CPAS Open source toolkit that integrates open source proteomics tools along with existing commercial software.

CORRA http://tools.proteomecenter.org/Corra/corra.html Statistical Analysis tools for Quantitative proteomics

SysPIMP http://pimp.starflr.info Identify mutated proteins from mass spectrometry results.

SwissPIT http://swisspit.cscs.ch Multitool platform that promotes use of multiple search algorithms.

mMass data miner http://mmass.biographics.cz

The OpenMS Proteomics Pipeline http://ww.openms.de

Page 46: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

Protip

Raw Data from

Orbitrap mzxml format

dta format

X!TANDEM search

Scaffold Analysis

Scaffold Viewer

MASCOT search

SEQUEST search

Mgf format

OMSSA search

Page 47: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute
Page 48: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

performing multiple searches through Protip

mzxml format

dta format

X!TANDEM search

Scaffold Analysis

MASCOT search

SEQUEST search

Mgf format

OMSSA search

# of

pep

tides

# of

pro

tein

s

5522 5137 

5486 

8162 

6554 6962 

7443 

401 

370 

411 

491 

441 

441 

462 

1200 

2400 

3600 

4800 

6000 

7200 

8400 

Sequ

est 

X! ta

ndem

 

Mascot 

All Together 

Sequ

est +

 Mascot 

Sequ

est +

 X! tande

X! ta

ndem

 + M

ascot 

HUMAN DATASET

Page 49: Proteomics software available in the public  . · PDF fileProteomics software available in the public domain. Pratik Jagtap Minnesota Supercomputing institute

LAST WORD…

Questions ? Pratik Jagtap [email protected]