a qconcat informatics pipeline for the analysis, visualization and sharing of absolute quantitative...
TRANSCRIPT
TECHNICAL BRIEF
A QconCAT informatics pipeline for the analysis,
visualization and sharing of absolute quantitative
proteomics data
Neil Swainston1,2, Daniel Jameson1,2 and Kathleen Carroll1,3
1 Manchester Centre for Integrative Systems Biology, Manchester Interdisciplinary Biocentre, University ofManchester, Manchester, UK
2 School of Computer Science, University of Manchester, Manchester, UK3 School of Chemistry, University of Manchester, Manchester, UK
Received: July 21, 2010
Revised: September 29, 2010
Accepted: October 18, 2010
Absolute protein concentration determination is becoming increasingly important in a
number of fields including diagnostics, biomarker discovery and systems biology modeling.
The recently introduced quantification concatamer methodology provides a novel approach to
performing such determinations, and it has been applied to both microbial and mammalian
systems. While a number of software tools exist for performing analyses of quantitative data
generated by related methodologies such as SILAC, there is currently no analysis package
dedicated to the quantification concatamer approach. Furthermore, most tools that are
currently available in the field of quantitative proteomics do not manage storage and disse-
mination of such data sets.
Keywords:
Bioinformatics / Data analysis / Data management / Quantification concatamer /
Quantitation
An informatics workflow is introduced, which represents a
solution to the data analysis and management challenges
encountered in applying the quantification concatamer
(QconCAT) methodology to absolute quantitative proteo-
mics studies. The workflow includes automated database
searching and quantitation, and data storage and sharing,
utilizing existing tools and applying community-developed
standards where possible. A study of glycolytic enzymes
from Saccharomyces cerevisiae is discussed, alongside a
comparison of absolute protein concentrations calculated by
these tools and by manual analysis
Absolute protein concentration is becoming increasingly
important in a number of fields including systems biology
modeling. The QconCAT methodology [1] is a recently
introduced approach for determining absolute protein
concentration. It involves expressing concatenated
constructs of reporter peptides in Escherichia coli and
growing these in culture media supplemented with isoto-
pically labeled amino acid analogues. This forms a recom-
binant QconCAT protein, containing isotopically labeled
peptides that act as unique internal standards for each of the
proteins of interest. Known concentrations of this labeled
protein can then be introduced to a given sample, which
upon digestion yield equimolar amounts of QconCAT
peptides, and when co-digested with endogenous proteins
will produce pairs of heavy labeled and light native
peptides, with identical chemical properties. Following
analysis by LC-MS, the co-elution of such peptide pairs
allows their relative abundance to be calculated from the
response ratio between the analyte and the internal standard
peptide. Absolute concentrations of each target protein can
then be inferred from the known concentrations of labeled
analyte.
The informatics steps involved in performing such an
experiment are as follows: (i) selection of suitable reporter
Colour Online: See the article online to view figs. 1 and 2 in colour.Abbreviation: QconCAT, quantification concatamer
Correspondence: Neil Swainston, Manchester Centre for Inte-
grative Systems Biology, Manchester Interdisciplinary Biocentre,
University of Manchester, Manchester M1 7DN, UK
E-mail: [email protected]
Fax: 144-161-306-8918
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Proteomics 2011, 11, 329–333 329DOI 10.1002/pmic.201000454
peptide(s) for each protein to be quantified; (ii) data acqui-
sition and identification of peptides; (iii) quantitation of
peptide pairs and inference of absolute protein concentra-
tions; and (iv) storage and dissemination of identifications,
quantitations and associated mass spectra. There is
currently no dedicated informatics support for steps (ii)–(iv).
This pipeline rectifies this by providing an integrated system
consisting of two parts: an analysis tool to perform peptide
identification and quantitation, and a database repository
allowing this data to be stored and visualized (see Fig. 1).
Analysis is performed by the QconCAT PrideWizard, an
extension of the original PrideWizard [2], which was devel-
oped to quantify iTRAQ labeled samples. The wizard
provides a user interface to which batches of spectra may be
submitted. Labeled peptides are then identified through a
Mascot [3] MS/MS Ion Search. Protein hits are filtered such
that only those that contain at least one QconCAT labeled
peptide, with rank 1 and a peptide expect score o5, are
retained. Furthermore, peptides are filtered such that the
only ones quantified are unmodified (apart from the
QconCAT label) and are unique to a single protein.
Quantitation is then performed by firstly generating an
extracted ion chromatogram for the m/z value correspond-
ing to the precursor ion matching each labeled peptide.
Where multiple matches occur against the same labeled
peptide in a given protein, the highest scoring one is
considered for quantitation. Savitzky-Golay smoothing [4] is
applied to the chromatogram, and the start and end reten-
tion time for the chromatographic peak matching the
peptide is determined, based on the retention time of the
fragmentation spectrum that supplied the peptide match.
Each precursor scan within this retention time window is
extracted and analyzed with an implementation of the
SILAC Analyzer linear fit quantitation algorithm [5]. This
provides a light/heavy ratio, and standard error, for each
identified unlabeled/labeled peptide pair (see Fig. 2).
Absolute quantification is obtained by multiplying this ratio
by the known amount of the standard. Protein quantitations
are a function of individual peptide quantitations and are
calculated using a formula described previously [6]. Peptides
from each replicate contribute to the overall protein quan-
titation.
Upon completion of the analysis, raw experimental data,
metadata, protein and peptide identifications and quantita-
tions are formatted into PRIDE XML [7] (see Supporting
Information Fig. S1). The PRIDE XML documents are
automatically uploaded to a native XML database, where
they can be queried through both a web and web service
interface (see http://www.mcisb.org/QconCAT/). The web
interface provides a simple search facility, allowing proteins
to be searched by UniProt identifiers. Protein summaries
display the calculated ratio, labeled peptides from each of
the submitted replicates, a link to the original Mascot
results, a tooltip showing experimental metadata and an
interactive panel, allowing fragmentation spectra, precursor
spectra and extracted mass chromatograms to be viewed. In
this way, both identifications and quantitations can be
viewed and shared online, displaying the original raw data
as was acquired by the instrument (see Fig. 2). Data can be
copied from the web interface and pasted into spreadsheets
or e-mails, providing an export facility for report generation.
The system can also be configured to manage the original,
vendor-specific raw data file, which can then also be down-
loaded from the web interface.
To test the system, a study was performed upon yeast
glycolytic enzymes with the results generated by the Qcon-
CAT PrideWizard compared to those generated manually.
Samples were prepared and data collected in triplicate by
LC-MS using a nanoACQUITY chromatograph (Waters MS
Technologies, Manchester, UK) coupled to an LTQ-Orbitrap
MascoteXist database
Identify
QuantifyWeb / web servicemzData
PRIDEXML
FormatPRIDE XML
UploadPride Converter
QconCAT Pride Wizard
Browser
Figure 1. The QconCAT analysis pipeline. Data are exported from the instrument software in mzData format, which is then imported into
the PRIDE Converter where metadata is applied. Batches of resulting PRIDE XML may be input to the QconCAT PrideWizard, which
submits queries to Mascot and quantifies the resulting identified peptides. Identifications and quantitations are merged with the spectral
data, and the resulting document is uploaded to a native XML database where it can be accessed through a web and web service interface.
330 N. Swainston et al. Proteomics 2011, 11, 329–333
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
(ThermoFisher Scientific, Waltham, MA, USA). The
acquired raw data were converted to the vendor-independent
mzData [8] format using Bioworks Browser (v3.3.1 SP3,
ThermoFisher, Bremen, Germany), an operation that
performs no operations such as deisotoping or charge
deconvolution. These data were passed through the PRIDE
Converter [9] version 2.2, metadata was added and the data
exported in PRIDE XML format. Analysis was then
performed on all replicates using the QconCAT PrideWi-
zard. Manual analysis was also performed through genera-
tion of peak areas from extracted ion chromatograms
(Bioworks Browser). Ratios of unlabeled to labeled peptide
areas were calculated and these averaged across replicates.
Results show that the QconCAT PrideWizard is able to
quantify 23 of the 27 proteins in the study, in comparison to
19 by manual analysis (see Table 1). Correlation is observed
between the 17 proteins quantified by both the QconCAT
PrideWizard and by manual analysis (R2 5 0.88). The
QconCAT PrideWizard additionally quantified five low
abundance proteins, with the least abundant ADH5_YEAST
reported at 104 copies per cell. ALF_YEAST was successfully
quantified manually, but could not be quantified by the
QconCAT PrideWizard due to neither of the two labeled
marker peptides representing this protein being found by
Figure 2. Web interface screen captures providing visualization
of the quantitation of the peptide IDVAVDSTGVFK from
G3P1_YEAST. This represents the first of three labeled peptides
that were assigned to this protein (see http://www.mcisb.org/
QconCAT/G3P1_YEAST/). (A) shows the extracted ion chroma-
tograms for the masses 625.836 and 628.846 Da, representing the
unlabeled and labeled peptide in black and blue, respectively,
and illustrates their co-elution. Vertical lines indicate the calcu-
lated start and end retention times of the labeled peptide chro-
matographic peak. (B) illustrates a precursor scan taken between
the start and end retention times highlighted in (A), containing
isotopic envelopes for both unlabeled and labeled peptides. The
SILAC Analyzer linear fit algorithm is applied to each of these
precursor scans. This entails applying a sliding window across
the isotopic clusters in each of the precursor scans (B), gathering
pairs of intensity readings at m/z (representing the unlabeled
peptide) and m/z1Dm/z (representing the labeled peptide, where
Dm is the monoisotopic mass of the label (6.020 Da in these
studies) and z is the precursor ion charge). If both the labeled
and unlabeled peptides are present, these intensity pairs display
a linear correlation, which is plotted in the Fit tab (C). Applying
linear regression to this scatter plot provides an unlabeled to
label intensity ratio and standard error, which in the above
example is 0.79370.005.
Table 1. Protein concentrations calculated by the QconCATPrideWizard and by manual analysis.
Proteinaccession
Protein concentration/molecules per cell
QconCATPrideWizard
Manualanalysis
ADH1_YEAST 9.718E104 1.492E105ADH2_YEAST 2.001E104ADH3_YEAST 7.518E104ADH4_YEAST 5.693E105 5.692E105ADH5_YEAST 1.049E104ADH6_YEASTADH7_YEASTALF_YEAST 5.032E106ENO1_YEAST 3.348E106 3.552E106ENO2_YEAST 8.998E106 1.035E107G3P1_YEAST 1.799E106 1.766E106G3P2_YEAST 1.623E107G3P3_YEAST 1.822E107 1.623E107G6PI_YEAST 5.488E105 5.646E105HXKA_YEAST 7.157E104 8.513E104HXKB_YEAST 2.499E105 2.490E105HXKG_YEAST 1.192E105 1.430E105K6PF1_YEAST 1.806E105 1.728E105K6PF2_YEAST 1.499E105 1.702E105KPYK1_YEAST 9.535E106 1.523E107KPYK2_YEAST 2.032E104PDC1_YEAST 4.620E106 4.869E106PDC5_YEAST 4.343E104PDC6_YEAST 1.685E104PGK_YEAST 1.103E106 1.518E106PMG1_YEAST 2.958E106 3.234E106TPIS_YEAST 1.020E106 2.394E106
Proteomics 2011, 11, 329–333 331
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Mascot. Conversely, reporter peptides for ADH7_YEAST
were identified, but the corresponding native peptides were
in such low abundance that the protein could be quantified.
No peptides were identified for ADH6_YEAST.
The study attempts to quantify a number of isoenzymes,
including two members of the enolase family. A number of
QconCAT peptides originally selected to act as unique
marker peptides for a given protein were found to be
duplicates; that is, they were shared between a number of
isoenzymes. An example is SGETEDTFIADLVVGLR,
originally selected as a marker peptide for ENO1_YEAST,
which was correctly ignored in the quantification calculation
of the QconCAT PrideWizard on the grounds that it is also
present in ENO2_YEAST. As it is common practice to select
multiple peptides to act as markers for a given protein,
ENO1_YEAST was successfully quantified by the unique
peptides TFAEALR and NVNDVIAPAFVK. However,
G3P2_YEAST was not quantified, as both selected marker
peptides (VLPELQGK and VPTVDVSVVDLTVK) were non-
unique. Such common peptides may be considered in
unbiased approaches such as SILAC analyses, where the
contribution that a shared peptide makes to each of its
proteins may be inferred. However, it is appropriate to
exclude such peptides from QconCAT studies, as their
presence can usually be mitigated by experimental design.
The approach taken by the QconCAT Pride Wizard is to
first identify heavy/light peptide pairs and then to quantify
them. The identification of QconCAT pairs is driven by the
Mascot MS/MS Ion Search matching labeled QconCAT
peptides, which is dependent upon acquisition of MS/MS
data for these peptides. While this may not be applicable to
relative quantitative proteomics studies such as SILAC, where
a sample may contain hundreds or thousands of pairs across
a large dynamic range, QconCAT studies focus on a finite
number of peptide pairs (typically �50). Furthermore, it is
assumed the labeled QconCAT protein is added to the sample
in a concentration large enough to ensure that its peptides
provide signals of sufficient intensity that fragmentation data
will be acquired. Quantitation is performed by extracting
individual survey scans containing each heavy/light peptide
pair and determining signal ratios of equivalent points across
the two isotopic clusters. Such an approach minimizes the
potentially detrimental effect of co-eluting peptides.
A key feature of the system is its performance. A single
raw data file of 133 MB can be quantified in 249 s (of which
132 s are accounted for by the Mascot database search itself)
on a MacBook Pro (Apple Corporation, CA, USA), running
Mac OS X 10.6.4, with a 2.5-GHz Intel Core 2 Duo processor
and 4 GB 667 MHz DDR2 SDRAM.
A further consideration in the development of the system
was to ensure ease-of-use, as the system has been designed
for use by mass spectrometrists rather than bioinformati-
cians. The QconCAT PrideWizard manages the flow of data
from spectral data submission through database searching,
peptide/protein quantitation, data formatting and storage.
As such, a user can submit a batch of spectral data files,
which with only a small number of additional inputs allows
all steps of the analysis pipeline to be performed in a single
operation.
Existing tools such as Mascot and the SILAC Analyzer
have been reused and repackaged: a deliberate strategy that
avoids wheel reinvention. Where possible, the workflow has
utilized existing data representation standards. Mascot has
been recognized as the de facto industry standard for
performing protein identification and a link to the original
Mascot results, along with individual peptide scores, is
provided. Reanalysis of the original raw data is possible due
to the facility to export the data in standard, non-proprietary
mzData format. Peptide and protein identifications,
matched spectra and experimental metadata can be viewed,
queried and exported. This corresponds to journal recom-
mendations, which state that original, raw experimental data
should be made available, along with the secondary derived
data in terms of protein identifications and quantitations in
a standardized format that facilitates query and use by third
parties [10]. Furthermore, the authors intend to update the
QconCAT pipeline to support subsequent iterations of the
PRIDE XML format that incorporate the newly introduced
HUPO Proteomics Standards Initiative standards mzML,
mzIdentML and ultimately, mzQuantML.
The QconCAT Browser allows the survey scan data upon
which quantitations are performed to be viewed, allowing
both identifications and quantitations to be verified by
visualization of the original fragmentation and survey
spectra in a web browser. This contrasts with the usual
procedure, in which verification of reported quantitative data
is rarely achievable due to the inaccessibility of the original
raw data. Even when raw data are accessible, it is commonly
held upon the instrument computer and can usually only be
accessed through the vendor-supplied instrumentation
software. Explicitly displaying quantitative data along with
peptide and protein identifications in a web accessible
manner provides significant benefits and is an approach that
will hopefully become more widespread over time.
In addition to the web browser interface, a web service
interface is provided, allowing programmatic access to the
concentration values, identifications and all spectra
contained in the database. The web service interface allows
the user to submit customized XQuery commands to the
XML database, providing the flexibility to query and retrieve
any element of the PRIDE XML document, from individual
peptide or protein records up to the entire document itself.
This QconCAT analysis pipeline provides a freely avail-
able, vendor-independent means of analyzing, visualizing
and disseminating QconCAT experimental data, performing
the calculation of absolute protein concentrations and
managing the storage and dissemination of these data in a
standards compliant manner.
The authors thank Kieran Smallbone, Norman Paton, RobBeynon, Claire Eyers, Simon Hubbard, Craig Lawless andJulian Selley, and Marcin Rzeznicki for his Savitzky-Golay
332 N. Swainston et al. Proteomics 2011, 11, 329–333
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
algorithm implementation (http://code.google.com/p/savitzky-golay-filter/). The authors thank the EPSRC and BBSRC fortheir funding of the Manchester Centre for Integrative SystemsBiology (http://www.mcisb.org), BBSRC/EPSRC Grant BB/C008219/1.
The authors have declared no conflict of interest.
References
[1] Beynon, R. J., Doherty, M. K., Pratt, J. M., Gaskell, S. J.,
Multiplexed absolute quantification in proteomics using
artificial QCAT proteins of concatenated signature peptides.
Nat. Methods 2005, 2, 587–589.
[2] Siepen, J. A., Swainston, N., Jones, A. R., Hart, S. R. et al.,
An informatic pipeline for the data capture and submission
of quantitative proteomic data using iTRAQ. Proteome Sci.
2007, 5, 4.
[3] Perkins, D. N., Pappin, D. J., Creasy, D. M., Cottrell, J. S.,
Probability-based protein identification by searching
sequence databases using mass spectrometry data. Elec-
trophoresis 1999, 20, 3551–3567.
[4] Savitzky, A., Golay, M. J. E., Smoothing and differentiation
of data by simplified least squares procedures. Anal. Chem.
1964, 36, 1627–1639.
[5] Nilse, L., Sturm, M., Trudgian, D. et al., SILACAnalyzer – a
tool for differential quantitation of stable isotope derived
data. CIBB, 6th International Meeting on Computational
Intelligence Methods for Bioinformatics and Biostatistics,
Genoa 2009.
[6] Baker, R. W. R., Nissim, J. A., Expressions for combining
standard errors of two groups and for sequential standard
error. Nature 1963, 198, 1020.
[7] Jones, P., Cote, R. G., Martens, L., Quinn, A. F. et al., PRIDE:
a public repository of protein and peptide identifications for
the proteomics community. Nucleic Acids Res. 2006, 34,
D659–D663.
[8] Orchard, S., Taylor, C., Hermjakob, H., Zhu, W. et al.,
Current status of proteomic standards development. Expert
Rev. Proteomics 2004, 1, 179–183.
[9] Barsnes, H., Vizcaıno, J. A., Eidhammer, I., Martens, L.,
PRIDE Converter: making proteomics data-sharing easy.
Nat. Biotechnol. 2009, 27, 598–599.
[10] Anon. Democratizing proteomics data. Nat. Biotechnol.
2007, 25, 262.
Proteomics 2011, 11, 329–333 333
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com