a qconcat informatics pipeline for the analysis, visualization and sharing of absolute quantitative...

5
TECHNICAL BRIEF A QconCAT informatics pipeline for the analysis, visualization and sharing of absolute quantitative proteomics data Neil Swainston 1,2 , Daniel Jameson 1,2 and Kathleen Carroll 1,3 1 Manchester Centre for Integrative Systems Biology, Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, UK 2 School of Computer Science, University of Manchester, Manchester, UK 3 School of Chemistry, University of Manchester, Manchester, UK Received: July 21, 2010 Revised: September 29, 2010 Accepted: October 18, 2010 Absolute protein concentration determination is becoming increasingly important in a number of fields including diagnostics, biomarker discovery and systems biology modeling. The recently introduced quantification concatamer methodology provides a novel approach to performing such determinations, and it has been applied to both microbial and mammalian systems. While a number of software tools exist for performing analyses of quantitative data generated by related methodologies such as SILAC, there is currently no analysis package dedicated to the quantification concatamer approach. Furthermore, most tools that are currently available in the field of quantitative proteomics do not manage storage and disse- mination of such data sets. Keywords: Bioinformatics / Data analysis / Data management / Quantification concatamer / Quantitation An informatics workflow is introduced, which represents a solution to the data analysis and management challenges encountered in applying the quantification concatamer (QconCAT) methodology to absolute quantitative proteo- mics studies. The workflow includes automated database searching and quantitation, and data storage and sharing, utilizing existing tools and applying community-developed standards where possible. A study of glycolytic enzymes from Saccharomyces cerevisiae is discussed, alongside a comparison of absolute protein concentrations calculated by these tools and by manual analysis Absolute protein concentration is becoming increasingly important in a number of fields including systems biology modeling. The QconCAT methodology [1] is a recently introduced approach for determining absolute protein concentration. It involves expressing concatenated constructs of reporter peptides in Escherichia coli and growing these in culture media supplemented with isoto- pically labeled amino acid analogues. This forms a recom- binant QconCAT protein, containing isotopically labeled peptides that act as unique internal standards for each of the proteins of interest. Known concentrations of this labeled protein can then be introduced to a given sample, which upon digestion yield equimolar amounts of QconCAT peptides, and when co-digested with endogenous proteins will produce pairs of heavy labeled and light native peptides, with identical chemical properties. Following analysis by LC-MS, the co-elution of such peptide pairs allows their relative abundance to be calculated from the response ratio between the analyte and the internal standard peptide. Absolute concentrations of each target protein can then be inferred from the known concentrations of labeled analyte. The informatics steps involved in performing such an experiment are as follows: (i) selection of suitable reporter Colour Online: See the article online to view figs. 1 and 2 in colour. Abbreviation: QconCAT, quantification concatamer Correspondence: Neil Swainston, Manchester Centre for Inte- grative Systems Biology, Manchester Interdisciplinary Biocentre, University of Manchester, Manchester M1 7DN, UK E-mail: [email protected] Fax: 144-161-306-8918 & 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com Proteomics 2011, 11, 329–333 329 DOI 10.1002/pmic.201000454

Upload: neil-swainston

Post on 06-Jul-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A QconCAT informatics pipeline for the analysis, visualization and sharing of absolute quantitative proteomics data

TECHNICAL BRIEF

A QconCAT informatics pipeline for the analysis,

visualization and sharing of absolute quantitative

proteomics data

Neil Swainston1,2, Daniel Jameson1,2 and Kathleen Carroll1,3

1 Manchester Centre for Integrative Systems Biology, Manchester Interdisciplinary Biocentre, University ofManchester, Manchester, UK

2 School of Computer Science, University of Manchester, Manchester, UK3 School of Chemistry, University of Manchester, Manchester, UK

Received: July 21, 2010

Revised: September 29, 2010

Accepted: October 18, 2010

Absolute protein concentration determination is becoming increasingly important in a

number of fields including diagnostics, biomarker discovery and systems biology modeling.

The recently introduced quantification concatamer methodology provides a novel approach to

performing such determinations, and it has been applied to both microbial and mammalian

systems. While a number of software tools exist for performing analyses of quantitative data

generated by related methodologies such as SILAC, there is currently no analysis package

dedicated to the quantification concatamer approach. Furthermore, most tools that are

currently available in the field of quantitative proteomics do not manage storage and disse-

mination of such data sets.

Keywords:

Bioinformatics / Data analysis / Data management / Quantification concatamer /

Quantitation

An informatics workflow is introduced, which represents a

solution to the data analysis and management challenges

encountered in applying the quantification concatamer

(QconCAT) methodology to absolute quantitative proteo-

mics studies. The workflow includes automated database

searching and quantitation, and data storage and sharing,

utilizing existing tools and applying community-developed

standards where possible. A study of glycolytic enzymes

from Saccharomyces cerevisiae is discussed, alongside a

comparison of absolute protein concentrations calculated by

these tools and by manual analysis

Absolute protein concentration is becoming increasingly

important in a number of fields including systems biology

modeling. The QconCAT methodology [1] is a recently

introduced approach for determining absolute protein

concentration. It involves expressing concatenated

constructs of reporter peptides in Escherichia coli and

growing these in culture media supplemented with isoto-

pically labeled amino acid analogues. This forms a recom-

binant QconCAT protein, containing isotopically labeled

peptides that act as unique internal standards for each of the

proteins of interest. Known concentrations of this labeled

protein can then be introduced to a given sample, which

upon digestion yield equimolar amounts of QconCAT

peptides, and when co-digested with endogenous proteins

will produce pairs of heavy labeled and light native

peptides, with identical chemical properties. Following

analysis by LC-MS, the co-elution of such peptide pairs

allows their relative abundance to be calculated from the

response ratio between the analyte and the internal standard

peptide. Absolute concentrations of each target protein can

then be inferred from the known concentrations of labeled

analyte.

The informatics steps involved in performing such an

experiment are as follows: (i) selection of suitable reporter

Colour Online: See the article online to view figs. 1 and 2 in colour.Abbreviation: QconCAT, quantification concatamer

Correspondence: Neil Swainston, Manchester Centre for Inte-

grative Systems Biology, Manchester Interdisciplinary Biocentre,

University of Manchester, Manchester M1 7DN, UK

E-mail: [email protected]

Fax: 144-161-306-8918

& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Proteomics 2011, 11, 329–333 329DOI 10.1002/pmic.201000454

Page 2: A QconCAT informatics pipeline for the analysis, visualization and sharing of absolute quantitative proteomics data

peptide(s) for each protein to be quantified; (ii) data acqui-

sition and identification of peptides; (iii) quantitation of

peptide pairs and inference of absolute protein concentra-

tions; and (iv) storage and dissemination of identifications,

quantitations and associated mass spectra. There is

currently no dedicated informatics support for steps (ii)–(iv).

This pipeline rectifies this by providing an integrated system

consisting of two parts: an analysis tool to perform peptide

identification and quantitation, and a database repository

allowing this data to be stored and visualized (see Fig. 1).

Analysis is performed by the QconCAT PrideWizard, an

extension of the original PrideWizard [2], which was devel-

oped to quantify iTRAQ labeled samples. The wizard

provides a user interface to which batches of spectra may be

submitted. Labeled peptides are then identified through a

Mascot [3] MS/MS Ion Search. Protein hits are filtered such

that only those that contain at least one QconCAT labeled

peptide, with rank 1 and a peptide expect score o5, are

retained. Furthermore, peptides are filtered such that the

only ones quantified are unmodified (apart from the

QconCAT label) and are unique to a single protein.

Quantitation is then performed by firstly generating an

extracted ion chromatogram for the m/z value correspond-

ing to the precursor ion matching each labeled peptide.

Where multiple matches occur against the same labeled

peptide in a given protein, the highest scoring one is

considered for quantitation. Savitzky-Golay smoothing [4] is

applied to the chromatogram, and the start and end reten-

tion time for the chromatographic peak matching the

peptide is determined, based on the retention time of the

fragmentation spectrum that supplied the peptide match.

Each precursor scan within this retention time window is

extracted and analyzed with an implementation of the

SILAC Analyzer linear fit quantitation algorithm [5]. This

provides a light/heavy ratio, and standard error, for each

identified unlabeled/labeled peptide pair (see Fig. 2).

Absolute quantification is obtained by multiplying this ratio

by the known amount of the standard. Protein quantitations

are a function of individual peptide quantitations and are

calculated using a formula described previously [6]. Peptides

from each replicate contribute to the overall protein quan-

titation.

Upon completion of the analysis, raw experimental data,

metadata, protein and peptide identifications and quantita-

tions are formatted into PRIDE XML [7] (see Supporting

Information Fig. S1). The PRIDE XML documents are

automatically uploaded to a native XML database, where

they can be queried through both a web and web service

interface (see http://www.mcisb.org/QconCAT/). The web

interface provides a simple search facility, allowing proteins

to be searched by UniProt identifiers. Protein summaries

display the calculated ratio, labeled peptides from each of

the submitted replicates, a link to the original Mascot

results, a tooltip showing experimental metadata and an

interactive panel, allowing fragmentation spectra, precursor

spectra and extracted mass chromatograms to be viewed. In

this way, both identifications and quantitations can be

viewed and shared online, displaying the original raw data

as was acquired by the instrument (see Fig. 2). Data can be

copied from the web interface and pasted into spreadsheets

or e-mails, providing an export facility for report generation.

The system can also be configured to manage the original,

vendor-specific raw data file, which can then also be down-

loaded from the web interface.

To test the system, a study was performed upon yeast

glycolytic enzymes with the results generated by the Qcon-

CAT PrideWizard compared to those generated manually.

Samples were prepared and data collected in triplicate by

LC-MS using a nanoACQUITY chromatograph (Waters MS

Technologies, Manchester, UK) coupled to an LTQ-Orbitrap

MascoteXist database

Identify

QuantifyWeb / web servicemzData

PRIDEXML

FormatPRIDE XML

UploadPride Converter

QconCAT Pride Wizard

Browser

Figure 1. The QconCAT analysis pipeline. Data are exported from the instrument software in mzData format, which is then imported into

the PRIDE Converter where metadata is applied. Batches of resulting PRIDE XML may be input to the QconCAT PrideWizard, which

submits queries to Mascot and quantifies the resulting identified peptides. Identifications and quantitations are merged with the spectral

data, and the resulting document is uploaded to a native XML database where it can be accessed through a web and web service interface.

330 N. Swainston et al. Proteomics 2011, 11, 329–333

& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 3: A QconCAT informatics pipeline for the analysis, visualization and sharing of absolute quantitative proteomics data

(ThermoFisher Scientific, Waltham, MA, USA). The

acquired raw data were converted to the vendor-independent

mzData [8] format using Bioworks Browser (v3.3.1 SP3,

ThermoFisher, Bremen, Germany), an operation that

performs no operations such as deisotoping or charge

deconvolution. These data were passed through the PRIDE

Converter [9] version 2.2, metadata was added and the data

exported in PRIDE XML format. Analysis was then

performed on all replicates using the QconCAT PrideWi-

zard. Manual analysis was also performed through genera-

tion of peak areas from extracted ion chromatograms

(Bioworks Browser). Ratios of unlabeled to labeled peptide

areas were calculated and these averaged across replicates.

Results show that the QconCAT PrideWizard is able to

quantify 23 of the 27 proteins in the study, in comparison to

19 by manual analysis (see Table 1). Correlation is observed

between the 17 proteins quantified by both the QconCAT

PrideWizard and by manual analysis (R2 5 0.88). The

QconCAT PrideWizard additionally quantified five low

abundance proteins, with the least abundant ADH5_YEAST

reported at 104 copies per cell. ALF_YEAST was successfully

quantified manually, but could not be quantified by the

QconCAT PrideWizard due to neither of the two labeled

marker peptides representing this protein being found by

Figure 2. Web interface screen captures providing visualization

of the quantitation of the peptide IDVAVDSTGVFK from

G3P1_YEAST. This represents the first of three labeled peptides

that were assigned to this protein (see http://www.mcisb.org/

QconCAT/G3P1_YEAST/). (A) shows the extracted ion chroma-

tograms for the masses 625.836 and 628.846 Da, representing the

unlabeled and labeled peptide in black and blue, respectively,

and illustrates their co-elution. Vertical lines indicate the calcu-

lated start and end retention times of the labeled peptide chro-

matographic peak. (B) illustrates a precursor scan taken between

the start and end retention times highlighted in (A), containing

isotopic envelopes for both unlabeled and labeled peptides. The

SILAC Analyzer linear fit algorithm is applied to each of these

precursor scans. This entails applying a sliding window across

the isotopic clusters in each of the precursor scans (B), gathering

pairs of intensity readings at m/z (representing the unlabeled

peptide) and m/z1Dm/z (representing the labeled peptide, where

Dm is the monoisotopic mass of the label (6.020 Da in these

studies) and z is the precursor ion charge). If both the labeled

and unlabeled peptides are present, these intensity pairs display

a linear correlation, which is plotted in the Fit tab (C). Applying

linear regression to this scatter plot provides an unlabeled to

label intensity ratio and standard error, which in the above

example is 0.79370.005.

Table 1. Protein concentrations calculated by the QconCATPrideWizard and by manual analysis.

Proteinaccession

Protein concentration/molecules per cell

QconCATPrideWizard

Manualanalysis

ADH1_YEAST 9.718E104 1.492E105ADH2_YEAST 2.001E104ADH3_YEAST 7.518E104ADH4_YEAST 5.693E105 5.692E105ADH5_YEAST 1.049E104ADH6_YEASTADH7_YEASTALF_YEAST 5.032E106ENO1_YEAST 3.348E106 3.552E106ENO2_YEAST 8.998E106 1.035E107G3P1_YEAST 1.799E106 1.766E106G3P2_YEAST 1.623E107G3P3_YEAST 1.822E107 1.623E107G6PI_YEAST 5.488E105 5.646E105HXKA_YEAST 7.157E104 8.513E104HXKB_YEAST 2.499E105 2.490E105HXKG_YEAST 1.192E105 1.430E105K6PF1_YEAST 1.806E105 1.728E105K6PF2_YEAST 1.499E105 1.702E105KPYK1_YEAST 9.535E106 1.523E107KPYK2_YEAST 2.032E104PDC1_YEAST 4.620E106 4.869E106PDC5_YEAST 4.343E104PDC6_YEAST 1.685E104PGK_YEAST 1.103E106 1.518E106PMG1_YEAST 2.958E106 3.234E106TPIS_YEAST 1.020E106 2.394E106

Proteomics 2011, 11, 329–333 331

& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 4: A QconCAT informatics pipeline for the analysis, visualization and sharing of absolute quantitative proteomics data

Mascot. Conversely, reporter peptides for ADH7_YEAST

were identified, but the corresponding native peptides were

in such low abundance that the protein could be quantified.

No peptides were identified for ADH6_YEAST.

The study attempts to quantify a number of isoenzymes,

including two members of the enolase family. A number of

QconCAT peptides originally selected to act as unique

marker peptides for a given protein were found to be

duplicates; that is, they were shared between a number of

isoenzymes. An example is SGETEDTFIADLVVGLR,

originally selected as a marker peptide for ENO1_YEAST,

which was correctly ignored in the quantification calculation

of the QconCAT PrideWizard on the grounds that it is also

present in ENO2_YEAST. As it is common practice to select

multiple peptides to act as markers for a given protein,

ENO1_YEAST was successfully quantified by the unique

peptides TFAEALR and NVNDVIAPAFVK. However,

G3P2_YEAST was not quantified, as both selected marker

peptides (VLPELQGK and VPTVDVSVVDLTVK) were non-

unique. Such common peptides may be considered in

unbiased approaches such as SILAC analyses, where the

contribution that a shared peptide makes to each of its

proteins may be inferred. However, it is appropriate to

exclude such peptides from QconCAT studies, as their

presence can usually be mitigated by experimental design.

The approach taken by the QconCAT Pride Wizard is to

first identify heavy/light peptide pairs and then to quantify

them. The identification of QconCAT pairs is driven by the

Mascot MS/MS Ion Search matching labeled QconCAT

peptides, which is dependent upon acquisition of MS/MS

data for these peptides. While this may not be applicable to

relative quantitative proteomics studies such as SILAC, where

a sample may contain hundreds or thousands of pairs across

a large dynamic range, QconCAT studies focus on a finite

number of peptide pairs (typically �50). Furthermore, it is

assumed the labeled QconCAT protein is added to the sample

in a concentration large enough to ensure that its peptides

provide signals of sufficient intensity that fragmentation data

will be acquired. Quantitation is performed by extracting

individual survey scans containing each heavy/light peptide

pair and determining signal ratios of equivalent points across

the two isotopic clusters. Such an approach minimizes the

potentially detrimental effect of co-eluting peptides.

A key feature of the system is its performance. A single

raw data file of 133 MB can be quantified in 249 s (of which

132 s are accounted for by the Mascot database search itself)

on a MacBook Pro (Apple Corporation, CA, USA), running

Mac OS X 10.6.4, with a 2.5-GHz Intel Core 2 Duo processor

and 4 GB 667 MHz DDR2 SDRAM.

A further consideration in the development of the system

was to ensure ease-of-use, as the system has been designed

for use by mass spectrometrists rather than bioinformati-

cians. The QconCAT PrideWizard manages the flow of data

from spectral data submission through database searching,

peptide/protein quantitation, data formatting and storage.

As such, a user can submit a batch of spectral data files,

which with only a small number of additional inputs allows

all steps of the analysis pipeline to be performed in a single

operation.

Existing tools such as Mascot and the SILAC Analyzer

have been reused and repackaged: a deliberate strategy that

avoids wheel reinvention. Where possible, the workflow has

utilized existing data representation standards. Mascot has

been recognized as the de facto industry standard for

performing protein identification and a link to the original

Mascot results, along with individual peptide scores, is

provided. Reanalysis of the original raw data is possible due

to the facility to export the data in standard, non-proprietary

mzData format. Peptide and protein identifications,

matched spectra and experimental metadata can be viewed,

queried and exported. This corresponds to journal recom-

mendations, which state that original, raw experimental data

should be made available, along with the secondary derived

data in terms of protein identifications and quantitations in

a standardized format that facilitates query and use by third

parties [10]. Furthermore, the authors intend to update the

QconCAT pipeline to support subsequent iterations of the

PRIDE XML format that incorporate the newly introduced

HUPO Proteomics Standards Initiative standards mzML,

mzIdentML and ultimately, mzQuantML.

The QconCAT Browser allows the survey scan data upon

which quantitations are performed to be viewed, allowing

both identifications and quantitations to be verified by

visualization of the original fragmentation and survey

spectra in a web browser. This contrasts with the usual

procedure, in which verification of reported quantitative data

is rarely achievable due to the inaccessibility of the original

raw data. Even when raw data are accessible, it is commonly

held upon the instrument computer and can usually only be

accessed through the vendor-supplied instrumentation

software. Explicitly displaying quantitative data along with

peptide and protein identifications in a web accessible

manner provides significant benefits and is an approach that

will hopefully become more widespread over time.

In addition to the web browser interface, a web service

interface is provided, allowing programmatic access to the

concentration values, identifications and all spectra

contained in the database. The web service interface allows

the user to submit customized XQuery commands to the

XML database, providing the flexibility to query and retrieve

any element of the PRIDE XML document, from individual

peptide or protein records up to the entire document itself.

This QconCAT analysis pipeline provides a freely avail-

able, vendor-independent means of analyzing, visualizing

and disseminating QconCAT experimental data, performing

the calculation of absolute protein concentrations and

managing the storage and dissemination of these data in a

standards compliant manner.

The authors thank Kieran Smallbone, Norman Paton, RobBeynon, Claire Eyers, Simon Hubbard, Craig Lawless andJulian Selley, and Marcin Rzeznicki for his Savitzky-Golay

332 N. Swainston et al. Proteomics 2011, 11, 329–333

& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 5: A QconCAT informatics pipeline for the analysis, visualization and sharing of absolute quantitative proteomics data

algorithm implementation (http://code.google.com/p/savitzky-golay-filter/). The authors thank the EPSRC and BBSRC fortheir funding of the Manchester Centre for Integrative SystemsBiology (http://www.mcisb.org), BBSRC/EPSRC Grant BB/C008219/1.

The authors have declared no conflict of interest.

References

[1] Beynon, R. J., Doherty, M. K., Pratt, J. M., Gaskell, S. J.,

Multiplexed absolute quantification in proteomics using

artificial QCAT proteins of concatenated signature peptides.

Nat. Methods 2005, 2, 587–589.

[2] Siepen, J. A., Swainston, N., Jones, A. R., Hart, S. R. et al.,

An informatic pipeline for the data capture and submission

of quantitative proteomic data using iTRAQ. Proteome Sci.

2007, 5, 4.

[3] Perkins, D. N., Pappin, D. J., Creasy, D. M., Cottrell, J. S.,

Probability-based protein identification by searching

sequence databases using mass spectrometry data. Elec-

trophoresis 1999, 20, 3551–3567.

[4] Savitzky, A., Golay, M. J. E., Smoothing and differentiation

of data by simplified least squares procedures. Anal. Chem.

1964, 36, 1627–1639.

[5] Nilse, L., Sturm, M., Trudgian, D. et al., SILACAnalyzer – a

tool for differential quantitation of stable isotope derived

data. CIBB, 6th International Meeting on Computational

Intelligence Methods for Bioinformatics and Biostatistics,

Genoa 2009.

[6] Baker, R. W. R., Nissim, J. A., Expressions for combining

standard errors of two groups and for sequential standard

error. Nature 1963, 198, 1020.

[7] Jones, P., Cote, R. G., Martens, L., Quinn, A. F. et al., PRIDE:

a public repository of protein and peptide identifications for

the proteomics community. Nucleic Acids Res. 2006, 34,

D659–D663.

[8] Orchard, S., Taylor, C., Hermjakob, H., Zhu, W. et al.,

Current status of proteomic standards development. Expert

Rev. Proteomics 2004, 1, 179–183.

[9] Barsnes, H., Vizcaıno, J. A., Eidhammer, I., Martens, L.,

PRIDE Converter: making proteomics data-sharing easy.

Nat. Biotechnol. 2009, 27, 598–599.

[10] Anon. Democratizing proteomics data. Nat. Biotechnol.

2007, 25, 262.

Proteomics 2011, 11, 329–333 333

& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com