statistical and machine learning methods to analyze large...

Statistical and machine learning methods to analyzelarge-scale mass spectrometry data

MATTHEW THE

Doctoral ThesisStockholm, Sweden 2018

TRITA-CBH-FOU-2018:45ISBN 978-91-7729-967-7

School of Engineering Sciences inChemistry, Biotechnology and Health

SE-171 21 SolnaSWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framläggestill offentlig granskning för avläggande av Teknologie doktorsexamen i bioteknologionsdagen den 24 oktober 2018 klockan 13.00 i Atrium, Entréplan i Wargentinhuset,Karolinska Institutet, Solna.

© Matthew The, October 2018

Tryck: Universitetsservice US AB

iii

Abstract

Modern biology is faced with vast amounts of data that contain valuable in-formation yet to be extracted. Proteomics, the study of proteins, has repos-itories with thousands of mass spectrometry experiments. These data goldmines could further our knowledge of proteins as the main actors in cell pro-cesses and signaling. Here, we explore methods to extract more informationfrom this data using statistical and machine learning methods.

First, we present advances for studies that aggregate hundreds of runs. Weintroduce MaRaCluster, which clusters mass spectra for large-scale datasetsusing statistical methods to assess similarity of spectra. It identified up to40% more peptides than the state-of-the-art method, MS-Cluster. Further,we accommodated large-scale data analysis in Percolator, a popular post-processing tool for mass spectrometry data. This reduced the runtime for adraft human proteome study from a full day to 10 minutes.

Second, we clarify and promote the contentious topic of protein false discoveryrates (FDRs). Often, studies report lists of proteins but fail to report proteinFDRs. We provide a framework to systematically discuss protein FDRs andtake away hesitance. We also added protein FDRs to Percolator, opting forthe best-peptide approach which proved superior in a benchmark of scalableprotein inference methods.

Third, we tackle the low sensitivity of protein quantification methods. Currentmethods lack proper control of error sources and propagation. To remedy this,we developed Triqler, which controls the protein quantification FDR througha Bayesian framework. We also introduce MaRaQuant, which proposes aquantification-first approach that applies clustering prior to identification.This reduced the number of spectra to be searched and allowed us to spotunidentified analytes of interest. Combining these tools outperformed thestate-of-the-art method, MaxQuant/Perseus, and found enriched functionalterms for datasets that had none before.

iv

Sammanfattning

Modern molekylärbiologi genererar enorma mängder data, som kräver sofisti-kerade algoritmer för att tolkas. Inom området proteomik, det vill säga detstorskaliga studier av proteiner, har det under årens lopp byggt upp stora,lättåtkomliga arkiv med data från masspektrometri-experiment. Vidare, finnsdet fortfarande många outforskade aspekter av proteiner, som är extra in-tressanta eftersom proteiner utgör de viktigaste aktörerna i cellprocesser ochcellsignalering. I den här avhandlingen undersöker vi metoder för att utvinnamer information från tillgängliga data med hjälp av statistik och maskinin-lärningsmetoder.

För det första, presenterar vi framsteg för att kunna hantera analyser av stu-dier av sammanslagna mängder med hundratals aggregerade körningar. Vibeskriver MaRaCluster, en ny metod för klustring av masspektra som använ-der statistiska metoder för att bedöma likheten mellan masspektra. Metodenidentifierar upp till 40% fler peptider än den etablerade metoden, MS-Cluster.Dessutom har vi lagt till funktioner till Percolator, den populära mjukvaranför efterbehandling av masspektrometridata. Detta minskade exekveringsti-den för en storskalig studie av det mänskliga proteomet från ett dygn ner till10 minuter på en standarddator.

För det andra, förtydligar vi och förespråkar användandet av ett mått påosäkerhet i identifiering av protein från masspektrometridata, det så kalladeFalse Discovery Rate (FDR) på proteinnivå. Många studier rapporterar inteFDR på proteinnivå, trots att dessa väljer att redovisa sina resultat i form avlistor av proteiner. Här presenterar vi ett statistiskt ramverk för att kunnasystematiskt diskutera FDR på proteinnivå och för att undanröja tveksam-heter. Vi jämför skalbara inferensmetoder för proteiner och inkluderade denbästa-peptidinferensmetoden i Percolator.

För det tredje, tar vi upp bristen på känslighet för proteinkvantifieringsme-toder. Nuvarande metoder saknar ordentlig kontroll av felkällor och felfort-plantning. För att åtgärda detta utvecklade vi ett nytt verktyg, Triqler, somtillhandahåller adekvat falsk upptäcktshastighetsstyrning för proteinkvanti-fiering genom ett Bayesiska ramverk. Vi beskriver också MaRaQuant, därvi introducera kvantifiering-först-metoden som tillämpar klustring före iden-tifiering. Detta minskade antalet spektra som behövde matchas, men tillätoss också att upptäcka oidentifierade intressanta analyter. Kombinationen avdessa två verktyg överträffade den etablerade proteinkvantifieringsmetoden,MaxQuant/Perseus, och hittade anrikning av funktionella anteckningstermerför dataset där inget hittades tidigare.

Contents

Contents v

Acknowledgments 1

List of publications 3

1 Introduction 51.1 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Biological background 72.1 From DNA to proteins . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Current state of research . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Shotgun proteomics 133.1 Protein digestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Liquid chromatography . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Tandem mass spectrometry . . . . . . . . . . . . . . . . . . . . . . . 14

4 Data analysis pipeline 214.1 Feature detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Peptide identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Protein identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4 Protein quantification . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Statistical evaluation 275.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 False discovery rates . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3 The Percolator software package . . . . . . . . . . . . . . . . . . . . 33

6 Mass spectrum clustering 356.1 Distance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 38

v

vi CONTENTS

6.3 Evaluating clustering performance . . . . . . . . . . . . . . . . . . . 40

7 Label-free protein quantification 417.1 Feature detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.2 Peptide quantification . . . . . . . . . . . . . . . . . . . . . . . . . . 437.3 Matches-between-runs . . . . . . . . . . . . . . . . . . . . . . . . . . 437.4 Missing value imputation . . . . . . . . . . . . . . . . . . . . . . . . 447.5 Peptide to protein summarization . . . . . . . . . . . . . . . . . . . . 447.6 Differential expression analysis . . . . . . . . . . . . . . . . . . . . . 44

8 Present investigations 478.1 Paper I: MaRaCluster . . . . . . . . . . . . . . . . . . . . . . . . . . 478.2 Paper II: Protein-level false discovery rates . . . . . . . . . . . . . . 488.3 Paper III: Percolator 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . 488.4 Paper IV: Triqler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498.5 Paper V: MaRaQuant . . . . . . . . . . . . . . . . . . . . . . . . . . 49

9 Future perspectives 51

Bibliography 53

Acknowledgements

I owe many thanks to all the people who have made all the work in my Ph.D. andthis thesis possible.

First of all, none of this would have been possible without my marvelous supervisor,Lukas. Your support, optimism and never-ending stream of suggestions made thisjourney an immensely positive experience.

I would also like to thank my collaborators, William Stafford Noble, Fredrik Edfors,Johannes Griss, Yasset Perez-Riverol and Juan Antonio Vizcaíno, as well as theOpenMS crew for their help and comradery, Timo Sachsenberg, Julianus Pfeuffer,Leon Bichmann and Oliver Alka. A special thanks to Oliver Serang, for all yourfascinating stories and planting the Bayesian seed, and Wout Bittremieux, for yourcompanionship all around the globe.

All the people who have passed through the Käll lab over the years. Viktor andLuminita, for providing me with a warm welcome to the group, no matter howshort it was. David, Heydar, Xuanbin, Ayesha, Sara, Daniel, Yunyi, Alicia, Revant,Aslıand Yang for all the good times in- and outside the lab. Jorrit, for injectingsome “Hollandse nuchterheid” every so often. Yrin, for all the amusing discussionsand board game evenings, go finish those degrees! Vital, for making the office anenjoyable place to come to. Richard, for sharing all your kooky ideas and stories.Gustavo, for your contagious good mood, stimulating conversations and for alwaysbeing up for hot pot and a beer (or two). Patrick, for your infectious high-energycan-do attitude, go apply that to the mess I left you!

To everyone at Gamma 6 for all the enjoyable lunches, fikas and after work activi-ties, Mattias, Linus, Pekka, Ikram, Owais, Mehmood, Auwn, Hashim, Mohammad,Hazal, Seong, Jovana, Andrei, Parul, Johannes, Ilaria, Daniel, Yue and Mandi.Also to all of you from SciLifeLab, you are simply too many to name, but in par-ticular thanks to Johannes, for being my TA buddy and all-round good company,and to Wenjing, Marcel and Fran, for providing me with a second group when ournumbers were particularly low and all the times after that as well.

To my friends that I was fortunate enough of meeting here in Stockholm, Steffen,Lisa, Lukas, Thomas, Nicole and Dmitry, and also my Berlin Master/Ph.D. buddies,Luis, Carlos and Xiao, for all the comradery and good times.

1

2 Acknowledgments

A big thanks as well to my family and friends in the Netherlands and extendedfamily in Denmark. In particular, my parents, for all your continuous support andfor always showing great interest in my work. And last but definitely not least,Ditte, for all your encouragements, patience and affection, I would not have comeso far today if it wasn’t for you.

List of publications

I have included the following articles in the thesis.

PAPER I: MaRaCluster: A fragment rarity metric for clustering frag-ment spectra in shotgun proteomics.Matthew The & Lukas KällJournal of Proteome Research, 15(3), 713-720 (2016).DOI: 10.1021/acs.jproteome.5b00749

PAPER II: How to talk about protein-level false discovery rates in shot-gun proteomics.Matthew The, Ayesha Tasnim & Lukas KällProteomics, 16(18), 2461-2469 (2016).DOI: 10.1002/pmic.201500431

PAPER III: Fast and accurate protein false discovery rates on large-scale proteomics data sets with Percolator 3.0.Matthew The, Michael J. MacCoss, William S. Noble & Lukas KällJournal of the American Society for Mass Spectrometry, 27(11),1719-1727 (2016).DOI: 10.1007/s13361-016-1460-7

PAPER IV: Integrated identification and quantification error probabil-ities for shotgun proteomics.Matthew The & Lukas KällManuscript in preparation

PAPER V: Distillation of label-free quantification data by clusteringand Bayesian modeling.Matthew The & Lukas KällManuscript in preparation

The articles are used here with permission from the respective publishers.

3

4 LIST OF PUBLICATIONS

Other publications that I have not included in the thesis.

PAPER VI: Response to “Comparison and Evaluation of Clustering Al-gorithms for Tandem Mass Spectra”.Johannes Griss, Yasset Perez-Riverol, Matthew The, Lukas Käll &Juan Antonio VizcaínoJournal of proteome research, 17(5), 1993-1996 (2018).DOI: 10.1021/acs.jproteome.7b00824

PAPER VII: Uncertainty estimation of predictions of peptides’ chromato-graphic retention times in shotgun proteomics.Heydar Maboudi Afkham, Xuanbin Qiu, Matthew The & Lukas KällBioinformatics, 33(4), 508–513 (2017).DOI: 10.1093/bioinformatics/btw619

PAPER VIII: A protein standard that emulates homology for the charac-terization of protein inference algorithms.Matthew The, Fredrik Edfors, Yasset Perez-Riverol, Samuel H Payne,Michael R Hoopmann, Magnus Palmblad, Björn Forsström & LukasKällJournal of proteome research, 17(5), 1879-1886 (2018).DOI: 10.1021/acs.jproteome.7b00899

PAPER IX: ABRF Proteome Informatics Research Group (iPRG) 2016Study: Inferring Proteoforms from Bottom-up ProteomicsData.Joon-Yong Lee, Hyungwon Choi, Christopher Colangelo, Darryl Davis,Michael R. Hoopmann, Lukas Käll, Henry Lam, Samuel H. Payne,Yasset Perez-Riverol, Matthew The, Ryan Wilson, Susan T. Wein-traub & Magnus PalmbladJournal of Biomolecular Techniques, 29(2), 39-45 (2018).DOI: 10.7171/jbt.18-2902-003

Chapter 1

Introduction

We live in an era where data is generated in gigabytes, or sometimes even terabytesper hour. The field of biology forms no exception to this, where we use all this datato capture the complexity of life, most prominently in the forms of DNA, RNAand proteins. The field that deals with the analysis of this biological data is calledbioinformatics.

Of the 3 aforementioned primary building blocks of life, proteins are arguably themost important ones when it comes to biological processes. Ideally, one would wantto know the expression levels of all proteins in a cell or tissue, and with advancesin technology that allow deep sampling of, for example, the human proteome, weare well on our way.

In the field of proteomics, the study of proteins, a major part of the data genera-tion comes from so-called shotgun proteomics experiments. In these experiments,we start with a mixture of unknown proteins. We then use several experimentalprocedures to break the proteins up into smaller, more easily identifiable fragmentsand subsequently detect and register these fragments. The challenge is then to putthese shards of evidence together again and derive what the original proteins werethat were present in the mixture.

Several gigabytes of data from these shotgun proteomics experiments are depositeddaily to openly accessible repositories, such as the PRIDE archive [77] (http://www.ebi.ac.uk/pride/archive/) and PeptideAtlas [22] (http://www.peptideatlas.org/). In this thesis, we will look at novel methods we have developed which at-tempt to explore and summarize these gold mines of information, using techniquesfrom machine learning and statistics.

1.1 Thesis overview

In order to make this thesis as accessible as possible to readers unfamiliar withthe field of shotgun proteomics. Chapters 2 and 3 are devoted to the biological

5

http://www.ebi.ac.uk/pride/archive/

http://www.ebi.ac.uk/pride/archive/

http://www.peptideatlas.org/

http://www.peptideatlas.org/

6 CHAPTER 1. INTRODUCTION

and experimental background respectively. Chapter 4 contains an overview of theanalysis pipeline that is commonly used in shotgun proteomics and has also beenemployed at several instances in the included papers. Chapter 5 focuses on thestatistical methods that are used to provide appropriate levels of confidence inthe obtained results and are the subject of papers II and III. Chapter 6 examinesseveral approaches to summarize the data of large-scale studies by clustering similarobservations, providing an introduction to paper I. Chapter 7 gives an introductioninto protein quantification, which is the subject of papers IV and V. Chapter 8summarizes the main points of the included papers and explains where they fit intothe pipeline. Finally, Chapter 9 provides some perspectives into the future of thefield of shotgun proteomics and the subjects of the included papers in particular.

Chapter 2

Biological background

Over the last century, molecular biology has fundamentally changed our under-standing of life and the way we think about it. Among other things, it has givenus the tools to reason about evolution, heredity and disease. We take it for grantedthat we can check for risk factors for certain diseases through analysis of our genomeand that we can test blood samples for protein disease markers. Here, a short back-ground in the field of molecular biology will be provided, with a focus on proteins,together with the relevant terminology used later in the thesis.

2.1 From DNA to proteins

The relationship between the 3 primary building blocks of life can be formulatedin what is known as the central dogma of molecular biology. It can loosely beformulated as ”DNA is transcribed to RNA and RNA is translated into proteins“.

DNA stands for deoxyribonucleic acid and is the famous double-stranded helix thatstores our genetic information. Each strand is built up of a chain of nucleotides, ofwhich there are 4 types: cytosine (C), guanine (G), adenine (A) and thymine (T).The two strands are complementary, in the sense that a C on one of the strandshas to be coupled to a G at the same location on the other strand and vice versa,while the same relation holds for A and T. Each such pair of C-G or A-T is calleda base pair and DNA molecules are in the order of millions of base pairs long.

RNA stands for ribonucleic acid and is very similar to DNA, except that it onlyhas a single strand and is typically only a few thousand base pairs long. Insteadof thymine, RNA uses the nucleotide type uracil (U). RNA is copied from specificregions of the single DNA strands, called genes, through a process called transcrip-tion. In most eukaryotic genes, certain parts, called introns, of the RNA chainsare subsequently removed in a process called RNA splicing. The remaining RNAchains, called exons, can be combined in different ways, giving rise to splice variantsof a gene.

7

8 CHAPTER 2. BIOLOGICAL BACKGROUND

Residue Formula Mass (Da)Alanine Ala A C3H5NO 71.04Arginine Arg R C6H12N4O 156.10Asparagine Asn N C4H6N2O2 114.04Aspartic acid Asp D C4H5NO3 115.03Cysteine Cys C C3H5NOS 103.01Glutamic acid Glu E C5H7NO3 129.04Glutamine Gln Q C5H8N2O2 128.06Glycine Gly G C2H3NO 57.02Histidine His H C6H7N3O 137.06Isoleucine Ile I C6H11NO 113.08Leucine Leu L C6H11NO 113.08Lysine Lys K C6H12N2O 128.09Methionine Met M C5H9NOS 131.04Phenylalanine Phe F C9H9NO 147.07Proline Pro P C5H7NO 97.05Serine Ser S C3H5NO2 87.03Threonine Thr T C4H7NO2 101.05Tryptophan Trp W C11H10N2O 186.08Tyrosine Tyr Y C9H9NO2 163.06Valine Val V C5H9NO 99.07

Table 2.1: Table of the 20 canonical amino acids. The second and thirdcolumn are, respectively, the 3-letter code and 1-letter code commonly used as ashorter representation of the amino acid when given in a protein sequence. Thegiven mass is the monoisotopic mass given in unified atomic mass units, also knownas Daltons (Da).

Proteins consist of a chain of typically hundreds of amino acids, which are connectedby amide bonds. The amino acids come in 20 types that all have the same basicstructure of an amine (NH2) and carboxylic acid (COOH) group, but differ in theirside chains (see Table 2.1). They are translated from RNA strands, where a tripletof nucleotides codes for a certain type of amino acid. As one can see from the math,there are 43 = 64 triplets and only 20 amino acid types, and consequently certaintriplets code for the same amino acid. Also, there is one triplet, the start codon,that signals the start of the protein as well as three triplets, the stop codons, thatmark the end of the protein.

A protein is commonly represented as a string of amino acids, starting with theamino acid at the amino end, also called the N-terminus, and ending with the aminoacid at the carboxyl end, also called the C-terminus. This is also the direction inwhich the protein is translated from RNA, i.e. new amino acids are added to thecarboxyl end during translation.

2.2. PROTEINS 9

From the brief description above, one might get the impression that it would beenough to know the DNA sequences to know what proteins would be produced.However, there are several complicating factors such as splice variants, gene regu-lation, as well as RNA and protein degradation that prevent this conclusion frombeing correct. Knowing the DNA sequence does not tell us how much RNA willbe produced from a specific gene and the correlation of protein abundance andRNA levels is generally low, although this is currently a controversial point of de-bate [47, 44, 75, 115, 25].

2.2 Proteins

In contrast to DNA strands, proteins can have many different structures, owingto the diversity of the amino acid types’ side chains. The side chains differ in,among other things, ion charge, size and hydrophilicity, i.e., to which degree it isattracted to water. The proteins typically folds into a conformation that minimizesits energy (Figure 2.1), which depends on the above-mentioned characteristics, aswell as other chemical properties of the amino acid side chains. Proteins can alsobind to one or more other proteins, forming so-called protein complexes.

Figure 2.1: The 3D structure of the protein myoglobin (blue). The aminoacid chain folds into specific structures that give the protein its functionality. Myo-globin is involved in the transportation of oxygen to cells. Here, a heme group(gray) with a bound oxygen molecule (red) is bound to the myoglobin molecule. [7]

10 CHAPTER 2. BIOLOGICAL BACKGROUND

Exactly this diversity in structure allows proteins to act as the major provider offunctionality on cell level, for example, catalyzing chemical reactions, transportingmolecules, or receiving and reacting to signals in the cell’s environment. An im-portant mechanism that helps in this last task is post-translational modifications(PTMs), which are reversible chemical alterations to amino acids that can, for ex-ample, activate or deactivate the protein. In short, knowing the proteins that arepresent in a cell, ideally with their modifications and a functional annotation, cantell us a lot about what is happening in a cell.

PTMs and splice variants result in variation in the population of proteins stemmingfrom the same gene, known as the different proteoforms of a gene [103]. Humanshave approximately 20 000 different protein-coding genes [36]. Taking splice vari-ants into account this results into around 70 000 protein species [116]. Accountingfor PTMs as well as single nucleotide polymorphisms (SNPs), the most commonform of genetic variation between people, could send the total number of distinctproteoforms into the millions [1]. Proteoforms from a single gene are all slightlydifferent and may even have different functionality. However, as they often havesignificant parts of their amino acid chains in common, it can be quite challengingto distinguish them from one another. Therefore, it is common to report proteinson gene level, either with or without taking splice variants into account.

2.3 Current state of research

Nowadays, we can determine entire DNA sequences of humans and other organisms,their genomes, in a matter of hours. We can also get relatively complete images ofthe gene expression, the transcriptome, using RNA sequencing techniques. How-ever, getting a complete image of the protein expression, the proteome, is not yetfeasible. Note that gene and protein expression are very much cell type dependentand therefore the transcriptome or proteome of an organism should be a collectionof transcriptomes or proteomes from different tissues and cell types. Some notableattempts have been made at constructing a human proteome [115, 60], but bothprojects took a substantial amount of time and effort and are far from complete,each claiming a coverage of 70 − 80% of the known human genes.

One of the main challenges in proteomics is the large dynamic range of protein copynumbers in a cell. Protein copy numbers span 7 orders of magnitude in a typicalcell, from just a single copy to tens of millions [122]. In stark contrast, typicaluntargeted proteomics experiments can cover about 4 orders of magnitude and areunable to detect proteins with fewer than 100 copies per cell [122].

Another important cornerstone of molecular biological research is the functionalannotation of genes and proteins. As biological processes are often highly complexand interdependent, it is often difficult to isolate the specific functionality of a singlegene or protein. While there are definitely cases where we have a clear image ofthe function of a gene or protein, we more often have to satisfy ourselves to only

2.3. CURRENT STATE OF RESEARCH 11

knowing the pathway it is active in, or use the functionality of a protein that has a(partly) similar sequence or structure as a proxy. Databases that have up-to-dateinformation on gene or protein functionality include Gene Ontology [6] (http://geneontology.org/), Pfam [9] (http://pfam.xfam.org/) and Uniprot [4] (http://www.uniprot.org/).

http://geneontology.org/

http://geneontology.org/

http://pfam.xfam.org/

http://www.uniprot.org/

http://www.uniprot.org/

Chapter 3

Shotgun proteomics

In this chapter, we will take a look at one of the most common techniques to identifyand quantify large numbers of protein species in a sample, shotgun proteomics.Direct measurement of proteins has proven to be difficult [59, 13]. Therefore, inshotgun proteomics, several steps are undertaken to reduce the complexity of theanalyte by breaking up the proteins into progressively smaller pieces, hence theadjective shotgun. The following sections give a more detailed explanation of eachof these steps.

Note that, we will focus an untargeted proteomics techniques, where we attempt toidentify and quantify thousands of proteins at a time. We will not delve into targetedproteomics techniques, such as selected reaction monitoring (SRM) [91] and multiplereaction monitoring (MRM) [62], as only tens of proteins can typically be detectedin such experiments. Also, fractionation techniques, such as gel electrophoresis andstrong cation exchange [82], will not be explored here, as this generally reduces thereproducibility between runs.

3.1 Protein digestion

Proteins typically have lengths of a few hundreds of amino acids, which make themtoo complex to detect directly. Therefore, the proteins are first digested, similar tothe way that food is digested in our digestive track. Proteases cut the long proteinamino acid chains into smaller pieces, ideally to a length of about 6 to 50 aminoacids. These short amino acid chains are called peptides and are basically shorterforms of proteins. In the following, when we refer to a peptide, we actually meanthe set of all peptides with the same amino acid chain, sometimes also referred toas a peptide species, rather than a single molecule.

Common proteases include trypsin, Lys-C and chymotrypsin, and are often isolatedfrom digestive systems of vertebrates, such as porcine and bovine trypsin, whichare isolated from the pancreas. Proteases cleave the protein at specific places in

13

14 CHAPTER 3. SHOTGUN PROTEOMICS

the amino acid chain, e.g. trypsin cleaves the amide bonds after arginine andlysine unless they are followed by proline. As an example, the hypothetical proteinMKPATRDISILKILGAG would be split up into the peptides MKPATR, DISILKand ILGAG using trypsin as the protease. The predictability of this cleaving processmakes it easier to identify the peptides in later stages. However, the cleaving processis not perfect and frequently some cleavage sites are missed, known as miscleavages.The opposite can also happen,proteins are split at other amide bonds than thecleavage site, known as partial digestion.

3.2 Liquid chromatography

The goal is now to identify the peptides that are present after the digestion step.This task will be easier if we can analyze only a single peptide at a time. Liquidchromatography is used as a first step to accomplish this separation of the peptides.

As mentioned before, amino acid side chains have different chemical properties, forexample, with respect to hydrophobicity and charge. Liquid chromatography (LC)uses this fact by letting the peptides, dissolved in a liquid solvent, pass through acolumn containing an adsorbent. The composition of the liquid solvent is graduallychanged to include more organic solvent. The duration and strategy to performthis change is called the gradient. Through interactions between the adsorbent, theliquid solvent and the amino acid side chains, peptides are slowed down at differentrates. This causes different peptides to elude at different times from the LC column,whereas peptides from the same species elude at approximately the same time. Thetime it takes for a particular peptide to pass through the LC column is known asthe peptide’s retention time.

The longer the gradient, the more peptides and proteins can be identified. As a ruleof thumb, the number of identified proteins is 1000 + 1000 × h, with h the numberof hours of the gradient [122]. Typical gradient times are around 2 hours, but shortgradients of just 20 minutes [38] and long gradients of more than 40 hours [54] havebeen used.

3.3 Tandem mass spectrometry

While liquid chromatography creates a first level of separation, there are usuallystill many peptide species that elude at approximately the same time. As our goalis to analyze a single peptide species at a time, we need to apply a second levelof separation. This is done with a mass spectrometer, an instrument that ionizesmolecules and then separates molecules on the basis of their mass-to-charge (m/z)ratio1 (Figure 3.1). In shotgun proteomics, we often use a technique which consistsof two steps of mass spectrometry, tandem mass spectrometry.

1Molecular masses are typically given in dalton (Da) units, whereas mass-to-charge ratios aregiven in thomson (Th) units, where 1 Th = 1Da

e , with e the elementary charge unit.

3.3. TANDEM MASS SPECTROMETRY 15

The instrument makes use of the fact that charged molecules will change direc-tion when passing through an electric or magnetic field, by the Lorentz force:F = q(E + v × B). The degree of deflection depends on the mass of the moleculeby Newton’s second law of motion: F = m · a. Modern mass spectrometers work onthe same principles but use different techniques to separate molecules than the oneschematically depicted in Figure 3.1, for instance, time of flight (TOF), quadrupolesor ion traps. Some of these techniques can detect the mass-to-charge ratio of an ionwithout it needing to hit a detector, facilitating further analysis of this molecule.

Ion sourcebeam focussing

ion acelerator

electron trap

ion repeller

gas inflow (from behind)

ionizing filament

{m/q} = 46{m/q} = 45{m/q} = 44

magnetamplifiers

curre

nt

ratiooutput

DetectionFaradaycollectors

pos

itiv

eio

nbeam

legend:m ... ion massq ... ion charge

Figure 3.1: A mass spectrometer ionizes the molecules of interest andthen separates the charged molecules on the basis of their mass-to-chargeratio. [35]

The LC column is often directly connected to the mass spectrometer and peptidesenter the mass spectrometer according to their retention time. Peptides are firstionized into precursor ions before the deflection in an electric or magnetic field. Theprecursor mass-to-charge ratios, i.e., the mass-to-charge ratios of the precursor ions,are recorded in a so-called MS1 spectrum. Several MS1 spectra are recorded, eachat a different retention time.

The MS1 spectrum consists of a list of mass-to-charge ratios with their correspond-ing signal intensity, which itself is a function of the precursor ions’ abundance. Weoften plot the signal intensity as a function of the mass-to-charge ratio to visualize


a spectrum, where we can identify the precursor peaks corresponding to certain pre-cursor ions as mass-to-charge ratios with high signal intensity (Figure 3.2, bottom).The MS1 spectrum furthermore records the retention time at which the spectrumwas registered.

To visualize the precursor ions entering the mass spectrometer we can create amass chromatogram, which plots the ion intensity against the retention time. Forexample, a plot can be made of the ion intensity summed over all precursor mass-to-charge ratios (Figure 3.2, top), known as a total ion chromatogram (TIC). Sucha plot can be seen as a summary of all the MS1 spectra, which themselves canbe thought of as slices along the retention time axis, by disregarding the recordedindividual precursor ion mass-to-charge ratios. This type of plot mostly serves asa diagnostic plot to check how many analytes elude at which retention time. Chro-matograms of specific precursor mass-to-charge ratio, extracted-ion chromatograms(XICs), can be used to quantify peptides, by using the peak height or the areaunder the curve at a certain retention time interval as a proxy for the peptides’abundance.

Tota

lint

ensit

y(1

e9)

Retention time

Ion

inte

nsity

(1e6

)

Precursor m/z

Figure 3.2: Examples of a TIC plot (top) and an MS1 spectrum (bottom).Top: in a TIC mass chromatogram we plot the retention time in minutes on the xaxis and the total precursor ion intensity on the y axis. Bottom: the MS1 spectrumtaken at a retention time of 85.2 minutes. It displays the precursor ion mass-to-charge ratio on the x axis and the precursor ion intensity on the y axis.


After obtaining the MS1 spectrum, two diverging approaches are available. Eitherwe try to analyze single peptide species at a time by selecting a very narrow precur-sor mass-to-charge ratio window, known as data-dependent acquisition, or we takelarge windows of mass-to-charge ratios and capture multiple peptide species at atime, known as data-independent acquisition.

Data-dependent acquisition (DDA)

For data-dependent acquisition (DDA), a narrow window called the isolation win-dow, of typically around 1 Th around the target precursor mass-to-charge ratio isselected based on high-intensity precursor peaks in the MS1 spectrum. This ide-ally leads to the selection of a single peptide species. The precursor ions withinthis range are fragmented into smaller ions, using techniques such as collision-induced dissociation (CID) [53], Higher-energy collisional dissociation (HCD) [88]and electron-transfer dissociation (ETD) [105, 121]. Most often, the peptide breaksat one of its amide bonds, thereby producing two fragment ions that are themselvesshort amino acid chains, or just a single amino acid. These fragment ions are thenrecorded in a so-called MS2 spectrum or fragment spectrum.

Several mass-to-charge ranges of the precursor ions are selected to produce numer-ous MS2 spectra. Usually, a list of recently selected mass-to-charge ranges is kept,so that one does not sample the same highly abundant precursor ions over and overagain. This technique is known as dynamic exclusion and includes several parame-ters that govern how often one samples a particular range before excluding it andfor how long it should be excluded.

An MS2 spectrum, like an MS1 spectrum, consists of a list of mass-to-charge ratioswith their corresponding signal intensity. In the plots of the signal intensity asa function of the mass-to-charge ratio, we can now identify the fragment peaks ascorresponding to certain fragment ions (Figure 3.3). The MS2 spectrum also retainsinformation about the mass-to-charge range that was selected for the precursor ion,as well as the retention time.

The advantage of data-dependent acquisition is that the MS2 spectra usually onlyfeature a single (dominant) peptide species, thereby preventing the more difficultcase of having the fragment peaks of two or more peptide species in a single MS2spectrum. In this case, we both need to figure out which peaks belonged to whichpeptide species, as well as identify which peptide species they came from. However,it is not at all uncommon for two peptide species to be fragmented together anyway,resulting in so-called chimeric spectra, with recent studies showing over 50% of theMS2 spectra featuring this behavior [72, 79].

The major issue with data-dependent acquisition is its poor reproducibility. Eachtime the experiment is run, a different set of precursor ions could be selected. Thismainly is due to small variations in the MS1 spectrum that influence the precursor


0 200 400 600 800 1000 1200 1400

m/z

0

20000

40000

60000

80000

100000

120000

140000

160000

inte

nsity

103111-Yeast-2hr-03.ms2:28023 (peptide = K.NFTPEQISSMVLGK.M, q-val = 0.0)

K.NFTPEQISSMVLGK.M-K -GK

NF-

-LGK

NFT-

Figure 3.3: Example of an MS2 spectrum. In an MS2 spectrum, we plot theobserved fragment ion intensity (black bars) as a function of the fragment ion m/z.Overlayed, we plotted the predicted fragment ion peaks of its peptide identification(red lines, colored amino acid oligomer annotations), which frequently match upwith the observed fragments. The precursor ion peak at a mass-to-charge ratio of517.60 Th can be seen in Figure 3.2.

mass-to-charge ratio selection and thereby also the outcome of the dynamic exclu-sion. This would not be a problem if we were able to sample all or most of thepeptide species at least once, but in practice, we are often considerably undersam-pling the pool of present peptide species due to limits of the instrumentation [79].

Data-independent acquisition (DIA)

A recently popularized alternative to DDA is data-independent acquisition (DIA),also known as SWATH [110, 39]. Here, broad precursor mass-to-charge ratio win-dows are systematically selected for the fragmentation round, with often tens ofpeptide species in it. At the cost of more complex MS2 spectra, we now have amuch deeper sampling of the pool of present peptide species. If the windows aretaken broad enough, one could even sample all peptide species that elude from theLC column, which could greatly increase the reproducibility of shotgun proteomics


experiments. One of the main use cases of DIA is in protein quantification, wherewe can now use chromatograms of specific fragment ions in the MS2 spectra forquantification, instead of using the chromatograms from the precursor ions in theMS1 spectra.

Chapter 4

Data analysis pipeline

Bioinformaticians receive the MS1 and MS2 spectra that were produced in theshotgun proteomics experiment and are now faced with the task of putting thesepieces of the puzzle back together. The end product is commonly a list of proteinsthat one believes to have been present in the original mixture, possibly with aquantification of the protein’s concentration. This chapter focuses on peptide andprotein identification rather than quantification, which is the subject of Chapter 7.For a more elaborate review of the data analysis pipeline, see [86].

4.1 Feature detection

One of the most important pieces of information for matching an MS2 spectrum toa peptide sequence is the precursor mass of the peptide. By assigning an accurateprecursor mass to a spectrum, the search space can be reduced to only the peptidesequences with the correct mass. Each MS2 spectrum does come with informa-tion regarding its precursor mass-to-charge ratio, but in fact, this only reflects theisolation window rather than the actual mass-to-charge ratio of the precursor ionof interest. To resolve this discrepancy, we can recalibrate the precursor mass-to-charge ratios of the precursor ion and make a prediction of its charge state byexamination of the MS1 spectra.

Software packages that were developed for this purpose include Hardklör [50] andBullseye [52]. Hardklör scouts the MS1 spectra for potential precursor ions, knownas MS1 features, and Bullseye subsequently matches these to the correspondingMS2 spectrum, while propagating information extracted from the peaks in theMS1 spectra, such as precursor mass-to-charge ratio, charge, retention time and ioncurrent. Note that multiple MS1 features can be assigned to a single MS2 spectrumand it is not uncommon to have multiple precursor ions within the isolation window,resulting in chimeric spectra. In the context of peptide and protein identification,we normally do not use the MS1 spectra from this point on and focus solely on the

21

22 CHAPTER 4. DATA ANALYSIS PIPELINE

MS2 spectra for the identification of peptides. For the role of feature detection inprotein quantification, see Chapter 7.

4.2 Peptide identification

Several approaches are available for identifying the peptide from an MS2 spectrum.They can roughly be classified into three categories: database search engines, spec-tral library searching and de novo sequencing. Database search engines are by farthe most commonly used and usually are capable of identifying the highest numberof peptides, but the other two categories are gaining in popularity as the amount ofdata, the accuracy of the instruments and available computing power are increasing.

Database search engines

As the number of possible peptides grows exponentially with their length n as20n, identifying the amino acid sequence directly from the MS2 spectrum is a verydifficult task. Instead, we usually work with databases containing sequences ofknown proteins for the organism of interest. We then do an in silico digestion ofthese proteins, i.e. we cleave the proteins based on the cleavage rules of the proteasethat was used. At this stage, we can also account for miscleavages, partial digestionand post-translational modifications.

Having obtained a list of potentially present peptides, we can construct a theoreti-cal spectrum by considering the mass-to-charge ratios of the fragments that resultfrom breaking the peptide at the amide bonds. Subsequently, a scoring function isapplied to a pair of experimental and theoretical spectra, which assigns a score tothe pair. Such a pair is called a peptide-spectrum match (PSM). We only compareexperimental and theoretical spectra for which the precursor masses are sufficientlyclose, which is set depending on the accuracy of the instrumentation. On moderninstruments, this mass search window can be as narrow as 20 parts per million(ppm) around the experimentally obtained precursor mass-to-charge ratio.

Popular database search engines include SEQUEST [27], MASCOT [16], X!Tandem [19]and MS-GF+ [61].

Spectral library searching

One of the problems with database search engines is that the theoretical spectrumis not necessarily a good template for the experimental spectrum of a peptidedue to the complex processes governing the fragmentation. Therefore, in spectrallibrary searching we instead choose to use previously identified spectra as a templatefor a peptide. This first of all requires a good database of template spectra, aswell as an appropriate scoring function that compares a template spectrum to theexperimental spectrum.

4.2. PEPTIDE IDENTIFICATION 23

Databases of template spectra are commonly constructed using the reliable identifi-cations from database search engines. Notable examples include, the NIST (http://peptide.nist.gov), PeptideAtlas (http://www.peptideatlas.org/speclib/)and PRIDE (https://www.ebi.ac.uk/pride/cluster/) spectral libraries. As dif-ferent ionization and fragmentation techniques produce different spectra for thesame peptide species, it is often beneficial to select a library created with the sameor similar instrumentation.

Spectral libraries are especially important in the context of DIA, as traditionalsearch engines typically do not have sufficient sensitivity to identify more than 2peptide species in a spectrum. Using the peak intensity information from previ-ous experimental spectra and reducing the search space of candidate peptides canprovide the extra sensitivity needed [39].

Packages implementing a scoring function to identify peptides from spectral li-braries include SpectraST [67], X!Hunter [20] and BiblioSpec [34]. For a compre-hensive overview of spectral library searching, see [66].

De novo sequencingThe most challenging class of algorithms for peptide identification is de novo se-quencing. Here, one attempts to identify the peptide sequence directly from thedistances between fragment peaks, without making use of a database. De novoidentification is most commonly used for experiments in which no good referencedatabase is available, for example for organisms without a reference genome [24].

One of the challenges for de novo identification is that, frequently, not all amidebond breaks are recorded as fragment ions, resulting in a gap of two or more aminoacids. By using combinations of amino acids one can bridge such gaps, but oftenseveral combinations of amino acids add up to an approximately equal mass thatcould explain such a gap. Additionally, we do not know the order of the aminoacids in the gap. Therefore, de novo peptide identifications often contain ambiguousregions, i.e., multiple short amino acid chains are just as likely to have produced apart of the spectrum.

Popular de novo sequencing packages include PEAKS [74], PepNovo [32], NovoHMM [29],pNovo [14] and Novor [73].

Post-translational modification searchesThe ability to identify post-translation modifications (PTMs) represents one ofthe main benefits of proteomics over transcriptomics. However, it comes with thechallenge that the search space increases exponentially as a function of the numberof modifications that are to be searched for. An increased search space not onlyincreases the time for peptide identification but also reduces the sensitivity of thesearch as the probability of obtaining a false positive increases as well.

http://peptide.nist.gov

http://peptide.nist.gov

http://www.peptideatlas.org/speclib/

https://www.ebi.ac.uk/pride/cluster/

24 CHAPTER 4. DATA ANALYSIS PIPELINE

Typically, only a handful of common modifications are searched for, such as oxi-dation of methionine, N-term acetylation and deamidation of asparagine and glu-tamine. Unfortunately, these modifications typically do not have much biologicalrelevance, but failure to account for them will result in many missed peptides. Moreinteresting are studies of, for example, phosphorylation of protein kinases, that playan important role in cell signaling [28, 76]. However, these studies typically relyon the enrichment of these post-translationally modified peptides and proteins, astheir expression levels are typically too low to be detected in normal experiments.

All major database search engines allow users to specify the modifications they wantto search for, but the choice of which modifications to search for is not a straight-forward task. Alternatively, so-called open modification searches do not require apredetermined set of modifications, but choose a modification based on unexplainedgaps between fragment peaks [83] or on the precursor mass difference [64].

Popular open modification search engines are MODa [83] and MSFragger [64].

Retention time predictionThe supplied precursor mass narrows down the number of candidate peptides foran experimental spectrum from the order of millions to several hundreds. We couldnarrow down the search space even further if we predict the retention time of atheoretical peptide and compare this to the retention time of the experimentalspectrum. Unfortunately, the processes inside the LC column are quite complexand no simple expression can be found to predict a peptide’s retention time andare furthermore dependent on the type of LC column.

A popular approach has been to calculate retention coefficients to the amino acids,reflecting their individual contribution to the retention time as used in, for exam-ple, SSRCalc [65]. Also, attempts have been made to predict retention times bymodeling the physical interactions of the peptide with the LC column such as inBioLCCC [40]. Alternatively, machine learning algorithms, such as Elude [81] andRTPredict [90], can be applied that learn parameters from previously seen data anduse these to make predictions of peptide retention times. For a review of the sub-ject, see [80]. Retention time prediction is the subject of Paper VII (not includedin this thesis).

4.3 Protein identification

For certain studies, it is enough to have the peptide identifications, but often weare more interested in the actual proteins that those peptides came from. Thistask is made difficult by the fact that many peptides could stem from more thanone protein, known as shared peptides, either due to multiple splice variants fromthe same gene or due to homologous genes, that is, genes with common ancestry.Therefore, we cannot simply combine evidence from all peptides that could have

4.4. PROTEIN QUANTIFICATION 25

come from a protein, as some of those peptides could have come from other proteins.This issue has been designated as the protein inference problem [84] and severalprotein inference methods have been developed to tackle it.

There is currently a wide availability of schemes to infer proteins, based on differentprinciples such as IDPicker [117], which uses the parsimony principle, and Protein-Prophet [85], MSBayes [69] and Fido [99], which all use probabilistic models. Fora review on the subject see [100].

4.4 Protein quantification

Another entity of interest can be the concentration of each protein that was iden-tified and a wide variety of methods are employed for this. While most of thesemethods are quantifying peptides rather than proteins, the idea is that we can trans-fer these peptide quantifications to protein-level. Here, we again face the proteininference problem and care has to be taken in doing this correctly. By comparingthe quantitative levels of peptides or proteins between different samples, for ex-ample, treated and untreated, we can get a better insight into the inner workingsof the biological processes. For a review on the subject of protein quantification,see [8].

Some quantification methods rely on modification of the sample by introducingstable isotopes, for example, stable isotope labeling by amino acids (SILAC) [89] andisobaric labeling techniques, such as tandem mass tags (TMT) [108] and isobarictags for relative and absolute quantitation (iTRAQ) [114].

Others do not require modification of the sample and are therefore cheaper, simplerand allow retrospective analysis of the data, often at the cost of higher noise levelsand more missing values. These so-called label-free quantification strategies includespectral counting [70] or using the total ion current (TIC) of precursor ions [12].Chapter 7 will provide more detailed information with regards to label-free quan-tification. Alternatively, DIA methods allow protein quantification by fragmentions and achieve high coverage and reproducibility [95].

Chapter 5

Statistical evaluation

Having obtained a list of peptides, proteins or even protein quantities from the dataanalysis pipeline, we are faced with the question: “How confident are we that theseare correct?”. The field of statistics is essential in answering this question. Failureto apply these techniques or to apply them correctly could cost an experimentervaluable time and money. We start by providing some statistical terminology, fol-lowed by the most common approach to statistical evaluation in shotgun proteomicsand finish with a section about a software package developed by our lab, Percola-tor, that performs statistical evaluation of shotgun proteomics data. This chapterserves as a detailed introduction to papers II and III.

5.1 Terminology1

In statistics, a null hypothesis usually represents the presupposition that the obser-vations are a result of random chance alone. For example, a null hypothesis for aPSM produced by a database search engine could be, “The peptide was spuriouslymatched to the spectrum”.

We will call a PSM, peptide or protein for which we have found evidence throughthe data analysis pipeline an inference, as they have respectively been inferred frommass spectra, PSMs and peptides. These inferences can be correct or incorrect,which actually depends on the null hypothesis we are testing, as will be illustratedlater in this chapter.

Ideally, the list of inferences is sorted by some score, for instance, from the scor-ing function of a database search engine in the case of PSMs. Since we are onlyinterested in the inferences we are very confident about, we typically only keep

1In some cases, other terms will be more common in the field of statistics than the ones usedhere. However, terms like “identification” or “feature” have confounding meanings due to itsconcurrent usage in the fields of proteomics and machine learning and were therefore avoided onpurpose

27

28 CHAPTER 5. STATISTICAL EVALUATION

the inferences scoring better than some kind of threshold. These inferences scoringabove the threshold will be referred to here as discoveries 2. Discoveries, like in-ferences, can be correct or incorrect; correct discoveries are called true discoveriesand incorrect discoveries are called false discoveries.

5.2 False discovery rates

Introduced by Benjamini and Hochberg [11], the false discovery rate (FDR) is aconcept from multiple hypothesis testing aimed at controlling the number of falsediscoveries. When multiple hypotheses are tested at the same time, we cannot sim-ply use a significance level on p values, as we would when testing a single hypothesis.To see this, say that we use a significance level of 0.05 and test this hypothesis a1000 times and let the null hypothesis be true all times. Then, by random chancealone, we would already expect 50 discoveries, without there being a true discoveryin the whole set.

The FDR is defined as the expected fraction of false discoveries in the group ofdiscoveries:

FDR = E[

false positivesfalse pos. + true pos.

]The FDR allows for a few false discoveries, provided that their number is smallcompared to the number of true discoveries. A commonly employed threshold of1% FDR would, for example, allow 10 false discoveries, provided that there are990 true discoveries. This might not be acceptable for certain critical applications,but in shotgun proteomics, we are often dealing with exploratory data analysis andthus are more interested in many discoveries as candidates for further examinationand less concerned about false discoveries.

As the FDR is not a monotonic function of the score, we often prefer to use theq value of an inference (Figure 5.1), which is the multiple testing corrected analogof the widely used p value. The q value is defined as the minimum of the FDRs ofall inferences with the same or a worse score than this inference. The q value is amonotonic function of the score and therefore it makes sense to set a threshold onit and call all inferences below a certain q value discoveries.

Decoy models

For each inference, we test the selected null hypothesis by considering the distribu-tion under the null hypothesis, i.e., we look at the score distribution of inferences forwhich the null hypothesis is true and compare the score of an inference to this nulldistribution. However, if we simply search our spectra against a protein database,

2In statistical literature, discoveries are also frequently referred to as positives or significantfeatures

5.2. FALSE DISCOVERY RATES 29

we obtain a mixture of correct and incorrect inferences and, therefore, do not knowthe null distribution.

To obtain this null distribution, we artificially create a set of inferences for whichwe know the null hypothesis is true, known as the decoy model [56]. This is usuallydone by reversing or shuffling the original protein database, resulting in a decoydatabase containing decoy proteins, and subsequently, search the spectra againstthis database. As none of the decoy peptides, i.e., peptides derived from the decoyproteins, should be present in the sample, any match to such a decoy peptide, calleda decoy PSM, is incorrect. The score distribution for the matches to decoy peptidesis, therefore, a good estimation of the null distribution for the PSMs. The originalprotein database containing the proteins of interest is called the target database andcorrespondingly we have target PSMs, target peptides and target proteins.

Although any database of non-present proteins would produce a score distributionfor which the null hypothesis is true, using a reversed or appropriately shuffled ver-sion of the protein sequences of interest ensures that several peptide characteristics,like length and amino acid frequency, are preserved. This, in turn, ensures that theprobability of obtaining a hit to the decoy database is approximately equal to theprobability of obtaining an incorrect hit in the target database.

PSM-level false discovery ratesTo estimate the PSM-level FDR at a certain threshold t for a list of target PSMs,we can count the number of these decoy PSMs scoring better than t and dividethat by the number of target PSMs scoring better than t (Figure 5.1). An extracorrection should be applied to account for the fact that the number of incorrecttarget PSMs is lower than the number of decoy PSMs [104].

An alternative method to calculate the PSM-level FDR is to search the spectraagainst a concatenated database, i.e., a protein database consisting of the targetand decoy proteins combined, also known as target-decoy competition [26] (TDC).In this case, we expect to get the same number of decoy PSMs as incorrect targetPSMs at a threshold t and no correction factor is needed. This approach is especiallybeneficial if there is a bias in the scoring function, which causes some spectra tosystematically score better than others [58]. If the scores are not biased the twomethods give approximately the same performance, though TDC is to be preferredas some issues still remain with the first approach [57].

Peptide-level false discovery ratesFrequently, many PSMs map to the same peptide and correctly inferred peptidesusually have more PSMs mapped to them than incorrectly inferred peptides. There-fore, simply using the PSM-level FDR as the peptide-level PSM will lead to over-confident FDR rates. In contrast, inferring the peptide by taking the score of thebest scoring PSM and subsequently recalculating the FDR based on this truncated


list will give an accurate peptide-level FDR [41]. Note that multiple PSMs with thesame peptide sequence are not independent pieces of evidence, as multiple spectraof the same analyte could receive the same incorrect identification. Therefore, thesePSMs cannot simply be combined by, for example, Fisher’s method for combiningp values [30].

Protein-level false discovery ratesWhile the FDR itself is well defined and PSM- and peptide-level are well acceptedin the field, protein FDRs have caused some controversy due to its ambiguity inits definition and its difficulty to be estimated. We will first focus on the issue ofambiguity in its definition, which can be traced back to the different definitions ofthe (often implicit) null hypothesis different authors use.Two camps can be distinguished right off the bat. Loosely, their two differingnull hypotheses can be formulated as “the evidence for the protein came fromincorrectly interpreted mass spectra” [85, 93, 97] and “the protein is absent fromthe sample” [99, 117, 69]. Interestingly, while some consider the difference betweenthe two views obvious, many actually have trouble distinguishing the two views.The different viewpoints can be illustrated using the following scenario (see alsoFigure 5.2):

Protein A is present in the sample and has 3 peptides a1, a2 and a3that can be detected. Due to undersampling of the pool of presentpeptides in the mass spectrometry experiment, peptides a2 and a3 donot produce a mass spectrum. Furthermore, a1 produces a very low-quality mass spectrum that cannot be interpreted confidently. ProteinB is also present in the sample and has a peptide b4 that has a pep-tide sequence similar to a2. The mass spectrum belonging to b4 getserroneously inferred as originating with high confidence from peptidea2. Based on this high confident, albeit faulty, evidence, protein A getscalled significant.

The first camp bases their classifications on the observed mass spectra. They wouldclassify this protein as a false discovery because the mass spectrum was erroneouslyinterpreted. The second camp bases their classification on the actual presence of theprotein in the sample, and would, therefore, classify this protein as a true discovery.Formulating a definition is also made harder by the shared peptide problem. Ahandful of protein inference methods choose to eliminate all shared peptides, therebysimplifying the problem considerably at the cost of reduced sensitivity. However,the majority of the available methods chooses not to throw out this preciously ob-tained data. This often leads to a solution called protein grouping, where proteinswith the same peptide evidence are inferred together as a protein group.However, this solution comes with the immediate question: “What do we meanwhen we say we inferred a protein group?”. Several answers are possible: “one

5.2. FALSE DISCOVERY RATES 31

Figure 5.1: Estimation of false discovery rates (FDRs) and q values using adecoy model. Column 1 and 2 represent the idealized situation in which we knowwhich inferences are correct and incorrect; columns 3-5 represent the situation inpractice, where these labels are unknown and a decoy model is used to approximatethe number of incorrect inferences. Ideally, column 2 and 4 should correspond.

>proteinAFIKELRVIESGPHCANTEIIVKLSDGRELCLDPKENWVQRVVEKFLKRAENS

>proteinBNVWGKVEADIPGHGQEVLIRLFKFDKFKHLKSEDEMKASEDLKKHLEDCLPKGA

Missing MS/MS

Missing MS/MS

Missing MS/MS

Missing MS/MS

Peptide a1 a2 a3

b1 b4b2 b3

Correct id: VEADIPGHGQEVLIR→ proteinB identified

Incorrect id: ELCLDPK

→ proteinA identified

No identification

Figure 5.2: Should the identification of protein A be regarded as a falsediscovery or not? The figure depicts the dilemma presented in the text regardingthe different null hypotheses that are used concurrently in the field.


protein in the group is present”, “all proteins in the group are present” or “at leastone of the proteins in the group is present”. The first answer leaves the questionwhich of the proteins to choose, the second is generally too liberal and the thirdis too unspecific. This might not be very problematic if the proteins in the grouphave the same function, but due to sequence similarity between different genes, thisis certainly not always the case.Provided with a well-defined null hypothesis, there is still the difficulty of estimat-ing the corresponding protein-level FDR. These difficulties mostly seem to stemfrom the shared peptide problem. Correctly inferred shared peptides break theassumptions of the standard decoy model since decoy proteins cannot mimic thecase of a correct peptide which is mapped to an absent protein. On the other hand,using only peptides unique to a protein does not break the decoy model, but doesforce us to throw out a large portion of the inferred peptides.Sometimes the difficulty in estimating the protein-level FDR stems from faultyassumptions rather than a fundamental problem. This was the case in one of thefirst draft human proteomes [115, 98], which was later resolved [97].

FDR CalibrationHaving obtained FDR estimates, one would like to verify that they provide goodestimates of the true number of false discoveries. This can be done through calibra-tion using engineered datasets where we know the ground truth of each inferencebefore doing the experiment.The most popular method to achieve this is to analyze mixtures of known proteins,such as the ISB 18 [63] or the Sigma UPS1/UPS2 protein mixtures [101]. Here, theentire shotgun proteomics pipeline is executed and we try to find the peptides andproteins that are known to be present against a database of peptides and proteinsthat are definitely not present. We can check the FDR estimates by comparingthem to the fraction of non-present peptides or proteins reported by the method.These standards contain relatively few proteins, often in the order of 50, and oftendo not contain many shared peptides. Therefore, they are not very representativeof the highly complex samples we typically analyze in shotgun proteomics [43].Furthermore, an intrinsic problem is that we cannot verify the null hypothesis fora peptide or protein inference to have come from incorrect PSMs, as we are not incontrol over this part of the process.Alternatively, one can use simulations of shotgun proteomics experiments. Here,correct and incorrect inferences are drawn from their respective score distributions.For example, to test peptide and protein inference methods we can simulate correctand incorrect PSMs, or in the case of protein inference methods, we could evensimulate correct and incorrect peptides.Simulations come with a certain set of assumptions about the shotgun proteomicspipeline which might be violated in practice. For example, in some simulations,

5.3. THE PERCOLATOR SOFTWARE PACKAGE 33

present peptides are randomly drawn, whereas in practice the sampling of presentpeptides is a function of their abundance. Because of assumptions like these, sim-ulations could be considered as giving less conclusive evidence than mixtures ofknown proteins. However, they can simulate results from samples with high com-plexity and can also verify the null hypothesis of peptide and protein inferencescoming from incorrect PSMs.

5.3 The Percolator software package

A common issue with database search engines is that their scoring functions oftencontain biases, for example, scores are systematically higher for longer peptides orhighly charged precursor ions. Initially, this was solved by, for instance, applyingdifferent score thresholds for different precursor charge states or having peptidelength dependent thresholds.

Käll et al. [55] recognized that a more optimal approach to achieve this was to use asupport vector machine (SVM), which was implemented in the widely used softwarepackage, Percolator. Percolator is still being developed further in our group andwe regularly release new features and bug fixes.

An SVM is a classification algorithm, i.e., given a set of features for an inferenceit decides to assign it a certain class, or label, typically one out of two. In thecontext of shotgun proteomics, these labels are typically “correct” and “incorrect”.A feature can be any numeric value, for example, scoring function score, peptidelength or precursor mass.

SVMs belong to the class of supervised learning algorithms, which means that theyare trained by feeding it inferences for which the labels are known. Based on thistraining set it will produce a high-dimensional decision boundary in the featurespace which separates the inferences from the two labels in an optimal way (Figure5.3).

The problem in the analysis of shotgun proteomics context is that we do not knowto which of the two classes an inference belongs. We can create a decoy modelwhich will produce exclusively incorrect inferences, but, initially, we do not havea set of correct inferences. To overcome this, Percolator uses a semi-supervisedlearning approach, where the correct set is inferred during the training. In practice,this means that we start by assigning the label “correct” to a small set of highlyconfident PSMs, typically those with a very good database search engine score.Subsequently, we train an SVM using this small set as the correct inferences andthe decoy inferences as incorrect inferences and based on the new decision boundarywe select a new set of highly confident PSMs. This process can be repeated afixed number of times or until convergence, i.e., the set of inferences labeled ascorrect does not change significantly anymore. To prevent overfitting to the data,Percolator uses a cross-validation scheme, that is, the PSMs are divided into 3


H1 H2 H3

X1

X2

Figure 5.3: A support vector machine calculates the decision boundaryH3 which has the largest margin between inferences of the two classes.Here, one can think of the black circles representing the incorrect inferences, whilethe white circles represent the correct ones. The X1 axis represents one feature,e.g., a PSM’s score, while the X2 axis represents a second feature, e.g., a PSM’speptide length. The decision boundary H1 does not provide a separation betweenthe two classes, H2 and H3 both separate the two classes, but H3 does this withthe largest possible margin to the closest inference. [112]

validation bins and each bin is evaluated by the SVM trained on the other twobins [42].

Chapter 6

Mass spectrum clustering

In the field of shotgun proteomics, it is considered good practice to make data setsof tandem mass spectrometry experiments publicly available in repositories suchas the PRIDE archive and PeptideAtlas. According to the latest statistics, thePRIDE archive contains 690 million mass spectra from 813 data sets [111], andthe human part of the PeptideAtlas contains 470 million mass spectra from over a1000 samples [23]. While the data sets are interesting by themselves for researchersinterested in the particular topic of the original study, much more interesting out-comes can be expected by considering the entire database as a whole. In particular,we would like to investigate commonalities between studies and to analyze the hugeamount of unidentified spectra that could, for example, unveil novel PTMs [51] orunexpected fragmentation patterns. This chapter serves as a detailed introductionto paper I.

In a typical experiment, more than half of the MS2 spectra remain unidentified [15].A part of these can be attributed to low-quality spectra or spurious fragmenta-tion events. However, also among high-quality spectra, a large proportion remainsunidentified, even after accounting for common modifications and semi-tryptic ornon-tryptic peptides [87].

It is quite a daunting task to consider hundreds of millions of MS2 spectra and findspectra of interest. A useful technique from the field ofmachine learning is to clusterthe spectra, i.e., similar spectra are grouped together, ideally stemming from thesame peptide species. By doing this, we can identify groups of interest, rather thansingle spectra of interest, which typically is a reduction in the number of spectrato be searched of at least one order of magnitude. Furthermore, large groups ofunidentified spectra are natural candidates for investigation and one could constructspectral libraries for frequently recurring spectra [33]. By using open modificationsearches and de novo identification tools, we can shed more light on to what Skinneret al. [102] aptly called the “dark matter” of proteomics.

The two key challenges in the clustering of mass spectra are how to measure the

35

36 CHAPTER 6. MASS SPECTRUM CLUSTERING

similarity between two spectra and how to decide if the similarity is high enoughto warrant with a high probability that two or more spectra came from the samepeptide species. The first challenge has been met with several distance measuresthat take into account different characteristics of mass spectra, while the secondpart requires a combination of statistics and clustering algorithms from the field ofmachine learning.Several mass spectrum clustering packages have been developed, such as Pep-Miner [10], MS2Grouper [106], NorBel2 [31], MS-Cluster [33], PRIDE-Cluster [45,46], CAMS-RS [96] and Liberator [51]. Some of these have specifically targetedlarge-scale datasets [33, 45, 96, 51]. So far, only PRIDE-Cluster has managed tocluster all spectra, including the unidentified spectra, of an entire repository.In the following sections, we will start by looking at the different aspects of massspectrum clustering algorithms. In particular, we will look at approaches to measuresimilarities between mass spectra using a distance measure, how to subsequentlygroup similar spectra together using clustering algorithms and how to evaluate theresults of clustering.

6.1 Distance measures

Before we start, it is helpful to clarify the small difference between two frequentlyused terms. Similarity measures assign high scores to pairs of similar spectra,whereas distance measures assign low scores. This corresponds to our intuitionthat highly similar pairs receive high similarity scores but are close to each otherwhen talking about distance. We will mostly use the term distance measure as itcorresponds better to a visualization of clusters, where similar spectra are close toeach other and dissimilar spectra are far away from each other.Spectrum distance measures are not only used for clustering but are also an integralpart of spectral library search engines. Some measures were specifically designedfor a certain type of experimental setting and are therefore not guaranteed to workwhen these conditions change. Here, we focus on the most generally applicabledistance measures, that should perform reasonably well independent of the experi-mental settings.To determine how similar two spectra are, it is beneficial to take into account thevariation in the peak patterns of spectra from the same peptide species. Typi-cally, the absolute peak intensities can vary over several orders of magnitude andtherefore some type of normalization is often a key step when using peak inten-sity information. However, even after normalization, the relative peak intensity ofa specific fragment peak will usually still show significant variance. Furthermore,electronic and chemical noise, as well as poorly calibrated instruments, can causesystematic and random noise, which needs to be dealt with appropriately.One of the most straightforward similarity measure is by first binning the peaks, forexample, in 1 Th bins, and then applying the cosine similarity, of which adapted

6.1. DISTANCE MEASURES 37

versions have been employed by [10, 33, 51, 46]:

C(IQ, IT ) =∑m

i=0 IQi · IT

i√∑mi=0

(IQ

i

)2·√∑m

i=0(IT

i

)2(6.1)

Where IQ and IT are vectors of the summed or maximum peak intensity per bin.This normalized dot product takes values in the interval [0, 1], where pairs of verysimilar spectra have a score close to 1. Its main advantage is that it is fast tocalculate, but it has the potential problem that a few high-intensity peaks willdominate the scores. In the most extreme case, two spectra with just a single peakin the same bin, for instance, at the precursor ion mass-to-charge ratio, could get ascore of 1 even though we do not have much reason to believe they came from thesame peptide species.

Quite similar to the cosine similarity is the Euclidean distance, as used in [92]:

E(IQ, IT ) =

√√√√ m∑i=0

(IQi − IT

i )2 (6.2)

This distance measure puts less emphasis on the high-intensity peaks when they arein common, while increasing its emphasis when it is absent in one of the spectra.On the other hand, it is very sensitive to the natural variance of fragment peakintensity and it is not straightforward how to normalize the peak intensities, northe scores themselves.

To prevent a few peaks from overpowering all lower intensity peaks, we could insteadignore all peak intensities and use a 1 if the peak intensity is above a certainthreshold and a 0 if it is below this threshold:

D(BQ, BT ) =m∑

i=0BQ

i · BTi (6.3)

Where BQ and BT are boolean vectors with 1s and 0s as specified above; a nor-malization factor can be added similarly to the cosine distance above. While thisapproach solves the problem of overpowering by a few peaks it introduces some newproblems. First, an appropriate threshold needs to be chosen to set the 1s and 0s,or alternatively a fixed number k of 1s can be set corresponding to the most intensebins. Second, peaks in common as a result of low-intensity peaks from backgroundnoise cannot be distinguished from high-intensity peaks in common from actualfragments.

The previous distance metrics do not specifically take advantage of the specialcharacteristics of mass spectra, i.e., that they come from sequences of amino acids.This property can, for example, be exploited by looking at the mass-to-charge ratio


differences of consecutive peaks [96]. However, this relies on the ability to filter outnoise peaks between actual fragments peaks, which is not a simple task.Some other distance measures that have been proposed are the Hertz similarityindex [49], the accurate mass spectrum distance [48], using mass differences onaligned peak lists [2] and a sigmoid-based fragment scoring function [31].

6.2 Clustering algorithms

After calculating the pairwise distances, we are left with a large affinity matrix. Thismatrix is normally quite sparse, as we typically only consider pairs of spectra withsimilar precursor mass or mass-to-charge ratio. One of the issues which makes themass spectrum clustering different from many other clustering problems in machinelearning is that instead of a few clusters with many observations per cluster, weexpect many clusters with relatively few spectra in each cluster. Furthermore,unless we analyze the data with a search engine first, we have no idea of how manyclusters we expect to get. This makes the application of popular machine learningclustering algorithms such as k-means or Gaussian mixture models rather difficult.The class of clustering algorithm that does come quite natural to our problem arethe hierarchical clustering algorithms. Initially, each observation forms its owncluster, a singleton cluster. We start by linking the most similar two clusters andcalculate a new distance between this newly created cluster and all other clusters.This continues until only one cluster is remaining. In this way these algorithmsconstruct a hierarchical tree of the observations (Figure 6.1), linking the mostsimilar observations in the bottom and less similar observations higher up in thetree. Constructing the clusters then simply comes down to setting a threshold onthe tree and ignoring all links above this threshold.There are three main ways to construct the linkages, which differ in the calculationof the distance between a newly formed cluster and all other clusters (Figure 6.2).In single-linkage clustering the new distance is the minimum of the distances ofpairs with one observation in the newly formed cluster and one in the cluster wewant to calculate the new distance to. This has, for example, been used by [31]. Inaverage-linkage clustering the new distance is the average distance of the pairs, asdescribed above, employed by [92]. In complete-linkage clustering the new distanceis the maximum distance between the pairs as described above.While single-linkage clustering is the easiest version to implement, its greedy naturetends to produce impure clusters. Average- and complete-linkage clustering improvethis by requiring more consensus, but are harder to implement for large matrices,as it requires the storage of all distances between clusters. For average-linkagea memory constrained algorithm is available that resolves this issue [71] and forcomplete-linkage, a similar scheme can be constructed.In the spirit of average-linkage clustering, one could recalculate the distances byfirst creating a consensus spectrum of the newly formed cluster and then apply the

6.2. CLUSTERING ALGORITHMS 39

7 clusters

4 clusters

S8 S5 S7

4.4e-‐5

6.9e-‐7

S1 S4

1.7e-‐18

S6

3.2e-‐15

S2 S3

9.2e-‐14

1.4e-‐2 0.87

5 clusters

Figure 6.1: Hierarchical clustering creates a tree of the observations; clus-ters can be constructed by choosing a cutoff threshold. Here, we have 8spectra, marked s1, . . . , s8, which all start off as clusters of size 1. The distanceat which two clusters were linked are given in red. Clusters with a small distanceare linked at the bottom, e.g., s1 and s4, whereas clusters with a large distance arelinked at the top, e.g., s7 and s8. Setting the cutoff threshold at a large distancecreates few clusters (top line, 4 clusters), which might result in impure clusters.Setting the threshold at a small distance results in many clusters (bottom line, 7clusters), which reduces the summarizing power. Setting the threshold at an inter-mediate distance (middle line, 5 clusters) trades off these two issues and might givethe optimal result.

DU,V = mini,j

D(Ui, Vj) DU,V = 1|U||V |

∑i

∑j

D(Ui, Vj) DU,V = maxi,j

D(Ui, Vj)

(A) (B) (C)

Figure 6.2: Calculation of the distance between two clusters, U and V , forsingle-linkage (A), average-linkage (B) and complete-linkage (C) clus-tering.


distance measure between this new consensus spectra and the consensus spectra ofeach of the other clusters [33, 46].Instead of hierarchical clustering methods, one could employ graph-based algo-rithms. After first eliminating all links over a certain distance threshold, we couldlook for so-called cliques, i.e., groups of spectra for which each pair has a link inthe graph [106].

6.3 Evaluating clustering performance

The goal of mass spectrum clustering is to cluster spectra belonging to the samepeptide species together while preventing the formation of clusters with more thanone peptide species. To evaluate the performance of a mass spectrum clusteringalgorithm we can use the peptide inferences as the true labels. With these labels,we can apply different cluster evaluation metrics. We shall call the list of clustersresulting from a mass spectrum clustering algorithm a clustering.However, the peptide inferences of the spectra themselves contain a certain level ofuncertainty due to chimeric spectra and incorrect identifications. To account forincorrect identifications, one could cluster only the spectra with a PSM inferencewith a q value below 1% or even 0.1%. Still, this could give an inaccurate imageof the actual performance of the mass spectrum clustering algorithm on the entireset, as the confident PSMs are typically from high-quality spectra which are moreeasily distinguished from each other than low-quality spectra. Alternatively, wecould first cluster all spectra and then eliminate all spectra with only PSMs abovea certain q value. In this case, the clusters are no longer accurately represented bytheir remaining spectra and care has to be taken to correctly interpret any statisticswe calculate from them.Several general-purpose clustering metrics have been proposed such as the randindex, the adjusted rand index, the B-Cubed algorithm, the purity and normalizedmutual information, see [3] for a review. Most of these metrics give us a score,typically between 0 and 1, and allow us to rank different clusterings by this score.Unfortunately, the value of these scores are hard to interpret, and more simplisticmetrics might give more information. A commonly used example is the clusterpurity, which is the average fraction of spectra belonging to the most commonpeptide species in its cluster.Alternatively, we can use the number of peptide discoveries from searching theconsensus spectra with a database search engine. It is not hard to imagine thatif the clusters are pure that the consensus spectra should be a good model forspectra from that peptide species. Unfortunately, this is certainly not always thecase, as a cluster with one high-quality spectrum together with many lower-qualityspectra might result in a consensus spectrum falling below the discovery threshold.Nevertheless, the number of peptide discoveries for a clustering can be consideredas an informative metric regarding its purity.

Chapter 7

Label-free protein quantification

Though it might seem trivial to add quantitative values to proteins once they havebeen identified, this is by no means the case. There are many possible sourcesof errors along the way from the raw data of the mass spectrometer to a list ofquantified proteins. Firstly, it is not always evident which quantification signalbelongs to a certain peptide nor if the quantification signal is accurate, as peptidesof similar mass and retention time might act as confounding factors. Secondly,as was the case in protein identification, we are still faced with the problem thatwe are detecting peptides rather than the proteins themselves. This means thatwe will somehow have to summarize the peptide evidence and deal with possiblyconflicting evidence. In this chapter, we will investigate the data analysis pipelineof label-free shotgun proteomics and the types of error sources present at each step.This chapter serves as a detailed introduction to papers IV and V.

We will focus on quantification methods that use the precursor intensity signalsfrom the MS1 spectra in favor of alternative techniques such as spectral countingand using fragment intensity signals from the MS2 spectra. Furthermore, we willmostly consider relative quantification methods, rather than absolute quantificationmethods.

7.1 Feature detection

As mentioned before, we can use the precursor ion intensity of the MS1 spectra asa proxy for a peptides’ abundance, relative to other runs. For a given peptide, theprecursor ions will come in different isotopes due to the different isotopes of theconstituting atoms, resulting in an isotopic envelope (see Figure 7.1). For a charge1 precursor ion, this isotopic envelope starts with a peak at the monoisotopic mass,the sum of the atomic masses using the mass of the most abundant isotope, followedby peaks at 1 Da intervals. For a higher charged precursor ion of charge c, thesemasses will be divided by the charge and the shifts of the isotopes will, therefore,be 1

c Th.

41

42 CHAPTER 7. LABEL-FREE PROTEIN QUANTIFICATION

1 or 2 features?

Charge +2 feature

Charge +1 feature

4510 4520 4530 4540 4550

650

651

652

653

654

655

656

657

658

659Pr

ecur

sor

m/z

(Th)

Retention time (s)

Figure 7.1: Examples of isotopic envelopes. We plotted the retention time onthe x-axis, the precursor m/z on the y-axis and the precursor ion intensity on the zaxis, where purple represents the highest intensity, followed by increasingly lightershades of red. We can see two clearly distinguishable MS1 features in green andblue, one of charge +1 and one of charge +2. We can also see a more ambiguousfeature in red, which could also be two separate features. There are also severalless intense features in the plot, which we have not highlighted but would typicallybe called as well.

Due to the differing atomic compositions between peptides, the distribution amongthe isotopes depends on the peptide. It is important to deisotope the MS1 spectra,such that the most relevant isotopes are taken into account, resulting in an MS1feature that can be used as the quantitative representation of an analyte [107].Notable MS1 feature detection software packages are Hardklör [50], MaxQuant [18]and Dinosaur [107].

Errors can be made during the deisotoping, due to isotopes being attributed to thewrong analyte. Also, false positives will arise during the calling of the MS1 features,resulting from random noise, as well as false negatives, truly present analytes thatdo not get an MS1 feature assigned. Furthermore, in some cases, MS1 features areerroneously split up into multiple smaller MS1 features.

7.2. PEPTIDE QUANTIFICATION 43

7.2 Peptide quantification

After feature detection, we can link the MS1 features to the MS2 spectra by consid-ering the precursor isolation window, the charge state and the retention time [52].Some MS2 spectra can be linked to multiple MS1 features, for example in the caseof chimeric spectra. We can now do peptide identification in the same manner as inthe identification pipeline. We can then assign the peptide from the MS2 spectrumto one of its MS1 features. If multiple MS1 features are present with the chargestate of the peptide identification, we have to select one, for example, based onpredicted precursor mass and retention time.

In this way, we obtain a quantity for a peptide with a certain charge state acrossseveral runs. We typically filter out peptides that have not been quantified in morethan a certain number of runs. This will typically filter out false positives, as wellas reduce problems with missing value imputation. It has been observed that theintensity is influenced differently at different retention times, therefore a retentiontime-dependent normalization factor helps to reduce this variance [118].

Errors are introduced during this stage due to incorrect PSMs and the resultingincorrect peptide and protein identifications, as well as matching the wrong MS1feature to the peptide.

7.3 Matches-between-runs

Following the approach above, we are left with many missing values due to therandom nature of MS2 acquisition of data-dependent acquisition. To alleviate thisproblem, we can search for MS1 features with similar retention time and precursormass for identified peptides that did not have a (reliable) MS2 spectrum in thatparticular run. This strategy is known as matches-between-runs (MBR) [17, 118, 5].In a particular case, the number of missing values could be reduced from 40% downto just a few percents using this technique [118].

To apply MBR effectively, we need a mapping between retention times of differentruns. While linear mappings are sufficient if experimental conditions are kept verysteady, one often has to resort to smooth approximations with splines or otherinterpolation methods. We can either select one of the runs as a reference run [113]or use pairwise alignments [94].

The same type of errors can be made in MBR as during feature detection, that isdeisotoping errors, false negatives due to insufficient or incoherent signal, and falsepositives due to random noise. Furthermore, we are dependent on the retentiontime alignment and the mass accuracy of the precursors to select the right MS1feature. To control this error rate, one can use a decoy model where we search forfeatures in a shifted region of precursor m/z and/or retention time [118]. Dependingon the implementation, a mapping has to be made between intensities obtained

44 CHAPTER 7. LABEL-FREE PROTEIN QUANTIFICATION

by MBR and intensities obtained be regular feature detection, which introducesanother source of uncertainty.

7.4 Missing value imputation

Even after MBR, many missing values remain and imputation strategies are com-monly applied. Many different strategies have been employed, each with their ownsources of errors. A popular imputation method is to impute a very low value [17],under the assumption that the feature is missing due to low signal. Unfortunately,this assumption turns out to be frequently violated [68]. Other imputation meth-ods try to use quantification values from peptides with a similar expression patternfor the present values or use the very conservative approach of imputing the meanvalue of all present values for the peptide [109].Each of these methods come with severe implications for the differential expressionanalysis. Aside from potentially producing false positives if the incorrect value wasimputed, these imputation strategies also affect the variance that is used as inputto typical differential expression tests. For example, imputation by some low valuecan reduce the sensitivity of the test by increasing the variance, whereas imputationby the mean value can artificially lower the variance and produce false positives inthis way. Missing value imputation can also cause the p value distribution of thenon-differentially expressed proteins to become non-uniform, which could producean increased number of false positive and also affects the multiple testing correction.

7.5 Peptide to protein summarization

Having obtained multiple peptide quantifications of the same protein, one now hasto combine these quantifications into a quantification of the protein. While it ispossible to use shared peptides for protein quantification [37], it is often easier todiscard these values.Different strategies can be employed for summarizing the peptide quantifications,often using the intuition that peptides with high ion intensity are the most reliable.For example, the top-3 approach takes the three peptides with the highest ionintensity and simply takes the average of these three.One problem with these approaches is that a single outlier can invalidate the entireprotein quantification. To prevent this, one can use the coherence between thepeptide quantifications and remove the peptides that show aberrant behavior [120,119].

7.6 Differential expression analysis

Finally, one is interested in the proteins that show differential expression betweentwo conditions. By far the most common way is to use a t-test or its variant

7.6. DIFFERENTIAL EXPRESSION ANALYSIS 45

Welch’s t-test. One problem with these tests is that they do not take the effect sizeinto account, that is, proteins that have a small but significant difference betweenconditions are commonly not of interest. To remedy this, one could simply filteraway these proteins, but this will invalidate the uniformity assumption of p-values,making multiple testing correction an issue [78]. Alternatively, one can make adap-tations to existing tests that penalize proteins with low effect sizes [78, 21].

Compared to RNA sequencing results, differential expression analysis on proteinlevel often suffers from low sensitivity due to its lower coverage and higher noiselevel. It is not uncommon to have only a handful or even no significantly differen-tially expressed proteins. There are multiple causes for this, though much of it canlikely be attributed to erroneous quantification outliers that increase the variance,which lowers the sensitivity of the statistical test.

Chapter 8

Present investigations

The papers included in this thesis center around the peptide and protein inferencepipeline. In particular, the papers try to address different issues arising from theincrease in available data. Paper I introduces a new mass spectrum clustering al-gorithm which uses a probabilistic framework to calculate a distance measure andapplies this to a dataset containing millions of spectra. Paper II focuses on thediscussion of protein-level false discovery rates, which has been criticized for notbehaving properly on large datasets. The paper proposes a framework in whichthey can be discussed in a meaningful way to reopen the discussion on protein-levelfalse discovery rates on large datasets. Paper III showcases two new features of thePercolator package which makes it ready for application on aggregated datasetscomprising multiple runs such as the recent draft human proteome studies. Pa-per IV introduces a new framework for protein quantification based on Bayesianstatistics that provides a rigorous way of treating error sources and uncertaintiesrather than hiding them as typically happens. Paper V proposes a paradigm shiftfor protein quantification in which the interpretation of the data with the searchengine is postponed until the end, such that we can pinpoint unidentified analytesof interest.

8.1 Paper I: MaRaCluster: A fragment rarity metric forclustering fragment spectra in shotgun proteomics

This study proposes a novel mass spectrum distance measure, the fragment raritymetric, which uses the frequency of the observed peaks in the MS2 spectra to cal-culate a p value for a pair of spectra. It makes use of the intuition that fragmentpeaks shared by many MS2 spectra give less information about two spectra stem-ming from the same peptide species than peaks shared by few MS2 spectra. Incontrast to existing distance measures, the score of the fragment rarity metric canbe interpreted by itself and is robust to both systematic and random noise. Thisdistance metric is then used as input to a complete-linkage clustering, from which

47

48 CHAPTER 8. PRESENT INVESTIGATIONS

clusters are then derived. The complete-linkage clustering mitigates the problemof chimeric spectra connecting clusters of different peptide species. Using a datasetcontaining 7.8 million spectra, we demonstrate that our algorithm, MaRaCluster,outperforms alternatives that use a cosine distance or single-linkage clustering, aswell as the state-of-the-art method MS-Cluster.

8.2 Paper II: How to talk about protein-level falsediscovery rates in shotgun proteomics

In this work, we discuss the controversy regarding protein-level false discovery ratesthat has resurfaced in the wake of the draft human proteome studies. In partic-ular, we propose a framework to have meaningful discussions about them, whichmainly consists of separating its definition from its application. The definition isdetermined by the choice of null hypothesis, of which two significantly differentversions are currently employed by protein inference methods. We show that inter-pretation of the results using the wrong null hypothesis can lead to a considerableoverestimation of the confidence in the results. Furthermore, we recommend theemployment of simulations of the digestion and matching step to calibrate andverify the protein-level FDR estimates reported by protein inference methods. Wedemonstrate this approach by verifying the FDR estimates of a very simple proteininference method, using decoy models for both of the null hypotheses.

8.3 Paper III: Fast and accurate protein false discoveryrates on large-scale proteomics data sets withPercolator 3.0

This paper introduces two new features of the Percolator package that are aimedtowards large-scale studies. The first feature accelerates the training algorithm bytraining the SVMs only on a small random subset of the available PSMs. Evengoing down to just 0.14% of the original 73 million PSMs does not significantlyreduce the number of discovered PSMs and peptides at 1% FDR. For the secondfeature, we compared several scalable protein inference methods based on accuracyand number of discoveries. The protein inference methods we benchmarked werethe best scoring peptide approach, the two peptide rule, multiplication of peptide-level posterior error probabilities and Fisher’s method for combining p values. Usinga small yeast dataset with an entrapment database we show that it is absolutelynecessary to remove shared peptides to have working decoy models. Furthermore,we demonstrate that the best scoring peptide approach gives the most significantproteins at 1% FDR on the data of a draft of the human proteome. This approachwas subsequently implemented in the Percolator package, which together with therandom subset training, allowed the analysis from 73 million PSMs up to the proteininferences in just 10 minutes.

8.4. PAPER IV: TRIQLER 49

8.4 Paper IV: Integrated identification and quantificationerror probabilities for shotgun proteomics.

In protein quantification, many error sources are present that could result in falsepositive findings. Almost all protein quantification methods use intermediate filtersat several parts of the pipeline in an attempt to limit the number of false positives.However, they subsequently typically fail to propagate error probability informa-tion to subsequent steps and instead silently assume that the filtered list does notcontain any false positives. Instead, we designed a probabilistic graphical model(PGM) that propagates the error probabilities through each of the steps in theprotein quantification pipeline. Specifically, instead of imputing missing values, weintegrate over a probability distribution for the missing value and propagate uncer-tainty all the way from the MS1 features, through peptide and protein expressionlevels unto a posterior distribution for the fold change difference between treatmentgroups. We implemented this PGM in a Python package called Triqler. Not onlydoes Triqler achieve proper control of the false discovery rate, but it also increasesthe sensitivity on both engineered and clinical datasets. Specifically, Triqler man-ages to identify 35 differentially expressed proteins in a clinical dataset at a 5%FDR where previously none were discovered and also outperformed the state-of-the-art method, MaxQuant. Furthermore, the proteins discovered by Triqler showenrichment for several functional annotation terms, suggesting that these are in-deed relevant findings, whereas the proteins discovered by MaxQuant did not showany enrichment.

8.5 Paper V: Distillation of label-free quantification databy clustering and Bayesian modeling.

In conventional protein quantification pipelines, we start our quantification fromthe peptide identifications coming from a search engine. In this paper, we proposeto reverse this process by first quantifying all possible analyte signals without in-terpretations and only in the very end use a search engine to assign identities tothese signals. This can simply be formulated as the quantification-first approach,opposed to the identification-first philosophy. In this way, we do not lose any po-tentially interesting signals just because they could not initially be identified. Byusing spectrum clustering with MaRaCluster, we can further reduce the time spenton the identification process, by creating spectrum clusters and consensus spectrathat can directly be associated with the discovered features. Additionally, we notethat the successful technique of matches-between-runs (MBR) is an error sourcethat is typically not accounted for in terms of error propagation. We adapted theMBR approach to work in the setting of the quantification first approach and re-tained the error probabilities associated with this process as an input to Triqler.We created a combined software package, MaRaQuant, that effectively recovers allsignals of interest and provides the consensus spectra that belong to these signals,

50 CHAPTER 8. PRESENT INVESTIGATIONS

which can then be searched by a search engine. We demonstrated that we couldindeed recover new features of interest that could be identified by open modifica-tion searching and de novo identification. Moreover, we showed that by combiningMaRaQuant with Triqler we could obtain a much higher sensitivity than the state-of-the-art method, MaxQuant, on one engineered as well as two clinical datasets.In both clinical datasets, we could recover significant functional annotation terms,whereas significant proteins from MaxQuant and from the original studies wereunable to produce any such significant terms.

Chapter 9

Future perspectives

The studies presented in this thesis were focused on summarizing data and isolat-ing objects of interest from large-scale proteomics studies. As shotgun proteomicsstudies become increasingly easy and cheap to perform, computational parts of theanalysis pipeline will reach their limits and will require new solutions.

Mass spectrum clustering has already started to get traction in the field of pro-teomics and I believe that we can expect several new applications for this typeof analysis in the near future. Especially the increasing popularity of spectral li-braries, both in the DDA and DIA fields, appears to be a perfect match for spectrumclustering. However, more attention is needed in the statistical analysis of spec-tral library searching, specifically with regards to decoy spectra and chimericity ofspectra. Furthermore, I expect that more effort will be directed towards identify-ing the unidentified spectra in for example spectral archives or retrieved throughour “quantification first” approach. This can be achieved through advancements ininstrumentation, improved algorithms for open modification searches and de novoidentification, as well as large-scale analysis across datasets.

In the same way that PSM and peptide-level false discoveries have cemented thereputation of shotgun proteomics experiments as a reliable tool, I think that report-ing identification and quantification protein-level false discovery rates will increasethe applicability and popularity of shotgun proteomics even further. I also thinkthat Bayesian statistics will become an integral part of protein quantification as itcan deal with the high complexity of the experiments in a relatively straightforwardand understandable way.

Finally, there have already been some efforts in using deep learning in proteomicsand I expect that these efforts will intensify over the next few years. In particular,fragment ion intensity, retention time and ionization efficiency prediction seem tobe good fits for the deep learning framework. Considering the vast amounts of datathe field has collected in the repositories, it should not take long before we are ableto tackle these prediction tasks with high accuracy.

51

Bibliography

[1] Ruedi Aebersold, Jeffrey N Agar, I Jonathan Amster, Mark S Baker, Car-olyn R Bertozzi, Emily S Boja, Catherine E Costello, Benjamin F Cravatt,Catherine Fenselau, Benjamin A Garcia, et al. How many human proteoformsare there? Nature Chemical Biology, 14(3):206, 2018.

[2] Rikard Alm, Peter Johansson, Karin Hjernø, Cecilia Emanuelsson, MarkusRingner, and Jari Häkkinen. Detection and identification of protein isoformsusing cluster analysis of MALDI-MS mass spectra. Journal of Proteome Re-search, 5(4):785–792, 2006.

[3] Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A com-parison of extrinsic clustering evaluation metrics based on formal constraints.Information Retrieval, 12(4):461–486, 2009.

[4] Rolf Apweiler, Amos Bairoch, Cathy H Wu, Winona C Barker, BrigitteBoeckmann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, RodrigoLopez, Michele Magrane, et al. UniProt: The universal protein knowledge-base. Nucleic Acids Research, 32(suppl 1):D115–D119, 2004.

[5] Andrea Argentini, Ludger JE Goeminne, Kenneth Verheggen, Niels Hulstaert,An Staes, Lieven Clement, and Lennart Martens. moFF: A robust and auto-mated approach to extract peptide ion intensities. Nature Methods, 13(12):964–966, 2016.

[6] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein,Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina SDwight, Janan T Eppig, et al. Gene Ontology: Tool for the unification ofbiology. Nature Genetics, 25(1):25–29, 2000.

[7] AzaToth. Myoglobin 3D structure. https://upload.wikimedia.org/wikipedia/commons/6/60/Myoglobin.png, 2012. Accessed: 2016-03-29.

[8] Marcus Bantscheff, Markus Schirle, Gavain Sweetman, Jens Rick, and Bern-hard Kuster. Quantitative mass spectrometry in proteomics: A critical re-view. Analytical and BioAnalytical Chemistry, 389(4):1017–1031, 2007.

53

https://upload.wikimedia.org/wikipedia/commons/6/60/Myoglobin.png

https://upload.wikimedia.org/wikipedia/commons/6/60/Myoglobin.png

54 BIBLIOGRAPHY

[9] Alex Bateman, Lachlan Coin, Richard Durbin, Robert D Finn, Volker Hollich,Sam Griffiths-Jones, Ajay Khanna, Mhairi Marshall, Simon Moxon, Erik LLSonnhammer, et al. The Pfam protein families database. Nucleic AcidsResearch, 32(suppl 1):D138–D141, 2004.

[10] Ilan Beer, Eilon Barnea, Tamar Ziv, and Arie Admon. Improving large-scaleproteomics by clustering of mass spectrometry data. Proteomics, 4(4):950–960, 2004.

[11] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate:A practical and powerful approach to multiple testing. Journal of the RoyalStatistical Society. Series B (Methodological), pages 289–300, 1995.

[12] Pavel V Bondarenko, Dirk Chelius, and Thomas A Shaler. Identification andrelative quantitation of protein mixtures by enzymatic digestion followed bycapillary reversed-phase liquid chromatography-tandem mass spectrometry.Analytical Chemistry, 74(18):4741–4749, 2002.

[13] Adam D Catherman, Owen S Skinner, and Neil L Kelleher. Top down pro-teomics: Facts and perspectives. Biochemical and Biophysical Research Com-munications, 445(4):683–693, 2014.

[14] Hao Chi, Rui-Xiang Sun, Bing Yang, Chun-Qing Song, Le-Heng Wang, ChaoLiu, Yan Fu, Zuo-Fei Yuan, Hai-Peng Wang, Si-Min He, et al. pNovo: De novopeptide sequencing and identification using hcd spectra. Journal of ProteomeResearch, 9(5):2713–2724, 2010.

[15] Joel M Chick, Deepak Kolippakkam, David P Nusinow, Bo Zhai, RaminRad, Edward L Huttlin, and Steven P Gygi. A mass-tolerant database searchidentifies a large proportion of unassigned spectra in shotgun proteomics asmodified peptides. Nature Biotechnology, 33(7):743, 2015.

[16] John S Cottrell and U London. Probability-based protein identification bysearching sequence databases using mass spectrometry data. Electrophoresis,20(18):3551–3567, 1999.

[17] Jürgen Cox, Marco Y Hein, Christian A Luber, Igor Paron, Nagarjuna Na-garaj, and Matthias Mann. Accurate proteome-wide label-free quantifica-tion by delayed normalization and maximal peptide ratio extraction, termedMaxLFQ. Molecular & Cellular Proteomics, 13(9):2513–2526, 2014.

[18] Jürgen Cox and Matthias Mann. MaxQuant enables high peptide identi-fication rates, individualized ppb-range mass accuracies and proteome-wideprotein quantification. Nature Biotechnology, 26(12):1367, 2008.

[19] Robertson Craig and Ronald C Beavis. TANDEM: Matching proteins withtandem mass spectra. Bioinformatics, 20(9):1466–1467, 2004.

BIBLIOGRAPHY 55

[20] Robertson Craig, JC Cortens, David Fenyo, and Ronald C Beavis. Usingannotated peptide mass spectrum libraries for protein identification. Journalof Proteome Research, 5(8):1843–1849, 2006.

[21] Xiangqin Cui and Gary A Churchill. Statistical tests for differential expressionin cDNA microarray experiments. Genome Biology, 4(4):210, 2003.

[22] Frank Desiere, Eric W Deutsch, Nichole L King, Alexey I Nesvizhskii, ParagMallick, Jimmy Eng, Sharon Chen, James Eddes, Sandra N Loevenich, andRuedi Aebersold. The PeptideAtlas project. Nucleic Acids Research, 34(suppl1):D655–D658, 2006.

[23] Eric W Deutsch, Zhi Sun, David Campbell, Ulrike Kusebauch, Caroline SChu, Luis Mendoza, David Shteynberg, Gilbert S Omenn, and Robert LMoritz. State of the human proteome in 2014/2015 as viewed through Pep-tideAtlas: Enhancing accuracy and coverage through the AtlasProphet. Jour-nal of Proteome Research, 14(9):3461–3473, 2015.

[24] Arun Devabhaktuni and Joshua E Elias. Application of de novo sequencingto large-scale complex proteomics data sets. Journal of Proteome Research,15(3):732–742, 2016.

[25] Fredrik Edfors, Frida Danielsson, Björn M Hallström, Lukas Käll, EmmaLundberg, Fredrik Pontén, Björn Forsström, and Mathias Uhlén. Gene-specific correlation of RNA and protein levels in human cells and tissues.Molecular Systems Biology, 12(10):883, 2016.

[26] Joshua E Elias and Steven P Gygi. Target-decoy search strategy for increasedconfidence in large-scale protein identifications by mass spectrometry. NatureMethods, 4(3):207–214, 2007.

[27] Jimmy K Eng, Ashley L McCormack, and John R Yates. An approach tocorrelate tandem mass spectral data of peptides with amino acid sequences ina protein database. Journal of the American Society for Mass Spectrometry,5(11):976–989, 1994.

[28] Scott B Ficarro, Mark L McCleland, P Todd Stukenberg, Daniel J Burke,Mark M Ross, Jeffrey Shabanowitz, Donald F Hunt, and Forest M White.Phosphoproteome analysis by mass spectrometry and its application to sac-charomyces cerevisiae. Nature Biotechnology, 20(3):301, 2002.

[29] Bernd Fischer, Volker Roth, Franz Roos, Jonas Grossmann, Sacha Baginsky,Peter Widmayer, Wilhelm Gruissem, and Joachim M Buhmann. NovoHMM:A hidden markov model for de novo peptide sequencing. Analytical Chemistry,77(22):7265–7273, 2005.

[30] Ronald Aylmer Fisher. Statistical methods for research workers. GenesisPublishing Pvt Ltd, 1925.

56 BIBLIOGRAPHY

[31] Kristian Flikka, Jeroen Meukens, Kenny Helsens, Joël Vandekerckhove, In-gvar Eidhammer, Kris Gevaert, and Lennart Martens. Implementation andapplication of a versatile clustering tool for tandem mass spectrometry data.Proteomics, 7(18):3245–3258, 2007.

[32] Ari Frank and Pavel Pevzner. PepNovo: De novo peptide sequencing viaprobabilistic network modeling. Analytical Chemistry, 77(4):964–973, 2005.

[33] Ari M Frank, Matthew E Monroe, Anuj R Shah, Jeremy J Carver, Nuno Ban-deira, Ronald J Moore, Gordon A Anderson, Richard D Smith, and Pavel APevzner. Spectral archives: Extending spectral libraries to analyze both iden-tified and unidentified spectra. Nature Methods, 8(7):587–591, 2011.

[34] Barbara E Frewen, Gennifer E Merrihew, Christine C Wu, William StaffordNoble, and Michael J MacCoss. Analysis of peptide MS/MS spectra fromlarge-scale proteomics experiments using spectrum libraries. Analytical Chem-istry, 78(16):5678–5684, 2006.

[35] Devon Fyson. Schematic of a typical mass spectrometer. https://commons.wikimedia.org/wiki/File:Mass_Spectrometer_Schematic.svg, 2008. Ac-cessed: 2016-03-08.

[36] Pascale Gaudet, Pierre-André Michel, Monique Zahn-Zabal, Aurore Britan,Isabelle Cusin, Marcin Domagalski, Paula D Duek, Alain Gateau, AnneGleizes, Valérie Hinard, et al. The nextprot knowledgebase on human pro-teins: 2017 update. Nucleic Acids Research, 45(D1):D177–D182, 2016.

[37] Sarah Gerster, Taejoon Kwon, Christina Ludwig, Mariette Matondo, Chris-tine Vogel, Edward M Marcotte, Ruedi Aebersold, and Peter Bühlmann. Sta-tistical approach to protein quantification. Molecular & Cellular Proteomics,13(2):666–677, 2014.

[38] Philipp E Geyer, Nils A Kulak, Garwin Pichler, Lesca M Holdt, DanielTeupser, and Matthias Mann. Plasma proteome profiling to assess humanhealth and disease. Cell Systems, 2(3):185–195, 2016.

[39] Ludovic C Gillet, Pedro Navarro, Stephen Tate, Hannes Röst, Nathalie Se-levsek, Lukas Reiter, Ron Bonner, and Ruedi Aebersold. Targeted data ex-traction of the MS/MS spectra generated by data-independent acquisition:A new concept for consistent and accurate proteome analysis. Molecular &Cellular Proteomics, 11(6):O111–016717, 2012.

[40] Alexander V Gorshkov, Irina A Tarasova, Victor V Evreinov, Mikhail MSavitski, Michael L Nielsen, Roman A Zubarev, and Mikhail V Gorshkov.Liquid chromatography at critical conditions: Comprehensive approach tosequence-dependent retention time prediction. Analytical Chemistry, 78(22):7770–7777, 2006.

https://commons.wikimedia.org/wiki/File:Mass_Spectrometer_Schematic.svg

https://commons.wikimedia.org/wiki/File:Mass_Spectrometer_Schematic.svg

BIBLIOGRAPHY 57

[41] Viktor Granholm and Lukas Käll. Quality assessments of peptide–spectrummatches in shotgun proteomics. Proteomics, 11(6):1086–1093, 2011.

[42] Viktor Granholm, William S Noble, and Lukas Käll. A cross-validationscheme for machine learning algorithms in shotgun proteomics. BMC Bioin-formatics, 13(Suppl 16):S3, 2012.

[43] Viktor Granholm, William Stafford Noble, and Lukas Käll. On using samplesof known protein content to assess the statistical calibration of scores assignedto peptide-spectrum matches in shotgun proteomics. Journal of ProteomeResearch, 10(5):2671–2678, 2011.

[44] Dov Greenbaum, Christopher Colangelo, Kenneth Williams, and Mark Ger-stein. Comparing protein abundance and mRNA expression levels on a ge-nomic scale. Genome Biology, 4(9):117, 2003.

[45] Johannes Griss, Joseph M Foster, Henning Hermjakob, and Juan AntonioVizcaíno. PRIDE Cluster: Building a consensus of proteomics data. NatureMethods, 10(2):95–96, 2013.

[46] Johannes Griss, Yasset Perez-Riverol, Steve Lewis, David L Tabb, José ADianes, Noemi del Toro, Marc Rurik, Mathias Walzer, Oliver Kohlbacher,Henning Hermjakob, et al. Recognizing millions of consistently unidentifiedspectra across hundreds of shotgun proteomics datasets. Nature Methods, 13(8):651, 2016.

[47] Steven P Gygi, Yvan Rochon, B Robert Franza, and Ruedi Aebersold. Cor-relation between protein and mRNA abundance in yeast. Molecular andCellular Biology, 19(3):1720–1730, 1999.

[48] Michael Edberg Hansen and Jørn Smedsgaard. A new matching algorithmfor high resolution mass spectra. Journal of the American Society for MassSpectrometry, 15(8):1173–1180, 2004.

[49] Harry S Hertz, Ronald A Hites, and Klaus Biemann. Identification of massspectra by computer-searching a file of known spectra. Analytical Chemistry,43(6):681–691, 1971.

[50] Michael R Hoopmann, Gregory L Finney, and Michael J MacCoss. High-speeddata reduction, feature detection, and MS/MS spectrum quality assessmentof shotgun proteomics data sets using high-resolution mass spectrometry. An-alytical Chemistry, 79(15):5620–5632, 2007.

[51] Oliver Horlacher, Frédérique Lisacek, and Markus Müller. Mining large scaleMS/MS data for protein modifications using spectral libraries. Journal ofProteome Research, 2015.

58 BIBLIOGRAPHY

[52] Edward J Hsieh, Michael R Hoopmann, Brendan MacLean, and Michael JMacCoss. Comparison of database search strategies for high precursor massaccuracy MS/MS data. Journal of Proteome Research, 9(2):1138–1143, 2009.

[53] Donald F Hunt, John R Yates, Jeffrey Shabanowitz, Scott Winston, andCharles R Hauer. Protein sequencing by tandem mass spectrometry. Pro-ceedings of the National Academy of Sciences, 83(17):6233–6237, 1986.

[54] Mio Iwasaki, Shohei Miwa, Tohru Ikegami, Masaru Tomita, Nobuo Tanaka,and Yasushi Ishihama. One-dimensional capillary liquid chromatographicseparation coupled with tandem mass spectrometry unveils the escherichiacoli proteome on a microarray scale. Analytical Chemistry, 82(7):2616–2620,2010.

[55] Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble, andMichael J MacCoss. Semi-supervised learning for peptide identification fromshotgun proteomics datasets. Nature Methods, 4(11):923–925, 2007.

[56] Lukas Käll, John D Storey, Michael J MacCoss, and William Stafford Noble.Assigning significance to peptides identified by tandem mass spectrometryusing decoy databases. Journal of Proteome Research, 7(01):29–34, 2007.

[57] Uri Keich, Attila Kertesz-Farkas, and William Stafford Noble. Improvedfalse discovery rate estimation procedure for shotgun proteomics. Journal ofProteome Research, 14(8):3148–3161, 2015.

[58] Uri Keich and William Stafford Noble. On the importance of well-calibratedscores for identifying shotgun proteomics spectra. Journal of Proteome Re-search, 14(2):1147–1160, 2014.

[59] Neil L Kelleher. Peer reviewed: Top-down proteomics. Analytical Chemistry,76(11):196–A, 2004.

[60] Min-Sik Kim, Sneha M Pinto, Derese Getnet, Raja Sekhar Niru-jogi, Srikanth S Manda, Raghothama Chaerkady, Anil K Madugundu,Dhanashree S Kelkar, Ruth Isserlin, Shobhit Jain, et al. A draft map ofthe human proteome. Nature, 509(7502):575–581, 2014.

[61] Sangtae Kim, Nitin Gupta, and Pavel A Pevzner. Spectral probabilitiesand generating functions of tandem mass spectra: A strike against decoydatabases. Journal of Proteome Research, 7(8):3354–3363, 2008.

[62] Donald S Kirkpatrick, Scott A Gerber, and Steven P Gygi. The absolutequantification strategy: A general procedure for the quantification of proteinsand post-translational modifications. Methods, 35(3):265–273, 2005.

BIBLIOGRAPHY 59

[63] John Klimek, James S Eddes, Laura Hohmann, Jennifer Jackson, AmeliaPeterson, Simon Letarte, Philip R Gafken, Jonathan E Katz, Parag Mallick,Hookeun Lee, et al. The standard protein mix database: A diverse dataset to assist in the production of improved peptide and protein identificationsoftware tools. Journal of Proteome Research, 7(01):96–103, 2007.

[64] Andy T Kong, Felipe V Leprevost, Dmitry M Avtonomov, Dattatreya Mel-lacheruvu, and Alexey I Nesvizhskii. MSFragger: Ultrafast and comprehen-sive peptide identification in mass spectrometry–based proteomics. NatureMethods, 14(5):513, 2017.

[65] Oleg V Krokhin, R Craig, V Spicer, W Ens, KG Standing, RC Beavis, andJA Wilkins. An improved model for prediction of retention times of trypticpeptides in ion pair reversed-phase HPLC its application to protein peptidemapping by off-line HPLC-MALDI MS. Molecular & Cellular Proteomics, 3(9):908–919, 2004.

[66] Henry Lam and Ruedi Aebersold. Spectral library searching for peptide identi-fication via tandem MS. In Proteome Bioinformatics, pages 95–103. Springer,2010.

[67] Henry Lam, Eric W Deutsch, James S Eddes, Jimmy K Eng, Stephen EStein, and Ruedi Aebersold. Building consensus spectral libraries for peptideidentification in proteomics. Nature Methods, 5(10):873–875, 2008.

[68] Cosmin Lazar, Laurent Gatto, Myriam Ferro, Christophe Bruley, and ThomasBurger. Accounting for the multiple natures of missing values in label-freequantitative proteomics data sets to compare imputation strategies. Journalof Proteome Research, 15(4):1116–1125, 2016.

[69] Yong Fuga Li, Randy J Arnold, Yixue Li, Predrag Radivojac, Quanhu Sheng,and Haixu Tang. A bayesian approach to protein inference problem in shotgunproteomics. Journal of Computational Biology, 16(8):1183–1193, 2009.

[70] Hongbin Liu, Rovshan G Sadygov, and John R Yates. A model for randomsampling and estimation of relative protein abundance in shotgun proteomics.Analytical Chemistry, 76(14):4193–4201, 2004.

[71] Yaniv Loewenstein, Elon Portugaly, Menachem Fromer, and Michal Linial.Efficient algorithms for accurate hierarchical clustering of huge datasets:Tackling the entire protein space. Bioinformatics, 24(13):i41–i49, 2008.

[72] Roland Luethy, Darren E Kessner, Jonathan E Katz, Brendan MacLean,Robert Grothe, Kian Kani, Vitor Faca, Sharon Pitteri, Samir Hanash,David B Agus, and Parag Mallick. Precursor-ion mass re-estimation improvespeptide identification on hybrid instruments. Journal of Proteome Research,7(9):4031–4039, 2008.

60 BIBLIOGRAPHY

[73] Bin Ma. Novor: Real-time peptide de novo sequencing software. Journal ofthe American Society for Mass Spectrometry, 26(11):1885–1894, 2015.

[74] Bin Ma, Kaizhong Zhang, Christopher Hendrie, Chengzhi Liang, Ming Li,Amanda Doherty-Kirby, and Gilles Lajoie. PEAKS: Powerful software forpeptide de novo sequencing by tandem mass spectrometry. Rapid Communi-cations in Mass Spectrometry, 17(20):2337–2342, 2003.

[75] Tobias Maier, Marc Güell, and Luis Serrano. Correlation of mRNA andprotein in complex biological samples. FEBS Letters, 583(24):3966–3973,2009.

[76] Matthias Mann, Shao-En Ong, Mads Grønborg, Hanno Steen, Ole N Jensen,and Akhilesh Pandey. Analysis of protein phosphorylation using mass spec-trometry: Deciphering the phosphoproteome. Trends in biotechnology, 20(6):261–268, 2002.

[77] Lennart Martens, Henning Hermjakob, Philip Jones, Marcin Adamski, ChrisTaylor, David States, Kris Gevaert, Joël Vandekerckhove, and Rolf Apweiler.PRIDE: The proteomics identifications database. Proteomics, 5(13):3537–3545, 2005.

[78] Davis J McCarthy and Gordon K Smyth. Testing significance relative to afold-change threshold is a treat. Bioinformatics, 25(6):765–771, 2009.

[79] Annette Michalski, Juergen Cox, and Matthias Mann. More than 100,000detectable peptide species elute in single shotgun proteomics runs but themajority is inaccessible to data-dependent LC-MS/MS. Journal of ProteomeResearch, 10(4):1785–1793, 2011.

[80] Luminita Moruz and Lukas Käll. Peptide retention time prediction. MassSpectrometry Reviews, 2016.

[81] Luminita Moruz, Daniela Tomazela, and Lukas Käll. Training, selection, androbust calibration of retention time models for targeted proteomics. Journalof Proteome Research, 9(10):5209–5216, 2010.

[82] Akira Motoyama and John R Yates III. Multidimensional LC separations inshotgun proteomics, 2008.

[83] Seungjin Na, Nuno Bandeira, and Eunok Paek. Fast multi-blind modificationsearch through tandem mass spectrometry. Molecular & Cellular Proteomics,pages mcp–M111, 2011.

[84] Alexey I Nesvizhskii and Ruedi Aebersold. Interpretation of shotgun pro-teomic data the protein inference problem. Molecular & Cellular Proteomics,4(10):1419–1440, 2005.

BIBLIOGRAPHY 61

[85] Alexey I Nesvizhskii, Andrew Keller, Eugene Kolker, and Ruedi Aebersold.A statistical model for identifying proteins by tandem mass spectrometry.Analytical Chemistry, 75(17):4646–4658, 2003.

[86] Alexey I Nesvizhskii, Olga Vitek, and Ruedi Aebersold. Analysis and vali-dation of proteomic data generated by tandem mass spectrometry. NatureMethods, 4(10):787–797, 2007.

[87] Kang Ning, Damian Fermin, and Alexey I Nesvizhskii. Computational anal-ysis of unassigned high-quality MS/MS spectra in proteomic data sets. Pro-teomics, 10(14):2712–2718, 2010.

[88] Jesper V Olsen, Boris Macek, Oliver Lange, Alexander Makarov, StevanHorning, and Matthias Mann. Higher-energy c-trap dissociation for peptidemodification analysis. Nature Methods, 4(9):709, 2007.

[89] Shao-En Ong, Blagoy Blagoev, Irina Kratchmarova, Dan Bach Kristensen,Hanno Steen, Akhilesh Pandey, and Matthias Mann. Stable isotope labelingby amino acids in cell culture, SILAC, as a simple and accurate approach toexpression proteomics. Molecular & Cellular Proteomics, 1(5):376–386, 2002.

[90] Nico Pfeifer, Andreas Leinenbach, Christian G Huber, and Oliver Kohlbacher.Statistical learning of peptide retention behavior in chromatographic separa-tions: A new kernel-based approach for computational proteomics. BMCBioinformatics, 8(1):468, 2007.

[91] Paola Picotti and Ruedi Aebersold. Selected reaction monitoring–based pro-teomics: Workflows, potential, pitfalls and future directions. Nature Methods,9(6):555, 2012.

[92] David W Powell, Connie M Weaver, Jennifer L Jennings, K Jill McAfee, YueHe, P Anthony Weil, and Andrew J Link. Cluster analysis of mass spec-trometry data reveals a novel component of SAGA. Molecular and CellularBiology, 24(16):7249–7259, 2004.

[93] Lukas Reiter, Manfred Claassen, Sabine P Schrimpf, Marko Jovanovic,Alexander Schmidt, Joachim M Buhmann, Michael O Hengartner, and RuediAebersold. Protein identification false discovery rates for very large pro-teomics data sets generated by tandem mass spectrometry. Molecular & Cel-lular Proteomics, 8(11):2405–2417, 2009.

[94] Hannes L Röst, Yansheng Liu, Giuseppe D’Agostino, Matteo Zanella, PedroNavarro, George Rosenberger, Ben C Collins, Ludovic Gillet, Giuseppe Testa,Lars Malmström, et al. TRIC: An automated alignment strategy for repro-ducible protein quantification in targeted proteomics. Nature Methods, 13(9):777, 2016.

62 BIBLIOGRAPHY

[95] Hannes L Röst, George Rosenberger, Pedro Navarro, Ludovic Gillet, Saša MMiladinović, Olga T Schubert, Witold Wolski, Ben C Collins, Johan Malm-ström, Lars Malmström, et al. OpenSWATH enables automated, targetedanalysis of data-independent acquisition MS data. Nature Biotechnology, 32(3):219, 2014.

[96] Fahad Saeed, Jason D Hoffert, and Mark A Knepper. CAMS-RS: Cluster-ing algorithm for large-scale mass spectrometry data using restricted searchspace and intelligent random sampling. IEEE/ACM Transactions on Com-putational Biology and Bioinformatics (TCBB), 11(1):128–141, 2014.

[97] Mikhail M Savitski, Mathias Wilhelm, Hannes Hahne, Bernhard Kuster, andMarcus Bantscheff. A scalable approach for protein false discovery rate esti-mation in large proteomic data sets. Molecular & Cellular Proteomics, pagesmcp–M114, 2015.

[98] Oliver Serang and Lukas Käll. Solution to statistical challenges in proteomicsis more statistics, not less. Journal of Proteome Research, 2015.

[99] Oliver Serang, Michael J MacCoss, and William Stafford Noble. Efficientmarginalization to compute protein posterior probabilities from shotgun massspectrometry data. Journal of Proteome Research, 9(10):5346–5357, 2010.

[100] Oliver Serang and William Noble. A review of statistical methods for proteinidentification using tandem mass spectrometry. Statistics and its Interface, 5(1):3, 2012.

[101] Sigma-Aldrich. UPS1 & UPS2 proteomic standards. http://www.sigmaaldrich.com/life-science/proteomics/mass-spectrometry/ups1-and-ups2-proteomic.html. Accessed: 2016-03-15.

[102] Owen S Skinner and Neil L Kelleher. Illuminating the dark matter of shotgunproteomics. Nature Biotechnology, 33(7):717, 2015.

[103] Lloyd M Smith, Neil L Kelleher, et al. Proteoform: A single term describingprotein complexity. Nature Methods, 10(3):186–187, 2013.

[104] John D Storey and Robert Tibshirani. Statistical significance for genomewidestudies. Proceedings of the National Academy of Sciences, 100(16):9440–9445,2003.

[105] John EP Syka, Joshua J Coon, Melanie J Schroeder, Jeffrey Shabanowitz,and Donald F Hunt. Peptide and protein sequence analysis by electron trans-fer dissociation mass spectrometry. Proceedings of the National Academy ofSciences, 101(26):9528–9533, 2004.

http://www.sigmaaldrich.com/life-science/proteomics/mass-spectrometry/ ups1-and-ups2-proteomic.html



BIBLIOGRAPHY 63

[106] David L Tabb, Melissa R Thompson, Gurusahai Khalsa-Moyers, Nathan CVerBerkmoes, and W Hayes McDonald. MS2Grouper: Group assessment andsynthetic replacement of duplicate proteomic tandem mass spectra. Journalof The American Society for Mass Spectrometry, 16(8):1250–1261, 2005.

[107] Johan Teleman, Aakash Chawade, Marianne Sandin, Fredrik Levander, andJohan Malmström. Dinosaur: A refined open-source peptide MS feature de-tector. Journal of Proteome Research, 15(7):2143–2151, 2016. ISSN 15353907.

[108] Andrew Thompson, Jürgen Schäfer, Karsten Kuhn, Stefan Kienle, JosefSchwarz, Günter Schmidt, Thomas Neumann, and Christian Hamon. Tandemmass tags: A novel quantification strategy for comparative analysis of com-plex protein mixtures by MS/MS. Analytical Chemistry, 75(8):1895–1904,2003.

[109] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, TrevorHastie, Robert Tibshirani, David Botstein, and Russ B Altman. Missingvalue estimation methods for DNA microarrays. Bioinformatics, 17(6):520–525, 2001.

[110] John D Venable, Meng-Qiu Dong, James Wohlschlegel, Andrew Dillin, andJohn R Yates. Automated approach for quantitative analysis of complexpeptide mixtures from tandem mass spectra. Nature Methods, 1(1):39–45,2004.

[111] Juan Antonio Vizcaíno, Attila Csordas, Noemi del Toro, José A Dianes, Jo-hannes Griss, Ilias Lavidas, Gerhard Mayer, Yasset Perez-Riverol, FlorianReisinger, Tobias Ternent, et al. 2016 update of the PRIDE database and itsrelated tools. Nucleic Acids Research, 44(D1):D447–D456, 2016.

[112] Zack Weinberg. SVM separating hyperplanes. https://commons.wikimedia.org/wiki/File:Svm_separating_hyperplanes_(SVG).svg,2012. Accessed: 2016-03-16.

[113] Hendrik Weisser, Sven Nahnsen, Jonas Grossmann, Lars Nilse, AndreasQuandt, Hendrik Brauer, Marc Sturm, Erhan Kenar, Oliver Kohlbacher,Ruedi Aebersold, et al. An automated pipeline for high-throughput label-free quantitative proteomics. Journal of Proteome Research, 12(4):1628–1644,2013.

[114] Sebastian Wiese, Kai A Reidegeld, Helmut E Meyer, and Bettina Warscheid.Protein labeling by iTRAQ: A new tool for quantitative mass spectrometryin proteome research. Proteomics, 7(3):340–350, 2007.

[115] Mathias Wilhelm, Judith Schlegl, Hannes Hahne, Amin Moghaddas Gholami,Marcus Lieberenz, Mikhail M Savitski, Emanuel Ziegler, Lars Butzmann,Siegfried Gessulat, Harald Marx, et al. Mass-spectrometry-based draft of thehuman proteome. Nature, 509(7502):582–587, 2014.

https://commons.wikimedia.org/wiki/File:Svm_separating_hyperplanes_(SVG).svg

https://commons.wikimedia.org/wiki/File:Svm_separating_hyperplanes_(SVG).svg

64 BIBLIOGRAPHY

[116] Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode,Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, AstridGall, Carlos García Girón, et al. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2017.

[117] Bing Zhang, Matthew C Chambers, and David L Tabb. Proteomic parsimonythrough bipartite graph analysis improves accuracy and transparency. Journalof Proteome Research, 6(9):3549–3557, 2007.

[118] Bo Zhang, Lukas Kall, and Roman A Zubarev. DeMix-Q: Quantification-centered data processing workflow. Molecular & Cellular Proteomics, pagesmcp–O115, 2016.

[119] Bo Zhang, Mohammad Pirmoradian, Roman Zubarev, and Lukas Käll. Co-variation of peptide abundances accurately reflects protein concentration dif-ferences. Molecular & Cellular Proteomics, 16(5):936–948, 2017.

[120] Yafeng Zhu, Lina Hultin-Rosenberg, Jenny Forshed, Rui MM Branca,Lukas M Orre, and Janne Lehtiö. SpliceVista, a tool for splice variant identi-fication and visualization in shotgun proteomics data. Molecular & CellularProteomics, 13(6):1552–1562, 2014.

[121] Roman A Zubarev. Electron-capture dissociation tandem mass spectrometry.Current Opinion in Biotechnology, 15(1):12–16, 2004.

[122] Roman A Zubarev. The challenge of the proteome dynamic range and itsimplications for in-depth proteomics. Proteomics, 13(5):723–726, 2013.

statistical and machine learning methods to analyze large...

Documents