Protein Sequence Databases,
Peptides to Proteins, and
Statistical Significance
Protein Sequence Databases,
Peptides to Proteins, and
Statistical Significance
Nathan EdwardsDepartment of Biochemistry and Mol. & Cell. BiologyGeorgetown University Medical Center
2
Protein Sequence Databases
• Link between mass spectra and proteins• A protein’s amino-acid sequence provides
a basis for interpreting• Enzymatic digestion• Separation protocols• Fragmentation• Peptide ion masses
• We must interpret database information as carefully as mass spectra.
3
More than sequence…
Protein sequence databases provide much more than sequence:
• Names• Descriptions• Facts• Predictions• Links to other information sources
Protein databases provide a link to the current state of our understanding about a protein.
4
Much more than sequence
Names• Accession, Name, Description
Biological Source• Organism, Source, Taxonomy
LiteratureFunction
• Biological process, molecular function, cellular component
• Known and predictedFeatures
• Polymorphism, Isoforms, PTMs, DomainsDerived Data
• Molecular weight, pI
5
Database types
Curated• Swiss-Prot• UniProt• RefSeq NP
Translated• TrEMBL• RefSeq XP, ZP
Omnibus• NCBI’s nr• MSDB• IPI
Other• PDB• HPRD• EST• Genomic
6
SwissProt
• From ExPASy • Expert Protein Analysis System• Swiss Institute of Bioinformatics
• ~ 515,000 protein sequence “entries”• ~ 12,000 species represented• ~ 20,000 Human proteins• Highly curated• Minimal redundancy• Part of UniProt Consortium
7
TrEMBL
• Translated EMBL nucleotide sequences• European Molecular Biology Laboratory
• European Bioinformatics Institute (EBI)• Computer annotated • Only sequences absent from SwissProt• ~ 10.5 M protein sequence “entries”• ~ 230,000 species• ~ 75,000 Human proteins• Part of UniProt Consortium
8
UniProt
• Universal Protein Resource• Combination of sequences from
• Swiss-Prot• TrEMBL
• Mixture of highly curated/reviewed (SwissProt) and computer annotation (TrEMBL)
• “Similar sequence” clusters are available• 50%, 90%, 100% sequence similarity
9
RefSeq
• Reference Sequence• From NCBI (National Center for
Biotechnology Information), NLM, NIH• Integrated genomic, transcript, and
protein sequences.• Varying levels of curation
• Reviewed, Validated, …, Predicted, …• ~ 9.7 M protein sequence “entries”
• ~ 209,000 reviewed, ~ 90,000 validated• ~ 39,000 Human proteins
10
RefSeq
• Particular focus on major research organisms• Tightly integrated with genome projects.
• Curated entries: NP accessions• Predicted entries: XP accessions• Others: YP, ZP, AP
11
IPI
• International Protein Index• From EBI
• For a specific species, combines• UniProt, RefSeq, Ensembl• Species specific databases: HInv-DB, VEGA, TAIR
• ~ 87,000 (from ~ 307,000 ) human protein sequence entries
• Human, mouse, rat, zebra fish, arabidopsis, chicken, cow
• Slated for closure November 2010, but still going…
12
MSDB
• From the Imperial College (London)• Combines
• PIR, TrEMBL, GenBank, SwissProt• Distributed with Mascot
• …so well integrated with Mascot• ~ 3.2M protein sequence entries• “Similar sequences” suppressed
• 100% sequence similarity• Not updated since September 2006
(obsolete)
13
NCBI’s nr
• “non-redundant”• Contains
• GenBank CDS translations• RefSeq Proteins• Protein Data Bank (PDB)• SwissProt, TrEMBL, PIR• Others
• “Similar sequences” suppressed• 100% sequence similarity
• ~ 10.5 M protein sequence “entries”
14
Human Sequences
• Number of Human genes is believed to be between 20,000 and 25,000
SwissProt ~ 20,000
RefSeq ~ 39,000
TrEMBL ~ 75,000
IPI-HUMAN ~ 87,000
MSDB ~130,000
nr ~230,000
17
Accessions
• Permanent labels• Short, machine readable• Enable precise communication• Typos render them unusable!• Each database uses a different format
• Swiss-Prot: P17947• Ensembl: ENSG00000066336• PIR: S60367; S60367• GO: GO:0003700;
18
Names / IDs
• Compact mnemonic labels• Not guaranteed permanent• Require careful curation• Conceptual objects
• ALBU_HUMAN• Serum Albumin
• RT30_HUMAN• Mitochondrial 28S ribosomal protein S30
• CP3A7_HUMAN• Cytochrome P450 3A7
19
Description / Name
• Free text description• Human readable• Space limited• Hard for computers to interpret!• No standard nomenclature or format• Often abused….
• COX7R_HUMAN• Cytochrome c oxidase subunit VIIa-
related protein, mitochondrial [Precursor]
20
FASTA Format
• >• Accession number
• No uniform format• Multiple accessions separated by |
• One line of description• Usually pretty cryptic
• Organism of sequence?• No uniform format• Official latin name not necessarily used
• Amino-acid sequence in single-letter code• Usually spread over multiple lines.
22
Organism / Species / Taxonomy
• The protein’s organism…• …or the source of the biological sample
• The most reliable sequence annotation available
• Useful only to the extent that it is correct• NCBI’s taxonomy is widely used
• Provides a standard of sorts; Heirachical• Other databases don’t necessarily keep up
• Organism specific sequence databases starting to become available.
23
Organism / Species / Taxonomy
• Buffalo rat• Gunn rats• Norway rat• Rattus PC12 clone IS• Rattus norvegicus• Rattus norvegicus8• Rattus norwegicus• Rattus rattiscus
• Rattus sp.
• Rattus sp. strain Wistar• Sprague-Dawley rat• Wistar rats• brown rat• laboratory rat• rat• rats• zitter rats
24
Controlled Vocabulary
• Middle ground between computers and people
• Provides precision for concepts• Searching, sorting, browsing• Concept relationships
• Vocabulary / Ontology must be established• Human curation
• Link between concept and object:• Manually curated• Automatic / Predicted
25
Gene Ontology
• Hierarchical• Molecular function• Biological process• Cellular component
• Describes the vocabulary only!• Protein families provide GO association
• Not necessarily any appropriate GO category.• Not necessarily in all three hierarchies.• Sometimes general categories are used because
none of the specific categories are correct.
27
Protein Families
• Similar sequence implies similar function• Similar structure implies similar function• Common domains imply similar function
• Bootstrap up from small sets of proteins/domains with well understood characteristics
• Usually a hybrid manual / automatic approach
30
Sequence Variants
• Protein sequence can vary due to• Polymorphism• Alternative splicing• Post-translational modification
• Sequence databases typically do not capture all versions of a protein’s sequence
33
Omnibus Database Redundancy Elimination
• Source databases often contain the same sequences with different descriptions
• Omnibus databases keep one copy of the sequence, and • An arbitrary description, or• All descriptions, or• Particular description, based on source preference
• Good definitions can be lost, including taxonomy
34
Description Elimination
• gi|12053249|emb|CAB66806.1| hypothetical protein [Homo sapiens]
• gi|46255828|gb|AAH68998.1| COMMD4 protein [Homo sapiens]
• gi|42632621|gb|AAS22242.1| COMMD4 [Homo sapiens]
• gi|21361661|ref|NP_060298.2| COMM domain containing 4 [Homo sapiens]
• gi|51316094|sp|Q9H0A8|COM4_HUMAN COMM domain containing protein 4
• gi|49065330|emb|CAG38483.1| COMMD4 [Homo sapiens]
37
Peptides to Proteins
• A peptide sequence may occur in many different protein sequences• Variants, paralogues, protein families
• Separation, digestion and ionization is not well understood
• Proteins in sequence database are extremely non-random, and very dependent
Protein Grouping Scenarios
• Parsimony• Minimum # of proteins
• Weighted• Choose proteins
with the most confident peptides(ProteinProphet)
• Show all • Mark repeated peptides
• Often no (ideal) resolution is possible!
41Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
42
High Quality Peptide Identification: E-value < 10-8
44
Peptide Identification
• Peptide fragmentation by CID is poorly understood
• MS/MS spectra represent incomplete information about amino-acid sequence• I/L, K/Q, GG/N, …
• Correct identifications don’t come with a certificate!
45
Peptide Identification
• High-throughput workflows demand we analyze all spectra, all the time.
• Spectra may not contain enough information to be interpreted correctly• …bad static on a cell phone
• Peptides may not match our assumptions• …its all Greek to me
• “Don’t know” is an acceptable answer!
46
What scores do “wrong” peptides get?
• Generate random peptide sequences• Real looking fragment masses• Empirical distribution• Require similar precursor mass
• Arbitrary score function can model anything we like!
49
Random Peptide Scores
• Truly random peptides don’t look much like real peptides
• Just use peptides from the sequence database!
• Assumptions:• IID sampling of “score” values per spectra
• Caveats:• Correct peptide (non-random) may be included• Peptides are not independent
50
Extrapolating from the Empirical Distribution
• Often, the empirical shape is consistent with a theoretical model
Geer et al., J. Proteome Research, 2004
Fenyo & Beavis, Anal. Chem., 2003
E-values vs p-values
• Need to adjust for the size of the sequence database• Best false/random score goes up with number
of trials• E-value makes this adjustment
• Expected number of incorrect peptides (with this score) from this sequence database.
• E-value = # Trials * p-value (to 1st approx.)
51
52
False Discovery Rate
• Which peptide IDs to accept?• E-value only provides a per-spectrum statistic• With enough spectra, even these can be
misleading!• Decide which spectra (w/ scores) will be
accepted:• SEQUEST Xcorr, E-value, Score, etc., plus...• Threshold on identification criteria
• Control the proportion of incorrect identifications in the result for entire dataset
Distribution of scores over all spectra
53
0
20
40
60
80
100
120
140
160
180
200
-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3
Brian Searle, Proteome Software
Distribution of scores over all spectra
54
0
20
40
60
80
100
120
140
160
180
200
-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3
False
True
Brian Searle, Proteome Software
55
False Discovery Rate
• FDRscore ≥ x = # false ids with score ≥ x # all ids with score ≥ x
• Need to estimate numerator!• Assumes the false (and true) scores, sampled
over spectra, are IID• Not true for some peptide-spectrum scores• (Mostly) true for E-values
• Can compute the # false ids using a decoy search…
Decoy searches
• Shuffle or reverse sequence database• Same size as original• Known false identifications• Estimate “False” distribution
• Alternatively, merge target+decoy results:• Competition between target and decoy scores• Assume false target and false decoys each
win half the time• FDRscore ≥ x = 2 * # decoy ids with score ≥ x
# target ids with score ≥ x 57