proteomics informatics –
DESCRIPTION
Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4). Peptide Mapping - Mass Accuracy. Peptide Mapping Database Size. Human. C. elegans. S. cerevisiae. Peptide Mapping Cys -Containing Peptides. Human. C. elegans. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/1.jpg)
Proteomics Informatics – Protein identification I: searching protein
sequence collections and significance testing (Week 4)
![Page 2: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/2.jpg)
2
Peptide Mapping - Mass Accuracy
![Page 3: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/3.jpg)
3
Peptide MappingDatabase Size
C. elegans
S. cerevisiae
Human
![Page 4: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/4.jpg)
4
Peptide MappingCys-ContainingPeptides
C. elegans
S. cerevisiae
Human
![Page 5: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/5.jpg)
MS
Identification – Peptide Mass Fingerprinting
MS
Digestion
All Peptide Masses
Pick Protein
Compare, Score, Test Significance
Repeat for each protein
SequenceDB
Identified Proteins
![Page 6: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/6.jpg)
ProFound – Search Parameters
http://prowl.rockefeller.edu/
![Page 7: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/7.jpg)
ProFound – Protein Identification by Peptide Mapping
pattern
r
iiirr
ii F
mmrmm
gNrNIkPDIkP
2
1
20
minmax
1 2
)(
2exp
2!)!()|()|(
W. Zhang & B.T. Chait, Analytical Chemistry72 (2000) 2482-2489
![Page 8: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/8.jpg)
ProFound Results
![Page 9: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/9.jpg)
Peptide Mapping – Mass Accuracy
ProFound
0
1
2
3
4
5
6
7
0 0.5 1 1.5 2
Mass Tolerance (Da)
-log(
e)
Mascot
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2
Mass Tolerance (Da)Sc
ore
![Page 10: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/10.jpg)
Peptide Mapping - Database SizeS. cerevisiae
Fungi
All Taxa
Expectation Values
Peptide mapping example:S. Cerevisiae 4.8e-7
Fungi 8.4e-6
All Taxa 2.9e-4
![Page 11: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/11.jpg)
Database size
![Page 12: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/12.jpg)
Missed Cleavage Sites
u = 1
u = 2
u = 4
Expectation Values
Peptide mapping example:u=1 4.8e-7
u=2 1.1e-5
u=4 6.8e-4
![Page 13: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/13.jpg)
Peptide Mapping - Partial Modifications
No Modifications
Phophorylation (S, T, or Y)
Searched Searched With Without Possible Modifications Phosphorylation
of S/T/Y
DARPP-32 0.00006 0.01
CFTR 0.00002 0.005
Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data.
![Page 14: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/14.jpg)
Peptide Mapping - Ranking by Direct Calculation of the Significance
![Page 15: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/15.jpg)
The response to random input data should be random.
Maximum number of correct identification and minimum number of incorrect identifications for any data set.
Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set.
The statistical significance of the results should be calculated.
The searches should be fast.
General Criteria for a Good Protein Identification Algorithms
![Page 16: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/16.jpg)
Response to Random Data
Nor
mal
ized
Fre
quen
cy
![Page 17: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/17.jpg)
Peptide FragmentationMass
Analyzer 1Frag-
mentation DetectorIon Source
Mass Analyzer 2
b
y
![Page 18: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/18.jpg)
Identification – Tandem MS
![Page 19: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/19.jpg)
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
Tandem MS – Sequence Confirmation
KLEDEELFGS
![Page 20: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/20.jpg)
K1166
L1020
E907
D778
E663
E534
L405
F292
G145
S88 b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
KLEDEELFGS
Tandem MS – Sequence Confirmation
![Page 21: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/21.jpg)
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
KLEDEELFGS
Tandem MS – Sequence Confirmation
![Page 22: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/22.jpg)
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
KLEDEELFGS
Tandem MS – Sequence Confirmation
![Page 23: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/23.jpg)
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
KLEDEELFGS
Tandem MS – Sequence Confirmation
![Page 24: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/24.jpg)
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
113
KLEDEELFGS
113
Tandem MS – Sequence Confirmation
![Page 25: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/25.jpg)
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
129
129
KLEDEELFGS
Tandem MS – Sequence Confirmation
![Page 26: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/26.jpg)
KLEDEELFGS
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
Tandem MS – Sequence Confirmation
![Page 27: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/27.jpg)
KLEDEELFGS
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
Tandem MS – Sequence Confirmation
![Page 28: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/28.jpg)
KLEDEELFGS
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
Tandem MS – Sequence Confirmation
![Page 29: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/29.jpg)
Tandem MS – de novo Sequencing
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292 405 5349071020663 778 1080
1022
Mass Differences
1-letter code
3-letter code
Chemical formula
Monoisotopic
Average
A Ala C3H5ON 71.0371 71.0788R Arg C6H12ON4 156.101 156.188N Asn C4H6O2N2 114.043 114.104D Asp C4H5O3N 115.027 115.089C Cys C3H5ONS 103.009 103.139E Glu C5H7O3N 129.043 129.116Q Gln C5H8O2N2 128.059 128.131G Gly C2H3ON 57.0215 57.0519H His C6H7ON3 137.059 137.141I Ile C6H11ON 113.084 113.159L Leu C6H11ON 113.084 113.159K Lys C6H12ON2 128.095 128.174M Met C5H9ONS 131.04 131.193F Phe C9H9ON 147.068 147.177P Pro C5H7ON 97.0528 97.1167S Ser C3H5O2N 87.032 87.0782T Thr C4H7O2N 101.048 101.105W Trp C11H10ON2 186.079 186.213Y Tyr C9H9O2N 163.063 163.176V Val C5H9ON 99.0684 99.1326
Amino acid masses
Sequences consistent
with spectrum
![Page 30: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/30.jpg)
Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260 32 129 145 244 274 373 403 502 518 615 647 760 762 819
292 97 113 212 242 341 371 470 486 583 615 728 730 787
389 16 115 145 244 274 373 389 486 518 631 633 690
405 99 129 228 258 357 373 470 502 615 617 674
504 30 129 159 258 274 371 403 516 518 575
534 99 129 228 244 341 373 486 488 545
633 30 129 145 242 274 387 389 446
663 99 115 212 244 357 359 416
762 16 113 145 258 260 317
778 97 129 242 244 301
875 32 145 147 204
907 113 115 172
1020 2 59
1022 57
![Page 31: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/31.jpg)
Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260 32 129 145 244 274 373 403 502 518 615 647 760 762 819
292 97 113 212 242 341 371 470 486 583 615 728 730 787
389 16 115 145 244 274 373 389 486 518 631 633 690
405 99 129 228 258 357 373 470 502 615 617 674
504 30 129 159 258 274 371 403 516 518 575
534 99 129 228 244 341 373 486 488 545
633 30 129 145 242 274 387 389 446
663 99 115 212 244 357 359 416
762 16 113 145 258 260 317
778 97 129 242 244 301
875 32 145 147 204
907 113 115 172
1020 2 59
1022 57
![Page 32: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/32.jpg)
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260 32 E 145 244 274 373 403 502 518 615 647 760 762 819
292 P I/L 212 242 341 371 470 486 583 615 728 730 787
389 16 D 145 244 274 373 389 486 518 631 633 690
405 V E 228 258 357 373 470 502 615 617 674
504 30 E 159 258 274 371 403 516 518 575
534 V E 228 244 341 373 486 488 545
633 30 E 145 242 274 387 389 446
663 V D 212 244 357 359 416
762 16 I/L 145 258 260 317
778 P E 242 244 301
875 32 145 F 204
907 I/L D 172
1020 2 59
1022 G
Tandem MS – de novo Sequencing
X
X
X
X
X
X
…GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG……GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG…
Peptide M+H = 11661166 -1079 = 87 => S
SGF(I/L)EEDE(I/L)…
SGF(I/L)EEDE(I/L)…
1166 – 1020 – 18 = 128ÞK or Q
SGF(I/L)EEDE(I/L)(K/Q)
![Page 33: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/33.jpg)
Tandem MS – de novo Sequencing
Challenges in de novo sequencing
Neutral loss (-H2O, -NH3)
Modifications
Background peaks
Incomplete information
Challenges in de novo sequencing
Neutral loss (-H2O, -NH3)
Modifications
Background peaks
Incomplete information
![Page 34: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/34.jpg)
MS/MS
LysisFractionation
Tandem MS – Database Search
MS/MS
Digestion
SequenceDB
All FragmentMasses
Pick Protein
Compare, Score, Test Significance
Repeat for all proteins
Pick PeptideLC-MS
Repeat for
all peptides
![Page 35: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/35.jpg)
Algorithms
![Page 36: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/36.jpg)
Comparing and Optimizing Algorithms
Score
Score 1-Specificity
1-Specificity
Sens
itivi
tySe
nsiti
vity
Algorithm 1
Algorithm 2
True
True
False
False
Score
Score 1-Specificity
1-Specificity
Sens
itivi
tySe
nsiti
vity
Algorithm 1
Algorithm 2
True
True
False
False
![Page 37: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/37.jpg)
37
MS/MS - Parent Mass Error and Enzyme Specificity
)!!( ybIII nnxx
Expectation Values
MS/MS example:Dm=2, Trypsin 2.5e-5
Dm=100, Trypsin 2.5e-5
Dm=2, non-specific 7.9e-5
Dm=100, non-specific 1.6e-4
![Page 38: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/38.jpg)
Sequest
Cross-correlation
![Page 39: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/39.jpg)
X! Tandem - Search Parameters
http://www.thegpm.org/
![Page 40: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/40.jpg)
X! Tandem - Search Parameters
![Page 41: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/41.jpg)
X! Tandem - Search Parameters
![Page 42: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/42.jpg)
sequences
sequences
spectra
Conventional, single stage searching
Generic search engine
Test all cleavages,
modifications, & mutations
for all sequences
![Page 43: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/43.jpg)
Determining potential modifications- e.g., oxidation, phosphorylation, deamidation
- calculation order 2n - NP complete
Some hard problems in MS/MS analysis in proteomics
Allowing for unanticipated peptide cleavages - e.g., chymotryptic contamination in trypsin - calculation order ~ 200 × tryptic cleavage - “unfortunate” coefficient
Detecting point mutations - e.g., sequence homology - calculation order 18N
- NP complete
![Page 44: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/44.jpg)
sequences
sequences
spectra
Multi-stage searching
Trypticcleavage
Modifications #1
Modifications #2
Point mutation
X! Tandem
![Page 45: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/45.jpg)
Search Results
![Page 46: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/46.jpg)
Search Results
![Page 47: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/47.jpg)
Sequence Annotations
![Page 48: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/48.jpg)
Search Results
![Page 49: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/49.jpg)
Search Results
![Page 50: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/50.jpg)
LysisFractionation
DigestionLC-MS/MS
Identification – Spectrum Library Search
MS/MS
Spectrum Library
PickSpectrum
Compare, Score, Test Significance
Repeat for
all spectra
Identified Proteins
![Page 51: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/51.jpg)
1. Find the best 10 spectra for a particular sequence, with the same PTMs and charge.2. Add the spectra together and normalize the intensity values.
3. Assign a “quality” value: the median expectation value of the 10 spectra used.
4. Record the 20 most intense peaks in the averaged spectrum, it’s parent ion z, m/z, sequence, protein accessions & quality.
Steps in making an Annotated Spectrum Library (ASL):
![Page 52: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/52.jpg)
0
2
4
6
8
10
0 10 20 30 40 50
peptide length
fract
ion
of li
brar
y (%
)Spectrum Library Characteristics – Peptide Length
![Page 53: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/53.jpg)
0
10
20
30
40
50
10 30 50 70 90 110 130 150 170 190
protein Mr (kDa)
% c
over
age
residuespeptides
Spectrum Library Characteristics – Protein Coverage
![Page 54: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/54.jpg)
Library spectrum
Test spectrum(5:25)
(5:25)
Results: 4 peaks selected, 1 peak missed
Identification – Spectrum Library Search
![Page 55: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/55.jpg)
Matches Probability1 0.452 0.153 0.0164 0.000395 0.0000037
Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum.
How likely is this?Identification – Spectrum Library Search
![Page 56: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/56.jpg)
If you have 1000 possible m/z values and 20 peaks in test and library spectrum?
1.0E-14
1.0E-12
1.0E-10
1.0E-08
1.0E-06
1.0E-04
1.0E-02
1.0E+00
1 2 3 4 5 6 7 8 9 10
matches
p 1 matched: p = 0.65 matched: p = 0.0002
10 matched: p = 0.0000000000001
Identification – Spectrum Library Search
![Page 57: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/57.jpg)
ExperimentalMass Spectrum
Library of AssignedMass Spectra
M/Z
Best search result
Identification – Spectrum Library Search
![Page 58: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/58.jpg)
X! Hunter
![Page 59: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/59.jpg)
1. Use dot product to find a library spectrum that best matches a test spectrum.2. Calculate p-value with hypergeometric distribution.
3. Use p-value to calculate expectation value, given the identification parameters.4. If expectation value is less than the median expectation value of the library spectrum, report the median value.
X! Hunter algorithm:
![Page 60: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/60.jpg)
X! Hunter Result
Query Spectrum
Library Spectrum
![Page 61: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/61.jpg)
Significance Testing
False protein identification is caused by random matching
An objective criterion for testing the significance of protein identification results is necessary.
The significance of protein identifications can be tested once the distribution of scores for false results is known.
![Page 62: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/62.jpg)
Significance Testing - Expectation Values
The majority of sequences in a collection will give a score due to random matching.
![Page 63: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/63.jpg)
Database Search
M/Z
List of Candidates
ExtrapolateAnd Calculate Expectation Values
List of Candidates With Expectation Values
Distribution of Scoresfor Random and False Identifications
Significance Testing - Expectation Values
![Page 64: Proteomics Informatics –](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681623c550346895dd26f4a/html5/thumbnails/64.jpg)
Proteomics Informatics – Protein identification I: searching protein
sequence collections and significance testing (Week 4)