protein identification via database searching

98
Protein Identification via Database searching Attila Kertész-Farkas [email protected] Protein Structure and Bioinformatics Group, ICGEB, Trieste

Upload: ophira

Post on 25-Feb-2016

45 views

Category:

Documents


2 download

DESCRIPTION

Protein Identification via Database searching. Attila Kert é sz- Farkas [email protected] Protein Structure and Bioinformatics Group, ICGEB, Trieste. Mass Spectra analysis. Biological sample. Results report. Mass Spectra analysis. Biological sample. Results report. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Protein Identification via Database searching

Protein Identification via Database searching

Attila Kerté[email protected]

Protein Structure and Bioinformatics Group, ICGEB, Trieste

Page 2: Protein Identification via Database searching

Mass Spectra analysis

Biological sample

Results report

Page 3: Protein Identification via Database searching

Mass Spectra analysis

Biological sample

Results report

Page 4: Protein Identification via Database searching

Computational analysis of MS/MS

• Two approaches:– De novo sequencing– Database searching based– Hybrid

Page 5: Protein Identification via Database searching

De novo sequencing

Page 6: Protein Identification via Database searching

De novo sequencing

• – can identify new peptides and proteins– Able to discover (new) PTMs– Independent of protein databases

• – Requires MS/MS data of good quality– No statistics based validation

Page 7: Protein Identification via Database searching

Database searching-based MS/MS tandem mass spectra identification

• Pipeline

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Page 8: Protein Identification via Database searching

Database searching-based MS/MS tandem mass spectra identification

• Pipeline

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Page 9: Protein Identification via Database searching

Database searching-based MS/MS tandem mass spectra identification

• Pipeline

Input data Peptide identification Validation Protein

inference

Quantitation

Interpretation

Data formats Database searching

Statistical methods for validations

Protein assembling

Page 10: Protein Identification via Database searching

• Mass spectrum:– Histogram of the mass over charge of the

observed fragment ions.– Spectrum normalization. Usually intensity is scaled

to [0,100] interval.

Input data

Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Page 11: Protein Identification via Database searching

• Most common formats are the mzXML, MGF and DAT,

Input data

Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Page 12: Protein Identification via Database searching

MGF file format

Input data

Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Page 13: Protein Identification via Database searching

.mzXML

Input data

Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Page 14: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Input dataExperimental Spectra

Scores:1. 2

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

Page 15: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

Input dataExperimental Spectra

Scores:1. 22. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Page 16: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Input dataExperimental Spectra

Scores:3. 41. 22. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

Page 17: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

Input dataExperimental Spectra

Scores:3. 41. 22. 14. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

Page 18: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

Input dataExperimental Spectra

Scores:3. 41. 22. 14. 15. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Page 19: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

Input dataExperimental Spectra

Scores:3. 41. 22. 22. 14. 15. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Page 20: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE

Input dataExperimental Spectra

Scores:3. 414. 3 1. 22.27. 22. 14. 19. 112. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Page 21: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE

Input dataExperimental Spectra

Scores:15. 323. 414. 3 1. 22.27. 22. 14. 19. 112. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Page 22: Protein Identification via Database searching

Input dataExperimental Spectra

Scores:15. 323. 414. 3 1. 22.27. 22. 14. 19. 112. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Protein sequence DB

Score: 32Peptide: SHLITLLLFLFHSETICR

Page 23: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Input dataExperimental Spectra

Scores:13. 46. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

Page 24: Protein Identification via Database searching

Input dataExperimental Spectra

Scores:13. 46. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Page 25: Protein Identification via Database searching

Input dataExperimental Spectra

Scores:11. 36. 39. 33. 31. 34. 27. 213. 21. 110. 1

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLR

Page 26: Protein Identification via Database searching

Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDK

Page 27: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Input dataExperimental Spectra

Scores:1. 2

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

Page 28: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Input dataExperimental Spectra

Scores:1. 2

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

1.

Page 29: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Input dataExperimental Spectra

Scores:1. 2

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:

Protein sequence DB

1.

2.

Page 30: Protein Identification via Database searching

Shared Peak Count (SPC)This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

Page 31: Protein Identification via Database searching

Shared Peak Count (SPC)This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

SPC = 7

Page 32: Protein Identification via Database searching

Inner product (I)This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

Page 33: Protein Identification via Database searching

Inner product (I)This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

I = 3.5

Page 34: Protein Identification via Database searching

Hyperscore: H = I*Nb!*Ny!I is the sum of the intensity of the matched peaksNb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum! is the factorial function.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

bb bb by y y yy

Page 35: Protein Identification via Database searching

Hyperscore: H = I*Nb!*Ny!- I is the sum of the intensity of the matched peaks- Nb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum- ! is the factorial function.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

bb bb by y y yy

H = 3.2*3!*4! = 3.2*6*24 = 460.8

Page 36: Protein Identification via Database searching

Xcorr

q is the query spectrumt is the theoretical spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

75

75])[,(

1511),(),(

iitqItqItqXcorr

Page 37: Protein Identification via Database searching

Xcorr

q is the query spectrumt is the theoretical spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

75

75])[,(

1511),(),(

iitqItqItqXcorr

I(q,t)=3.2

Page 38: Protein Identification via Database searching

Xcorr

q is the query spectrumt is the theoretical spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

75

75])[,(

1511),(),(

iitqItqItqXcorr

I(q,t)=3.2

I(q,t[-75])=

Page 39: Protein Identification via Database searching

Xcorr

q is the query spectrumt is the theoretical spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

75

75])[,(

1511),(),(

iitqItqItqXcorr

I(q,t)=3.2

I(q,t[-32])=

Page 40: Protein Identification via Database searching

Xcorr

q is the query spectrumt is the theoretical spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

75

75])[,(

1511),(),(

iitqItqItqXcorr

I(q,t)=3.2

I(q,t[0])=

Page 41: Protein Identification via Database searching

Xcorr

q is the query spectrumt is the theoretical spectrum

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison:1.

100%

0%1

0

75

75])[,(

1511),(),(

iitqItqItqXcorr

I(q,t)=3.2

I(q,t[32])=

And so on.

Page 42: Protein Identification via Database searching

Protein Sequence Databases– Completeness:

• Complete• Longer searching time

– Redundancy:• Sequence variations can be found• Redundant database can mess up the statistics

– Quality of sequence annotation

Protein sequence DB

2.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 43: Protein Identification via Database searching

• Entrez Protein DB– http://www.ncbi.nlm.nih.gov/sites/entrez?db=protein– Most complete, redundant

• Reference Sequence (RefSeq) and UniProt (Swiss-Prot and TrEMBL)– http://www.ncbi.nlm.nih.gov/RefSeq/– http://www.uniprot.org/– Well annotated, non-redundant

• International Protein Index (IPI)– http://www.ebi.ac.uk/IPI/IPIhelp.html– Represents a good balance between redundancy and

completeness. – Contains cross-reference to Ensemble, UniProt, RefSeq.

• Sequences from a single genome– Difficult to obtain good statistics on small datasats.

Protein sequence DB

2.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 44: Protein Identification via Database searching

Protein sequence DB

2.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

• Taxonomy• Allows searches to be limited to entries from particular

species or groups of species.• Speed up a search, and ensures that the hit list will

only contain entries from the selected species.• For non-redundant databases, a single entry may

represent identical sequences from multiple species. The accession string and title text from the FASTA entry, listed on the master results page, will usually describe just one of these entries. To see the equivalent entries, and to explore their taxonomy, follow the accession number link in the results list to the Protein View. If the hit is from a non-redundant database, and represents multiple entries with identical sequences, the Protein View will include links to NCBI Entrez and the NCBI Taxonomy Browser for all equivalent entries.

Page 45: Protein Identification via Database searching

Run time• Database search has to enumerate all

peptides and compare them to all experimental spectra.

• This can be slow with large protein sequence databases especially when slow scoring function is applied, like Xcorr.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 46: Protein Identification via Database searching

Speedup techniques• Fast database indexing

– Fast implementation of sequence indexing in the database

• Parent mass check– PTMs can be lost

• Sequest’s preliminary score• Tag-based filtering (de novo hybrid)

– Increases the specificity(or sensitivity)

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 47: Protein Identification via Database searching

• Advanced database indexing– Better implementation of the sequence indexing– Better representation of protein sequences.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 48: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Input dataExperimental Spectra

Scores:1. 2

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison

Protein sequence DB

|)()(| tPMqPM

Parent mass check

Page 49: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Input dataExperimental Spectra

Scores:Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Spectra comparison

Protein sequence DB

|)()(| tPMqPM

Parent mass check

Page 50: Protein Identification via Database searching

Fast prescoring (used in SEQUEST)So called Sp score:

R(q,t) is the maximum number of consecutive matched b-y ions.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

100%

0%1

0

t)SPC(t,)),(0075.01(),(),(),( tqRtqSPCtqItqSP

Sp=3.2*7*(1+0.0075*4)/10=2.3072

SEQUEST selects the top 500 scoring peptides, scored by Sp, and rescores them using the Xcorr.

Page 51: Protein Identification via Database searching

Sequence tag based filtering• Extract short amino acid tags from the

experimental spectra, • Using spectrum graph, where nodes are the

peaks, masses which differ by the mass of an amino acid are linked by an edge.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 52: Protein Identification via Database searching

W

RA

C

VG

E

K

DW

QP

T

LT

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 53: Protein Identification via Database searching

WR

A

C

VG

E

K

DW

LP

T

L T

TAG Prefix Mass

AVG 0.0

WTD 120.2

PET 211.4

• Generates short peptide sequence tags from the spectrum, and uses these tags to filter the protein sequence database.

• Tags make database search much faster, analogous to the way that BLAST’s filter speeds up sequence search.

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 54: Protein Identification via Database searching

Tag-based filteringMDHPEDESHSEK

QDDEEALARLEEIK

SIEAKLTLR

QNNLNPERPDSAYLR

LKQINEEQREGLR

FVSEAVTAICEAK

SSDIQAAVQICSLLHQR

EFSASLTQGLLK

SAEDLEADK

MDHPEDESHSEK

QDDEEALARLEEIK

SIEAKLTLR

QNNLNPERPDSAYLR

LKQINEEQREGLR

FVSEAVTAICEAK

SSDIQAAVQICSLLHQR

EFSASLTQGLLK

SAEDLEADK

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 55: Protein Identification via Database searching

Summary• Experimental spectra are compared to protein

sequence database.• Scoring function,• Protein Database,• Speedup techniques,

Input data Peptide assignment

Validation Protein inference

Quantitation

Interpretation

Page 56: Protein Identification via Database searching

Validation

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Page 57: Protein Identification via Database searching

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Page 58: Protein Identification via Database searching

Scores:15. 323. 414. 3 1. 22.27. 22. 14. 19. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Page 59: Protein Identification via Database searching

Scores:13. 46. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Page 60: Protein Identification via Database searching

Scores:11. 36. 39. 33. 31. 34. 27. 213. 21. 110. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Page 61: Protein Identification via Database searching

Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Page 62: Protein Identification via Database searching

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

How can peptide assignments be

approved or rejected automatically?

Why is it necessary?

Page 63: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

•Human judgment is biased and can be unreliable, •Millions of spectra per day,•Very difficult by looking at the spectrum visually.

Why is it necessary to do it automatically?

Page 64: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Two computational approaches:•Relative score•probability based scoring

Page 65: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Relative score:SEQUEST: delta score

1

21

sss

Cn

Page 66: Protein Identification via Database searching

Scores:15. 323. 414. 3 1. 22.27. 22. 14. 19. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Page 67: Protein Identification via Database searching

Scores:13. 46. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Cn=(4-4)/4=0

Page 68: Protein Identification via Database searching

Scores:11. 36. 39. 33. 31. 34. 27. 213. 21. 110. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Cn=(4-4)/4=0

Cn=(3-3)/3=0

Page 69: Protein Identification via Database searching

Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Cn=(4-4)/4=0

Cn=(3-3)/3=0

Cn=(15-4)/15=0.733

Page 70: Protein Identification via Database searching

Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Cn=(4-4)/4=0

Cn=(3-3)/3=0

Cn=(15-4)/15=0.733

Keep the peptide assignment that exceeds a certain limit.

Page 71: Protein Identification via Database searching

Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Cn=(4-4)/4=0

Cn=(3-3)/3=0

Cn=(15-4)/15=0.733

Keep the peptide assignment that exceeds a certain limit.

Page 72: Protein Identification via Database searching

Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Cn=(4-4)/4=0

Cn=(3-3)/3=0

Cn=(15-4)/15=0.733

Keep the peptide assignment that exceeds a certain limit.

Page 73: Protein Identification via Database searching

Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Cn=(4-4)/4=0

Cn=(3-3)/3=0

Cn=(15-4)/15=0.733

Keep the peptide assignment that exceeds a certain limit.

Page 74: Protein Identification via Database searching

Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Cn=(4-4)/4=0

Cn=(3-3)/3=0

Cn=(15-4)/15=0.733

Keep the peptide assignment that exceeds a certain limit.

Page 75: Protein Identification via Database searching

Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Input dataExperimental Spectra

Protein sequence DB

Score: 4Peptide: AELDLNMTR

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK

Score: 3Peptide: SIEAKLTLR

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Cn=(32-4)/32=0.875

Cn=(4-4)/4=0

Cn=(3-3)/3=0

Cn=(15-4)/15=0.733

Keep the peptide assignment that exceeds a certain limit.

Page 76: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Probability based peptide assignment validation:

Compute the statistical significance of the score. The statistical significance of a score s is the probability of observing a random score x that is higher or equal that the score s, formally P(s <= x). This probability is called the p-value.3 approaches: 1. using analytical functions,2. Fitting a distribution of the sample of random scores.3. non-parametric approach.

Compute the probability that the peptide assignment with the corresponding score is correct.

Page 77: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Probability based peptide assignment validation:

The probability based approach means, very loosely speaking, how far the score is from the random.

Page 78: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Probability based peptide assignment validation:

Random score is a score obtained by a comparison between a randomly selected experimental and a randomly selected theoretical spectrum. This random score has a probability density distribution, and it depends on the scoring functions. As a null hypothesis.

Page 79: Protein Identification via Database searching

T hscore

probability distributionof random scores

p-value of hit h

Freq

uenc

y

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Probability based peptide assignment validation:

The distribution depends on the scoring function.

Random matches caused by match with noise

Page 80: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Probability based peptide assignment validation:

1. Analytical function. Depends on the scoring function. And the parameters are calculated from the spectra to be compared.

1. In the case of the SPC scoring function, the distribution of the random scores can be modeled with hyper geometrical distribution. 2. In the case of the inner product scoring function, the random scores can be modeled with normal distirbution.

T hscore

probability distributionof random scores

p-value of hit h

Freq

uenc

y

Page 81: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Probability based approach:

Build a histogram of the scores that were obtained during the comparison. Fit a known distribution function, and use this for calculation of the p-value of the top score.

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

Match

Freq

uenc

y

Page 82: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1

Probability based approach:

Decoy approach.Make a dummy dataset, big enough to obtain solid statistics. Decoy dataset can be made by: 1.random shuffling 2.Markov-chain generated amino acid sequences3.more typically, by simply reversing the sequence of proteins in the database. Sometimes it is called reverse database.

No correct matches are expected from the decoy dataset, so the scores obtained on Decoy dataset are used for excellent estimate of random distribution.

Page 83: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13.156.41.49.34.33.27.211.28.110.12.15.112.1

Spectra comparison:

Protein sequence DB

Input dataExperimental Spectra

>IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE

Decoy Protein sequence DB

Page 84: Protein Identification via Database searching

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/z

inte

nsity

(%)

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

100 200 300 400 500 600 700 800 900 1000 11000

20

40

60

80

100

120

m/zin

tens

ity (%

)

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13.156.41.49.34.33.27.211.28.110.12.15.112.1

Spectra comparison:

Protein sequence DB

Input dataExperimental Spectra

>Decoy_protein_sequence_1EDEQFYFKTVMVGEDPMNTRLSVPQDAEMATCLFWGPCAASEFSTTPGSDSRIFAFRKDQKRNESLDTINVAELQLRTEDGSKVCSLCMKGGHIGLFLAHPEIPVVDIKEELNVNPGQLYGAVLQNNRLYFTKQNVDWIRFAQMKSSKRGSPRCITESHFLFLLLTILHSRLGRCIEM

Decoy Protein sequence DB

Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1

Page 85: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13.156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1Protein

sequence DB

Input dataExperimental Spectra

Decoy Protein sequence DB

Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

Match

Freq

uenc

y

Can provide more accurate random distribution model. Doubles the execution time.

Frequently applied approach!

Page 86: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13.156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1Protein

sequence DB

Input dataExperimental Spectra

Decoy Protein sequence DB

Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1

Non-parametric approach.Instead of fitting probability density function to the histogram:Calculate the percentage of the scores on the decoy dataset, equal or higher score than the actual top score.

0.0scores} {all#

15} score{decoy #

Page 87: Protein Identification via Database searching

T hscore

A

B

probability distributionof random scores

probability distributionof correct scores

p-value of hit h

Freq

uenc

y

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13.156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1Protein

sequence DB

Decoy Protein sequence DB

Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1

False Positive Rate (FPR), the probability of labelling a random score significant (area B in the figure). A FPR of 0.01 means that 1% of the random scores are labelled significant.E-value: The E-value of a query is the expected number for finding a database element with random score greater than or equal to the query hit s on a database of n data. For instance, an E-value of 10-2 means that the score h is expected to occur by chance only once in 100 independent similarity searches over the database. If the E-value is 10, then ten random hits with score greater or equal to h are expected within a single similarity search.

Page 88: Protein Identification via Database searching

T hscore

A

B

probability distributionof random scores

probability distributionof correct scores

p-value of hit h

Freq

uenc

y

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation Scores:13.156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1Protein

sequence DB

Decoy Protein sequence DB

Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1

False Discovery Rate, the ratio of random scores within significant scores, formally FDR=A/(A+B). The FDR = 0.01 means the 1% of the scores labelled significant are actually observed by chance. FDR is often used to control the ratio of the false positives. The threshold T can be set to keep the FDR under a certain level, typical levels are 0.01 or 0.05, i.e experimenters set thresholds to allow 1% or 5% of false positives. The lower the FDR the more true (non-random) similarity hits are lost. Decoy dataset is used to calculate the FDR.

Page 89: Protein Identification via Database searching

Input data Peptide assignment Validation

Protein inference

Quantitation

Interpretation

Summary:1.Peptide assignment has to be validated. 2.Relative scoring or probability based scoring can be applied.3.False positives (false assignments) can be kept under a certain level.

Page 90: Protein Identification via Database searching

Protein Inference

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Page 91: Protein Identification via Database searching

Take the peptides that passed the validation.

This section is about to infer the proteins that could produces these peptides. The task is not trivial.

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Input dataExperimental Spectra

Score: 32Peptide: SHLITLLLFLFHSETICR

Score: 15Peptide: LLHGDPGEEDK

Page 92: Protein Identification via Database searching

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Peptides:

MDHPEDESHSEK

QDDEEALARLEEIK

SIETLR

QNNLNPERPDSAYLR

LKQINEEQREGLR

FVSEAVTAICEAK

SSDIQAAVQICSLLHQR

EFSASLTQGLLK

SAEDLEADK

Proteins:

Page 93: Protein Identification via Database searching

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Peptides:

MDHPEDESHSEK

QDDEEALARLEEIK

SIETLR

QNNLNPERPDSAYLR

LKQINEEQREGLR

FVSEAVTAICEAK

SSDIQAAVQICSLLHQR

EFSASLTQGLLK

SAEDLEADK

Proteins:

Page 94: Protein Identification via Database searching

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Peptides:

MDHPEDESHSEK

QDDEEALARLEEIK

SIETLR

QNNLNPERPDSAYLR

LKQINEEQREGLR

FVSEAVTAICEAK

SSDIQAAVQICSLLHQR

EFSASLTQGLLK

SAEDLEADK

Proteins:

Page 95: Protein Identification via Database searching

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation

By Occam’s razor, the Protein A should be preferred. Protein A, B ad C can be homologous proteins

Page 96: Protein Identification via Database searching

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation

Many models have been develop to cope with to this problem.Statistical based, Graph theory and spectral Network based.Well-known method ProteinProphet.

Page 97: Protein Identification via Database searching

Summary

Input data Peptide identification Validation Protein

inference

Quantitation

Interpretation

Data formats Database searching

Statistical methods for validations

Protein assembling

Page 98: Protein Identification via Database searching

Database Searching•

– Simple and straightforward– Has a limited search space.– Completeness– Statistical analysis can be carried out.

• – Has a limited search space. Limited to the database.– Enumerating all candidates is too slow, particularly when

modifications and non-tryptic peptides must be considered. (A modern instrument produces million spectra per day)

Input data Peptide assignment Validation Protein

inference

Quantitation

Interpretation