testing statistical significance scores of sequence comparison methods with structure similarity tim...

21
Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Upload: simon-dalton

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Testing statistical significance scores of sequence comparison methods with structure similarity

Tim Hulsen

NCMLS PhD Two-Day Conference

2006-04-27

Page 2: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Introduction

• Sequence comparison:

Important for finding similar proteins (homologs) for a protein with unknown function

• Algorithms: BLAST, FASTA, Smith-Waterman

• Statistical scores: E-value (standard), Z-value

Page 3: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

E-value or Z-value?

• Smith-Waterman sequence comparison with Z-value statistics:100 randomized shuffles to test significance of SW score

O. MFTGQEYHSV

shuffle

1. GQHMSVFTEY2. YMSHQFTVGEetc.

# seqs

SW score

rnd ori: 5*SD Z = 5

Page 4: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

E-value or Z-value?

• Z-value calculation takes much time (2x100 randomizations)

• Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E-value

• BUT Advantage of Z-value has never been proven by experimental results

Page 5: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

How to compare?

• Structural comparison is better than sequence comparison

• ASTRAL SCOP: Structural Classification Of Proteins

• e.g. a.2.1.3, c.1.2.4; same number ~ same structure

• Use structural classification as benchmark for sequence comparison methods

Page 6: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

ASTRAL SCOP statisticsmax. % identity members families avg. fam. size max. fam. size families =1 families >1

10% 3631 2250 1.614 25 1655 595

20% 3968 2297 1.727 29 1605 692

25% 4357 2313 1.884 32 1530 783

30% 4821 2320 2.078 39 1435 885

35% 5301 2322 2.283 46 1333 989

40% 5674 2322 2.444 47 1269 1053

50% 6442 2324 2.772 50 1178 1146

70% 7551 2325 3.248 127 1087 1238

90% 8759 2326 3.766 405 1023 1303

95% 9498 2326 4.083 479 977 1349

Page 7: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Methods (1)

• Smith-Waterman algorithms: dynamic programming; computationally

intensive– Paracel with e-value (PA E):

• SW implementation of Paracel

– Biofacet with z-value (BF Z):• SW implementation of Gene-IT

– ParAlign with e-value (PA E):• SW implementation of Sencel

– SSEARCH with e-value (SS E):• SW implementation of FASTA (see next page)

Page 8: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Methods (2)

• Heuristic algorithms:– FASTA (FA E)

• Pearson & Lipman, 1988• Heuristic approximation; performs better than

BLAST with strongly diverged proteins– BLAST (BL E):

• Altschul et al., 1990• Heuristic approximation; stretches local alignments

(HSPs) to global alignment• Should be faster than FASTA

Page 9: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Method parameters

- all:- matrix: BLOSUM62- gap open penalty: 12- gap extension penalty: 1

- Biofacet with z-value: 100 randomizations

Page 10: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Receiver Operating Characteristic

• R.O.C.: statistical value, mostly used in clinical medicine

• Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis

Page 11: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

ROC50 Examplequery d1c75a_ a.3.1.1  

hit # pc e    

1 d1gcya1 b.71.1.1 0.31

2 d1h32b_ a.3.1.1 0.4

3 d1gks__ a.3.1.1 0.52

4 d1a56__ a.3.1.1 0.52

5 d1kx2a_ a.3.1.1 0.67

6 d1etpa1 a.3.1.4 0.67

7 d1zpda3 c.36.1.9 0.87

8 d1eu1a2 c.81.1.1 0.87

9 d451c__ a.3.1.1 1.1

10 d1flca2 c.23.10.2 1.1

11 d1mdwa_ d.3.1.3 1.1

12 d2dvh__ a.3.1.1 1.5

13 d1shsa_ b.15.1.1 1.5

14 d1mg2d_ a.3.1.1 1.5

15 d1c53__ a.3.1.1 2.4

16 d3c2c__ a.3.1.1 2.4

17 d1bvsa1 a.5.1.1 6.8

18 d1dvva_ a.3.1.1 6.8

19 d1cyi__ a.3.1.1 6.8

20 d1dw0a_ a.3.1.1 6.8

21 d1h0ba_ b.29.1.11 6.8

22 d3pfk__ c.89.1.1 6.8

23 d1kful3 d.3.1.3 6.8

24 d1ixrc1 a.4.5.11 14

25 d1ixsb1 a.4.5.11 14

- Take 100 best hits- True positives: in same SCOP family, or false positives: not in same family- For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12)- Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC50

(0,167)- Take average of ROC50 scores for all entries

Page 12: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

ROC50 results

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

pdb010 pdb020 pdb025 pdb030 pdb035 pdb040 pdb050 pdb070 pdb090 pdb095

ASTRAL SCOP set

mea

n R

OC

50

pc ebf zbl efa ess epa e

Page 13: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Coverage vs. Error

• C.V.E. = Coverage vs. Error (Brenner et al., 1998)

• E.P.Q. = selectivity indicator (how much false positives?)

• Coverage = sensitivity indicator (how much true positives of total?)

Page 14: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

CVE Examplequery d1c75a_ a.3.1.1  

hit # pc e    

1 d1gcya1 b.71.1.1 0.31

2 d1h32b_ a.3.1.1 0.4

3 d1gks__ a.3.1.1 0.52

4 d1a56__ a.3.1.1 0.52

5 d1kx2a_ a.3.1.1 0.67

6 d1etpa1 a.3.1.4 0.67

7 d1zpda3 c.36.1.9 0.87

8 d1eu1a2 c.81.1.1 0.87

9 d451c__ a.3.1.1 1.1

10 d1flca2 c.23.10.2 1.1

11 d1mdwa_ d.3.1.3 1.1

12 d2dvh__ a.3.1.1 1.5

13 d1shsa_ b.15.1.1 1.5

14 d1mg2d_ a.3.1.1 1.5

15 d1c53__ a.3.1.1 2.4

16 d3c2c__ a.3.1.1 2.4

17 d1bvsa1 a.5.1.1 6.8

18 d1dvva_ a.3.1.1 6.8

19 d1cyi__ a.3.1.1 6.8

20 d1dw0a_ a.3.1.1 6.8

21 d1h0ba_ b.29.1.11 6.8

22 d3pfk__ c.89.1.1 6.8

23 d1kful3 d.3.1.3 6.8

24 d1ixrc1 a.4.5.11 14

25 d1ixsb1 a.4.5.11 14

- Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01- True positives: in same SCOP family, or false positives: not in same family- For each threshold, calculate coverage: number of true positives divided by total number of possible true positives- For each threshold, calculate errors-per-query: number of false positives divided by number of queries- Plot coverage on x-axis and errors-per-query on y-axis; right-bottom is best

Page 15: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

CVE results

(for PDB010)

-

+

Page 16: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Mean Average Precision

• A.P.: borrowed from information retrieval search (Salton, 1991)

• Recall: true positives divided by number of homologs

• Precision: true positives divided by number of hits

• A.P. = approximate integral to calculate area under recall-precision curve

Page 17: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Mean AP Examplequery d1c75a_ a.3.1.1  

hit # pc e    

1 d1gcya1 b.71.1.1 0.31

2 d1h32b_ a.3.1.1 0.4

3 d1gks__ a.3.1.1 0.52

4 d1a56__ a.3.1.1 0.52

5 d1kx2a_ a.3.1.1 0.67

6 d1etpa1 a.3.1.4 0.67

7 d1zpda3 c.36.1.9 0.87

8 d1eu1a2 c.81.1.1 0.87

9 d451c__ a.3.1.1 1.1

10 d1flca2 c.23.10.2 1.1

11 d1mdwa_ d.3.1.3 1.1

12 d2dvh__ a.3.1.1 1.5

13 d1shsa_ b.15.1.1 1.5

14 d1mg2d_ a.3.1.1 1.5

15 d1c53__ a.3.1.1 2.4

16 d3c2c__ a.3.1.1 2.4

17 d1bvsa1 a.5.1.1 6.8

18 d1dvva_ a.3.1.1 6.8

19 d1cyi__ a.3.1.1 6.8

20 d1dw0a_ a.3.1.1 6.8

21 d1h0ba_ b.29.1.11 6.8

22 d3pfk__ c.89.1.1 6.8

23 d1kful3 d.3.1.3 6.8

24 d1ixrc1 a.4.5.11 14

25 d1ixsb1 a.4.5.11 14

- Take 100 best hits- True positives: in same SCOP family, or false positives: not in same family-For each of the true positives: divide the positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the true positive rank (2,3,4,5,9,12,14,15,16,18,19,20)- Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140)- Take average of AP scores for all entries = mean AP

Page 18: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Mean AP results

0.00

0.03

0.06

0.09

0.12

0.15

0.18

0.21

0.24

0.27

0.30

pdb010 pdb020 pdb025 pdb030 pdb035 pdb040 pdb050 pdb070 pdb090 pdb095

ASTRAL SCOP set

mea

n A

P

pc ebf zbl efa ess epa e

Page 19: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Time consumption

• PDB095 all-against-all comparison:– Biofacet: multiple days (Z-value calc.!)– SSEARCH: 5h49m– ParAlign: 47m– FASTA: 40m– BLAST: 15m

Page 20: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Conclusions

• e-value better than Z-value(!)• SW implementations are (more or less)

the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all

• Use FASTA/BLAST only when time is important

• Larger structural comparison database needed for better analysis

Page 21: Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Credits

Peter Groenen

Wilco Fleuren

Jack Leunissen