typing staphylococcus aureus using the protein a gene phaedra agius – january, 2008, completed at...

34
Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York laboration with Barry Kreiswirth, Steve Naidich, Kristin

Upload: ximena-muff

Post on 14-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Typing Staphylococcus aureus using the protein A gene

Phaedra Agius – January, 2008, completed at RPI in New York

in collaboration with Barry Kreiswirth, Steve Naidich, Kristin Bennett

Page 2: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Introduction

• What is staph?• Typing methods and the spA gene• The data• Comparing Sequences• Similarities and differences• Hierarchical clustering• Evaluating the results• Multidimensional Scaling• Conclusion

Page 3: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

•Staphylococcus aureus is a bacteria often living on the skin or in the nose of a healthy person.

•Staph can cause a multitude of infections, from skin infections to more deadly infections such as pneumonia and meningitis

•It can spread rapidly

•Some strains are resistant to antibiotics (MRSA)

Page 4: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Typing Methods

• Multi Locus Sequence Typing (MLST) is a well established typing method that looks at 7 house-keeping genes in staph. These are genes that are always turned on.

• Our method looks at just ONE gene – the spA gene.

Page 5: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

The spA gene

• The spA gene contains information for making Protein A.

• The protein A in staph is a virulence factor. It inhibits white blood cells from ingesting and destroying the bacteria by acting as an immunological disguise.

Page 6: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Preprocessed DNA sequences of the spA gene

AAA GAG GAAGACAACAACAAGCCTGGTAAA GAAGATGGCAACAAGCCTGGT AAA GAAGACAACAAAAAACCTGGCAAA GAAGATGGCAACAAACCTGGT AAA GAAGACGGCAACAAGCCTGGT AAA GAAGATGGCAACAAGCCTGGT

X1K1A1O1M1Q1

The spA DNA sequences can be preprocessed into a sequence of repeats, or cassettes.

Instead of dealing with the long DNA sequences, we use these shorter preprocessed spa sequences X1-K1-A1-O1-M1-Q1

Note, first cassette has 27bp, the others have 24bp

Page 7: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Labeled data

• The MLST allelic profile is provided for each sequence

• 194 sequences labeled with their MLST type

DukeId SpaMotif spa MLST arcc aroe glpf gmk pta tpi yqil

1075014 X1-K1-A1-M1-B3 538 395 10 47 8 26 26 32 2

584 X1-K1-B1-B3 541 ? 10 ? 8 26 26 32 2

1771 X1-K1-B1 93 47 10 11 8 6 10 3 2

40 X1-K1-A1-K1-A1-O1-M1-Q1-Q1 468 30 2 2 2 2 6 3 2

1073088 X1-K1-A1-K1-A1-O1-M1-Q1-Q1-Q1 536 30 2 2 2 2 6 3 2

349 X1-K1-A1-O1-M1-Q1 390 30 2 2 2 2 6 3 2

Spa sequences MLST labels

Page 8: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Comparing spa sequences

• T1-J1-M1-G1-M1-K1

• T1-K1-B1-M1-D1-M1-G1-M1-K1• T1-M1-B1-M1-D1-M1-G1-M1-K1• T1-M1-D1-M1-G1-M1-M1-K1

• U1-J1-F1-K1-P1-E1• T1-J1-F1-K1-B1-P1-E1• U1-J1-G1-F1-M1-B1

These ‘preprocessed’ sequences are highly conserved.

How can we generate numbers from sequences that reflect the subtle differences and/or similarities between them?

Page 9: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Comparing spa sequences

– Global alignment– Affine alignment– BCGS - Best common gap-weighted

subsequence• Weighting the sequence ends (B and E)

Using these methods each spa sequence can be represented as a vector of similarity scores between itself and all the other sequences

Page 10: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Global alignment

• Costs: Gap =1, Mismatch = 1

C L O U D Y D A Y

G * O * * A W A Y

1 0 1 1 1 1 0

• Distance: d = 5 Similarity: s = 2

Page 11: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Affine gap alignment• Costs: Gap Initialization = 2, Gap =1, Mismatch = 1

U1 J1 G1 F1 B1 B1 B1 B1 P1 B1 Global T1 J1 * * B1 B1 B1 * * D1

0 3 1 0 0 0 3 1 Distance = 8 Similarity = 4

U1 J1 G1 F1 B1 B1 B1 B1 P1 B1 AffineT1 J1 * * * * B1 B1 B1 D1

0 3 1 1 1 0 0 1 Distance = 7 Similarity =

3

Page 12: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

BCGS-Best Common Gap-weighted Subsequence

P A R T Y H A R D

P A N T * * * R Y

Common subsequences are:

S1=A,T,R, S2=AT, S3=TR, S4=ATR

Gap weighted scores: Choose a weight 0< 1>=ג

S1 = 1̧ 0 = 1, S2 = 2̧ ,S3 = 2̧ 3, S4 = 3̧ 4

Page 13: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

If 1=ג , then S4 is the optimal choice.

If 0.9=ג , the scores are 1, 1.8, 1.46 and 1.97 respectively

If 0.8=ג , the scores are 1, 1.6, 1.02 and 1.23 respectively

S1=A,T,R, S2=AT, S3=TR, S4=ATR

S1 = 1̧ 0 = 1, S2 = 2̧ ,S3 = 2̧ 3, S4 = 3̧ 4

Page 14: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Normalizing the similarity scores

• The similarity scores M are normalized as follows:

where n1 and n2 are the sequence lengths

Example: C L O U D Y D A Y

G * O * * A W A Y

Similarity = 3, Normalized similarity = 3/√(7*4)=0.57

Page 15: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

B and E

The cassettes at the beginning (B) and end (E) of a sequence are highly conserved within spa families

These cassettes shall be compared separately, scored as a match (1) or mismatch (0) and weighted

B

E

M=middle

Let B and E have a weight of 20% in the overall score

Sim score = 0.2*B + 0.6*M + 0.2*E

Page 16: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Similarities Distances

Normalized similarity scores can be transformed to distances as follows:

Spa sequence vector of distances between that sequence and every other sequence in the dataset.

The set of spa sequences is now represented by a (normalized) distance matrix.

D(s1;s2) = 1¡ sim(s1;s2)

Page 17: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Hierarchical Clustering

Uses a distance matrix

It iteratively ‘merges’ the two nearest items/clusters

1 2 3 4 5 6 7 8

1 0 9 4 7 8 4 5 9

2 0 6 9 6 8 5 8

3 0 6 7 1 2 9

4 0 5 4 5 3

5 0 7 5 4

6 0 2 6

7 0 5

8 0

---Cutoff c … this determines the number of clusters to be formed

Page 18: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Training and Testing

• Split the data into two – a TRAINING set and a TEST set• Build a model on the Training set by

choosing optimal B, E and c parameters

• Assign the Test data to the nearest clusters

• Evaluate the results• Repeat multiple times for validation

Train

Test

Page 19: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Assigning Test sequences to the Training clusters

•We define the distance between a point and a cluster to be the mean of the distances between that point and the members of the cluster.

IF the distance between a test point and the nearest cluster exceeds an outlier threshold t , the test point is defined to be an outlier (a novel strain of the bacteria)

ELSE the test point is assigned to the nearest cluster.

>t

Page 20: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Evaluation

• Compare our clusters to the groups defined by the MLST labels via the Jaccard coefficient

• Split our data into a Training and Testing set multiple times and measure the consistency of the clusters formed via a Stability score

• Measure the Accuracy of our spa groups by comparing them to the MLST groups

Page 21: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Jaccard coefficient

Clustering S

Clustering M

Page 22: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Stability

The stability is measured over the n Training and Testing iterations.

It is defined to be the mean of the Jaccard scores measuredpairwise between the spa clusterings obtained at each iteration

Spa clustering 1

Spa clustering 3

Spa clustering 2J1

J2J3

Stability = mean(J1,J2,J3)

Iterations 1, 2, 3 ….

Page 23: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Accuracy

Spa group

MLST group

The MLST label assigned to a spa group is the label of the MLST group with which the spa group has the largest intersection.

The accuracy for that spa group is defined to be the percentage of correctly labeled points.

The overall accuracy of a spa clustering is defined to be the percentage of correctly labeled points.Accuracy = 8/11

Page 24: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Results: Jaccard scores(40 iters, outlier threshold = 1.5 sd)

Page 25: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Results: Stability scores(40 iters, outlier threshold = 1.5 sd)

Page 26: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Results: Accuracy scores(40 iters, outlier threshold = 1.5 sd)

Page 27: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Results: Outlier detection(40 iters, outlier threshold = 1.5 sd)

Page 28: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Results: Varying the Outlier threshold(10 iters, test set size = 30%)

Page 29: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Multidimensional Scaling (MDS)

• MDS translates a distances matrix to a set of coordinates such that the distances between the points are approximately equal to the dissimilarities.

Picture taken from Forrest W. Young’s paper ‘Multidimensional Scaling’

Page 30: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

MLST 1

MLST 5MLST 8

MLST 15

MLST 30

MLST 45

MLST 59MLST 109

MLST 188

MDS with our distances

Page 31: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

MDS – a closer look

-0.22 -0.2 -0.18 -0.16 -0.14 -0.12 -0.1 -0.08 -0.06

0

0.05

0.1

0.15

0.2

0.25

0.3

MLST 20T1-G2-M1-F1-B1-B1-B1T1-G2-M1-F1-F1-B1-B1-B1U1-G2-M1-F1-B1-L1-B1U1-G2-M1-F1-B1-B1-L1-B1

MLST 59Z1-D1-M1-D1-M1-N1-K1-B1Z1-D1-M1-D1-M1-N1-K1-E1Z1-D1-M1-N1-K1-B1

Page 32: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Conclusion and future work• The Spa clustering method can refine groups in ways that

MLST cannot • BCGS worked best• MDS on our spa distances clearly draws out the clusters

Future research• More data, compare to other typing methods• Use BCGS on other data types• Different distance measures• Different ways of assigning test points to clusters• Better ways for finding the optimal parameters other than a

grid search

Page 33: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

References• Spa Typing method for Discriminating among Staphylococcus

aureus Isolates: Implications for Use of a Single marker to Detect Genetic Micro and MacrovariationLarry koreen, Srinivas Ramaswamy, Edward Graviss, Steven Naidich, James Musser and Barry Kreiswirth

• Evaluation of protein A Gene Polymorphic Region DNA Sequencing for Typing of Staphylococcus aureus StrainsB. Shopsin, M. Gomes, S.O. Montgomery, D.H. Smith, M. Waddington, D.E. Dodge, D.A.Bost, M. Riehman, S. Naidich and B. Kreiswirth

• Introduction to Computational molecular BiologyJoao Setubal and Joao Meidanis

• Kernel Methods for Pattern AnalysisJohn Shawe-Taylor and Nello Cristianini

• Framework for kernel regularization with application to protein clusteringFan Lu, Sunduz Keles, Stephen J. Wright and Grace Wahba

Page 34: Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

This work is published in IEEE/ACM Transactions on Computational Biology and BioinformaticsVolume 4, Issue 4, Oct.-Dec. 2007 Page(s):693 - 704

Thanks!Questions?