improved identification of epidemiologically related …erated by the restriction digest (4). in...

11
JOURNAL OF CLINICAL MICROBIOLOGY, Nov. 2010, p. 4072–4082 Vol. 48, No. 11 0095-1137/10/$12.00 doi:10.1128/JCM.00659-10 Copyright © 2010, American Society for Microbiology. All Rights Reserved. Improved Identification of Epidemiologically Related Strains of Salmonella enterica by use of a Fusion Algorithm Based on Pulsed-Field Gel Electrophoresis and Multiple-Locus Variable-Number Tandem-Repeat Analysis S. L. Broschat, 1,2 * D. R. Call, 1,2 M. A. Davis, 2 D. Meng, 1 S. Lockwood, 1 R. Ahmed, 3 and T. E. Besser 2 School of Electrical Engineering and Computer Science 1 and Department of Veterinary Microbiology and Pathology, 2 Washington State University, Pullman, Washington 99164, and National Microbiology Laboratory, Canadian Science Centre for Human and Animal Health, Winnipeg, MB, Canada 3 Received 29 March 2010/Returned for modification 18 June 2010/Accepted 17 August 2010 Pulsed-field gel electrophoresis (PFGE) and multiple-locus variable-number tandem-repeat analysis (MLVA) are used to assess genetic similarity between bacterial strains. There are cases, however, when neither of these methods quantifies genetic variation at a level of resolution that is well suited for studying the molecular epidemiology of bacterial pathogens. To improve estimates based on these methods, we propose a fusion algorithm that combines the information obtained from both PFGE and MLVA assays to assess epidemiological relationships. This involves generating distance matrices for PFGE data (Dice coefficients) and MLVA data (single-step stepwise-mutation model) and modifying the relative distances using the two different data types. We applied the algorithm to a set of Salmonella enterica serovar Typhimurium isolates collected from a wide range of sampling dates, locations, and host species. All three classification methods (PFGE only, MLVA only, and fusion) produced a similar pattern of clustering relative to groupings of common phage types, with the fusion results being slightly better. We then examined a group of serovar Newport isolates collected over a limited geographic and temporal scale and showed that the fusion of PFGE and MLVA data produced the best discrimination of isolates relative to a collection site (farm). Our analysis shows that the fusion of PFGE and MLVA data provides an improved ability to discriminate epidemiologically related isolates but provides only minor improvement in the discrimination of less related isolates. Salmonellosis is one of the most common food-borne dis- eases in the United States (5). Consequently, it is important to understand how Salmonella strains disseminate within and be- tween reservoirs and environments. Many molecular typing tools have been used for this purpose (11). Of these methods, pulsed-field gel electrophoresis (PFGE) is considered by many to be the gold standard for strain typing, and variable-number tandem-repeat (VNTR) assays are powerful alternative or complementary typing tools (3, 22). Both methods offer a high degree of genetic resolution for strain typing, depending on several factors. PFGE involves separating chromosomal DNA macro-re- striction fragments by size, and strains are discriminated based on the resulting band pattern observed after electrophoresis has been completed. It is one of the most reproducible and highly discriminatory typing techniques and has been widely and successfully used for a variety of Salmonella enterica sero- vars (12, 15); for many situations, PFGE is capable of discrim- inating between closely related strains. In addition, the use of the assay to analyze different serovars does not require a great deal of modification, as might be required with procedures that are dependent on PCR. Difficulties arise when strains are very closely related (i.e., poor discrimination [18, 27]) or when bands comigrate in the gel or identically sized bands represent completely different fragments of chromosomal DNA and thereby produce spurious matches (6). These complications are more pronounced when a large number of bands are gen- erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large number of independent restriction digests would be needed to infer an accurate phylogenetic relationship (6). Multiple-locus variable-number tandem-repeat analysis (MLVA) is a PCR-based technique that relies on the amplifi- cation of chromosomal or plasmid DNA that encompasses short tandem repeats of a DNA sequence. The tandem repeats are prone to higher-than-background mutation rates due to DNA strand slippage during replication (23), and thus, the amplified fragments will vary in length depending on the num- ber of repeats harbored at a given locus. Different fragment lengths are tallied either as the total length (base pairs) or the estimated number of repeat units, and each discretely sized fragment is considered a unique “allele” for the locus under investigation. Because of the relatively high mutation rate, strains can accumulate distinctive allele patterns within a rel- atively short period of time (5). Furthermore, the technique can be multiplexed and automated and is conducive to rapid and relatively high-throughput strain typing. MLVA assays are relatively robust (5, 15–17) and, while not perfect, they can provide phylogenetic information even with a limited number * Corresponding author. Mailing address: School of EECS, Wash- ington State University, P.O. Box 642752, Pullman, WA 99164-2752. Phone: (509) 335-5693. Fax: (509) 335-3818. E-mail: [email protected] .edu. † Supplemental material for this article may be found at http://jcm .asm.org/. Published ahead of print on 25 August 2010. 4072 on June 10, 2020 by guest http://jcm.asm.org/ Downloaded from

Upload: others

Post on 04-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

JOURNAL OF CLINICAL MICROBIOLOGY, Nov. 2010, p. 4072–4082 Vol. 48, No. 110095-1137/10/$12.00 doi:10.1128/JCM.00659-10Copyright © 2010, American Society for Microbiology. All Rights Reserved.

Improved Identification of Epidemiologically Related Strains ofSalmonella enterica by use of a Fusion Algorithm Based on

Pulsed-Field Gel Electrophoresis and Multiple-LocusVariable-Number Tandem-Repeat Analysis�†

S. L. Broschat,1,2* D. R. Call,1,2 M. A. Davis,2 D. Meng,1 S. Lockwood,1 R. Ahmed,3 and T. E. Besser2

School of Electrical Engineering and Computer Science1 and Department of Veterinary Microbiology and Pathology,2 WashingtonState University, Pullman, Washington 99164, and National Microbiology Laboratory, Canadian Science Centre for

Human and Animal Health, Winnipeg, MB, Canada3

Received 29 March 2010/Returned for modification 18 June 2010/Accepted 17 August 2010

Pulsed-field gel electrophoresis (PFGE) and multiple-locus variable-number tandem-repeat analysis(MLVA) are used to assess genetic similarity between bacterial strains. There are cases, however, when neitherof these methods quantifies genetic variation at a level of resolution that is well suited for studying themolecular epidemiology of bacterial pathogens. To improve estimates based on these methods, we propose afusion algorithm that combines the information obtained from both PFGE and MLVA assays to assessepidemiological relationships. This involves generating distance matrices for PFGE data (Dice coefficients)and MLVA data (single-step stepwise-mutation model) and modifying the relative distances using the twodifferent data types. We applied the algorithm to a set of Salmonella enterica serovar Typhimurium isolatescollected from a wide range of sampling dates, locations, and host species. All three classification methods(PFGE only, MLVA only, and fusion) produced a similar pattern of clustering relative to groupings of commonphage types, with the fusion results being slightly better. We then examined a group of serovar Newport isolatescollected over a limited geographic and temporal scale and showed that the fusion of PFGE and MLVA dataproduced the best discrimination of isolates relative to a collection site (farm). Our analysis shows that thefusion of PFGE and MLVA data provides an improved ability to discriminate epidemiologically related isolatesbut provides only minor improvement in the discrimination of less related isolates.

Salmonellosis is one of the most common food-borne dis-eases in the United States (5). Consequently, it is important tounderstand how Salmonella strains disseminate within and be-tween reservoirs and environments. Many molecular typingtools have been used for this purpose (11). Of these methods,pulsed-field gel electrophoresis (PFGE) is considered by manyto be the gold standard for strain typing, and variable-numbertandem-repeat (VNTR) assays are powerful alternative orcomplementary typing tools (3, 22). Both methods offer a highdegree of genetic resolution for strain typing, depending onseveral factors.

PFGE involves separating chromosomal DNA macro-re-striction fragments by size, and strains are discriminated basedon the resulting band pattern observed after electrophoresishas been completed. It is one of the most reproducible andhighly discriminatory typing techniques and has been widelyand successfully used for a variety of Salmonella enterica sero-vars (12, 15); for many situations, PFGE is capable of discrim-inating between closely related strains. In addition, the use ofthe assay to analyze different serovars does not require a greatdeal of modification, as might be required with procedures that

are dependent on PCR. Difficulties arise when strains are veryclosely related (i.e., poor discrimination [18, 27]) or whenbands comigrate in the gel or identically sized bands representcompletely different fragments of chromosomal DNA andthereby produce spurious matches (6). These complicationsare more pronounced when a large number of bands are gen-erated by the restriction digest (4). In addition, while bandpatterns convey a crude degree of genetic relatedness, a largenumber of independent restriction digests would be needed toinfer an accurate phylogenetic relationship (6).

Multiple-locus variable-number tandem-repeat analysis(MLVA) is a PCR-based technique that relies on the amplifi-cation of chromosomal or plasmid DNA that encompassesshort tandem repeats of a DNA sequence. The tandem repeatsare prone to higher-than-background mutation rates due toDNA strand slippage during replication (23), and thus, theamplified fragments will vary in length depending on the num-ber of repeats harbored at a given locus. Different fragmentlengths are tallied either as the total length (base pairs) or theestimated number of repeat units, and each discretely sizedfragment is considered a unique “allele” for the locus underinvestigation. Because of the relatively high mutation rate,strains can accumulate distinctive allele patterns within a rel-atively short period of time (5). Furthermore, the techniquecan be multiplexed and automated and is conducive to rapidand relatively high-throughput strain typing. MLVA assays arerelatively robust (5, 15–17) and, while not perfect, they canprovide phylogenetic information even with a limited number

* Corresponding author. Mailing address: School of EECS, Wash-ington State University, P.O. Box 642752, Pullman, WA 99164-2752.Phone: (509) 335-5693. Fax: (509) 335-3818. E-mail: [email protected].

† Supplemental material for this article may be found at http://jcm.asm.org/.

� Published ahead of print on 25 August 2010.

4072

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 2: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

of loci (13, 18). While access to a sequenced genome dramat-ically speeds the ability to establish new assays (3), it is not arequisite to assay development. The primary limitations of thetechnique include the potential need for a new set of loci forevery species or serovar under investigation and the fact thatsome loci are very “unstable” and can “disappear” from somestrains or lineages; this produces the equivalent of an uninfor-mative “null” allele. Mutation rates can also vary between loci(5, 24, 25); if ignored, this factor can introduce bias into com-parative analyses.

Clearly, PFGE and MLVA offer different technical and in-terpretive advantages and disadvantages, but it is important toemphasize that the nature of the methodological and interpre-tive errors is independent between the techniques. For exam-ple, errors due to comigration of bands for PFGE are inde-pendent of band size estimation errors for MLVA becausedifferences in MLVA band size are not detectable using PFGEand macro-restriction fragments are generally independent oftandem-repeat sequences. Provided that most of the experi-mental variation from these two methods is uncorrelated (i.e.,independent), it is possible to combine results from the twomethods to produce improved estimates (8), and this premiseunderlies the current study.

Our objective was to determine whether combining the in-formation obtained from both PFGE and MLVA assays pro-duces more rigorous and discriminatory analyses of bacterialpathogens, such as Salmonella. Two sets of Salmonella entericaisolates were used in this study; one set included serovar Ty-phimurium isolates from a wide range of sampling dates, lo-cations, and host species while the other set included a groupof serovar Newport isolates collected over a limited geographicand temporal scale for a single host species. The results of thedifferent typing methods were assessed by comparison withthose of phage-typing assays (serovar Typhimurium) and withknown epidemiological relationships (serovar Newport). Tointerpret the MLVA data, we employed a metric that incor-porates a stepwise-mutation model, and to interpret the PFGEdata, we employed Dice coefficients to construct distance ma-trices. Our analysis shows that the fusion of the two typingmethods provides an improved ability to discriminate betweenisolates when PFGE and MLVA separately provide partial butincomplete discrimination between strains with a high degreeof probable genetic similarity.

MATERIALS AND METHODS

Salmonella strains. Two sets of isolates were used for this study. Set A included37 S. enterica serovar Typhimurium strains that were previously collected fromdifferent animal hosts and from different locations and time periods (see thefigures for strain designations and descriptors). Because these isolates wereepidemiologically unrelated, we assumed that they encompassed a high degree ofgenetic variability. Set B included 63 S. enterica serovar Newport isolates, mostlycollected from cattle in Washington State, and this set was assumed to representless genetic diversity because of the restricted geographic representation, inclu-sion of multiple isolates from individual farms, and a limited temporal scale.Salmonella serovar Typhimurium strains were phage typed at the National Mi-crobiology Laboratory, Canadian Science Center for Human and Animal Health,Winnipeg, Manitoba. Serovar Newport isolates were tested for antibiotic resis-tance using a disc diffusion method (2) according to Clinical and LaboratoryStandards Institute (NCCLS) guidelines (19, 20). Northwestern isolates fromcattle were tested for susceptibility to a panel of antimicrobial drugs that in-cluded ampicillin (10 �g), chloramphenicol (30 �g), gentamicin (10 �g), kana-mycin (30 �g), streptomycin (10 �g), tetracycline (30 �g), triple-sulfa (a combi-

nation of sulfadiazine, sulfamethazine, and sulfamerazine) (250 �g),trimethoprim-sulfamethoxazole (1.25 �g to 23.75 �g), ceftazidime (30 �g),amoxicillin-clavulanic acid (20/10 �g), and nalidixic acid (30 �g) (BD Diagnos-tics, Sparks, Maryland). Northeastern isolates were tested with the same panelexcept that a sulfisoxazole disc (250 �g) was substituted for the triple-sulfa disc.

Pulsed-field gel electrophoresis. We followed a standard PFGE protocol forSalmonella enterica, using an XbaI restriction digest (21). Briefly, genomic DNAwas digested in agarose plugs with the restriction enzyme XbaI, and the resultingDNA fragments were gel separated using a CHEF-DR II (Bio-Rad, Hercules,CA) apparatus. The electrophoresis conditions included an initial pulse time of2.2 s, final pulse time of 63.8 s, running temperature of 14°C, and run time of 18to 20 h at 6 V/cm. PFGE gels were stained with ethidium bromide and visualizedusing a UV transilluminator. Gel images were analyzed using Bionumerics ver-sion 4.6 (Applied Maths, Sint-Martens-Latem, Belgium). Estimated band sizeswere exported from Bionumerics for the current study.

Multiple-locus variable-number tandem-repeat analysis. For S. enterica sero-var Typhimurium isolates, four of five previously described VNTR loci wereemployed for MLVA (STTR5, STTR6, STTR9, and STTR10pl), following apublished protocol (18) with the addition of a VNTR locus identified in-house(STTR11) (STTR11-Forward, GATAAGCCGTACTGTTCAGG, and STTR11-Reverse, TACTCCTTTGTGGTCTACGC). For the S. enterica serovar Newportstrains, two Typhimurium loci (STTR5 and STTR6) (18) and four publishedNewport-specific loci were employed (NewportA, NewportB, NewportM, andNewportL) (6). The PCRs for both serovars were conducted as previously de-scribed (6, 20). Capillary electrophoresis was carried out at the Washington StateUniversity Genomics Core using a 3730 DNA analyzer with Pop-7 polymer(Applied Biosystems). The resulting electropherograms were analyzed usingGeneMarker software (Softgenetics LLC, State College, PA).

Data analysis. Dice similarity coefficients were calculated using Bionumerics(Applied Maths, Austin, TX) from PFGE data to generate the distance matrix,and the unweighted-pair group method with arithmetic mean (UPGMA) algo-rithm was used to construct a dendrogram. For the MLVA data, the total lengthof the tandem repeats was divided by the estimated size of each repeat to obtainthe number of tandem repeats for each locus and each strain. There were fiveavailable loci with tandem repeats for S. enterica serovar Typhimurium, but datafrom one locus were not used because it was from a plasmid locus and �73% of37 bacterial isolates were positive for this locus. For S. enterica serovar Newport,there were six loci, all of which were used. Because passage experiments indicatethat VNTR mutations are usually composed of a single step (5, 24), we modeledour data using a single-step stepwise-mutation model (SMM). Based on thisstatistical model, we estimated the distance S (the total number of single steps)between two lineages (two different isolates) and their most recent commonancestor (MRCA) using the number of tandem repeats, XL, of the two lineages.The distances, S, for all lineage pairs were then used to construct the distancematrix, and UPGMA was used to obtain a dendrogram.

If � is the rate of stepwise mutations per generation and if we assume that thegain or loss of a repeat is equally probable, then the following conditionalprobabilities, P, characterize the single-step SMM: P(Xt�1 � i � 1 � Xt � i) �P(Xt�1 � i � 1 � Xt � i) � �/2, P(Xt�1 � i � Xt � i) � 1 � �, and P(� Xt�1 � Xt � �

2 � Xt � i) � 0, where i denotes the number of tandem repeats at distance t andt is an integer number between zero and infinity. Based on these conditionalprobabilities, the probability of the distance t is given by (26) P(t � n0 , …, nk) �N(t)/D(t), where N(t) � e� (� � 2�n)t �j

k�0 [Ij(2�t)]j

n and D(t) � �0e� (� � 2�n)t

�jk� 0 [Ij(2�t)]j

n dt. The equation for P assumes that the mutation rate � isconstant for all loci, which is approximately true for our S. enterica serovarTyphimurium data; nm denotes the locus number where the subscript m � 0, 1,2,…, k is the difference between the number of tandem repeats for two lineages,and m � k is the maximum number of differences; n is the number of loci used;� is a parameter associated with the distance to a MRCA (in this work, � �0.0002 was found to give satisfactory results [26]); and Ij is the jth-order modifiedBessel function of type 1. The distance S is the value of t with the maximumprobability.

Fusion algorithm. Dendrograms constructed from PFGE or MVLA data candiffer substantially. Consequently, if we assume that both types of data containuseful information as well as error, it is possible that better results can beobtained by combining the data. In fact, it is known that (i) if two differentalgorithms used to describe a set of samples give different results, (ii) if the errorfor each result is less than the error associated with randomly generated results,and (iii) if the errors for both are uncorrelated, then a combination of thesealgorithms will give better results than either of the two algorithms alone (8). Forour problem, we have different data (PFGE and MLVA) from the same sets ofsamples. We can safely assume that the error for both the MLVA and PFGEalgorithms is less than the error for random clustering, because both PFGE and

VOL. 48, 2010 FUSION ALGORITHM FOR EPIDEMIOLOGICAL ANALYSIS 4073

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 3: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

FIG. 1. Generalized tree for 37 S. enterica serovar Typhimurium isolates generated from the fusion algorithm. Parameters used in this analysisincluded r between 0 and 1 and the threshold parameter thr between 0.05 and 1. Information includes isolate designation, source, phage type,collection date (month/day/year), and state where the isolate was collected. The scale bar represents a composite measure of genetic distancedetermined using the averaged values of the normalized distance matrices (see Materials and Methods).

4074 BROSCHAT ET AL. J. CLIN. MICROBIOL.

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 4: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

FIG. 2. Dendrogram for 37 S. enterica serovar Typhimurium isolates constructed using UPGMA and a distance matrix obtained using asingle-step stepwise mutation model for VNTR data (MLVA only). The scale bar represents a measure of genetic distance.

VOL. 48, 2010 FUSION ALGORITHM FOR EPIDEMIOLOGICAL ANALYSIS 4075

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 5: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

MLVA can recapitulate epidemiological relationships (6, 14). Furthermore, wecan assume that the errors are uncorrelated because the PFGE and MLVAassays measure differences that arise from different genetic mechanisms. Con-sequently, because the two methods provide different results, it is probable that

combining the PFGE and MLVA data will provide a more comprehensivepicture of the underlying population structure of these strains.

Several strategies can be used to combine different types of data. One strategyis to treat each data type independently and produce two independent dendro-

FIG. 3. Dendrogram for 37 S. enterica serovar Typhimurium isolates constructed using UPGMA with Dice coefficients for PFGE data (PFGEonly). The scale bar represents a measure of genetic distance.

4076 BROSCHAT ET AL. J. CLIN. MICROBIOL.

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 6: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

grams that are then combined to form a single dendrogram. While conceptuallysimple, this approach weights all sources of information equally, which may notbe an optimal approach. Another strategy is to combine the data sets togetherbefore generating a dendrogram, but it may be difficult to combine the data ifthey differ in type (e.g., discrete and continuous), and even if this is accom-plished, a suitable approach for evaluating the combined data may not exist. Analternative approach is to process each type of data using an algorithm that isappropriate for that data type and then fuse the results at some midpoint in theprocess; this is the approach we employed.

We begin with two distance matrices, one for the PFGE data set and one forthe MLVA data set, as described above. One distance matrix is used to constructa dendrogram, and a threshold is selected (see below) to define distinct clustersfrom this dendrogram. If two strains in the second distance matrix are groupedtogether in one of the clusters formed from the first distance matrix, the distancebetween them is reduced (see below); if these two strains are not groupedtogether, then the distance in the second matrix is left unchanged. After allpairwise distance values are adjusted based on the clusters from the first matrix,a final dendrogram is generated. Both sets of data can be used in alternatingroles: PFGE data are used to create the clusters while MLVA data are used tocreate a second distance matrix to be modified, and MLVA data are used tocreate the clusters while PFGE data are used to create a second distance matrixto be modified.

For the fusion algorithm described above, values for two parameters must bechosen. The first is the threshold value, thr, which divides the initial dendrograminto distinct clusters. The second is the degree that each distance value should bereduced in the modified distance matrix. If D is the distance matrix to bemodified with elements d(i,j) and D* is the modified distance matrix, the ele-ments of D* are given in terms of d(i,j) by the equation d*(i,j) � d(i,j) if bacterialsamples i and j are not in the same cluster and by the equation d*(i,j) � r � d(i,j)if bacterial samples i and j are in the same cluster, with 0 � r � 1. Thus, theweight parameter r dictates how much the distance value will be reduced. Thechoice of these two parameters, thr and r, changes the results, and there is noobvious way of knowing what values are the best to use. Our approach was to usethe entire range of values for the threshold thr, i.e., 0.05 to 1, and several rangesof values for the weight parameter r. For the former, this corresponds to a rangefrom having each strain form its own cluster (thr � 0.05) to the opposite extremewhere all strains are grouped into a single cluster (thr � 1). For each set ofranges, we created a dendrogram using UPGMA with the modified distancematrix; from the resulting set of dendrograms, we constructed a “generalizedtree” using Consense from the software package PHYLIP (version 3.68) with thedefault parameters (10).

The set of dendrograms used to construct the generalized tree is a combinationof two sets of data, one created when PFGE clusters are used to modify thedistance matrix and the other when MLVA clusters are used. When the sets arecombined, a conflict will occur if they disagree completely on the relationshipbetween two strains. For example, the PFGE data may indicate that strains Aand B always occur as a pair, while the MLVA data may indicate that strains Aand C always occur as a pair. If this happens, Consense (10) will construct a treethat depends on the order of the input. To prevent such an occurrence, we“break the tie” by multiplying the values of one set of data by 0.501 and the otherset by 0.499.

One other issue arises when the two different data sets are combined; theirdistance measures are not the same. The manner in which we resolved this wasto scale the dendrogram produced by Consense using an average-distance matrixobtained by averaging the values of the normalized PFGE and MLVA distancematrices. Normalization was achieved by dividing all matrix values by the max-imum distance to obtain values between 0 and 1. Then, the final distances weredetermined as follows. Find all leaves on a branch (e.g., branch 1 contains A andB; branch 2 contains C, D, and E; and F contains itself). Find all combinationsof leaves on one branch and leaves on the other branch and obtain the distancefor each combination from the average distance matrix. Use the maximum valueas the distance between the two branches.

The fusion program can be downloaded at http://www.vetmed.wsu.edu/research_vmp/MicroArrayLab/.

RESULTS AND DISCUSSION

Comparison of genetically diverse strains of S. enterica sero-var Typhimurium. To compare genetically diverse strains of S.enterica serovar Typhimurium, generalized trees were con-structed using the fusion, MLVA, and PFGE algorithms de-

scribed above. For the fusion algorithm, we present resultsfrom weight parameter r between 0 and 1. Potential ties werebroken by multiplying PFGE cluster data by a factor of 0.499and MLVA cluster data by a factor of 0.501. This weights theanalysis in favor of MLVA under the assumption that there ismore phylogenetically relevant information available fromMLVA data than from PFGE (7).

Assessing the validity of our analysis is complicated by thelack of a gold standard with which to compare our results.Indeed, barring a complete genome sequence for each strainand suitable algorithms for assessing genetic relationships, theonly potential gold standards available are multilocus sequencetyping (MLST) and phenotypic characteristics. Given the prob-able lack of genetic variation for intraserovar MLST compar-isons (9), we chose to compare the S. enterica serovar Typhi-murium strains using susceptibility to a panel of lytic phages asa measure of relatedness. The Centre for Infections of theHealth Protection Agency (Colindale, London, United King-dom) provided 38 phages (1), and 5 additional phages weredeveloped at the National Microbiology Laboratory (Win-nepeg, Canada). Our analysis assumes that strains with similarphage susceptibilities are more closely related than strains withdissimilar phage susceptibilities. All of the strains were sub-jected to susceptibility testing using a panel of 31 phages.Strains that were judged untypeable with this panel were sub-jected to testing with an additional 12 phages. Only one strain(8745) was considered untypeable using the combined panel of43 lytic phages (see Tables S1a and b in the supplementalmaterial).

Each dendrogram was divided into six clusters, A to F, andthe strains within a cluster were compared according to theirphage susceptibilities with the expectation that, on average,strains with greater genetic similarity will have fewer phagesusceptibility mismatches. Because lytic phage susceptibility isunlikely to have a 1:1 correspondence with genetic similarity,we arbitrarily selected a cutoff where isolates were considered“more similar” if they had �7 differences in the lytic phagepanel or “less similar” if they had 7 differences in lytic phagesusceptibility. For this putatively diverse set of isolates, thecorrespondence between lytic phage results, fusion (Fig. 1),MLVA (Fig. 2), and PFGE (Fig. 3) was small but measurable(Table 1). The fusion results included one or three fewer phagesusceptibility mismatches relative to the MLVA and PFGEresults, respectively. We also examined the ability of the threeapproaches to distinctly classify unique phage types within thesame clusters (Fig. 1 to 3). With a discrimination index calcu-

TABLE 1. Number of unrelated strains in each of the clustersdelineated in the dendrograms of Fig. 1 to 3a

AlgorithmNo. of strains in cluster: Total

no.A B C D E F

Fusion 1 2 0 0 0 1 4MLVA only 1 0 0 3 0 1 5PFGE only 0 2 3 2 0 0 7

a A phage susceptibility panel (see Tables S1a and b in the supplementalmaterial) was used as a proxy for genetic similarity in which we arbitrarilyclassified isolates as genetically similar if differing by �7 lytic phage susceptibil-ities and genetically distinct if differing by 7 lytic phage susceptibilities. Notethat the number of strains in each cluster is not the same for each algorithm.

VOL. 48, 2010 FUSION ALGORITHM FOR EPIDEMIOLOGICAL ANALYSIS 4077

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 7: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

FIG. 4. Generalized tree for 63 S. enterica serovar Newport isolates generated from 160 dendrograms. The dendrograms were obtained usingthe fusion algorithm with the weight parameter, r, between 0 and 1 and the threshold parameter, thr, between 0.05 and 1. Information includesisolate designation, collection date, county where the isolate was collected (Washington counties include Adams, Grant, King, Snohomish,Whatcom, and Yakima; Idaho counties include Gooding, Jerome, and Twin Falls; Clay, Clinton, and Utah Counties are in Nebraska, New York,and Utah, respectively), and antibiotic resistance phenotype (see Materials and Methods). Resistance profile abbreviations: A, ampicillin; C,chloramphenicol; K, kanamycin; Sxt, trimethoprim-sulfamethoxazole; S, streptomycin; T, tetracycline; Amc, amoxicillin-clavulanic acid; Su,triple-sulfa; Caz, ceftazidime; SUSCEPT, susceptible to all antimicrobials tested.

4078

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 8: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

FIG. 5. Dendrogram for 63 S. enterica serovar Newport isolates constructed using UPGMA and a distance matrix obtained using a single-stepstepwise mutation model for VNTR data (MLVA only). The scale bar represents a measure of genetic distance.

VOL. 48, 2010 FUSION ALGORITHM FOR EPIDEMIOLOGICAL ANALYSIS 4079

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 9: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

FIG. 6. Dendrogram for 63 S. enterica serovar Newport isolates constructed using UPGMA with Dice coefficients for PFGE data (PFGE only).The scale bar represents a measure of genetic distance.

4080 BROSCHAT ET AL. J. CLIN. MICROBIOL.

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 10: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

lated as one minus the average proportion of unique phagetypes per cluster, the fusion algorithm had an improved but notstatistically significantly different discrimination index (0.34 �0.13 [average � standard error of the mean]) compared withthose of PFGE (0.25 � 0.12) and MLVA (0.20 � 0.10). Thus,for this relatively diverse set of isolates, we did not find acompelling benefit to merging PFGE and MLVA data usingthe fusion algorithm, although the fusion algorithm doesgive better statistical performance than random allocationof phage types to clusters (0.10, standard deviation � 0.03,n � 1,000).

Comparison of genetically similar S. enterica serovar New-port strains. To compare genetically similar S. enterica serovarNewport strains, generalized dendrograms were constructedusing the fusion algorithm, PFGE-only data, and MLVA-onlydata with the same parameters discussed in the previous sec-tion (Fig. 4, 5, and 6). As with the serovar Typhimurium anal-ysis, we lacked a gold standard for assessing the performanceof our results relative to the true genetic relationships, but wedid have epidemiologically relevant information as inferredfrom the location (farm) where the isolates were collected. Weassumed that isolates collected from the same farm were morelikely to be genetically similar relative to isolates collected atdifferent farms. Consequently, we examined all cases in thedendrogram where serovar Newport isolates were indistin-guishable and calculated a discrimination index as one minusthe average proportion of distinct farms within these “mono-phyletic” groups. The number of monophyletic groups rangedfrom 6 (fusion and PFGE) (Fig. 4 and 6) to 9 (MLVA) (Fig. 5),with a significantly better discrimination index for the fusionalgorithm (0.76 � 0.05) than for MLVA (0.31 � 0.12) andPFGE (0.39 � 0.09) (P � 0.004; analysis of variance). Based onthis assessment procedure, it is clear that combining the datafrom PFGE and MLVA analyses provided a greater degree ofisolate discrimination as a function of the farm where theisolates were collected. While the antimicrobial resistance pro-files cannot be used quantitatively to validate the fusion results,the clustering of the susceptible strains by the fusion algorithm(Fig. 4) relative to the clustering by the MLVA (Fig. 5) andPFGE (Fig. 6) algorithms provides further evidence that thefusion algorithm provides more biologically consistent classifi-cations.

Clearly, combining data from PFGE and MLVA can serveto buffer the extremes in the level of genetic variation mea-sured by these two methods alone, but the degree of benefitmay be a function of the genetic similarity among the iso-lates under consideration. In cases in which isolates repre-sent a relatively diverse and epidemiologically unrelatedgroup of bacteria, combining PFGE and MLVA data mayprovide a minor benefit. In contrast, for closely relatedisolates, as judged by their collection from identical farmlocations in this study, the combination of PFGE andMLVA data are likely to provide a higher degree of dis-crimination at the farm level. The advantage of increaseddiscrimination for closely related isolates should extend toother cases of epidemiologically related strains, includingsituations where there is a need to trace food-borne diseaseoutbreaks and infections in clinical settings.

ACKNOWLEDGMENTS

K. N. K. Baker provided invaluable technical assistance. MartinWiedmann, Cornell University, kindly provided northeastern S. Typhi-murium isolates.

This project has been funded in part with federal funds from theNational Institute of Allergy and Infectious Diseases, National Insti-tutes of Health, Department of Health and Human Services, undercontract no. NO1-AI-30055, and by the Agricultural Animal HealthProgram, College of Veterinary Medicine, Washington State Univer-sity, Pullman, WA. Scholarship support for D.M. was provided by theCarl M. Hansen Foundation.

REFERENCES

1. Anderson, E. S., L. R. Ward, M. J. Saxe, and J. D. de Sa. 1977. Bacterio-phage-typing designations of Salmonella typhimurium. J. Hyg. (Lond). 78:297–300.

2. Bauer, A. W., W. M. Kirby, J. C. Sherris, and M. Turck. 1966. Antibioticsusceptibility testing by a standardized single disk method. Am. J. Clin.Pathol. 45:493–496.

3. Benson, G. 1999. Tandem repeats finder: a program to analyze DNA se-quences. Nucleic Acids Res. 27:573–580.

4. Call, D. R., J. G. Hallett, S. G. Mech, and M. Evans. 1998. Considerations formeasuring genetic variation and population structure with multilocus finger-printing. Mol. Ecol. 7:1337–1346.

5. Call, D. R., L. Orfe, M. A. Davis, S. Lafrentz, and M. S. Kang. 2008. Impactof compounding error on strategies for subtyping pathogenic bacteria. Food-borne Pathog. Dis. 5:505–516.

6. Davis, M. A., K. N. K. Baker, D. R. Call, L. D. Warnick, Y. Soyer, M.Wiedmann, Y. Grohn, P. L. McDonough, D. D. Hancock, and T. E. Besser.2009. Multilocus variable-number tandem-repeat method for typing Salmo-nella enterica serovar Newport. J. Clin. Microbiol. 47:1934–1938.

7. Davis, M. A., D. D. Hancock, T. E. Besser, and D. R. Call. 2003. Evaluationof pulsed-field gel electrophoresis as a tool for determining the degree ofgenetic relatedness between strains of Escherichia coli O157:H7. J. Clin.Microbiol. 41:1843–1849.

8. Dietterich, T. G. 2000. Ensemble methods in machine learning, p. 1–15.In J. Kittler and F. Roli (ed.), Proceedings of the First InternationalWorkshop on Multiple Classifier Systems. Springer Verlag, London,United Kingdom.

9. Fakhr, M. K., L. K. Nolan, and C. M. Logue. 2005. Multilocus sequencetyping lacks the discriminatory ability of pulsed-field gel electrophoresis fortyping Salmonella enterica serovar Typhimurium. J. Clin. Microbiol. 43:2215–2219.

10. Felsenstein, J. 1989. PHYLIP: Phylogeny Inference Package (version 3.2).Cladistics 5:164–166.

11. Foley, S. L., S. Zhao, and R. D. Walker. 2007. Comparison of moleculartyping methods for the differentiation of Salmonella foodborne pathogens.Foodborne Pathog. Dis. 4:253–276.

12. Gerner-Smidt, P., K. Hise, J. Kincaid, S. Hunter, S. Rolando, E. Hyytia-Trees, E. M. Ribot, and B. Swaminathan. 2006. PulseNet USA: a five-yearupdate. Foodborne Pathog. Dis. 3:9–19.

13. Gulati, P., R. K. Varshney, and J. S. Virdi. 2009. Multilocus variablenumber tandem repeat analysis as a tool to discern genetic relationshipsamong strains of Yersinia enterocolitica biovar 1A. J. Appl. Microbiol.107:875–884.

14. Harbottle, H., D. G. White, P. F. McDermott, R. D. Walker, and S. Zhao.2006. Comparison of multilocus sequence typing, pulsed-field gel elec-trophoresis, and antimicrobial susceptibility typing for characterization ofSalmonella enterica serotype Newport isolates. J. Clin. Microbiol. 44:2449–2457.

15. Hopkins, K. L., C. Maguire, E. Best, E. Liebana, and E. J. Threlfall. 2007.Stability of multiple-locus variable-number tandem repeats in Salmonellaenterica serovar Typhimurium. J. Clin. Microbiol. 45:3058–3061.

16. Lindstedt, B. A., E. Heir, E. Gjernes, and G. Kapperud. 2003. DNA finger-printing of Salmonella enterica subsp. enterica serovar Typhimurium withemphasis on phage type DT104 based on variable number of tandem repeatloci. J. Clin. Microbiol. 41:1469–1479.

17. Lindstedt, B. A., M. Torpdahl, E. M. Nielsen, T. Vardund, L. Aas, and G.Kapperud. 2007 Harmonization of the multiple-locus variable-number tan-dem repeat analysis method between Denmark and Norway for typing Sal-monella Typhimurium isolates and closer examination of the VNTR loci.J. Appl. Microbiol 102:728–735.

18. Lindstedt, B. A., T. Vardund, L. Aas, and G. Kapperud. 2004. Multiple-locusvariable-number tandem-repeats analysis of Salmonella enterica subsp. en-terica serovar Typhimurium using PCR multiplexing and multicolor capillaryelectrophoresis. J. Microbiol. Methods 59:163–172.

19. NCCLS. 2003. Methods for dilution antimicrobial susceptibility tests forbacteria that grow aerobically; approved standard M7-A6. National Com-mittee for Clinical Laboratory Standards, Wayne, PA.

20. NCCLS. 2003. Performance standards for antimicrobial susceptibility testing,

VOL. 48, 2010 FUSION ALGORITHM FOR EPIDEMIOLOGICAL ANALYSIS 4081

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from

Page 11: Improved Identification of Epidemiologically Related …erated by the restriction digest (4). In addition, while band patterns convey a crude degree of genetic relatedness, a large

14th informational supplement, 13th ed. Approved standard M100-S13. Na-tional Committee for Clinical Laboratory Standards, Wayne, PA.

21. Ribot, E. M., M. A. Fair, R. Gautom, D. N. Cameron, S. B. Hunter, B.Swaminathan, and T. J. Barrett. 2006. Standardization of pulsed-field gelelectrophoresis protocols for the subtyping of Escherichia coli O157:H7,Salmonella, and Shigella for PulseNet. Foodborne Pathog. Dis. 3:59–67.

22. Torpdahl, M., G. Sorensen, B. A. Lindstedt, and E. M. Nielsen. 2007. Tan-dem repeat analysis for surveillance of human Salmonella Typhimuriuminfections. Emerg. Infect. Dis. 13:388–395.

23. van Belkum, A., S. Scherer, L. van Alphen, and H. Verbrugh. 1998. Short-sequence DNA repeats in prokaryotic genomes. Microbiol. Mol. Biol. Rev.62:275–293.

24. Vogler, A. J., C. Keys, Y. Nemoto, R. E. Colman, Z. Jay, and P. Keim. 2006.Effect of repeat copy number on variable-number tandem repeat mutationsin Escherichia coli O157:H7. J. Bacteriol. 188:4253–4263.

25. Vogler, A. J., C. E. Keys, C. Allender, I. Bailey, J. Girard, T. Pearson, K. L.Smith, D. M. Wagner, and P. Keim. 2007. Mutations, mutation rates, andevolution at the hypervariable VNTR loci of Yersinia pestis. Mutant. Res.616:145–158.

26. Walsh, B. 2001. Estimating the time to the most recent common ancestor forthe Y chromosome or mitochondrial DNA for a pair of individuals. Genetics158:897–912.

27. Zheng, J., C. E. Keys, S. Zhao, J. Meng, and E. W. Brown. 2007. Enhancedsubtyping scheme for Salmonella enteritidis. Emerg. Infect. Dis. 13:1932–1935.

4082 BROSCHAT ET AL. J. CLIN. MICROBIOL.

on June 10, 2020 by guesthttp://jcm

.asm.org/

Dow

nloaded from