comparing virus classification using genomicwongls/psz/giw2013/accepted-paper… · october 1, 2013...
Post on 04-Jul-2020
3 Views
Preview:
TRANSCRIPT
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
Journal of Bioinformatics and Computational Biologyc© Imperial College Press
COMPARING VIRUS CLASSIFICATION USING GENOMIC
MATERIALS ACCORDING TO DIFFERENT TAXONOMIC
LEVELS
JING-DOO WANG
Department of Computer Science and Information Engineering,
Asia University,No. 500, Lioufeng Rd. Wufeng, Taichung 41354, Taiwan.
jdwang@asia.edu.tw
In this paper, three genomic materials - DNA sequences, protein sequences and regions
(domains) are used to compare methods of virus classification. Virus classes (categories)are divided by various taxonomic level of virus into three datasets for 6 order, 42 family
and 33 genera. To increase the robustness and comparability of experimental resultsof virus classification, the classes are selected that contain at least 10 instances, and
meanwhile each instance contains at least one region name. Experimental results show
that the approach using region names achieved the best accuracies - reaching 99.9%,97.3% and 99.0% for 6 orders, 42 families and 33 genera, respectively. This paper not
only involves exhaustive experiments that compare virus classifications using different
genomic materials, but also proposes a novel approach to biological classification basedon molecular biology instead of traditional morphology.
Keywords: Virus Classification; Taxonomy; Genome Sequence; Protein Clustering; Re-
gion.
1. Introduction
Virus classification concerns the naming viruses and the placing of viruses into a
taxonomic system. The two main systems currently used for virus classification
are the ICTV (International Committee on Taxonomy of Viruses) system4 and
the Baltimore classification system9. The former shares many features with the
system of classification of cellular organisms, such as taxon structure; the latter
places viruses into one of seven groups depending on a combination of their types
of nucleic acid (DNA or RNA), stranded-ness (single-stranded or double-stranded),
sense, and method of replication7.
Viruses are mainly classified by their phenotypes, such as morphology, type
of nucleic acid, mode of replication, host organisms, and the type of disease they
cause. Observing the phenotypes of viruses requires considerable effort on the part
of biologists (or virologists). Moreover, the inconsistencies of their observations,
made at various laboratories or times may lead to arguments when attempts are
made to verify or classify some unknown viruses. Viruses are diverse and flexible,
1
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
2 JING-DOO WANG
and many viruses exist whose taxa are still unknown and labeled ”unclassified” in
the ICTV. Therefore, a novel approach for classifying viruses automatically and
precisely is sought.
A growing number of complete whole genomes are available in the NCBI 2, en-
abling research based on genome-wide comparisons. For example, some studies have
compared genomic signatures to analyze evolutionary relationships 15,28, to identify
signature genes for taxonomic characterization 16,17, to classify sequences 10,18, and
to elucidate viral phylogeny 33. Studies that involve comparisons of genome-wide
sequences comparisons might address the challenge of making such comparisons
without sequence alignment 21,27.
To take advantage of available classifiers that are used in machine learning8 or
data mining32, instances (species) must be transformed into representative vectors
for virus classification in the vector space model13. To achieve the above vector
transformation precisely using genomic materials, two important issues must be
addressed. One is feature extraction, which identifies the characteristics (features)
of one class (category) of viruses that distinguish it from another. The other is
the design of a weighting method that can specify the relative importance of these
features.
Various studies of virus classification using genomic sequences have been
published34,29,30. In 34, Yu et al. proposed a natural vector approach that converted
each virus into a 12-dimensional vector according to the quantity and global distri-
bution of the nucleotides in its viral sequences, and then used the nearest neighbor
method to classify 2044 single-segment viruses at different levels of Baltimore class,
family, subfamily and genus. Their virus classification was computed quickly because
it took into account topological information about the viruses in advance. Wang 29
compared classifications of 35 virus families based on ”DNA”(deoxyribonucleic acid)
and ”Protein” (amino acids) sequences. To make their experiments more robust
and to extend to different taxonomic levels, Wang 30 used 6 orders, 43 families and
33 genera for comparing virus classifications. However, their experimental results
conflicted with their original expectation that the approach was based on protein
sequences should be more accurate than that based on DNA sequences. However, in
the studies29,30, a group of protein sequences that were deemed to perform one bio-
logical function were found to combine with another group with a different function,
making the functionality of the combined group ambiguous. To avoid this problem
of ambiguity, the ”region” names (domains), within the notations of proteins in
NCBI, are the features that are used for virus classification in this paper.
To make the contribution of this paper solid for the readers, Section 2.1 intro-
duces the preprocesses for collecting and extracting these three genomic materials.
In this study, experiments were performed to classify viruses in NCBI using ex-
isting taxonomic levels. Experimental resources contain three datasets, including 6
orders, 42 families and 33 genera. Each class (category) in the dataset contains at
least 10 instances (species) in which includes at least one region name that belongs
to that instance. Experimental results show that the approach that was based on
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS 3
Fig. 1. The Processes of Extracting Virus Genomic Materials.
”region” name achieved the best accuracies of 99.9%, 97.3% and 99.0% with the
three datasets of 6 orders, 42 families and 33 genera, respectively. In summary, this
paper provides a novel approach for analyzing taxonomy using genomic materials
in the field of molecular biology, instead of using phenotypes.
The remainder of this paper is organized as follows. Section 2 describes the
method of transforming virus instances into representative vectors for three ge-
nomics materials. Section 3 presents the experimental results. Section 4 presents
discussions and possible avenues for future work. Section 5 draws conclusions.
2. Method
This paper presents two main processes for classifying viruses using genomic ma-
terials in the vector space model22. One is to gather whole genomes of viruses and
extract genomic materials. Another is to transform each of the virus instances into
representative vectors using these genomic materials. Figure 1 and Fig.2 present
above two processes. Section 2.1 and section 2.2 describe the processes in detail.
2.1. Genomic Materials Extraction
As shown in Fig.1, the compressed file ”all.gbk.tar.gz” for virus genomes was firstly
downloaded from the NCBI FTP site2, and then the genomic materials, includ-
ing virus taxonomy, DNA sequences, protein sequences and protein’s ”GI” number
were extracted from the ”GenBank flat file format” files that were derived from
the ”all.gbk.tar.gz”. For example, as shown in Fig.3, the genomic materials of the
virus ”Bovine adenovirus A” were extracted from the file ”NC 006324”. Figure 3
presents the family and genus of the virus as ”Adenoviridae” and ”Mastadenovirus”,
respectively.
The bottom of the figure displays the protein annotated with ”CDS”, its se-
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
4 JING-DOO WANG
Fig. 2. The Processes of Virus Classification via Genomic Materials.
Fig. 3. Genomic Materials extracted from the ”NC 006324.gbk”.
quence, labeled with the tag ”/translation=”, and DNA sequences. As presented in
Fig.4, the region name ”Adeno E1A”, for example, was extracted from the notation
of ”YP 094027” which was downloaded automatically via a web agent11 by querying
with the number ”GI:52801680” via the Entrez Programming Utilities (E-utilities)1.
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS 5
Fig. 4. Region name ”Adeno E1A” extracted from the ”Y P 094027”(GI:52801680).
Table 1. The outline of processing vector transformation
Vector Transformation
Genomic Materials Feature Extraction Feature Weighting Vector Dimension (m)
DNA sequences K-mers tf*(1/Entropy) 4k
Protein sequences sequence clustering tf*idf # of clusters
Regions (Domains) region name tf*idf # of region names
2.2. Vector Transformation for Instances
With regard to the processes of representative vector transformation, some practical
issues, such as feature extraction and weighting 14, should be considered. Table 1
gives an overview of approaches to vector transformation based on three genomic
materials.
As shown in Fig.2, after three virus genomic materials - DNA sequences, protein
sequences and region names, virus instances must be transformed into representative
vectors using proper weighting methods such that each vector represent its original
instance precisely. After vector transformation, as shown in Fig.2, the LIBSVM12
was used to perform virus classification. In the following, Section 2.2.2, Section 2.2.3
and Section 2.2.4 describe the vector transformations of the three genomic materials.
Notably, the method for transferring DNA sequences and protein sequences into
vectors were adopted from previous works29,30.
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
6 JING-DOO WANG
2.2.1. Notations
Let {C1, C2, . . . , Cc} be an actual partition of a data set X :
X =
x1,1, x1,2, . . . , x1,n1
,
x2,1, x2,2, . . . , x2,n2 ,
. . . ,
xc,1, xc,2, . . . , xc,nc.
. (1)
where xi,l ∈ Rm is the lth instance of the class Ci, i = 1, 2, . . . , c; l = 1, 2, . . . , ni;
N =∑c
i=1 ni; {xi,1, xi,2, . . . , xi,ni} ∈ Ci; R represents a real number; m is the
number of dimensions in the vector model, and c is the number of classes.
2.2.2. DNA Sequences vs. k-mer Approach
The k-mer approach is a well-known method for transferring sequences (strings) into
vectors20. Let Pd be the dth pattern of k-mers. Let Pattern Frequency PF (Pd, Ci)
and PF (Pd, xi,l) be the number of patterns Pd that appear in the class Ci and
instance xi,l, respectively. Let Prob(Pd, Ci) = ( PF (Pd,Ci)∑i=ci=1 PF (Pd,Ci)
) be the probability
that the Pd is in class Ci. the Shannon entropy24 Entropy(Pd) of pattern Pd across
c classes is given by Eq.2.
Entropy(Pd) = −i=c∑i=1
(Prob(Pd, Ci)) ∗ log(Prob(Pd, Ci)). (2)
Given a value k for the k-mer transformation of DNA sequences whose alphabet
contains 4 symbols, ”A”, ”C”, ”G” and ”T”, the vector of one instance xi,l was
transferred herein into a 4k-dimensional vector as Eq.3.
< xi,l >=< x1i,l, x
2i,l, . . . , x
di,l, . . . , x
4k
i,l >, (3)
where xdi,l = PF (Pd, xi,l) ∗ 1
Entropy(Pd), 1 ≤ d ≤ m = 4k. Notably, the well-
known weighing method tf ∗ idf22 cannot be applied because when k is small, such
as k = 5, the k-mers might appear in all of the sequences, possibly causing the idf
values of all k-mer patterns to be the same.
2.2.3. Protein Sequences vs. Clustering
The approach to clustering protein sequences, adopted from the previous work29, is
used in the rest of this paper. To transfer viruses into vectors via protein sequences,
the protein sequences were clusters into the same group under the simplifying as-
sumption that similar protein sequences had similar functionalities. The similarity
between two protein sequences was measured using the E value as e−E , determined
using ”pblast” program 20; two protein sequences were put into the same group
if their E value was greater than a given threshold T as e−T . To determine the
best value of the threshold T , however, several candidate values of T are used in
experiments and the one that hields the highest accuracy is selected as the final
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS 7
threshold value. After the protein sequences were clustered into m groups, these
groups could be used, for example, to represent each virus as one m-dimensional
vector ; meanwhile, the weighting for each dimension of that vector is determined
according to a weighting method that is similar to the tf ∗idf weighting approach22.
Let CDS(xi,l) be the set of protein sequences that belong to xi,l and
let |CDS(xi,l)| be number of protein sequences in CDS(xi,l). Let S =
∪1≤i≤c,1≤l≤niCDS(xi,l) = {s1, s2, s3, ..., s|#ofProteins|} be the set of all protein
sequences in X and |#ofProteins| be the number of protein sequences in S.
First, all of the protein sequences in S are mapped into distinct m groups as
GID1, GID2, ..., GIDm, in which the instances in one group, such as GIDd,
1 ≤ d ≤ m, have similar functions. The similarity between two protein sequences,
for example, sp and sq, are measured using the ”pblast” program20, and sp and sqare clustered into the same group GIDd, 1 ≤ d ≤ m, if the similarity of the sprelated to the sq is under the given threshold T-value (T), e−T , e.g. T = 3.
In this study, a weighting method similar to that in tf ∗ idf 14 was adopted.
Let CDS(xi,l)GIDd= |CDS(xi,l) ∩ GIDd| be the number of CDS(xi,l) that are
mapped to group GIDd. Let Group Frequency of the GIDd, GF (GIDd), be the
number of instances that contain the CDS that were mapped to the GIDd, and let
IGF (GIDd) be the Inverse Group Frequency (IGF), log NGF (GIDd)
. After all CDS
in S are mapped to distinct m groups as GIDd, 1 ≤ d ≤ m, each instance xi,l could
be represented as one vector < xi,l > using Eq.4.
< xi,l >= (x1i,l, x
2i,l, . . . , x
di,l, . . . , x
|#ofGroups|i,l ) (4)
where xdi,l=CDS(xi,l)GIDd
∗ IGF (GIDd), 1 ≤ d ≤ m = |#ofGroups|.
2.2.4. Region Names from Protein Notation
The regions (domains) within one protein are well known to support a particular of
that protein. After the region names are extracted and collected from the notation
of the proteins, as shown in Section 2.1, the ”tf*idf” weighting method14 is applied
to transform vectors where one region name is used as one term and one virus is
treated as one document. Accordingly, the term frequency (tf) of a region name for
one virus instance is the number of times that region name appears in the notation
for names of proteins that belong to that virus; the document frequency (df) of one
region name is estimated as the number of viruses that contain that region name.
Let rd be the dth in the set of region names and let tf(xi,l, rd) be the number of
rd that appear in the instance xi,l. Let df(rd) be the number of instances in which
the notations for the protein contains the rd region and let the inverse document
frequency idf(rd) be log( Ndf(rd)
). For example, one instance xi,l is transformed into
a vector as follows.
< xi,l >=< x1i,l, x
2i,l, . . . , x
di,l, . . . , x
|#ofRegion|i,l >, (5)
where xdi,l=tf(xi,l, rd) ∗ idf(rd), 1 ≤ d ≤ m = |#ofRegion|.
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
8 JING-DOO WANG
Table 2. The statistics of virus taxonomy
ICTV NCBI Selected (#ofSpecies)
# of Orders 7 6 6 (812)# of Families 96 85 42 (1,922)# of Genera 420 326 33 (693)# of Species 2,618 2,406
Table 3. The statistics of six virus orders.
Average (Per Virus)
Ci Order #ofViruses DNA Length(bp) #ofProteins #ofRegions DNA Length(bp) #ofProteins #ofRegions
1 Caudovirales 446 3628439253 39312 22158 8135514.0 88.1 49.72 Herpesvirales 47 856390145 4672 4518 18221066.9 99.4 96.13 Mononegavirales 64 7394800 493 544 115543.8 7.7 8.54 Nidovirales 33 8177344 293 818 247798.3 8.9 24.85 Picornavirales 114 1616712 171 780 14181.7 1.5 6.86 Tymovirales 108 3903885 520 926 36147.1 4.8 8.6
Total 812 4505922139 45461 29744
3. Experimental Results
In this paper, the ”easy.py” program from LIBSVM12 was used as the SVM classifier
for virus classification; meanwhile 10-fold cross-validation was adopted to avoid the
over-fitting problem19. Notably, SVM is a well-known classifier in machine learning8
and LIBSVM supports multi-class classification. In the following, Section 3.1 gives
statistics of viruses in ICTC and NCBI, and of viruses, in 6 orders, 42 families
and 33 genera that were selected for the experiments. Section 3.2 compares the
accuracies of classification according to these three genomic materials.
3.1. The Statistics of viruses
To provide a comprehensive understanding of existing virus taxonomy in ICTV
(International Committee on Taxonomy of Viruses)4 and the viral genomes avail-
able in NCBI (National Center for Biotechnology Information) 5, Table 2 gives the
statistics concerning virus taxonomy. Based on the official ICTV 2012 taxonomy25,
a total of 2, 618 virus species belonged to 7 orders, 96 families and 420 genera. Based
on the whole virus genomes that were extracted from NCBI’s FTP site2 when this
study started (2012-6-21), 2, 406 virus species belonged to 6 orders, 85 families, 420
genera.
To ensure the robustness of experimental results and to provide three types of
comparable genomic materials for virus classification, 6 orders, 42 families and 33
genera were selected for experiments. Each of the classes (orders, families or genera)
contained at least 10 species and each of these species had at least one region name
that was tagged in notation for the corresponding protein, as described in Fig.4. Ta-
ble 3, Table 4 and Table 5 provide details of the statistics of the DNA sequences, the
number of proteins, the number of region names and their corresponding averages
per species by the order, family and genus of the viruses, respectively.
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS 9
Table 4. The statistics of 42 virus families.
Average (Per Virus)
Ci Family #ofViruses DNA Length(bp) #ofProteins #ofRegions DNA Length(bp) #ofProteins #ofRegions
1 Adenoviridae 25 27030426 794 776 1081217.0 31.8 31.02 Alphaflexiviridae 39 1452534 211 308 37244.5 5.4 7.93 Anelloviridae 34 360257 108 102 10595.8 3.2 3.04 Arenaviridae 25 1052628 100 124 42105.1 4.0 5.05 Astroviridae 11 209256 31 51 19023.3 2.8 4.66 Baculoviridae 51 960584193 7149 6328 18834984.2 140.2 124.17 Betaflexiviridae 45 1990938 238 476 44243.1 5.3 10.68 Bromoviridae 29 1115486 132 178 38465.0 4.6 6.19 Bunyaviridae 25 1406346 99 153 56253.8 4.0 6.1
10 Caliciviridae 19 369837 48 119 19465.1 2.5 6.311 Caulimoviridae 33 1193484 152 222 36166.2 4.6 6.712 Circoviridae 14 87332 44 48 6238.0 3.1 3.413 Closteroviridae 23 3856475 233 204 167672.8 10.1 8.914 Coronaviridae 29 7633320 259 765 263217.9 8.9 26.415 Dicistroviridae 14 278978 30 93 19927.0 2.1 6.616 Flaviviridae 52 612284 57 720 11774.7 1.1 13.817 Geminiviridae 254 6043944 1597 1907 23795.1 6.3 7.518 Herpesviridae 41 674477523 3885 4246 16450671.3 94.8 103.619 Inoviridae 26 2127893 287 213 81842.0 11.0 8.220 Luteoviridae 21 723719 126 165 34462.8 6.0 7.921 Microviridae 14 728653 140 126 52046.6 10.0 9.022 Myoviridae 104 2412036372 15976 8347 23192657.4 153.6 80.323 Nodaviridae 12 175085 39 25 14590.4 3.3 2.124 Papillomaviridae 67 3685566 479 594 55008.4 7.1 8.925 Paramyxoviridae 33 4401661 277 349 133383.7 8.4 10.626 Partitiviridae 18 169081 39 18 9393.4 2.2 1.027 Parvoviridae 52 1073344 209 246 20641.2 4.0 4.728 Picornaviridae 55 451983 59 465 8217.9 1.1 8.529 Podoviridae 88 236689280 4944 3163 2689650.9 56.2 35.930 Polyomaviridae 22 642091 125 221 29186.0 5.7 10.031 Potyviridae 82 1655075 168 753 20183.8 2.0 9.232 Poxviridae 27 929391315 4685 6871 34421900.6 173.5 254.533 Reoviridae 32 8989522 363 231 280922.6 11.3 7.234 Retroviridae 56 2283443 252 690 40775.8 4.5 12.335 Rhabdoviridae 25 2162899 169 128 86516.0 6.8 5.136 Secoviridae 32 726507 65 147 22703.3 2.0 4.637 Siphoviridae 248 959712930 17974 10412 3869810.2 72.5 42.038 Togaviridae 17 460911 40 180 27112.4 2.4 10.639 Tombusviridae 43 961398 227 184 22358.1 5.3 4.340 Totiviridae 26 301678 55 48 11603.0 2.1 1.841 Tymoviridae 22 413049 64 133 18775.0 2.9 6.042 Virgaviridae 37 1718589 196 339 46448.4 5.3 9.2
Total 1922 6261437285 62125 50868
3.2. Comparison of Accuracies of Classification and Numbers of
Dimensions of Vectors
Figure 5 and Fig.6 present accuracies of virus classification by SVM classifiers for
two types of genomic materials, DNA and protein sequences, respectively. The val-
ues of ”k” and ”T” in the experiments ranged from 1 to 8 and from 3 to 75,
respectively. As shown in Fig.5 (Fig.6), the best accuracies were 99.5%(98.0%),
93.7%(91.5%) and 98.1%(94.5%) when k=5(T=30), k=4(T=21) and k=6(T=12)
were set with three virus datasets in 6 orders, 43 families and 33 genera, respec-
tively.
As shown in Table 6, the classification accuracies obtained using ”region” names
were 99.9%, 97.3% and 99.0%, respectively. In this study, as shown in Table 6,
”Region” achieved the best accuracy. The numbers of dimensions of the vectors
and 42 families, for example, were 256 for ”DNA” when k=5, 28,136 for ”Protein”
when T=21, and 4,538 for ”Region”. Section 4.1 explains why the use of ”Region”
yielded the best accuracy.
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
10 JING-DOO WANG
Table 5. The statistics of 33 virus genera.
Average (Per Virus)
Ci Genus #ofViruses DNA Length(bp) #ofProteins #ofRegions DNA Length(bp) #ofProteins #ofRegions
1 Alphabaculovirus 35 700041223 5081 4826 20001177.8 145.2 137.92 Alphapapillomavirus 14 789935 100 137 56423.9 7.1 9.83 Alphatorquevirus 16 195467 52 55 12216.7 3.3 3.44 Alphavirus 16 441387 38 172 27586.7 2.4 10.85 Badnavirus 18 494642 65 111 27480.1 3.6 6.26 Begomovirus 123 3253448 813 965 26450.8 6.6 7.87 Begomovirus∗ 13 17843 13 24 1372.5 1.0 1.88 Betabaculovirus 12 225849175 1687 1298 18820764.6 140.6 108.29 Betacoronavirus 10 3013335 98 297 301333.5 9.8 29.7
10 Carlavirus 27 1403493 164 326 51981.2 6.1 12.111 Carmovirus 13 295267 74 51 22712.8 5.7 3.912 Circovirus 11 64075 33 38 5825.0 3.0 3.513 Crinivirus 11 2035824 122 93 185074.9 11.1 8.514 Dependovirus 15 191112 40 72 12740.8 2.7 4.815 Enterovirus 13 95345 13 130 7334.2 1.0 10.016 Flavivirus 37 442757 41 551 11966.4 1.1 14.917 Gammaretrovirus 13 278712 40 120 21439.4 3.1 9.218 Ilarvirus 14 559013 66 73 39929.5 4.7 5.219 Inovirus 14 1056331 147 137 75452.2 10.5 9.820 Mastadenovirus 16 17663233 519 561 1103952.1 32.4 35.121 Mastrevirus 13 137527 51 63 10579.0 3.9 4.822 Nepovirus 10 245938 20 54 24593.8 2.0 5.423 New world arenaviruses 18 756288 72 90 42016.0 4.0 5.024 Partitivirus 11 101890 23 11 9262.7 2.1 1.025 Parvovirus 11 243618 49 70 22147.1 4.5 6.426 Polerovirus 13 449983 78 106 34614.1 6.0 8.227 Polyomavirus 22 642091 125 221 29186.0 5.7 10.028 Potexvirus 31 1054674 162 236 34021.7 5.2 7.629 Potyvirus 64 1251743 128 600 19558.5 2.0 9.430 Sobemovirus 12 218780 52 43 18231.7 4.3 3.631 Tobamovirus 22 579754 90 168 26352.5 4.1 7.632 Tombusvirus 10 236700 50 60 23670.0 5.0 6.033 Tymovirus 15 282903 45 94 18860.2 3.0 6.3
Total 693 964383506 10151 11853
* Begomovirus-associated alphasatellites
Fig. 5. Accuracy Comparison using DNA sequences using various k values.
Table 6. Comparison of Classification Accuracy and Numbers of Dimensions of Vectors
DNA Protein Region
6 Virus Orders 99.5%, 1024 (k=5) 98.0%,26942 (T=30) 99.9%,278342 Virus Families 93.7%, 256 (k=4) 91.5%,28136 (T=21) 97.3%,453833 Virus Genera 98.1%, 4096 (k=6) 94.5%,2223 (T=12) 99.0%,2939
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS 11
Fig. 6. Accuracy Comparison using protein sequences using various T values.
4. Discussions
4.1. Why ”region” yielded the best accuracy
Table 6 shows that ”region” provided the best classification accuracy. The reason
is discussed below. First, The frequency distributions of k-mers that were derived
from DNA sequences were used for virus classification. Generally, longer k-mers
present more specific features. However, two characteristics of viruses - their rapid
evolution and diversity cause the frequency distribution of k-mers to be too sparse
to be used for classification purpose when the DNA sequences ar short but the value
of k is large. Figure 5 shows that the accuracy decreases as the value of k increases
over 7.
Second, in this study, the protein sequences within the same group after protein
clustering were assumed to have similar functions. This fact was used as a distin-
guishing feature for further vector transformation processing. However, the protein
clustering approach was implemented in the ”pblast” program to measure the simi-
larity between two protein sequences and the single-linkage method was used to join
two groups into one. The above approach might generate impurities in the protein
such groups that one group may exhibit two functions. For example, Fig.7 presents
7 protein sequences, S1, S2, . . . , S7, in two distinctive groups (functions), GID1 and
GID2 determined by the single-link method with a threshold value T . The two
groups, GID1 and GID2, are formed due to regions R1 and R2, respectively, and
are disjointed because all of the distances between the nodes of GID1 and those of
GID2 are larger than e−T . However, the appearance of S8, containing both R1 and
R2, results in the merging of GID1 and GID2 into GID3.
The region name is a distinguishing feature for classification in this study. With
respect to the distribution of class frequency (CF) of region names across 42 families,
an example of which is shown in Fig.8, the majority of the CF values of region
names were ”CF=1”(78.45%) and most of the region names appeared in only one
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
12 JING-DOO WANG
Fig. 7. Two distinct groups, GID1 and GID2, are merged as the group GID3 due to the sequence
S8 that contains R2 and R3.
Fig. 8. The distribution of class frequency (CF) of the ”Region” derived from 42 families.
class. Notably, about 20% of the regions were with ”CF=2” (18.85%) or ”CF=3”
(5.61%), and these regions, as the S8 described above, may have caused the impurity
of protein groups.
4.2. Drawbacks of ”region” name annotation
As shown in Table 2, ICTV and NCBI, contained 420 and 326 genera, respectively.
However, only 33 genera (693 viruses) were selected in the experiments owing to
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS 13
the requirement that each class should contain at least ten instances that included
at least one region name. Hence, the majority of the viruses were not used in
classification experiments so the experimental result in this study was not robust
enough.
The ”region” names usually were given or assigned manually while related se-
quences were aligned using the RPS-Blast against the CDD (Conserved Domain
Database) 23. The way in which a region name is assigned may have the side effect
that related sequences might be highly specific to some viral family, for example.
Therefore, the region names may contain some metadata about the label of the
original family, which may provide a way of cheating in classification experiments.
To avoid such a situation, region names must be annotated automatically without
knowledge of the class label using the HMMER33 against PFAM6 or other auto-
mated domain annotation tools. Doing so would make the proposed approach more
practical and provide more convincing experimental classification in the future.
4.3. Verifying fitness of class structure within existing virus
taxonomy
The mis-classified instances are examined using a confusion matrix 26 to identify the
implicit relationship between two classes. This experimental results is thus obtained
are not shown herein owing to the limitation on the number of pages. However,
analyzing the ambiguities of among classes is favored to evaluate the fitness of an
existing class structure31. After a feasible type of genomic material is selected from
existing genomic materials for classification. Existing class structures of biological
taxonomy can be verified via molecular biology instead of traditional morphology.
Such work may provide clues for biologists or taxonomists to reinspect and adjust
existing class structures when they working with taxonomy in the future.
5. Conclusion
In this study, there genomic materials are used to compare methods of virus clas-
sification; there are DNA sequences, protein sequences and region names. The first
two materials are extracted directly from virus genomes, and the last is obtained
from the annotation of the protein. The resources that are used in the experi-
ments are collected from taxonomic levels and include 6 orders, 42 families and 33
genera. Experimental results show that using ”region” to classify viruses yielded
the best classification accuracy when the SVM classifier from LIBSVM was used.
The obtained accuracies were 99.9%, 97.3% and 99.0% for the three datasets that
comprised 6 order, 42 families and 33 genera, respectively. This paper provides a
novel approach to classifying viruses for molecular biological purposes, instead of
the use of morphology. This approach, using genomic materials, can be applied
to classify other creatures (organisms). This work opens up a new way to deter-
mine whether the existing taxonomic structure is suited from the point of view of
molecular biology31.
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
14 JING-DOO WANG
Acknowledgment
This study is supported by Asia University, Taiwan under project 101-asia-30. The
author thanks the reviewers for their valuable comments and suggestions.
References
1. Entrez Programming Utilities Help,http://www.ncbi.nlm.nih.gov/books/NBK25501/.
2. FTP Site for Genomes in NCBI, ftp://ftp.ncbi.nih.gov/genomes.3. HMMER, http://hmmer.janelia.org/.4. International Committee on Taxonomy of Viruses (ICTV),
http://www.ncbi.nlm.nih.gov/ICTVdb/.5. National Center for Biotechnology Information(NCBI),
http://www.ncbi.nlm.nih.gov/.6. Pfam database, http://pfam.sanger.ac.uk/.7. Wikipedia: Virus Classification, http://en.wikipedia.org/wiki/Virus classification.8. Alpaydin E, Introduction to Machine Learning, The MIT Press, 2004.9. Baltimore D, Animal Virology, no. 4, Elsevier Science, 1976. ISBN 9780323142281.
10. Bazinet A, Cummings M, A comparative evaluation of sequence classification pro-grams, BMC Bioinformatics 13(1):92+, 2012.
11. Burke SM, Torkington N, Aas G, Perl and LWP. Fetching Web Page, Parsing HTML,Writing Spiders and More, O’Reilly, Beijing, 2002.
12. Chang CC, Lin CJ, LIBSVM: a library for support vector machines, 2001, softwareavailable at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
13. Croft B, Metzler D, Strohman T, Search Engines Information Retrieval in Practice,1st ed., Addison Wesley, 2009.
14. Croft B, Metzler D, Strohman T, Search Engines: Information Retrieval in Practice,Addison-Wesley Publishing Company, USA, 2009. ISBN 0136072240, 9780136072249.
15. Deschavanne P, DuBow M, Regeard C, The use of genomic signature distance betweenbacteriophages and their hosts displays evolutionary relationships and phage growthcycle determination, Virology Journal 7, 2010.
16. Dutilh BE, He Y, Hekkelman ML, Huynen MA, Signature, a web server for taxonomiccharacterization of sequence samples using signature genes, Nucleic Acids Research(suppl 2):W470–W474.
17. Dutilh BE, Snel B, Ettema TJ, Huynen MA, Molecular biology and evolution 25,2008.
18. Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI, A two-stage methodologyfor sequence classification based on sequential pattern mining and optimization, DataKnowl Eng 66(3):467–487, 2008.
19. Han J, Kamber M, Data Mining: Concepts and Techniques, 2nd ed., Morgan Kauf-mann, 2007.
20. Jones NC, Pevzner PA, An Introduction to Bioinformatics Algorithms, MIT Press,2004. ISBN 0-262-10106-8.
21. Jun SR, Sims GE, Wu GA, Kim SH, Whole-proteome phylogeny of prokaryotes byfeature frequency profiles: An alignment-free method with optimal feature resolution,Proceedings of the National Academy of Sciences 107(1):133–138, 2010.
22. Manning CD, Raghavan P, Schu”tze H, Introduction to Information Retrieval, Cam-bridge University Press.
23. Marchler-Bauer A, Zheng C, Chitsaz F, Derbyshire MK, Geer LY, Geer RC, GonzalesNR, Gwadz M, Hurwitz DI, Lanczycki CJ, Lu F, Lu S, Marchler GH, Song JS, Thanki
October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS 15
N, Yamashita RA, Zhang D, Bryant SH, CDD: conserved domains and protein three-dimensional structure., Nucleic acids research 41(Database issue):D348–D352, 2013.
24. Mitchell TM, Machine Learning, The McGraw-Hill Companies, Inc, 1997.25. on Taxonomy of Viruses IC, King A, Adams M, Lefkowitz E, Carstens E,
Virus Taxonomy: IXth Report of the International Committee on Taxonomy ofVirusesImmunology and microbiology, Immunology and microbiology, AcademicPress, 2011. ISBN 9780123846846.
26. Roiger R, Geatz MW, Data Mining: A Tutorial Based Primer, Addison Wesley, 2003.27. Trifonov V, Rabadan R, Frequency analysis techniques for identification of viral ge-
netic data, mBio 1(3), July/August 2010.28. van Passel M, Kuramae E, Luyf A, Bart A, Boekhout T, The reach of the genome
signature in prokaryotes, BMC Evolutionary Biology 6(1):84, 2006.29. Wang JD, A Comparison study of Virus Classification by Genome Sequences, The 11th
IEEE International Conference on Bioinformatics and Bioengineering, pp. 270–273,2011.
30. Wang JD, Virus Classification via Genomic Sequences From Different TaxonomicLevel, The 23rd International Conference on Genome Informatics, p. 76, 2012.
31. Wang JD, Liu HC, An Approach to Evaluate the Fitness of One Class Structure viaDynamic Centroids, Expert Systems with Applications 38(11):13764–13772, 2011.
32. Witten IH, Frank E, Data Mining: Practical Machine Learning Tools and Techniques(Third Edition), Elsevier, 2011. ISBN 0120884070.
33. Wu GA, Jun SR, Sims GE, Kim SH, Whole-proteome phylogeny of large dsdna virusfamilies by an alignment-free method, Proceedings of the National Academy of Sciences106(31):12826–12831, 2009.
34. Yu C, Hernandez T, Zheng H, Yau SC, Huang HH, He RL, Yang J, Yau SST, Realtime classification of viruses in 12 dimensions, PLOS ONE 8(5), 2013.
Jing-Doo Wang received his BS degree in Computer Science
and Information Engineering from the University of Tatung (for-
merly Tatung Institute of Technology) in 1989, and his M.S. and
Ph.D. degrees in Computer Science and Information Engineering
from the University of Chung Cheng in 1993 and 2002 respec-
tively.He has been with Asia University (formerly Taichung Healthcare and Management
University) since spring 2003, where he is currently an assistant professor in the De-
partment of Computer Science and Information Engineering. He also holds a joint
appointment with the Department of Biomedical Informatics. His research interests
are in the areas of bioinformatics, text mining for trend analysis and the extraction
of maximal repeats via cloud computing.
top related