comparing virus classification using genomicwongls/psz/giw2013/accepted-paper… · october 1, 2013...

October 1, 2013 21:5 WSPC/INSTRUCTION FILE2013GIW˙JBCB˙VirusClassificationCopmarison˙Final

Journal of Bioinformatics and Computational Biologyc© Imperial College Press

COMPARING VIRUS CLASSIFICATION USING GENOMIC

MATERIALS ACCORDING TO DIFFERENT TAXONOMIC

LEVELS

JING-DOO WANG

Department of Computer Science and Information Engineering,

Asia University,No. 500, Lioufeng Rd. Wufeng, Taichung 41354, Taiwan.

[email protected]

In this paper, three genomic materials - DNA sequences, protein sequences and regions

(domains) are used to compare methods of virus classification. Virus classes (categories)are divided by various taxonomic level of virus into three datasets for 6 order, 42 family

and 33 genera. To increase the robustness and comparability of experimental resultsof virus classification, the classes are selected that contain at least 10 instances, and

meanwhile each instance contains at least one region name. Experimental results show

that the approach using region names achieved the best accuracies - reaching 99.9%,97.3% and 99.0% for 6 orders, 42 families and 33 genera, respectively. This paper not

only involves exhaustive experiments that compare virus classifications using different

genomic materials, but also proposes a novel approach to biological classification basedon molecular biology instead of traditional morphology.

Keywords: Virus Classification; Taxonomy; Genome Sequence; Protein Clustering; Re-

gion.

1. Introduction

Virus classification concerns the naming viruses and the placing of viruses into a

taxonomic system. The two main systems currently used for virus classification

are the ICTV (International Committee on Taxonomy of Viruses) system4 and

the Baltimore classification system9. The former shares many features with the

system of classification of cellular organisms, such as taxon structure; the latter

places viruses into one of seven groups depending on a combination of their types

of nucleic acid (DNA or RNA), stranded-ness (single-stranded or double-stranded),

sense, and method of replication7.

Viruses are mainly classified by their phenotypes, such as morphology, type

of nucleic acid, mode of replication, host organisms, and the type of disease they

cause. Observing the phenotypes of viruses requires considerable effort on the part

of biologists (or virologists). Moreover, the inconsistencies of their observations,

made at various laboratories or times may lead to arguments when attempts are

made to verify or classify some unknown viruses. Viruses are diverse and flexible,

1


2 JING-DOO WANG

and many viruses exist whose taxa are still unknown and labeled ”unclassified” in

the ICTV. Therefore, a novel approach for classifying viruses automatically and

precisely is sought.

A growing number of complete whole genomes are available in the NCBI 2, en-

abling research based on genome-wide comparisons. For example, some studies have

compared genomic signatures to analyze evolutionary relationships 15,28, to identify

signature genes for taxonomic characterization 16,17, to classify sequences 10,18, and

to elucidate viral phylogeny 33. Studies that involve comparisons of genome-wide

sequences comparisons might address the challenge of making such comparisons

without sequence alignment 21,27.

To take advantage of available classifiers that are used in machine learning8 or

data mining32, instances (species) must be transformed into representative vectors

for virus classification in the vector space model13. To achieve the above vector

transformation precisely using genomic materials, two important issues must be

addressed. One is feature extraction, which identifies the characteristics (features)

of one class (category) of viruses that distinguish it from another. The other is

the design of a weighting method that can specify the relative importance of these

features.

Various studies of virus classification using genomic sequences have been

published34,29,30. In 34, Yu et al. proposed a natural vector approach that converted

each virus into a 12-dimensional vector according to the quantity and global distri-

bution of the nucleotides in its viral sequences, and then used the nearest neighbor

method to classify 2044 single-segment viruses at different levels of Baltimore class,

family, subfamily and genus. Their virus classification was computed quickly because

it took into account topological information about the viruses in advance. Wang 29

compared classifications of 35 virus families based on ”DNA”(deoxyribonucleic acid)

and ”Protein” (amino acids) sequences. To make their experiments more robust

and to extend to different taxonomic levels, Wang 30 used 6 orders, 43 families and

33 genera for comparing virus classifications. However, their experimental results

conflicted with their original expectation that the approach was based on protein

sequences should be more accurate than that based on DNA sequences. However, in

the studies29,30, a group of protein sequences that were deemed to perform one bio-

logical function were found to combine with another group with a different function,

making the functionality of the combined group ambiguous. To avoid this problem

of ambiguity, the ”region” names (domains), within the notations of proteins in

NCBI, are the features that are used for virus classification in this paper.

To make the contribution of this paper solid for the readers, Section 2.1 intro-

duces the preprocesses for collecting and extracting these three genomic materials.

In this study, experiments were performed to classify viruses in NCBI using ex-

isting taxonomic levels. Experimental resources contain three datasets, including 6

orders, 42 families and 33 genera. Each class (category) in the dataset contains at

least 10 instances (species) in which includes at least one region name that belongs

to that instance. Experimental results show that the approach that was based on


COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS 3

Fig. 1. The Processes of Extracting Virus Genomic Materials.

”region” name achieved the best accuracies of 99.9%, 97.3% and 99.0% with the

three datasets of 6 orders, 42 families and 33 genera, respectively. In summary, this

paper provides a novel approach for analyzing taxonomy using genomic materials

in the field of molecular biology, instead of using phenotypes.

The remainder of this paper is organized as follows. Section 2 describes the

method of transforming virus instances into representative vectors for three ge-

nomics materials. Section 3 presents the experimental results. Section 4 presents

discussions and possible avenues for future work. Section 5 draws conclusions.

2. Method

This paper presents two main processes for classifying viruses using genomic ma-

terials in the vector space model22. One is to gather whole genomes of viruses and

extract genomic materials. Another is to transform each of the virus instances into

representative vectors using these genomic materials. Figure 1 and Fig.2 present

above two processes. Section 2.1 and section 2.2 describe the processes in detail.

2.1. Genomic Materials Extraction

As shown in Fig.1, the compressed file ”all.gbk.tar.gz” for virus genomes was firstly

downloaded from the NCBI FTP site2, and then the genomic materials, includ-

ing virus taxonomy, DNA sequences, protein sequences and protein’s ”GI” number

were extracted from the ”GenBank flat file format” files that were derived from

the ”all.gbk.tar.gz”. For example, as shown in Fig.3, the genomic materials of the

virus ”Bovine adenovirus A” were extracted from the file ”NC 006324”. Figure 3

presents the family and genus of the virus as ”Adenoviridae” and ”Mastadenovirus”,

respectively.

The bottom of the figure displays the protein annotated with ”CDS”, its se-


4 JING-DOO WANG

Fig. 2. The Processes of Virus Classification via Genomic Materials.

Fig. 3. Genomic Materials extracted from the ”NC 006324.gbk”.

quence, labeled with the tag ”/translation=”, and DNA sequences. As presented in

Fig.4, the region name ”Adeno E1A”, for example, was extracted from the notation

of ”YP 094027” which was downloaded automatically via a web agent11 by querying

with the number ”GI:52801680” via the Entrez Programming Utilities (E-utilities)1.



Fig. 4. Region name ”Adeno E1A” extracted from the ”Y P 094027”(GI:52801680).

Table 1. The outline of processing vector transformation

Vector Transformation

Genomic Materials Feature Extraction Feature Weighting Vector Dimension (m)

DNA sequences K-mers tf*(1/Entropy) 4k

Protein sequences sequence clustering tf*idf # of clusters

Regions (Domains) region name tf*idf # of region names

2.2. Vector Transformation for Instances

With regard to the processes of representative vector transformation, some practical

issues, such as feature extraction and weighting 14, should be considered. Table 1

gives an overview of approaches to vector transformation based on three genomic

materials.

As shown in Fig.2, after three virus genomic materials - DNA sequences, protein

sequences and region names, virus instances must be transformed into representative

vectors using proper weighting methods such that each vector represent its original

instance precisely. After vector transformation, as shown in Fig.2, the LIBSVM12

was used to perform virus classification. In the following, Section 2.2.2, Section 2.2.3

and Section 2.2.4 describe the vector transformations of the three genomic materials.

Notably, the method for transferring DNA sequences and protein sequences into

vectors were adopted from previous works29,30.


6 JING-DOO WANG

2.2.1. Notations

Let {C1, C2, . . . , Cc} be an actual partition of a data set X :

X =

x1,1, x1,2, . . . , x1,n1

,

x2,1, x2,2, . . . , x2,n2 ,

. . . ,

xc,1, xc,2, . . . , xc,nc.

. (1)

where xi,l ∈ Rm is the lth instance of the class Ci, i = 1, 2, . . . , c; l = 1, 2, . . . , ni;

N =∑c

i=1 ni; {xi,1, xi,2, . . . , xi,ni} ∈ Ci; R represents a real number; m is the

number of dimensions in the vector model, and c is the number of classes.

2.2.2. DNA Sequences vs. k-mer Approach

The k-mer approach is a well-known method for transferring sequences (strings) into

vectors20. Let Pd be the dth pattern of k-mers. Let Pattern Frequency PF (Pd, Ci)

and PF (Pd, xi,l) be the number of patterns Pd that appear in the class Ci and

instance xi,l, respectively. Let Prob(Pd, Ci) = ( PF (Pd,Ci)∑i=ci=1 PF (Pd,Ci)

) be the probability

that the Pd is in class Ci. the Shannon entropy24 Entropy(Pd) of pattern Pd across

c classes is given by Eq.2.

Entropy(Pd) = −i=c∑i=1

(Prob(Pd, Ci)) ∗ log(Prob(Pd, Ci)). (2)

Given a value k for the k-mer transformation of DNA sequences whose alphabet

contains 4 symbols, ”A”, ”C”, ”G” and ”T”, the vector of one instance xi,l was

transferred herein into a 4k-dimensional vector as Eq.3.

< xi,l >=< x1i,l, x

2i,l, . . . , x

di,l, . . . , x

4k

i,l >, (3)

where xdi,l = PF (Pd, xi,l) ∗ 1

Entropy(Pd), 1 ≤ d ≤ m = 4k. Notably, the well-

known weighing method tf ∗ idf22 cannot be applied because when k is small, such

as k = 5, the k-mers might appear in all of the sequences, possibly causing the idf

values of all k-mer patterns to be the same.

2.2.3. Protein Sequences vs. Clustering

The approach to clustering protein sequences, adopted from the previous work29, is

used in the rest of this paper. To transfer viruses into vectors via protein sequences,

the protein sequences were clusters into the same group under the simplifying as-

sumption that similar protein sequences had similar functionalities. The similarity

between two protein sequences was measured using the E value as e−E , determined

using ”pblast” program 20; two protein sequences were put into the same group

if their E value was greater than a given threshold T as e−T . To determine the

best value of the threshold T , however, several candidate values of T are used in

experiments and the one that hields the highest accuracy is selected as the final



threshold value. After the protein sequences were clustered into m groups, these

groups could be used, for example, to represent each virus as one m-dimensional

vector ; meanwhile, the weighting for each dimension of that vector is determined

according to a weighting method that is similar to the tf ∗idf weighting approach22.

Let CDS(xi,l) be the set of protein sequences that belong to xi,l and

let |CDS(xi,l)| be number of protein sequences in CDS(xi,l). Let S =

∪1≤i≤c,1≤l≤niCDS(xi,l) = {s1, s2, s3, ..., s|#ofProteins|} be the set of all protein

sequences in X and |#ofProteins| be the number of protein sequences in S.

First, all of the protein sequences in S are mapped into distinct m groups as

GID1, GID2, ..., GIDm, in which the instances in one group, such as GIDd,

1 ≤ d ≤ m, have similar functions. The similarity between two protein sequences,

for example, sp and sq, are measured using the ”pblast” program20, and sp and sqare clustered into the same group GIDd, 1 ≤ d ≤ m, if the similarity of the sprelated to the sq is under the given threshold T-value (T), e−T , e.g. T = 3.

In this study, a weighting method similar to that in tf ∗ idf 14 was adopted.

Let CDS(xi,l)GIDd= |CDS(xi,l) ∩ GIDd| be the number of CDS(xi,l) that are

mapped to group GIDd. Let Group Frequency of the GIDd, GF (GIDd), be the

number of instances that contain the CDS that were mapped to the GIDd, and let

IGF (GIDd) be the Inverse Group Frequency (IGF), log NGF (GIDd)

. After all CDS

in S are mapped to distinct m groups as GIDd, 1 ≤ d ≤ m, each instance xi,l could

be represented as one vector < xi,l > using Eq.4.

< xi,l >= (x1i,l, x

2i,l, . . . , x

di,l, . . . , x

|#ofGroups|i,l ) (4)

where xdi,l=CDS(xi,l)GIDd

∗ IGF (GIDd), 1 ≤ d ≤ m = |#ofGroups|.

2.2.4. Region Names from Protein Notation

The regions (domains) within one protein are well known to support a particular of

that protein. After the region names are extracted and collected from the notation

of the proteins, as shown in Section 2.1, the ”tf*idf” weighting method14 is applied

to transform vectors where one region name is used as one term and one virus is

treated as one document. Accordingly, the term frequency (tf) of a region name for

one virus instance is the number of times that region name appears in the notation

for names of proteins that belong to that virus; the document frequency (df) of one

region name is estimated as the number of viruses that contain that region name.

Let rd be the dth in the set of region names and let tf(xi,l, rd) be the number of

rd that appear in the instance xi,l. Let df(rd) be the number of instances in which

the notations for the protein contains the rd region and let the inverse document

frequency idf(rd) be log( Ndf(rd)

). For example, one instance xi,l is transformed into

a vector as follows.

< xi,l >=< x1i,l, x

2i,l, . . . , x

di,l, . . . , x

|#ofRegion|i,l >, (5)

where xdi,l=tf(xi,l, rd) ∗ idf(rd), 1 ≤ d ≤ m = |#ofRegion|.


8 JING-DOO WANG

Table 2. The statistics of virus taxonomy

ICTV NCBI Selected (#ofSpecies)

# of Orders 7 6 6 (812)# of Families 96 85 42 (1,922)# of Genera 420 326 33 (693)# of Species 2,618 2,406

Table 3. The statistics of six virus orders.

Average (Per Virus)

Ci Order #ofViruses DNA Length(bp) #ofProteins #ofRegions DNA Length(bp) #ofProteins #ofRegions

1 Caudovirales 446 3628439253 39312 22158 8135514.0 88.1 49.72 Herpesvirales 47 856390145 4672 4518 18221066.9 99.4 96.13 Mononegavirales 64 7394800 493 544 115543.8 7.7 8.54 Nidovirales 33 8177344 293 818 247798.3 8.9 24.85 Picornavirales 114 1616712 171 780 14181.7 1.5 6.86 Tymovirales 108 3903885 520 926 36147.1 4.8 8.6

Total 812 4505922139 45461 29744

3. Experimental Results

In this paper, the ”easy.py” program from LIBSVM12 was used as the SVM classifier

for virus classification; meanwhile 10-fold cross-validation was adopted to avoid the

over-fitting problem19. Notably, SVM is a well-known classifier in machine learning8

and LIBSVM supports multi-class classification. In the following, Section 3.1 gives

statistics of viruses in ICTC and NCBI, and of viruses, in 6 orders, 42 families

and 33 genera that were selected for the experiments. Section 3.2 compares the

accuracies of classification according to these three genomic materials.

3.1. The Statistics of viruses

To provide a comprehensive understanding of existing virus taxonomy in ICTV

(International Committee on Taxonomy of Viruses)4 and the viral genomes avail-

able in NCBI (National Center for Biotechnology Information) 5, Table 2 gives the

statistics concerning virus taxonomy. Based on the official ICTV 2012 taxonomy25,

a total of 2, 618 virus species belonged to 7 orders, 96 families and 420 genera. Based

on the whole virus genomes that were extracted from NCBI’s FTP site2 when this

study started (2012-6-21), 2, 406 virus species belonged to 6 orders, 85 families, 420

genera.

To ensure the robustness of experimental results and to provide three types of

comparable genomic materials for virus classification, 6 orders, 42 families and 33

genera were selected for experiments. Each of the classes (orders, families or genera)

contained at least 10 species and each of these species had at least one region name

that was tagged in notation for the corresponding protein, as described in Fig.4. Ta-

ble 3, Table 4 and Table 5 provide details of the statistics of the DNA sequences, the

number of proteins, the number of region names and their corresponding averages

per species by the order, family and genus of the viruses, respectively.



Table 4. The statistics of 42 virus families.

Average (Per Virus)

Ci Family #ofViruses DNA Length(bp) #ofProteins #ofRegions DNA Length(bp) #ofProteins #ofRegions

1 Adenoviridae 25 27030426 794 776 1081217.0 31.8 31.02 Alphaflexiviridae 39 1452534 211 308 37244.5 5.4 7.93 Anelloviridae 34 360257 108 102 10595.8 3.2 3.04 Arenaviridae 25 1052628 100 124 42105.1 4.0 5.05 Astroviridae 11 209256 31 51 19023.3 2.8 4.66 Baculoviridae 51 960584193 7149 6328 18834984.2 140.2 124.17 Betaflexiviridae 45 1990938 238 476 44243.1 5.3 10.68 Bromoviridae 29 1115486 132 178 38465.0 4.6 6.19 Bunyaviridae 25 1406346 99 153 56253.8 4.0 6.1

10 Caliciviridae 19 369837 48 119 19465.1 2.5 6.311 Caulimoviridae 33 1193484 152 222 36166.2 4.6 6.712 Circoviridae 14 87332 44 48 6238.0 3.1 3.413 Closteroviridae 23 3856475 233 204 167672.8 10.1 8.914 Coronaviridae 29 7633320 259 765 263217.9 8.9 26.415 Dicistroviridae 14 278978 30 93 19927.0 2.1 6.616 Flaviviridae 52 612284 57 720 11774.7 1.1 13.817 Geminiviridae 254 6043944 1597 1907 23795.1 6.3 7.518 Herpesviridae 41 674477523 3885 4246 16450671.3 94.8 103.619 Inoviridae 26 2127893 287 213 81842.0 11.0 8.220 Luteoviridae 21 723719 126 165 34462.8 6.0 7.921 Microviridae 14 728653 140 126 52046.6 10.0 9.022 Myoviridae 104 2412036372 15976 8347 23192657.4 153.6 80.323 Nodaviridae 12 175085 39 25 14590.4 3.3 2.124 Papillomaviridae 67 3685566 479 594 55008.4 7.1 8.925 Paramyxoviridae 33 4401661 277 349 133383.7 8.4 10.626 Partitiviridae 18 169081 39 18 9393.4 2.2 1.027 Parvoviridae 52 1073344 209 246 20641.2 4.0 4.728 Picornaviridae 55 451983 59 465 8217.9 1.1 8.529 Podoviridae 88 236689280 4944 3163 2689650.9 56.2 35.930 Polyomaviridae 22 642091 125 221 29186.0 5.7 10.031 Potyviridae 82 1655075 168 753 20183.8 2.0 9.232 Poxviridae 27 929391315 4685 6871 34421900.6 173.5 254.533 Reoviridae 32 8989522 363 231 280922.6 11.3 7.234 Retroviridae 56 2283443 252 690 40775.8 4.5 12.335 Rhabdoviridae 25 2162899 169 128 86516.0 6.8 5.136 Secoviridae 32 726507 65 147 22703.3 2.0 4.637 Siphoviridae 248 959712930 17974 10412 3869810.2 72.5 42.038 Togaviridae 17 460911 40 180 27112.4 2.4 10.639 Tombusviridae 43 961398 227 184 22358.1 5.3 4.340 Totiviridae 26 301678 55 48 11603.0 2.1 1.841 Tymoviridae 22 413049 64 133 18775.0 2.9 6.042 Virgaviridae 37 1718589 196 339 46448.4 5.3 9.2

Total 1922 6261437285 62125 50868

3.2. Comparison of Accuracies of Classification and Numbers of

Dimensions of Vectors

Figure 5 and Fig.6 present accuracies of virus classification by SVM classifiers for

two types of genomic materials, DNA and protein sequences, respectively. The val-

ues of ”k” and ”T” in the experiments ranged from 1 to 8 and from 3 to 75,

respectively. As shown in Fig.5 (Fig.6), the best accuracies were 99.5%(98.0%),

93.7%(91.5%) and 98.1%(94.5%) when k=5(T=30), k=4(T=21) and k=6(T=12)

were set with three virus datasets in 6 orders, 43 families and 33 genera, respec-

tively.

As shown in Table 6, the classification accuracies obtained using ”region” names

were 99.9%, 97.3% and 99.0%, respectively. In this study, as shown in Table 6,

”Region” achieved the best accuracy. The numbers of dimensions of the vectors

and 42 families, for example, were 256 for ”DNA” when k=5, 28,136 for ”Protein”

when T=21, and 4,538 for ”Region”. Section 4.1 explains why the use of ”Region”

yielded the best accuracy.


10 JING-DOO WANG

Table 5. The statistics of 33 virus genera.

Average (Per Virus)

Ci Genus #ofViruses DNA Length(bp) #ofProteins #ofRegions DNA Length(bp) #ofProteins #ofRegions

1 Alphabaculovirus 35 700041223 5081 4826 20001177.8 145.2 137.92 Alphapapillomavirus 14 789935 100 137 56423.9 7.1 9.83 Alphatorquevirus 16 195467 52 55 12216.7 3.3 3.44 Alphavirus 16 441387 38 172 27586.7 2.4 10.85 Badnavirus 18 494642 65 111 27480.1 3.6 6.26 Begomovirus 123 3253448 813 965 26450.8 6.6 7.87 Begomovirus∗ 13 17843 13 24 1372.5 1.0 1.88 Betabaculovirus 12 225849175 1687 1298 18820764.6 140.6 108.29 Betacoronavirus 10 3013335 98 297 301333.5 9.8 29.7

10 Carlavirus 27 1403493 164 326 51981.2 6.1 12.111 Carmovirus 13 295267 74 51 22712.8 5.7 3.912 Circovirus 11 64075 33 38 5825.0 3.0 3.513 Crinivirus 11 2035824 122 93 185074.9 11.1 8.514 Dependovirus 15 191112 40 72 12740.8 2.7 4.815 Enterovirus 13 95345 13 130 7334.2 1.0 10.016 Flavivirus 37 442757 41 551 11966.4 1.1 14.917 Gammaretrovirus 13 278712 40 120 21439.4 3.1 9.218 Ilarvirus 14 559013 66 73 39929.5 4.7 5.219 Inovirus 14 1056331 147 137 75452.2 10.5 9.820 Mastadenovirus 16 17663233 519 561 1103952.1 32.4 35.121 Mastrevirus 13 137527 51 63 10579.0 3.9 4.822 Nepovirus 10 245938 20 54 24593.8 2.0 5.423 New world arenaviruses 18 756288 72 90 42016.0 4.0 5.024 Partitivirus 11 101890 23 11 9262.7 2.1 1.025 Parvovirus 11 243618 49 70 22147.1 4.5 6.426 Polerovirus 13 449983 78 106 34614.1 6.0 8.227 Polyomavirus 22 642091 125 221 29186.0 5.7 10.028 Potexvirus 31 1054674 162 236 34021.7 5.2 7.629 Potyvirus 64 1251743 128 600 19558.5 2.0 9.430 Sobemovirus 12 218780 52 43 18231.7 4.3 3.631 Tobamovirus 22 579754 90 168 26352.5 4.1 7.632 Tombusvirus 10 236700 50 60 23670.0 5.0 6.033 Tymovirus 15 282903 45 94 18860.2 3.0 6.3

Total 693 964383506 10151 11853

* Begomovirus-associated alphasatellites

Fig. 5. Accuracy Comparison using DNA sequences using various k values.

Table 6. Comparison of Classification Accuracy and Numbers of Dimensions of Vectors

DNA Protein Region

6 Virus Orders 99.5%, 1024 (k=5) 98.0%,26942 (T=30) 99.9%,278342 Virus Families 93.7%, 256 (k=4) 91.5%,28136 (T=21) 97.3%,453833 Virus Genera 98.1%, 4096 (k=6) 94.5%,2223 (T=12) 99.0%,2939



Fig. 6. Accuracy Comparison using protein sequences using various T values.

4. Discussions

4.1. Why ”region” yielded the best accuracy

Table 6 shows that ”region” provided the best classification accuracy. The reason

is discussed below. First, The frequency distributions of k-mers that were derived

from DNA sequences were used for virus classification. Generally, longer k-mers

present more specific features. However, two characteristics of viruses - their rapid

evolution and diversity cause the frequency distribution of k-mers to be too sparse

to be used for classification purpose when the DNA sequences ar short but the value

of k is large. Figure 5 shows that the accuracy decreases as the value of k increases

over 7.

Second, in this study, the protein sequences within the same group after protein

clustering were assumed to have similar functions. This fact was used as a distin-

guishing feature for further vector transformation processing. However, the protein

clustering approach was implemented in the ”pblast” program to measure the simi-

larity between two protein sequences and the single-linkage method was used to join

two groups into one. The above approach might generate impurities in the protein

such groups that one group may exhibit two functions. For example, Fig.7 presents

7 protein sequences, S1, S2, . . . , S7, in two distinctive groups (functions), GID1 and

GID2 determined by the single-link method with a threshold value T . The two

groups, GID1 and GID2, are formed due to regions R1 and R2, respectively, and

are disjointed because all of the distances between the nodes of GID1 and those of

GID2 are larger than e−T . However, the appearance of S8, containing both R1 and

R2, results in the merging of GID1 and GID2 into GID3.

The region name is a distinguishing feature for classification in this study. With

respect to the distribution of class frequency (CF) of region names across 42 families,

an example of which is shown in Fig.8, the majority of the CF values of region

names were ”CF=1”(78.45%) and most of the region names appeared in only one


12 JING-DOO WANG

Fig. 7. Two distinct groups, GID1 and GID2, are merged as the group GID3 due to the sequence

S8 that contains R2 and R3.

Fig. 8. The distribution of class frequency (CF) of the ”Region” derived from 42 families.

class. Notably, about 20% of the regions were with ”CF=2” (18.85%) or ”CF=3”

(5.61%), and these regions, as the S8 described above, may have caused the impurity

of protein groups.

4.2. Drawbacks of ”region” name annotation

As shown in Table 2, ICTV and NCBI, contained 420 and 326 genera, respectively.

However, only 33 genera (693 viruses) were selected in the experiments owing to



the requirement that each class should contain at least ten instances that included

at least one region name. Hence, the majority of the viruses were not used in

classification experiments so the experimental result in this study was not robust

enough.

The ”region” names usually were given or assigned manually while related se-

quences were aligned using the RPS-Blast against the CDD (Conserved Domain

Database) 23. The way in which a region name is assigned may have the side effect

that related sequences might be highly specific to some viral family, for example.

Therefore, the region names may contain some metadata about the label of the

original family, which may provide a way of cheating in classification experiments.

To avoid such a situation, region names must be annotated automatically without

knowledge of the class label using the HMMER33 against PFAM6 or other auto-

mated domain annotation tools. Doing so would make the proposed approach more

practical and provide more convincing experimental classification in the future.

4.3. Verifying fitness of class structure within existing virus

taxonomy

The mis-classified instances are examined using a confusion matrix 26 to identify the

implicit relationship between two classes. This experimental results is thus obtained

are not shown herein owing to the limitation on the number of pages. However,

analyzing the ambiguities of among classes is favored to evaluate the fitness of an

existing class structure31. After a feasible type of genomic material is selected from

existing genomic materials for classification. Existing class structures of biological

taxonomy can be verified via molecular biology instead of traditional morphology.

Such work may provide clues for biologists or taxonomists to reinspect and adjust

existing class structures when they working with taxonomy in the future.

5. Conclusion

In this study, there genomic materials are used to compare methods of virus clas-

sification; there are DNA sequences, protein sequences and region names. The first

two materials are extracted directly from virus genomes, and the last is obtained

from the annotation of the protein. The resources that are used in the experi-

ments are collected from taxonomic levels and include 6 orders, 42 families and 33

genera. Experimental results show that using ”region” to classify viruses yielded

the best classification accuracy when the SVM classifier from LIBSVM was used.

The obtained accuracies were 99.9%, 97.3% and 99.0% for the three datasets that

comprised 6 order, 42 families and 33 genera, respectively. This paper provides a

novel approach to classifying viruses for molecular biological purposes, instead of

the use of morphology. This approach, using genomic materials, can be applied

to classify other creatures (organisms). This work opens up a new way to deter-

mine whether the existing taxonomic structure is suited from the point of view of

molecular biology31.


14 JING-DOO WANG

Acknowledgment

This study is supported by Asia University, Taiwan under project 101-asia-30. The

author thanks the reviewers for their valuable comments and suggestions.

References

1. Entrez Programming Utilities Help,http://www.ncbi.nlm.nih.gov/books/NBK25501/.

2. FTP Site for Genomes in NCBI, ftp://ftp.ncbi.nih.gov/genomes.3. HMMER, http://hmmer.janelia.org/.4. International Committee on Taxonomy of Viruses (ICTV),

http://www.ncbi.nlm.nih.gov/ICTVdb/.5. National Center for Biotechnology Information(NCBI),

http://www.ncbi.nlm.nih.gov/.6. Pfam database, http://pfam.sanger.ac.uk/.7. Wikipedia: Virus Classification, http://en.wikipedia.org/wiki/Virus classification.8. Alpaydin E, Introduction to Machine Learning, The MIT Press, 2004.9. Baltimore D, Animal Virology, no. 4, Elsevier Science, 1976. ISBN 9780323142281.

10. Bazinet A, Cummings M, A comparative evaluation of sequence classification pro-grams, BMC Bioinformatics 13(1):92+, 2012.

11. Burke SM, Torkington N, Aas G, Perl and LWP. Fetching Web Page, Parsing HTML,Writing Spiders and More, O’Reilly, Beijing, 2002.

12. Chang CC, Lin CJ, LIBSVM: a library for support vector machines, 2001, softwareavailable at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

13. Croft B, Metzler D, Strohman T, Search Engines Information Retrieval in Practice,1st ed., Addison Wesley, 2009.

14. Croft B, Metzler D, Strohman T, Search Engines: Information Retrieval in Practice,Addison-Wesley Publishing Company, USA, 2009. ISBN 0136072240, 9780136072249.

15. Deschavanne P, DuBow M, Regeard C, The use of genomic signature distance betweenbacteriophages and their hosts displays evolutionary relationships and phage growthcycle determination, Virology Journal 7, 2010.

16. Dutilh BE, He Y, Hekkelman ML, Huynen MA, Signature, a web server for taxonomiccharacterization of sequence samples using signature genes, Nucleic Acids Research(suppl 2):W470–W474.

17. Dutilh BE, Snel B, Ettema TJ, Huynen MA, Molecular biology and evolution 25,2008.

18. Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI, A two-stage methodologyfor sequence classification based on sequential pattern mining and optimization, DataKnowl Eng 66(3):467–487, 2008.

19. Han J, Kamber M, Data Mining: Concepts and Techniques, 2nd ed., Morgan Kauf-mann, 2007.

20. Jones NC, Pevzner PA, An Introduction to Bioinformatics Algorithms, MIT Press,2004. ISBN 0-262-10106-8.

21. Jun SR, Sims GE, Wu GA, Kim SH, Whole-proteome phylogeny of prokaryotes byfeature frequency profiles: An alignment-free method with optimal feature resolution,Proceedings of the National Academy of Sciences 107(1):133–138, 2010.

22. Manning CD, Raghavan P, Schu”tze H, Introduction to Information Retrieval, Cam-bridge University Press.

23. Marchler-Bauer A, Zheng C, Chitsaz F, Derbyshire MK, Geer LY, Geer RC, GonzalesNR, Gwadz M, Hurwitz DI, Lanczycki CJ, Lu F, Lu S, Marchler GH, Song JS, Thanki



N, Yamashita RA, Zhang D, Bryant SH, CDD: conserved domains and protein three-dimensional structure., Nucleic acids research 41(Database issue):D348–D352, 2013.

24. Mitchell TM, Machine Learning, The McGraw-Hill Companies, Inc, 1997.25. on Taxonomy of Viruses IC, King A, Adams M, Lefkowitz E, Carstens E,

Virus Taxonomy: IXth Report of the International Committee on Taxonomy ofVirusesImmunology and microbiology, Immunology and microbiology, AcademicPress, 2011. ISBN 9780123846846.

26. Roiger R, Geatz MW, Data Mining: A Tutorial Based Primer, Addison Wesley, 2003.27. Trifonov V, Rabadan R, Frequency analysis techniques for identification of viral ge-

netic data, mBio 1(3), July/August 2010.28. van Passel M, Kuramae E, Luyf A, Bart A, Boekhout T, The reach of the genome

signature in prokaryotes, BMC Evolutionary Biology 6(1):84, 2006.29. Wang JD, A Comparison study of Virus Classification by Genome Sequences, The 11th

IEEE International Conference on Bioinformatics and Bioengineering, pp. 270–273,2011.

30. Wang JD, Virus Classification via Genomic Sequences From Different TaxonomicLevel, The 23rd International Conference on Genome Informatics, p. 76, 2012.

31. Wang JD, Liu HC, An Approach to Evaluate the Fitness of One Class Structure viaDynamic Centroids, Expert Systems with Applications 38(11):13764–13772, 2011.

32. Witten IH, Frank E, Data Mining: Practical Machine Learning Tools and Techniques(Third Edition), Elsevier, 2011. ISBN 0120884070.

33. Wu GA, Jun SR, Sims GE, Kim SH, Whole-proteome phylogeny of large dsdna virusfamilies by an alignment-free method, Proceedings of the National Academy of Sciences106(31):12826–12831, 2009.

34. Yu C, Hernandez T, Zheng H, Yau SC, Huang HH, He RL, Yang J, Yau SST, Realtime classification of viruses in 12 dimensions, PLOS ONE 8(5), 2013.

Jing-Doo Wang received his BS degree in Computer Science

and Information Engineering from the University of Tatung (for-

merly Tatung Institute of Technology) in 1989, and his M.S. and

Ph.D. degrees in Computer Science and Information Engineering

from the University of Chung Cheng in 1993 and 2002 respec-

tively.He has been with Asia University (formerly Taichung Healthcare and Management

University) since spring 2003, where he is currently an assistant professor in the De-

partment of Computer Science and Information Engineering. He also holds a joint

appointment with the Department of Biomedical Informatics. His research interests

are in the areas of bioinformatics, text mining for trend analysis and the extraction

of maximal repeats via cloud computing.

comparing virus classification using genomicwongls/psz/giw2013/accepted-paper… · october 1, 2013...

Documents