dna barcode data analysis boosting accuracy by combining simple classification methods cse 377 –...
Post on 21-Dec-2015
214 views
TRANSCRIPT
DNA Barcode Data AnalysisBoosting Accuracy by Combining Simple
Classification Methods
CSE 377 – Bioinformatics - Spring 2006
Sotirios Kentros Univ. of Connecticut
Bogdan Paşaniuc
2
Outline
Motivation Problem Definition The Methods
Hamming Distance and Minimum Hamming Distance Aminoacid Similarity and Minimum Aminoacid Similarity Dinucleotide Distance Trinucleotide Distance Nucleotide Frequency Similarity
Combining the Methods Results
Specie Classification New Specie Recognition
Conclusion Future Work
3
Motivation
“DNA barcoding” was proposed as a tool for differentiating biological species
Goal: To make a “finger print” for species, using a short sequence of DNA
Assumption: mitochondrial DNA evolve at a lower rate than regular DNA
Mitochondrial DNA: High interspecie variability while retaining low intraspecie sequence variability
Choice was cytochrome c oxidase subunit 1 mitochondrial region ("COI", 648 base pairs long).
4
Problem definition
The scope of our project was to explore if by combining simple classification methods one can increase the classification accuracy.
We address two problems: Classification of individuals given a training
set of species. Identification of individuals that belong in
new species. All the sequences are aligned
5
Problem definition
Specie differentiation:
INPUT: a set S of aligned DNA sequences for which the specie is known and x a new sequence
OUTPUT: find the specie of x, given that there are sequences in S that have the same specie as x
6
Problem definition
Specie differentiation&New Specie Determination:
INPUT: a set S of aligned DNA sequences for which the specie is known and x a new sequence
OUTPUT: find the specie of x, if there is at least a sequence in S with the same specie or determine if it is a new specie.
7
Methods Used
Hamming Distance and Minimum Hamming Distance
Aminoacid Similarity and Minimum Aminoacid Similarity
Dinucleotide Distance Trinucleotide Distance Nucleotide Frequency Similarity
8
Methods
Specie S1 xd(x,S1)
Specie S2
d(x,S2) …
Specie Snd(x,Sn)
1. d(x,Si) = Minimum{ d(x,y) | sequence y belongs to specie Si }• Notation: Minimum “Method” Classifier
2. d(x,Si) = Average{ d(x,y) | sequence y belongs to specie Si }• Notation: “Method” Classifier
9
Hamming Distance
Average: Given new sequence x find specie S such
that the minimum hamming distances on the average from x to y (y in S) is minimized
Assign to S to y Minimum
Given new sequence x find y such that the minimum hamming distance from x to y is minimized
Assign specie(y) to x
10
Aminoacid Similarity
Genetic code:
rules that map DNA sequences to proteins Codon: tri-nucleotide unit that encodes for one
aminoacid Divide DNA seq. into codons and substitute
each one by its corresp. aminoacid Blosum62 (BLOck SUbstitution Matrix)
20x20 matrix that gives score for each two aminoacids based on aminoacid properties
The higher the score the more likely no functional change in the protein
11
Aminoacid Similarity
Distance(x,y)
DNA sequences x, y ->Aminoacid sequences x’ , y’ (using codon to aminoacid transf.)
Using the Blosum aminoacid substitution matrix get the score of the alignment
Average: Find the specie with maximum average
similarity Minimum:
Find the sequence with max. similarity
12
Dinucleotide Distance
For each specie find the frequency with which each Dinucleotide appears.
Compute the frequency of each Dinucleotide in the unclassified sequence.
Find the specie with the minimum Mean Square distance to the new unclassified sequence
For New Species, after classifying the individual find the Average Intraspecie Mean Square distance for the candidate specie. If the individual is close enough, assign him at the specie, otherwise he belongs in a New Specie.
in/dels are ignored
13
Trinucleotide Distance
For each specie find the frequency with which each Trinucleotide appears.
Compute the frequency of Trinucleotide appearance of the unclassified sequence.
Find the specie with the minimum Mean Square distance to the new unclassified sequence
For New Species, after classifying the individual find the Average Intraspecie Mean Square distance for the candidate specie. If the individual is close enough, assign him at the specie, otherwise he belongs in a New Specie.
in/dels are ignored
14
Nucleotide Frequency Similarity
For each position in the DNA find the frequency with which the Nucleotides appear in the specie individuals. We include the frequency of in/dels appearing.
For unclassified individuals compute the log of the probability that the individual belongs to the specie and assign it to the specie for which the probability is maximum.
For new species, we compute the minimum probability for the individuals belonging in the specie and compare it with the one of the candidate specie in order to determine whether it belongs to the specie or not.
15
Combining the Methods
The specie on which most classifiers agreed is returned
Simple Voting: Every classifier’s returned specie has a
weight of 1 Output the specie with the most votes
Weighted Voting Every classifier has a different weight based
on the accuracy of each independent method Output the specie with largest total
As expected weighted voting yields higher accuracy and thus in our results the combined method uses weighted voting
16
Datasets(1)
We used the dataset provided at http://dimacs.rutgers.edu/workshops/BarcodeResearchchallanges consisting of 1623 aligned sequences classified into 150 species with each sequence consisting of 590 nucleotides on the average.
We randomly deleted from each specie 10 to 50 percent of the sequences Deleted seq -> test Remaining seq -> train
We made sure that in every specie has a least one sequence
17
Methods
Percent missing from each specie(%)
10 20 30 40 50
Aminoacid Similarity
95.1 94.8 94.7 94.3 93
Min. Aminoacid Similarity
99.3 99.2 98.7 98.1 97.3
Hamming Dist. 97.9 97.4 96.7 96.5 96.5
Min. Hamming Dist.
98.8 98.2 97.5 97.1 96.4
Nucleotide Freq Sim.
56.2 49.6 44.2 44.6 38.2
Dinucleotide Freq. Dist.
44.9 42.2 41.6 41.5 39.3
Trinucleotide Freq. Dist
70.9 68.1 68 66.7 64.2
Combination 99.2 99.2 98.8 98.3 97.7
Specie Recovering Accuracy(in %)(no new specie)
18
Datasets(2)
In order to test the accuracy of new specie detection and classification we devised a regular leave one out procedure.
delete a whole specie randomly delete from each remaining
specie 0 to 50 percent of the sequences Deleted seq -> test Remaining seq -> train
The following table gives accuracy results on average for 150x6 different testcases
19
Methods
Percent missing from each remaining specie(%)
0 10 20 30 40 50
Aminoacid Similarity 65.1 49.2 43.6 42.0 41.0 37.4
Min. Aminoacid Similarity 72.6 61.0 56.2 56.4 52.6 51.0
Hamming Dist. 55.0 91.4 90.2 90.4 88.0 88.6Min. Hamming
Dist. 73.1 85.4 79.6 78.6 75.0 74.4
Dinucleotide Freq. Dist. 51.0 50.4 48.2 48.2 45.2 43.4
Trinucleotide Freq. Dist 56.5 63.6 61.8 63.0 59.2 57.4
Nucleotide Freq Sim. 73.0 56.2 49.6 44.2 44.0 38.2
Combination 80.5 93.2 91.6 91.6 88.4 88.6
Leave one out Accuracy(in %)
20
Conclusions(1)
Every method show a tradeoff between new specie detection and classification accuracy
Hamming distance performs very good when no new species are present but the accuracy results are low for new specie detection
The combined method yields better accuracy results both on new specie detection and seq. classification.
The runtime of all methods is within same order of magnitude
21
Conclusions(2)
By combining simple classification methods, we managed to boost the accuracy both for classifying individuals in known species and for detecting new species
As expected the results imply a tradeoff between classification and new specie detection the higher the classification accuracy the
lower the detection
Hamming Distance is a very good metric for the training dataset provided
22
Future Work
New specie clustering: determining the different new species present
Further investigate threshold selection and weighting schemes.
Possible ignoring parts of the given sequences could improve accuracy. Are there redundant/noisy regions?
Use independent weighting schemes for new specie detection and classification into known species.