application of latent semantic analysis to protein remote homology detection wu dongyin 4/13/2015
TRANSCRIPT
![Page 1: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/1.jpg)
Application of latent semantic analysis to protein remote homology detection
Wu Dongyin4/13/2015
![Page 2: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/2.jpg)
ABSTRACT
LSA
Related Work on Remote Homology Detection
LSA-based SVM and Data set
Result and Discussion
CONCLUSION
![Page 3: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/3.jpg)
Motivation Remote homology detection: A central problem in computational biology, the classification of proteins into functi
onal and structural classes given their amino acid sequences
Results
Discriminative method such as SVM is one of the most effective methods
Explicit feature are usually large and noise data may be introduced, and it leads to peaking phenomenon
Introduce LSA, which is an effecient feature extraction technique in NLP LSA model significantly improves the performance of remote homology detection in comparison with basic formal
isms, and its peformance is comparable with complex kernel methods such as SVM-LA and better than other sequence-based methods
ABSTRACT
![Page 4: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/4.jpg)
Related Work of Remote Homology Detection
pairwise sequence
comparison algorithm
rotein families and discriminative classifiers
generative models
dynamic programming
algorithm:
BLAST,
FASTA,
PSI-BLAST, etc
HMM, etc SVM, SVM-fisher, SVM-k-spectrum, mismatch-SVM, SVM-pairwise, SVM-I-sites,
SVM-LA, SVM-SW, etc
structure is more conserved than sequence -- detecting very subtle sequence similarities, or remote homology is important
Most methods can detect homology with a high level of similarity, while remote homology is often difficult to be separated from pairs of proteins that share similarities owing to chance -- 'twilight zone'
The success of a SVM classification method depends on the choice of the feature set to describe each protein. Most of these research efforts focus on finding useful representations of protein sequence data for SVM training by using either explicit feature vector representations or kernel functions.
![Page 5: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/5.jpg)
LSA
Latent semantic analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning by statistical computations applied to a large corpus of text.
LSA analysis the relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text.
![Page 6: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/6.jpg)
LSA
c1 c2 c3 c4 c5 m1 m2 m3 m4
human 1 1
interface 1 1
computer 1 1
user 1 1 1
system 1 1 2
response 1 1
time 1 1
EPS 1 1
survey 1 1
tree 1 1 1
graph 1 1 1
minor 1 1
bag-of-words model
N documentM words in total Word-Document Matrix
( M × N )
( This repretation does not recognize synonymous or related words and the dimensions are too large)
![Page 7: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/7.jpg)
TUSVW
U(M×K) S(K×K) VT(K×N)
LSA
))Min(( NM,R
![Page 8: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/8.jpg)
c1 c2 c3 c4 c5 m1
m2
m3
m4
human 1 1
interface 1 1
computer 1 1
user 1 1 1
system 1 1 2
response 1 1
time 1 1
EPS 1 1
survey 1 1
tree 1 1 1
graph 1 1 1
minor 1 1
+sequences of proteins
documents
LSA
![Page 9: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/9.jpg)
For a new document (sequence) which is not in the training set, it is required to add the unseen do
cument (sequence) to the original training set and the LSA model be computed. The new vector t can be approximated as
t = dU
LSA
where d is the raw vector of the new document, which is similar to the columns of the matrix W
![Page 10: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/10.jpg)
LSA-based SVM and Data set
Structral Classification of Proteins (SCOP) 1.53 sequences from ASTRAL database
54 families 4352 distinct sequences remote homology is simulated by holding out all m
embers of a target 1.53 family from a given superfamily.
3 basic building block of proteins N-gram
N = 3, 20^3, 8000 words Patterns
alphabet ∑U{‘.’}, where ∑ is the set of the 20 amino acids and {‘.’} can be any of the amino acids. X2 selection, 8000 patterns. Motifs
denotes the limited, highly conserved regions of proteins. 3231 motifs.
![Page 11: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/11.jpg)
Result and Discussion
Two methods are used to evaluate the experimental results: the receiver operating characteristic (ROC) scores. the median rate of false positives (M-RFP) scores. The fraction of negative test sequences
that score as high or better than the median score of the positive sequences.
![Page 12: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/12.jpg)
Result and Discussion
![Page 13: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/13.jpg)
Result and Discussion
When the families are in the left-upper area, it means that the method labeled by y-axis outperforms the method labeled by x-axis on this family.
![Page 14: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/14.jpg)
Result and Discussion
fold1
superfamily1.1
family1.1.1
fold2
family1.1.2 family1.1.3
superfamily2.1
family1.2.1 family1.2.2
superfamily1.2
family2.1.1
positivetrain
20
positivetest13
1. Family level
negative train & negative test3033 & 1137
![Page 15: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/15.jpg)
Result and Discussion
fold1
superfamily1.1
family1.1.1
fold2
family1.1.2 family1.1.3
superfamily2.1
family1.2.1 family1.2.2
superfamily1.2
family2.1.1
positivetrain
88
positivetest33
2. Superfamily level
negative train & negative test3033 & 1137
![Page 16: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/16.jpg)
Result and Discussion
fold1
superfamily1.1
family1.1.1
fold2
family1.1.2 family1.1.3
superfamily2.1
family1.2.1 family1.2.2
superfamily1.2
family2.1.1
positive
train61
3. Fold level
positivetest33
negative train & negative test3033 & 1137
![Page 17: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/17.jpg)
Result and Discussion
![Page 18: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/18.jpg)
Result and Discussion
LSA better than SVM-pairwise and SVM-LAworse than methods without LSA and PSI-BLAST
vectorization step optimization step
SVM-pairwise O(n2l2) O(n3)
SVM-LA O(n2l2) O(n2p)
SVM-Ngram O(nml) O(n2m)
SVM-Pattern O(nml) O(n2m)
SVM-Motif O(nml) O(n2m)
SVM-Ngram-LSA O(nmt) O(n2R)
SVM-Pattern-LSA O(nmt) O(n2R)
SVM-Motif-LSA O(nmt) O(n2R)
computational efficiency
n: the number of training examplesl: the length of the longest training sequencem: the total number of wordst: min (m,n)p: the length of the latent semantic representation vector p = n, in SVM-pairwise p = m , in the method with LSA p = R, in the LSA method
![Page 19: Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015](https://reader035.vdocuments.us/reader035/viewer/2022070416/5697c0141a28abf838ccd4f1/html5/thumbnails/19.jpg)
CONCLUSION
In this paper, the LSA model from natural language processing is successfully used in protein remote homology detection and improved performances have been acquired in comparison with the basic formalisms.
Each document is represented as a linear combination of hidden abstract concepts, which arise automatically from the SVD mechanism.
LSA defines a transformation between high-dimensional discrete entities (the vocabulary) and a low-dimensional continuous vector space S, the R-dimensional space spanned by the Us, leading to noise removal and efficient representation of the protein sequence.
As a result, the LSA model achieves better performance than the methods without LSA.