final project transciption factor dna binding prediction
DESCRIPTION
Final Project Transciption Factor DNA binding PredictionTRANSCRIPT
1
Transcription Factor-DNA binding prediction
Tahmina AhmedProsunjit BiswasIffat Sharmin ChowdhuryBadri Sampath
2
Motivation
• Label the unlabeled DNA sequences by the model, built by examining the labeled DNA sequences and be able to perceive some real world Machine Learning problems.
3
Approaches
• K-mer based Fixed length K-mer
K-mer with Mismatches
Using Regular Expression
• PWM basedMEME and MAST
• Combined Model
Unite both model
K-mer Approach Based on Regular Expression
Motivation
2-mer appears mostly in the sequences. So, emphasize mostly on 2-mer.
Strategy
- For any two 2-mers X & Y, generate regular expression X(.*)Y and Y(.*)X.
- Use these Regular expression as candidate attribute.
5
Classifier Selection
Fig : Around 9 classifiers applied on TF data set
Algorithms are numbered as follows -
(1)Logistic (2)SMO (3)NaiveBayes (4)BayesianLogisticRegression (5)Kstar (6)Bagging 7)LogitBoost (8)RandomForest (9)J48
Summary -
* 9 classifiers are applied on 10 data set. 3 are shown among them
* choosing an absolute classifier is not a trivial task
* same classifier behaves differently on different data sets
6
Change in Accuracy due to Different Classifiers
Logistic J48 RandomForest NaiveBayes Logistic J48 RandomForest NaiveBayes
Fig : The performance of different types of Classifiers on TF_3 data set Fig : The performance of different types of Classifiers on TF_5 data set
Summary -
* classifiers have great consequences on accuracy
* one has to be prudent when choosing classifiers
7
Change in Accuracy due to Different K-mer Length
4-mer 5-mer 6-mer
Fig : The performance of different length K-mer on TF_3 data set
Summary -
* K-mer length also has consequences on accuracy
* not trivial, difficult to find the absolute one
8
Attribute Space Selection
Fig : The performance of different selecting k-mer on TF_4 data set
Summary -
* considering number of attributes also has consequences on accuracy
* accuracy increases if we consider greater number of attributes, but from such saturation point it decreases.
9
PWM based Analysis on Accuracy(TF_1 data set)
Fig : J48, minW 6 - maxW 15, no. of sites 10 Fig : J48, minW 6 – maxW 15, no. of motifs 5
Summary -
* accuracy increases when we have more motifs but fixed no. of sites
* accuracy increases when we have more sites but fixed no. of motifs
* what happened when we increases both ?????
PWM based Analysis
Fig : Accuracy vary on no. of motifs and no. of sites
* 1st bar concern with no. of sites
* 2nd bar concern with no. of motifs
* 3rd bar concern with accuracy
* the point is that accuracy decreases when we increases no. of motifs and no. of sites.
Extra Work for TF_20
Fig : Flow diagram of Building New Model for TF-20
Summary -
* we have done some extra work for TF_20
K-mer+
Pwm Sequences identified differently
Sequences identified by both model
Biased 2-mer Model
Newly Labeled
Sequences
The New Model for TF-20
12
AUC based on the Feedback (bonus model)
Fig : AUC of 10 data sets based on last submission
* accuracy improved than first submission
* PWM does not have pleasant result
13
Participation
Background Study
Working with Tools
Working with
Models
Parameter Tuning
Automation
Badri Sampath
DNA,RNA,protein, motif
AlignAce, MEME,MAST
PWM K-mer Arff Writer,Mast output
writer
Iffat Sharmin
Chowdhury
Protein, Motif,
Transcription
Weka, AlignAce,ScanAce
K-mer PWM Script for FASTA,
Weka
Prosunjit Biswas
DNA, Transcriptio
nK-mer
MEME,MAST
K-mer PWM Script for RE, for new
model
Tahmina Ahmed
MEME, MAST, PWM
MEME, MAST,Weka
PWM K-mer Script for MEME, MAST
14
Acknowledgment
Questions ???