final project transciption factor dna binding prediction

1

Transcription Factor-DNA binding prediction

Tahmina AhmedProsunjit BiswasIffat Sharmin ChowdhuryBadri Sampath

2

Motivation

• Label the unlabeled DNA sequences by the model, built by examining the labeled DNA sequences and be able to perceive some real world Machine Learning problems.

3

Approaches

• K-mer based Fixed length K-mer

K-mer with Mismatches

Using Regular Expression

• PWM basedMEME and MAST

• Combined Model

Unite both model

K-mer Approach Based on Regular Expression

Motivation

2-mer appears mostly in the sequences. So, emphasize mostly on 2-mer.

Strategy

- For any two 2-mers X & Y, generate regular expression X(.*)Y and Y(.*)X.

- Use these Regular expression as candidate attribute.

5

Classifier Selection

Fig : Around 9 classifiers applied on TF data set

Algorithms are numbered as follows -

(1)Logistic (2)SMO (3)NaiveBayes (4)BayesianLogisticRegression (5)Kstar (6)Bagging 7)LogitBoost (8)RandomForest (9)J48

Summary -

* 9 classifiers are applied on 10 data set. 3 are shown among them

* choosing an absolute classifier is not a trivial task

* same classifier behaves differently on different data sets

6

Change in Accuracy due to Different Classifiers

Logistic J48 RandomForest NaiveBayes Logistic J48 RandomForest NaiveBayes

Fig : The performance of different types of Classifiers on TF_3 data set Fig : The performance of different types of Classifiers on TF_5 data set

Summary -

* classifiers have great consequences on accuracy

* one has to be prudent when choosing classifiers

7

Change in Accuracy due to Different K-mer Length

4-mer 5-mer 6-mer

Fig : The performance of different length K-mer on TF_3 data set

Summary -

* K-mer length also has consequences on accuracy

* not trivial, difficult to find the absolute one

8

Attribute Space Selection

Fig : The performance of different selecting k-mer on TF_4 data set

Summary -

* considering number of attributes also has consequences on accuracy

* accuracy increases if we consider greater number of attributes, but from such saturation point it decreases.

9

PWM based Analysis on Accuracy(TF_1 data set)

Fig : J48, minW 6 - maxW 15, no. of sites 10 Fig : J48, minW 6 – maxW 15, no. of motifs 5

Summary -

* accuracy increases when we have more motifs but fixed no. of sites

* accuracy increases when we have more sites but fixed no. of motifs

* what happened when we increases both ?????

PWM based Analysis

Fig : Accuracy vary on no. of motifs and no. of sites

* 1st bar concern with no. of sites

* 2nd bar concern with no. of motifs

* 3rd bar concern with accuracy

* the point is that accuracy decreases when we increases no. of motifs and no. of sites.

Extra Work for TF_20

Fig : Flow diagram of Building New Model for TF-20

Summary -

* we have done some extra work for TF_20

K-mer+

Pwm Sequences identified differently

Sequences identified by both model

Biased 2-mer Model

Newly Labeled

Sequences

The New Model for TF-20

12

AUC based on the Feedback (bonus model)

Fig : AUC of 10 data sets based on last submission

* accuracy improved than first submission

* PWM does not have pleasant result

13

Participation

Background Study

Working with Tools

Working with

Models

Parameter Tuning

Automation

Badri Sampath

DNA,RNA,protein, motif

AlignAce, MEME,MAST

PWM K-mer Arff Writer,Mast output

writer

Iffat Sharmin

Chowdhury

Protein, Motif,

Transcription

Weka, AlignAce,ScanAce

K-mer PWM Script for FASTA,

Weka

Prosunjit Biswas

DNA, Transcriptio

nK-mer

MEME,MAST

K-mer PWM Script for RE, for new

model

Tahmina Ahmed

MEME, MAST, PWM

MEME, MAST,Weka

PWM K-mer Script for MEME, MAST

14

Acknowledgment

Questions ???

final project transciption factor dna binding prediction

Technology