a quantitative modeling of protein- dna interaction for improved energy based motif finding...

A Quantitative Modeling of Protein-DNA interaction for

ImprovedEnergy Based Motif Finding

Algorithm

Junguk HurJunguk HurSchool of InformaticsSchool of Informatics

April 25, 2005

L529 – Term Project

BACKGROUNDBACKGROUND

Motif Finding : Important challenge in computation biology.Motif Finding : Important challenge in computation biology.

Current Algorithms : Current Algorithms : Many stochastic or combinatorial algorithms to find motifs for a Many stochastic or combinatorial algorithms to find motifs for a

given set of sequences; MEME, Gibbs, CONSENSUS, and etcgiven set of sequences; MEME, Gibbs, CONSENSUS, and etc

No quantitative dataNo quantitative data

High-throughput genome-wide quantitative data are availableHigh-throughput genome-wide quantitative data are available ChIP-on-Chip: Chromatin ImmunoPrecipitation on Microarray (In ChIP-on-Chip: Chromatin ImmunoPrecipitation on Microarray (In

vivo)vivo)

PBM: Protein-Binding Microarray (In vitro)PBM: Protein-Binding Microarray (In vitro)

EMBF (Energy Based Motif Finding) AlgorithmEMBF (Energy Based Motif Finding) Algorithm Ratio Ratio Binding Affinity Binding Affinity Energy Energy

ChIP-on-Chip (ChIP-on-Chip (Ren Ren et al.et al.))

Array of intergenic sequences from the whole

genome

Energy-Based Motif Finding Energy-Based Motif Finding (EBMF)(EBMF)

Chin Chin et alet al. 2004. 2004 Let Let eei i be the average binding energy between TF and sequence be the average binding energy between TF and sequence ssii, ,

then then eeii = -ln( = -ln(KKee))

KeKe = [TF• = [TF•ssii] / [TF][] / [TF][ssii] ]

Color intensityColor intensity ratioratio represents the value of represents the value of KeKe

Problem DefinitionProblem Definition Solve A*X = B ( Solve A*X = B ( A: Matrix to be decomposed, B: Total Energy, X=New Energy at each Position ,To be calculated)A: Matrix to be decomposed, B: Total Energy, X=New Energy at each Position ,To be calculated) Minimize the prediction error Minimize the prediction error

Iteratively improve candidate matrix Iteratively improve candidate matrix MM

4 x l4 x l energy matrix energy matrix MM to represent the motif (to represent the motif (ll=motif length)=motif length)

Goals and MethodsGoals and Methods Ultimately to build better model representing the Ultimately to build better model representing the

local and non-local correlation between nucleotideslocal and non-local correlation between nucleotides Based on the EBMF algorithmBased on the EBMF algorithm Utilizing quantitative measure for DNA-protein interactionUtilizing quantitative measure for DNA-protein interaction Potentially more accurate than the Positional Weight Potentially more accurate than the Positional Weight

Matrices (PWMs)Matrices (PWMs)

Implementation of EBMF firstImplementation of EBMF first Solving linear equationsSolving linear equations

Matrix Solution : QR-decomposition / LR-decompositionMatrix Solution : QR-decomposition / LR-decomposition Least square method : Downhill Simplex Method Least square method : Downhill Simplex Method

Programming Language : Perl Programming Language : Perl Data Set : Yeast ChIP-on-Chip data (GAL4, GCN4, RAP1)Data Set : Yeast ChIP-on-Chip data (GAL4, GCN4, RAP1)

ResultsResults Implemented EBMF failed to find the motif for each Implemented EBMF failed to find the motif for each

TFs even though initial matrix starting from the TFs even though initial matrix starting from the TRANSFAC PSSM.TRANSFAC PSSM. QR/LR-decomposition: Resulted in Infinity QR/LR-decomposition: Resulted in Infinity

Due to singular-like matrix (up to the precision of the Due to singular-like matrix (up to the precision of the machine)machine)

Downhill Simplex Method: Too slow and still deviated Downhill Simplex Method: Too slow and still deviated from the TRANSFAC resultfrom the TRANSFAC result

MATLAB : Same as QRMATLAB : Same as QR

Tried to modify the matrixTried to modify the matrix Add small non-zero number to zero elementAdd small non-zero number to zero element Limit to only one TFBS per promoterLimit to only one TFBS per promoter Worked for short length of random sets but still did not Worked for short length of random sets but still did not

work for the yeast TFs.work for the yeast TFs.

DiscussionDiscussion

Data are singular? Any other tricky way?Data are singular? Any other tricky way? Try other data set.Try other data set. Other direction to use quantitative protein-Other direction to use quantitative protein-

DNA binding data DNA binding data Possible correlation among TFs Possible correlation among TFs

AcknowledgementAcknowledgement I deeply thank Dr. Haixu TangI deeply thank Dr. Haixu Tang

a quantitative modeling of protein- dna interaction for improved energy based motif finding...

Documents