a quantitative modeling of protein- dna interaction for improved energy based motif finding...
Post on 21-Dec-2015
219 views
TRANSCRIPT
A Quantitative Modeling of Protein-DNA interaction for
ImprovedEnergy Based Motif Finding
Algorithm
Junguk HurJunguk HurSchool of InformaticsSchool of Informatics
April 25, 2005
L529 – Term Project
BACKGROUNDBACKGROUND
Motif Finding : Important challenge in computation biology.Motif Finding : Important challenge in computation biology.
Current Algorithms : Current Algorithms : Many stochastic or combinatorial algorithms to find motifs for a Many stochastic or combinatorial algorithms to find motifs for a
given set of sequences; MEME, Gibbs, CONSENSUS, and etcgiven set of sequences; MEME, Gibbs, CONSENSUS, and etc
No quantitative dataNo quantitative data
High-throughput genome-wide quantitative data are availableHigh-throughput genome-wide quantitative data are available ChIP-on-Chip: Chromatin ImmunoPrecipitation on Microarray (In ChIP-on-Chip: Chromatin ImmunoPrecipitation on Microarray (In
vivo)vivo)
PBM: Protein-Binding Microarray (In vitro)PBM: Protein-Binding Microarray (In vitro)
EMBF (Energy Based Motif Finding) AlgorithmEMBF (Energy Based Motif Finding) Algorithm Ratio Ratio Binding Affinity Binding Affinity Energy Energy
ChIP-on-Chip (ChIP-on-Chip (Ren Ren et al.et al.))
Array of intergenic sequences from the whole
genome
Energy-Based Motif Finding Energy-Based Motif Finding (EBMF)(EBMF)
Chin Chin et alet al. 2004. 2004 Let Let eei i be the average binding energy between TF and sequence be the average binding energy between TF and sequence ssii, ,
then then eeii = -ln( = -ln(KKee))
KeKe = [TF• = [TF•ssii] / [TF][] / [TF][ssii] ]
Color intensityColor intensity ratioratio represents the value of represents the value of KeKe
Problem DefinitionProblem Definition Solve A*X = B ( Solve A*X = B ( A: Matrix to be decomposed, B: Total Energy, X=New Energy at each Position ,To be calculated)A: Matrix to be decomposed, B: Total Energy, X=New Energy at each Position ,To be calculated) Minimize the prediction error Minimize the prediction error
Iteratively improve candidate matrix Iteratively improve candidate matrix MM
4 x l4 x l energy matrix energy matrix MM to represent the motif (to represent the motif (ll=motif length)=motif length)
Goals and MethodsGoals and Methods Ultimately to build better model representing the Ultimately to build better model representing the
local and non-local correlation between nucleotideslocal and non-local correlation between nucleotides Based on the EBMF algorithmBased on the EBMF algorithm Utilizing quantitative measure for DNA-protein interactionUtilizing quantitative measure for DNA-protein interaction Potentially more accurate than the Positional Weight Potentially more accurate than the Positional Weight
Matrices (PWMs)Matrices (PWMs)
Implementation of EBMF firstImplementation of EBMF first Solving linear equationsSolving linear equations
Matrix Solution : QR-decomposition / LR-decompositionMatrix Solution : QR-decomposition / LR-decomposition Least square method : Downhill Simplex Method Least square method : Downhill Simplex Method
Programming Language : Perl Programming Language : Perl Data Set : Yeast ChIP-on-Chip data (GAL4, GCN4, RAP1)Data Set : Yeast ChIP-on-Chip data (GAL4, GCN4, RAP1)
ResultsResults Implemented EBMF failed to find the motif for each Implemented EBMF failed to find the motif for each
TFs even though initial matrix starting from the TFs even though initial matrix starting from the TRANSFAC PSSM.TRANSFAC PSSM. QR/LR-decomposition: Resulted in Infinity QR/LR-decomposition: Resulted in Infinity
Due to singular-like matrix (up to the precision of the Due to singular-like matrix (up to the precision of the machine)machine)
Downhill Simplex Method: Too slow and still deviated Downhill Simplex Method: Too slow and still deviated from the TRANSFAC resultfrom the TRANSFAC result
MATLAB : Same as QRMATLAB : Same as QR
Tried to modify the matrixTried to modify the matrix Add small non-zero number to zero elementAdd small non-zero number to zero element Limit to only one TFBS per promoterLimit to only one TFBS per promoter Worked for short length of random sets but still did not Worked for short length of random sets but still did not
work for the yeast TFs.work for the yeast TFs.
DiscussionDiscussion
Data are singular? Any other tricky way?Data are singular? Any other tricky way? Try other data set.Try other data set. Other direction to use quantitative protein-Other direction to use quantitative protein-
DNA binding data DNA binding data Possible correlation among TFs Possible correlation among TFs
AcknowledgementAcknowledgement I deeply thank Dr. Haixu TangI deeply thank Dr. Haixu Tang