modeling dependencies in protein-dna binding sites
DESCRIPTION
Modeling Dependencies in Protein-DNA Binding Sites. Yoseph Barash 1 Gal Elidan 1 Nir Friedman 1 Tommy Kaplan 1,2. 1 School of Computer Science & Engineering 2 Hadassah Medical School The Hebrew University, Jerusalem, Israel. Dependent positions in binding sites. ?T. ? C. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/1.jpg)
Modeling Dependencies in Protein-DNA Binding Sites
1 School of Computer Science & Engineering2 Hadassah Medical School
The Hebrew University, Jerusalem, Israel
Yoseph Barash 1
Gal Elidan 1
Nir Friedman 1
Tommy Kaplan 1,2
![Page 2: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/2.jpg)
promoter
gene
binding site
Dependent positions in binding sites
Pros: Biology suggests dependencies Single amino-acid interacts with two nucleotides Change in conformation of protein or DNA
Cons: Modeling dependencies is harder Additional parameters Requires more data, not as robust
A?C?T
To model or not to model dependencies ?[Man & Stormo 2001, Bulyk et al, 2002, Benos et al, 2002]
Most approaches assume position independence
![Page 3: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/3.jpg)
Can we learn dependencies from available genomic data ?
Do dependency models perform better ?
Outline Flexible models of dependencies Learning from (un)aligned sequences Systematic evaluation
Biological insights
Data driven approach
Yes
Yes
![Page 4: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/4.jpg)
How to model binding sites ?
))P(X)P(X)P(X)P(XP(X)XP(X 543215 1 T
5432151 T)|T)P(X|T)P(X|T)P(X|T)P(X|P(T)P(X)XP(X )X|)P(X)P(XX|)P(XX|)P(XP(X)XP(X 354133215 1
X1 X2 X3 X4 X5 Profile: Independency model
Tree: Direct dependencies
Mixture of Profiles:Global dependencies
Mixture of Trees:Both types of dependencies
X1 X2 X3 X4 X5
T
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
T
3541332151 )XT,|T)P(X|)P(XXT,|)P(XXT,|T)P(X|P(T)P(X)XP(X
? )X X X X P(X 54321 represent a distribution of binding sites
![Page 5: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/5.jpg)
Learning models: Aligned binding sites
Learning based on methods for probabilistic graphical models (Bayesian networks)
GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTAAAGGGCCGGGCGGGAGGCCGGGAGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC
Aligned binding sitesModels
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
LearningMachinery
select maximum likelihood model
![Page 6: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/6.jpg)
Evaluation using aligned data
Estimate generalization of each model:
Test: how probable is the site given the model?
-20.34-23.03-21.31-19.10-18.42-19.70-22.39-23.54-22.39-23.54-18.07-19.18-18.31-21.43
ATGGGGCGGGGCGTGGGGCGGGGCATGGGGCGGGGCGTGGGGCGGGGCGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC
GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTTGGGGGCCGGGC
GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTTGGGGGCCGGGC
Data set Test Log-LikelihoodTest setTraining set
Testavg. LL = -20.77
95 TFs with ≥ 20 binding sites from TRANSFAC database [Wingender et al, 2001’]
Cross-validation:
![Page 7: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/7.jpg)
Arabidopsis ABA binding factor 1
Profile
Test LL per instance -19.93
Mixture of Profiles76%
24%
Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold)
X4 X5 X6 X7 X8 X9 X10 X11 X12
Tree
Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold)
![Page 8: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/8.jpg)
Likelihood improvement over profiles
TRANSFAC 95 aligned data sets
0.5
1
2
4
8
16
32
64
128
10 20 30 40 50 60 70 80 90
Significant(paired t-test)
Fol
d-ch
ange
in li
kelih
ood Not significant
Significant improvement in generalization
Data often exhibits dependencies
![Page 9: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/9.jpg)
Sources of data: Gene annotation (e.g. Hughes et al, 2000)
Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000)
ChIP (e.g. Simon et al, 2001; Lee et al, 2002)
Motif finding problemInput: A set of potentially co-regulated genes
Output: A common motif in their promoters
Evaluation for unaligned data
![Page 10: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/10.jpg)
EM algorithm
Learning models: unaligned data
Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model
Unaligned Data
Learna model
Identify binding
sites
ModelsX1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
![Page 11: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/11.jpg)
ChIP location analysis[Lee et al, 2002]
Yeast genome-wide location experiments Target genes for 106 TFs in 146 experiments
YAL005C...
YAL010CYAL012CYAL013WYPR201W
YAL001CYAL002WYAL003W
Gene
YAL001CYAL002WYAL003W
+ – +– ...
+ –––
ABF1 Targets
– +––. ..
– ++ –
ZAP1 Targets…....
# genes ~ 6000
![Page 12: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/12.jpg)
Learned Mixture of Profiles
43
492
Example: Models learned for ABF1 (YPD) Autonomously replicating sequence-binding factor 1
Learned profile
Known profile(from TRANSFAC)
![Page 13: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/13.jpg)
Evaluating PerformanceDetect target genes on a genomic scale:
ACGTAT…………….………………….AGGGATGCGAGC-1000 0-473
![Page 14: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/14.jpg)
-180 -160 -140 -120 -100 -80 -60
p-v
alu
e
10-8
10-7
10-6
10-5
10-4
10-2
10-1
Profile
10-3
Evaluating Performance
Mix of Trees
Bonferroni corrected p-value ≤ 0.01
Gal4 regulates Gal80
Biologicallyverified site
Detect target genes on a genomic scale:
![Page 15: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/15.jpg)
YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W
Evaluation using ChIP location data[Lee et al, 2002]
Evaluate using a 5-fold cross-validation test:
+–+
YAL001CYAL002WYAL003W
Data set Test set Prediction
– +––+ –––
YAL001CYAL002WYAL003W
+–+
![Page 16: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/16.jpg)
––– – ++– –
Evaluate using a 5-fold cross-validation test:
+–+
True
– +––+ –––
+–+
√√√√FN√√√FP√√
YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W
Data set
YAL001CYAL002WYAL003W
Prediction
Evaluation using ChIP location data[Lee et al, 2002]
![Page 17: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/17.jpg)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
0% 1% 2% 3% 4% 5%
Tru
e P
ositi
ve R
ate
(Sen
sitiv
ity)
False Positive Rate
Profile
Example: ROC curve of HSF1
Mixture of Trees
Tree
~60 FP
Mixture of Profiles
![Page 18: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/18.jpg)
-20 -10 0 10 20 30 40 50 60
-25
-20
-15
-10
-5
0
5
10
15
20
Δ s
pe
cif
icit
y
Δ sensitivity
Tree vs. ProfileTrue
Predicted
TP
Improvement in sensitivity & specificity
30
615
3
SensitivityTP / True
SpecificityTP / Predicted
105 unaligned data sets from Lee et al.
![Page 19: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/19.jpg)
-20 -10 0 10 20 30 40 50 60
-25
-20
-15
-10
-5
0
5
10
15
20
Δ s
pe
cif
icit
y
Δ sensitivity
Mixture of Profiles vs. ProfileTrue
Predicted
TP
Improvement in sensitivity & specificity
52
1718
0
SensitivityTP / True
SpecificityTP / Predicted
105 unaligned data sets from Lee et al.
![Page 20: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/20.jpg)
-20 -10 0 10 20 30 40 50 60
-25
-20
-15
-10
-5
0
5
10
15
20
Δ s
pe
cif
icit
y
Δ sensitivity
Mixture of Trees vs. ProfileTrue
Predicted
TP
Improvement in sensitivity & specificity
84
162
1
SensitivityTP / True
SpecificityTP / Predicted
105 unaligned data sets from Lee et al.
![Page 21: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/21.jpg)
“Is it worthwhile to model dependencies?”Evaluation clearly supports this
What about the underlying biology ?(with Prof. Hanah Margalit, Hadassah Medical School)
![Page 22: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/22.jpg)
Distance between dependent positions
0
10
20
30
40
50
Nu
m o
f d
epe
nd
en
cies
1 2 3 4 5 6 7 8 9 10 11
Distance
Weak (< 0.3 bits)
Medium (< 0.7 bits)
Strong
Tree models learned from the aligned data sets
< 1/3 of the dependencies
![Page 23: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/23.jpg)
0.5
1
2
4
8
16
32
64
128
Fo
ld-c
han
ge
in li
ke
liho
od
Zinc finger
bZIPbHLH
Helix
Turn Helix
β Sheetothers ???
Structural families
Dependency models vs. Profile on aligned data sets
0.5
1
2
4
8
16
32
64
128
10 20 30 40 50 60 70 80 90
Significant(paired t-test)
Fol
d-ch
ange
in li
kelih
ood
Not Significant
![Page 24: Modeling Dependencies in Protein-DNA Binding Sites](https://reader036.vdocuments.us/reader036/viewer/2022070406/568141f7550346895dadd43e/html5/thumbnails/24.jpg)
Conclusions Flexible framework for learning dependenciesDependencies are found in many cases It is worthwhile to model them -
Better learning and binding site prediction
http://compbio.cs.huji.ac.il/TFBN
Future work Link to the underlying structural biology Incorporate as part of other regulatory
mechanism models