evaluation of count scores for weight matrix motifs

Evaluation of count scores for weight matrix motifs

Project Presentation for CS598SS

Hong Cheng and Qiaozhu Mei

Problem Background

• Understand the mechanism of gene regulation and predict the gene regulation.

• Need a quantitative measure of the strength of a TF Binding Site correlated in a gene sequence.

• This measure can be used as an important feature in the study of gene regulation.

Project Background (cont.)

• There are no standard criteria for such a measure.

• But we expect a good measure can model– The quality of a Binding Site.– The occurrence of a binding site in the

sequence.

• There can be many choices for such a measure, but which one is better…?

Project Overview

• Project Goal

• Possible scoring measures

• Evaluate a score: Constraint Analysis

• Experiments and results

• Current Status and Future Work

Goal of the project

• Three Steps:– Formalize the problem of counting score of

weight matrix motifs and propose an evaluation mechanism.

– Evaluate the existing scoring methods of weight matrix motifs.

– Either suggest a good motif counting method or propose a new score better than existing scores.

Possible scoring measures

• Simple Counting– Match or not match

• Likelihood Sum– Data likelihood of a site is generated by a

motif: sum over all possible sites

• Model based scores

• Free Energy

• Normalization of existing scores

Simple counting

• Simple Counting (match or not match)– Doesn’t work for fuzzy motifs– Variation: count a motif if for a subsequence, P(s|

w) is above a threshold.

• Likelihood sum score– A soft version of simple counting

– Ad hoc, doesn’t have a sound probabilistic interpretation

sitespossibleall

s

iii

sitespossibleall

MspMspS__

||

1__

)|()|(

Model Based Scores

• Consider the sequence to be generated by a model involving a set of motifs

• HMM model: Stubb (Sinha et al 2003)– Count of a motif as an average number of

times the motif is planted in the sequence

• Two options:– With fixed transition probabilities– Fitting transition probabilities by unsupervised

learning

Other Possible Scores

• Free Energy: : a set of motifs M and model parameters b: model parameters and only backgrounds

– F(s, ) = log( Pr(S| )/ Pr(S| ))– Models the score of a sequence and a set of motifs, cannot give

score of a specific motif unless run the computation for only one motif

• Normalization of Existing Scores– Estimating P(C >= x) instead of #sequences – Use well known normalization methods to normalize the actual counts

– Min-Max; Z-Score ( Z = (N-E)/S)

• Question: What makes a good score for weight matrix motif?

Evaluation of scores

• Empirical evaluation with Lab Experiments– Comparing the score with lab experiments to

see the effectiveness – ChIP to Chip studies– Problem:

• Lab experiment data not easy to get• Performance may vary over species (thus may be

biased)

• Analytical evaluation: heuristic constraints

Analytical Evaluation with Heuristic Constrains

• There are many heuristic constraints which we expect a good score will satisfy

• The effectiveness of a score can be implied by how good it satisfies the constraints

• Whether a score satisfy a constraint can be studied analytically or with experiments on random data

• Combining with empirical evaluation, constraint analysis can tell us why a score is better than others, and help us defining a new score.

Heuristic Constraints (I)

• Formalization– Motif PWMs: w, M– Sequences: S– Possible binding sites: s– Score of the nth run: Cn(S, w)

• Motif Quality Constraint– Focus on quality of sites– Contribution of a motif w with length l on sites

– For one motif w, two sequence S1, S2, a site position [i , i + l - 1]. S1[i , i + l - 1] != S2[i , i + l - 1] and other positions are the same.

– If I(S1[i , i + l - 1] ) <= I(S2[i , i + l - 1] )– C(S1, w) >= C(S2, w)

)|Pr(

)|Pr(log)(

bs

wssI

Heuristic Constraints (II)

• Motif Length Constraint– For two motifs w1 and w2, length(w1) = length(w2) + 1. For any

position i <= length(w2), the multinomial vector w1(i) = w2(i) .– Compute the score of M1 and M2 on one sequence S

independently– C(S, w1)<= C(S, w2)

• Motif Sharpness Constraint– For two motifs w1 and w2, length(w1) = length(w2), if for any

position i, j = 0, 1, 2, w1(i, j) < w2(i, j) and w1(i, 3) > w2(i, 3) – (w1 is sharper than w2)– Compute the score of w1 and w2 on a large number of

sequences independently– Expectation [C(w1)]<= Expectation [C(w2)]

Heuristic Constraints (III)

• Motif Probability Constraint– For one motif w, one sequence S, if we compute the score C(S,

w) two times and give higher probability to w in the second run– (e.g. transition probability or prior probability in HMM)

– E.g. p1< p2

– C1(S, w) <= C2(S, w)

• Motif Competition Constraint– For two motifs w1 and w2, one sequence S. First compute the

score for w1 only, then compute considering the co-occurrences of w1 and w2.

– C1(S, w1) >= C2(S, w2)

Heuristic Constraints (IV)

• Deterministic Constraint– One motif w, one sequence S, if we compute the score of w twice

with no parameter changing, – C1(S, w) = C2(S, w)

• Upper Bound Constraint– An existing set of motifs M, a sequence S. if we adding a new motif

wn and compute the scores for M and wn again,

–

– But cannot exceed an upper bound (e.g. the length

of S)

MwwMw

wSCwSCn

),(),( 12

nwMw

wSC ),(2

A summary of constraints

• The heuristic constraints can allow us to analyze the effectiveness of a score without doing experiments.

• In experiments show that one score is better than others, the heuristic constraints can indicate why it is better.

• Difficult to find a close set of constraints• Some constraints are closely related

(maybe not orthogonal, though not redundant)

Experiment Design

• Regular (comparing distribution):– Method

• Stubb with learnt p• Simple Count

– Data• Real motifs, real sequence data• Real motifs, random generated very long sequence

(say, 10k~100k)• Random motifs, including long, short, fuzzy and

sharp combinations, random long sequence

Experiment Design

• Stubb with Fixed Prior Probability– Vary prior prob p : 0.0001, …, 0.001, …,

0.01…– Data

• Real motifs, random generated long sequence• Random motifs, including long, short, fuzzy and

sharp combinations, random generated long sequence

– See score distribution

Experiment Design: Constraints

• Motif Length:– Random generated motifs (uniform, varying

length), random generated long sequence. – Random generated motifs (uniform, varying

length), real sequences

• Motif sharpness:– Random generated motifs (varying sharpness,

equal length), random generated long sequence (100k)

Experiment Design: Constraints

• Motif Competition– Real motifs, real sequence/random sequence

data – several runs:

• 1st run: only motif M1• 2nd run: M1 and M2,• 3rd run: M1 and M2 and M3, • …

– Plot the distribution of M1 in several runs.

Experiment Design (cont.)

• Deterministic constraints:– Real motifs, real sequences, run it several

times, plot the distributions of Motif 1 to see whether it changes a lot.

• Normalization:– Z-Score only; Min-Max only; P(C>=N) only;

P(C>=N) + Z-Score; P(C>=N) + Min-Max

Experiment Result(1)

• Stubb on real sequences against real motifs• Simple count on real sequences against real

motifs• Four motifs

– Bicoid, length 11, medium sharp– Kruppel, length 9, medium sharp– Gt, length 12, a bit sharper– Hkb, length 7, sharpest, every row has one non-zero

count and three 0s

Experiment Result (1)-Stubb

Experiment Result (1)-Simple Count

Experiment Result (1) – Normalization P(x>=N)

Result (1) – Normalization z-score on motif score


• Stubb on random sequences against random motifs

• Simple count on random sequences against random motifs

• Four motifs– Long_fuzzy, length 20, uniform– Long_sharp, length 20, sharp– Short_fuzzy, length 5, uniform– Short_sharp, length 5, sharp

Experiment Result(2)-Stubb

Experiment Result (2)-Simple Count


• Stubb with Fixed Prior Probability, varying p 0.0001 ~0.05

• Four real motifs– Bicoid– Kruppel– Hkb– Gt

• Four random motifs– Long_fuzzy– Long_sharp– Short_sharp– Short_fuzzy

Experiment Result(3)-Bicoid

Experiment Result(3)-Hkb

Experiment Result(3)-Long_fuzzy

Experiment Result(3)-Short_sharp

Experiment Result(4)-Constraint Motif Length

• Test on this heuristic– Stubb– Simple Count

• Generate 10 random motifs, uniform, vary length from 1 to 10

Experiment Result(4)-Simple count

Experiment Result(5)-Contraint Motif Sharpness

• Test on this heuristic– Stubb– Simple Count

• Generate 10 random motifs, length 10, vary sharpness

Experiment Result(6)-Motif Competition

• Test on this constraint– Stubb– Simple Count

• 1st run: using bicoid only

• 2nd run: using bicoid and other five motifs

• 3rd run: using bicoid and other nine motifs

• Monitor the bicoid score

Summary

Constraints Stubb Stubb_FixedP

Likelihood Sum

Probability Constraint N/A Yes N/A

Motif Length Yes Yes Yes

Motif Sharpness Yes Yes No

Motif Competition Not clear Not clear Not clear

Deterministic No Yes Yes

Upper Bound Yes Yes No

Site Quality To be done..

Future Work

• Finish constraint tests

• Evaluate more scores (e.g. Free Energy)

• Define and formalize more constraints

• Comparing with ChIP-chip experiment results, study the effectiveness of scores and the relation to constraints

evaluation of count scores for weight matrix motifs

Documents

good score

new score

good motif

good measure

evaluation mechanism

biasedanalytical evaluation

evaluation of count

gene sequence