cis/tf discovery for arabidopsis aristotelis tsirigos email: tsirigos@cs.nyu.edu nyu computer...

Cis/TF discovery for Arabidopsis

Aristotelis Tsirigosemail: tsirigos@cs.nyu.edu

NYU Computer Science

Outline

• Input data

• The proposed model

• Results on yeast

• Results on arabidopsis

• Unsupervised pattern discovery

Input data

Input data~

25 points1,500bp

upstream

gctaagc...

Normalization~

25 points1,500bp

upstream

normalize columns(mean=0)

gctaagc...

Filtering~

25 points1,500bp

upstream

normalize columns(mean=0, stdev=1)

25 pointsgctaagc...motif

bitmap

001011…

filter outlow-variance

The proposed model

Assumption 1

A single TF binds on a single cis element (motif)

Source: U.S. Department of Energy Genomics (http://doegenomestolife.org)

Assumption 2

TFs regulate genes sharing a motif only on subset of conditions

TF & regulated genes (group #1)

conditione

nTF & regulated genes (group #2)

condition

Expression pattern #1

conditionex

nExpression pattern #2

condition

Assumption 2 (cont’d)

TFs regulate genes sharing a motif only on subset of conditions

Assumption 3The TF expression correlates with the

sum of the partially correlating expression patterns

sum of genes

condition

Objective

• For each cis element (motif):

– discover groups of co-regulated genes

– compute aggregate motif expression

• For each TF:

– find best correlating motifs

The algorithm – step 1~

step 1: clustering

25 points

step 1: clustering

25 points

step 2 for any motif

compute its gene set

step 1 clustering

25 points

step 3 compute the distribution of its genes into the clusters.

step 1 clustering

25 points

step 3 compute the distribution of its genes into the clusters

step 4 determine overrepresented

clusters using t-test

The algorithm – final step~

25 points

final stepcompute motif

aggregate expression

25 points

Example TF: BAS1

RANK MOTIF OCCUR corr score 1 gactcg 46 0.6446 66 2 cgagtc 46 0.6446 16 3 gactaa 163 0.6381 66 4 ttagtc 163 0.6381 33 5 tcggct 87 0.6374 33 ... 12 gctagt 110 0.6268 33 13 agtcac 137 0.6262 83 p-value=0.079 ... 27 gagtca 136 0.6192 100 p-value=0.004

Using cis/TF version 1:

Example TF: BAS1

Using cis/TF version 2:

RANK MOTIF OCCUR signf corr score 1 ctgact 122 0.62 0.66 33 2 agtcag 122 0.62 0.66 83 3 ggttta 187 0.62 0.63 50 4 taaacc 187 0.62 0.63 33 5 gagtca 136 0.68 0.63 100 p-value=0.002 6 tgactc 136 0.68 0.63 33 7 atttga 378 0.64 0.63 33 8 tcaaat 378 0.64 0.63 50 9 agtggc 126 0.66 0.61 50 10 gccact 126 0.66 0.61 50

Cluster #1: correlation = 0.02

Cluster #2: correlation = -0.05

BAS1#2

BAS1#0

Cluster #4: correlation = -0.35

BAS1#4

Conclusions

Advantages of version 2:

gives ability to focus on gene cluster that correlates best with a given TF

thus, increases overall correlation and motif rank

offers a measure of motif significance

can be extended to pairs of TFs/motifs

Arabidopsis

Procedure• Permute gene cluster assignment

• Compile list of putative motifs

• Compute significance score of known motifs

• Repeat 1000 times

• Compute p-value of the score

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

ranking score

f experi

p-val = 0.006

TF discovery?

Need data for training!

(TFs and their associated binding cites)

Parameters to be estimated: number of clusters

motif size & degeneracy

Pattern discovery

TF-driven pattern discovery

• Unsupervised pattern discovery

• Find groups of genes partially correlating with TF

• Apply statistical filter

• Look for over-represented motifs in genes’ upstream regions

• Data for validation?

33-0.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

AT1G73230 (TF)

AT1G53290

AT5G59880

Pattern discovery example

TF & regulated genes (group #2)

condition

“Predicting Gene Expression form Sequence”Beer & Tavazoie, Cell 2004

• Group genes in 49 clusters• Predict gene cluster using motifs discovered in

its upstream region

-1 -0.5 0 0.5 1

correlation

2,500 genes

PAC&RRPE

Conclusions

ConlusionsTwo options:

• Supervised training:

– uses background knowledge to construct model

– needs more training data

• Unsupervised pattern discovery:

– minimal model bias (no prior knowledge)

– needs more ‘expert’ help to filter results

cis/tf discovery for arabidopsis aristotelis tsirigos email: tsirigos@cs.nyu.edu nyu computer...

Documents

dr. aristotelis b. alexopoulos

aristotelis ethica eudemia

aristotelis - organon graece 02

aristotelis categoriae et liber de interpretatione...

faculty: –yevgeniy dodis dodis@cs.nyu.edu –victor shoup...

p presented by: prof. aristotelis naniopoulos aristotle ......

dr. aristotelis chatziioannou

image processing - cs.nyu.edu

christ's college, cambridge · 2018. 3. 25. · in meteora...

aristotelis ethica nicomachea

aristotelis pseudo - secretum secretorum

1 extending relational database functionality with data...

it - cs.nyu.edu

introduction to boosting aristotelis tsirigos email:...

peri psixis aristotelis ekdoseis schooltime.gr

thomas aquinas_ in aristotelis de anima commentarium_...

aristotelis - de caelo - oxford classical texts

1 strangerdb -- safe data management with untrusted servers...

aristotelis opera cum averrois commentariis vol vi

ross (ed.) 1950 - aristotelis physica