pura rarg kni bcl6 sna stat5a rara foxa2 e74 hb nr4a1 e ... · the cyrene cis-lexicon presently...

1
Report on the CYRENE Project: A cis-Lexicon containing the regulatory architecture of 586 regulatory genes experimentally validated using the “Davidson Criteria” Ryan Tarpine, James Hart, Timothy Johnstone, Derek Aguiar, Sorin Istrail Center for Computational Molecular Biology, Brown University All correspondence including getting the cisGRN Browser to Istrail Lab, Center for Computational Molecular Biology and Department of Computer Science, Brown University, [email protected] cisGRN Browser The CYRENE cis-Lexicon presently contains the regulatory architecture of 393 transcription-factor-encoding genes and 194 other regulatory genes in eight species: human, mouse, fruit fly, sea urchin, nematode, rat, chicken, and zebrafish, with a higher priority on the first five species. The regulatory architectures of each of these CYRENE genes are validated using the ―Davidson Criteria:‖ sites must be shown to physically bi nd proteins and functionally confirmed by in-vivo disruption. The cis-Lexicon annotations include confirmed transcription factor binding sites, the cis Regulatory Module (CRM) boundaries, the spatial and temporal functionality of the CRM, and the molecular function and classification of the encoded protein. Included is an update on the CLOSE System (cis-Lexicon Ontology Search Engine) -- a set of algorithmic strategies for automated literature extraction of cis-regulation articles that is used to speed up the identification of new CYRENE genes in the literature and to estimate the ―completeness‖ of the CYRENE transcription factor universe. Here also we discuss the newly released CYRENE cisGRN-Browser, a full genome browser dedicated to cis-regulatory genomics. This work has been done jointly with Eric Davidson of Division of Biology at California Institute of Technology. Davidson and de-Leon, 2010 cis-Lexicon cis-Lexicon Connectivity Map (D. Melanogaster) Future Direction: Cross-Platform Integration Virtual Sea Urchin’s view of the Strongylocentrotus purpuratus embryo at 0, 1, 2, 3, and at 6 hours. VSU distinguishes cell type by color. Virtual Sea Urchin The Virtual Sea Urchin (VSU) uses spatial models and a graphics engine to simulate the 4- dimensional sea urchin embryo, allowing the researcher to probe the GRN at various levels of granularity -- from the multicellular embryo to the gene-regulatory network of an individual cell-type. The VSU currently provides models for the S. purpuratus embryo at 6h (shown), 10h, 15h, 20h, and 24h which were created by extrapolating cross sectional color coded tracings from photomicrographs to three dimensions (Eric H. Davidson. The Regulatory Genome: Gene Regulatory Networks In Development And Evolution. Academic Press, May 2006). The computational and data model for the VSU was recently completely rebuilt in Java using JOGL bindings to accommodate animation and integration with the cis- Browser. The development of an embryo can now be modeled using flat text files. The computational modeling of embryonic development will eventually feature realistic cell models and dynamics simulators. We also plan to combine the cis-regulatory sequence analysis capabilities of Cyrene and the network building, visualization, and simulation capabilities of BioTapestry with the temporal and spatial analysis of the 4D Virtual Sea Urchin to get a complete characterization of the S. purpuratus GRN. cis-Lexicon Ontology Search Engine (CLOSE) The CLOSE algorithm combines human-curated knowledge of biological nomenclature with combinatorial optimization to home in on the few thousand papers that are relevant to the CYRENE Project out of the millions in PubMed. The CLOSE algorithm begins with a set of synonym lists, each carefully designed by biologists to capture the various ways that one concept can be described in the literature. Each list represents a particular aspect of cis-regulatory analysis that, when recognized in a title or abstract, would be evidence that the paper is relevant to the CYRENE Project. The CLOSE algorithm adapts itself to match as many known relevant papers as possible while minimizing the number of predictions that it makes, aiming to maximize both sensitivity and specificity. Within minutes, it determines a set of rules that match 95% of our known cis-regulatory papers while discarding 95% of our starting setpapers downloaded from journals which publish cis- regulatory analyses along with other biological research. All PubMed Literature (>1,000,000) CLOSE Dataset (~40,000) Davidson Criteria cis- regulation papers (~1,000) Distribution of cis-Lexicon transcription factors by TF superfamily Distribution of cis-Lexicon transcription factors by Species Pura rarg Kni Bcl6 Sna Stat5a rara Foxa2 E74 Hb Nr4a1 e(spl) CI h TEF-1 (TEAD-1) Myf6 Myf6 Hoxb2 Nkx2-1 ac HOXD4 Cebpa Pit-1 (Pou1f1) Tll En-2 HoxA4 Foxa1 EN Ubx E2F2 Foxa3 IRF1 WT1 tin etv4 otx2 RORA rara gsc POU3F2 Pgr Kr TCN2 Ahr Gcm EGR-3 HSF1 bcd elk1 Nkx2-1 Fos POU4F1 HLHmgamma TCF7 Nupr1 Cdx-2 IRF8 gata6 Abd-A Mitfa PDX-1 Nkx2-1 Sox2 Gsb SMAD7 Nkx6-1 a-myb pax4 Pb NR0B2 (SHP) Tp53 Ankrd1 HNF1A Ddit3 six2 Sox14 Pax3 mafg Hoxa5 irf5 ilf2 esrra ppard elf4 Sox9 Dac Repo Tlx1 Lmo2 Plagl1 Rhox5 Pcna E2f6 Trp53 Mxd4 Lhx3 Tgfb1 Gabpa Rhox5 Tbx1 Giot1 Trp63 Sall1 Ush Hoxd4 znf268 car Nrl Aire Sall4 Snai2 Nr2c1 Gata4 Lyl1 Gbx2 C15 Smad6 Creb3 Nr3c1 Hif3a Ikzf3 Otx2 chrebp Srebf1 Hmga1 Zeb1 Pou4f3 nr1h2 HNF1b tp73 Runx1 hes6 usf2 GATA1 car SREBF1 mxd1 hmx1 tbx20 neurog2 foxp3 couptf2 klf10 Nr4a1 Ptf1a Ddit3 Hlh-6 ATF3 Sox10 Ebf1 Osr1 Snai1 Prox1 Nr4a1 Fos Foxf1a Foxl1 Jun Nkx3-2 ChREBP Mipu1 (Znf667) chrebp Id3 MYC CYP27B1 C11orf31 Fox3p Foxa Sfpi1 EGR-1 NFkBIA dref HOXB4 nr1d1 Runx2 Pax6 mec-3 RELB Msx2 TFAP2c Ar NR0B2 (SHP) Pax2 otx GLI1 Mef2c cad IRF4 ndn IRF7 NFATc1 arntl Pit-1 HOXA10 E2F6 Srebf1 ppard GFI1B tal1 Tp63 Neurod1 Sp7 Bcl3 Nr4a3 Runx2 Ebf1 ase NR0B2 (SHP) Hif1a Ovol1 Hoxc8 Esr1 Rb1 Nr4a1 Myc nr1h4 Hmga1 hoxd9 SOX3 Gata1 nr5a1 (ad4bp. sf-1) pxr pokemon nr0b1 Prdm1 eve Hic1 Dfd EGR-1 rarb MYB (c-Myb) Hes1 REL Pou5f1 MYBl2 (b-myb) HNF-1a Hoxc8 Foxa2 ascl1 tsh SRY KLF-1 Nr5a1 pparg pparg Pgr Nr3c1 Ato bap blimp1/Krox brachyury brk bs (dSRF) CEBPA Cebpa Cebpb Cebpb CEBPD Cebpd ceh-36 che-1 cog-1 Dll E2F (dE2F) EDF-1 eve fGf4 FOS (c-fos) Fosl1 ftz gataE hand Hand-1 HNF1B Hoxa2 Hoxb2 hoxb2 Hoxb3 jing kn Krox20 lim-6 lz MafA Mafk mef2 Msx1 Myf5 Myf5 Myod1 MYOG nanog Nfe2 Nfe2l2 nfkb1 Nkx2-5 oc (otd) otp pax4 pax6b PDX-1 POU5F1 pros ptf1a Rb1 Rbl1 salm slp1 SMAD7 so Sox2 SP3 Srf STAT1 Stat3 STAT4 svp tcfap2a TFAP2a TFAP2c (AP-2gamma) TLX1(Hox11) vvl ybx1 zen Zfp106 gcm Zbtb7 Ahr Gcm EGR-3 ppard GFI1B tal1 Cellular function of cis-Lexicon genes Transcription factor coverage by species

Upload: others

Post on 20-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pura rarg Kni Bcl6 Sna Stat5a rara Foxa2 E74 Hb Nr4a1 e ... · The CYRENE cis-Lexicon presently contains the regulatory architecture of 393 transcription-factor-encoding genes and

Report on the CYRENE Project: A cis-Lexicon containing the regulatory architecture of 586 regulatory genes

experimentally validated using the “Davidson Criteria” Ryan Tarpine, James Hart, Timothy Johnstone, Derek Aguiar, Sorin Istrail

Center for Computational Molecular Biology, Brown University

All correspondence including getting the cisGRN Browser to Istrail Lab, Center for Computational Molecular Biology and Department of Computer Science,

Brown University, [email protected]

cisGRN Browser

The CYRENE cis-Lexicon presently contains the regulatory architecture of 393 transcription-factor-encoding genes and 194 other regulatory genes in eight species: human, mouse, fruit fly, sea urchin, nematode, rat, chicken, and zebrafish, with a higher priority on the first five species. The regulatory

architectures of each of these CYRENE genes are validated using the ―Davidson Criteria:‖ sites must be shown to physically bind proteins and functionally confirmed by in-vivo disruption. The cis-Lexicon annotations include confirmed transcription factor binding sites, the cis Regulatory Module (CRM) boundaries, the spatial and temporal functionality of the CRM, and the molecular function and classification of the encoded protein. Included is an update on the CLOSE System (cis-Lexicon Ontology Search Engine) -- a set of algorithmic strategies for automated literature extraction of cis-regulation

articles – that is used to speed up the identification of new CYRENE genes in the literature and to estimate the ―completeness‖ of the CYRENE transcription factor universe. Here also we discuss the newly released CYRENE cisGRN-Browser, a full genome browser dedicated to cis-regulatory

genomics. This work has been done jointly with Eric Davidson of Division of Biology at California Institute of Technology.

Davidson and de-Leon, 2010

cis-Lexicon

cis-Lexicon Connectivity Map (D. Melanogaster)

Future Direction: Cross-Platform Integration

Virtual Sea Urchin’s view of the Strongylocentrotus purpuratus embryo at 0, 1, 2, 3, and at 6 hours. VSU distinguishes cell type by color.

Virtual Sea Urchin

The Virtual Sea Urchin (VSU) uses spatial models and a graphics engine to simulate the 4-dimensional sea urchin embryo, allowing the researcher to probe the GRN at various levels of granularity -- from the multicellular embryo to the gene-regulatory network of an individual cell-type. The VSU currently provides models for the S. purpuratus embryo at 6h (shown), 10h, 15h, 20h, and 24h which were created by extrapolating cross sectional color coded tracings from photomicrographs to three dimensions (Eric H. Davidson. The Regulatory Genome: Gene Regulatory Networks In Development And Evolution. Academic Press, May 2006).

The computational and data model for the VSU was recently completely rebuilt in Java using JOGL bindings to accommodate animation and integration with the cis-Browser. The development of an embryo can now be modeled using flat text files. The computational modeling of embryonic development will eventually feature realistic cell models and dynamics simulators. We also plan to combine the cis-regulatory sequence analysis capabilities of Cyrene and the network building, visualization, and simulation capabilities of BioTapestry with the temporal and spatial analysis of the 4D Virtual Sea Urchin to get a complete characterization of the S. purpuratus GRN.

cis-Lexicon Ontology Search Engine (CLOSE)

The CLOSE algorithm combines human-curated knowledge of biological nomenclature with combinatorial optimization to home in on the few thousand papers that are relevant to the CYRENE Project out of the millions in PubMed. The CLOSE algorithm begins with a set of synonym lists, each carefully designed by biologists to capture the various ways that one concept can be described in the literature. Each list represents a particular aspect of cis-regulatory analysis that, when recognized in a title or abstract, would be evidence that the paper is relevant to the CYRENE Project. The CLOSE algorithm adapts itself to match as many known relevant papers as possible while minimizing the number of predictions that it makes, aiming to maximize both sensitivity and specificity. Within minutes, it determines a set of rules that match 95% of our known cis-regulatory papers while discarding 95% of our starting set—papers downloaded from journals which publish cis-regulatory analyses along with other biological research.

All PubMed Literature

(>1,000,000)

CLOSE Dataset

(~40,000)

Davidson Criteria cis-regulation

papers (~1,000)

Distribution of cis-Lexicon transcription factors by TF superfamily Distribution of cis-Lexicon transcription factors by Species

Pura rarg Kni Bcl6 Sna Stat5a rara Foxa2 E74 Hb Nr4a1 e(spl) CI h TEF-1 (TEAD-1) Myf6 Myf6 Hoxb2 Nkx2-1 ac HOXD4 Cebpa Pit-1 (Pou1f1) Tll En-2 HoxA4 Foxa1 EN Ubx E2F2 Foxa3 IRF1 WT1 tin etv4 otx2 RORA rara gsc POU3F2 Pgr Kr TCN2 Ahr Gcm EGR-3 HSF1 bcd elk1 Nkx2-1 Fos POU4F1 HLHmgamma TCF7 Nupr1 Cdx-2 IRF8 gata6 Abd-A Mitfa PDX-1 Nkx2-1 Sox2 Gsb SMAD7 Nkx6-1 a-myb pax4 Pb NR0B2 (SHP) Tp53 Ankrd1 HNF1A Ddit3 six

2 S

ox14 P

ax3 m

afg

Hoxa5 irf5

ilf2 e

srra

ppard

elf4

Sox9 D

ac R

epo T

lx1 L

mo2 P

lagl1

Rhox5 P

cna E

2f6

Trp

53 M

xd4 L

hx3 T

gfb

1 G

abpa R

hox5 T

bx1 G

iot1

Trp

63 S

all1

Ush H

oxd4 z

nf2

68 c

ar N

rl Aire

Sall4

Snai2

Nr2

c1 G

ata

4 L

yl1 G

bx2 C

15 S

mad6 C

reb3 N

r3c1 H

if3a Ik

zf3

Otx

2 c

hre

bp S

rebf1

Hm

ga1 Z

eb1 P

ou4f3

nr1

h2 H

NF

1b tp

73 R

unx1 h

es6 u

sf2

GA

TA

1 c

ar S

RE

BF

1 m

xd1 h

mx1 tb

x20 n

euro

g2 fo

xp3 c

ouptf2

klf1

0 N

r4a1 P

tf1a D

dit3

Hlh

-6 A

TF

3 S

ox10 E

bf1

Osr1

Snai1

Pro

x1 N

r4a1 F

os F

oxf1

a F

oxl1

Jun N

kx3-2

ChR

EB

P M

ipu1 (Z

nf6

67) c

hre

bp Id

3 M

YC

CY

P27B

1 C

11orf3

1 F

ox3p F

oxa S

fpi1

EG

R-1

NF

kB

IA d

ref H

OX

B4 n

r1d1 R

unx2 P

ax6 m

ec-3

RE

LB

Msx2 T

FA

P2c A

r N

R0B

2 (

SH

P)

Pax2 o

tx G

LI1

Mef2

c c

ad IR

F4 n

dn IR

F7 N

FA

Tc1 a

rntl P

it-1

HO

XA

10 E

2F

6 S

rebf1

ppard

GF

I1B

tal1

Tp63 N

euro

d1 S

p7 B

cl3

Nr4

a3 R

unx2 E

bf1

ase N

R0B

2 (

SH

P)

Hif1a O

vol1

Hoxc8 E

sr1

Rb1 N

r4a1 M

yc n

r1h4 H

mga1 h

oxd9 S

OX

3 G

ata

1 n

r5a1 (

ad4bp.

sf-

1)

pxr

pokem

on n

r0b1 P

rdm

1 e

ve H

ic1 D

fd E

GR

-1 r

arb

MY

B (

c-M

yb)

Hes1 R

EL P

ou5f1

MY

Bl2

(b

-myb

) H

NF

-1a H

oxc8 F

oxa2 a

scl1

ts

h S

RY

KLF

-1 N

r5a1 p

parg

pparg

Pgr

Nr3

c1 A

to b

ap b

limp1/K

rox b

rachyu

ry b

rk b

s (

dS

RF

) C

EB

PA

Cebpa C

ebpb C

ebpb C

EB

PD

Cebpd c

eh-3

6 c

he-1

cog-1

Dll E2F (dE2F) EDF-1 eve fGf4 FOS (c-fos) Fosl1 ftz gataE hand Hand-1 HNF1B Hoxa2 Hoxb2 hoxb2 Hoxb3 jing kn Krox20 lim-6 lz MafA Mafk mef2 Msx1 Myf5 Myf5 Myod1 MYOG nanog Nfe2 Nfe2l2 nfkb1 Nkx2-5 oc (otd) otp pax4 pax6b PDX-1 POU5F1 pros ptf1a Rb1 Rbl1 salm slp1 SMAD7 so Sox2 SP3 Srf STAT1 Stat3 STAT4 svp tcfap2a TFAP2a TFAP2c (AP-2gamma) TLX1(Hox11) vvl ybx1 zen Zfp106 gcm Zbtb7 Ahr Gcm EGR-3 ppard GFI1B tal1

Cellular function of cis-Lexicon genes Transcription factor coverage by species