introduction: - pennsylvania state universityross/share/chenggata/enhancerco… · web viewapril...

36
April 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs Yong Cheng, David C. King, Lou Dore, Xinmin Zhang, Yuepin Zhou, Ying Zhang, Christine Dorman, Demesew Abebe, Swathi Kumar, Francesca Chiaromonte, Webb Miller, Roland Green, Mitchell Weiss, Ross C. Hardison ABSTRACT: Regulation of gene transcription is critical to the proper development and function of an organism, but the severity of purifying selection on the DNA sequences required for this regulation remains controversial. Although many studies have supported an association of deep conservation in noncoding DNA sequences with gene regulatory function, other studies reveal extensive turnover in binding site motifs and rapid evolution. Furthermore, DNA segments with biochemical signatures of gene regulatory regions are only slightly enriched for sequences under intense purifying selection. We examined the correlation of sequence conservation and enhancer activity dependent on the transcription factor GATA-1 in mouse erythroid cells. The sites occupied by GATA-1 in erythroid cells were mapped along 66 million bp of mouse chromosome 7, which contains almost 30 genes that are regulated by this transcription factor, using chromatin immunoprecipitation followed by hybridization to high density tiling arrays and full validation by quantitative PCR. Each occupied site was examined for the phylogenetic extent of conservation among other vertebrate species, both for the whole interval and for the GATA-1 binding site motif. The occupied sites were also tested for enhancer activity in gain-of-function assays in transfected erythroid cells. Although the occupied DNA intervals were conserved in multiple mamalian lineages far more frequently than randomly chosen intervals, only 45% of the binding site motifs are conserved to this extent. This large subset of occupied intervals with binding site motifs conserved in multiple mammalian lineages tends to be highly active in enhancer assays. This suggests that strong purifying selection is working on binding site motifs for enhancers that are capable of 1

Upload: others

Post on 07-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

April 01, 2008

Strong transcriptional enhancers show constraint on binding factor motifs

Yong Cheng, David C. King, Lou Dore, Xinmin Zhang, Yuepin Zhou, Ying Zhang, Christine Dorman, Demesew Abebe, Swathi Kumar, Francesca Chiaromonte, Webb Miller, Roland Green, Mitchell Weiss, Ross C. Hardison

ABSTRACT:

Regulation of gene transcription is critical to the proper development and function of an organism, but the severity of purifying selection on the DNA sequences required for this regulation remains controversial. Although many studies have supported an association of deep conservation in noncoding DNA sequences with gene regulatory function, other studies reveal extensive turnover in binding site motifs and rapid evolution. Furthermore, DNA segments with biochemical signatures of gene regulatory regions are only slightly enriched for sequences under intense purifying selection. We examined the correlation of sequence conservation and enhancer activity dependent on the transcription factor GATA-1 in mouse erythroid cells. The sites occupied by GATA-1 in erythroid cells were mapped along 66 million bp of mouse chromosome 7, which contains almost 30 genes that are regulated by this transcription factor, using chromatin immunoprecipitation followed by hybridization to high density tiling arrays and full validation by quantitative PCR. Each occupied site was examined for the phylogenetic extent of conservation among other vertebrate species, both for the whole interval and for the GATA-1 binding site motif. The occupied sites were also tested for enhancer activity in gain-of-function assays in transfected erythroid cells. Although the occupied DNA intervals were conserved in multiple mamalian lineages far more frequently than randomly chosen intervals, only 45% of the binding site motifs are conserved to this extent. This large subset of occupied intervals with binding site motifs conserved in multiple mammalian lineages tends to be highly active in enhancer assays. This suggests that strong purifying selection is working on binding site motifs for enhancers that are capable of activating genes independently of other, nonpromoter regulatory regions.

1

Page 2: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

INTRODUCTION

Most of the genomic DNA sequence that is conserved among two or more mammalian orders falls outside the sequences coding for protein, but its function is any is not well-understood [Waterston, 2002, 0][Miller, 2004 #1828][Dermitzakis, 2005 #2002]. Many DNA sequences reguired for regulation of gene expression, i.e. cis-regulatory modules (CRMs) such as enhancers and promoters, are conserved across mammalian lineges [Emorine, 1983 #134][Li, 1990 #31][Aparicio, 1995 #381][Woolfe, 2005 #1893], and this led to the use of conservation of noncoding sequences to predict CRMs from aligned genomic DNA sequences [Hardison, 2000 #1072][Pennacchio, 2001, 21147229]. Indeed, gain-of-function assays show that almost half of the noncoding sequences conserved across vertebrate lineags (from mammals to fish) [Pennacchio, 2006 #2205] or almost invariant in multiple mammalian orders [Visel, 2008 #2342] are active as enhancers during early development.

Conservation of occupied sites and binding motifsMany studies have focused on conserved noncoding sequences as potential

gene regulatory regions [Miller, 2004 #1828][Dermitzakis, 2005 #2002]. By taking an approach to identifying sites occupied by GATA-1 that is agnostic sequence alignments, we can evaluate how frequently these sites are more conserved than the background genomic DNA (unbound, non-exonic, non-repetitive DNA). A substantial majority of the occupied DNA segments (almost 80%) are conserved in multiple mammalian orders or beyond; this frequency of conservation to these distances is considerably higher than observed in the background genomic DNA. Thus the occupied sites are under constraint - they are changing more slowly than is expected from the observed changes in other DNA. However, very few of these intervals overlap with the most constrained subset of the human genome. Only ____ overlap with the “most conserved” intervals found by phastCons {we should look at that - David, do you want to do it?}. Other studies also have found a signal for constraint in regulatory regions deduced from ChIP-chip experiments, but little overlap with the most severely constrained regions [King, 2007 #2235][Birney, 2007 #2187], in agreement with the observations here. This shows that the occupied intervals are subject to purifying selection, and this is a feature that can be used in finding and analyzing CRMs. However, restricting a search to only deeply conserved noncoding sequences will miss a large majority of the CRMs.

In constrast, barely half the occupied sites had WGATAR motifs that are conserved in multiple mammalian orders, but this provides the basis for important functional distinctions. The presence of a motif conserved outside rodents is tightly associated with strong enhancement activity in transiently transfected cells. Of the 16 occupied sites that confer high enhancement, 81% have a WGATAR motif conserved outside rodents. For this subset of occupied sites, the constraint on the motif is severe - no changes from the consensus binding site are tolerated in multiple mammalian orders, despite diverging 60 to 100 million years ago. This is consistent with a recent analysis of enhancer activity and evolution for muscle genes in Ciona species [Brown, 2007 #2340]. In this study, individual binding site motifs within known enhancers were mutated and the effects on expression were measured. Motifs that contribute significantly to the activity of an enhancer were almost invariably found to beconserved, as shown by clear orthologs that have sustained many fewer substitutions than expected between distant Ciona species. Motifs that contribute weakly to enhancer activity sustain substantially more substitutions, such that motif activity is correlated with percent identity.

Our studies, comparing the level of enhancer activity between occupied sites with and without conservation of motifs, and those of Brown et al. [, 2007 #2340] both show

2

Page 3: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

that high enhancer activity is associated with strong purifying selection on the binding site motifs. Other studies emphasize turnover [Dermitzakis, 2002 #1536; Moses, 2006 #2211] and compensatory changes [Ludwig, 2000 #1698; Ludwig, 1995 #1699] in binding site motifs when orthologous enhancers are compared. Both conservation and turnover of binding sites motifs have been observed in multiple comparative studies in a range of species [e.g., \Moses, 2003 #2294; Dermitzakis, 2003 #1680].

Within our set of segments occupied by GATA-1 in erythroid cells, those showing patterns of motif conservation consistent with turnover are the majority. However, the subset showing strong constraint in the binding site motif tends to have the strongest enhancer activity when assayed by gene transfer experiments. This suggests that categorizing protein-bound segments by the phylogenetic depth of conservation both of the bound interval and the binding site motifs may have good predictive value for some activities. The occupied segments with deep conservation of the binding site motifs are the product of purifying selection on those motifs. The association between activity and constraint indicates that the purifying selection could be loss of critical enhancement when the motifs are altered by mutation.

Occupied sites whose binding site motifs are not deeply conserved still show evidence of constraint for the bound DNA segment, but the position of the motif is more flexible over evolutionary time. These bound segments are not as strong in enhancement assays, and many are not active. In this case the selection against changes in the binding site motif is not expected to be as severe as for enhancers of high activity. Mutations in the binding site motif are more likely to persist in a population, especially if mutations at other nucleotides in the vicinity gave a motif closer to the preferred binding site. A few events like this could lead to a motif “moving” from one position to another in the vicinity, but only if the effects of the alterations were nearly neutral. One possibility is that these occupied sites with nonconserved motifs could function to modulate the activity of enhancers that are capable of acting independently (and hence would be positive in our gene-transfer assays).

In other cases, the lack of conservation of motifs could reflect lineage-specific differences in activity. The changes in motif patterns in eve stripe 2 enhancer relate to important differences in activity between Drosophila species [Ludwig, 2005 #2341]. Some erythroid enhancers are found in primates but not mice [Bodine, 1987 #834][Valverde-Garduno, 2004 #1959], presumably playing a role in lineage-specific differences in expression. Further comparisons of the expression patterns for the likely target genes for the occupied sites in rodents and primates could lend insight into this issue.

Affect on predictions of cis-regulatory modulesThe large scale analysis of ChIP material to identify sites occupied by a nuclear

protein is a strong complement to methods that for predict cis-regulatory modules from aligned genomic DNA sequences. One such method computes regulatory potential (RP), which is a score derived from statistical modeling to find patterns in multiple sequence alignments that distinguish alignments in known regulatory regions from those in neutral DNA [Taylor, 2006 #2006]. In combination with a conserved match to a GATA-1 consensus binding motif in loci that are regulated by GATA-1, this is an effective predictor of erythroid enhancers. Over half of the predicted regions are active as enhancers in gene-transfer experiments in erythroid cells. [Wang, 2006 #2007]. When we apply similar criteria to the sites on mouse chromosome 7 that are occupied by GATA-1, we find that only 12 (19%) of the 93 fully validated GHPs overlap with intervals predicted to be erythroid CRMs in the Wang et al. study. Interestingly, most of these (9 of 11 or 82%) enhance gene expression by a factor of almost 2-fold or more. This fits with

3

Page 4: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

the association of high activity with occupied sites that also have a conserved binding motif. It is possible that this also reflects a tendency for high RP regions to be active as enhancers. This could be investigated by testing the activity of occupied sites with conserved WGATAR motifs but an RP<0.05 {we should already have these data}.

These results show that the data for protein occupancy reveals many bound sites that are missed by the RP plus conserved binding site method. We need to test whether the latter method finds some enhancers that are missed by the occupancy data, i.e. by qPCR analysis of predicted CRMs that do not have a positive signal in the ChIP-chip analysis. The intersection of the two approaches is a small subset of bound sites, but it does predict enhancers with considerable accuracy. Further investigation is needed to assess the utility of using the union of the two methods.

The mean RP score for each long (500-700bp) interval that is bound by GATA-1 was computed, and the distribution of these values compared for the sets of bound sites that are or are not active in enhancer assays. No significant difference was observed. Likewise the distribution of mean RP scores is not significantly different between validated and non-validated GHPs. However, this approach has limited ability to use RP, or any alignment based method, to distinguish functional classes. The critical regions for binding are likey to be much shorter than 500bp, but this analysis combines the RP score of the critical region with that of flanking DNA that may be neutral. A better approach would use a more localized measure, but we do not know what shorter region on which to focus. The transcription factor GATA-1 is required for normal development of erythrocytes, megakaryocytes, mast cells and eosinophils and probably dendritic cells [Pevny, 1991 #361; Simon, 1992 #2165; Pevny, 1995 #736; Weiss, 1995 #737] [Shivdasani, 1997 #2248] precursors. This protein regulates most of the genes that define the mature erythroid phenotype [e.g. \Weiss, 1995 #1438; Blobel, 2001 #1314; Welch, 2004 #1829] and many genes in megakaryocytes [Orkin, 1998 #2247; Wang, 2002 #1446]. Mutations in the GATA1 gene in humans can result in hematologic disorders [Nichols, 2000 #2240; Ahmed, 2004 #1961]. The protein contains two zinc fingers [Omichinski, 1993 #433] and binds to the motif WGATAR in DNA [Wall, 1988 #384; Mignotte, 1989 #744; Martin, 1990 #751; Merika, 1993 #194; Ko, 1993 #195; Whyatt, 1993 #435]. GATA-1 activates expression of some genes but represses expression of other genes during erythroid maturation [Welch, 2004 #1829].

Given its critical role in differentiation and maturation of numerous hematopoietic lineages, it is important to find the DNA sequences occupied by this protein and to establish the determinants of occupancy. Although the binding site motif WGATAR is strongly associated with currently known sites occupied by GATA-1, only a fraction of all such motifs are bound [Grass, 2003 #2111; Im, 2005 #2256; Grass, 2006 #2109]. Moreover, noncannonical GATA-1 binding sites have been identified by in vitro site selection assays and their biological significance is unknown. Conservation of binding site motifs in other mammalian orders has been used to predict occupied sites with good but imperfect success [Blanchette, 2006 #2003; Wang, 2006 #2007], but high throughput determination of many occupied sites show that many bound sites are not under strong evolutionary constraint [Birney, 2007 #2187; Margulies, 2007 #2236; King, 2007 #2235]. The effectiveness of interspecies conservation to distinguish bound from unbound motifs varies with the transcription factor, and the GATA-1 binding site motifs recorded in TRANSFAC are among the least conserved [Sauer, 2006 #2277]. Thus the determinants of occupancy and the effectiveness of sequence conservation in predicting occupied sites remain open questions.

4

Page 5: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

These issues can be addressed more completely by chromatin immunoprecipitation assays that are analyzed using high throughput methods such as hybridization to high density tiling arrays of genomic DNA sequences [ChIP-chip assays; \Ren, 2000 #2014; Birney, 2007 #2187]. The cell line G1E is a particularly useful one for this assay because it is derived by immortalizing embryonic stem cells induced to erythroid differentiation from a Gata1 null mouse strain, and thus it provides a model for an erythroid-committed cell frozen at the progenitor stage [Weiss, 1994 #429]. Transduction of this line with a virus producing a hybrid of GATA-1 with the ligand-binding domain of the estrogen receptor protein (GATA-1ER) restores GATA-1 activity in an estradiol-inducible manner in the resulting G1E-ER4 cells. After induction, the cells proceed through most stages of late erythroid maturation [Weiss, 1997 #1439; Welch, 2004 #1829]. In this system, chromatin associated with GATA-1 can be immunoprecipitated from the rescued cells (induced G1E-ER4) and compared with the background in Gata1 null G1E cells.

We have chosen to examine occupancy by GATA-1 on a large segment (66 million base pairs, or Mb) of mouse chromosome 7, which contains the intensively studied Hbb gene cluster that encodes beta-like globin genes expressed in erythroid cells. The active chromatin domain containing these genes and sub-domains that are expressed at different developmental stages have been established [Bulger, 2000 #1175; Forsberg, 2000 #2166; Schubeler, 2001, 21457241; Bulger, 2003 #1483], and binding sites for GATA-1 and other proteins have been determined [Johnson, 2002 #1481; Letting, 2003 #1435; Im, 2005 #2256]. This region also contains the erythroid expressed Ahsp gene whose product stabilizes nascent alpha-globin polypeptide chains [Kihm, 2002 #1433] and 26 other genes whose expression is up- or down-regulated during late erythroid maturation [Welch, 2004 #1829]. Thus this large segment provides a positive reference set of GATA-1 occupied sites around the Hbb gene cluster to establish sensitivity of the high throughput methods.

Having established a good threshold for sensitivity using known GATA-bound reference motifs, we then predicted all the sites occupied by the GATA-1 protein by analysis of the ChIP-chip results and validated of all the stringent hits by quantitative PCR. Rather than covering the mouse genome broadly, we used the ChIP-chip data as an initial screen on a large chromosomal interval followed by extensive validation by quantitative PCR and further functional tests to establish a high-quality dataset for occupancy and function. The project is sufficiently large in scope, covering about twice the amount of DNA compared to the ENCODE pilot project [Birney, 2007 #2187], to address important issues about GATA-1 occupancy. With this dataset, we examined the distribution of sites occupied by GATA-1 and their roles in gene expression. We searched for determinants of occupancy both in the DNA sequence and in the patterns of conservation among homologous sequences in vertebrates.

Our results confirm that WGATAR is required for occupancy in vivo, but only about 1 in 1000 intervals containing this motif are actually occupied in this erythroid cell line. Multiple such motifs in an interval help distinguish bound from unbound sites, as do the presence of motifs matching the binding sites for CP2 and Krüppel-like factors such as Sp1 and EKLF. These results support previous observations about GATA-1 interacting with these factors [Merika, 1995 #438; Bose, 2006 #2264], and our data show that these additional motifs can account for about half the occupied sites with a specificity of about 60%.

While most sites currently known to be occupied by GATA-1 (or any transcription factor)

5

Page 6: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

were discovered by in depth analysis of promoters, enhancers and silencers, it is not clear whether the sites found to be occupied using high throughput methods are playing distinct roles in cellular physiology. We find that the sites occupied by GATA-1 in this long chromosomal interval G1E-ER4 cells tend to be biologically active. Many of them (54%) are within enhancers, as demonstrated by increased reporter gene expression in transiently tranfected K562 cells, or within promoters, as determined by proximity to transcription start sites. A few of the sites occupied by GATA-1 are far from GATA-1-responsive genes (as much as 4 Mb), but most are significantly closer to target genes than to genes that do not respond to GATA-1.

6

Page 7: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

RESULTS

ChIP-chip and quantitative PCR data for GATA-1 binding on mouse chromosome 7

DNA in close proximity to GATA-1 in mouse erythroid cells was isolated by chromatin immunoprecipitation (ChIP), using antibody against the estrogen receptor domain of the hybrid protein, GATA-1ER, used to rescue the maturation phenotype in G1E-ER4 cells. GATA-1 ChIP material was isolated from three sources: the Gata1 null cell line G1E, the rescued G1E-ER4 cells prior to induction with estradiol (with much of the GATA-1ER protein in an inactive state), and the rescued G1E-ER4 cells after induction with estradiol (with the hybrid protein fully active). First the GATA-1 ChIP DNA was screened in a high throughput technique using hybridization to a NimbleGen high density tiling screening [Nuwaysir, 2002 #2279], a process called ChIP-chip [Ren, 2000 #2014]. In this method, the GATA-1 ChIP DNA was amplified by ligation-mediated PCR and then hybridized together with the input DNA. The microarray covered 66 Mb of mouse chromosome 7 (positions 63331168 to 129534093) at 100 bp resolution (i.e. the beginning of each 50 nucleotide probe is 100 bp away from the start of the adjacent probe in chromosomal coordinates). The ChIP-chip data included two replicates of the GATA-1 ChIP DNA from induced G1E-ER4 cells and from the G1E cells, which were hybridized to the microarrays resulting in 387,540 datapoints (for each replicate) for enrichment of GATA-1 under the two conditions..

Peaks in the ChIP-chip data from induced G1E-ER4 cells were determined using two different programs. Mpeak [Kim, 2005 #1941] searches for strings of consecutive probes with enrichment values above a user-defined threshold that also show a progressive decline on each side of the peak, producing a “hill” shape in the data profile. TAMALPAIS [Xu, 2007 #2280] searches for a certain number of consecutive probes above an enrichment value without regard to the shape of the enrichment profile (exact parameters are in the Methods). Overlapping regions called as peaks in both replicates of the induced G1E-ER4 ChIP material were retained both for high stringency (3 standard deviations for Mpeak, L1 or L2 for TAMALPAIS) and low stringency (1 standard deviation for Mpeak, L3 or L4 for TAMALPAIS) thresholds. The number of peaks found consistently in both replicates is given in Table 1 for each method. Peaks found in only one replicate were discarded. Similar numbers of peaks were found in the G1E GATA-1 ChIP material; these peaks found in the Gata1 null line must reflect noise in the assay. As expected, none overlapped with those found for the induced G1E-ER4 cells, which have active GATA-1ER present. Most of the peaks determined by the two methods overlapped, but some were found consistently by only one method (Table 1). The union of the sets (319 peaks) was examined further, retaining both those common to the two methods and unique to each. We separated them into those passing the high stringency filters (81 high stringency peaks) and those passing the low but not the high stringency filters (238 low stringency peaks). These 319 ChIP-chip peaks are referred to as GHPs (GATA-1 hit positive) followed by a numerical indicator.

A reference set of sites occupied by GATA-1 in induced G1E-ER4 cells was then developed for the Hbb gene cluster. Previous studies have shown GATA-1 binding in mouse erythroid cells to six segments, viz. the promoter for the major adult beta globin gene Hbb-b1, locus control region DNase hypersensitive sites HS1, HS2, HS3 and HS4, and an upstream site HS-60.6 [Johnson, 2002 #1481; Letting, 2003 #1435; Im, 2005 #2256], which are within the 140kb active chromatin domain [Bulger, 2003 #1483].

7

Page 8: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

These six sites were re-examined for occupancy by GATA-1 by using a quantitative PCR assay of the GATA-1 ChIP material prepared from induced G1E-ER4 cells that had not been amplified. Our results confirmed the occupancy of five of the sites in the line of G1E-ER4 cells used in our studies (Fig. 1). We do not observe occupancy of LCR HS4 in these cells, although it is clearly occupied in MEL cells (Supplemental Fig. 1).

The five sites in the Hbb gene cluster demonstrated by quantitative PCR to be occupied in induced G1E-ER4 cells constitute the reference set of occupied sites. Sensitivity can be evaluated as the ability to find these sites, and we find that the peaks determined from the ChIP-chip data at high stringency include 3 of the 5 sites, while those in the low stringency set include 4 of the 5 sites. Even the set of low stringency peaks does not include LCR HS1, despite the fact that considerable signal above the background is present (Fig. 1). We expect that this somewhat low sensitivity of 60-80% can be increased with improvements to the peak-calling software. The specificity within the 120kb domain was excellent; no other GHPs were found in this region, even for the low stringency set.

The ChIP-chip peaks identified outside of the beta-globin locus were then tested for validation of occupancy using quantitative PCR on unamplifed GATA-1 ChIP material. These tests were run on a total of 135 GHPs, including all 81 from the high stringency set and 54 of the 238 low stringency GHPs. The ChIP material came from all three cell lines, i.e. G1E cells and G1E-ER4 before and after estradiol induction. The level of enrichment observed varied considerably (Fig. 2), with some sites showing high occupancy (as much as 140-fold over the G1E background) and others showing low occupancy (the quarter of the sites with the lowest occupancy have enrichment values ranging from 4 to 8-fold over background). The different levels of occupancy reflect both the number of GATA-1 protein molecules bound per site as well as the fraction of cells in culture in which the site is actually bound. Some low occupancy sites may have transient interactions with GATA-1. Other GHP sites show no enrichment for GATA-1-associated DNA; these are ChIP-chip hits that fail to validate. These could result from occasional biases in the amplification of the ChIP material prior to hybridization. The GATA-1 ChIP DNA from uninduced G1E-ER4 cells frequently shows a significant amount of binding by GATA-1ER. This suggests that some fraction of the hybrid protein molecules are capable of binding to DNA without estradiol, but in all cases the amount of binding increases upon treatment with estradiol.

Two criteria were employed to establish a threshold for validation of the ChIP-chip GHPs by quantitative PCR. One is the same as employed by Kim et al. [, 2007 #2083], which requires a signal from induced G1E-ER4 cells that exceeds three standard deviations above the mean enrichment for a set of 9 negative controls (sites that show no evidence for occupancy by ChIP-chip). Using this criterion 74 of the 81 high stringency GHPs (91%) are validated, which is similar to previous results [Kim, 2007 #2083] for CTCF occupancy (Fig. 2B). Another 21 of the 54 tested low stringency GHPs (39%) are also validated by this criterion, for a total of 95 validated GHPs. However, in the G1E system, we can also take advantage of the Gata1 null background. Some of the GHPs validated by the first criterion show very little enrichment in the rescued G1E-ER4 cells compared to the G1E null background. Thus we added a second criterion for full validation, requiring that the signal for the ChIP from induced G1E-ER4 cells be at least 4-fold greater than for the ChIP from G1E cells. This second criterion reduces the number of fully validated GHPs to 63 (Table 1), 54 from the high stringency set (67% validation rate) and 9 from the low stringency set (17% validation rate). It is important to note that of these 63, 11 are uniquely identified by Mpeak and 5 are uniquely identified by TAMALPAIS. Thus combining the results of different peak-finding programs is beneficial.

8

Page 9: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

The set of 63 fully validated GHPs is a high quality dataset for occupancy by GATA-1, with most of them resulting from applying high stringency thesholds for ChIP-chip peak-calling (86%) and all of them showing both a strong enrichment in unamplified ChIP signal compared to both negative controls (no ChIP-chip peak) and to material from the Gata1 null cell line. However, it unlikely to be complete. Given a 17% validation rate for the 54 low stringency GHPs tested, application of this validation rate to all 238 leads to an expectation that an additional 31 (40 total from the low stringency GHPs minus the 9 currently identified) would be validated if all were tested. Thus the full set of sites occupied by GATA-1 in this 66 Mb region may be about 94 sites. Note that this would not include any additional sites like LCR HS1, which did not meet even the low stringency threshold for ChIP-chip peaks but which is occupied.

Specificity of GATA-1 occupied sites along mouse chromosome 7

With this estimate of 94 sites occupied by GATA-1, we can evaluate the frequency with which GATA-1 binds to its cognate binding site motif in this 66 Mb region of mouse chromosome 7. The GHPs from the peak-calling programs are about 500bp on average, so we determined the number of 500bp intervals in the 66 Mb that contain a WGATAR motif. Within the nonrepetitive DNA (which is the only DNA present on the tiling array) and excluding exons, there are 176,527 WGATAR motifs and 90,413 intervals of 500bp with at least one WGATAR. Using the latter as the number of potential binding sites, our results show that 94 out of 90,413 sites are occupied, or slightly greater than 1 in 1000 sites. Thus the protein GATA-1 is an exquisite discriminator among available motifs. The fact that 90,319 out of 90,413 potential binding sites are not occupied indicates that the ChIP data are highly specific, supporting the conclusion from examining the Hbb gene cluster domain.

Sequence determinants of occupancy by GATA-1

The set of 63 fully validated GHPs was evaluated for determinants of occupancy by GATA-1. As expected from the previous biochemical studies, the motif WGATAR is almost always found in sites occupied by GATA-1, being present in 60 (95%) of the 63 sites. This motif is quite common and is found in 77% of randomly sampled 500 bp intervals from the 66 Mb region. However, its frequency in the occupied sites is higher than in randomly sampled, unoccupied 500bp intervals for 994 trials out of 1000 (p=0.006). This result contrasts strikingly with those for other high throughput studies of occupancy by Sp1, c-Myc and p53 [Cawley, 2004 #2078] and E2F1 [Xu, 2007 #2280], which show only a minority of occupied sites with the canonical binding site motif. Perhaps the biochemically-defined binding site is a stronger contributor to occupancy by some transcription factors than others.

However, the WGATAR motif is not sufficient for specific binding in vivo, and other factors must be determinants of binding to distinguish the occupied sites from the thousand fold excess of segments with the motif that are not occupied. We searched for motifs that distinguish occupied from unbound DNA segments containing WGATAR motifs, using hexamer enumeration, MEME, AlignACE and CLOVER to find motifs with discriminatory power.

Words (strings of nucleotides) that are over-represented in the occupied intervals compared to unoccupied DNA intervals were determined using direct enumeration.

9

Page 10: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

These over-represented words may be parts of binding site motifs that distinguish the bound from unbound sites. In order to focus on motifs other than a single WGATAR, we constructed a negative control set of 245 intervals from the 66 Mb region on mouse chromosome 7 that have at least one match to WGATAR, are 500bp in size, and are located within 10kb of the 63 sites occupied by GATA-1. Thus these negative controls have the same primary motif and are about the same length as the intervals in the positive set, but they do not bind GATA-1. We developed a custom program for word counts (see Methods). Because the canonical binding consensus of GATA-1 is a hexamer, the program counted hexamers in the intial steps.

There are 2080 possible unique hexamers (after merging hexamers with their reverse complements), and their frequencies were counted in sequences of the occupied sites and in the sequences of the negative control dataset. Three different methods were used to find over-represented hexamers based on these counts: a one-sided Fisher’s Exact Test, Wilcoxon rank-order tests, and comparison with frequencies obtained after 1000 random re-partitionings after pooling the positive and control sets. The p-values estimated by each method were corrected for multiple testing using a false discovery rate of 0.05 or less (see Methods). After intersecting the sets of hexamers computed to be enriched at these FDR thresholds, we focused on a set of 60 hexamers common to the results of all three approaches. These (with their counts and enrichment statistics) are listed in Supplementary Table YZ1. In order to see if any hexamers were parts of longer words, the frequencies of nucleotides in the three positions before and after each hexamer were determined, based on all occurrences of each hexamer in the positive dataset. This approach showed evidence of longer words in only two cases, both of which indicate an extension of the WGATAR motif. One could add a G on the 5’ end (GAGATA) and the other adds a G on the 3’ end (WGATAAG).

We then searched for candidate proteins that may bind to these over-represented words by comparing the sequence of each word to a a set of nonredundant motifs [Xie, 2005 #2289] compiled from the TRANSFAC library [Wingender, 2001, 20574815], the set of motifs in Jaspar [Sandelin, 2004 #1839] and custom motifs for binding sites for erythroid transcription factors (see Methods). Twenty previously described motifs for factor binding sites matched one or more over-represented words (Table 2). These are candidates for transcription factor binding site motifs that help distinguish sites bound by GATA-1 from unbound sites.

Words that match the binding site for GATA family of transcription factors were seen frequently, even though this motif was included in the control unbound set of sequences. This suggested that multiple instances of the WGATAR motif were distinctive for bound sites, and indeed this is the case. The average number of WGATAR motifs in the bound intervals is 2.7 whereas it is only 1.6 for unoccupied intervals with a WGATAR, and the fraction of bound intervals with multiple WGATAR motifs (78%) is considerably higher than that for unbound intervals containing WGATAR (46%).

Four of the motifs are similar to the binding sites for transcription factors previously shown to bind to be associated with GATA-1: the Krüppel-like zinc finger proteins Sp1 and EKLF [Merika, 1995 #438; Gregory, 1996, 96202501], CP2 [Bose, 2006 #2264] and PU.1, an ETS-family protein [Klemsz, 1990 #189]. The protein PU.1 interacts directly with GATA-1 and represses transcriptional activation by GATA-1 [Rekhtman, 1999 #1450]. A related protein, FLI-1, has been shown to interact synergistically with GATA-1 to increase binding and activation of target genes [Eisbacher, 2003 #2245]. Additional proteins listed in Table YZ3, or relatives of them that share the binding site, are candidates for proteins that associate with GATA-1 at the sites of occupancy, and thus may help distinguish bound from unbound sites.

10

Page 11: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

Three probabilistic methods were also employed to find motifs significantly associated with the intervals bound by GATA-1. MEME [Bailey, 1995 #1787] uses an expectation minimization strategy applied to a set of sequences of bound intervals, AlignACE [Hughes, 2000 #1496] is a Gibbs sampling method also applied to a set of sequences of bound intervals, and CLOVER [Frith, 2004 #2297] compares frequencies of motifs in bound intervals with the frequencies in a background distribution. Because these are probabilistic methods, we reduced the search space by focusing on shorter DNA intervals (average length of 120 bp) centered on the WGATAR motif. As expected, these methods also found WGATAR to be over-represented (Table 3). In addition, the half-site for CP-2 binding was found by both MEME and AlignACE, and motifs that match a binding sites for EKLF and PU.1 were also found by AlignACE. Thus these probabilitistic methods provide additional evidence that these binding sites help distinguish bound from unbound intervals.

The combined results from all the approaches implicate at least five motif-based discriminators of occupied from unoccupied sites: multiple motifs for GATA-1 binding and binding site motifs for EKLF, Sp1, CP2, and PU.1 (or related proteins with similar binding site motifs). The frequency of each of these motifs in the GATA-1 occupied intervals and in randomly sampled intervals (1000 iterations of picking 63 intervals) in the 66Mb region was then computed to assess significance of the enrichments and to evaluate the contributions of the motifs singly and in combination to sensitivity and specificity of binding by GATA-1 (Table 4). We also included E-boxes, which are the binding sites for basic helix-loop-helix proteins, within 20bp of a WGATAR in the analysis. One basic helix-loop-helix protein TAL1, which can form a complex with E47, GATA-1, LDB1 and LMO2 in erythroid cells [Wadman, 1997 #720].

The most effective discriminator for GATA-1 binding is the presence of multiple copies of WGATAR, which captures 78% of the occupied sites but with only a specificity of 54%. No other single motif has a high sensitivity, indicating that no single additional protein or binding site will account for the majority of bound sites. However, each additional motif does capture a significant number of bound intervals with very high specificity, and they tend to capture different sets of intervals. Thus the combination of multiple WGATARs plus any of the other over-represented motifs gives a substantial increase in specificity (to almost 70%) while retaining fairly high (67%) sensitivity. The binding site motifs for EKLF, Sp1, and CP2 have the highest specificity, and combinations of the WGATAR motif (singly or multiple) with those motifs gives high specificity (85% to 80%) with moderate sensitivity (40%-50%). Similar results were obtained when the motif frequencies were compared between bound intervals and negative control intervals that have a WGATAR but are not occupied (Supplementary Table YZ5).

Conservation of the DNA segments bound by GATA-1

The 63 fully validated sites occupied by GATA-1 show a wide range of DNA sequence conservation, with 12 (19%) deeply conserved in vertebrates including fish, 6 (9%) conserved in mammals including marsuplials (but no further), 33 (52%) conserved in placental mammals, 7 (11%) conserved in boreoeutherians (including primates, rodents and carnivores), 2 (3%) conserved in euarchontoglires (primates and rodents) and 3 (5%) only found in rodents (Fig. 4A). As a group, the occupied sites tend to be conserved over a deeper phylogenetic distance than randomly sampled DNA. Conservation of the bound DNA segments out to clades as distant as other boreoeutherians or further occurs more frequently in the DNA segments occupied by GATA-1 than in randomly sampled DNA segments from the target region of mouse

11

Page 12: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

chromosome 7 (Fig. 4A). Conservation to these more distant clades occurs as much as twice the frequency as in random DNA segments. The most common level of conservation is to placental mammals, and the 1.3-fold increase in conservation to this distance is significant. It is far beyond the 75th percentile of comparable evaluations in random DNA segments. In contrast, conservation limited to the rodents or even boreoeutherians is substantially less frequent than is observed for random segments of DNA. {David, please describe the methods in detail in the Methods. E.g. are the random segments limited to noncoding nonrepetitive? Also, we need a figure legend.}

Conservation of the WGATAR motif in DNA segments bound by GATA-1

Despite the tendency of the DNA segments occupied by GATA-1 to be conserved in multiple eutherian orders or beyond, the WGATAR binding site motif is conserved to that phylogenetic distance much less frequently. The WGATAR motif is conserved in nonrodent mammals (or more deeply) for only 27 or (45%) of the 60 occupied sites that have the motif (Fig. 4B). Conservation of WGATAR motifs outside rodents is highly significant; randomly sampled DNA intervals with this motif (but unbound) rarely show conservation of that motif outside rodents (Fig. 4B). Thus a subset of the sites occupied by GATA-1 have WGATAR motifs significantly more conserved than is observed for non-occupied sites. Also, the number of conserved WGATAR motifs is considerably higher in sites occupied by GATA-1 than in randomly sampled intervals (Fig. 4C). This property is a good predictor of a subset of occupied sites, but requiring such conservation of the motif for prediction will miss many occupied sites.

The majority of mouse DNA segments bound by GATA-1 have WGATAR motifs that are not conserved outside rodents (32 or 33 sites, 53% of the 60 sites with a WGATAR motif). In some cases this is because no DNA sequence in the comparison species aligns to the mouse WGATAR motif, and in other cases the DNA from comparison species aligns, but the inferred homolog to the mouse WGATAR motif has changed so that it no longer matches the consensus.

The lack of conservation of the WGATAR motif in this subset of occupied DNA intervals contrasts with the enrichment for conservation of the whole interval in multiple mammalian lineages. If the conservation of the DNA homologous to the occupied intervals in other species represents a constraint because of transcription factor occupancy in those species, then one may expect that the WGATAR motif should be present in the other species, but not in the six positions inferred to be homologous to the WGATAR in mouse. These would be examples of evolutionary turnover of motifs in DNA segments occupied by GATA-1 in multiple mammalian lineages. For each of these 39 occupied DNA intervals that have rodent-specific WGATAR motifs, we searched the homologous DNA interval (500bp to 700bp) in each comparison species for other instances of WGATAR. We found 25 cases with a WGATAR motif within the homologous DNA interval in at least one nonrodent mammal. In 13 of the cases, the other instance of the motif was conserved within a clear clade, such as primates. The latter are particularly good candidates for turnover events. {David - now that the number occupied segments with rodent-only WGATAR motifs has changed, did the numbers for this analysis change?}

Distribution of GATA-1 occupied sites along mouse chromosome 7

The distribution of the set of 63 fully validated GHP along the 66 Mb of mouse

12

Page 13: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

chromosome 7 is not uniform, but rather the occupied sites appear to clustered (Fig. 5). If the clusters are cis-regulatory modules (CRMs) that are regulating genes within these regions, then we should find erythroid-regulated genes in most of the regions with clusters of GATA-1 occupied sites. Alternatively, the occupied sites could be in CRMs that are very far away from previously characterized erythroid loci such as the Hbb gene complex, in which case the occupied sites would not show an association with erythroid-regulated genes.

Using the expression microarray data from G1E-ER4 cells [Welch, 2004 #1829], in the 66 Mb region we find 29 genes whose transcript levels change at least two-fold (either increased or decreased) after restoration of GATA-1 activity. The GHPs tend to be closer to the transcription start sites (TSSs) of these GATA-1-responsive genes than to randomly selected positions in the nonrepetitive DNA (Fig. 6a). In 300 randomized samplings, the cumulative distributions exceeded that for the distance between GHPs and TSSs of responsive genes only once (p-value of 0.003) at 400kb and three times (p-value of 0.01) at 500kb (Fig. 6a). As expected, the distributions of distances from GHPs are no different for responsive genes or randomly sampled points when the distances exceed 500 kb (inset in Fig. 6a).

However, validated GHPs tend to be closer to the TSSs of genes than to randomly selected points (Fig. 6b), so we also compared the distances of validated GHPs to either GATA-1-responsive genes or to all known genes. Again, the distances to the responsive genes tend to shorter than the distances to any gene (Fig. 6c). For example, the cumulative distribution curves for samplings of all genes passed that for GATA-1-responsive genes at 400kb in only 15 out of 300 trials, for an empirical p-value of 0.05.

Biological function of the sites occupied by GATA-1

The set of 63 fully validated GHPs was then evaluated for biological function as erythroid enhancers and/or promoters. We tested 55 of the 63 validated sites occupied by GATA-1 for their ability to enhance expression of luciferase from an HBG promoter in transiently transfected K562 cells. Twelve DNA segments that do not contain any WGATAR motif and not overlap with the GATA-1 binding hits were used as negative controls. An additional 51 GHPs that did not pass the qPCR validation were also tested for enhancer activity (labeled “not-occupied” in Fig. 7a). GHP intervals that caused at least a two-fold increase in activity compared to that of the parental reporter gene plasmid (corresponding to the mean of the negative controls plus 3 standard deviations) in at least two repeats of the assay are considered to be active as enhancers.

Among the 59 validated GATA-1 occupied regions tested, 30 (51%) are enhancers in this assay (Fig. 7a, Table 5). Some are very strong enhancers, with eight causing greater than a 4-fold increase in luciferase activity. The strongest enhancer among this group is LCR HS2, which is known to be a particularly powerful cis-regulatory module. In contrast, ChIP-chip hits that are not validated by qPCR are rarely enhancers.

The sites occupied by GATA-1 that also have at least one WGATAR motif conserved beyond rodents are strongly associated with enhancer activity. As shown in Table 6, the occupied sites with conserved WGATAR motifs are positive in an enhancer assay at over twice the frequency as those with nonconserved WGATAR motifs (76% versus 33%, p-value < 0.004 by a chi-squared test). Also, GHPs with conserved WGATAR motifs show significantly higher activity in enhancer assays than do GHPs with non-conserved motifs (Fig. 7b). Almost all the occupied sites with high enhancer activity

13

Page 14: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

(>=3-fold change) contain at least one WGATAR motif conserved outside rodents; 13 of the 16 high activity occupied sites (81%) have this property (Fig. 7c). In constrast, less than half of the lower activity enhancers (between 2- and 3-fold effects) have a conserved WGATAR, and only 30% of the occupied sites with no activity have a conserved motif. {Yong - check these numbers and make sure they fit with Fig. 7c. Shouldn’t the light and dark gray columns addup to 1.0 for each activity category? It does not look like they do that. -rh}

Many erythroid promoters contain binding sites for GATA-1. In order to determine what fraction of the occupied sites in this study are candidates for a role in promoting transcription, we searched for occupied sites close to transcription start sites (TSSs) of genes. Three different datasets of TSSs were examined, viz. RefSeq genes [Pruitt, 2001, 20574773], UCSC Known Genes [Hsu, 2006 #2023], and CAGE-tag data from the RIKEN-FANTOM consortium [Hayashizaki, 2006 #2155]. A 500 bp region centered on each recorded TSS was considered a candidate promoter region. The numbers of TSSs differ considerably among the datasets, with the CAGE-tag data giving at least 13 times more TSSs than those from the gene models (after merging overlapping 500bp intervals with TSSs; Table YPZ). These additional promoters are thought to represent alternative start sites for protein-coding genes and start sites for noncoding transcripts.

The promoters deduced from the gene models overlap with 13% to 17% of the DNA segments occupied by GATA-1 (Table 7). The RefSeq and Known Genes datasets are predominantly protein-coding genes with substantial support from mRNAs. Thus the promoters derived from these sets that overlap the sites occupied by GATA-1 are strong candidates promoters directly regulated by this transcription factor. These are listed in Supplementary Table YPZ_2.

Even more (40%) of the validated GHPs overlap with intervals containing CAGE-tags (Table 7). However, with so many 500bp intervals containing CAGE-tags (covering about 8% of the 66Mb), it was important to show that this represents an enrichment over expectation. In 1000 rounds of resampling 10,476 regions of 500bp from the non-repetitive, non-exonic portion of the 66Mb region, only once did the number of intervals overlapping the GHPs meet that seen for the CAGE-tags (i.e. 25 overlapping intervals; the number of overlaps from resampling never exceeded 25 and the mean was 10). Thus the enrichment of validated GHPs for CAGE-tag intervals is highly significant (empirical p-value = 0.001, see Supplemental Figure YC2). These results show that sites occupied by GATA-1 are strongly associated with transcription start points, and many (about 15%) are close to, and likely part of the control region, for promoters driving transcription of protein-coding genes. The CAGE-tag clusters associated with other occupied sites could be start sites transcripts involved in regulatory mechanisms [Kong, 1997 #1976].

Many of the DNA segments occupied by GATA-1 that are close to TSSs were also tested for enhancer activity by transient transfection of K562 cells. About half (43% to 56% depending on the source of TSS information) of the occupied sites implicated as promoters were also active as enhancers (Table YPZ). Thus current information can assign a likely activity to 42 (67%) of these 63 sites occupied by GATA-1: 25 are candidate promoters (based on overlap with intervals containing TSSs defined by CAGE-tags) and 29 are validated as enhancers, but 12 are candidate promoters and have enhancer activity (25+29-12= 42). Of the remaining 21, 13 are not enhancers or promoters (check this for CAGE-tags) and 8 have not been tested for enhancement.

14

Page 15: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

Discussion

We have combined large-scale ChIP-chip data followed by independent validation of occupancy and experimental tests on the occupied sites to learn important new information about GATA-1 and its interaction with genomic DNA in vivo. In contrast to recent results for several transcription factors thought to be sequence-specific regulators, such as Myc, Sp1 and E2F1, the transcription factor GATA-1 does require a specific sequence motif (WGATAR) for occupancy. However, only a tiny fraction of the instances of this motif are occupied, about one in a thousand. Multiple WGATAR motifs in combination with motifs for binding sites for EKLF, Sp1, CP2 and PU.1 are characteristic of a large majority of the occupied sites, but they are far from sufficient to explain the exquisite discrimination by GATA-1. The occupied sites are closer to genes that are regulated by GATA-1 than to non-target genes, and we suggest that silencing of long chromosomal intervals through chromatin modifications prevents binding by GATA-1 and thus contributes to the specificity of occupancy. Most of the occupied sites are conserved to deeper phylogenetic clades than are randomly chosen intervals, showing evidence of evolutionary constraint on theses sequences. However, slightly less than half of the WGATAR binding motifs found in the mouse DNA intervals are conserved outside rodents. The occupied sites with WGATAR binding motifs conserved in eutherian mammals or more deeply are strongly enriched for high enhancer activity. The occupied sites with rodent-specific WGATAR motifs tend to have WGATAR motifs at other locations that are conserved in specific clades, consistent with evolutionary turnover in the motifs.

Quality of the dataset of occupied sitesIn our study design to find GATA-1 occupied sites on mouse chromosome 7, we

utilized high throughput ChIP-chip data as an initial predictor of occupancy, followed by a large number of validation assays using quantitative PCR. The sensitivity of our results is 60-80% when evaluated using a reference set of sites around the Hbb gene complex known to be occupied by GATA-1. The specificity, defined as true negatives/(true negatives plus false positives) is excellent, because the number of true negatives (about 90,300) is enormous compared to the number of false positives (the GHPs that fail the quantitative PCR evaluation, which is 27 or 72 depending on the stringency of the threshold for testing). Although most large-scale studies of occupancy using ChIP-chip data include estimates of sensitivity based on previously determined occupied sites, analysis of specificity, i.e. the ability to reject unoccupied sites, often is lacking. A recent study from the ENCODE Pilot Project [Johnson, 2008 #2307] addressed this issue by constructing a simulated ChIP preparation of cloned DNA fragments that were added as spiked samples to genomic DNA. In studies performed in several labs on three different platforms using multiple methods for peak-calling, the sensitivity obtained with low abundance samples that required amplification (as do those in our study) ranged from about 20% to 47% at a false discovery ratio (false positives/total number of true positives) of about 5%. Applying the same measures to our high stringency GHPs, we obtain a sensitivity of 55% at a false discovery ratio of 20% (Supplementary Figure SvFDR). In contrast to the results of the Johnson et al. study, we see a substantial increase in sensitivity as the false discovery ratio is increased. One aspect that contributes to our higher sensitivity is combining the results of different peak-calling methods. Some of our fully validated bound sites were found by only one of the two

15

Page 16: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

methods used.ChIP-chip is an excellent technology for large scale surveys of sites occupied by

proteins, but our results also show that current analysis methods are not as sensitive as may be desired. Neither of the two peak-calling methods we employed found HS1 of the beta-globin LCR, despite the fact that the ChIP-chip arrays did reveal specific hybridization to this interval. Using conventional high-stringency thresholds on peak calling revealed 54 of the 63 bound sites discovered, and we estimate that an additional 31 sites remain to be identified in the larger number of low stringency hits. It is likely that novel approaches to analyzing the ChIP-chip data will improve performance. Directly incorporating a comparison with the ChIP-chip results from the Gata1 null cell line and developing training data based on validated samples could potentially reduce the false positives while including more false negatives. However, it is likely that comprehensive identification of sites occupied at a low level will remain a challenge.

Determinants of occupancyThe protein GATA-1 has a high affinity in solution for sequences containing the

motif WGATAR, but previous studies left it unclear whether this was true for sites occupied in vivo. Our results show that the WGATAR motif is significantly associated with occupancy: 95% of the occupied sites have the motif and unoccupied sites have substantially fewer of these motifs. The three DNA segments bound by GATA-1 but lacking a WGATAR motif do have a match to a position specific weight matrix for the binding site. Thus a non-consensus GATA-1 binding sites in those segments could be recognized by the protein. Certainly, sites with small deviations from the consensus motif [Merika, 1993 #194][Ko, 1993 #195][Newton, 2001 #2324][Molete, 2002, 21856495] are bound by GATA-1 in vitro. Another contributor to the occupied sites lacking a consensus motif could be indirect interactions with GATA-1. Although some binding to sequences without the canonical motifs occurs, it is a small minority of the sites. Thus requiring a match to the consensus motif for predictions of occupancy makes the predictions considrably more specific while losing only a small amount in sensitivity. Many mutagenesis studies have also shown the importance of the WGATAR motif for regulated expression of reporter genes [e.g. \Gong, 1991 #79; Walters, 1992 #750; Abruzzo, 1994 #413; Vyas, 1999 #1451; Wang, 2006 #2007]. These results, combined with the in vitro affinity and our demonstration of the motifs in the preponderance of sites occupied in vivo, make a strong case for the WGATAR motif as a critical determinant of in vivo binding by GATA-1.

However, the presence of the WGATAR motif is not sufficient to determine occupancy by GATA-1 in erythroid cells, in fact only about 1 in 1000 potential sites are occupied. Some of the other sites may be occupied by GATA-1 in myeloid lineages, not in erythroid cells, presumably regulating genes that are specifically activated or repressed by GATA-1 in those other cell types. Thus some of the specificity could be determined by combinatorial actions of different transcription factors with GATA-1. Our investigation of motifs associated with occupancy is consistent with this. We find that motifs similar to the binding sites of several transcription factors active in erythroid cells are enriched in the sites occupied by GATA-1 in G1E-ER4 cells. The proteins EKLF, Sp1, CP2 and ETS-family proteins have been shown to interact with GATA-1 [Merika, 1995 #438; Gregory, 1996, 96202501; Bose, 2006 #2264; Rekhtman, 1999 #1450; Eisbacher, 2003 #2245], and motifs for one or more of each of these factors in combination with the WGATAR motif contribute to specific occupancy. One ETS-family protein, PU.1, is an antagonist of GATA-1 function [Rekhtman, 1999 #1450], but a related protein, FLI-1, works synergistically with GATA-1 and thus it may be a more likely candidate for the protein that appears to be recognized by GATA-1 at some of these sites. Further ChIP

16

Page 17: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

experiments utilizing antibodies against these candidate proteins will test these hypotheses.

Another major determinant of occupancy by GATA-1 is the presence of multiple WGATAR motifs. This suggests that multiple molecules of GATA-1 may be bound to such sites in vivo. GATA-1 is known to self-associate through its zinc finger domains [Crossley, 1995 #2327], and a mutant GATA-1 defective in self-association activity can only partially rescue the Gata1-0.5 mutant mouse [Shimizu, 2007 #2328]. This shows that the specific interaction of mutliple GATA-1 protein molecules is needed for the regulatory activity of GATA-1. The fact that most of the occupied sites also have multiple binding motifs suggests that this self-association in vivo may be driven both by multiple binding sites and specific protein-protein interaction domains.

The motifs in our study are not able to explain all the exquisite discrimination among potential binding sites by GATA-1. The occupied sites in erythroid cells tend to cluster in the broad vicinity of genes whose expression is regulated in response to GATA-1 in erythroid cells. It is possible that some of the large segments of unoccupied sites are regions of the chromosome that are silenced in erythroid cells. If so, then these regions may be in an inaccessible chromatin conformation, and one would expect to find chromatin marks associated with silencing in these regions. Further ChIP experiments directed against chromatin modifications would test this hypothesis.

Activity of the occupied sitesA majority of the 63 sites occupied by GATA-1 are active in gene regulation and

expression. At least 30 of them are active as enhancers in transiently transfected K562 cells, and 25 overlap with a CAGE-tag cluster indicative of promoter activity. After accounting for the occupied sites that are active in both assays, we find that 43 (68%) show evidence of promoter and/or enhancer activity. The other 20 occupied sites could also play roles in regulation that are more difficult to assay. For example, negative cis-regulatory modules such as silencers are difficult to detect in the transfection assays employed here, because decreases in activity that are significantly different from those obtained with some predicted neutral DNAs are rarely observed. Our assays would also miss enhancers that need transcription factors that are not present in K562 cells. Some of the occupied sites may be cis-regulatory modules that are playing some other role in regulation. Consider HS-60.6 in the Hbb gene cluster. It is strong DNase hypersensitive site located about 20 kb from the boundary of the DNase sensitive domain in erythroid cells. Our results confirm previous observations that it is occupied by GATA-1 [Im, 2005 #2256], but its role is unknown.

17

Page 18: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

Methods / Supplementary Data and Analysis

Methods

Cell cultureCells were cultured as decribed previously. (2004 blood)Basically, G1E cells were grown in IMDM media with 15% fetal calf serum, Epo 2U/ml and kit ligand 50ng/ml. To induce activation of the conditional GATA-1,cells were cultured in the presence of 10 -7 mol /L beta-estradiol for 24hrs.

Chromatin immupprecipitationChip assay was conducted as described previously. (Wang et al). Three different cells-G1E knockout cells and the G1E-ER4 cell before and after inductions of the GATA-1-ER hybrid with estradiol were used in this assay.

For ChIP-chip assay, ChIP DNA were amplified by LM-PCR assay.( need the information from Lou)

For conventional ChIP assay, ChIP DNAs were quantified by SYBR green real time PCR on ABI 7300 PCR machine. Realtime PCR primers were designed by primer quest tools on IDTdna.com with the amplicons length between 120-150 bp. PCR product signals we reference to a series dilution of the relevant input

Peak finding

Both Mpeak and TAMALPAIS program were applied to identify the GATA-1 binding hits.For Mpeak, the two duplicated ChIP-chip results from the G1E ER4 induced cells were used as input separately. Only the peaks that can be identify by the outputs of both duplicates were selected as the GATA-1 binding hits. Different pre-filtering thresholds (mean+1sd, mean+2sd,mean+2.5sd and mean+3sd) were chosen and all the other parameters were just used as default. “Peak only” output were used which only contain a 50 bp regions, then we did 250 bp extension on both sides of those regions to get the final GATA-1 binding hits.

For TAMALPAIS, since the program required triplicates as input. We generated two inputs. One was the combination of the two copies of hybridization 1 result with the result of hybridization 2, another was the combination of the two copies of hybridization 2 result with the result of hybridization 1. Similar as the method in Mpeak, only the peaks that can be identify by both outputs were selected as the final binding hits. Then we used merge and union method to combine the results from the two programs and lift over the results from MM6 to MM8.

Direct enumeration of hexamers

Each of the 4,096 possible hexamers of nucleotides was counted in the DNA sequencesof the GATA-1 occupied segments and in the sequences of the control, unbound DNA segments. Counts for hexamers that are reverse complements of each other were combined, which reduces the number of possible hexamers to 2080. The program kmercenary is available from _________.

18

Page 19: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

Identification of enriched hexamers

To identify motifs in addition to WGATAR that may be important in determining GATA-1 occupancy, a set of control, unbound DNA segments containing WGATAR regions was selected.  We selected a set of 245 DNA intervals that met these criteria: each contains at least one WGATAR motif within its sequence, the average length is 500bp, and each resides with 10kb of a bound site.

Each of the 2080 hexamers was tested for enrichment in the occupied DNA segments compared to the controls by three methods. In the first, a two-by-two contingency table was constructed with columns giving the number of DNA intervals with the hexamer present or absent and rows separating intervals bound by GATA-1 from control intervals. Each of the 2080 contingency tables was evaluated using a one-sided Fisher's exact test. In the second method, a one-sided Wilcoxon rank order test was used to compare distributions of hexamer presence in bound and control DNA intervals. In the third method, all bound and control intervals were pooled together, followed by random partitioning of the pool 1000 times. The enrichment for each hexamer was computed as the ratio of fraction of (pseudo)bound sites with the hexamer over fraction of (pseudo)control regions with the hexamer, giving 1000 values for the enrichment of each of the 2080 hexamers. Comparison of the enrichment observed in the real bound intervals versus control intervals with the enrichment values from the random partitioning gives an empirical p-value.

The p-values computed for each of the three methods were corrected for multiple testing using a FDR package provided by the R statistical package {REF}. For Fisher’s Exact Test, 172 enriched hexamers passed an FDR q-value threshold of 0.05. For the Wilcoxon test, 116 enriched hexamers passed an FDR q-value threshold at 0.01. For the empirical method (random partitioning of the pool), 110-140 enriched hexamers had an empirical FDR q-value of 0 (5 separate runs of the q-value computation). A set of 60 enriched hexamers were common to these seven sets that pass FDR thresholds, and these were the focus for futher study.

To test for the presence of longer words that contain the hexamers, each instance of a hexamer along with three nucleotides on each side was extracted from the input sequences. The frequencies of nucleotides at each position in the hexamer and the extensions on each side were computed and summarized as logos {REF}.

Identify known transcription factor binding sites by matching the enriched words to a non-redundant TFBS library

The enriched hexamers were compared to the consensus sequences for known binding sites for transcription factors, using a nonredundant set (Xie et al.) derived from TRANSFAC {REF}, from the Jaspar library {REF}, and custom consensus sequences for binding sites for three erythroid transcription factors: EKLF BS (CCNCACCCW), GATA-1 BS (WGATAR), CP2 BS (CCWG half site). We used a string-matching program scoring exact matches as 1, mismatches as -1, and partial positives for matching a degenerate position other than N. The following scoring matrix was used.

  A T G C R Y M K S W H B V D N

A 1 -1 -1 -1 0.5 -1 0.5 -1 -1 0.5 0.33 -1 0.33 0.33 0

T -1 1 -1 -1 -1 0.5 -1 0.5 -1 0.5 0.33 0.33 -1 0.33 0

G -1 -1 1 -1 0.5 -1 -1 0.5 0.5 -1 -1 0.33 0.33 0.33 0

19

Page 20: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

C -1 -1 -1 1 -1 0.5 0.5 -1 0.5 -1 0.33 0.33 0.33 -1 0

The similarity score is the sum of all comparison scores over the overlapped positions, normalized for the average size of the compared strings. Each of the 60 common enriched hexamers was compared to each consensus sequence from the three data sources for factor binding site. The known factor binding site with the highest similarity score to a hexamer is considered the best match. Analysis of the rank order of similarity scores for all factor binding sites was used as a measure of robustness of a match; similarity scores that are strikingly higher than the bulk are considered robust. {Ying - do you have a number or some threshold for this?}

Distance between GATA1 response genes and GATA1 occupancy sites

GATA-1 response genes information was obtained from the affemtrix expression profile of G1E cell after the GATA-1 restoration with the time course of 3,7,14,21,30hrs. Then the raw data was normalized by RMA method using Bioconduct package. The genes show at least 2 fold change compared to the 0hr at one of the time course are defined as the GATA1 response genes. In order to check whether the validated GHPs are randomly distributed or they tend to close to the GATA-1 responsive genes, We generated 300 data sets, each set contains 29 random chosen position in the 70Mb ChIP-chip regions which mimic the expression changed genes. We compared the distance between validated GHPs and transcription start sites of the nearest GATA-1 response genes with those between GHPs and random sites and repeat those process for 300 time.We then tested if GHPs were located near the genes. 1031 known genes in the 76mb region were downloaded from UCSC browser. The random sites are 1031 random chosen site in the region and repeat 300 timesIn order to further know if those GHPs are more specifically close to GATA-1 responsive genes instead of all the genes. We compare the distance between validated GHPs and transcription start sites of the nearest GATA-1 response genes with those between GHPs and random chosen known genes. Each time we randomly chose 29 known genes and repeat this for 300 timesAnalysis of over-representation used programs in the R statistical package.

Enhancer assays by transient transfection

The Enhancer assays were  the same as described previously with the following modification. (Wang et al 2007). Basically, the genomic regions of the GATA-1 occupancy sites (average size about 1kb) were amplified from mouse genomic DNA and inserted into the luciferase reporter genes driven by the HBG1 gene promoter.The plasmid DNAs were transiently transfected into K562 cells in a 48 well format using 0.4µg of plasmid containing firefly luciferase reporter and 0.0001 µg of cotransfection control plasmid expressing Renilla luciferase per well in OptiMEM (Invitrogen) using 0.4µl of PLUS Reagent (Invitrogen) and 0.6µl Lipofectamine LTX per well. K562 cells were plated at 8*104 cells per well and plasmids were tested in quadruplicate.   Each plasmid was tested at least twice.Two days after the transfection, cell extracts were subject to a dual luciferase assay following the manufacturer’s protocol (Promega). For each of the triplicate samples, the firefly luciferase activity of the test plasmid (divided by the Renilla luciferase activity of the cotransfection control) was normalized by the firefly luciferase activity from the

20

Page 21: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

parental MCSluc (divided by the Renilla luciferase activity of the cotransfection control) to obtain a fold change.

Overlap of GHPs with candidate promoters Yong:RefGene and KnownGene tracks were downloaded from UCSC browser. For genes on the plus strand, the TSS is defined by “txStart”. For genes on the minus strand, the TSS is defined by “txEnd”. The CAGE data were downloaded from Fantom website and UCSC genome browser and lifted over from mm5 to mm8. The genome interval operation were conducted using Galaxy tools from http://main.g2.bx.psu.edu/ .

Yuepin:The analysis was done for the interval: chr7:60000000-140000000 (mm8)

1. Definition of Transcription Start Site (TSS)Three sources of TSS dataset:

1) Defined by RefGene TSS.Download RefGene track from UCSC browser, totally 830 genes in this interval. Get the 5kb interval centered on TSS. For genes on the plus strand, the TSS is defined by “txStart”. For genes on the minus strand, the TSS is defined by “txEnd”.

2) Defined by KnownGene TSS. Download KnownGene track from UCSC browser, totally 1170 genes in this interval. Get the 5kb interval centered on TSS. For genes on the plus strand, the TSS is defined by “txStart”. For genes on the minus strand, the TSS is defined by “txEnd”.

3) Defined by CAGE tag from RIKEN group.Two Cage data source were used:

a. From Fantom website Download the Cage tag data "cage.rep_tag.chr7.gff.gz" (based on mm5) from Fantom: http://gerg01.gsc.riken.jp/cage/download/mm5/chromosomes/ Lift them over from mm5 to mm8, and extend to 5kb intervals. Totally 67,376 intervals were successfully lifted over within chr7:60000000-140000000 (mm8)

b. From UCSC genome track Download the " Riken CAGE TC" track from UCSC mm5 assembly. Lift them over from mm5 to mm8, and extend to 5kb intervals. Totally 20,431 intervals were successfully lifted over within chr7:60000000-140000000 (mm8)

Analysis of over-representation used programs in the R statistical package.

21

Page 22: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

Figure Legends

Fig1 GATA-1 ChIP-chip data in Hbb gene clusterThe top line shows the location of genes and DNase Hypersensitive sites in the Hbb gene cluster. In the middle panel, the first two tracks are the logarithmic ratio of hybridization intensities between ChIP DNA from G1E ER rescued cell line and the input DNA. The third track is the hybridization signals from the G1E Gata1-null cell. The next track shows the high stringent and low stringent peaks identified from the ChIP-chip data. The bottom panel shows the quantitative PCR result of the previously identified sites occupied by GATA-1. The two bars are the qPCR result with ChIP material from G1E cells and from rescued G1E ER4 cells. The mean of two determinations is plotted and the error bars are half of the range.

Fig2 independent validation of the GATA-1 ChIP-chip result by qPCRPanel A illustrates GHPs with high, low and no occupancy by GATA-1. Amplicons for each GHPs were assayed in GATA1-ER ChIP material from G1E knockout cell (first bar is each set) and from the G1E-ER4 cell before and after induction of the GATA-1-ER hybrid with estradiol (second and third bar). The mean of the two determinations is graphed, with half the range shown as error bars. Relative enrichment is the ratio between the amount of certain DNA sequence pulled down by GATA-1 and the amount of same DNA sequence in the input material. Line A is drawn at the mean relative enrichment of the negative controls plus three standard deviation.Panel b shows the relative enrich ment in ChIP material from induced G1E ER4 cells for the 81 high strignet hits tested by qPCR. The black bars are the GHPs that not only pass the mean plus three standard deviation of the negative controls set but also show at least a four fold increase in enrichment compared to the signals from the Gata1-null cells. The grey bare are the GHPs that did not pass one or both of the above threshold . Line A is the same as the top panel, The inset is th e qPCR validation results from kim et at(2007 cell) for CTCF binding sites and line B is the threshold in that experiment base on the mean of the negative plus three standard deviation.

Fig. Conservation

Conservation of the GATA-1 binding site motif in occupied sites. The classification of the 63 GATA-1 occupied sites by phylogenetic conservation of sequence (a) and WGATAR motif (b), and (c) the number of motifs conserved beyond rodents.  The grey bars show the number of GHPs (out of 63) in each group, the asterisked number in italics shows the number relative to the average of the same classification found in 1000 random samples, shown as box distributions; box line is median, box width is interquartile range, whiskers extend to the most extreme data point that is no more than 1.5 times the interquartile range.  Conservation of the binding site is much greater (b) and more frequent (c) than conservation of sequence (a). {I don’t understand the conclusion. - rh}

Fig.5 Distribution of GATA-1 occupied sites along 66 Mb of mouse chromosome 7.This figure shows the tracks of different features in the 66 Mb ChIP-chip region. In the expression change track, blue is expression-decreased genes. Red is the expression increased genes.

22

Page 23: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

Fig.3.bRelationship between GHP’s enhancer activity and the number of WGATARs within GHPsPanel a shows that in the GHPs that do not contain conserved WGATARs, there is almost no correlation between the number of WGATAR motifs and the enhancer activity. Bar plot in the panel shows the distribution of the numbers of WGATAR motifs within those GHPs and the box plot in each bin shows the enhancer activity of GHPs that have same number of WGATAR motifs. The y-axis on the left side is the count of GHPs with a certain number of WGATAR motifs. The y-axis on the right side is the enhancer activity of GHPs using the fold change compared to the parental plasmid.Panel b shows that the enhancer activity of the GHPs increases with the number of the conserved WGATAR motifs. For the GHPs contain at least one conserved WGATAR motif, the majority of them show at least two fold enhancement in enhancer activity. On the other hand, for the GHPs that do not contain conserved WGATAR motifs, most of them did not pass the 2-fold threshold in enhancer assay. Bar plot here shows the conserved WGATAR motif distribution and the box plot is similar as mentioned in panel a. Both the y-axis are the same as the panel a

Fig.7 GHPs tend to be closer to the Transcription Start Sites of the GATA-1 responsive genesPanel a shows the comparison of the distance between GHPs and GATA-1 responsive genes with those between GHPs and random sites within 500kb. The solid curve is the cumulative curve of the distance between GHPs to the nearest transcription start sites of the GATA-1 responsive genes. The dot lines are the cumulative curve between GHPs and nearest random sites. 300 random sets with each set contains 29 random regions are shown here. The x-axis is the distance between GHPs to the nearest target regions (GATA-1 responsive genes or the random sites); the y-axis is the percentage of GHPs whose distance to the nearest target regions is below the length shown in the x -axis. The inset is the same comparison graph of the whole region.

Panel b shows the comparison of the distance between GHPs and GATA-1 responsive genes with those between GHPs and the known-genes. The solid curve is the cumulative curve of the distance between GHPs and the nearest transcription start sites of the GATA-1 responsive genes. The dot lines are the cumulative curve between GHPs and nearest transcription start sites of the randomly chosen 29 known-genes. 300 randomly chosen known-gene sets are shown here. The x and y axis are similar to those in panel a and the inset is the graph of the whole region comparison.

Panel c shows the comparison of the distance between GHPs and known-genes with those between GHPs and random sites. The solid curve is the cumulative curve between GHPs and the nearest transcription start site of the known genes. The dot lines are the cumulative between GHPs and random sites. 300 random sets with each contains 1031 random sites are shown here. The x and y axis are similar to those mentioned above. And again the inset is the same graph extended to the whole region.

23

Page 24: Introduction: - Pennsylvania State Universityross/share/ChengGATA/EnhancerCo… · Web viewApril 01, 2008 Strong transcriptional enhancers show constraint on binding factor motifs

Fig.7 Activities of GHPs in enhancer assaysPanel a shows the distribution of all measurement in the enhancer assay. x-axis is the fold change in the enhancer assay. y-axis is the count of the measurement that have certain fold change. Panel b. Box plot shows the enhancer activity distribution between GHPs with the conserved WGATAR motif and those without the conserved WGATAR motifs. The total numbers of GHPs in each group are labeled under the panel.Panel c. GHPs are divided into three groups according their enhancer activity (less than 2 fold, greater than 2 but less than 3 fold, greater than 3 fold). The black bars represent the percentage of total GHPs in each group. The grey bars show the ratio between the GHPs with the conserved WGATAR motifs to all the GHPs in each group,

24