other&applications&of&...

49
Other applications of secondgeneration sequencing

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Other  applications  of  second-­‐generation  sequencing

Page 2: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Review

• We  have  covered  for  second-­‐generation  sequencing:– Overview  technologies.– Data  and  statistical  issues.– RNA-­‐seq,  ChIP-­‐seq  and  their  analysis  strategies.

• Today  we  will  introduce  some  other  applications  of  sequencing,  mainly– For  DNA  methylation:  bisulfite  sequencing  (BS-­‐seq).– Hi-­‐C  for  3-­‐dimensional  chromatin  structures.

Page 3: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

EpigeneticsNon-­‐DNA  sequence  related,  heritable  mechanisms  to  control  gene  expressions.  Examples:  DNA  methylation,  histone  modifications.

wikipedia

Page 4: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

DNA  methylation

• An  epigenetic  modification  of  the  DNA  sequence.• Involves  adding  a  methyl  group  to  cytosine.• Primarily  happens  at  the  CpG  sites  (when  C  and  G  are  at  

consecutive  bases),  although  non-­‐CG  methylation  exists.  • Mostly  detected  in  higher  organisms:

– In  human  genome,  most  CpG  sites  are  fully  methylated(over  90%)  except  at  CpG  island  where  the  methylation  level  is  minimal.

– Methylation  are  detected  in  some  plants,  insects  and  bacteria,  but  the  levels  are  low.    

Page 5: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

http://www.delawareneuroscience.org/images/Investigator/DNA%20methylation_small.jpg

http://www.bio.miami.edu/dana/pix/cytosine.bmp

Page 6: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Function  of  DNA  methylation

• Important  in  gene  regulation:  methylation  at  TSS  suppress  gene  expression.  

• Play  crucial  role  in  development  and  differentiation:  help  cells  establish  identity.  

• Believed  to  be  interacting  with  environment  exposures.  So  it  is  being  used  to  explain  GxE  interactions.  

• Often  referred  to  as  the  “5th base”.  • Recent  researches  found  different  types  of  methylation,  e.g.,  

hydroxyl  methylation.  

Page 7: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

DNA  Methylation  regulates  gene  expression

http://www.spandidos-­‐publications.com/article_images/or/31/2/OR-­‐31-­‐02-­‐0523-­‐g00.jpg

Page 8: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Detecting  DNA  methylation

• Capture  based:  MeDIP-­‐seq  (Methylated  DNA  immunoprecipitation  followed  by  sequencing).  – Same  as  ChIP-­‐seq,  but  use  antibody  against  methylated  DNA.  – Analysis  methods  are  the  same  as  ChIP-­‐seq.  – Resolution  is  low:  can  roughly  quantify  the  amount  of  DNA  

methylation  in  a  few  hundred  bps.  

• Bisulfite  sequencing  (BS-­‐seq):  bisulfite  conversion  of  DNA  followed  by  sequencing:– Base  pair  resolution:  measures  the  methylation  status  of  each  

nucleotide.

Page 9: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Bisulfite  sequencing• Technology  in  a  nutshell:– First  treat  the  DNA  with  bisulfite.  As  a  result,

• Unmethylated  C  will  be  turned  into  T.• Methylated  C  will  be  protected  and  still  be  C.• No  change  for  other  bases.  

– Amplify,  then  sequence  the  treated  DNA  segments.  • The  mismatches  between  C-­‐T  measures  the  methylation  strength.  

• Raw  data:  sequence  reads,  but  not  exactly  from  the  reference  genome.  

Page 10: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Bisulfite  Sequencing

http://www.ecseq.com/services/EPIseq.html

Page 11: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Alignment  of  BS-­‐seq

• The  reads  from  BS-­‐seq  cannot  be  directly  aligned  to  the  reference  genome.  – There  are  four  different  strands  after  bisulfite  treatment  and  PCR.  

– T  could  be  aligned  to  T  or  C.  – The  search  space  for  alignment  is  bigger.  

BMC Bioinformatics 2009, 10:232 http://www.biomedcentral.com/1471-2105/10/232

Page 3 of 9(page number not for citation purposes)

and is still lacking in current short read alignment soft-ware.

A common approach to overcome these issues is to con-vert all Cs to Ts and map the converted reads to the con-verted reference; then, the alignment results are post-processed to count false-positive bisulfite C/T alignmentsas mismatches, where a C in the BS-read is aligned to a Tin the reference [2]. Although this all-inclusive C/T con-version is effective for reads derived from the C-poorstrands, it is not appropriate for reads derived from the G-poor strands, where all the Cs are actually transcribedfrom Gs by PCR amplification and thus could not be con-verted to Ts during bisulfite treatment. During shotgunsequencing, however, a bisulfite read is almost equallylikely to be derived from either the C-poor or the G-poorstrands. There is no precise way to determine the original

strand a bisulfite read is derived from. Furthermore, byignoring the C/T mapping asymmetry, this strategy gener-ates a large number of false-positive bisulfite mappingsand greatly increases the computational load in a quad-ratic manner with an increase in the size of the referencesequence. In order to accurately extract the true bisulfitemappings in the post-processing stage, all mapping loca-tions have to be recorded, even the non-unique map-pings. Therefore, this approach is only practical for smallreference sequences, where only the C-poor strands aresequenced. For example, Meissner et al. used this map-ping strategy for reduced representation bisulfite sequenc-ing (RRBS) [2], where the genomic DNA was digested bythe Mspl restriction enzyme and 40–220 bp segmentswere selected for sequencing. The reference sequence (~27M nt) is only about 1% of the whole mouse genome, cov-ering 4.8% of the total CpG dinucleotides.

Mapping of bisulfite readsFigure 2Mapping of bisulfite reads. 1) Increased search space due to the cytosine-thymine conversion in the bisulfite treatment. 2) Mapping asymmetry: thymines in bisulfite reads can be aligned with cytosines in the reference (illustrated in blue) but not the reverse.

>>ATTTCG>>

>>ATACTTCGATGATCTCGCAAGACTCCGGC>>

ATTTCG ATTTCGATTTCG

Bisulfite Read

Reference

Bisulfite Read Reference

C

T

C

T

1) Multiple Mapping

2) Mapping Asymmetry

Page 12: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Alignment  strategy• Use  existing  alignment  software  (eg,  bowtie)  as  is:

– Problem:  C-­‐T  mismatches  make  some  reads  can’t  be  aligned.

• Naïve  method:  change  both  the  reference  and  reads  to  make  all  C’s  to  T’s,  then  align.  – Problem:  create  other  mismatches.  

• Better  ideas:  – Consider  the  methylation  status  during  alignment:  create  multiple  

versions  of  the  reference  “seed”  (there  will  be  four  sets  of  references  at  each  locations  containing  a  C  ).

• Clever  implementations  needed.  

Page 13: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Alignment  tools

• See  a  list  of  available  BS-­‐seq  aligner  at  http://www.mi.fu-­‐berlin.de/w/ABI/ExistingBisulfiteMappers.  

• Performances  wise,  they  are  usually  slower:– in  the  rate  of  a  few  hundred  reads  per  second.  

Page 14: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Data  after  alignments• Special  software  needed  to  process  the  alignment  file.• At  each  C  position,  report  the  total  number  of  reads  covering  

that  site,  and  the  number  of  reads  with  T:

chr1301087422 18

chr1301089431 27

chr1301092212 10

chr130109577 6

chr130109716 6

chr130110257 5

• These  are  usually  inputs  for  downstream  BS-­‐seq  analysis.  

Page 15: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

BS-­‐seq  data  analysis

• Compared  with  ChIP-­‐seq  and  RNA-­‐seq,  still  in  relatively  early  stage.

• Questions  include:– Single  dataset  analysis:  • Segment  genome  according  to  methylation  status.

– Comparison  of  multiple  datasets:• Differential  methylation  (DM)  analysis.  

Page 16: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Single  BS-­‐seq  dataset  analysis

• Detecting  the  methylation  loci/regions:– Estimate  “methylation  density”  (percentage  of  cells  have  methylation)  at  each  C  position,  which  is  simply  #methy/#total  at  each  CpG  site,  but:    • Background  error  rates  need  to  be  considered.• Spatial  correlation  among  nearby  CpG  sites  can  be  utilized  to  improve  estimation.  

– Methylated  regions  (or  states)  can  be  determined  by  smoothing  based  method  (e.g.,  moving  average,  HMM)  using  the  estimated  percentage  as  input.  

Page 17: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

An  HMM  approach

• Stadler  et  al.  (2012)  Nature:  – Using  the  estimated  percentages  as  input  to  fit  a  3-­‐state  HMM:  FMR,  LMR  and  UMR.  

DNase-I-hypersensitive sites (DHS), a unique chromatin state thatdepends on DNA-binding factors10–12. In fact, at least 80% of LMRsand 90% of UMRs overlap with DHS (Fig. 2 and SupplementaryFig. 2). LMRs are unlikely novel promoters as we find only weak signalfor RNA polymerase II (Fig. 2 and Supplementary Fig. 3) and no RNAsignal abovewhat we observe atmethylated regions evenwhen using astrand-specific protocol that does not require polyadenylation (Sup-plementary Fig. 3). Next, we explored if LMRs could represent distalregulatory regions, such as enhancers. Indeed, LMRs are stronglyenriched for chromatin features such as highH3K4monomethylation(H3K4me1) signal relative to H3K4 trimethylation (H3K4me3) andthe presence of p300 histone acetyltransferase, which are predictivefeatures of enhancers13 (Fig. 2). This indicates that a subset of LMRsare enhancers that, in light of the absence of H3K27me3 and thepresence of H3K27ac, are presumably active14 (Fig. 2b). Transgenicassays further show that individual LMRs increase the activity of alinked promoter and experimentally function as enhancers (Sup-plementary Fig. 4). We thus conclude that many LMRs, identifiedsolely by their DNA methylation pattern, represent active regulatoryregions.To investigate LMR features further, we combined newly generated

and published data sets for several DNA-binding factors and addi-tional histone modifications (Supplementary Table 1, Fig. 2b andSupplementary Figs 5 and 6). LMRs and UMRs are depleted for theheterochromatic histone modification H3K9me2 in agreement withthe absence of this mark at active chromatin6. Most DNA-bindingfactors show enrichment not only at UMRs, which are mostly pro-moters, but also at LMRs. Factors enriched at LMRs in stem cellsinclude pluripotency transcription factors such as Nanog, Oct4 andKlf4, but also structural DNA-binding factors such as the insulator

protein CTCF15 and members of the cohesin complex (Fig. 2b andSupplementary Fig. 5), both of which bind promoters and distalregulatory regions16. Notably, not all factors occupy distal andproximal regulatory regions with equal preferences. Smad1 binds toneither LMRs nor UMRs, whereas some bind primarily at UMRs, suchas KDM2A and Zfx, and others such as Nanog and Esrrb show higherenrichment at LMRs (Fig. 2b and Supplementary Fig. 5). In summary,several lines of evidence including genomic position, conservation,chromatin state, regulatory activity and transcription factor occupancysupport the hypothesis that LMRs are indeed active distal regulatoryregions.InterestinglyLMRsshowastrongpresenceof5-hydroxymethylcytosine

(5hmC), consistent with recent reports of 5hmC presence at enhancerregions17–19. One candidate protein responsible for catalysing 5hmC,Tet1 (refs 20, 21), is enriched at both UMRs and LMRs (Fig. 2b).To ask if LMRs are also present in other mammals we performed

HMM segmentation of a human stem cell methylome3, which alsoidentifies LMRswith similar features, indicating that these are a generalcharacteristic of mammalian methylomes (Supplementary Fig. 7).

Transcription factor binding creates LMRsTodetermine howLMRs are formed,we investigated theDNA-bindingprotein CTCF, which binds to regulatory regions including promoters,enhancers and insulators22,23.Wedetermined the genome-wide bindingof CTCF by chromatin immunoprecipitation followed by sequencing(ChIP-seq) (Supplementary Fig. 8), revealing high occupancy at bothUMRs and LMRs (Fig. 2b and Supplementary Fig. 5). A composite viewof DNA methylation shows an average methylation of 20% at CTCFbinding sites with increasing methylation adjacent to it (Supplemen-tary Fig. 9), in line with a previous report in primates24. If reducedmethylation is a general feature of CTCF-occupied sites, inclusion ofDNA methylation data should improve prediction of CTCF binding.

020

4060

8010

0M

ethy

latio

n (%

)

01

23

Enric

hmen

t

FMR UMR LMR

Tet15hmC.GLIB5hmC.CMSSmad1STAT3n-MycZfxKDM2AE2f1EsrrbKlf4NanogOct4Smc3Smc1NipblCTCFH3K27acH3K27me3H3K9me2p300Pol IIH3K4me3H3K4me2H3K4me1DNase IMethylation

a

b

UMRLMR

FMR

Mea

n co

nser

vatio

n Conservation

0 3–3 3–3 3–3

3–3 3–3 3–3

0.1

0.2

0.3

Enric

hmen

t (lo

g 2)

0

DNase I

0.0

0.5

1.0

1.5

Position around segment middle (kb)

Enric

hmen

t (lo

g 2)

00

00

H3K4me3 Pol II

H3K4me1 p300

0.0

0.5

1.0

0.0

1.0

2.0 1.

50.

00.

51.

01.

5

0.0

0.3

0.6

0.9

Figure 2 | General features of LMRs. Composite profiles 3 kb aroundsegment midpoints. a, Evolutionary conservation based on multi-speciesalignments (upper left). Enrichment of DNase I tags (lower left). Chromatinfeatures that predict enhancer function are enriched at LMRs (middle andright). b, Heat map of methylation levels, histone modifications and proteinbinding (H3K4me1 signal rescaled for visibility).

a c d

e f

b

025

5075

100

Met

hyla

tion

(%)

FMRLMRUMR

−3 0 3

Position around middle (kb)

0 5 10 15 20

0.00

0.10

Distance to TSS (log2 nt)

Den

sity

FMRLMRUMR

12

22

44

32

FMR(2,485.0 Mbp)

57

3

13 7

20

UMR(27.9 Mbp)

34

25

34

33

LMR(12.0 Mbp)

Promoter Exon Intron Repeat Intergenic

89 (1)

2 (1)9 (98)

CpG islands

FMR UMR LMR

(n = 15,974)

Methylation (%)

Frac

tion

of C

pGs

0.0

0.25

0.5

0−10

10−2

0

20−3

0

30−4

0

40−5

0

50−6

0

60−7

0

70−8

0

80−9

0

90−1

00

6.5% 4.1% 89.4%0

5010

0M

ethy

latio

n (%

)

CGITbx3

120 120.05 120.1 120.15chr5 (Mbp)

Genes

LMR

25 kb

Figure 1 | Features of the mouse ES cell methylome. a, Distribution of CpGmethylation frequency for all CpGs with at least tenfold coverage. Of allcytosines, 4.1% show intermediate methylation levels. b, Representativegenomic region. Computational segmentation identifies UMRs (bluepentagons), LMRs (red triangles) and FMRs (unmarked). Each dot representsone CpG (CpG islandsmarked in green). Included is an independently verifiedLMR upstream of Tbx3. Mbp, million base pairs. c, Composite profile of CpGmethylation for all three groups. kb, kilobases. d, Distances to TSS.e, f, Distribution of all three classes among genome features. e, A smallpercentage of LMRs overlap with CpG islands. Numbers indicate observedpercentage of overlaps per group (expected percentage in parentheses).f, Distribution of the regions throughout the genome.

ARTICLE RESEARCH

2 2 / 2 9 D E C E M B E R 2 0 1 1 | V O L 4 8 0 | N A T U R E | 4 9 1

Macmillan Publishers Limited. All rights reserved©2012

Page 18: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Smoothing  method

• Can  directly  smooth  the  percentages,  but  that  doesn’t  consider  the  uncertainty  in  percentage  estimates.  

• A  better  approach:  BSmooth  model  (Hansen  et  al.  2012  Genome  Biology).– Assumes  the  true  methylation  level  is  a  smooth  curve  of  genomic  coordinates.  

– The  observed  counts  follow  a  binomial  distribution.  

Page 19: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

BSmooth  smoothing• Notations  at  position  j:– Nj,  Mj:  total/methylated  reads.– πj:  underlying  true  methylation  level.  – lj:  location.

• Model:

• Fitting:  weighted  glm  in  each  2kb  window,  where  the  weights  depend  on  the  variances  of  estimated  πj.

M j ~ Bin(N j,π j )

log(π j / (1−π j )) = β0 +β1l j +β2l j2

Page 20: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Bsmooth  Bioconductor  package:  bsseq

• Mainly  provide  functions  for  smoothing  and  some  visualization.

• Implemented  in  parallel  computing  environment  to  speed  up  the  calculation.  

M <- matrix(0:8, 3, 3)Cov <- matrix(1:9, 3, 3)BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"),

pos = c(1,2,3), M = M, Cov = Cov, sampleNames = c("A","B", "C"))

BS1 <- BSmooth(BS1)

Page 21: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Differential  methylation  analysis• Comparison  of  methylation  profiles  under  different  biological  conditions  is  of  great  interests.– Results  from  such  analysis  are:  differentially  methylated  loci  (DML)  or  regions  (DMR).  

• Strategy  to  detect  DML:– Hypothesis  testing  at  each  CpG  site.  

• Strategy  to  detect  DMR:– Need  to  combine  data  from  nearby  CpG  sites  because  of  the  spatial  correlation.

Page 22: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

DML  detection  based  on  2x2  table

• At  each  CpG  site,  summarize  the  counts  from  two  samples  into  a  2x2  table:

• Chi-­‐square  or  Fisher’s  exact  test  can  be  applied.• Bsseq  has  function  fisherTests for  this:

fisherTests(BSobj, group1, group2)

Total Methylated

Sample  1 40 2

Sample  2 25 19

Page 23: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Wald-­‐test  based

• Can  handle  data  with  replicates.• The  key  is  to  estimate  within  group  variances.  • BSmooth  approach  (for  two  group  comparison):  – Denote  the  group  assignment  for  ith  sample  by  Xi.– Number  of  replicates  in  two  groups  are  n1 and  n2.– Frame  the  estimated  values  of  into  a  two-­‐group  testing  framework:  πij=a(lj)+  b(lj)Xi+εi,j,  εi,j~N(0,  σj

2).

– Use  SAM-­‐alike  method  to  estimate  σj2,  then  do  Wald  test.  

Page 24: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Shrinkage  based  method(Feng  et  al.  2014,  NAR)

• Similar  to  that  in  RNA-­‐seq  DE  analysis,  the  BS-­‐seq  data  can  be  modeled  as  Beta-­‐binomial  distribution:

• Beta  distribution  is  parameterized  by  mean  and  dispersion,  and  impose  a  log-­‐normal  prior  on  dispersions.  

• Wald  test  procedure  can  be  derived.  

P a g e | 14

Materials and methods

The Bayesian hierarchical model

To characterize the data, we propose the following Bayesian hierarchical model, based on the

beta-binomial distribution. Notation for our model is as follows: at the ith CpG site, jth group

and kth replicate, 𝑋 is the number of reads that show methylation, 𝑁 is the total number of

reads that cover this position, and 𝑝 is the underlying “true”  methylation proportion.

𝑋 |𝑝 ,𝑁 ~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙  (𝑁 , 𝑝 )

𝑝 ~𝐵𝑒𝑡𝑎 𝜇 , ∅

Since the process of sequencing is a random sampling process from statistical perspective, The

the model assumes that 𝑋 |𝑝 ,𝑁 𝑋 follows a binomial distribution.. Since The the true

methylation proportions among replicates are can be anywhere between 0 and 1, it is a natural

choice to assume that the proportions d to follow beta distribution, as it is the most flexible

distribution with the support of interval between 0 and 1 and applicable to a wide variety of

disciplines. Here the beta distribution which is parameterized by mean (denoted by  𝜇 ) and

dispersion (denoted by ∅ ). Compared with the traditional parameterization of the Beta (𝛼, 𝛽)

distribution, the parameters have the following relationship:

𝜇 =𝛼

𝛼 + 𝛽, ∅ =

1𝛼 + 𝛽 + 1

P a g e | 15

Here, the biological variation among replicates is captured by the beta distribution and the

variation due to the random sampling of DNA segments during sequencing is captured by the

binomial distribution. The dispersion parameter ∅ captures the  variation  of  a  CpG  site’s  

methylation proportion relative to the group mean. We allow Each each CpG site within a single

condition (e.g. within cases, or controls) is assumed to havehas its own dispersion. It is a flexible

assumption because it allows. either different or common dispersions for both conditions.

To combine information across all CpG sites, based on the observed distribution of dispersion

from a publicly available RRBS dataset on mouse embryogenesis (27), we assumed the

following prior on ∅ :

∅ ~𝑙𝑜𝑔𝑛𝑜𝑟𝑚𝑎𝑙  (  𝑚 , 𝑟 )

where 𝑚  𝑎𝑛𝑑  𝑟 are mean and variance parameters that can be estimated from the data. We

based our choice of a log normal distribution on the observed distribution of dispersion from a

publicly available RRBS dataset on mouse embryogenesis [24]. For each CpG site in this dataset,

we applied a MOM estimator to estimate the dispersion parameters. As shown in Figure 6, the

genome-wide distribution of logarithm dispersion parameter estimates is approximately Gaussian

with mean = -3.39 and SD = 1.08, suggesting that the dispersion parameters can be well-

described by a log-normal distribution. To be noticed, simulations which the dispersions are

from different distributions shows that our proposed method is robust against the violation of

log-normal assumption (Supplementary Figure 2).

Parameter estimation

Page 25: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Simulation  results

• The  Wald  test  with  shrunk  dispersion  performs  favorably  compared  with  other  methods.  

1 1 1 1 1 1 1 1 1 1

200 400 600 800 1000

3040

5060

7080

90

Top ranked CpG sites

% th

at a

re tr

ue D

M

2 22

22

22

22

23 3 3 3 3 3 3 3 3 3

4 4 44

44

44

44

5 5 5 55

55

55

5

1 1 1 1 11

11

11

200 400 600 800 1000

3040

5060

7080

90

Top ranked CpG sites

% th

at a

re tr

ue D

M

2 2 2 22

22

22

2

3 3 3 33

33

33

3

4 4 4 4 4 44

44

4

5 5 5 5 5 55

55

5

12345

t−testFisherAdj. ChisqWald test, naive dispersionWald test, shrunk dispersion

Page 26: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Things  to  consider  in  DMR  calling

• Coverage  depth:– Should  one  filter  out  sites  with  shallower  coverage?  

• With  biological  replicates:– CpG  specific  biological  variances.– Small  sample  estimate  of  the  variance.  

• Spatial  correlation  of  methylation  levels  among  nearby  CpG  sites.– Is  smoothing  appropriate?  – What  is  data  has  low  spatial  correlation,  like  in  5hmC.  

Page 27: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Existing  methods  for  DML/DMR  detection

• BSmooth (Hansen  et  al.  2012,  GB):– Smoothing,  then  take  the  smoothed  values  and  run  two-­‐group  t-­‐test.  

• MethylKit(Akalin et  al.  2012, GB):– Logsitic regression  or  Fisher’s  exact  test.  – Recently  implemented  DSS  Wald  test  approach.  

• BiSeq (Hebestreitet  al.  2013,  Bioinformatics):– Smoothing,  then  take  the  smoothed  value  and  run  beta  glm.    

• DSS  (Feng  et  al.  2014,  NAR):– Based  on  beta-­‐binomial  model.  Empirical  Bayesian  estimate  of  

dispersions,  and  Wald  test.  – Spatial  correlations  are  ignored

Page 28: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

• MOABS  (Sun  et  al.  2014,  GB):  – Based  on  beta-­‐binomial  model  to  define  CDIF,  the  lower  bound  of  CI  

for  methylation  difference  in  two  groups.– Spatial  correlations  are  ignored.  

• methylSig (Hebestreitet  al.  2014,  Bioinformatics)– Based  on  beta-­‐binomial  model.  MLE  based  method  to  estimate  

dispersion.  – Likelihood  ratio  test.  

• DSS-­‐single  (Wu  et  al.  2015,  NAR)– Works  for  single  replicated  data,  use  nearby  CpG  sites  are  “pseudo-­‐

replicates”.  

• RADMeth (Dolzhenkoet  al.  2014,  BMC  Bioinformatics)– Based  on  beta-­‐binomial  GLM,  works  for  multiple  factor  design.

Page 29: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Useful  bioc  packages  -­‐ bsseq

• First  create  BSseq objects• Use  BSmooth function  to  smooth.• fisherTests performs  Fisher’s  exact  test,  if  there’s  no  

replicate.• BSmooth.tstat performs  t-­‐test  with  replicates.• dmrFinder calls  DMRs  based  on  BSmooth.tstat    results.

BSobj = BSmooth(BSobj)dmlTest=fisherTests(BSobj, group1=c("C1", "C2","C3"),

group2=c("N1","N2","N3"))

dmr <- dmrFinder(dmlTest)

Page 30: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Useful  bioc  packages  -­‐ DSS

• Input  data  has  the  same  format  as  bsseq.• DMLtestperforms  Wald  test  at  each  CpG.• callDML/callDMR calls  DML  or  DMR.• More  options  in  DML/DMR  calling.  

dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"),group2=c("N1","N2","N3"),smoothing=TRUE, smoothing.span=500)

dmrs <- callDMR(dmlTest)

Page 31: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Another  paradigm  –single  read  BS-­‐seq  analysis

• So  far  we  have  focused  on  “marginal”  methylation  levels  (aggregated  information  from  all  reads).

• Sometimes  data  at  each  single  read  provide  additional  information.  

• Useful  reads:– Xie  et  al.  (2011)  NAR.– Landan  et  al.  (2012)  Nat.  Genetics.

Page 32: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Single  read  information

• Methylation  entropy  or  polymorphism.

linkers. Bisulfite modification of genomic DNA was per-formed with EZ DNA Methylation Gold kit (ZymoResearch, Orange, CA, USA) according to the manufac-turer’s instructions.

PCR cloning, sequencing and multiple sequence aligments

PCR reactions were performed with Qiagen Hotstart PCRmaster kit (Qiagen). For each reaction, a 50 ml PCRmixture was prepared with 2 ml (100 ng) bisulfite treatedDNA, 50 pmol each forward and reverse primers. Theprimers used in the PCR runs for genomic locus 1(chr9:139174924-139175041) are 50-GGT TAT TTT TTTTTT AGT TTT GGT TTA GAT ATG A-30 and 50-TTTCTC CAA TCT TAA CTT AAA CAT AAT TCC-30. Theprimers used in the PCR runs for genomic locus 2(chr10:134480046-134480230) are 50-AAA TAT AATTTA GAA GGT ATT GTA GAT GTA AAT G-30 and50-CAT AAC TTA AAA AAT ATT ACA AAT ATAAAT ACC AAC-30. The PCR products with appropriatesize were gel-purified and cloned with TOPO vectors(Invitrogen). Sequencing reactions for colonies were con-ducted at the Sequencing Core Facility of the Children’sMemorial Research Center of Northwestern University’sFeinberg School of Medicine. To ensure an accurate cal-culation of the fidelity of inheritance of DNA methylation,the sequence reads contain unconverted cytosine atnon-CpG sites, due to the incomplete bisulfite conversion,were discarded. After the removal of vector and primersequences, the sequence reads obtained were subjected tomultiple alignments together with a reference sequence forcorresponding genomic locus. Multiple sequence align-ments were performed with clustal W (24).

Statistical analysis of the association between methylationentropies and DNA related attributes

The statistical analyses were conducted as previouslydescribed (22). Briefly, we compiled a comprehensive listof attributes that can be linked directly to the genomicregions of interest. The data for most of these attributeswere calculated based on the UCSC Genome AnnotationDatabase (25). The attributes for DNA sequence featureswere directly calculated based on the DNA sequence ex-tracted from the human genome. All the attributes areeither in the numerical form or boolean form (such aspresent in gene or not). The non-parametric Wilcoxonranksum test and chi-square test statistical tests were per-formed for each attribute in numerical form or booleanform, respectively. Significance thresholds were adjustedfor multiple testing using the highly conservativeBonferroni method, and the family-wise error rate wasset to be <1%.

RESULTS

The definition and statistical assessment of methylationentropy

Traditionally, DNA methylation data analysis is based onthe determination of the average methylation level (thepercentage of methylated CpG) of one or more contiguous

CpG sites. Such conventional way is unable to dissectDNA methylation patterns, which are herein defined asthe combination of methylation statuses of contiguousCpG dinucleotides in a DNA strand. In order to betterdecode epigenetic data, we defined ‘methylation entropy’and exploited it to assess the variability of DNA methy-lation pattern that might be observed for a given genomiclocus in a cell population. The concept of entropy was firstintroduced by Rudolf Clausius as a thermodynamicproperty and later modified as Shannon entropy in infor-mation theory to measure the degree of uncertaintyassociated with a stochastic event (26).

Entropy : HðXÞ ¼ $X

PðxÞ log2 PðxÞ

An important variable in entropy equation is the probabil-ity P(x) for a given event x. A frequently used example tointerpret the concept of Shannon entropy is tossing a coin,which has two possible outcomes. Since it is a randomevent, the probability for heads or tails would be 0.5.Similarly, the methylation status (methylated orunmethylated) of a CpG dinucleotide could be consideredas heads or tails but may not be random. Thus, theconcept of entropy could be modified to quantitativelyassess the variation in DNA methylation patterns.To calculate methylation entropy, the following param-

eters were introduced to the original entropy formula:(i) number of CpG sites in a given genomic locus;(ii) number of sequence reads generated for a genomiclocus and (iii) frequency of each distinct DNA methyla-tion pattern observed in a genomic locus, calculated basedupon the sequence reads that were generated for the locus(Figure 1A). The probability of a given event in Shannonentropy equation was replaced with the frequency of a

ME: Methylation Entropye: Entropy for code bitb: Number of CpG sitesni: Observed occurrence of methylation pattern iN: Total number of sequence reads generated

∑ −= )(Nn

LogNn

be

MEii

A

B C D E

ME = 0 ME = 0 ME = 0.1875 ME = 1

Figure 1. The formula of methylation entropy and the examples forgenomic loci with various methylation entropies in a cell population.(A) The formula of methylation entropy. The determination of methy-lation entropy requires three parameters: the number of CpG sites, thetotal number of sequence reads generated and the occurrence of eachmethylation pattern. (B–E) Genomic loci with various methylationentropies.

Nucleic Acids Research, 2011 3

by guest on March 14, 2014

http://nar.oxfordjournals.org/D

ownloaded from

Xie  et  al.  (2011)  NAR

Page 33: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

What  single  read  tells  us

• Comparison  of  methyl-­‐entropy/polymorphism  among  different  samples.

• Sample  deconvolution– Zheng  et  al.  (2014)  GB:  MethylPurify– estimate  the  proportion  of  cell  types  in  a  mixed  sample  (such  as  cancer),  as  well  as  calling  DMRs.

Page 34: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

MethylPurify

with the smallest parameter variance in the 50 samplingand uses the mode of their α1 estimate as the α1 for thewhole tumor sample (Figure 1e,f). With the sample α1, afew EM iterations in each bin could quickly converge onthe m1 and m2 estimates and read assignment across thegenome. To avoid local maxima of EM, MethylPurifystarts from two distinct initial values of m1 and m2 ineach bin, representing α1 component being hyper- andhypo-methylated, and the convergence point with higherlikelihood is selected as the final prediction (see Methodssection for details).The output of MethylPurify will report the mixing

ratio of the two components (α1: 1 - α1) in the wholesample and the methylation level of each component(m1 and m2) in each qualifying bin across the genome.MethylPurify could also detect differentially methylatedregions (DMRs) as consecutive differentially methylatedbins (DMBs).

Inference of mixing ratio from simulated mixture ofbisulfite reads from tumor and normal cell linesTo validate MethylPurify in estimating the mixing ratio, weused simulated mixture of whole genome bisulfite sequen-cing data from two separate breast cell lines [22]. HCC1954cell line (thereafter refer to as HCC) is derived from an es-trogen receptor (ER)/progesterone receptor (PR) negativeand ERBB2 positive breast tumor, and human mammary

epithelial cell line (HMEC) is immortalized from normalbreast epithelial cells. Bisulfite sequencing for the two celllines have slightly different read lengths (approximately 70to 100 bp) and sequencing coverage (27-fold and 20-fold,respectively). We randomly sampled bisulfite reads fromthe two cell lines at 20-fold total coverage with varyingmixing ratios from 0:1 (all HMEC) to 1:0 (all HCC) with astep of 0.05.We first examined how the parameter estimation varies

with changing inputs. At different mixing ratios, the aver-age variance (of all qualifying bins by bootstrapping) ofthe minor component percentage α1 is very small andstable (Figure 2a). The variance of α1 initially increaseswith the mean of α1, but is suppressed as α1 approaches0.5 since α1 is designated as the minor component to bealways ≤0.5 in our model. In contrast, the estimatedmethylation level of the minor component m1 is the mostvariable. This is reasonable because at low α1 (close to 0),the minor component has very little read coverage; at highα1 (close to 0.5), it is sometimes difficult to determinewhich component is minor so m1 could fluctuate depend-ing on whether MethylPurify assigns the methylated orunmethylated reads to the minor component.Since m1 is the most variable among the three parame-

ters and dominates the sum of the variances, MethylPur-ify later only uses the standard deviation (stdev) of m1

from bootstrapping to rank all qualifying bins. Indeed,

Figure 1 Overview of MethylPurify. (a) A differentially methylated region (DMR) between tumor and normal cells. Solid and hollow red circlesrepresent methylated and unmethylated cytosines, respectively. (b) Short reads from two cell populations after bisulfite treatment and sonication.(c) A library of bisulfite reads in a mixture of two cell populations. (d) EM algorithm iteratively estimates three parameters: the minor composition(α1) and the methylation level of each population (m1, m2) in M step, and assigns reads to each population in E step. (e) Among all 300 bp bins,the parameters estimated from informative bins converge on a final mixing ratio estimate. (f) Top, density plot of predicted minor componentfrom selected informative bins. Bottom, separated methylation level of tumor and normal cells based on the predicted mixing ratio, and DMRsare detected as consecutive differentially methylated bins (DMBs).

Zheng et al. Genome Biology 2014, 15:419 Page 3 of 13http://genomebiology.com/2014/15/8/419

Page 35: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Conclusion  on  BS-­‐seq  analyses

• Careful  in  alignments.  • Data  modeling  is  different  from  ChIP/RNA-­‐seq:  Poisson/NB  vs.  Binomial  models.

• DMR  calling  needs  to  consider  spatial  correlation,  coverage  and  biological  variances.

• Single  read  analysis  could  be  very  useful.• A  lot  of  room  for  method  development.

Page 36: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Detecting  long-­‐range  interactions

• So  far  we  have  assumed  the  genome  is  a  long  line.  • In  reality,  chromosomes  fold  into  complicated  structures  in  

nucleus.  Implications:– Genomic  loci  far  away  on  chromosome  could  be  close  spatially  due  to  

chromosome  folding.  – This  is  important  for  studying  gene  regulatory  mechanisms,  e.g.,  

detecting  enhancers.  

• Traditional  lower  throughput  methods:  – 3C:  Chromosome  Conformation  Capture.  – 5C:  Carbon-­‐Copy  Chromosome  Conformation  Capture.

• High-­‐throughput:  Hi-­‐C

Page 37: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Hi-­‐C  experimental  procedures

(12, 13). Interestingly, chromosome 18, which issmall but gene-poor, does not interact frequentlywith the other small chromosomes; this agreeswith FISH studies showing that chromosome 18tends to be located near the nuclear periphery (14).

We then zoomed in on individual chromo-somes to explore whether there are chromosom-al regions that preferentially associate with eachother. Because sequence proximity strongly in-fluences contact probability, we defined a normal-

ized contact matrixM* by dividing each entry inthe contact matrix by the genome-wide averagecontact probability for loci at that genomic dis-tance (10). The normalized matrix shows manylarge blocks of enriched and depleted interactions,generating a plaid pattern (Fig. 3B). If two loci(here 1-Mb regions) are nearby in space, wereasoned that they will share neighbors and havecorrelated interaction profiles. We therefore de-fined a correlation matrix C in which cij is the

Pearson correlation between the ith row and jthcolumn of M*. This process dramatically sharp-ened the plaid pattern (Fig. 3C); 71% of the result-ing matrix entries represent statistically significantcorrelations (P ≤ 0.05).

The plaid pattern suggests that each chromo-some can be decomposed into two sets of loci(arbitrarily labeled A and B) such that contactswithin each set are enriched and contacts betweensets are depleted.We partitioned each chromosome

Fig. 1. Overview of Hi-C. (A)Cells are cross-linked with form-aldehyde, resulting in covalentlinks between spatially adjacentchromatin segments (DNA frag-ments shown in dark blue, red;proteins, which canmediate suchinteractions, are shown in lightblue and cyan). Chromatin isdigested with a restriction en-zyme (here, HindIII; restrictionsite marked by dashed line; seeinset), and the resulting stickyends are filled in with nucle-otides, one of which is bio-tinylated (purple dot). Ligationis performed under extremelydilute conditions to create chi-meric molecules; the HindIIIsite is lost and an NheI site iscreated (inset). DNA is purifiedand sheared. Biotinylated junc-tions are isolated with strep-tavidin beads and identified bypaired-end sequencing. (B) Hi-Cproduces a genome-wide con-tactmatrix. The submatrix shownhere corresponds to intrachro-mosomal interactions on chromo-some 14. (Chromosome 14 isacrocentric; the short arm isnot shown.) Each pixel represents all interactions between a 1-Mb locus and another 1-Mb locus; intensity corresponds to the total number of reads (0 to 50). Tickmarks appear every 10 Mb. (C and D) We compared the original experiment with results from a biological repeat using the same restriction enzyme [(C), rangefrom 0 to 50 reads] and with results using a different restriction enzyme [(D), NcoI, range from 0 to 100 reads].

A

B C D

Fig. 2. The presence and orga-nization of chromosome territo-ries. (A) Probability of contactdecreases as a function of ge-nomic distance on chromosome 1,eventually reaching a plateau at~90 Mb (blue). The level of in-terchromosomal contact (blackdashes) differs for different pairsof chromosomes; loci on chromo-some 1 are most likely to inter-act with loci on chromosome 10(green dashes) and least likelyto interact with loci on chromo-some 21 (red dashes). Interchro-mosomal interactions are depletedrelative to intrachromosomal in-teractions. (B) Observed/expectednumber of interchromosomal con-tacts between all pairs of chromosomes. Red indicates enrichment, and blue indicates depletion (range from 0.5 to 2). Small, gene-rich chromosomes tend to interactmore with one another, suggesting that they cluster together in the nucleus.

A B

9 OCTOBER 2009 VOL 326 SCIENCE www.sciencemag.org290

REPORTS

on

Mar

ch 1

6, 2

010

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Page 38: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Hi-­‐C  data

• Paired  end  sequencing,  each  pair  is  for  a  pair  of  interacting  regions.  

• Usually  summarized  the  counts  into  a  2D  matrix:– First  cut  genome  into  N  equal  sized  bins  (size  depends  on  sequence  depth).

– Summarize  the  read  counts  into  NxN  matrix.  The  element  (i,  j)  represents  the  number  of  pairs  with  one  end  from  the  ith  window  and  the  other  end  from  the  jth  window.  

– The  counts  represent  the  strength  of  interaction.  – Usually  the  numbers  on  diagonal  are  greater.

Page 39: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Visualize  Hi-­‐C  data  in  a  heatmap

in this way by using principal component analysis.For all but two chromosomes, the first principalcomponent (PC) clearly corresponded to the plaidpattern (positive values defining one set, negativevalues the other) (fig. S1). For chromosomes 4 and5, the first PC corresponded to the two chromo-some arms, but the second PC corresponded to theplaid pattern. The entries of the PC vector reflectedthe sharp transitions from compartment to com-partment observed within the plaid heatmaps.Moreover, the plaid patterns within each chromo-some were consistent across chromosomes: the

labels (A and B) could be assigned on eachchromosome so that sets on different chromo-somes carrying the same label had correlatedcontact profiles, and those carrying different labelshad anticorrelated contact profiles (Fig. 3D). Theseresults imply that the entire genome can be par-titioned into two spatial compartments such thatgreater interaction occurswithin each compartmentrather than across compartments.

TheHi-C data imply that regions tend be closerin space if they belong to the same compartment(Aversus B) than if they do not. We tested this by

using 3D-FISH to probe four loci (L1, L2, L3, andL4) on chromosome 14 that alternate between thetwo compartments (L1 and L3 in compartment A;L2 and L4 in compartment B) (Fig. 3, E and F).3D-FISH showed that L3 tends to be closer toL1 than to L2, despite the fact that L2 lies be-tween L1 and L3 in the linear genome sequence(Fig. 3E). Similarly, we found that L2 is closer toL4 than to L3 (Fig. 3F). Comparable results wereobtained for four consecutive loci on chromosome22 (fig. S2, A and B). Taken together, these obser-vations confirm the spatial compartmentalization

A B C D

E F G H

Fig. 3. The nucleus is segregated into two compartments correspondingto open and closed chromatin. (A) Map of chromosome 14 at a resolutionof 1 Mb exhibits substructure in the form of an intense diagonal and aconstellation of large blocks (three experiments combined; range from 0to 200 reads). Tick marks appear every 10 Mb. (B) The observed/expectedmatrix shows loci with either more (red) or less (blue) interactions thanwould be expected, given their genomic distance (range from 0.2 to 5).(C) Correlation matrix illustrates the correlation [range from – (blue) to+1 (red)] between the intrachromosomal interaction profiles of every pairof 1-Mb loci along chromosome 14. The plaid pattern indicates thepresence of two compartments within the chromosome. (D) Interchromo-somal correlation map for chromosome 14 and chromosome 20 [rangefrom –0.25 (blue) to 0.25 (red)]. The unalignable region around the cen-tromere of chromosome 20 is indicated in gray. Each compartment onchromosome 14 has a counterpart on chromosome 20 with a very similar

genome-wide interaction pattern. (E and F) We designed probes for fourloci (L1, L2, L3, and L4) that lie consecutively along chromosome 14 butalternate between the two compartments [L1 and L3 in (compartment A);L2 and L4 in (compartment B)]. (E) L3 (blue) was consistently closer to L1(green) than to L2 (red), despite the fact that L2 lies between L1 and L3in the primary sequence of the genome. This was confirmed visually andby plotting the cumulative distribution. (F) L2 (green) was consistentlycloser to L4 (red) than to L3 (blue). (G) Correlation map of chromosome14 at a resolution of 100 kb. The PC (eigenvector) correlates with thedistribution of genes and with features of open chromatin. (H) A 31-Mbwindow from chromosome 14 is shown; the indicated region (yellowdashes) alternates between the open and the closed compartments inGM06990 (top, eigenvector and heatmap) but is predominantly open inK562 (bottom, eigenvector and heatmap). The change in compartmen-talization corresponds to a shift in chromatin state (DNAseI).

www.sciencemag.org SCIENCE VOL 326 9 OCTOBER 2009 291

REPORTS

on

Mar

ch 1

6, 2

010

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Page 40: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Overlay  with  other  1-­‐D  data

share this feature of classical insulators. A classical boundary elementis also known to stop the spread of heterochromatin. Therefore, weexamined the distribution of the heterochromatin mark H3K9me3 inhumans and mice in relation to the topological domains12,13. Indeed,we observe a clear segregation of H3K9me3 at the boundary regionsthat occurs predominately in differentiated cells (Fig. 2d, e andSupplementary Fig. 11). As the boundaries that we analysed in

Fig. 2d are present in both pluripotent cells and their differentiatedprogeny, the topological domains and boundaries appear to pre-markthe end points of heterochromatic spreading. Therefore, the domainsdo not seem to be a consequence of the formation of heterochromatin.Taken together, the above observations strongly suggest that the topo-logical domain boundaries correlate with regions of the genome dis-playing classical insulator and barrier element activity, thus revealing a

CTCF

H3K4me3

RNA PolII

p300

H3K4me1

HMM state

DI

Domains

1.0

0

0.8

0.60.40.2

0 10 20 30 40 50

1 –

Empi

rical

cu

mul

ativ

e de

nsity

DI (absolute value)

False positive rate 1%

DI (actual)DI (random)

0

10

20

30

40

0 0.5 1.0 1.5 2.0

Med

ian

norm

aliz

edin

tera

ctio

n co

unts

Genomic distance (Mb)

010

020

030

040

050

060

070

0

Nor

mal

ized

inte

ract

ing

coun

ts

Distance of 80-kb

P-value = 1.65 × 10

–126

A

BInteractions downstream

Interactions upstream

A B

Biased upstream

Biased downstream

Degree of bias

FISH probes:

mESC DI

HMM state

FISH probes:

mESC DI

HMM state

‘Intra-domain’ ‘Inter-domain’

Squ

ared

inte

rpro

be d

ista

nce

(d2 )

betw

een

FIS

H p

robe

s

Domain 1 Domain 2Domain

d

e

Putative boundary

Gen

omic

dis

tanc

e (k

b)be

twee

n FI

SH

pro

bes

Genomic distance Measured distanceh i

0

100

Nor

mal

ized

inte

ract

ing

coun

ts

Chr2:

Chr6: 50000000 51000000 52000000 53000000 54000000

2410003K15RikIgf2bp3

Tra2aCcdc126

D330028D13Rik

Stk31 Npy Mpp6Dfna5

Osbpl3

Cycs

5430402O13Rik

Npvf

C530044C16RikMir148a

Nfe2l3Hnrnpa2b1

Cbx3

Snx10

Skap2Hoxa1Hoxa2Hoxa3Hoxa4Hoxa5Hoxa6Mira

Hoxa7

Hoxa9

Mir196bHoxa10Hoxa11Hoxa13

5730457N03Rik

Evx1Hibadh

Tax1bp1

Jazf1

9430076C15Rik

Creb5TrilCpvl

Chn2

50 -

–50 _

5 -0.2 _

5 -0.3 _

5 -0.5_

3 -0.2 _

3 -0.2 _

74500000 74600000

Lnp Evx2Hoxd13Hoxd12Hoxd11Hoxd10Hoxd9Hoxd8

Hoxd3Hoxd4Mir10b

Hoxd1Mtx2

50 -

–50_

Chr11: 96200000 96300000

Hoxb13Gm53

Mir196a-1Hoxb9

Hoxb8

Hoxb7Hoxb6Hoxb5

Mir10aHoxb4

Hoxb3

Hoxb2Hoxb1

Gm11529Skap1

50 -

–50_

Intra Inter

b

a

Inter-domainIntra-domain

Intra-domainHoxb clusterInter-domainHoxd cluster

f g

0

20

40

60

80

100

0.00

0.02

0.04

0.06

0.08

0.10

0.12

c

Figure 1 | Topological domains in themouse ES cell genome. a, NormalizedHi-C interaction frequencies displayed as a two-dimensional heat mapoverlayed on ChIP-seq data (from Y. Shen et al., manuscript in preparation),directionality index (DI), HMM bias state calls, and domains. For bothdirectionality index andHMM state calls, downstream bias (red) and upstreambias (green) are indicated. b, Schematic illustrating topological domains andresulting directional bias. c, Distribution of the directionality index (absolutevalue, in blue) compared to random (red).d, Mean interaction frequencies at allgenomic distances between 40 kb to 2Mb. Above 40 kb, the intra- versus inter-domain interaction frequencies are significantly different (P, 0.005,Wilcoxontest). e, Box plot of all interaction frequencies at 80-kb distance. Intra-domaininteractions are enriched for high-frequency interactions. f–i, Diagramof intra-domain (f) and inter-domain FISH probes (g) and the genomic distancebetween pairs (h). i, Bar chart of the squared inter-probe distance (from ref. 6)FISH probe pairs. mESC, mouse ES cell. Error bars indicate standard error(n5 100 for each probe pair).

hESC DI

IMR90 DI

IMR90 H3K9me3

hESC H3K9me3

hESC domain

IMR90 domain

0

60

0.3

0–500 kb +500 kbBoundaryC

TCF

bind

ing

site

s pe

r 10

kb All CTCF sites31,968

Boundaryassociated

4,846

CTCF

a

b c

1,75

4 sh

ared

bou

ndar

ies

1,15

9 sh

ared

bou

ndar

ies

Boundary± 500 kb

Boundary± 500 kb

0 3.0

log2 (H3K9me3/input)

0 3.0

log2 (H3K9me3/input)

d

CS5 insulator

0.2

0.1

Chr7: 27000000 27500000

SKAP2HOXA1

BC031342HOXA2HOXA3HOXA4HOXA5HOXA6

HOXA7HOXA9HOXA10HOXA11HOXA11ASHOXA13

EVX1

BC034444

HIBADHNS5ATP1

TSL-ATAX1BP1

JAZF1

30 _

–30 _

-

Boundaryseparates two

non-LAD domains

Boundaryseparates twoLAD domains

Boundaryseparates LAD and

non-LAD domain

3.0 –3.0

log2 (Dam–laminB1/Dam)

f

Chr2: 2 Mb hg18138000000 139000000 140000000

THSD7BHNMT

SPOPLNXPH2LOC647012

30 _

–30 _

30 _

–30 _16

_

0 _

16 _

0 _

50

0

Nor

mal

ized

inte

ract

ing

coun

tse

Boundary± 500 kb

Boundary± 500 kb

Boundary± 500 kb

Non-boundaryassociated

27,122

Nor

mal

ized

in

tera

ctin

g co

unts

hESC IMR90 mESC Cortex

DI

Domains

Figure 2 | Topological boundaries demonstrate classical insulator orbarrier element features. a, Two-dimensional heatmap surrounding theHoxalocus and CS5 insulator in IMR90 cells. b, Enrichment of CTCF at boundaryregions. c, The portion of CTCF binding sites that are considered ‘associated’with a boundary (within 620-kb window is used as the expected uncertaintydue to 40-kb binning). d, Heat maps of H3K9me3 at boundary sites in humanand mouse. e, UCSC Genome Browser shot showing heterochromatinspreading in the human ES cells (hESC) and IMR90 cells. The two-dimensionalheat map shows the interaction frequency in human ES cells. f, Heat map ofLADs (from ref. 14) surrounding the boundary regions. Scale is the log2 ratio ofDNA adenosine methylation (Dam)–lamin B1 fusion over Dam alone (Dam–laminB1/Dam).

RESEARCH LETTER

2 | N A T U R E | V O L 0 0 0 | 0 0 M O N T H 2 0 1 2

Macmillan Publishers Limited. All rights reserved©2012

Page 41: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Data  analysis

• Normalization.  • An  easier  one:  defining  domains  (regions  with  higher  level  of  self-­‐interaction).  

• Harder  one:  find  long-­‐range  interaction.  • Others:  infer  3D  structures.  • Barely  touched:  comparison  (differential  domain).  

Page 42: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Normalization

• Consider  distance  between  read  pairs,  GC  contents,  mappability,  etc.  to  create  a  baseline  of  counts  (expected  number  of  reads  in  each  elements  of  the  matrix).

• Subtract  (or  divide)  the  baseline  from  the  observed  counts  to  get  the  signals.

• A  couple  approaches:– Yaffe  et  al.  (2011)  Nature  Genetics:  likelihood  based.  – Imakaev  et.  al.  (2012)  Nature  Method:  assuming  equal  visibility  at  all  

loci  and  do  median-­‐polish  type  of  correction  (iteratively  divide  the  row/column  sums).  

• Results:  usually  improved  correlation  among  replicates.

Page 43: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Domain  detection

• The  genome  are  organized  into  different  “domains”.  • Can  be  seen  as  the  blocks  on  diagonal  of  the  heatmap.  

• To  detect,  use  the  facts  that  the  interactions  are  higher  within  a  domain,  and  lower  cross  domains.  

• Still  an  open  statistical  problem.

share this feature of classical insulators. A classical boundary elementis also known to stop the spread of heterochromatin. Therefore, weexamined the distribution of the heterochromatin mark H3K9me3 inhumans and mice in relation to the topological domains12,13. Indeed,we observe a clear segregation of H3K9me3 at the boundary regionsthat occurs predominately in differentiated cells (Fig. 2d, e andSupplementary Fig. 11). As the boundaries that we analysed in

Fig. 2d are present in both pluripotent cells and their differentiatedprogeny, the topological domains and boundaries appear to pre-markthe end points of heterochromatic spreading. Therefore, the domainsdo not seem to be a consequence of the formation of heterochromatin.Taken together, the above observations strongly suggest that the topo-logical domain boundaries correlate with regions of the genome dis-playing classical insulator and barrier element activity, thus revealing a

CTCF

H3K4me3

RNA PolII

p300

H3K4me1

HMM state

DI

Domains

1.0

0

0.8

0.60.40.2

0 10 20 30 40 50

1 –

Empi

rical

cu

mul

ativ

e de

nsity

DI (absolute value)

False positive rate 1%

DI (actual)DI (random)

0

10

20

30

40

0 0.5 1.0 1.5 2.0

Med

ian

norm

aliz

edin

tera

ctio

n co

unts

Genomic distance (Mb)

010

020

030

040

050

060

070

0

Nor

mal

ized

inte

ract

ing

coun

ts

Distance of 80-kb

P-value = 1.65 × 10

–126

A

BInteractions downstream

Interactions upstream

A B

Biased upstream

Biased downstream

Degree of bias

FISH probes:

mESC DI

HMM state

FISH probes:

mESC DI

HMM state

‘Intra-domain’ ‘Inter-domain’

Squ

ared

inte

rpro

be d

ista

nce

(d2 )

betw

een

FIS

H p

robe

s

Domain 1 Domain 2Domain

d

e

Putative boundary

Gen

omic

dis

tanc

e (k

b)be

twee

n FI

SH

pro

bes

Genomic distance Measured distanceh i

0

100

Nor

mal

ized

inte

ract

ing

coun

ts

Chr2:

Chr6: 50000000 51000000 52000000 53000000 54000000

2410003K15RikIgf2bp3

Tra2aCcdc126

D330028D13Rik

Stk31 Npy Mpp6Dfna5

Osbpl3

Cycs

5430402O13Rik

Npvf

C530044C16RikMir148a

Nfe2l3Hnrnpa2b1

Cbx3

Snx10

Skap2Hoxa1Hoxa2Hoxa3Hoxa4Hoxa5Hoxa6Mira

Hoxa7

Hoxa9

Mir196bHoxa10Hoxa11Hoxa13

5730457N03Rik

Evx1Hibadh

Tax1bp1

Jazf1

9430076C15Rik

Creb5TrilCpvl

Chn2

50 -

–50 _

5 -0.2 _

5 -0.3 _

5 -0.5_

3 -0.2 _

3 -0.2 _

74500000 74600000

Lnp Evx2Hoxd13Hoxd12Hoxd11Hoxd10Hoxd9Hoxd8

Hoxd3Hoxd4Mir10b

Hoxd1Mtx2

50 -

–50_

Chr11: 96200000 96300000

Hoxb13Gm53

Mir196a-1Hoxb9

Hoxb8

Hoxb7Hoxb6Hoxb5

Mir10aHoxb4

Hoxb3

Hoxb2Hoxb1

Gm11529Skap1

50 -

–50_

Intra Inter

b

a

Inter-domainIntra-domain

Intra-domainHoxb clusterInter-domainHoxd cluster

f g

0

20

40

60

80

100

0.00

0.02

0.04

0.06

0.08

0.10

0.12

c

Figure 1 | Topological domains in themouse ES cell genome. a, NormalizedHi-C interaction frequencies displayed as a two-dimensional heat mapoverlayed on ChIP-seq data (from Y. Shen et al., manuscript in preparation),directionality index (DI), HMM bias state calls, and domains. For bothdirectionality index andHMM state calls, downstream bias (red) and upstreambias (green) are indicated. b, Schematic illustrating topological domains andresulting directional bias. c, Distribution of the directionality index (absolutevalue, in blue) compared to random (red).d, Mean interaction frequencies at allgenomic distances between 40 kb to 2Mb. Above 40 kb, the intra- versus inter-domain interaction frequencies are significantly different (P, 0.005,Wilcoxontest). e, Box plot of all interaction frequencies at 80-kb distance. Intra-domaininteractions are enriched for high-frequency interactions. f–i, Diagramof intra-domain (f) and inter-domain FISH probes (g) and the genomic distancebetween pairs (h). i, Bar chart of the squared inter-probe distance (from ref. 6)FISH probe pairs. mESC, mouse ES cell. Error bars indicate standard error(n5 100 for each probe pair).

hESC DI

IMR90 DI

IMR90 H3K9me3

hESC H3K9me3

hESC domain

IMR90 domain

0

60

0.3

0–500 kb +500 kbBoundaryC

TCF

bind

ing

site

s pe

r 10

kb All CTCF sites31,968

Boundaryassociated

4,846

CTCF

a

b c

1,75

4 sh

ared

bou

ndar

ies

1,15

9 sh

ared

bou

ndar

ies

Boundary± 500 kb

Boundary± 500 kb

0 3.0

log2 (H3K9me3/input)

0 3.0

log2 (H3K9me3/input)

d

CS5 insulator

0.2

0.1

Chr7: 27000000 27500000

SKAP2HOXA1

BC031342HOXA2HOXA3HOXA4HOXA5HOXA6

HOXA7HOXA9HOXA10HOXA11HOXA11ASHOXA13

EVX1

BC034444

HIBADHNS5ATP1

TSL-ATAX1BP1

JAZF1

30 _

–30 _

-

Boundaryseparates two

non-LAD domains

Boundaryseparates twoLAD domains

Boundaryseparates LAD and

non-LAD domain

3.0 –3.0

log2 (Dam–laminB1/Dam)

f

Chr2: 2 Mb hg18138000000 139000000 140000000

THSD7BHNMT

SPOPLNXPH2LOC647012

30 _

–30 _

30 _

–30 _16

_

0 _

16 _

0 _

50

0

Nor

mal

ized

inte

ract

ing

coun

tse

Boundary± 500 kb

Boundary± 500 kb

Boundary± 500 kb

Non-boundaryassociated

27,122

Nor

mal

ized

in

tera

ctin

g co

unts

hESC IMR90 mESC Cortex

DI

Domains

Figure 2 | Topological boundaries demonstrate classical insulator orbarrier element features. a, Two-dimensional heatmap surrounding theHoxalocus and CS5 insulator in IMR90 cells. b, Enrichment of CTCF at boundaryregions. c, The portion of CTCF binding sites that are considered ‘associated’with a boundary (within 620-kb window is used as the expected uncertaintydue to 40-kb binning). d, Heat maps of H3K9me3 at boundary sites in humanand mouse. e, UCSC Genome Browser shot showing heterochromatinspreading in the human ES cells (hESC) and IMR90 cells. The two-dimensionalheat map shows the interaction frequency in human ES cells. f, Heat map ofLADs (from ref. 14) surrounding the boundary regions. Scale is the log2 ratio ofDNA adenosine methylation (Dam)–lamin B1 fusion over Dam alone (Dam–laminB1/Dam).

RESEARCH LETTER

2 | N A T U R E | V O L 0 0 0 | 0 0 M O N T H 2 0 1 2

Macmillan Publishers Limited. All rights reserved©2012

Page 44: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Domain  detection  by  HMM  (Dixon  et  al.  2012,  Nature)

• Compute  directionality  index  (DI).

• Run  2-­‐state  HMM  on  DI  assuming  Gaussian  emission.    • Define  domains  based  on  HMM  results:  a  domain  starts  from  

the  beginning  of  a  “up”  region,  and  ends  at  the  end  of  its  next  “down”  region.  

W W W. N A T U R E . C O M / N A T U R E | 3 1

SUPPLEMENTARY INFORMATION RESEARCH

Median ~ 454 kb Median ~ 880 kb

0

1,000,000

2,000,000

3,000,000

4,000,0000

1,000,000

2,000,000

3,000,000

4,000,000

Domain Size (bp) Domain Size (bp)

050

100

150

200

250

300

Freq

uenc

y

100

020

030

0

Freq

uenc

y

Directionality Index

0

100

Nor

mal

ized

Inte

ract

ing

Coun

ts

a

b

chr12: 101000000 101500000 102000000 102500000 103000000 103500000 104000000

Ttc8

4930474N09RikFoxn3

1700064M15Rik2610021K21Rik

Tdp1

Kcnk13Psmc1

BC002230Gm10433

Calm1Gm10432

Ttc7b

Rps6ka5Gpr68

Ccdc88cMir1190

Smek1

Smek1D130020L05Rik

Kif4-ps

CatsperbTc2n

Fbln5Trip11

Atxn3Cpsf2

Slc24a4

Rin3

LgmnGolga5

Chga

Itpk1Mir1936

Gm20604Moap1

AK010878

Ubr7Btbd7

Cox8c

Unc79

50 -

-50 _5 -

-5 _

0 -

Lamina Associated Domains Topological Domains

Supplementary,Figure,12.,,Comparison,of,Topological,Domains,with,Lamina,Associated,Domains,(LADs).,,a,#Histogram#showing#the#size#distribution#of#the#topological#domains#and#the#LADs.#Generally,#LADs#are#smaller#in#size#than#topological#domains.#b,#Genome#browser#shot#showing#a#region#on#chromosome#12#with#multiple#topological#domains,#one#of#which#appears#to#be#entirely#lamina@associated,#with#the#remainder#are#non@lamina#associated.#

log Lamin B1 DamIDDamID( )

Median ~ 454 kb Median ~ 880 kb

0

1,000,000

2,000,000

3,000,000

4,000,0000

1,000,000

2,000,000

3,000,000

4,000,000

Domain Size (bp) Domain Size (bp)

050

100

150

200

250

300

Freq

uenc

y

100

020

030

0

Freq

uenc

yDirectionality Index

0

100

Nor

mal

ized

Inte

ract

ing

Coun

ts

a

b

chr12: 101000000 101500000 102000000 102500000 103000000 103500000 104000000

Ttc8

4930474N09RikFoxn3

1700064M15Rik2610021K21Rik

Tdp1

Kcnk13Psmc1

BC002230Gm10433

Calm1Gm10432

Ttc7b

Rps6ka5Gpr68

Ccdc88cMir1190

Smek1

Smek1D130020L05Rik

Kif4-ps

CatsperbTc2n

Fbln5Trip11

Atxn3Cpsf2

Slc24a4

Rin3

LgmnGolga5

Chga

Itpk1Mir1936

Gm20604Moap1

AK010878

Ubr7Btbd7

Cox8c

Unc79

50 -

-50 _5 -

-5 _

0 -

Lamina Associated Domains Topological Domains

Supplementary,Figure,12.,,Comparison,of,Topological,Domains,with,Lamina,Associated,Domains,(LADs).,,a,#Histogram#showing#the#size#distribution#of#the#topological#domains#and#the#LADs.#Generally,#LADs#are#smaller#in#size#than#topological#domains.#b,#Genome#browser#shot#showing#a#region#on#chromosome#12#with#multiple#topological#domains,#one#of#which#appears#to#be#entirely#lamina@associated,#with#the#remainder#are#non@lamina#associated.#

log Lamin B1 DamIDDamID( )

Page 45: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Detecting  long-­‐range  interactions

• The  interactions  can  be  seen  on  the  heatmap  as  bright,  off-­‐diagonal  spots.  

• A  harder  problem,  partly  because  there  are  not  enough  reads.  • Still  an  open  statistical  problem.  A  simple  method  is  a  Poisson  

test,  with  the  baseline  rates  computed  from  all  data:  

of the genome inferred from Hi-C. More gen-erally, a strong correlation was observed betweenthe number of Hi-C readsmij and the 3D distancebetween locus i and locus j as measured by FISH[Spearman’s r = –0.916, P = 0.00003 (fig. S3)],suggesting that Hi-C read count may serve as aproxy for distance.

Upon close examination of the Hi-C data, wenoted that pairs of loci in compartment B showeda consistently higher interaction frequency at agiven genomic distance than pairs of loci in com-partment A (fig. S4). This suggests that compart-ment B is more densely packed (15). The FISHdata are consistent with this observation; loci incompartment B exhibited a stronger tendency forclose spatial localization.

To explore whether the two spatial compart-ments correspond to known features of the ge-nome, we compared the compartments identifiedin our 1-Mb correlation maps with known geneticand epigenetic features. Compartment A correlatesstrongly with the presence of genes (Spearman’sr = 0.431, P < 10–137), higher expression [viagenome-wide mRNA expression, Spearman’sr = 0.476, P < 10–145 (fig. S5)], and accessiblechromatin [as measured by deoxyribonuclease I(DNAseI) sensitivity, Spearman’s r = 0.651, Pnegligible] (16, 17). Compartment A also showsenrichment for both activating (H3K36 trimethyl-ation, Spearman’s r = 0.601, P < 10–296) andrepressive (H3K27 trimethylation, Spearman’sr = 0.282, P < 10–56) chromatin marks (18).

We repeated the above analysis at a resolutionof 100 kb (Fig. 3G) and saw that, although thecorrelation of compartment A with all other ge-nomic and epigenetic features remained strong(Spearman’s r > 0.4, P negligible), the correla-tion with the sole repressive mark, H3K27 trimeth-ylation, was dramatically attenuated (Spearman’sr = 0.046, P < 10–15). On the basis of these re-sults we concluded that compartment A is moreclosely associated with open, accessible, activelytranscribed chromatin.

We repeated our experiment with K562 cells,an erythroleukemia cell line with an aberrant kar-yotype (19). We again observed two compart-ments; these were similar in composition to thoseobserved in GM06990 cells [Pearson’s r = 0.732,

Fig. 4. The local packing ofchromatin is consistent with thebehavior of a fractal globule. (A)Contact probability as a functionof genomic distance averagedacross the genome (blue) showsa power law scaling between500 kb and 7 Mb (shaded re-gion) with a slope of –1.08 (fitshown in cyan). (B) Simulationresults for contact probability asa function of distance (1 mono-mer ~ 6 nucleosomes ~ 1200base pairs) (10) for equilibrium(red) and fractal (blue) globules.The slope for a fractal globule isvery nearly –1 (cyan), confirm-ing our prediction (10). The slopefor an equilibrium globule is –3/2,matching prior theoretical expec-tations. The slope for the fractalglobule closely resembles the slopewe observed in the genome. (C)(Top) An unfolded polymer chain,4000 monomers (4.8 Mb) long.Coloration corresponds to distancefrom one endpoint, ranging fromblue to cyan, green, yellow, or-ange, and red. (Middle) An equi-librium globule. The structure ishighly entangled; loci that arenearby along the contour (sim-ilar color) need not be nearby in3D. (Bottom) A fractal globule.Nearby loci along the contourtend to be nearby in 3D, leadingto monochromatic blocks bothon the surface and in cross sec-tion. The structure lacks knots.(D) Genome architecture at threescales. (Top) Two compartments,corresponding to open and closedchromatin, spatially partition thegenome. Chromosomes (blue, cyan,green) occupy distinct territories.(Middle) Individual chromosomesweave back and forth betweenthe open and closed chromatincompartments. (Bottom) At thescale of single megabases, the chromosome consists of a series of fractal globules.

A

C D

B

9 OCTOBER 2009 VOL 326 SCIENCE www.sciencemag.org292

REPORTS

on

Mar

ch 1

6, 2

010

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Page 46: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Comparison,  e.g.,  differential  interaction

• Barely  touched  (people  still  struggle  with  domains  and  interactions).

• Conceptually,  one  want  to  compare  the  interactions  between  different  samples,  e.g.,  locus  A  interacts  with  locus  B  in  normal  cell  but  not  in  cancer.  

• For  an  element  in  the  matrix,  can  we  take  the  counts  then  use  RNA-­‐seq  DE  test  methods?– No!  Because  the  backgrounds  could  be  different.  This  is  similar  to  

ChIP-­‐seq  differential  binding  problem.– Also  neighboring  elements  in  the  matrix  need  to  be  combined  to  make  

inference  (like  in  ChIP-­‐seq,  but  combine  in  2-­‐D),  so  some  (kernel)  smoothing  is  needed.  

Page 47: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Construct  3D  structure  

• BACH  (Bayesian  3D  constructor  for  Hi-­‐C  data),  Hu  et  al.  (2013)  PloS  CB  – The  read  counts  represent  the  physical  distances  between  pairs  of  loci  on  the  genome.  

– Given  these  distances  the  3D  structure  can  be  estimated.  – Based  on  a  Poisson  model,  and  with  some  constraints,  the  3D  coordinates  of  each  bin  on  the  genome  can  be  estimated.      

– Estimation  procedure  is  based  on  MCMC.  

Page 48: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

Conclusion  on  Hi-­‐C  data

• Technology  to  detect  chromosomal  interactions  using  sequencing.

• Usually  requires  more  reads.• Still  in  very  early  infancy  in  terms  of  analysis  methods.  A  lot  of  room  for  development.

Page 49: Other&applications&of& second1generation&sequencingweb1.sph.emory.edu/users/hwu30/teaching/bioc/otherSeqs.pdfBS1seq&data&analysis • Compared#with#ChIP.seq#and#RNA.seq,still#in# relatively#early#stage

A  grand  overview  of  the  class

• The  technologies  and  statistical  methods  for:– Gene  expression  microarrays  and  a  little  bit  ChIP-­‐chip.– Second-­‐generation  sequencing:  ChIP-­‐seq  and  RNA-­‐seq.    

• Bioconductor  tools  for  analyzing  genomic  data,  including:– Biostrings,  BSgenome,  GenomicRanges,  GenomicFeatures for  general  

genomic  data.– A  little  bit  of  Rsamtools for  sequencing  data.– Several  Biocpackages  for  DE/DM  analyses  in:

• microarray:  siggenes,  limma.• RNA-­‐seq:  DESeq,  edgeR,  DSS.• BS-­‐seq:  bsseq,  DSS

• Some  software  tools  for  analyzing  sequence  data:– bowtie:  alignment.  – samtools:  for  manipulating  SAM/BAM  files.