enzymatic methyl-seq: next generation methylomes
TRANSCRIPT
Fiona J. Stewart1, V K Chaithanya Ponnaluri1, Brittany S. Sexton1, Louise Williams1, Bradley W. Langhorst1, Romualdas Vaisvila1, Yvonne Helbert2, Lei Zhang2, Timothy Harkins2, Kevin J. McKernan2, Eileen T. Dimalanta1, Theodore B. Davis1
1 New England Biolabs, Inc., Ipswich, MA 01938 2 Medicinal Genomics Corp., Woburn, MA 01801
Enzymatic Methyl-seq: Next Generation Methylomes
DNA methylation is important for gene regulation. The ability to accurately identify 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC) gives us greater insight into potential gene regulatory mechanisms. Bisulfite sequencing (BS) is traditionally used to detect methylated Cs, however, BS does have its drawbacks. DNA is commonly damaged and degraded by the chemical bisulfite reaction resulting in libraries that demonstrate high GC bias and are enriched for methylated regions. To overcome these limitations, we developed an enzymatic approach, NEBNextÒEnzymatic Methyl-seq (EM-seqÔ), for methylation detection that minimizes DNA damage, resulting in longer fragments and minimal GC bias.
Illumina libraries were prepared using bisulfite and EM-seq methods with 50 ng DNA from Arabidopsis thaliana and Cannabis sativa DNA. Libraries were sequenced using Illumina’s NextSeq500 (2x75). Reads were aligned using BWAMeth 0.2. and methylation information was extracted from the alignments using MethyDackel. Total 5mC levels were compared between the sequencing data from EM-seq and WGBS libraries and LCMS (Liquid Chromatography Mass Spectrometry). 5mC levels determined by EM-seq are close to those from LCMS, whereas, WGBS results in an overestimation of 5mC. Additionally, EM-seq libraries produce higher quality sequencing metrics such as longer inserts, lower duplication rates, a higher percentage of mapped reads and less GC bias compared to bisulfite converted libraries. We conclude that EM-seq is superior to WGBS and delivers higher library yields, more accurate methylation information, reduced DNA damage, increased sequencing length, and decreased GC-bias.
Sample Preparation
Data Analysis
Methyl KitBWAmeth
Methyl Extraction
Align(BWAmeth) Deduplicate Picard
• Reads were aligned to Jamaican Lion reference genome (August 2018 assembly) or theArabidopsis reference genome (TAIR10) (for Jamaican Lion, four miscellaneous contigswere removed from methylation analysis)
• Data were analyzed using the tools in above flowchart.
• Two plant DNAs were used to make EM-seq libraries• Cannabis sativa genomic DNA (Jamaican Lion): female clones (leaf, seeded and
unseeded flowers) & male sibling (flowers) plants• Arabidopsis thaliana
• Libraries were made using 50 ng genomic DNA, spiked with control DNA (unmethylatedlambda & CpG-methylated pUC19)
• Libraries were sequenced using an Illumina NextSeq 500, 2x76 base paired reads. 5caC issequenced as C and deaminated C as T.
• Bisulfite conversion was performed using Zymo Research EZ DNA Methylation-GoldTM kit
Cannabis sativa: Higher Quality Sequencing Metrics with EM-seq compared to WGBS Cannabis sativa female leaf EM-seq libraries are superior to WGBS
EM-seq libraries outperform WGBS for Cannabis sativa input DNA. (A)Fewer PCR cycles are required for EM-seq. (B) EM-seq libraries havelower duplication than WGBS. (C) EM-seq libraries have larger libraryinsert sizes than WGBS. In addition, the EM-seq protocol has beenoptimized for both standard and large insert libraries. (D) GC distributionis more even for EM-seq than WGBS. (E) EM-seq has higher coveragedepths than WGBS. 50 million 2 x 76 base reads were used for analysis(Figures A-D) and 130 million 2 x 76 base reads were used for Figure E.
0
5
10
15
20
25
Female Leaf EM-seq-1 Female Leaf EM-seq-2 Female Leaf WGBS-1 Female Leaf WGBS-2
Yiel
d (n
M)
PCR
Yie
ld (n
M)
EM-seq-1 EM-seq-2 WGBS-1 WGBS-20
5
10
15
20
25
6 PCR cycles 8 PCR cycles
0
0.5
1
1.5
2
2.5
3
3.5
Female Leaf EM-seq-1 Female Leaf EM-seq-2 Female Leaf WGBS-1 Female Leaf WGBS-2
Perc
ent D
uplic
atio
n
0.5
1.0
1.5
2.0
2.5
3.0
3.5
EM-seq-1 EM-seq-2 WGBS-1 WGBS-2
% D
uplic
atio
n
0
EM-seq can be used to investigate plant genomic DNA• analysis of the Cannabis sativa methylome identified genes involved in seed and THC production • the Arabidopsis methylome was successfully probed
Acknowledgements Thank you to Laurie Mazzola and Danielle Fuchs for their technical support.
• Higher library yields with fewer PCR cycles• Lower percent duplication• More even base coverage
• Larger library insert sizes• Less GC bias• Similar percentage methylation as LC-MS
Differential CpG methylation identified between Cannabis flower tissues
Differential methylation across Cannabis sativa flower tissues using EM-seq data. Female flower and female seeded flower (clones) as well as the male flower (sibling)were studied. (A) Percent cytosine methylation in the CpG, CHG, and CHH contexts. Female methylation levels for both flower and seeded flower were higher than for themale flower indicating methylation patterns are potentially determined by sex. Control DNA methylation levels were <0.5% for unmethylated lambda and >97.5% for CpGmethylated pUC19 (data not shown). (B) Volcano plot of the significant (q<0.01) differential methylation calls for the CpG context between (1) female flower andfemale seeded flower (2) female flower and male flower. The differential methylation calls for female flower and female seeded flower identified 30 hypomethylatedCpGs but no hypermethylated CpGs. The differential methylation calls for female and male flower identified >50,000 hypomethylated CpGs and >11,000hypermethylated CpGs. (C) Comparison of the hypomethylated CpGs with differential methylation between the female flower samples & female and male flowers.
B
MethylatedUnmethylated
Genomic Location (504 bp)
MethylatedUnmethylated
Differential methylation analysis comparingthe female flower with the seeded flower(clones) and the male flower (sibling).Significant differential methylation calls(q<0.01) for CpG context (1x coverageminimum) between female flower with femaleseeded flower and/or male flower wereidentified. (A) Region 1 BLASTs to the edestingene family of the Cannabis sativa genome.These genes are linked to the positiveregulation of seed production. The femaleflower is CpG methylated in this region,suggesting the expression of the genes areturned off, while the female seeded flower isunmethylated at this CpG. (B) Region 2BLASTs to the THC acid synthase (THCA)gene of the Cannabis sativa genome and thisgene is linked to the positive regulation ofTHCA production. The female flower is CpGunmethylated in this region, suggesting theexpression of the gene is turned on, while themale flower is methylated at this CpG,suggesting the expression of the gene isturned off. Male flowers produce log scaleless THCA.
(A) Comparison of the total percentage of 5mC determined using LC-MS, EM-seq and WGBS for female leaf. The percentage ofmethylated cytosines using LC-MS is calculated by determining theamount of methylated cytosines and the total cytosines ((5mC/(totalC)x100). 5mC percentages for EM-seq and WGBS were determinedby combining 5mC in the three cytosine contexts (CpG, CHH, CHG).EM-seq cytosine methylation numbers are closer to LCMSmethylation values than WGBS. (B) Cytosine methylation in the CpG,CHG, and CHH contexts for EM-seq and WGBS for female leaf.Control cytosine methylation for unmethylated lambda DNA was<0.5% for EM-seq and <2% for WGBS, and for CpG methylatedpUC19 was >97.5% for both EM-seq and WGBS (data not shown).(C) CpG context correlations of EM-seq and WGBS libraries (1xminimum coverage). Both EM-seq and WGBS libraries were highlycorrelated between replicates and methods (CHG and CHH contextdata not shown). 50 million, 2 x 76 base reads were used formethylation analysis.
Female & Female Seeded Flower
Female & Male Flower
C
B
Methylation profiles identified genes involved in seed & cannabinoid production
Genomic Location (992 bp)
WGBS0
10
20
30
40
LC-MS Female Leaf EM-seq Female Leaf WGBS Female Leaf
Tota
l % 5
mC
A 40
30
20
10
0EM-seqLC-MS
A
0
25
50
75
100
CpG CHG CHH CpG CHG CHH CpG CHG CHH
100
75
50
25
0
% 5
mC
per
C c
onte
xt
Female Flower Female Seeded Flower
Male Flower
C
Hypo Hyper
0 100
q-va
lue
Methylation Difference
0.0000
0.0025
0.0050
0.0075
0.0100
−100 −50 0 50 100meth.diff
qvalue factor(factor)
hypo
0.0000
0.0025
0.0050
0.0075
0.0100
−100 −50 0 50 100meth.diff
qvalue
factor(factor)hypo
hyper
0.0100
0.0075
0.0050
0.0025
0.0000
-100 0 100Methylation Difference
-100
0.0100
0.0075
0.0050
0.0025
0.0000
q-va
lue
Female & Female Seeded Flower Female & Male Flower
50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800
Insert Size
JL
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
0.16%
0.18%
0.20%
0.22%
Avg. %
of in
sert
s
% o
f Rea
ds
EM-seq (standard)EM-seq (large insert)WGBS
0
4
8
12
16
20
50 100 200 300 400 500 600 700Insert Size (bp)
Nor
mal
ized
Cov
erag
e
Percent GC0 10 20 30 40 50 60 70 80 90 100
EM-seqWGBS
00.20.40.60.81.01.21.41.61.82.02.2
A B
C D
Coverage Depth
% o
f Bas
es a
t Cov
erag
e D
epth
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Coverage Depth
Pro
ject
Medic
inalG
enom
ics_Leaf_
WG
BS
_E
Mseq_C
om
bin
edB
am
s
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
Avg. F
rac B
ases
10
0
4
2
6
8
1
EM-seqWGBS
305 10 15 20 25
E
0
25
50
75
100
CpG CHG CHH CpG CHG CHH CpG CHG CHH CpG CHG CHH
EM-seq Female Leaf WGBS Female Leaf
B 100
75
50
25
0
% 5
mC
per
C c
onte
xt
CpG
CH
GC
HH
CpG
CH
GC
HH
CpG
CH
GC
HH
CpG
CH
GC
HH
C
0
0.05
0.1
0.15
0.2
0.25
0.3
0 200 400 600
Cou
ntM
illion
s
Insert Size (bp)
Library Insert Size WGBS-1WGBS-2EM-seq-1EM-seq-2
Histogram of CpG coverage
log10 of read coverage per base
Freq
uenc
y
0.0 0.5 1.0 1.5 2.0 2.5 3.0
050
0000
1000
000
1500
000
2000
000
0.30 00.30.200.50.30.31.42.1
7.3
21
41.1
20.4
4.2
0.20 0 0 0 00.10.10 0 0 0 00.10 0 0
EM−seq−1
Histogram of CpG coverage
log10 of read coverage per base
Freq
uenc
y
0.0 0.5 1.0 1.5 2.0 2.5 3.0
020
0000
4000
0060
0000
8000
0010
0000
012
0000
0
1.20 0
2.2
3.6
0
11.7
7.57.8
21.4
10.510.5
7.8
5.8
32.2
1.71.10.80.50.30.20.10 0 0 0 0 0 0 0 0 0
WGBS−1
Histogram of % CpG methylation
% methylation per base
Freq
uenc
y
0 20 40 60 80 100
050
0000
1500
000
2500
000
3500
000 64.5
1.8 0.7 0.6 0.5 0.5 0.5 0.5 0.4 0.7 0.4 0.7 0.8 1.2 1.7 2.3 3.24.9 5.9
8.1
EM−seq−1
Histogram of % CpG methylation
% methylation per base
Freq
uenc
y
0 20 40 60 80 100
050
0000
1500
000
2500
000
3500
000
63.6
2 1.2 1 0.6 0.4 0.6 0.5 0.4 1 0.2 0.9 0.6 1.4 1.9 2.1 2.44.5 4.1
10.7
WGBS−1
0
1
2
3
4
5
6
1 11 21 31 41
CpG
s co
vere
dM
illion
s
Minimum Coverage Depth
CpG coverage distribution
WGBS-1WGBS-2EM-seq-1EM-seq-2
EM-seq and WGBS libraries were made using 50 ng Arabidopsis thaliana genomic DNA. Libraries were sequenced on an Illumina NextSeq 500 (2 x 75 bases). 125 million paired end reads for each library were aligned to TAIR10 using bwa-meth 0.2.2. CpG, CHH and CHG sites on both strands were counted independently. (A) EM-seq libraries have larger insert sizes. (B) Cytosine methylation in the CpG and CHN contexts for EM-seq and WGBS. (C) GC Bias plot for EM-seq and WGBS libraries. EM-seq libraries show less bias across the GC content.
Histogram of CHG coverage
log10 of read coverage per base
Freq
uenc
y
0.0 0.5 1.0 1.5 2.0 2.5 3.0
050
0000
1000
000
1500
000
2000
000
2500
000
0.30 00.20.200.40.20.21.21.8
6.5
19.9
41.6
22
4.7
0.20 0 0 0 00.10.10 0 0 0 00.10 0 0
EM−seq−1
Histogram of CHG coverage
log10 of read coverage per base
Freq
uenc
y
0.0 0.5 1.0 1.5 2.0 2.5 3.0
020
0000
4000
0060
0000
8000
0012
0000
0
1.10 0
2.1
3.5
0
11.8
7.78.2
22.5
1110.6
7.5
5.3
2.61.91.40.90.60.40.30.20.10 0 0 0 0 0 0 0 0 0
WGBS−1
Histogram of % CHG methylation
% methylation per base
Freq
uenc
y
0 20 40 60 80 100
0e+0
01e
+06
2e+0
63e
+06
4e+0
65e
+06
80.7
1.8 1 1 1 0.9 1 1.1 0.9 1.4 0.8 1.2 1 1.2 1.2 1 0.9 0.8 0.5 0.7
EM−seq−1
Histogram of % CHG methylation
% methylation per base
Freq
uenc
y
0 20 40 60 80 100
0e+0
01e
+06
2e+0
63e
+06
4e+0
6
78.7
2.5 1.5 1.3 1 0.7 1 1 0.8 1.6 0.5 1.2 0.8 1.3 1.3 1.1 0.9 1 0.6 1.5
WGBS−1
Histogram of CHH coverage
log10 of read coverage per base
Freq
uenc
y
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0e
+00
2.0e
+06
4.0e
+06
6.0e
+06
8.0e
+06
1.0e
+07
1.2e
+07
0.30 00.30.200.40.20.31.31.9
6.8
20.3
41.4
21.4
4.6
0.20 0 0 0 00.10.10 0 0 0 00.10 0 0
EM−seq−1
Histogram of CHH coverage
log10 of read coverage per base
Freq
uenc
y
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0e+0
01e
+06
2e+0
63e
+06
4e+0
65e
+06
6e+0
6
1.20 0
23.4
0
11.2
7.27.7
21.4
10.710.8
8.1
6
3.12.3
1.71.10.80.50.30.20.10 0 0 0 0 0 0 0 0 0
WGBS−1
Histogram of % CHH methylation
% methylation per base
Freq
uenc
y
0 20 40 60 80 100
0.0e
+00
5.0e
+06
1.0e
+07
1.5e
+07
2.0e
+07
2.5e
+07
88.9
3.5 1.6 1.3 0.9 0.6 0.6 0.5 0.4 0.5 0.2 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1
EM−seq−1
Histogram of % CHH methylation
% methylation per base
Freq
uenc
y
0 20 40 60 80 100
0.0e
+00
5.0e
+06
1.0e
+07
1.5e
+07
2.0e
+07
2.5e
+07 84.8
4.5 2.5 1.8 1.1 0.7 0.8 0.6 0.4 0.7 0.2 0.4 0.2 0.3 0.2 0.1 0.1 0.1 0 0.2
WGBS−1
01234567
1 11 21 31 41
CH
Gs
cove
red
Milli
ons
Minimum Coverage Depth
CHG coverage distribution
WGBS-1WGBS-2EM-seq-1EM-seq-2
05
101520253035
1 21 41
CH
H c
over
edM
illion
s
Minimum Coverage Depth
CHH coverage distribution
WGBS-1WGBS-2EM-seq-1EM-seq-2
www.NEBNext.com
0%
10%
20%
30%
40%
EM-seq WGBS
% 5
mC
in C
pG c
onte
xt
CpG
EM-seq WGBS
0%
2%
4%
6%
8%
10%
EM-seq WGBS
% 5
mC
in C
HN
con
text
CHN
EM-seq WGBS
EM-seq libraries cover more cytosines to greater minimum depths than WGBS. EM-seq identifies more CpGs, CHHs and CHGs, at higher coverage depth compared to WGBS, resulting in more usable information.
Cytosine correlations of EM-seq and WGBS libraries (1x minimum coverage). Both EM-seq and WGBS libraries were highly correlated between replicates and methods). 125 million, 2 x75 base reads were used for methylation analysis. Correlations in the CpG, CHG and CHH contexts are shown
Cannabis sativa
Arabidopsis thaliana
Introduction Methods
Methods
CpG
CH
G
CH
H
CpG
CH
G
CH
H
CpG
CH
G
CH
H
C
AFemale Flower-1
Female Flower-2
Female Seeded Flower-1Female Seeded Flower-2
Male Flower-1
Male Flower-2
Female Flower-1
Female Flower-2
Female Seeded Flower-1Female Seeded Flower-2
Male Flower-1
Male Flower-2
012345678
0 20 40 60 80 100
Nor
mal
ized
cov
erag
e
Percent GC
GC-bias plotWGBS-1WGBS-2EM-seq-1EM-seq-2
A
C
C
Conclusion
Histograms showing CpG,CHG and CHH coverage ofEM-seq (A) and WGBS (C)libraries were plotted usingMethylKit. The histogram isshifted right for EM-seqcompared to WGBS indicatingthat more CpGs are detected athigher coverage for EM-seqlibraries compared to WGBS.Percent methylation forlibraries made using EM-seq(B) and WGBS (D) are shownin the CpG, CHG and CHHcontexts. EM-seq and WGBSdata both show that mostcytosines in the CHG and CHHcontexts are not methylated.Methylation is found in the CpGcontext for both EM-seq andWGBS libraries.
A
B
C
D
EM-seq libraries compared to WGBS libraries had: