talk outline -...
TRANSCRIPT
Measuring methylation: from arrays to sequencing
Jovana Maksimovic, [email protected]
@JovMaksimovic
github.com/JovMaksimovic
Bioinformatics Winter School, 3 July 2017
Talk outline
• Epigenetics• DNA methylation• Measuring DNA methylation• Methylation arrays
• How do they work?• What do they measure?• Example analysis
• Methylation sequencing• What are the challenges?• How does it work?• Suggested analysis pipeline
• Summary
I work at MCRI
Me!
I mostly work on human development
& disease…
I write software for analysing methylation
array data
missMethyl…and a lot of gene
expression data
RNAseqMicroarrays
I analyse a lot of epigenetic data…
ChIPseqATACseqBSseqMicroarrays
…sometimes using mice or other
models
What is epigenetics?
• Epigenetics refers to stable heritable traits not explained by changes in DNA sequence
• Greek prefix “epi” means “on top of” genetics
• Chromosome modificationsthat affect gene expression
• Histones, DNA methylation• “Anything” that isn’t DNA!
• Essential for normal development
• Can be modified by environment
• Can be disrupted in disease
Epigenetics brings DNA to life!
embryogenesis
blastocyst
zygote
sperm
egg
embryonic stem cells
B cell
T cell
red blood
cell
haematopoietic
stem cell fat
cell
sperm
cell
skin cell
muscle
cell
gland
cell
hormone-
secreting
cell
germ
cell
neuron
astrocyte
neuronal
progenitor
cell
lung
cell
kidney
cell
identical DNA in every cell
diffe
ren
t ep
igen
etic
patte
rns
• Important in all species
Modified from https://biology.mit.edu/research/stemcell_epigenetics
intestine cell
Epigenetics is CRAZY complicated!
• New sequencing & microarray technologies are enabling us to learn A LOT more about epigenetics
• Different data types need different analysis
• Today I’m only focussing on DNA methylation
Roy et al. (2010), Science
Me
Me
What is DNA methylation?
DNA methylation primarily occurs at CpG dinucleotides
C G
AT
C C
Patterson et al. 2011, J Vis Exp
DNA methylation in the genome
• The human genome contains ~30,000,000 CpGs (~1%)
• VERY different between different species
• CpGs are not evenly spacedacross the genome
• Tend to be present in clusters called CpG islands
• CpG methylation is spatiallycorrelated
Eckhardt et al. 2007, Nature Genetics
~500bp
Methylation correlation with distance
Methylation can regulate gene expression
Plot from Peter Hickey
http://meeting.dxy.cn/oemethylation2012/article/i18782.html
Methylation at a single CpG vs. gene expression
Each point is one
sample
Methylation changes coat colour of Agouti mice Dolinoy 2008, Nutr Rev.
This gene controls
coat colour in
Agouti mice
These CpG sites in the
promoter change PS1A
expression depending on
methylation
These mice
are genetically
identical
Hypomethylated Hypermethylated
Coat colour different due to different maternal diet i.e. environment!
Cridge et al. 2015, Nutrients
Methylation makes worker bees!These larvae
are genetically
identical Hypomethylated
Hypermethylated
Methylation is coolWhat do we usually want to know about it?
Finding methylation differences can tell us a lot• Methylation is critical in determining cell type
• Regulatory T-cell vs. Naïve T-cell
• Methylation can be disrupted in disease• Cancer vs. Normal
• Methylation is affected by the environment• Smokers vs. Non-smokers
Collect appropriate samples
Extract DNA and measure methylation
Statistical analysis
Normal
Cancer
Epigenome-wide association studies (EWAS)
• Similar to GWAS
• Compare lots of cases to lots of controls
• Often looking for small effects e.g. complex disease or environmental effects
• Need lots of samples • 100s or 1000s of cases &
controls
https://en.wikipedia.org/wiki/Epigenome-wide_association_study_(EWAS)
How do we measure methylation?
• Bisulphite conversion• Create “SNPs”
• Single nucleotideresolution
• Array
• Sequencing
• Enrichment of methylated DNA
• Restriction enzymes
• Affinity
• Regional resolution• Array
• Sequencing
What is bisulphite conversion?
• Chemical process
• Unmethylated Cs get converted to Ts
• Methylated Cs areprotected
• Creates “SNP”• Used to call methylation
PCR
Methylation arraysWhat are they and how do they work?
Illumina InfiniumHumanMethylation BeadChips
• Human only
• Gene biased; selected to be relevant to human development & disease• eg. TSS, promoters, CpG islands, enhancers, ...
1 chip = 12 samples
>27,000 unique CpG sitesmeasured in each sample
1 chip = 8 samples1 chip = 12 samples
27k array (2009) 450k array (2011) 850k array (2015)
>450,000 unique CpG sitesmeasured in each sample
>850,000 unique CpG sites measured in each sample
Modified slide from Belinda Phipson
Methylation arrays are based on SNP array technology
• Methylation array “SNPs” (C/T) are created by bisulphite conversion
• Comparing the intensity of C/T gives the proportion of methylation at single CpG
What is this
base?
Measure
fluorescence
intensity
What methylation values can we get?
• On an array, we measure methylation in a population of cells
• Individual cell can be either 0, 0.5 or 1 at one CpG
• Across a population we get a continuous measurement between [0-1]
CH3 CH3 CH3
0 0.5 1
A sample
Many cells in single sample
Measures of methylation
• Arrays measure both methylated (C) and unmethylated (T)signal to get proportion of methylation at a CpG
β =𝑀𝑒𝑡ℎ
𝑀𝑒𝑡ℎ+𝑈𝑛𝑚𝑒𝑡ℎ
Intuitive, easy to interpret, great for visualisation
M value
Bet
a va
lue
Du et al. 2011, BMC Bioinformatics
𝑀 = log2𝛽
1−𝛽
Can convert between them via a logit transformation
𝑀 = log2𝑀𝑒𝑡ℎ
𝑈𝑛𝑚𝑒𝑡ℎ
Better statistical properties, recommended
for statistical testing
What does the data look like?
Sample A1
Sample A2
Sample A3
Sample B1
Sample B2
Sample B3
0.213 0.221 0.311 0.123 0.216 0.198
-0.011 0.001 -0.016 2.011 2.002 2.702
2.213 2.256 2.698 0.052 0.101 0.238
4.567 5.231 4.982 4.152 6.216 4.698
-4.723 -3.459 -5.36 -5.763 -5.122 -4.998
-5.567 -4.666 -4.845 -4.522 -4.111 -3.245
3.421 5.467 5.554 5.445 5.298 4.514
2.981 3.345 3.512 -3.534 -4.311 -3.889
3.792 2.987 3.324 -0.231 -0.066 -0.001
… ... ...
CpGsites
Table of M-values
Array analysis pipeline
QC: b density plots, control probes, MDS/clustering plots, …
Normalization: within and between arrays
Statistical testing for differential methylation, CpGs & regions
Annotation to genes, gene set testing, visualization, …
Combine with other data types
Transform data to remove unwanted
variation
minfi, missMethyl, wateRmelon
Estimate means and variances and
borrow information across probes
limma, bumphunter, DMRcate
Think about biological interpretation
missMethyl, Gviz
e.g. gene expression GenomicRanges
Remove bad samples and poor performing
probes (CpGs)
minfi, methylumi, limma
Software
M28 M29 M30
naive
activated
naive
activated
rTreg
rTreg
After QC, data exploration is your friend!
Dimension 1 Dimension 3
Dim
ensio
n 2
Dim
ensio
n 4
Clustering by individual and cell type
MDS plots showing largest sources of variation in the data
Statistical testing:Look for differences at single CpGs
Differential methylation
Phipson & Oshlack 2015, Genome Biology
moderated t = | ത𝑦𝑐𝑎𝑛−ത𝑦𝑛𝑜𝑟𝑚|
ǁ𝑠 𝑣
ǁ𝑠 is the empirical Bayes variance
Linear model :
𝑦 = 𝑋𝛽 + ε
Smyth, 2004
Adjust the p-values using Benjamini and Hochberg’s FDR
Can take into account any other covariates
One test per CpG!
Modified slide from Belinda Phipson & Alicia Oshlack
Lots of differences between immune cell types!
Statistical testing:Differences across CpG dense region
• Recall: CpG methylation is spatially correlated
• Can we find consistent group-average level differences between CpGs that are close together?
• More functionally relevant than differences at individual CpGs?
Aryee et al. 2014, Bioinformatics
Lots of DMRs between immune cell types!
You can do other cool stuff!
• Unmethylatedregions in rTregcompared to naïvecells enriched for FOXP3 binding motifs!
Forkhead-binding motif
Consensus motif from DMR seqs.
DMR consensus motif matches
Forkhead-binding motif
Differences in cell types controlled by FOXP3!Modified slide from Alicia Oshlack
Methylation array analysis is very mature: lots of methods!
https://www.bioconductor.org/
https://f1000research.com/articles/5-1281/v3
Methylation sequencingAKA bisulphite sequencing: the good, the bad and the ugly
Two main types of bisulphite sequencing• Whole-genome bisulphite sequencing (BS-seq)
• Gold standard
• Genome-wide (~30,000,000 CpGs in human)
• Expensive but covers almost everything• Need high (10-30x) coverage to reliably call methylation
• Targeted BS-seq• Only sequence regions of interest
• Reduced representation BS-seq (restriction enzyme)
• Capture BS-seq (similar principal to exome)
• Cheaper but can miss a lot of stuff• Can usually do higher (20-60x) coverage
What was bisulphite conversion again?
DNA
fragment
All four of these can
be sequenced!
What are the challenges?
• Like calling SNPs, methylation in BS-seq inferred by comparison to unconverted reference sequence
• Correct alignment is critical
• More challenging than usual!• Aligned sequences do not exactly match reference
• Complexity of libraries is reduced• Many Cs become Ts, so less info for mapping!
• Methylation is not symmetrical• Two strands of DNA in the reference genome must be
considered separately
Mapping (Bismark)
DNA
fragment
BS conversion & PCR
Mapping (Bismark)
TCGGTATGTTTAAACGTT
DNA
fragment
BS conversion & PCR
Mapping (Bismark)
TCGGTATGTTTAAACGTT
TTGGTATGTTTAAATGTT TCAATATATTTAAACATT
In silico read
conversionC-to-T G-to-A
DNA
fragment
BS conversion & PCR
Mapping (Bismark)
TCGGTATGTTTAAACGTT
TTGGTATGTTTAAATGTT TCAATATATTTAAACATT
In silico read
conversionC-to-T G-to-A
…TTGGTATGTTTAAATGTT…
…AACCATACAAATTTACAA……CCAACATATTTAAACACT……GGTTGTATAAATTTGTGA…
Align to in silico
bisulphite converted
genome
Fwd strand C-to-T converted genome Fwd strand G-to-A converted genome
Reverse complement Reverse complement
DNA
fragment
BS conversion & PCR
TCAATATATTTAAACATT TCAATATATTTAAACATT
TCAATATATTTAAACATT TCAATATATTTAAACATT
Mapping (Bismark)
TCGGTATGTTTAAACGTT
TTGGTATGTTTAAATGTT TCAATATATTTAAACATT
In silico read
conversionC-to-T G-to-A
…TTGGTATGTTTAAATGTT…
…AACCATACAAATTTACAA……CCAACATATTTAAACACT……GGTTGTATAAATTTGTGA…
Align to in silico
bisulphite converted
genome
Fwd strand C-to-T converted genome Fwd strand G-to-A converted genome
Reverse complement Reverse complement
…TTGGTATGTTTAAATGTT…
…AACCATACAAATTTACAA…
…CCAACATATTTAAACACT…
…GGTTGTATAAATTTGTGA…
x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
Read all alignment
outputs simultaneously
to determine if
sequence can be
mapped uniquely
DNA
fragment
BS conversion & PCR
TCAATATATTTAAACATT TCAATATATTTAAACATT
TCAATATATTTAAACATT TCAATATATTTAAACATT
Mapping (Bismark)
TCGGTATGTTTAAACGTT
TTGGTATGTTTAAATGTT TCAATATATTTAAACATT
In silico read
conversionC-to-T G-to-A
…TTGGTATGTTTAAATGTT…
…AACCATACAAATTTACAA……CCAACATATTTAAACACT……GGTTGTATAAATTTGTGA…
Align to in silico
bisulphite converted
genome
Fwd strand C-to-T converted genome Fwd strand G-to-A converted genome
Reverse complement Reverse complement
…TTGGTATGTTTAAATGTT…
…AACCATACAAATTTACAA…
…CCAACATATTTAAACACT…
…GGTTGTATAAATTTGTGA…
x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
TTGGTATGTTTAAATGTT TTGGTATGTTTAAATGTT
TTGGTATGTTTAAATGTT TTGGTATGTTTAAATGTT
x x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
x x
x
x x xRead all alignment
outputs simultaneously
to determine if
sequence can be
mapped uniquely
DNA
fragment
BS conversion & PCR
Calling methylation
TCGGTATGTTTAAATGTT
TATGTTTAAATGTT
…TCGGTATGTTTAAAT
…TCGGTATGTT AAACGTT…
…TCGGTATGTTTAAATGTT
GTT…
…TTG
…CCGGCATGTTTAAACGCT…
…TCGGTATGTTT
…TCGGTATGTTTAAATGTT…
ATGTT…
…TCGGTATGTTTAAAT TT…
…TTGGTATGTTTA ATGTT…
…TCGGTATGTTTAAACGT 2
10× 100 = 20%
8
10× 100 = 80%
Genome reference
Calling methylation
TCGGTATGTTTAAATGTT
TATGTTTAAATGTT
…TCGGTATGTTTAAATGTT…
…TTG
…CCGGCATGTTTAAACGCT… Genome reference
Good coverage is very important for reliablemethylation calls!
Some real BS-seq mapping results
https://software.broadinstitute.org/software/igv/interpreting_bisulfite_mode
Methylation calling output
chr1 753479 753479 50 1 1
chr1 753492 753492 66.67 2 1
chr1 753540 753540 100 1 0
chr1 753541 753541 50 1 1
chr1 753667 753667 25 1 3
chr1 753724 753724 66.67 2 1
chr1 753763 753763 0 0 2
chr1 753785 753785 0 0 1
chr1 759932 759932 100 1 0
chr1 760913 760913 0 0 1
chr1 761299 761299 100 2 0
chr1 761371 761371 80 8 2
chr1 761377 761377 100 10 0
chr1 761446 761446 92.86 13 1
chr1 761460 761460 53.85 7 6
chr1 762005 762005 100 1 0
chr1 762114 762114 0 0 5
chr1 762176 762176 0 0 7
chr1 762180 762180 0 0 8
No. unmethylated
reads
No. methylated
reads
% methylation
Sum for total coverage
Position of C in genome
80% =8
8+2× 100
This is what we work with!
Krueger et al. 2012, Nature Methods
Analysis pipeline
Thorough QC is VERY
important for BS-seq
Need to be brutal with
trimming off poor
quality bases…
…and adapters
As with SNP calling,
removing PCR
duplicates is a good
idea for better
methylation calling
Other stuff to find cool biology!
Summary
• Methylation arrays very popular• Only for human• Great for EWAS• Analysis very mature
• Bioconductor is the place to go!
• BS-seq best option for genome-wide single nucleotide resolution
• Only option for species other than human• Pre-processing, mapping, etc. pretty good• Statistical analysis still developing
• Bioconductor is a valuable resource
• Downstream analysis dependent on biological question
• Methylation is interesting & we know how to measure it• Best technology for the job depends on what you want to know!
Acknowledgments
Murdoch Childrens Research Institute
• Alicia Oshlack
• Belinda Phipson
• MCRI Bioinformatics group!
Johns Hopkins University
• Peter Hickey
missMethylhttps://www.bioconductor.org/packages/release/bioc/html/missMethyl.html
[email protected]@JovMaksimovic
github.com/JovMaksimovic
https://f1000research.com/articles/5-1281/v3