dna methylation analysis in r

57
Introduction R Basics Genomics and R RRBS analysis with R package methylKit Analyzing RRBS methylation data with R Basic analysis with R package methylKit Altuna Akalın [email protected] EpiWorkshop 2013 Weill Cornell Medical College Institute for Computational Biomedicine Akalın, Altuna Analyzing RRBS methylation data with R

Upload: altuna-akalin

Post on 11-May-2015

9.763 views

Category:

Documents


13 download

DESCRIPTION

Tutorial for analyzing DNA methylation data from bisulfite sequencing with R

TRANSCRIPT

Page 1: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Analyzing RRBS methylation data with RBasic analysis with R package methylKit

Altuna Akalınala2027medcornelledu

EpiWorkshop 2013Weill Cornell Medical College

Institute for Computational Biomedicine

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Outline

1 Introduction

2 R Basics

3 Genomics and R

4 RRBS analysis with R package methylKitWorking with annotated methylation events

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Introduction

This tutorial will show how to analyze methylation data obtained fromReduced Representation Bisulfite Sequencing (RRBS) experimentsFirst we will introduce

R basicsGenomics and R Bioconductor packages that will be usefulwhen working with genomic intervalsHow to use the package methylKit to analyze methylationdata

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Before we start

Download and unzip the data folderhttpmethylkitgooglecodecomfilesmethylKitTutorialData_feb2012zip

Make sure that you installed R 2152 and RStudiofor Mac OSX httpcranr-projectorgbinmacosxoldR-2152pkgfor windows httpcranr-projectorgbinwindowsbaseold2152For linux httpcranr-projectorgsrcbaseR-2R-2152targzhttpwwwrstudiocomidedownload

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

R Basics

Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Computations in RR can be used as an ordinary calculator There are a few examples

2 + 3 5 Note the order of operations

[1] 17

log(10) Natural logarithm with base e=2718282

[1] 2303

4^2 4 raised to the second power

[1] 16

32 Division

[1] 15

sqrt(16) Square root

[1] 4

abs(3 - 7) Absolute value of 3-7

[1] 4

pi The mysterious number

[1] 3142

exp(2) exponential function

[1] 7389

This is a comment line

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

VectorsR handles vectors easily and intuitively

x lt- c(1 3 2 10 5) create a vector x with 5 componentsx

[1] 1 3 2 10 5

y lt- 15 create a vector of consecutive integersy

[1] 1 2 3 4 5

y + 2 scalar addition

[1] 3 4 5 6 7

2 y scalar multiplication

[1] 2 4 6 8 10

y^2 raise each component to the second power

[1] 1 4 9 16 25

2^y raise 2 to the first through fifth power

[1] 2 4 8 16 32

y y itself has not been unchanged

[1] 1 2 3 4 5

y lt- y 2y it is now changed

[1] 2 4 6 8 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind

x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1

x y [1] 1 4 [2] 2 5 [3] 3 6

t(m1) transpose of m1

[1] [2] [3] x 1 2 3 y 4 5 6

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

dim(m1) 2 by 3 matrix

[1] 3 2

You can also directly list the elements and specify the matrix

m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2

[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 2: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Outline

1 Introduction

2 R Basics

3 Genomics and R

4 RRBS analysis with R package methylKitWorking with annotated methylation events

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Introduction

This tutorial will show how to analyze methylation data obtained fromReduced Representation Bisulfite Sequencing (RRBS) experimentsFirst we will introduce

R basicsGenomics and R Bioconductor packages that will be usefulwhen working with genomic intervalsHow to use the package methylKit to analyze methylationdata

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Before we start

Download and unzip the data folderhttpmethylkitgooglecodecomfilesmethylKitTutorialData_feb2012zip

Make sure that you installed R 2152 and RStudiofor Mac OSX httpcranr-projectorgbinmacosxoldR-2152pkgfor windows httpcranr-projectorgbinwindowsbaseold2152For linux httpcranr-projectorgsrcbaseR-2R-2152targzhttpwwwrstudiocomidedownload

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

R Basics

Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Computations in RR can be used as an ordinary calculator There are a few examples

2 + 3 5 Note the order of operations

[1] 17

log(10) Natural logarithm with base e=2718282

[1] 2303

4^2 4 raised to the second power

[1] 16

32 Division

[1] 15

sqrt(16) Square root

[1] 4

abs(3 - 7) Absolute value of 3-7

[1] 4

pi The mysterious number

[1] 3142

exp(2) exponential function

[1] 7389

This is a comment line

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

VectorsR handles vectors easily and intuitively

x lt- c(1 3 2 10 5) create a vector x with 5 componentsx

[1] 1 3 2 10 5

y lt- 15 create a vector of consecutive integersy

[1] 1 2 3 4 5

y + 2 scalar addition

[1] 3 4 5 6 7

2 y scalar multiplication

[1] 2 4 6 8 10

y^2 raise each component to the second power

[1] 1 4 9 16 25

2^y raise 2 to the first through fifth power

[1] 2 4 8 16 32

y y itself has not been unchanged

[1] 1 2 3 4 5

y lt- y 2y it is now changed

[1] 2 4 6 8 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind

x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1

x y [1] 1 4 [2] 2 5 [3] 3 6

t(m1) transpose of m1

[1] [2] [3] x 1 2 3 y 4 5 6

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

dim(m1) 2 by 3 matrix

[1] 3 2

You can also directly list the elements and specify the matrix

m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2

[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 3: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Introduction

This tutorial will show how to analyze methylation data obtained fromReduced Representation Bisulfite Sequencing (RRBS) experimentsFirst we will introduce

R basicsGenomics and R Bioconductor packages that will be usefulwhen working with genomic intervalsHow to use the package methylKit to analyze methylationdata

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Before we start

Download and unzip the data folderhttpmethylkitgooglecodecomfilesmethylKitTutorialData_feb2012zip

Make sure that you installed R 2152 and RStudiofor Mac OSX httpcranr-projectorgbinmacosxoldR-2152pkgfor windows httpcranr-projectorgbinwindowsbaseold2152For linux httpcranr-projectorgsrcbaseR-2R-2152targzhttpwwwrstudiocomidedownload

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

R Basics

Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Computations in RR can be used as an ordinary calculator There are a few examples

2 + 3 5 Note the order of operations

[1] 17

log(10) Natural logarithm with base e=2718282

[1] 2303

4^2 4 raised to the second power

[1] 16

32 Division

[1] 15

sqrt(16) Square root

[1] 4

abs(3 - 7) Absolute value of 3-7

[1] 4

pi The mysterious number

[1] 3142

exp(2) exponential function

[1] 7389

This is a comment line

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

VectorsR handles vectors easily and intuitively

x lt- c(1 3 2 10 5) create a vector x with 5 componentsx

[1] 1 3 2 10 5

y lt- 15 create a vector of consecutive integersy

[1] 1 2 3 4 5

y + 2 scalar addition

[1] 3 4 5 6 7

2 y scalar multiplication

[1] 2 4 6 8 10

y^2 raise each component to the second power

[1] 1 4 9 16 25

2^y raise 2 to the first through fifth power

[1] 2 4 8 16 32

y y itself has not been unchanged

[1] 1 2 3 4 5

y lt- y 2y it is now changed

[1] 2 4 6 8 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind

x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1

x y [1] 1 4 [2] 2 5 [3] 3 6

t(m1) transpose of m1

[1] [2] [3] x 1 2 3 y 4 5 6

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

dim(m1) 2 by 3 matrix

[1] 3 2

You can also directly list the elements and specify the matrix

m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2

[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 4: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Before we start

Download and unzip the data folderhttpmethylkitgooglecodecomfilesmethylKitTutorialData_feb2012zip

Make sure that you installed R 2152 and RStudiofor Mac OSX httpcranr-projectorgbinmacosxoldR-2152pkgfor windows httpcranr-projectorgbinwindowsbaseold2152For linux httpcranr-projectorgsrcbaseR-2R-2152targzhttpwwwrstudiocomidedownload

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

R Basics

Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Computations in RR can be used as an ordinary calculator There are a few examples

2 + 3 5 Note the order of operations

[1] 17

log(10) Natural logarithm with base e=2718282

[1] 2303

4^2 4 raised to the second power

[1] 16

32 Division

[1] 15

sqrt(16) Square root

[1] 4

abs(3 - 7) Absolute value of 3-7

[1] 4

pi The mysterious number

[1] 3142

exp(2) exponential function

[1] 7389

This is a comment line

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

VectorsR handles vectors easily and intuitively

x lt- c(1 3 2 10 5) create a vector x with 5 componentsx

[1] 1 3 2 10 5

y lt- 15 create a vector of consecutive integersy

[1] 1 2 3 4 5

y + 2 scalar addition

[1] 3 4 5 6 7

2 y scalar multiplication

[1] 2 4 6 8 10

y^2 raise each component to the second power

[1] 1 4 9 16 25

2^y raise 2 to the first through fifth power

[1] 2 4 8 16 32

y y itself has not been unchanged

[1] 1 2 3 4 5

y lt- y 2y it is now changed

[1] 2 4 6 8 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind

x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1

x y [1] 1 4 [2] 2 5 [3] 3 6

t(m1) transpose of m1

[1] [2] [3] x 1 2 3 y 4 5 6

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

dim(m1) 2 by 3 matrix

[1] 3 2

You can also directly list the elements and specify the matrix

m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2

[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 5: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

R Basics

Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Computations in RR can be used as an ordinary calculator There are a few examples

2 + 3 5 Note the order of operations

[1] 17

log(10) Natural logarithm with base e=2718282

[1] 2303

4^2 4 raised to the second power

[1] 16

32 Division

[1] 15

sqrt(16) Square root

[1] 4

abs(3 - 7) Absolute value of 3-7

[1] 4

pi The mysterious number

[1] 3142

exp(2) exponential function

[1] 7389

This is a comment line

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

VectorsR handles vectors easily and intuitively

x lt- c(1 3 2 10 5) create a vector x with 5 componentsx

[1] 1 3 2 10 5

y lt- 15 create a vector of consecutive integersy

[1] 1 2 3 4 5

y + 2 scalar addition

[1] 3 4 5 6 7

2 y scalar multiplication

[1] 2 4 6 8 10

y^2 raise each component to the second power

[1] 1 4 9 16 25

2^y raise 2 to the first through fifth power

[1] 2 4 8 16 32

y y itself has not been unchanged

[1] 1 2 3 4 5

y lt- y 2y it is now changed

[1] 2 4 6 8 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind

x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1

x y [1] 1 4 [2] 2 5 [3] 3 6

t(m1) transpose of m1

[1] [2] [3] x 1 2 3 y 4 5 6

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

dim(m1) 2 by 3 matrix

[1] 3 2

You can also directly list the elements and specify the matrix

m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2

[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 6: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Computations in RR can be used as an ordinary calculator There are a few examples

2 + 3 5 Note the order of operations

[1] 17

log(10) Natural logarithm with base e=2718282

[1] 2303

4^2 4 raised to the second power

[1] 16

32 Division

[1] 15

sqrt(16) Square root

[1] 4

abs(3 - 7) Absolute value of 3-7

[1] 4

pi The mysterious number

[1] 3142

exp(2) exponential function

[1] 7389

This is a comment line

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

VectorsR handles vectors easily and intuitively

x lt- c(1 3 2 10 5) create a vector x with 5 componentsx

[1] 1 3 2 10 5

y lt- 15 create a vector of consecutive integersy

[1] 1 2 3 4 5

y + 2 scalar addition

[1] 3 4 5 6 7

2 y scalar multiplication

[1] 2 4 6 8 10

y^2 raise each component to the second power

[1] 1 4 9 16 25

2^y raise 2 to the first through fifth power

[1] 2 4 8 16 32

y y itself has not been unchanged

[1] 1 2 3 4 5

y lt- y 2y it is now changed

[1] 2 4 6 8 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind

x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1

x y [1] 1 4 [2] 2 5 [3] 3 6

t(m1) transpose of m1

[1] [2] [3] x 1 2 3 y 4 5 6

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

dim(m1) 2 by 3 matrix

[1] 3 2

You can also directly list the elements and specify the matrix

m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2

[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 7: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

VectorsR handles vectors easily and intuitively

x lt- c(1 3 2 10 5) create a vector x with 5 componentsx

[1] 1 3 2 10 5

y lt- 15 create a vector of consecutive integersy

[1] 1 2 3 4 5

y + 2 scalar addition

[1] 3 4 5 6 7

2 y scalar multiplication

[1] 2 4 6 8 10

y^2 raise each component to the second power

[1] 1 4 9 16 25

2^y raise 2 to the first through fifth power

[1] 2 4 8 16 32

y y itself has not been unchanged

[1] 1 2 3 4 5

y lt- y 2y it is now changed

[1] 2 4 6 8 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind

x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1

x y [1] 1 4 [2] 2 5 [3] 3 6

t(m1) transpose of m1

[1] [2] [3] x 1 2 3 y 4 5 6

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

dim(m1) 2 by 3 matrix

[1] 3 2

You can also directly list the elements and specify the matrix

m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2

[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 8: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind

x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1

x y [1] 1 4 [2] 2 5 [3] 3 6

t(m1) transpose of m1

[1] [2] [3] x 1 2 3 y 4 5 6

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

dim(m1) 2 by 3 matrix

[1] 3 2

You can also directly list the elements and specify the matrix

m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2

[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 9: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Matrices

dim(m1) 2 by 3 matrix

[1] 3 2

You can also directly list the elements and specify the matrix

m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2

[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 10: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates

chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 11: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

names(mydata) lt- c(chr start end strand) variable namesmydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

OR this will work toomydata lt- dataframe(chr = chr start = start

end = end strand = strand)mydata

chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 12: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

There are a variety of ways to identify the elements of a data frame

mydata[24] columns 234 of data frame

start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +

columns chr and start from data framemydata[c(chr start)]

chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 13: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Data Frames

mydata$start variable start in the data frame

[1] 200 4000 100 400

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 14: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name

example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)

mymatrix = matrix(14 ncol = 2) age = 53)w

$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 15: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Lists

You can extract elements of a list using the [[]] convention

w[[3]] 3rd component of the list

[1] [2] [1] 1 3 [2] 2 4

w[[mynumbers]] component named mynumbers in list

[1] 1 2 3

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 16: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

R has great support for plotting and customizing plots We will showonly a few below

x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 17: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Graphics in R

Saving plots

pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR

plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 18: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting help on R functionscommands

Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 19: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing packages

Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom

How to install packagesinstallpackages(devtools)

source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)

Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 20: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Genomics and R

Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 21: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()

read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt

header = FALSE)

write data frames to text fileswritetable(cpgidf exampleouttxt)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 22: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the genomics data

check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 23: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 24: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1

ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1

ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )

find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 25: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Using GenomicRanges package for operations ongenomic intervals

number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))

Histogram of width(cpgi)

width(cpgi)

Fre

quen

cy

0 2000 6000 10000

050

010

00

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 26: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 27: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Installing methylKit

In R terminal install the package and the dependencies by typing thefollowing commands

source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)

seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 28: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Flowchart for capabilities of methylKit

Aligned reads to the genome

Methylation per base

Sample merge

Differential methylation

Descriptive statistics

Correlation

Clustering and PCA

Filtering by coverage

Windowregion based methylation

scores

VisualizationAnnotation Coercion to bioconductor and R

objectsFurther analysis by

user

Input option 1

Input option 2

readbismark()

read()

filterByCoverage()

getMethylationStats()getCoverageStats()

regionCounts() tileMethylCounts()

unite()

getCorrelation()

clusterSamples()

PCASamples()

annotateWithFeature()and similar functions

calculateDiffMeth()

bedgraph() diffMethPerChr()

as() getData()

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 29: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this

chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 30: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows

library(methylKit)filelist = list(test1CpGtxt test2CpGtxt

ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1

test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 31: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Reading the methylation data

Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 32: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2

getMethylationStats(myobj[[2]] plot = F bothstrands = F)

methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 33: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends

getMethylationStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG methylation

methylation per base

Fre

quen

cy

0 20 40 60 80 100

020

000

4000

060

000

8000

0

529

2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38

58

134

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 34: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Quality check and basic features of the data

We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram

getCoverageStats(myobj[[2]] plot = T bothstrands = F)

Histogram of CpG coverage

log10 of read coverage per base

Fre

quen

cy

10 15 20 25 30 35 40 45

050

0010

000

1500

020

000

2500

030

000

3500

0 222

177

125

139

168

144

2

03 0 0 0 0 0 0 0 0 0 0

test2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 35: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Filtering samples based on read coverage

It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample

filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 36: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationUniting data sets

In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples

meth = unite(myobj destrand = FALSE) meth is a methylBase object

Let us take a look at the data content of methylBase object

head(meth 2)

id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 37: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix

getCorrelation(meth plot = T)

test1

00 04 08

092 091

00 04 08

00

04

08

091

00

04

08

00 04 08

00

04

08

x

y

test2

089 089

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

ctrl1

00

04

08

097

00 04 08

00

04

08

00 04 08

00

04

08

x

y

00 04 08

00

04

08

x

y

00 04 0800 04 08

00

04

08

x

y

ctrl2

CpG base correlation

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 38: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

methylKit can do hiearchical clustering

clusterSamples(meth dist = correlation method = wardplot = TRUE)

000

005

010

015

020

025

030

035

CpG methylation clustering

Distance method correlation Clustering method wardSamples

Hei

ght

ctrl1

ctrl2

test

1

test

2

Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 39: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofiles

Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster

PCASamples(meth)

minus100 minus50 0 50 100

minus10

0minus

500

5010

015

0

CpG methylation PCA Analysis

PC1

PC

2

test1

test2

ctrl1ctrl2

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 40: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Clustering your samples based on the methylationprofilesclustering multiple conditions

It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option

filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)

clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)

newMeth = unite(clist)clusterSamples(newMeth)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 41: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values

myDiff = calculateDiffMeth(meth)

calculateDiffMeth() can also use multiple cores for fastercalculation

myDiff = calculateDiffMeth(meth numcores = 2)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 42: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated bases

Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25

get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25

qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25

qvalue = 001)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 43: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Getting differentially methylated regions

methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis

you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000

stepsize = 1000)

head(tiles[[1]] 2)

id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 44: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function

if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001

methcutoff = 25)

if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001

methcutoff = 25)

chr20

chr21

chr22

of hyper amp hypo methylated regions per chromsome

(percentage)

0 2 4 6 8 10

qvaluelt001 amp methylation diff gt=25

hyperhypo

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 45: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor

geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)

summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 46: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them

read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt

featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p

cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 47: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Annotating differential methylation events

We can also read any BED file and annotate our differentiallymethylated basesregions with them

read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)

diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 48: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function

diffAnn = annotateWithGenicParts(myDiff25p geneobj)

targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)

targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 49: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation events

It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters

getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)

promoter exon intron intergenic 1443 1411 3289 3857

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 50: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters

plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)

14

14

33

39

differential methylation annotation

promoterexonintronintergenic

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 51: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions

plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)

22

23

55

differential methylation annotation

CpGishoresother

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 52: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Regional analysis

We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput

promoters = regionCounts(myobj geneobj$promoters)

head(promoters[[1]] 3)

id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 53: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses

class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 54: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function

select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 55: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

methylKit convenience functions

Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects

get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 56: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html

Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq

Some of the aligners and pipelines

Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark

AMP pipeline httpcodegooglecompamp-errbs

BSMAP httpcodegooglecompbsmap

other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html

Akalın Altuna Analyzing RRBS methylation data with R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit
Page 57: DNA methylation analysis in R

IntroductionR Basics

Genomics and RRRBS analysis with R package methylKit

Further information

More info on methylKit

The webpage for the package ishttpmethylkitgooglecodecom

Genome Biology paper for the package is athttpgenomebiologycom20121310R87

Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit

The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf

The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf

Akalın Altuna Analyzing RRBS methylation data with R

  • Introduction
  • R Basics
  • Genomics and R
  • RRBS analysis with R package methylKit