chipseq analysis using deeptools and...

45
ChIPseq analysis using deepTools and MACS Devon Ryan Twitter: @dpryan79 Biostars: dpryan79

Upload: phamnga

Post on 01-May-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

ChIPseq analysis using deepTools and MACS

Devon RyanTwitter: @dpryan79Biostars: dpryan79

Who are we?

Deeptools.ie-freiburg.mpg.de

Topics to be covered

Assessing ChIP quality/IP strengthUnderstand the difference between BAM files and bigWig

filesUnderstand why, when and how one needs to normalize

ChIP-seq data.Know the basics of peak calling (why and how is it done?)Be able to work with the output from MACS2 (e.g.,

filtering of peaks, visualization in a Genome Browser).Know how to generate coverage plots, e.g., heatmaps.

ChIPseq

[Szalkowski &Schmid, Brief Bioinf, 2011]

Outline

ChIP efficiencyCoverage Files

Depth normalization Input normalization

Peak callingWhy? How?Types of peaks: Sharp, broad, mixed

Downstream processingThis is the actually interesting part

BAM fingerprints

Calculating the fingerprint

genome

2 3 4 2 3 3 4 4 5 7 8 bins

2 2 3 3 3 4 4 4 5 7 8

sorting

0.25 0.25 0.37 0.37 0.37 0.5 0.5 0.5 0.5 0.87 1

scaling

BAM fingerprints

IP Strength

High IP enrichment

Similar genome coverage (ca. 90%)

Insufficient genome coverage

Input deviates from straight line (uniformity)

InputChIP

InputChIP

IP Strength

Weak IP enrichment over input

InputChIP

InputChIP

Practical I: plotFingerprint

Outline

ChIP efficiencyCoverage Files

Depth normalization Input normalization

Peak callingWhy? How?Types of peaks: Sharp, broad, mixed

Downstream processingThis is the actually interesting part

FASTQ

SAM BAM

bedGraph bigWigcoverage files:

read numbers per genomic region

reads:DNA sequence +

genomic localization

sequenced DNA fragments (reads):

DNA sequence only

read alignment

counting reads, normalizing for sequencing depth etc.

NGS file formats ~ analysis steps

DO W NS T RE AM A NA LY S E S

chr1 10 20 1.5chr1 20 30 1.7chr1 30 40 2.0chr1 40 50 1.8

@Read1GATTTGGGGTTCAAAGCAGTAT+@Read2CGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

r1 163 chr1 7 30 8M2I4M1D3M = 37 39 TTAGATGATTG *r2 0 chr1 9 30 3S6M1P1I4M * 0 0 AAAAGGATA *

Computing coverage

chr2 100100 100120 5chr2 100121 100141 3.2chr2 100142 100163 13.8

chr2 100100 100120 5chr2 100121 100141 3.2chr2 100142 100163 13.8

39V34V1:38:C0RLHACXX:4:1216:16137:31969 163 chr1 3000307 42 51M = 3000408 152 CTGTAGTTACTGTTTGCTTACCTAGATTCTTCTTTTCCAGAATTCTCTTAG CCCFFFFFHHHGHIIJIJJJJIIGHFGIGIJIIJJJHIHEHIGIIIIJJGF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:0 YT:Z:CP

39V34V1:38:C0RLHACXX:4:1216:16137:31969 163 chr1 3000307 42 51M = 3000408 152 CTGTAGTTACTGTTTGCTTACCTAGATTCTTCTTTTCCAGAATTCTCTTAG CCCFFFFFHHHGHIIJIJJJJIIGHFGIGIJIIJJJHIHEHIGIIIIJJGF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:0 YT:Z:CP

size reduction leads to many advantages of bigWig files over BAMs:• data storage• data sharing• intuitive visualization via genome browsers• more efficient for downstream analyses• …

aim: reduce the vast amount of information fromthe BAM file to the simple information:How many reads do I have (per bp/genomic bin/…)?

genome

bins (e.g. 50 bp)

reads

1 2 2 5 5 6 6 6 4 46

fragments

DNA → Sonicated to ~200bp frags. → 50-100 base reads

Computing coverage

Depth normalization

Depth normalization

genome

bins (e.g. 50 bp)

reads

1 2 2 5 5 6 6 6 4 46

fragments

Practical II: bamCoverage

Outline

Coverage FilesDepth normalization (bamCoverage) Input normalization

Peak callingWhy? How?Types of peaks: Sharp, broad, mixed

Downstream processingThis is the actually interesting part

Input samples – They’re important!!!

Input controls should be treated exactly the same* as ChIP samples except for the antibody treatment!

*same cell type, same shearing, same PCRs, same experimentator, …

Input samples – They’re important!!!

(not only) gene-rich regions = bias-rich regions(especially applicable to old sequencing data)

Why do we focus on MACS2?

Comparative coverage of BAM files

typical application: input-normalization for a ChIP sample aim: diminish the background signal from the ChIP signal based on the

input main caveat: the same genomic region is never covered exactly the same way,

which is neither the input’s nor the ChIP’s fault

Normalization by total read count:very straight-forward, perhaps too simple?

Normalization by SES (Diaz et al., 2012):more sophisticated, based on bamFingerprint (greatest distance between input and ChIP), not recommended for broad marks due to weaker enrichment

Which normalization method?

ratio log2 (ratio) difference sum reciprocal_ratio SES

Base your decision on the kind of questions you would like to answer!

Practical III: bamCompare

Outline

ChIP efficiencyCoverage Files

Depth normalization Input normalization

Peak callingWhy? How?Types of peaks: Sharp, broad, mixed

Downstream processingThis is the actually interesting part

Peak calling

From DNA reads to protein binding sites

1. identify original fragments2. identify enriched regions3. assign significance

> 30 different programs

different solutions for

each step

3 main tasks of peak calling programs:

CATCGA…

ATCGCTG…

GCATTG…

CTACGGT…

CTACGGT

protein

Which peak caller to choose?

Again: base your decision on the kind of questions you would like to answer!

Table from Wilbanks & Facciotti, 2010

Why do we focus on MACS2?

One of the most widely used peak callers, also used by big consortia, e.g., (mod)ENCODE, Blueprint, NIH Roadmap (reproducibility! comparability!)

Under active developmentCan be used for sharp and mixed signals

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS).

Genome Biol. 2008;9(9):R137

Peak Calling with MACS

1. Identify regions with enrich-ment, i.e., large no. of mapped reads (modeling by read shifting)

2. Determine peaks based on enrichments passing p-value* threshold

3. Estimate false discovery rate (FDR*) for each detected peak

Figure from Park, Nature Reviews Genetics, 2009

*p-value: probability of an enrichment being stronger than expected (= PEAK!);null hypothesis: reads are randomly distributed throughout the genome following a Poisson distribution (input is used to parameterize the statistical model)

*q-value/FDR: how likely is it that the peak is not really a peak (false positive) given that testing is done genome-wide?

How does MACS2 work?

● Fragment length modeling– Sliding window (2x bandwidth)

– >M-fold enrichment over background (effective genome size!)

– Filter “peaks” (>5x, <50x enrichment)

● Peak calling– Reads extended

– Samples scaled (need input)

– Sliding window (2x frag. len.)

– Enrichment according to Poisson variance● Use local noise for filtering● Use input variance for filtering

Properties influencing the peak callingKnow thy data!

Library complexity (How many duplicates? Overrepresented regions?)

Enrichment strength (IP success) Width and nature of the enriched regions

(narrow vs. broad vs. mixed) No. of occupied sites Range of the ChIP signal intensities

ALWAYS inspect your data visually and manually!

It’s not the peak caller that’s making sense of your data, it’s you!

FASTQCFASTQC

plotFingerprintplotFingerprint

Genome BrowserGenome Browser

computeGCbiascomputeGCbias

Important MACS parameters

Specify effective/mappable genome size (-g) Fragment size: might be set manually, especially for paired-end

data (for which fragment size can be determined separately, e.g., by Picard CollectInsertSizeMetrics)!

Broad peak calling (--broad) should be turned on for basically all histone marks except H3K4me3 and perhaps H3K27ac

Think about stringent filtering criteria on the peak lists computed by MACS!

If you’re not satisfied, play with the

parameters!

Think about stringent filtering criteria on the peak lists computed by MACS!

If you’re not satisfied, play with the

parameters!

You could make a workflow for the

QC of peaks

You could make a workflow for the

QC of peaks

Practical IV:Peak Calling

Outline

ChIP efficiencyCoverage Files

Depth normalization Input normalization

Peak callingWhy? How?Types of peaks: Sharp, broad, mixed

Downstream processingThis is the actually interesting part

Powerful visualization: heatmaps

Requirements: bigWig file bed file question in mind!

use deepTools: computeMatrixplotHeatmap

plotProfile

use deepTools: computeMatrixplotHeatmap

plotProfile

Powerful visualization: heatmaps

Possible questions: What kinds of signal distributions

do I see in my peaks? How does my signal look around

the TSS/TES/my favorite region? How does my signal look when I

assume the same size for all genes?

computeMatrix

plotHeatmap plotProfile

Powerful visualization: PCA/etc.

Practical V:Visualization with deepTools

Advanced topics: GC Bias

Advanced topics: GC Bias

use deepTools: computeGCbiascorrectGCbias

use deepTools: computeGCbiascorrectGCbias

Advanced Topics: Blacklisted Regions

Peaks in the same place Regardless of ChIP type Regardless of cell type Regardless of experiment

These are false positives – known sites to ignore https://sites.google.com/site/anshulkundaje/projects/blacklists

These regions can screw up scaling!

DeepTools: can ignore blacklisted sites

NGS utils: “BAM filter”

Bedtools intersect (with peaks to remove overlaps)

Now you can...

Assessing ChIP quality/IP strengthUnderstand the difference between BAM files and bigWig

filesUnderstand why, when and how one needs to normalize

ChIP-seq data.Know the basics of peak calling (why and how is it done?)Be able to work with the output from MACS2 (e.g.,

filtering of peaks, visualization in a Genome Browser).Know how to generate coverage plots, e.g., heatmaps.

Where to go for help

DeepTools: https://groups.google.com/forum/#!forum/deeptools http://deeptools.readthedocs.org

MACS google group https://groups.google.com/forum/#!forum/macs-announcement https://github.com/taoliu/MACS/wiki

Biostars: www.biostars.org

Galaxy support: http://biostar.usegalaxy.org

Slides will be posted to GCC2016 website!Slides/datasets/pages will be @ http://deeptools.ie-freiburg.mpg.de