chipseq analysis using deeptools and...
TRANSCRIPT
Topics to be covered
Assessing ChIP quality/IP strengthUnderstand the difference between BAM files and bigWig
filesUnderstand why, when and how one needs to normalize
ChIP-seq data.Know the basics of peak calling (why and how is it done?)Be able to work with the output from MACS2 (e.g.,
filtering of peaks, visualization in a Genome Browser).Know how to generate coverage plots, e.g., heatmaps.
Outline
ChIP efficiencyCoverage Files
Depth normalization Input normalization
Peak callingWhy? How?Types of peaks: Sharp, broad, mixed
Downstream processingThis is the actually interesting part
Calculating the fingerprint
genome
2 3 4 2 3 3 4 4 5 7 8 bins
2 2 3 3 3 4 4 4 5 7 8
sorting
0.25 0.25 0.37 0.37 0.37 0.5 0.5 0.5 0.5 0.87 1
scaling
IP Strength
High IP enrichment
Similar genome coverage (ca. 90%)
Insufficient genome coverage
Input deviates from straight line (uniformity)
InputChIP
InputChIP
Outline
ChIP efficiencyCoverage Files
Depth normalization Input normalization
Peak callingWhy? How?Types of peaks: Sharp, broad, mixed
Downstream processingThis is the actually interesting part
FASTQ
SAM BAM
bedGraph bigWigcoverage files:
read numbers per genomic region
reads:DNA sequence +
genomic localization
sequenced DNA fragments (reads):
DNA sequence only
read alignment
counting reads, normalizing for sequencing depth etc.
NGS file formats ~ analysis steps
DO W NS T RE AM A NA LY S E S
chr1 10 20 1.5chr1 20 30 1.7chr1 30 40 2.0chr1 40 50 1.8
@Read1GATTTGGGGTTCAAAGCAGTAT+@Read2CGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
r1 163 chr1 7 30 8M2I4M1D3M = 37 39 TTAGATGATTG *r2 0 chr1 9 30 3S6M1P1I4M * 0 0 AAAAGGATA *
Computing coverage
chr2 100100 100120 5chr2 100121 100141 3.2chr2 100142 100163 13.8
chr2 100100 100120 5chr2 100121 100141 3.2chr2 100142 100163 13.8
39V34V1:38:C0RLHACXX:4:1216:16137:31969 163 chr1 3000307 42 51M = 3000408 152 CTGTAGTTACTGTTTGCTTACCTAGATTCTTCTTTTCCAGAATTCTCTTAG CCCFFFFFHHHGHIIJIJJJJIIGHFGIGIJIIJJJHIHEHIGIIIIJJGF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:0 YT:Z:CP
39V34V1:38:C0RLHACXX:4:1216:16137:31969 163 chr1 3000307 42 51M = 3000408 152 CTGTAGTTACTGTTTGCTTACCTAGATTCTTCTTTTCCAGAATTCTCTTAG CCCFFFFFHHHGHIIJIJJJJIIGHFGIGIJIIJJJHIHEHIGIIIIJJGF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YS:i:0 YT:Z:CP
size reduction leads to many advantages of bigWig files over BAMs:• data storage• data sharing• intuitive visualization via genome browsers• more efficient for downstream analyses• …
aim: reduce the vast amount of information fromthe BAM file to the simple information:How many reads do I have (per bp/genomic bin/…)?
genome
bins (e.g. 50 bp)
reads
1 2 2 5 5 6 6 6 4 46
fragments
DNA → Sonicated to ~200bp frags. → 50-100 base reads
Computing coverage
Outline
Coverage FilesDepth normalization (bamCoverage) Input normalization
Peak callingWhy? How?Types of peaks: Sharp, broad, mixed
Downstream processingThis is the actually interesting part
Input samples – They’re important!!!
Input controls should be treated exactly the same* as ChIP samples except for the antibody treatment!
*same cell type, same shearing, same PCRs, same experimentator, …
Input samples – They’re important!!!
(not only) gene-rich regions = bias-rich regions(especially applicable to old sequencing data)
Comparative coverage of BAM files
typical application: input-normalization for a ChIP sample aim: diminish the background signal from the ChIP signal based on the
input main caveat: the same genomic region is never covered exactly the same way,
which is neither the input’s nor the ChIP’s fault
Normalization by total read count:very straight-forward, perhaps too simple?
Normalization by SES (Diaz et al., 2012):more sophisticated, based on bamFingerprint (greatest distance between input and ChIP), not recommended for broad marks due to weaker enrichment
Which normalization method?
ratio log2 (ratio) difference sum reciprocal_ratio SES
Base your decision on the kind of questions you would like to answer!
Outline
ChIP efficiencyCoverage Files
Depth normalization Input normalization
Peak callingWhy? How?Types of peaks: Sharp, broad, mixed
Downstream processingThis is the actually interesting part
From DNA reads to protein binding sites
1. identify original fragments2. identify enriched regions3. assign significance
> 30 different programs
different solutions for
each step
3 main tasks of peak calling programs:
CATCGA…
ATCGCTG…
GCATTG…
CTACGGT…
CTACGGT
protein
Which peak caller to choose?
Again: base your decision on the kind of questions you would like to answer!
Table from Wilbanks & Facciotti, 2010
Why do we focus on MACS2?
One of the most widely used peak callers, also used by big consortia, e.g., (mod)ENCODE, Blueprint, NIH Roadmap (reproducibility! comparability!)
Under active developmentCan be used for sharp and mixed signals
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS).
Genome Biol. 2008;9(9):R137
Peak Calling with MACS
1. Identify regions with enrich-ment, i.e., large no. of mapped reads (modeling by read shifting)
2. Determine peaks based on enrichments passing p-value* threshold
3. Estimate false discovery rate (FDR*) for each detected peak
Figure from Park, Nature Reviews Genetics, 2009
*p-value: probability of an enrichment being stronger than expected (= PEAK!);null hypothesis: reads are randomly distributed throughout the genome following a Poisson distribution (input is used to parameterize the statistical model)
*q-value/FDR: how likely is it that the peak is not really a peak (false positive) given that testing is done genome-wide?
How does MACS2 work?
● Fragment length modeling– Sliding window (2x bandwidth)
– >M-fold enrichment over background (effective genome size!)
– Filter “peaks” (>5x, <50x enrichment)
● Peak calling– Reads extended
– Samples scaled (need input)
– Sliding window (2x frag. len.)
– Enrichment according to Poisson variance● Use local noise for filtering● Use input variance for filtering
Properties influencing the peak callingKnow thy data!
Library complexity (How many duplicates? Overrepresented regions?)
Enrichment strength (IP success) Width and nature of the enriched regions
(narrow vs. broad vs. mixed) No. of occupied sites Range of the ChIP signal intensities
ALWAYS inspect your data visually and manually!
It’s not the peak caller that’s making sense of your data, it’s you!
FASTQCFASTQC
plotFingerprintplotFingerprint
Genome BrowserGenome Browser
computeGCbiascomputeGCbias
Important MACS parameters
Specify effective/mappable genome size (-g) Fragment size: might be set manually, especially for paired-end
data (for which fragment size can be determined separately, e.g., by Picard CollectInsertSizeMetrics)!
Broad peak calling (--broad) should be turned on for basically all histone marks except H3K4me3 and perhaps H3K27ac
Think about stringent filtering criteria on the peak lists computed by MACS!
If you’re not satisfied, play with the
parameters!
Think about stringent filtering criteria on the peak lists computed by MACS!
If you’re not satisfied, play with the
parameters!
You could make a workflow for the
QC of peaks
You could make a workflow for the
QC of peaks
Outline
ChIP efficiencyCoverage Files
Depth normalization Input normalization
Peak callingWhy? How?Types of peaks: Sharp, broad, mixed
Downstream processingThis is the actually interesting part
Powerful visualization: heatmaps
Requirements: bigWig file bed file question in mind!
use deepTools: computeMatrixplotHeatmap
plotProfile
use deepTools: computeMatrixplotHeatmap
plotProfile
Powerful visualization: heatmaps
Possible questions: What kinds of signal distributions
do I see in my peaks? How does my signal look around
the TSS/TES/my favorite region? How does my signal look when I
assume the same size for all genes?
…
computeMatrix
plotHeatmap plotProfile
Advanced topics: GC Bias
use deepTools: computeGCbiascorrectGCbias
use deepTools: computeGCbiascorrectGCbias
Advanced Topics: Blacklisted Regions
Peaks in the same place Regardless of ChIP type Regardless of cell type Regardless of experiment
These are false positives – known sites to ignore https://sites.google.com/site/anshulkundaje/projects/blacklists
These regions can screw up scaling!
DeepTools: can ignore blacklisted sites
NGS utils: “BAM filter”
Bedtools intersect (with peaks to remove overlaps)
Now you can...
Assessing ChIP quality/IP strengthUnderstand the difference between BAM files and bigWig
filesUnderstand why, when and how one needs to normalize
ChIP-seq data.Know the basics of peak calling (why and how is it done?)Be able to work with the output from MACS2 (e.g.,
filtering of peaks, visualization in a Genome Browser).Know how to generate coverage plots, e.g., heatmaps.
Where to go for help
DeepTools: https://groups.google.com/forum/#!forum/deeptools http://deeptools.readthedocs.org
MACS google group https://groups.google.com/forum/#!forum/macs-announcement https://github.com/taoliu/MACS/wiki
Biostars: www.biostars.org
Galaxy support: http://biostar.usegalaxy.org
Slides will be posted to GCC2016 website!Slides/datasets/pages will be @ http://deeptools.ie-freiburg.mpg.de