bioconductor · created date: 7/31/2014 3:24:58 pm
TRANSCRIPT
![Page 1: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/1.jpg)
Visualisation and assessment ofChIP-seq quality
Thomas Carroll
Head of Bioinformatics,MRC Clinical Sciences Centre,
Imperial College London
BioC 2014
![Page 2: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/2.jpg)
ChIP-seq is noisy
● ChIP-seq/ChiP-exo/DNA-seq/MNase-seq isnoisy.
● Experimental biases:● Fragmentation/digestion.● IP strength/efficiency and specificity.● PCR Bias (Overamplification from low starting material)
● Highly variable patterns of enrichment betweenChIPs.
● Transcription factors may show sharp/narrow peaks.● Polymerase II will show mix of sharp/narrow and
dispersed/broad peaks
![Page 3: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/3.jpg)
Always visualise your data
● Coverage graphs.● Wigs (Okay)● bedGraphs (Okay)● BigWigs (Great)
● Allows for quickassessment of data...
..but dependent on user'sinterpretation/experience.
![Page 4: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/4.jpg)
High-thoughput ChIP-seq qualitycontrol with ChIPQC
● Need methods to quantify informativecharacteristics about your ChIP-seq data.
● ChIPQC – Tom Carroll and Rory Stark (Diffbind).
● ChIPQC provides workflow to generate metricsper sample/experiment.
![Page 5: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/5.jpg)
ChIP-seq metrics
● Distribution of Signal● Clustering of Watson/Crick reads.● Duplication Rate.
![Page 6: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/6.jpg)
Distribution of Signal
– Within enriched regions
– Within/across expected annotation
– Across the genome
– Within known artefact regions
![Page 7: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/7.jpg)
Signal in Peaks (FRIP)
● The simplestassessment ofenrichment.– Call enriched
regions over input
– Measure fractionof reads in peaks(FRIP)
– Good quality TF >5%
– Good quality Pol-II> 30%
![Page 8: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/8.jpg)
Relative Enrichment in Genomic Intervals (REGI).
● Expected enrichmentin genomic regions
● Plot relativeenrichment of reads inannotated regions.
![Page 9: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/9.jpg)
Signal in Blacklists (FRIBL)● Work from Encode (Kudaje A) has produced curated
list of conserved high signal artefact regions.● Available for many species including human, mouse
and drosophila genomes.● Represent around 0.5% of genome.
● Can account for high proportion of total signal (> 10%).
![Page 10: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/10.jpg)
Why worry about blacklists?
● Can affect -– Normalisation
between samples.
– Fragment lengthestimation.
– Quality metrics forChIP-seq.
Carroll et al 2014
![Page 11: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/11.jpg)
Global signal profile
● A simple method toreview global distributionis as histograms.
● More enriched samplesshow higher number ofbases at greater depths
● Input samples showhigher number of bases atlow depths
![Page 12: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/12.jpg)
Global Signal Profile
● Presence of stretchof high signal depth
● Identify anomaloussignal region ascandidate forblacklisting.
![Page 13: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/13.jpg)
Metric of Global Signal Profile -SSD
● SSD developed in htseqtools package.● Normalised standard deviation of coverage.● Provides measure of pile-up across genome
● Sample with regions of high signal (High SSD score)● Sample with low signal across genome (Low SSD score)
● Provides no measure of signal structure.
![Page 14: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/14.jpg)
SSD and Blacklists
● SSD is very sensitivehigh signal artefactregions.
● Input SSD scoresreduced afterBlacklisting
● Sample SSD scoresremain higher.
Carroll et al 2014
![Page 15: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/15.jpg)
Clustering of Watson/Crickreads.
![Page 16: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/16.jpg)
Watson and Crick reads clusteraround epigenetic marks
● ChIP-seq is typicallysingle ended.
● ChIP-seq watson andcrick reads clusteraround bindingevents.
● For transcriptionfactors the extent ofthis clustering relatedto ChIP-seq quality.
![Page 17: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/17.jpg)
Assessing W/C read clustering● Convert total coverage to cross-coverage scores to
allow for comparison between samples (and regions)
● Cross-Coverage Score =(Coverage0 – Coverage
n)/Coverage
0
● Frag_CC = Cross-coverage score atfragment length.
![Page 18: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/18.jpg)
Assessing W/C read clustering
● Slide Watson reads along binding site (5' to 3').
● Total area covered by signal will reduce after shiftingWatson reads by fragment length
![Page 19: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/19.jpg)
Assessing W/C read clustering
● Applied across genome.
● Expect reduction at fragment length.
![Page 20: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/20.jpg)
Read-length cross-coverage peak
● Blacklisted regionsstrongly contributeto read lengthcross-coveragepeak
● Rel_CC = Frag_CC/read length cross-coverage score.
Carroll et al 2014
![Page 21: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/21.jpg)
Duplication Rate
![Page 22: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/22.jpg)
Duplicate FAQ
● Typically ChIP-seq is singleend sequenced– Reads with same start
position consideredduplicates
● Removing duplicatessaturates dynamic range ofsignal.– Maximum signal at base is
2*read length
![Page 23: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/23.jpg)
Why worry about duplicates
● “Read duplicates arise from experimentalartefacts”– Is true
● “All read duplicates arise from experimentalartefacts”– Is false.
● So we need to consider that duplicates may beenriched for artefacts..
● ..but contribute to genuine ChIP-signal
![Page 24: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/24.jpg)
Duplicates (the bad kind)
● Low starting material.– If initial starting material is low this can lead to
overamplification of this material.
– Biases in PCR will compound this problem.
– Can lead to artificially enriched regions.
![Page 25: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/25.jpg)
Duplicates (bad kind 2)
● Blacklists with ultra high signal are high induplicates.
● Masking blacklisted regions prior to analysisremoves this problem
![Page 26: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/26.jpg)
Duplicates (The Good andMisunderstood)
● Duplicates will also exist within highly efficient(or even inefficient ChIP) when deeplysequenced ChIP.
● Removal of duplicates can lead to asaturation and so underestimation of ChIP-signal!
![Page 27: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/27.jpg)
Duplicates
● Consider enrichment efficiency and sequencingdepth.
● Remove duplicates prior to peak calling.● Retain duplicates for differential binding
analysis.
![Page 28: Bioconductor · Created Date: 7/31/2014 3:24:58 PM](https://reader036.vdocuments.us/reader036/viewer/2022071006/5fc37bf70815655d6b3fa479/html5/thumbnails/28.jpg)
Practical.
● All data is /data/ChIPQC/● Handout and R code in /data/ChIPQC/ or on
Bioc2014 materials page.● We will work through first examples.● Few questions using what we learnt.