chip-seq data: filtering and mapping readspedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf... ·...
TRANSCRIPT
ChIP-seq data: filtering and
mapping readsD. Puthier, C. Rioualen, J. van Helden
Galaxy Workshop — Cuernavaca, 2017
1
● Estrogen-receptor (ESR1) is a key factor in breast cancer development.
● Goal of the study: understand the dependency of ESR1 binding on presence of cofactors, in particular GATA3, which is mutated in breast cancers.
● Approaches: GATA3 silencing (siRNA), ChIP-seq on ESR1 in WT vs. siGATA3 conditions, chromatin profiling.
Dataset used
Theodorou,V., Stark,R., Menon,S. and Carroll,J.S. (2013) GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility. Genome Res, 23, 12–22.
2
● Sequence Read Archive (SRA): https://www.ncbi.nlm.nih.gov/sra
○ Provides access to unaligned reads in sra format
○ SRA read files need to be converted to fastq (see later).○ Linked to Gene Expression Omnibus (GEO)
■ https://www.ncbi.nlm.nih.gov/geo/
● European Nucleotide Archive (ENA): http://www.ebi.ac.uk/ena
○ The European database of short read sequences.
○ Provides direct access to raw reads in fastq format.
○ Linked to ArrayExpress
■ https://www.ebi.ac.uk/arrayexpress/
Read archives
3
Protocol
● Go to GEO web site (https://www.ncbi.nlm.nih.gov/geo/).
● Choose "Search" and paste GSE40129 (GSE stands for GEO Series Experiment). Click "GO" to get information about this experiment.
● In the "sample section" (middle of the page), click on "More" to visualize all sample names. Click on GSM986059 hyperlink (GSM stands for GEO SaMple) to get information about this sample.
● In the "relations" section, select SRX176856 hyperlink to open the SRA page corresponding to this sample.
● Click on the SRR link (bottom left) to access the record of the run.
NB: You can also get sequence data from the website of the European Nucleotide Archive (ENA): https://www.ebi.ac.uk/ena
Getting information about the study
4
Exercises
● Q1: What is the HTS platform used to sequence this sample?
● Q2: Is this experiment single-end or paired-end sequencing?
● Q3: How many runs (i.e. lanes) are associated to this sample?
● Q4: How many reads were produced (# of Spots)?
● Q5: Select the hyperlink to the run SRR540188. What is the sequence of the first read?
Getting information about the study
5
Exercises
● Q1: What is the HTS platform used to sequence this sample?
● Q2: Is this experiment single-end or paired-end sequencing?
● Q3: How many runs (i.e. lanes) are associated to this sample?
● Q4: How many reads were produced (# of Spots)?
● Q5: Select the hyperlink to the run SRR540188. What is the sequence of the first read?
Getting information about the study
6
Raw data
7
The raw data are provided in fastq format
■ Header
■ Sequence
■ + (optional header)
■ Quality (Sanger quality score or other format)
@QSEQ32.249996 HWUSI-EAS1691:3:1:17036:13000#0/1 PF=0 length=36GGGGGTCATCATCATTTGATCTGGGAAAGGCTACTG+=.+5:<<<<>AA?0A>;A*A################@QSEQ32.249997 HWUSI-EAS1691:3:1:17257:12994#0/1 PF=1 length=36TGTACAACAACAACCTGAATGGCATACTGGTTGCTG+DDDD<BDBDB??BB*DD:D#################
8
The Sanger quality score
● Sanger quality score (Phred quality score): Measure the quality of each base call○ Based on p, the probability of ○ error (the probability that the corresponding base call is incorrect)○ Qsanger= -10*log10(p)○ p = 0.01 <=> Qsanger 20
● Quality score are in ASCII 33 ● Note that SRA has adopted Sanger quality score although original fastq files
may use different quality score (see: http://en.wikipedia.org/wiki/FASTQ_format)
9
ASCII 33 encoding
● Storing PHRED scores as single characters gave a simple and space efficient encoding:○ Range 0-40○ ! is 0○ “ is 1○ # is 2○ $ is 3○ …○ I is 40
10
Exercises
● In the short read below, the 5 first residues are Gs, but they are associated to different quality score. Compute the p-value associated to each of them:
Exercicse: Qsanger quality score conversion
@QSEQ32.249996 HWUSI-EAS1691:3:1:17036:13000#0/1 PF=0 length=36
GGGGGTCATCATCATTTGATCTGGGAAAGGCTACTG+=.+5:<<<<>AA?0A>;A*A################
Qsanger ASCII -log10(p) p
. 46 13 .05
+ 43 10 0.1.
5 53 20 0.01
: 58 25 3.2e-3
= 61 28 1.6e-3
A 65 32 6.3e-4
# 35 2 0.63 11
Protocol
● Open a connection to the TIB2017 server (http://132.248.220.36/). Two solutions:
a. Enter the login and password you have received prior to the beginning of the school
b. Register in the “User” Menu
Connecting to the galaxy server
12
About Galaxy…
13
Protocol
1. In the upper right corner, click on Unnamed history and rename this workspace to ChIP-Seq_mapping.
2. Use Shared Data > Data libraries > Theodorou > FASTQ
3. Select siNT_ER_E2_r1_SRX176856_chr1.fq. Click on to history.
4. Set Select history to ChIP-Seq_mapping. Click import.
5. Go to your history use the pencil that is associated to the dataset to rename them to “ESR1_chr1.fq”.
Getting a dataset from the shared library
14
Protocol
1. Select FastQC from the toolbox.
2. Select ESR1_chr1.fq as input dataset. Press execute.
3. Display the data for the corresponding result in your history (right panel).
Q: Carefully inspect all the statistics. What do you think of the overall quality of the sequencing ?
NB: Here is a comprehensive documentation on how to interpret FastQC results: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/
Quality Control with FastQC
15
Quality control for high-throughput sequence data
● First step of analysis ○ Quality control○ Ensure proper quality of selected reads
16
Quality control with FastQC program
Quality
Position in read
Nb Reads
Mean Phred Score
Position in read
Look also at over-represented sequences
17
● A pre-processing step○ Input read ends with poor quality values are trimmed (most generally the
right end)○ May be a crucial step when working with aligners that perform global
alignments■ Lots of reads may be unmapped
● Several software for read trimming○ Sickle (sliding window-based trimming)○ FASTX-Toolkit (cut a defined number of nucleotides)○ Trimmomatic○ Cutadapt (delete ends using bwa algorithm)...
Read Trimming
18
Protocol1. Search for the Sickle tool using the galaxy search engine (upper left corner). 2. Set Single-End or Paired-End reads to Single-end. 3. Select the file ESR1_chr1.fq.4. Set Quality Threshold to 20 and Length Threshold to 25.5. Execute.6. Rename Single-End output of Sickle to ESR1_chr1_trim.fq.7. Perform a new FastQC analysis using the trimmed reads as input.
Q: How many reads to you retrieve after trimming? How does it compare with the input fastq files?
19Trimming
Protocol
1. Select FastQC from the toolbox.
2. Select ESR1_chr1_sickle.fq as input dataset. Press execute.
3. Rename the new fastQC result ESR1_chr1_sickle_fastQC.
4. Display the corresponding result by clicking on the eye of the fastQC Webpage.
Q: Carefully inspect all the statistics. What do you think of the overall quality of the sequencing ? Compare the number of reads before and after trimming.
Quality Control on trimmed data
20
Mapping
21
Mapping• Find out the position of the reads within the genome
Ref. Genome
Reads
• One position in the genome• Many possible positions
(repeat regions, duplicate regions, pseudogenes…)
2
22
Seed and extend
● A seed is mapped to several positions
○ Check whether flanking bases are compatible with the read sequence
● An index has to be produced before the mapping to store the coordinates of seeds (k-mers).
✅❌ ❌
23
Protocol
1. From the tool panel, select Bowtie2.
2. Set Is this single or paired library to Single-end.
3. Set FASTQ file to ESR1_chr1_sickle.fq.
4. Set Will you select a reference genome from your history or use a built-in index to Use a built-in genome index.
a. Set Select the reference genome to hg19 (chr1 only).
5. Set Do you want to use presets to default.
6. Set Save the bowtie2 mapping statistics to the history to Yes.
7. Press Execute. Rename the output to ESR1_chr1.bam
Q: What is format of the resulting dataset ? What should it contain ?
Mapping ChIP-seq reads with Bowtie
24
Aligner output: SAM/BAM files
● SAM: ‘Sequence Alignment/MAP’
● BAM: binary/compressed version of SAM
● Store information related to alignments
○ Read alignment coordinates○ Mapping quality○ CIGAR String○ Bitwise FLAG
■ read paired, read mapped in proper pair, read unmapped, ...○ ...
25
Bitwise flag
● Numerous informations are enclosed in the 3rd column of the bam file:○ read pairs○ reads mapped in proper pairs○ reads unmapped○ mates unmapped○ reads reverse strand○ mates reverse strand○ first in pair○ second in pair○ not primary alignment○ ...
26
These binary informationare enclosed in a single column
● 00000000001 → 2^0 = 1 (read paired)
● 00000000010 → 2^1 = 2 (read mapped in proper pair)
● 00000000100 → 2^2 = 4 (read unmapped)
● 00000001000 → 2^3 = 8 (mate unmapped) …
● 00000010000 → 2^4 = 16 (read reverse strand)
● 00000001001 → 2^0+ 2^3 = 9 → (read paired, mate unmapped)
● 00000001101 → 2^0+2^2+2^3 =13 ...
● ...
Bitwise flag
https://broadinstitute.github.io/picard/explain-flags.html27
● Examples of flags:
○ M alignment “match” (can be a sequence match or mismatch!)
○ I insertion to the reference
○ D deletion from the reference
● http://samtools.sourceforge.net/SAM1.pdf
The extended CIGAR string
ATTCAGATGCAGTAATTCA--TGCAGTA 5M2D7M
28
Filtering
29
Repeats and low complexity regions
● Some regions may contain repeats● Some regions may be of poor complexity
○ E.g AT rich, GC rich● Reads falling into these regions may be ambiguous● The mappability depends on read size.
○ As read are longer they become less ambiguous○ Mappability can be used as the measure of uniqueness
30
Mappability● Mappability (a): how many times a read of a given length can align at a
given position in the genome.○ a=1 (read align once)○ a=1/n (read align n times)
31
Unireads? Multireads?
● First aligners defined the notions of unireads and multireads.● A uniread is thought to map to a single position on the genome.● A multiread is thought to map to several position on the genome.
○ Which position/gene produced the signal ?
I’m a uniread
Genome
I’m a multiread
G1 G2 G3 G4
32
How to deal with ‘multireads’
? ? ?
Keep 1 position randomly
Keep all possible position
Keep none 33
● Uniqueness has no meaning as we don’t known the true sequence.
○ Indeed each nucleotide has an associated quality/probability of error.
● The notion has been superseded by the mapping quality score.
○ Mapping quality score is computed from the probability that alignment is wrong:
■ takes mappability and sequence quality into account
■ -log10(Prob(alignment is wrong))
● p=0.01 -> MAPQ: 20
● p=0.001 -> MAPQ: 30
● p=0.0001 -> MAPQ: 40
● ...
Filtering for Mapping Quality (MAPQ)
34
Protocol
1. Select “Filter BAM datasets on a variety of attributes” from the toolbox.
2. Apply filter to ESR1_chr1.bam3. Set Select BAM property to filter on to mapQuality (selected by
default).4. Set Filter on read mapping quality (phred scale) to “>=30”.5. Click Execute.6. Rename the two result files as follows
a. ESR1_chr1_filtering_parameters_txtb. ESR1_chr1_filtered.bam
35
Protocol
Q1: Check that the overall mean of MAPQ is improved on ESR1 chip after filtering.
● Use sam-stats from the toolbox to compute statistics on ESR1_chr1.bam and ESR1_chr1_filtered.bam.
● Use default parameters.● Mean MAPQ should change from ~33 to ~40.
NB: You could also check that the number of alignments decreased after filtering using flagstat.
36
● PCR duplicates
○ Related to poor library complexity
○ The same set of fragments are amplified
■ Indicates that immuno-precipitation failed
○ Tools to check for
■ FastQC report (duplicate diagram)
■ PCR bottleneck metric (ENCODE)
Filtering for PCR duplicate
37
QC : PBC (PCR Bottleneck Coefficient)● An approximate measure of library complexity
● PBC = N1/Nd ○ N1= Genomic position with 1 read aligned○ Nd = Genomic position with ≧ 1 read aligned
● Value : ○ 0-0.5: severe bottlenecking ○ 0.5-0.8: moderate bottlenecking○ 0.8-0.9: mild bottlenecking○ 0.9-1.0: no bottlenecking
https://genome.ucsc.edu/ENCODE/qualityMetrics.html
✅
❌
38
Protocol
Remove duplicates from the filtered aligned reads.
1. Select the tool MarkDuplicates from the toolbox.
2. SAM/BAM dataset or dataset collection :ESR1_chr1_filtered.bam
3. Set If true do not write duplicates to YES.
4. Set the The scoring strategy for choosing the non-duplicate to SUM_OF_BASE_QUALITIES (default)
5. Rename the two output files
a. ESR1_chr1_filtered_nodup.bam
b. ESR1_chr1_filtered_nodup_metrics
6. Run sam-stats on the duplicate-filtered bam file.
Q: What is the percentage of duplicates ? 39
Visualization
40
Integrative genome viewer (IGV)
41
Protocol1. Download the filtered dataset
a. Take care to download both the bam file and its associated index.2. Start IGV
a. Select hg19 as a genome (Menu > Genomes > Upload from server)b. Load the bam file using File > Load from file
c. Go to chr1
d. Check out gene “RNF223” for instance3. The bam file contains exhaustive information. Only a fraction of a bam dataset data is loaded
into memory at once. We will compute a lightweight file (a tdf) that will contain only coverage information.a. In IGV Select menu Tools > Run IGV tools > count. Browse to the bam file (! not the
*.bai)). Press Run.
b. Close the igvtools window.c. Load the tdf file.
42
Make sure you select the correct genome version!
Rainbow colors on coverage tracks correspond to mismatches !!! 43
● ACTB (chr5) mm9 vs mm10 in IGV (integrated Genome Viewer)
Customize the visualization parameters...
44
Bam files are fat
● BAM files are fat as they do contain exhaustive information about read alignments.○ Memory issues (can only visualize fraction of the BAM).
● Need a more lightweight file format containing only genomic coverage information: ○ ❌ Wig (not compressed, not indexed) ○ ✅ TDF (compressed, indexed) ○ ✅ BigWig (compressed, indexed)
45
● BAM files do not contain fragment location but read location ● We need to extend reads to compute fragments coordinates before
coverage analysis● Not required for PE
Coverage file and read extension
wi wi+1 wi+2 wi+3 wi+4
156 20 14 5
Window
Coverage 46
● Signal needs to be normalized○ A simplistic approach: normalize coverage to 1x○ Beware: popular but not optimal
Library size normalization
ChIP 1 (10 reads)
ChIP 2(20 reads)
ChIP 3(20 reads)
✅ Already normalized to 1x coverage
✅ Should be decreased by 2 fold to get 1x coverage
❌ Decreasing by 2 fold would underestimate peak signal. Problem...
Peak 47
Protocol
1. Find the bamCoverage tool
2. Set BAM files:
a. ESR1_chr1_filtered_nodup.bam
b. input_chr1_filtered_nodup.bam
3. Set Bin size in bp to 25
4. Set Scaling/Normalization method to Normalize to 1x
5. Set Effective genome size to user specified and enter 199400000 in Effective genome size (this is because we restricted the analysis to reads belonging to chromosome 1).
6. Region of the genome to limit the operation to: chr1
7. Execute, and rename the output to ESR1_chr1_filtered_nodup.bw and input_chr1_filtered_nodup.bw
8. Download the resulting files (bigwig and BAM) and open them in the IGV browser.
9. In IGV, right click on the left panel : select set data range, and set Max Value to 100.
48
Extracting a workflow
49
Our workflow
Trimmed reads (fastq)
Mapped reads (bam)
Filtered reads (bam)
AnnotationClustering
Motif discovery
Visualization (bigwig) Peak calling (bed)
Quality Control
Raw Data (fastq)fastqc (html)
fastqc (html)
Coverage file (bigWig)
50
Protocol
Extracting a workflow 51
1. In the history menu, select history options.
2. Click on Extract workflow.3. Set the name of the new workflow
to ChIP-Seq_mapping.4. Using the menu go to workflow >
ChIP-Seq_mapping > edit.5. Move the boxes in order to optimize
the readability of the workflow. 6. Rename the input elements
according to their connections. TODO
Protocol
1. Create a new history: History > Create new.2. Rename this workspace : INPUT.3. Select Shared Data > Data Libraries > Theodorou > FASTQ >
MCF_input_r3_SRX176888_chr1.fq4. Use to history to import the dataset into the INPUT history. 5. Click on Galaxy (top left) and go to INPUT history. 6. Rename the dataset to input_chr1.7. Select workflow > ChIP-Seq_mapping > run. Set the proper input files.8. Click Run workflow at the bottom of the page.9. Rename the datasets.
10. Load the results into IGV.a. Create a .tdf file for the input bam file.b. Compare it with the previous .tdf, corresponding to the chipped sample.c. Beware of the data scale!d. For readability you can rename the tracks by right-clicking on them
52
Comparison between the input and the chip samples
53
Why we use an input...
54
Merci
55
Bowtie: a fast and very popular aligner
● Burrows-Wheeler Transform-based algorithm. Two phases: “seed and extend”.
● The Burrows-Wheeler Transform of a text T, BWT(T), can be constructed as follows:○ The character $ is appended to T, where $ is a character not in T that is
lexicographically less than all characters in T.
○ The Burrows-Wheeler Matrix of T, BWM(T), is obtained by computing the matrix whose rows comprise all cyclic rotations of T sorted lexicographically.
1234567
acaacg$caacg$aaacg$acacg$acacg$acaag$acaac$acaacg
acaacg$
$acaacgaacg$acacaacg$acg$acacaacg$acg$acaag$acaac
T BWT (T)gc$aaac
7314256 56
Bowtie principle
57
● Burrows-Wheeler Matrices have a property called the Last First (LF) Mapping:○ The ith occurrence of character C in the last column corresponds to the same text
character as the ith occurrence of C in the first column○ Example: searching “AAC” in ACAACG
7314256
57