aligning sgs reads and snp finding -...
TRANSCRIPT
Aligning SGS reads and SNP findingAligning SGS reads and SNP finding
Philippe Bardou & Olivier Rué
2
Organisation
Morning (9h00 -12h00) :
- Sequence quality– Theory + exercises
- Read mapping– Theory + exercises
Afternoon (13h30-17h) :
- SAM format– Theory + exercises
- Visualisation– Theory + exercises
- SNP calling– Theory + exercises
3
Where are we?
Sequencing
De NovoAssembly
Alignment
Genome
Transcriptome
Genome
Transcriptome
Genome
Transcriptome
SNP Calling
Chip Seq
Genome
Transcriptome
Transcriptomic
Methylation
4
What are you going to learn?
● To extract reads and reference genome from the NCBI
● To verify the read quality
● To format reference sequence
● To align the reads on the reference genome
● To index the reference sequence and the aligned reads
● To visualise the alignments and variations
● To improve alignment and to recalibrate SGS data
● To call SNPs
5
What you should already know?
● How to connect to a remote unix server (putty)?
● What a unix command looks like?
● How to move around the unix environment?
● How to edit a file?
6
The pieces of software
● Fastqc : quality control
● BWA : alignment
● Samtools & Picard-tools : manipulation of BAM files
● IGV : visualisation
● GATK : analysis of SGS data
7
The 1000 genomes project
● Joint project NCBI / EBI
● Common data formats :
– fastq
– SAM (Sequence Alignment/Map)
8BAM
SAM
VCF
SRASRAENAENA[…][…]
fastq
BWABWA(aln / bwsw /(aln / bwsw /
mem)mem)
BAM
FastQCFastQC
fastqfastq
SAMSAM
samtoolssamtools(view/merge/(view/merge/
sort/...)sort/...)
BAM
IGVIGV
rmduprmdup
BAM BAM
Overview
realignmentrealignmentrecalibrationrecalibrationVariantVariantcallingcalling
GATK Picard tools
9
SGS platforms
● Two platforms :
– Illumina Solexa
– Roche 454
10
SGS reads
2013
HiSeq 2000/2500600Gb / run2 x 100pb / lecture
Titanium XL+700Mb / run700pb / lecture
2013
11
Sequencing bias bibliography
12
Sequencing bias
● Platform related
● Roche 454 (data from Jean-Marc Aury CNS)
– 99,9% mapped reads
– Mean error rate : 0,55%
– 37% deletions, 53% insertions, 10% substitutions.
– homopolymers errors
– emPCR duplications
● Solexa (data from Jean-Marc Aury CNS)
– 98,5% mapped reads
– Mean error rate : 0,38%
– 3% deletions, 2% insertions, 95% substitutions
– Low A/T rich coverage
13
What data will we use?
● The needed data :
– A reference sequence :● Genome● Parts of the genome● Transcriptome
– Short/Long reads
14
Where to get a reference genome?
● Assemble your own
● Use a public assembly :
– NCBI : Genbank
– EMBL
15
Where to get short reads?
● Produce your own sequences :
– CNS
– Local platform
– Private company● Use public data :
– SRA : NCBI Sequence Read Archive
– ENA : EMBL/EBI European Nucleotide Archive
16
NCBI SRA?
17
EBI ENA
18
Meta data
● Meta data structure :
– Experiment– Sample– Study– Run– Data file
19
What is a fastq file
20
fastq file formats
21
Sequence quality
● Phred : base calling
What is Phred Quality?
Traditionally, Phred quality is defined on base calls. Each base call is an estimate of the true nucleotide. It is a random variable and can be wrong. The probability that a base call is wrong is called error probability.
Explanation about the quality values :source http://maq.sourceforge.net/qual.shtml
22
Which reads should I keep?
● All
● Some : what criteria and threshold should I use
– Composition (number of Ns, complexity,...),
– Quality,
– Alignment based criteria,● Should I trim the reads using :
– Composition
– Quality
23
Basic reads statistics
● Number of reads
● Length histogram
● Number of Ns in the reads
● Reads quality
● Reads redundancy
● Reads complexity
24
Sequence quality analysis
● FastQC :
25
Sequence quality analysis
● FastQC :
26
NG6 - Home
27
NG6 - Runs
28
NG6 - Run
29
NG6 - Stats
30
NG6 - Stats
K-mers
31
NG6 - Stats
Sequence content across all bases
32
NG6 - Stats
N content across all bases
33
Exercises / set 1
● genotoul server connexion
● Short read retrieval (wget)
● Read statistics (fastqc)
● Data sets :
– SRX002048
– ERR003037
– ERR000017
Login Passwordanemone f1o2r3!aster f1o2r3!bleuet f1o2r3!iris ...lotusmuguetnarcissepenseepervencherosetulipeviolette
34
Read alignment
● The different software generations :
– Smith-Waterman / Needleman-Wunch (1970)
– BLAST (1990)
– MAQ (2008)
– BWA (2009)
35
BWA● Fast and moderate memory footprint (<4GB)● SAM output by default● Gapped alignment for both SE and PE reads● Effective pairing to achieve high alignment accuracy; suboptimal hits
considered in pairing.● Non-unique read is placed randomly with a mapping quality 0● Limited number of errors (2 for 32bp, 4 for 100 bp, ...)● The default conguration works for most typical input.
– Automatically adjust parameters based on read lengths and error rates.
– Estimate the insert size distribution on the fly
http://bio-bwa.sourceforge.net/
36
Burrows-Wheeler transform
● Original text = “googol”
● Append ‘$’ to mark the end = X = “googol$”
● Sort all rotations of the text in lexicographic order
● Take the last column.
37
BWA prefix trie
● Word 'googol'
● ^ = start character
● --- Search of 'lol' with one error
The prefix trie is compressed to fit in memory in most cases ( 1Go for the human genome).
38
http://bio-bwa.sourceforge.net/bwa.shtml
46
BWA – 3 algorithms
mem aln bwasw
Read length >70bp => few Mb Short reads (~ <100bp) Long reads (>100bp)
Genome length More than 4Gb < 4Gb < 4Gb
Time processing +++ ++ -
Paired Yes (+++) Yes (++) Yes (-)
Algorithm Auto: end-to-end/SW End-to-end SW
Commands
index bwa index ref.fasta
alignmentbwa mem ref.fa r1.fq [r2.fq] > o.sam
bwa aln ref.fa r1.fq > o1.saibwa aln ref.fa r2.fq > o2.sai
bwa bwasw ref.fa r1.fq [r2.fq] > o.sam
postprocess bwa samse ref.fa r1.fq o.sai > o.sam
bwa sampe ref.fa r1.fq r2.fq o1.sai o2.sai > o.sam
48
Exercises / set 2
● Retrieving the reference sequence in fasta format :
– gi|224581838|ref|NC_012125.1| Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594, complete genome
● Indexing the reference sequence
● Aligning the reads (fastq format)
● Formatting the alignment in SAM
49
Sequence Alignment/Map (SAM) format
➢ Data sharing was a major issue with the 1000 genomes➢ Capture all of the critical information about NGS data in a single
indexed and compressed file➢ Sharing : data across and tools➢ Generic alignment format➢ Supports short and long reads (454 – Solexa – Solid)➢ Flexible in style, compact in size, efficient in random access
Website : http://samtools.sourceforge.net
Paper :Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
50
Sequence Alignment/Map (SAM) format
51
SAM formatHeader section
➢ Header lines start with @ followed by a two-letter TAG
➢ Header fields are TYPE:VALUE pairs
52
SAM formatAlignment section
➢ 11 mandatory fields
➢ Variable number of optional fields
➢ Fields are tab delimited
53
SAM formatFull example
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL>
[<TAG>:<VTYPE>:<VALUE> [...]]
Header
Alignement
X? : Reserved for end usersNM : Number of nuc. DifferenceMD : String for mismatching positionsRG : Read group[...]
A : Printable characteri : Signed 32bit integerf : Singleprecision float numberZ : Printable stringH : Hex string (high nybble first)
54
SAM formatFlag field
http://picard.sourceforge.net/explain-flags.html
55
SAM formatExtended CIGAR format
Ref: GCATTCAGATGCAGTACGC
Read: ccTCAGGCATTAgtg
POS CIGAR
5 2S4M2D6M3S
56
SAM formatExtended CIGAR format
57
BAM format
➢ Binary representation of SAM
➢ Compressed by BGZF library
➢ Greatly reduces storage space requirements to about 27% of original SAM
58
SAMtools
➢ Library and software package
➢ Creating sorted and indexed BAM files from SAM files
➢ Removing PCR duplicates
➢ Merging alignments
➢ Visualization of alignments from BAM files
➢ SNP calling
➢ Short indel detection
http://samtools.sourceforge.net/samtools.shtml
60
SAMtoolsExample usage
➢ Create BAM from SAM
samtools view bS aln.sam o aln.bam
➢ Sort BAM file
samtools sort example.bam sortedExample
➢ Merge sorted BAM files
samtools merge sortedMerge.bam sorted1.bam sorted2.bam
➢ Index BAM file
samtools index sortedExample.bam
➢ Visualize BAM file
samtools tview sortedExample.bam reference.fa
61
Picard
➢ A SAMtools complementary package
➢ More format conversion than SAMtools
➢ Visualization of alignments not available
➢ SNP calling & short indel detection not available
http://picard.sourceforge.net/
62
PicardExample usage
➢ ValidateSamFile➢ SortSam➢ MarkDuplicates➢ EstimateLibraryComplexity➢ MergeSamFiles➢ ViewSam➢ ReplaceSamHeader
java Xmx2g jar PicardCmd.jar OPTION1=value1 OPTION2=value2...
➢ SamToFastq➢ FastqToSam➢ SamFormatConverter➢ CreateSequenceDictionary➢ CleanSam➢ CompareSAMs
63
Exercise / set 3
64
Visualizing the alignmentSAMtools : tview
samtools tview aln.sorted.bam ref.fasta
65
Visualizing the alignmentIGV
➢ IGV : Integrative Genomics Viewer
➢ Website : http://www.broadinstitute.org/igv
66
Visualizing the alignmentIGV
➢High-performance visualization tool
➢Interactive exploration of large, integrated datasets
➢Supports a wide variety of data types
➢Documentations
➢Developed at the Broad Institute of MIT and Harvard
67
Visualizing the alignmentIGV
68
Visualizing the alignmentIGV - Loading the reference
70
Visualizing the alignmentIGV - Loading the bam file
71
Visualizing the alignmentIGV - Loading the bam file
72
Visualizing the alignmentIGV - Zoom
74
Visualizing the alignmentIGV - Loading an annotation file
75
Visualizing the alignmentIGV - Coverage
76
Exercise / set 4
77
Variant calling
➢ A new file format :✔ VCF (variant calling format)
➢ Data pre-processing
➢ Variant calling
➢ IGV visualisation
78
The VCF format
✗ Tab-delimited text file
##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">...##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">...##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371
http://vcftools.sourceforge.net/specs.html
http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk
79
The VCF format
✗ Tab-delimited text file
##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">...##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">...##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371
VCF version
http://vcftools.sourceforge.net/specs.html
http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk
80
The VCF format
✗ Tab-delimited text file
##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phredscaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phredscaled pvalue using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods persample when compared against the HardyWeinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Zscore From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371
Fields description
http://vcftools.sourceforge.net/specs.html
http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk
81
The VCF format
✗ Tab-delimited text file
##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phredscaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phredscaled pvalue using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods persample when compared against the HardyWeinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Zscore From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371
Tools & options used
http://vcftools.sourceforge.net/specs.html
http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk
82
The VCF format
✗ Tab-delimited text file
##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phredscaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phredscaled pvalue using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods persample when compared against the HardyWeinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Zscore From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371
Genome informations
http://vcftools.sourceforge.net/specs.html
http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk
83
The VCF format
✗ Tab-delimited text file
##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phredscaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phredscaled pvalue using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods persample when compared against the HardyWeinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Zscore From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371
Header line
http://vcftools.sourceforge.net/specs.html
http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk
84
The VCF format
✗ Header line
##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phredscaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phredscaled pvalue using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods persample when compared against the HardyWeinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Zscore From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371
CHROM : chromosome IDPOS : position (sorted)ID : variant IDREF : reference base(s)ALT : alternative allele(s)QUAL : phred-scaled variant qualityFILTER : [ . | PASS | type of filter ]INFO : List of additional informationsFORMAT : Genotype fields formatL1 : 1st genotypeL2 : 2nd genotype...
85
The VCF format
✗ Tab-delimited text file
##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phredscaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phredscaled pvalue using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods persample when compared against the HardyWeinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Zscore From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371
Variants lines
http://vcftools.sourceforge.net/specs.html
http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk
86
The VCF format
✗ How to interpret a variant line ?
##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phredscaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phredscaled pvalue using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods persample when compared against the HardyWeinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Zscore From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371
Variant at position 101 on scaffold376Not annotatedBase G in reference genome / A in the sequencesSNP quality : 247,82No filters appliedOther informations
GT : ./. → not defined l10/0 → homozygous reference0/1 → heterozygous1/1 → homozygous alternative l2
AD : allelic depth for REF → 0allelic depth for ALT → 13
DP : Read depth → 13
GQ : genotype quality → 9,01
PL : genotype 0/0 likelihood → 87genotype 0/1 likelihood → 9Genotype 1/1 likelihood → 0
87
The VCF format
✗ Small INDELs
✗ Multi-allelic variants
scaffold376 500 . CTT C 424.60 ....
scaffold376 500 . C CT 434.60 ....
DeletionInsertion
scaffold376 577 . C A,T 2303.19 ....GT:AD:DP:GQ:PL 1/2:0,10,6:16:99:394,145,118,249,0,234 1/2:0,20,6:26:99:658,160,106,498,0,480
88
THE GATK best practices
89
THE GATK best practices
90
Why GATK ?
➢ New technologies have difficulties to provide an accurate phred-scaled base quality
➢ The impact of a wrong base quality on SNP calling is very important
➢ Illustration with 5 variant calling methods :
91
GATK pipeline
SAMSAMBAM
GATKGATKIndelRealignerIndelRealigner
GATKGATKRealignerTargetCreatorRealignerTargetCreator
SAMSAMBAM
VCF
Picard toolsPicard toolsAddOrReplaceReadGroupsAddOrReplaceReadGroups
SamtoolsSamtoolsindexindex Picard toolsPicard tools
MarkDuplicatesMarkDuplicates
SAMSAMBAM
SAMSAMBAM
GATKGATKBaseRecalibratorBaseRecalibrator
GATKGATKPrintReadsPrintReads
SAMSAMBAM GATKGATKUnifiedGenotyperUnifiedGenotyper
92
Warning ! Huge temporary data
SAMSAMBAM
GATKGATKIndelRealignerIndelRealigner
GATKGATKRealignerTargetCreatorRealignerTargetCreator
SAMSAMBAM
VCF
Picard toolsPicard toolsAddOrReplaceReadGroupsAddOrReplaceReadGroups
SamtoolsSamtoolsindexindex Picard toolsPicard tools
MarkDuplicatesMarkDuplicates
SAMSAMBAM
SAMSAMBAM
GATKGATKBaseRecalibratorBaseRecalibrator
GATKGATKPrintReadsPrintReads
SAMSAMBAM GATKGATKUnifiedGenotyperUnifiedGenotyper
93
Warning ! Huge temporary data
SAMSAMBAM
GATKGATKIndelRealignerIndelRealigner
GATKGATKRealignerTargetCreatorRealignerTargetCreator
SAMSAMBAM
VCF
Picard toolsPicard toolsAddOrReplaceReadGroupsAddOrReplaceReadGroups
SamtoolsSamtoolsindexindex Picard toolsPicard tools
MarkDuplicatesMarkDuplicates
SAMSAMBAM
SAMSAMBAM
GATKGATKBaseRecalibratorBaseRecalibrator
GATKGATKPrintReadsPrintReads
SAMSAMBAM GATKGATKUnifiedGenotyperUnifiedGenotyper
* HIGH CPU-MEMORY REQUIREMENTS* HIGH TIME-PROCESSING
94
Add Read Groups
➢ GATK needs @RG field in the BAM header
Aim : keep meta-informations (« Where are sequences coming from? »)Usefull to recalibrate base qualities (co-variates) and to call variants
java jar Xmx4G AddOrReplaceReadGroups.jar I=file.bam O=out.bam ID=[val] RGLB=[val] PL=[val] PU=[val] SM=[val]
95
Add Read Groups
➢ GATK needs @RG field in the BAM file to run
96
Add Read Groups
➢ GATK needs @RG field in the BAM file to run
Possible to add the @RG tag directly with « bwa samse»
bwa samse ref.fasta file.sai file.fastq r '@RG\tID:[val]\tPL:[val]\tPU:[val]\tLB:[val]\tSM:[val] > file.sam
97
Duplicate Marking
➢ Duplicates can propagate sequencing errors
Credits : GATK website
java jar Xmx4G MarkDuplicates.jar I=file.bam O=file_rmdup.bam REMOVE_DUPLICATES=boolean M=file.metrics
98
Local Realignment
Read concerned
99
Local Realignment
Raw BAM101M
2 mismatchs
BAM realigned98M25D3M1 mismatch
Read concerned
100
Base recalibration
➢ The base quality provided by the sequencers is too inaccurate to be kept. They are re-computed.
➢ Needs knowledge of real SNP for recalibrating
101
Variant Calling informations
102
IGV : visualization of variants
103BAM
SAM
VCF
SRASRAENAENA[…][…]
fastq
BWABWA(aln / bwsw /(aln / bwsw /
mem)mem)
BAM
FastQCFastQC
fastqfastq
SAMSAM
samtoolssamtools(view/merge/(view/merge/
sort/...)sort/...)
BAM
IGVIGV
rmduprmdup
BAM BAM
Synthesis
realignmentrealignmentrecalibrationrecalibrationVariantVariantcallingcalling
GATK Picard tools
104
Exercise / set 5