aligning sgs reads and snp finding -...

Aligning SGS reads and SNP findingAligning SGS reads and SNP finding

Philippe Bardou & Olivier Rué

2

Organisation

Morning (9h00 -12h00) :

- Sequence quality– Theory + exercises

- Read mapping– Theory + exercises

Afternoon (13h30-17h) :

- SAM format– Theory + exercises

- Visualisation– Theory + exercises

- SNP calling– Theory + exercises

3

Where are we?

Sequencing

De NovoAssembly

Alignment

Genome

Transcriptome

Genome

Transcriptome

Genome

Transcriptome

SNP Calling

Chip Seq

Genome

Transcriptome

Transcriptomic

Methylation

4

What are you going to learn?

● To extract reads and reference genome from the NCBI

● To verify the read quality

● To format reference sequence

● To align the reads on the reference genome

● To index the reference sequence and the aligned reads

● To visualise the alignments and variations

● To improve alignment and to recalibrate SGS data

● To call SNPs

5

What you should already know?

● How to connect to a remote unix server (putty)?

● What a unix command looks like?

● How to move around the unix environment?

● How to edit a file?

6

The pieces of software

● Fastqc : quality control

● BWA : alignment

● Samtools & Picard-tools : manipulation of BAM files

● IGV : visualisation

● GATK : analysis of SGS data

7

The 1000 genomes project

● Joint project NCBI / EBI

● Common data formats :

– fastq

– SAM (Sequence Alignment/Map)

8BAM

SAM

VCF

SRASRAENAENA[…][…]

fastq

BWABWA(aln / bwsw /(aln / bwsw /

mem)mem)

BAM

FastQCFastQC

fastqfastq

SAMSAM

samtoolssamtools(view/merge/(view/merge/

sort/...)sort/...)

BAM

IGVIGV

rmduprmdup

BAM BAM

Overview

realignmentrealignmentrecalibrationrecalibrationVariantVariantcallingcalling

GATK Picard tools

9

SGS platforms

● Two platforms :

– Illumina Solexa

– Roche 454

10

SGS reads

2013

HiSeq 2000/2500600Gb / run2 x 100pb / lecture

Titanium XL+700Mb / run700pb / lecture

2013

11

Sequencing bias bibliography

12

Sequencing bias

● Platform related

● Roche 454 (data from Jean-Marc Aury CNS)

– 99,9% mapped reads

– Mean error rate : 0,55%

– 37% deletions, 53% insertions, 10% substitutions.

– homopolymers errors

– emPCR duplications

● Solexa (data from Jean-Marc Aury CNS)

– 98,5% mapped reads

– Mean error rate : 0,38%

– 3% deletions, 2% insertions, 95% substitutions

– Low A/T rich coverage

13

What data will we use?

● The needed data :

– A reference sequence :● Genome● Parts of the genome● Transcriptome

– Short/Long reads

14

Where to get a reference genome?

● Assemble your own

● Use a public assembly :

– NCBI : Genbank

– EMBL

15

Where to get short reads?

● Produce your own sequences :

– CNS

– Local platform

– Private company● Use public data :

– SRA : NCBI Sequence Read Archive

– ENA : EMBL/EBI European Nucleotide Archive

16

NCBI SRA?

17

EBI ENA

18

Meta data

● Meta data structure :

– Experiment– Sample– Study– Run– Data file

19

What is a fastq file

20

fastq file formats

21

Sequence quality

● Phred : base calling

What is Phred Quality?

Traditionally, Phred quality is defined on base calls. Each base call is an estimate of the true nucleotide. It is a random variable and can be wrong. The probability that a base call is wrong is called error probability.

Explanation about the quality values :source http://maq.sourceforge.net/qual.shtml

22

Which reads should I keep?

● All

● Some : what criteria and threshold should I use

– Composition (number of Ns, complexity,...),

– Quality,

– Alignment based criteria,● Should I trim the reads using :

– Composition

– Quality

23

Basic reads statistics

● Number of reads

● Length histogram

● Number of Ns in the reads

● Reads quality

● Reads redundancy

● Reads complexity

24

Sequence quality analysis

● FastQC :

25

Sequence quality analysis

● FastQC :

26

NG6 - Home

27

NG6 - Runs

28

NG6 - Run

29

NG6 - Stats

30

NG6 - Stats

K-mers

31

NG6 - Stats

Sequence content across all bases

32

NG6 - Stats

N content across all bases

33

Exercises / set 1

● genotoul server connexion

● Short read retrieval (wget)

● Read statistics (fastqc)

● Data sets :

– SRX002048

– ERR003037

– ERR000017

Login Passwordanemone f1o2r3!aster f1o2r3!bleuet f1o2r3!iris ...lotusmuguetnarcissepenseepervencherosetulipeviolette

34

Read alignment

● The different software generations :

– Smith-Waterman / Needleman-Wunch (1970)

– BLAST (1990)

– MAQ (2008)

– BWA (2009)

35

BWA● Fast and moderate memory footprint (<4GB)● SAM output by default● Gapped alignment for both SE and PE reads● Effective pairing to achieve high alignment accuracy; suboptimal hits

considered in pairing.● Non-unique read is placed randomly with a mapping quality 0● Limited number of errors (2 for 32bp, 4 for 100 bp, ...)● The default conguration works for most typical input.

– Automatically adjust parameters based on read lengths and error rates.

– Estimate the insert size distribution on the fly

http://bio-bwa.sourceforge.net/

36

Burrows-Wheeler transform

● Original text = “googol”

● Append ‘$’ to mark the end = X = “googol$”

● Sort all rotations of the text in lexicographic order

● Take the last column.

37

BWA prefix trie

● Word 'googol'

● ^ = start character

● --- Search of 'lol' with one error

The prefix trie is compressed to fit in memory in most cases ( 1Go for the human genome).

38

http://bio-bwa.sourceforge.net/bwa.shtml

46

BWA – 3 algorithms

mem aln bwasw

Read length >70bp => few Mb Short reads (~ <100bp) Long reads (>100bp)

Genome length More than 4Gb < 4Gb < 4Gb

Time processing +++ ++ -

Paired Yes (+++) Yes (++) Yes (-)

Algorithm Auto: end-to-end/SW End-to-end SW

Commands

index bwa index ref.fasta

alignmentbwa mem ref.fa r1.fq [r2.fq] > o.sam

bwa aln ref.fa r1.fq > o1.saibwa aln ref.fa r2.fq > o2.sai

bwa bwasw ref.fa r1.fq [r2.fq] > o.sam

postprocess bwa samse ref.fa r1.fq o.sai > o.sam

bwa sampe ref.fa r1.fq r2.fq o1.sai o2.sai > o.sam

48

Exercises / set 2

● Retrieving the reference sequence in fasta format :

– gi|224581838|ref|NC_012125.1| Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594, complete genome

● Indexing the reference sequence

● Aligning the reads (fastq format)

● Formatting the alignment in SAM

49

Sequence Alignment/Map (SAM) format

➢ Data sharing was a major issue with the 1000 genomes➢ Capture all of the critical information about NGS data in a single

indexed and compressed file➢ Sharing : data across and tools➢ Generic alignment format➢ Supports short and long reads (454 – Solexa – Solid)➢ Flexible in style, compact in size, efficient in random access

Website : http://samtools.sourceforge.net

Paper :Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]

50

Sequence Alignment/Map (SAM) format

51

SAM formatHeader section

➢ Header lines start with @ followed by a two-letter TAG

➢ Header fields are TYPE:VALUE pairs

52

SAM formatAlignment section

➢ 11 mandatory fields

➢ Variable number of optional fields

➢ Fields are tab delimited

53

SAM formatFull example

<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL>

[<TAG>:<VTYPE>:<VALUE> [...]]

Header

Alignement

X? : Reserved for end usersNM : Number of nuc. DifferenceMD : String for mismatching positionsRG : Read group[...]

A : Printable characteri : Signed 32bit integerf : Singleprecision float numberZ : Printable stringH : Hex string (high nybble first)

54

SAM formatFlag field

http://picard.sourceforge.net/explain-flags.html

55

SAM formatExtended CIGAR format

Ref: GCATTCAGATGCAGTACGC

Read: ccTCAGGCATTAgtg

POS CIGAR

5 2S4M2D6M3S

56

SAM formatExtended CIGAR format

57

BAM format

➢ Binary representation of SAM

➢ Compressed by BGZF library

➢ Greatly reduces storage space requirements to about 27% of original SAM

http://samtools.sourceforge.net/index.shtml

58

SAMtools

➢ Library and software package

➢ Creating sorted and indexed BAM files from SAM files

➢ Removing PCR duplicates

➢ Merging alignments

➢ Visualization of alignments from BAM files

➢ SNP calling

➢ Short indel detection

http://samtools.sourceforge.net/samtools.shtml

59

SAMtoolsExample usage

http://samtools.sourceforge.net/SAM1.pdf

60

SAMtoolsExample usage

➢ Create BAM from SAM

samtools view bS aln.sam o aln.bam

➢ Sort BAM file

samtools sort example.bam sortedExample

➢ Merge sorted BAM files

samtools merge sortedMerge.bam sorted1.bam sorted2.bam

➢ Index BAM file

samtools index sortedExample.bam

➢ Visualize BAM file

samtools tview sortedExample.bam reference.fa

61

Picard

➢ A SAMtools complementary package

➢ More format conversion than SAMtools

➢ Visualization of alignments not available

➢ SNP calling & short indel detection not available

http://picard.sourceforge.net/

62

PicardExample usage

➢ ValidateSamFile➢ SortSam➢ MarkDuplicates➢ EstimateLibraryComplexity➢ MergeSamFiles➢ ViewSam➢ ReplaceSamHeader

java Xmx2g jar PicardCmd.jar OPTION1=value1 OPTION2=value2...

➢ SamToFastq➢ FastqToSam➢ SamFormatConverter➢ CreateSequenceDictionary➢ CleanSam➢ CompareSAMs

http://picard.sourceforge.net/explain-flags.html

63

Exercise / set 3

64

Visualizing the alignmentSAMtools : tview

samtools tview aln.sorted.bam ref.fasta

65

Visualizing the alignmentIGV

➢ IGV : Integrative Genomics Viewer

➢ Website : http://www.broadinstitute.org/igv

66


➢High-performance visualization tool

➢Interactive exploration of large, integrated datasets

➢Supports a wide variety of data types

➢Documentations

➢Developed at the Broad Institute of MIT and Harvard

67


68

Visualizing the alignmentIGV - Loading the reference

69

Visualizing the alignmentIGV - Loading the reference

http://picard.sourceforge.net/

70

Visualizing the alignmentIGV - Loading the bam file

71

Visualizing the alignmentIGV - Loading the bam file

72

Visualizing the alignmentIGV - Zoom

73

Visualizing the alignmentIGV - Zoom

http://www.broadinstitute.org/igv

74

Visualizing the alignmentIGV - Loading an annotation file

75

Visualizing the alignmentIGV - Coverage

76

Exercise / set 4

77

Variant calling

➢ A new file format :✔ VCF (variant calling format)

➢ Data pre-processing

➢ Variant calling

➢ IGV visualisation

78

The VCF format

✗ Tab-delimited text file

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">...##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">...##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

http://vcftools.sourceforge.net/specs.html

http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

79

The VCF format


##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">...##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">...##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

VCF version



80

The VCF format


##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phredscaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phredscaled pvalue using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods persample when compared against the HardyWeinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Zscore From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Zscore from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=1.905;QD=10.77;ReadPosRankSum=0.495;SB=85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=0.519;SB=486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

Fields description



81

The VCF format



Tools & options used



82

The VCF format



Genome informations



83

The VCF format



Header line



84

The VCF format

✗ Header line


CHROM : chromosome IDPOS : position (sorted)ID : variant IDREF : reference base(s)ALT : alternative allele(s)QUAL : phred-scaled variant qualityFILTER : [ . | PASS | type of filter ]INFO : List of additional informationsFORMAT : Genotype fields formatL1 : 1st genotypeL2 : 2nd genotype...

85

The VCF format



Variants lines



86

The VCF format

✗ How to interpret a variant line ?


Variant at position 101 on scaffold376Not annotatedBase G in reference genome / A in the sequencesSNP quality : 247,82No filters appliedOther informations

GT : ./. → not defined l10/0 → homozygous reference0/1 → heterozygous1/1 → homozygous alternative l2

AD : allelic depth for REF → 0allelic depth for ALT → 13

DP : Read depth → 13

GQ : genotype quality → 9,01

PL : genotype 0/0 likelihood → 87genotype 0/1 likelihood → 9Genotype 1/1 likelihood → 0

87

The VCF format

✗ Small INDELs

✗ Multi-allelic variants

scaffold376 500 . CTT C 424.60 ....

scaffold376 500 . C CT 434.60 ....

DeletionInsertion

scaffold376 577 . C A,T 2303.19 ....GT:AD:DP:GQ:PL 1/2:0,10,6:16:99:394,145,118,249,0,234 1/2:0,20,6:26:99:658,160,106,498,0,480

88

THE GATK best practices

89

THE GATK best practices

90

Why GATK ?

➢ New technologies have difficulties to provide an accurate phred-scaled base quality

➢ The impact of a wrong base quality on SNP calling is very important

➢ Illustration with 5 variant calling methods :

91

GATK pipeline

SAMSAMBAM

GATKGATKIndelRealignerIndelRealigner

GATKGATKRealignerTargetCreatorRealignerTargetCreator

SAMSAMBAM

VCF

Picard toolsPicard toolsAddOrReplaceReadGroupsAddOrReplaceReadGroups

SamtoolsSamtoolsindexindex Picard toolsPicard tools

MarkDuplicatesMarkDuplicates

SAMSAMBAM

SAMSAMBAM

GATKGATKBaseRecalibratorBaseRecalibrator

GATKGATKPrintReadsPrintReads

SAMSAMBAM GATKGATKUnifiedGenotyperUnifiedGenotyper

92

Warning ! Huge temporary data

SAMSAMBAM



SAMSAMBAM

VCF




SAMSAMBAM

SAMSAMBAM




93

Warning ! Huge temporary data

SAMSAMBAM



SAMSAMBAM

VCF




SAMSAMBAM

SAMSAMBAM




* HIGH CPU-MEMORY REQUIREMENTS* HIGH TIME-PROCESSING

94

Add Read Groups

➢ GATK needs @RG field in the BAM header

Aim : keep meta-informations (« Where are sequences coming from? »)Usefull to recalibrate base qualities (co-variates) and to call variants

java jar Xmx4G AddOrReplaceReadGroups.jar I=file.bam O=out.bam ID=[val] RGLB=[val] PL=[val] PU=[val] SM=[val]

95

Add Read Groups

➢ GATK needs @RG field in the BAM file to run

96

Add Read Groups

➢ GATK needs @RG field in the BAM file to run

Possible to add the @RG tag directly with « bwa samse»

bwa samse ref.fasta file.sai file.fastq r '@RG\tID:[val]\tPL:[val]\tPU:[val]\tLB:[val]\tSM:[val] > file.sam

97

Duplicate Marking

➢ Duplicates can propagate sequencing errors

Credits : GATK website

java jar Xmx4G MarkDuplicates.jar I=file.bam O=file_rmdup.bam REMOVE_DUPLICATES=boolean M=file.metrics

98

Local Realignment

Read concerned

99

Local Realignment

Raw BAM101M

2 mismatchs

BAM realigned98M25D3M1 mismatch

Read concerned

100

Base recalibration

➢ The base quality provided by the sequencers is too inaccurate to be kept. They are re-computed.

➢ Needs knowledge of real SNP for recalibrating

101

Variant Calling informations

102

IGV : visualization of variants

103BAM

SAM

VCF

SRASRAENAENA[…][…]

fastq

BWABWA(aln / bwsw /(aln / bwsw /

mem)mem)

BAM

FastQCFastQC

fastqfastq

SAMSAM

samtoolssamtools(view/merge/(view/merge/

sort/...)sort/...)

BAM

IGVIGV

rmduprmdup

BAM BAM

Synthesis

realignmentrealignmentrecalibrationrecalibrationVariantVariantcallingcalling

GATK Picard tools

104

Exercise / set 5

aligning sgs reads and snp finding -...

Documents