bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...
TRANSCRIPT
• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;
• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.
Background
• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;
• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.
Background
Cusco, Marzo 2009
Muséum national d'Histoire naturelle ‐ Paris
Informazioni pratiche
• Teoria + pratica;
• Software and tools;
• Files;
• Slides on the website;
• Argomenti nuovi / argomenti gia’ trattati;
Programma
• next‐generation sequencing (NGS)…come, quando, perche’?
• un esempio di gestione e analisi dati NGS:
• tipo di dato;• file e formati;• programmi;• interpretazione dei risultati;• stima dell’errore;• quando fermarsi?
• Applicazioni e/o progetti su diversi organismi.
capture: exome/custom/cancer
amplicon sequencing
whole genome
mapping to a reference genome
de‐novoassembly
sequencing
unalignedreads QC
mapping refinement
mapping QCassembly QC
whole transcriptome
amplicon sequencing: fixed/custom
DNA‐seq
RNA‐seq
reads trimming
NGS: come, quando, perché?
Filtering
Validation
NGS: come, quando, perché?
Domanda: quando? Domanda: perche’?
Domanda: quando?
Risposta: quando ha senso!
NGS: come, quando, perché?
Domanda: perche’?
Domanda: quando?
Risposta: quando ha senso!
• Amplicone 400bp in 100 individui? → Sanger sequencing
NGS: come, quando, perché?
Domanda: perche’?
Domanda: quando?
Risposta: quando ha senso!
• Amplicone 400bp in 100 individui? → Sanger sequencing
• 50 ampliconi in 100 individui? → NGS + target capture
NGS: come, quando, perché?
Domanda: perche’?
Domanda: quando?
Risposta: quando ha senso!
• Amplicone 400bp in 100 individui? → Sanger sequencing
• 50 ampliconi in 100 individui? → NGS + target capture
• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing
NGS: come, quando, perché?
Domanda: perche’?
Domanda: quando?
Risposta: quando ha senso!
• Amplicone 400bp in 100 individui? → Sanger sequencing
• 50 ampliconi in 100 individui? → NGS + target capture
• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing
Domanda: perche’?
Risposta: la vostra idea per un progetto!
NGS: come, quando, perché?
un esempio di gestione e analisi dati NGS
un esempio di gestione e analisi dati NGS
Nanopore minIon/gridIon
Pacific Bioscience (PacBio)
Ion torrent PGM/Proton
Roche 454
Illumina MiSeq/HiSeq
capture: exome/custom/cancer
amplicon sequencing
whole genome
mapping to a reference genome
de‐novoassembly
sequencing
unalignedreads QC
mapping refinement
mapping QCassembly QC
whole transcriptome
amplicon sequencing: fixed/custom
DNA‐seq
RNA‐seq
reads trimming
Filtering
Validation
un esempio di gestione e analisi dati NGS
un esempio di gestione e analisi dati NGS
• progetto
un esempio di gestione e analisi dati NGS
• progetto
• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)
un esempio di gestione e analisi dati NGS
• progetto
• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)
• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)
un esempio di gestione e analisi dati NGS
• progetto
• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)
• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)
• progetto:applicazione:scopo:coverage (SNPs, indels, repeatedelements, CNVs…)
Project:
• Saccharomyces cerevisiae;
• Genome: 16 chromosomes, ~12.5Mb, ~6200 genes;
• Whole genome sequencing;
• Illumina platform;
• Paired‐end reads, 1 library, 2 lanes.
fragment ========================================fragment + adaptors ~~~========================================~~~SE read ‐‐‐‐‐‐‐‐‐>PE reads R1‐‐‐‐‐‐‐‐‐> <‐‐‐‐‐‐‐‐‐R2unknown gap ..................................................
Single‐end (SE) or paired‐end (PE) sequencing.
fragment ========================================fragment + adaptors ~~~========================================~~~SE read ‐‐‐‐‐‐‐‐‐>PE reads R1‐‐‐‐‐‐‐‐‐> <‐‐‐‐‐‐‐‐‐R2unknown gap ..................................................
Single‐end (SE) or paired‐end (PE) sequencing.
raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant calling
SNPs/indels
single/multi‐sample
samtools
raw variants (.vcf)
ready‐to‐use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtools
.fa/.fasta
.fastq
.sam (.sai)
.bam (.bai)
.vcf
sequences
read data
mapped reads
mapped reads (binary)
variant information
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
raw reads (.fastq)
raw reads (.fastq)
gedit s‐6‐1.fastq
OR
Terminal: more s‐6‐1.fastq OR head s‐6‐1.fastq
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
raw reads (.fastq)
Instrument ID
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
Instrument ID
raw reads (.fastq)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
Instrument ID Tile
raw reads (.fastq)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
Instrument ID Tile
raw reads (.fastq)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
Instrument ID Tile
Index number
raw reads (.fastq)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
Instrument ID
First mate in the pair (paired‐end reads)
TileIndex number
raw reads (.fastq)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
read
Instrument ID
First mate in the pair (paired‐end reads)
TileIndex number
raw reads (.fastq)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
read
Quality values for each nucleotide
Instrument ID
First mate in the pair (paired‐end reads)
TileIndex number
raw reads (.fastq)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
read
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII
33 126
Instrument ID
0.2......................26...31........41
Illumina 1.8+ Phred+33, raw reads typically (0, 41)
First mate in the pair (paired‐end reads)
TileIndex number
Quality values for each nucleotide (base quality score)
raw reads (.fastq)
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII
33 1260.2......................26...31........41
Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Phred‐scale value:
Q = ‐10*log_10P → P = 10‐Q/10
Phred Quality Score(Q)
Probability of incorrect base call
(P)Base call accuracy
10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%
raw reads (.fastq)
raw reads (.fastq)
• Move into folder lane2;
• Open s‐7‐1.fastq
• gedit s‐7‐1.fastq
OR
Terminal: more s‐7‐1.fastq OR head s‐6‐1.fastq
• Are s‐6‐1.fastq and s‐7‐1.fastq coming from two different lanes?
raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant calling
SNPs/indels
single/multi‐sample
samtools
raw variants (.vcf)
ready‐to‐use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtools
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
• Evaluation of the quality of the generated data;
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
• Evaluation of the quality of the generated data;
• Basic summary statistics of the raw data;
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
• Evaluation of the quality of the generated data;
• Basic summary statistics of the raw data;
• Several modules to evaluate different features (i.e. adapters; base quality, etc…)
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
• Evaluation of the quality of the generated data;
• Basic summary statistics of the raw data;
• Several modules to evaluate different features (i.e. adapters; base quality, etc…)
• Feedback (green, orange, red): do not fully rely on that, think what does it mean!!
1‐ Fastq quality control + trimming
1‐ Fastq quality control + trimming
Per base sequence quality: warning
1‐ Fastq quality control + trimming
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
1‐ Fastq quality control + trimming
1‐ Fastq quality control + trimming
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
1‐ Fastq quality control + trimming
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
1‐ Fastq quality control + trimming
95‐99 bp 90‐94 bp
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
1‐ Fastq quality control + trimming
Per sequence quality score: pass
1‐ Fastq quality control + trimming
Sequence length: pass
Adapters removal1‐ Fastq quality control + trimming
Failed
Warning
Adapters removal1‐ Fastq quality control + trimming
Pass
Overrepresented sequences
1‐ Fastq quality control + trimming
Removal of overrepresented sequences (PCR primers).
FASTQC references:
• Software website:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• Manual:https://insidedna.me/tool_page_assets/pdf_manual/fastqc.pdf
raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant calling
SNPs/indels
single/multi‐sample
samtools
raw variants (.vcf)
ready‐to‐use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtools
raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant calling
SNPs/indels
single/multi‐sample
samtools
raw variants (.vcf)
ready‐to‐use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtools
Alignment : process of determining the most likelylocation within the genome for the observed DNA read
raw reads reference genome
2‐ Alignment to a reference genome
trade‐off: speed vs sensitivity – the higher the accuracy the longer the alignment run
two classes of methods:
Burrows‐Wheeler
• Fast• less robust at high divergence
with reference genome• e.g. bwa
Hashing
• slow (needs more memory)• robust at high divergence with
reference genome• e.g. stampy
the shorter the read the harder is to find its location in the genome
big amount of data: computationally challenging for memory and speed
2‐ Alignment to a reference genome
What if there are several possible places to align your sequencing read?
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
raw reads reference genome
low MQ: the probability of mapping to different locations is high, but no perfect multiple matches
high MQ: a single match
MQ0: a perfect multiple match
What if there are several possible places to align your sequencing read?
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
MQ is a phred‐score of the quality of the alignment
2‐ Alignment to a reference genome
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Reference sequence
Sample_1
1 copia
1 copia
1 copia
1 copia
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2Element 1
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
Perfect mul ple matches → MQ0Not a perfect match → Low MQ
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
Perfect mul ple matches → MQ0Not a perfect match → Low MQ
Reference sequence
Sample_1
2 copia
1 copia
1 copia
1 copia
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
False heterozygous callCluster of heterozygotes
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
False heterozygous callCluster of heterozygotes
Reference sequence
Sample_1
1 copia
2 copia
1 copia
1 copia
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
AluSg7
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
create the index of the reference genome (for bwa, samtools and picard)
bwa index: this is a FM‐index – specific to the algorithm behind this aligner
bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
2‐ Alignment to a reference genome: reference sequence
create the index of the reference genome (for bwa, samtools and picard)
bwa index: this is a FM‐index – specific to the algorithm behind this aligner
bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
2‐ Alignment to a reference genome: reference sequence
2‐ Alignment to a reference genome: reference sequence
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
2‐ Alignment to a reference genome: reference sequence
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
2‐ Alignment to a reference genome: reference sequence
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
50 characters
2‐ Alignment to a reference genome: reference sequence
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
50 characters
60 characters
create the dictionary of the reference genome (for samtools, gatk and picard)
dictionary .dict: list of contigs included in the fasta file of the reference genome
java -jar picard.jar CreateSequenceDictionaryREFERENCE=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa OUTPUT=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.dict
keep index and dictionary files in the same directory of the reference file!
2‐ Alignment to a reference genome: reference sequence
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
SequenceName
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
SequenceName
SequenceLength
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
SequenceName
SequenceLength
Path
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
SequenceName
SequenceLength
Path MD5 checksum
2‐ Alignment to a reference genome: mapping with bwa‐mem
Three different algorithm:
1. BWA‐backtrack: for illumina reads up to 100bp;
2. BWA‐SW: long read support, split alignment;
3. BWA‐MEM: long read support, split alignment, faster, more accurate
2‐ Alignment to a reference genome: mapping with bwa‐mem
Three different algorithm:
1. BWA‐backtrack: for illumina reads up to 100bp;
2. BWA‐SW: long read support, split alignment;
3. BWA‐MEM: long read support, split alignment, faster, more accurate
• paired‐end alignment (lane1);
2‐ Alignment to a reference genome: mapping with bwa‐mem
• paired‐end alignment (lane1);
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
2‐ Alignment to a reference genome: mapping with bwa‐mem
• paired‐end alignment (lane1);
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
• Option to mark shorter split hits as secondary (not supplementary).
2‐ Alignment to a reference genome: mapping with bwa‐mem
2‐ Alignment to a reference genome: mapping with bwa‐mem
Split read:
Karacok E et al., 2012
• paired‐end alignment (lane1);
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
• Option to mark shorter split hits as secondary (not supplementary).
• paired‐end alignment (lane1);
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
• Option to mark shorter split hits as secondary (not supplementary).
bwa mem [options] [RefSeq] [lane1_fastq1] [lane1_fastq2] > lane1.sam
2‐ Alignment to a reference genome: mapping with bwa‐mem