bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...

Bioinformatica e analisi dei genomi

Anno 2015/2016

Pierpaolo Maisano Delsermail: [email protected]

• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;

• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.

Background

• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;

• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.

Background

Cusco, Marzo 2009

Muséum national d'Histoire naturelle ‐ Paris

Informazioni pratiche

• Teoria + pratica;

• Software and tools;

• Files;

• Slides on the website;

• Argomenti nuovi / argomenti gia’ trattati;

Programma

• next‐generation sequencing (NGS)…come, quando, perche’?

• un esempio di gestione e analisi dati NGS:

• tipo di dato;• file e formati;• programmi;• interpretazione dei risultati;• stima dell’errore;• quando fermarsi?

• Applicazioni e/o progetti su diversi organismi.

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de‐novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA‐seq

RNA‐seq

reads trimming

NGS: come, quando, perché?

Filtering

Validation


Domanda: quando? Domanda: perche’?

Domanda: quando?

Risposta: quando ha senso!


Domanda: perche’?

Domanda: quando?


• Amplicone 400bp in 100 individui? → Sanger sequencing


Domanda: perche’?

Domanda: quando?



• 50 ampliconi in 100 individui? → NGS + target capture


Domanda: perche’?

Domanda: quando?




• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing


Domanda: perche’?

Domanda: quando?




• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing

Domanda: perche’?

Risposta: la vostra idea per un progetto!


un esempio di gestione e analisi dati NGS


Nanopore minIon/gridIon

Pacific Bioscience (PacBio)

Ion torrent PGM/Proton

Roche 454

Illumina MiSeq/HiSeq

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de‐novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA‐seq

RNA‐seq

reads trimming

Filtering

Validation



• progetto


• progetto

• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)


• progetto


• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)


• progetto


• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)

• progetto:applicazione:scopo:coverage (SNPs, indels, repeatedelements, CNVs…)

Project:

• Saccharomyces cerevisiae;

• Genome: 16 chromosomes, ~12.5Mb, ~6200 genes;

• Whole genome sequencing;

• Illumina platform;

• Paired‐end reads, 1 library, 2 lanes.

fragment ========================================fragment + adaptors ~~~========================================~~~SE read ‐‐‐‐‐‐‐‐‐>PE reads R1‐‐‐‐‐‐‐‐‐> <‐‐‐‐‐‐‐‐‐R2unknown gap ..................................................

Single‐end (SE) or paired‐end (PE) sequencing.

raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?

bwa

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK


5. variant calling

SNPs/indels

single/multi‐sample

samtools

raw variants (.vcf)

ready‐to‐use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtools

.fa/.fasta

.fastq

.sam (.sai)

.bam (.bai)

.vcf

sequences

read data

mapped reads

mapped reads (binary)

variant information

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

raw reads (.fastq)

raw reads (.fastq)

gedit s‐6‐1.fastq

OR

Terminal: more s‐6‐1.fastq OR head s‐6‐1.fastq


raw reads (.fastq)

Instrument ID


Lane

Instrument ID

raw reads (.fastq)


Lane

Instrument ID Tile

raw reads (.fastq)


Lane

coordinates of the cluster

Instrument ID Tile

raw reads (.fastq)


Lane


Instrument ID Tile

Index number

raw reads (.fastq)


Lane


Instrument ID

First mate in the pair (paired‐end reads)

TileIndex number

raw reads (.fastq)


Lane


read

Instrument ID


TileIndex number

raw reads (.fastq)


Lane


read

Quality values for each nucleotide

Instrument ID


TileIndex number

raw reads (.fastq)


Lane


read

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII

33 126

Instrument ID

0.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)


TileIndex number

Quality values for each nucleotide (base quality score)

raw reads (.fastq)

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII

33 1260.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Phred‐scale value:

Q = ‐10*log_10P → P = 10‐Q/10

Phred Quality Score(Q)

Probability of incorrect base call

(P)Base call accuracy

10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%

raw reads (.fastq)

raw reads (.fastq)

• Move into folder lane2;

• Open s‐7‐1.fastq

• gedit s‐7‐1.fastq

OR

Terminal: more s‐7‐1.fastq OR head s‐6‐1.fastq

• Are s‐6‐1.fastq and s‐7‐1.fastq coming from two different lanes?


bwa

distant reference?

stampy



local realignment

base recalibration

picardGATK GATK


5. variant calling

SNPs/indels


samtools

raw variants (.vcf)





vcftools


big datasets

known SNPs/indels



samtools

1‐ Fastq quality control + trimming

Fastqc: quality control of the raw data coming out from the sequencer



• Evaluation of the quality of the generated data;




• Basic summary statistics of the raw data;





• Several modules to evaluate different features (i.e. adapters; base quality, etc…)





• Several modules to evaluate different features (i.e. adapters; base quality, etc…)

• Feedback (green, orange, red): do not fully rely on that, think what does it mean!!


Per base sequence quality: warning


What can we do to improve the quality at the end of the reads?

Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores


95‐99 bp 90‐94 bp




Per sequence quality score: pass


Sequence length: pass

Adapters removal1‐ Fastq quality control + trimming

Failed

Warning

Adapters removal1‐ Fastq quality control + trimming

Pass

Overrepresented sequences


Removal of overrepresented sequences (PCR primers).

FASTQC references:

• Software website:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• Manual:https://insidedna.me/tool_page_assets/pdf_manual/fastqc.pdf


bwa

distant reference?

stampy



local realignment

base recalibration

picardGATK GATK


5. variant calling

SNPs/indels


samtools

raw variants (.vcf)





vcftools


big datasets

known SNPs/indels



samtools

Alignment : process of determining the most likelylocation within the genome for the observed DNA read

raw reads reference genome

2‐ Alignment to a reference genome

trade‐off: speed vs sensitivity – the higher the accuracy the longer the alignment run

two classes of methods:

Burrows‐Wheeler

• Fast• less robust at high divergence

with reference genome• e.g. bwa

Hashing

• slow (needs more memory)• robust at high divergence with

reference genome• e.g. stampy

the shorter the read the harder is to find its location in the genome

big amount of data: computationally challenging for memory and speed


What if there are several possible places to align your sequencing read?

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps


raw reads reference genome

low MQ: the probability of mapping to different locations is high, but no perfect multiple matches

high MQ: a single match

MQ0: a perfect multiple match

What if there are several possible places to align your sequencing read?


MQ is a phred‐score of the quality of the alignment




Reference sequence

Element 1 Element 2



Reference sequence

Element 1 Element 2

Sample_1



Reference sequence

Element 1 Element 2

Sample_1

Reference sequence

Sample_1

1 copia

1 copia

1 copia

1 copia



Reference sequence

Element 1 Element 2Element 1



Reference sequence

Element 1 Element 2

Sample_1

Element 1



Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ



Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ

Reference sequence

Sample_1

2 copia

1 copia

1 copia

1 copia



Reference sequence

Element 1 Element 2

Sample_1



Reference sequence

Element 1 Element 2

Sample_1

False heterozygous callCluster of heterozygotes



Reference sequence

Element 1 Element 2

Sample_1

False heterozygous callCluster of heterozygotes

Reference sequence

Sample_1

1 copia

2 copia

1 copia

1 copia



AluSg7

create the index of the reference genome (for bwa, samtools and picard)

bwa index: this is a FM‐index – specific to the algorithm behind this aligner

bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

2‐ Alignment to a reference genome: reference sequence

create the index of the reference genome (for bwa, samtools and picard)

bwa index: this is a FM‐index – specific to the algorithm behind this aligner

bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

index .fai

samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.



index .fai




index .fai



50 characters


index .fai



50 characters

60 characters

create the dictionary of the reference genome (for samtools, gatk and picard)

dictionary .dict: list of contigs included in the fasta file of the reference genome

java -jar picard.jar CreateSequenceDictionaryREFERENCE=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa OUTPUT=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.dict

keep index and dictionary files in the same directory of the reference file!



2‐ Alignment to a reference genome – reference sequence



SequenceName



SequenceName

SequenceLength



SequenceName

SequenceLength

Path



SequenceName

SequenceLength

Path MD5 checksum

2‐ Alignment to a reference genome: mapping with bwa‐mem

Three different algorithm:

1. BWA‐backtrack: for illumina reads up to 100bp;

2. BWA‐SW: long read support, split alignment;

3. BWA‐MEM: long read support, split alignment, faster, more accurate

• paired‐end alignment (lane1);



• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;




• Option to mark shorter split hits as secondary (not supplementary).



Split read:

Karacok E et al., 2012







bwa mem [options] [RefSeq] [lane1_fastq1] [lane1_fastq2] > lane1.sam


bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...

Documents