introduction to genomics - helsingin...

19
529053 Evolutionary Genomics Ari Löytynoja / [email protected] 14 March, 2016: Introduction to Genomics

Upload: others

Post on 28-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

14 March, 2016: Introduction to Genomics

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Genome

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Genome within Ensembl browser

http://www.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000139618;r=13:32315474-32400266

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Genome within Ensembl browser

1

2

3

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Genome

Genes

Variation

Repeats

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Genes- https://en.wikipedia.org/wiki/Gene

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

- SNP or SNV (single-nucleotide polymorphism/variation)

- indels (insertions and deletions)

Variation- structural variation

- CNV (copy-number variation)- inversions- translocations

https://en.wikipedia.org/wiki/Structural_variation https://en.wikipedia.org/wiki/Chromosome_abnormality

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Variation- caused by mutations- visible in DNA sequence- proportion of variable sites depends

on evolutionary distance- within species little- between species lots

seq1 CGATGCGCGATACATCGACGTGCAseq2 CGATGCGCGGTACATCGACGTGCAseq3 CGATGCGCGATACATCGACGTGCAseq4 CGATGCGCGATACATCGACGTGCA

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

RepeatsMany types

- https://en.wikipedia.org/wiki/Tandem_repeat- https://en.wikipedia.org/wiki/Retrotransposon - https://en.wikipedia.org/wiki/Transposable_element

Alu element is the most abundant transposable elements in the human genome

- ~ 300 bases long, ~ 1 million copies → makes ~11% of human genome

- repeat copies are similar, cause troubles in genome assembly and short-read mapping

- often useless and ignored where possible

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Genome

Genes

Variation

Repeats

→ correlations !

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Variation data- mainly interested in SNPs and indels

observed between samples- distribution of variation across sites- distribution among samples

- here, reference sequence is known- sample data are multiple genomes

- to us, data come from magic box

- data: m/billions of ~150 bp fragments

R1234

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Variation data- mainly interested in SNPs and indels

observed between samples- distribution of variation across sites- distribution among samples

- here, reference sequence is known- sample data are multiple genomes

- to us, data come from magic box

- data: m/billions of ~100 bp fragments

R1234

Sample genome: millions to billions bp long (human ~ 3 x 109 bp)

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Variation data- mainly interested in SNPs and indels

observed between samples- distribution of variation across sites- distribution among samples

- here, reference sequence is known- sample data are multiple genomes

- to us, data come from magic box

- data: m/billions of ~100 bp fragments

R1234

Sample genome: millions to billions bp long (human ~ 3 x 109 bp)

400-1500 bp

DNA fragmentation

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Variation data- mainly interested in SNPs and indels

observed between samples- distribution of variation across sites- distribution among samples

- here, reference sequence is known- sample data are multiple genomes

- to us, data come from magic box

- data: m/billions of ~100 bp fragments

R1234

Sample genome: millions to billions bp long (human ~ 3 x 109 bp)

400-1500 bp

100 bpknown

200-1000 bp unknown 100 bpknown

DNA fragmentation

DNA sequencing

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Short read mapping

Variation data- mainly interested in SNPs and indels

observed between samples- distribution of variation across sites- distribution among samples

- here, reference sequence is known- sample data are multiple genomes

- to us, data come from magic box

- data: m/billions of ~100 bp fragments

R1234

Genomic analyses

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Short read mapping

Variation data- mainly interested in SNPs and indels

observed between samples

1. detection of CNVs (and struct. var.) using short read data is tricky

2. evolution of CNVs is unclear→ population genetics theory is best developed for SNP data

R1234

Genomic analyses

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Illumina sequencing- most common sequencing machines

- Illumina reads have systematic errors- some errors can be accounted for

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

Illumina sequencing- data output in

fastq format

- per base sequence content shows contamination and fragment enrichment

- base call qualities reflect sample DNA quality and issues in sequencing run

- base qualities are taken into account in later analyses

- typically no need for clean up

good data bad data

@NS500198:34:H1708BGXX:1:11101:4647:1058 1:N:0:4ACAACNCGCCCGTGNTGCAGGACTGGGTCACGGCCACTGACATCCGCGTGGCCTTCCGCCGCCTGCACACGTTCGGTGACGAGAACGAGGCCGACTCCGAGCTGGCGCGCGCCTCGTACTTCTACGCCGTGTCCGACCT+7A7AA#FFFFFFFF#AFFFFF7FFFFFFF7FFFFF)FFF.F.FFFFFFFFFF.FFF.<FFFFFAFAAF.<AFF<FFFA<FAF<<)FA.AF<FA7FFFF<.AFFFFFAFAA<.AAFA<AA.FFF.F<FAAAFFF<7.FAA@NS500198:34:H1708BGXX:1:11101:20099:1059 1:N:0:4GCCACNAAAATTTANAACTAGAGCTGCCCTATGCCCCAGCAATTGCACTCCTGGGTATTTACCCCAAAGACACAGATGTAGTGAAAAGAAGGGCCATATTCACCCCAACGTTCATAGCAGCAAAGTCCACTATAGCCAA+<AA.A#A.7A.F<F#FFFF.AAFFA.FFFAFFF7FFFFFFFFFF.7)FFFFF.AFFAAAF.F)FFF<AFAF7FAF.FAFFFFFFFFA..)FFF.FF7FF)A...FF..<.<FF7)<FFFF<7F.)F.FFF.7FFFFFF7

529053 Evolutionary Genomics Ari Löytynoja / [email protected]

bamdata

bamdata

bamdata

bamdatafastq

data

fastq data

fastq data

Overview of resequencing data analysis

fastq data

vcf data

18917 C A 0/0 0/0 0/0 0/018969 G T 0/0 0/0 0/0 0/019022 G T 0/1 1/1 1/1 1/119030 T A 0/1 1/1 1/1 1/119163 A G 0/0 0/0 0/0 0/0

variant calling

mapping

summary statistics

analysis

samples

analysis analysis