10 billion piece jigsaw puzzles john cleary real time genomics

50
10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Upload: sheena-thomas

Post on 04-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

10 Billion Piece Jigsaw Puzzles

John Cleary

Real Time Genomics

Page 2: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 3: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 4: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 5: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Genome

Exome

Transcriptome

Metagenome

Page 6: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Differences between …

• Individuals in populations

• Child and parents

• Cancer and host genome

• Large pedigrees of animals

• Bacterial populations inside individuals

• Bacterial populations in the world

Page 7: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Real world problems …

• What is wrong with this new born child?

• Why are these cells cancerous and what should we do about it?

• We have 6,000 individuals in 1,500 families with cleft-palate – what causes this?

Page 8: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Real world problems …

• There is a hard to treat infectious disease in a hospital ward – where did it come from and is it the same as the one at another hospital?

• Is this water safe to drink?

• …

Page 9: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Human Genome

3 billion

nucleotides

Exome

30 million

nucleotides

Page 10: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Shapes of the Jigsaw Pieces

Page 11: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Differences between humangenomes - SNPs

A C G T T A G T G A

A C G T T A G T G A

A C G T T C G T G A

A C G T T G G T G A

~ 1 / 1,0003,000,000 nt

Page 12: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Differences between humangenomes - MNPs

A C G T T A G T G A

A C G T T A G T G A

A C G T T C A G A

A C G T T G T G A

Page 13: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Differences between humangenomes - indels

A C G T T A G T G A

A C G T T A G T G A

A C G T T G T G A

A C G T T G G T G A

~ 1 / 10,000 300,000

Page 14: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Differences between humangenomes - inserts

A C G T T A G T G A

A C G T T A G T G A

Up to 1,000,000 nt total 3,000,000 nt

T T A G G A C C C A

Page 15: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctggSIM: T AAGAATSIM: T AAGAATCALL: T GCALL: T TREAD: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Page 16: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Solving the Jigsaw

• Indexing

• Alignment

• SNP/MNP/Indel calling

Mapping

Page 17: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Indexing

A C G T T A G T G A A G

A C G T T C G T G A A G

A C G TT C G TG A A G

A C G TT A G TG A A G

4.5 billion

Page 18: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Aligning

A C G T T A G T G A A G

A C G T T C G T G A A G

1.6 billion

Page 19: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Cutting Edge Run

• Human genome (3 billion nt)

• 1 billion reads of 100 ntcoverage of 30

• Indexing + Aligning in 27 minutes

Page 20: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

i7 Quad Core

Page 21: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

2 sockets X 4 cores X 2 hyperthreads = 16

48 GB RAM

10 computers

1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB

X thousands of genomes

Page 22: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Shapes of the Jigsaw Pieces

Page 23: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Paired End Reads

100 nt 100 nt100 - 1,000 nt

IndexAlign

IndexAlign

Match

Page 24: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Solving the Jigsawwithout the picture

• Indexing

• Alignment

Assembly

Page 25: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Assembly

T A G T G A A G A A T T

A C G T T C G T G A A G

A C G TT C G TG A A G

T A G TG A A GA A T T

A C G T T ? G T G A A G A A T T

Page 26: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

SNP calling

15A 13C AC heterozygous SNP

15A 4C

5A 2C

1A 2C

Bayesian statistics(SNPs 1/1,000)

31A 42C Throw it out

Page 27: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctggSIM: T AAGAATSIM: T AAGAATCALL: T GCALL: T TREAD: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Page 28: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Lane

Multiple technologies and read lengths

SAM

Calibration

Mapping

SNP calling

VCFSNPs, MNPS, indels

Filtering

Complex regions

Page 29: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

SNP calling - Diploid Bayesian

SAM Genome statisticsCalibration

Error model Priors

Bayesian ModelA C G T A:C A:G A:T C:G C:T G:T23.1 43.2 …log posteriors

Counts filter Ambiguity filter

VCF

Simple isolated SNP

insert Adjacent SNPs, inserts

Complex region calling

SNPs, indels, MNPs

Page 30: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Complex Region Calling

Genome

AlignedReads

Modified Genome

Probabilistic realignmentthrough all paths for eachread against each modified genome

Page 31: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Comparing twins

3,000,000 SNPs

Do any of them differ between the twins?

15A 4C 3A 10C 3G

Page 32: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 33: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

DNA

mRNA

protein

Gene

Page 34: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 35: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Cancer comparison

Page 36: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Copy Number Variants

• Varying levels of extraction of reads across genome (use differences)

• Locate boundaries (as accurately as possible)

• Extract number of variants

• Use in combination with calling SNPs

Page 37: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Large pedigrees

Page 38: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 39: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Chlorocebus pygerythrus

Page 40: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 41: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 42: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 43: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 44: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Metagenomics or what is living on you

• Mapping reads back onto a database of known bacteria/viruses

• Many are ambiguous

• Many don’t map at all

• Estimate frequency of each species

• Remove human “contamination”

Page 45: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

TS10.389 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p54820.183 gi|187734516|ref|NC_010655.1| Akkermansia muciniphila ATCC BAA-8350.145 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 84820.037 gi|119025018|ref|NC_008618.1| Bifidobacterium adolescentis ATCC 15703

TS4 0.428 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.210 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.149 gi|60650141|ref|NC_006873.1| Bacteroides fragilis NCTC 9343 plasmid pBF9343 0.037 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.036 gi|238922432|ref|NC_012781.1| Eubacterium rectale ATCC 33656

TS25 0.752 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.073 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.041 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.020 gi|58036264|ref|NC_004307.2| Bifidobacterium longum NCC2705 0.018 gi|189438863|ref|NC_010816.1| Bifidobacterium longum DJO10A

Page 46: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Metagenomics

• Map reads to database

• Estimate most likely frequenciesa hill climbing estimation problem

• Can anything be done about unmapped reads?

Page 47: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

How do we get there?

• Software engineering (500,000 lines code)

• Algorithms

• Bayesian statistics

• Testingcalibration/simulation/analysis

Page 48: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Page 49: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

How do we get there?

• Performance optimizationalgorithmsdisk I/O and compressionparallel executionoptimization for memory sizeoptimization for cache sizetargeted code optimization

Page 50: 10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics