10 billion piece jigsaw puzzles

41
10 Billion Piece Jigsaw Puzzles John Cleary Netvalue Ltd. Real Time Genomics

Upload: sine

Post on 18-Feb-2016

35 views

Category:

Documents


4 download

DESCRIPTION

10 Billion Piece Jigsaw Puzzles. John Cleary Netvalue Ltd. Real Time Genomics. 100 billion. 10 billion. 100 million. billion. 10 million. million. 100 thousand. thousand. 10 thousand. hundred. Genome Transcriptome Cancer. Genomes of …. human - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 10 Billion Piece Jigsaw Puzzles

10 Billion Piece Jigsaw Puzzles

John ClearyNetvalue Ltd.

Real Time Genomics

Page 2: 10 Billion Piece Jigsaw Puzzles
Page 3: 10 Billion Piece Jigsaw Puzzles

million

100 thousand

10 thousand

10 million

100 million

billion

10 billion

100 billion

thousand

hundred

Page 4: 10 Billion Piece Jigsaw Puzzles

Genome

Transcriptome

Cancer

Page 5: 10 Billion Piece Jigsaw Puzzles

Genomes of …• human• reference species

mouse, chimp, arabidopsis…• agricultural species

cattle, sheep, pig, …rice, wheat, grape …

• bacterialdisease, human “ecosystem”

Page 6: 10 Billion Piece Jigsaw Puzzles

Differences between …

• Individuals• Populations

disease and “quantitative traits”• Somatic and tumor genomes• Transcriptome of child and parents• Bacterial populations of individuals

Page 7: 10 Billion Piece Jigsaw Puzzles

Human Genome

3 billion

Nucleotides

Page 8: 10 Billion Piece Jigsaw Puzzles

Shapes of the Jigsaw PiecesCompanyLengths (nt)

45415 - 700Illumina36 - 150

Complete Genomics36Ion Torrentupto 200

Oxford Nanopore(?)upto 50,000Pacific Biosciences100*

Page 9: 10 Billion Piece Jigsaw Puzzles

Differences betweengenomes - SNPs

A C G T T A G T G A

A C G T T A G T G A

A C G T T C G T G A

A C G T T G G T G A

~ 1 / 1,0003,000,000 nt

Page 10: 10 Billion Piece Jigsaw Puzzles

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctggSIM: T AAGAATSIM: T AAGAATCALL: T GCALL: T TREAD: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Page 11: 10 Billion Piece Jigsaw Puzzles
Page 12: 10 Billion Piece Jigsaw Puzzles

Differences between humangenomes - MNPs

A C G T T A G T G A

A C G T T A G T G A

A C G T T C A G A

A C G T T G T G A

Page 13: 10 Billion Piece Jigsaw Puzzles

Differences between humangenomes - indels

A C G T T A G T G A

A C G T T A G T G A

A C G T T G T G A

A C G T T G G T G A

~ 1 / 10,000 300,000

Page 14: 10 Billion Piece Jigsaw Puzzles

Differences between genomes - inserts

A C G T T A G T G A

A C G T T A G T G A

Up to 1,000,000 nt total 3,000,000 nt

T T A G G A C C C A

Page 15: 10 Billion Piece Jigsaw Puzzles

Differences between genomes – structural variants

Tandem Repeat

Inversion

Copy

Page 16: 10 Billion Piece Jigsaw Puzzles

Solving the Jigsaw

• Indexing

• Alignment

• SNP/MNP/Indel/SV calling

Mapping

Page 17: 10 Billion Piece Jigsaw Puzzles

Indexing

A C G T T A G T G A A G

A C G T T C G T G A A G

A C G TT C G TG A A G

A C G TT A G TG A A G

4.5 billion

Page 18: 10 Billion Piece Jigsaw Puzzles

Aligning

A C G T T A G T G A A G

A C G T T C G T G A A G

1.6 billion

Page 19: 10 Billion Piece Jigsaw Puzzles

Cutting Edge Run

• Human genome (3 billion nt)

• 1 billion reads of 100 ntcoverage of 30

• Indexing + Aligning in 27 minutes

Page 20: 10 Billion Piece Jigsaw Puzzles

i7 Quad Core

Page 21: 10 Billion Piece Jigsaw Puzzles

2 sockets X 4 cores X 2 hyperthreads = 16

48 GB RAM

10 computers

1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB

X thousands of genomes

Page 22: 10 Billion Piece Jigsaw Puzzles

Shapes of the Jigsaw PiecesCompanyLengths (nt)

45415 - 700Illumina36 - 150

Complete Genomics36Ion Torrentupto 200

Oxford Nanopore(?)upto 50,000Pacific Biosciences100*

Page 23: 10 Billion Piece Jigsaw Puzzles

Paired End Reads

100 nt 100 nt100 - 1,000 nt

IndexAlign

IndexAlign

Match

100 nt

Page 24: 10 Billion Piece Jigsaw Puzzles

Solving the Jigsawwithout the picture

• Indexing

• AlignmentAssembly

Page 25: 10 Billion Piece Jigsaw Puzzles

Assembly

T A G T G A A G A A T T

A C G T T C G T G A A G

A C G TT C G TG A A G

T A G TG A A GA A T T

A C G T T ? G T G A A G A A T T

Page 26: 10 Billion Piece Jigsaw Puzzles

SNP calling

15A 13C AC heterozygous SNP

15A 4C

5A 2C

1A 2C

Bayesian statistics(SNPs 1/1,000)

31A 42C Throw it out

Page 27: 10 Billion Piece Jigsaw Puzzles

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctggSIM: T AAGAATSIM: T AAGAATCALL: T GCALL: T TREAD: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Page 28: 10 Billion Piece Jigsaw Puzzles

Comparing twins

3,000,000 SNPs

Do any of them differ between the twins?

15A 4C 3A 10C 3G

Page 29: 10 Billion Piece Jigsaw Puzzles
Page 30: 10 Billion Piece Jigsaw Puzzles

DNA

mRNA

protein

Gene

Page 31: 10 Billion Piece Jigsaw Puzzles
Page 32: 10 Billion Piece Jigsaw Puzzles

Cancer comparison

Page 33: 10 Billion Piece Jigsaw Puzzles

Copy Number Variants

• Varying levels of extraction of reads across genome (use differences)

• Locate boundaries (as accurately as possible)

• Extract number of variants• Use SNPs

Page 34: 10 Billion Piece Jigsaw Puzzles
Page 35: 10 Billion Piece Jigsaw Puzzles
Page 36: 10 Billion Piece Jigsaw Puzzles

Metagenomics or what is living on you

• Mapping reads back onto a database of known bacteria/viruses

• Many are ambiguous• Many don’t map at all• Estimate frequency of each species• Remove human “contamination”

Page 37: 10 Billion Piece Jigsaw Puzzles

TS10.389 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p54820.183 gi|187734516|ref|NC_010655.1| Akkermansia muciniphila ATCC BAA-8350.145 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 84820.037 gi|119025018|ref|NC_008618.1| Bifidobacterium adolescentis ATCC 15703

TS4 0.428 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.210 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.149 gi|60650141|ref|NC_006873.1| Bacteroides fragilis NCTC 9343 plasmid pBF9343 0.037 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.036 gi|238922432|ref|NC_012781.1| Eubacterium rectale ATCC 33656

TS25 0.752 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.073 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.041 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.020 gi|58036264|ref|NC_004307.2| Bifidobacterium longum NCC2705 0.018 gi|189438863|ref|NC_010816.1| Bifidobacterium longum DJO10A

Page 38: 10 Billion Piece Jigsaw Puzzles

Metagenomics

• Map reads to database

• Estimate most likely frequenciesa hill climbing estimation problem

• Can anything be done about unmapped reads?

Page 39: 10 Billion Piece Jigsaw Puzzles
Page 40: 10 Billion Piece Jigsaw Puzzles
Page 41: 10 Billion Piece Jigsaw Puzzles