Transcript
Page 1: 10 Billion Piece Jigsaw Puzzles

10 Billion Piece Jigsaw Puzzles

John ClearyNetvalue Ltd.

Real Time Genomics

Page 2: 10 Billion Piece Jigsaw Puzzles
Page 3: 10 Billion Piece Jigsaw Puzzles

million

100 thousand

10 thousand

10 million

100 million

billion

10 billion

100 billion

thousand

hundred

Page 4: 10 Billion Piece Jigsaw Puzzles

Genome

Transcriptome

Cancer

Page 5: 10 Billion Piece Jigsaw Puzzles

Genomes of …• human• reference species

mouse, chimp, arabidopsis…• agricultural species

cattle, sheep, pig, …rice, wheat, grape …

• bacterialdisease, human “ecosystem”

Page 6: 10 Billion Piece Jigsaw Puzzles

Differences between …

• Individuals• Populations

disease and “quantitative traits”• Somatic and tumor genomes• Transcriptome of child and parents• Bacterial populations of individuals

Page 7: 10 Billion Piece Jigsaw Puzzles

Human Genome

3 billion

Nucleotides

Page 8: 10 Billion Piece Jigsaw Puzzles

Shapes of the Jigsaw PiecesCompanyLengths (nt)

45415 - 700Illumina36 - 150

Complete Genomics36Ion Torrentupto 200

Oxford Nanopore(?)upto 50,000Pacific Biosciences100*

Page 9: 10 Billion Piece Jigsaw Puzzles

Differences betweengenomes - SNPs

A C G T T A G T G A

A C G T T A G T G A

A C G T T C G T G A

A C G T T G G T G A

~ 1 / 1,0003,000,000 nt

Page 10: 10 Billion Piece Jigsaw Puzzles

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctggSIM: T AAGAATSIM: T AAGAATCALL: T GCALL: T TREAD: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Page 11: 10 Billion Piece Jigsaw Puzzles
Page 12: 10 Billion Piece Jigsaw Puzzles

Differences between humangenomes - MNPs

A C G T T A G T G A

A C G T T A G T G A

A C G T T C A G A

A C G T T G T G A

Page 13: 10 Billion Piece Jigsaw Puzzles

Differences between humangenomes - indels

A C G T T A G T G A

A C G T T A G T G A

A C G T T G T G A

A C G T T G G T G A

~ 1 / 10,000 300,000

Page 14: 10 Billion Piece Jigsaw Puzzles

Differences between genomes - inserts

A C G T T A G T G A

A C G T T A G T G A

Up to 1,000,000 nt total 3,000,000 nt

T T A G G A C C C A

Page 15: 10 Billion Piece Jigsaw Puzzles

Differences between genomes – structural variants

Tandem Repeat

Inversion

Copy

Page 16: 10 Billion Piece Jigsaw Puzzles

Solving the Jigsaw

• Indexing

• Alignment

• SNP/MNP/Indel/SV calling

Mapping

Page 17: 10 Billion Piece Jigsaw Puzzles

Indexing

A C G T T A G T G A A G

A C G T T C G T G A A G

A C G TT C G TG A A G

A C G TT A G TG A A G

4.5 billion

Page 18: 10 Billion Piece Jigsaw Puzzles

Aligning

A C G T T A G T G A A G

A C G T T C G T G A A G

1.6 billion

Page 19: 10 Billion Piece Jigsaw Puzzles

Cutting Edge Run

• Human genome (3 billion nt)

• 1 billion reads of 100 ntcoverage of 30

• Indexing + Aligning in 27 minutes

Page 20: 10 Billion Piece Jigsaw Puzzles

i7 Quad Core

Page 21: 10 Billion Piece Jigsaw Puzzles

2 sockets X 4 cores X 2 hyperthreads = 16

48 GB RAM

10 computers

1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB

X thousands of genomes

Page 22: 10 Billion Piece Jigsaw Puzzles

Shapes of the Jigsaw PiecesCompanyLengths (nt)

45415 - 700Illumina36 - 150

Complete Genomics36Ion Torrentupto 200

Oxford Nanopore(?)upto 50,000Pacific Biosciences100*

Page 23: 10 Billion Piece Jigsaw Puzzles

Paired End Reads

100 nt 100 nt100 - 1,000 nt

IndexAlign

IndexAlign

Match

100 nt

Page 24: 10 Billion Piece Jigsaw Puzzles

Solving the Jigsawwithout the picture

• Indexing

• AlignmentAssembly

Page 25: 10 Billion Piece Jigsaw Puzzles

Assembly

T A G T G A A G A A T T

A C G T T C G T G A A G

A C G TT C G TG A A G

T A G TG A A GA A T T

A C G T T ? G T G A A G A A T T

Page 26: 10 Billion Piece Jigsaw Puzzles

SNP calling

15A 13C AC heterozygous SNP

15A 4C

5A 2C

1A 2C

Bayesian statistics(SNPs 1/1,000)

31A 42C Throw it out

Page 27: 10 Billion Piece Jigsaw Puzzles

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctggSIM: T AAGAATSIM: T AAGAATCALL: T GCALL: T TREAD: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Page 28: 10 Billion Piece Jigsaw Puzzles

Comparing twins

3,000,000 SNPs

Do any of them differ between the twins?

15A 4C 3A 10C 3G

Page 29: 10 Billion Piece Jigsaw Puzzles
Page 30: 10 Billion Piece Jigsaw Puzzles

DNA

mRNA

protein

Gene

Page 31: 10 Billion Piece Jigsaw Puzzles
Page 32: 10 Billion Piece Jigsaw Puzzles

Cancer comparison

Page 33: 10 Billion Piece Jigsaw Puzzles

Copy Number Variants

• Varying levels of extraction of reads across genome (use differences)

• Locate boundaries (as accurately as possible)

• Extract number of variants• Use SNPs

Page 34: 10 Billion Piece Jigsaw Puzzles
Page 35: 10 Billion Piece Jigsaw Puzzles
Page 36: 10 Billion Piece Jigsaw Puzzles

Metagenomics or what is living on you

• Mapping reads back onto a database of known bacteria/viruses

• Many are ambiguous• Many don’t map at all• Estimate frequency of each species• Remove human “contamination”

Page 37: 10 Billion Piece Jigsaw Puzzles

TS10.389 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p54820.183 gi|187734516|ref|NC_010655.1| Akkermansia muciniphila ATCC BAA-8350.145 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 84820.037 gi|119025018|ref|NC_008618.1| Bifidobacterium adolescentis ATCC 15703

TS4 0.428 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.210 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.149 gi|60650141|ref|NC_006873.1| Bacteroides fragilis NCTC 9343 plasmid pBF9343 0.037 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.036 gi|238922432|ref|NC_012781.1| Eubacterium rectale ATCC 33656

TS25 0.752 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.073 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.041 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.020 gi|58036264|ref|NC_004307.2| Bifidobacterium longum NCC2705 0.018 gi|189438863|ref|NC_010816.1| Bifidobacterium longum DJO10A

Page 38: 10 Billion Piece Jigsaw Puzzles

Metagenomics

• Map reads to database

• Estimate most likely frequenciesa hill climbing estimation problem

• Can anything be done about unmapped reads?

Page 39: 10 Billion Piece Jigsaw Puzzles
Page 40: 10 Billion Piece Jigsaw Puzzles
Page 41: 10 Billion Piece Jigsaw Puzzles

Top Related