a dna-based archival storage system - snia · 2019-12-21 · a dna-based archival storage system...

36
A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research joint work with Karin Strauss, Doug Carmean, Georg Seelig, James Bornholt, Randolph Lopez, Lee Organick, Yuan Chen, Chris Takahashi, Bichlien Nguyen, Sergey Yekhanin, Siena Dumas Ang.

Upload: others

Post on 24-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

A DNA-BasedArchival Storage System

Luis Ceze

University of Washington Microsoft Research

joint work with Karin Strauss, Doug Carmean, Georg Seelig, James Bornholt, Randolph Lopez, Lee Organick, Yuan Chen, Chris Takahashi, Bichlien Nguyen, Sergey Yekhanin, Siena Dumas Ang.

Page 2: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Life evolved a fantastic storage medium…

But why?

Gene

Protein

Function

DNA

100101010

Page 3: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

DNA molecules for digital data

Extremely dense Theory: 1 exabyte in 1 mm3

Extremely durable Half life > 500 years

And readers never become obsolete!

Page 4: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Comparing Storage Technologies

Would require:

4,000,000tapecartridges

Stack80kmhigh

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1.0E+10

ThinFilm(MEMS)

Tape Optical Flash(Chip) DNA

Gb/m

m^3

Today Projection Limit

redundancy

spacing

addressing

Potential of DNA:>107 Improvement

Page 5: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

The ultimate hierarchy

FlashHDDTape

DNA-based Archival

~5 yrs

centuries

Access Timems

10s msminutes

hrs

Durability

~5 yrs ~15-30 yrs

Page 6: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

A DNA-based archival storage system

Store

Write Read

Redundancy and density

Efficient retrieval

Wet lab experiments

[ASPLOS’16]

11010101… 11010101…

ACATCG…

Page 7: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

DNA moleculesFour nucleotides:

A

C

G

T

Adenine

Cytosine

Guanine

Thymine

DNA strand (oligonucleotide) is a linear sequence of these nucleotides

G A C A C C TG A C A C C T

Page 8: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

DNA moleculesFour nucleotides:

A

C

G

T

Adenine

Cytosine

Guanine

Thymine

DNA strand (oligonucleotide) is a linear sequence of these nucleotides

G A C A C C T

Two strands can bind to each otherif they are complementary:

C T G T G G A

G A C A C C T

C, G arecomplementary

A, T arecomplementary

T

Partial errors allowed

Page 9: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

9

DNA data storage at 30,000 feet

11011101 11011101

Encoding

Decoding

AGCTATCAG AGCTATCAGSynthesis

Sequencing

DNA Manipulation

write path read path

Page 10: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

10

DNA Manipulation: Synthesis

AGCTATCAG AGCTATCAG

Synthesis

Sequencing

DNA Synthesis: manufacturing DNA strands

GACACCT G A C A C C T

• Synthesis process appends one letter at a time• Maximum practical sequence length of ~100s of letters• Produces millions of copies of each sequence• Can make many different sequences in parallel• Normally used for genomics and genetic engineering

Page 11: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

11

DNA Manipulation: Sequencing

AGCTATCAG AGCTATCAG

Synthesis

Sequencing

DNA Sequencing: reading DNA strandsGACACCTG A C A C C T

• Produces many reads of a strand• Normally used for research and diagnostics• Currently much higher throughput than synthesis

Page 12: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

12

DNA data storage at 30,000 feet

11011101 11011101

Encoding

Decoding

AGCTATCAG AGCTATCAGSynthesis

Sequencing

write path read path

Page 13: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Writing digital data to DNAThe easy way: convert base 2 to base 4

But this approach isn’t feasible for more than a few bytes

G G A T G C A

P[Attach] = 99%

99% 98% 97% 96.1% 95.1% 94.2%

100 nts

36.6%

200 nts

13.4%

… …

G C A C

10100011 10010001 11100111 11000101 10010100 10111101

2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1

G G A T T G C T T A C C G C C A G T T C

Page 14: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

G C A C

Chunking dataBreak binary data into chunks stored in separate strands10100011 10010001 11100111 11000101 10010100 10111101

2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1

G G A T T G C T T A C C G C C A G T T C

Page 15: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

C A T C C

G C A C

Chunking dataBreak binary data into chunks stored in separate strands10100011 10010001 11100111 11000101 10010100 10111101

2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1

G G A T

T G C T T A C C

G C C A G T T C

A A A A

A A A C

A A A G

C A T C C

C A T C C

A T G T T

A T G T T

A T G T TAddresses within the file

File identifiers(“primers”)

Page 16: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

1 of N

2 of N

3 of N

~ 10 bytes per DNA strand.

C A T C C

G C A C G G A T

T G C T T A C C

G C C A G T T C

A A A A

A A A C

A A A G

C A T C C

C A T C C

A T G T T

A T G T T

A T G T TAddresses within the file

File identifiers(“primers”)

Page 17: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Other encoding considerations

G C C TA AAHomopolymers are bad…

G A C A C C T G

C T G T G C A CA T

G

TCA

T

A

G

C

Secondary structures are problematic …

Page 18: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

18

DNA data storage at 30,000 feet

11011101 11011101

Encoding

Decoding

AGCTATCAG AGCTATCAGSynthesis

Sequencing

write path read path

Page 19: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Random access?

?

Page 20: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

C A T C C

Random access!

PCR

Selectively amplify strands based on their primer

SampleStrands with 3 different primers

Almost all strands have desired primer

Reads are destructive, so replenish when necessary

T A T C T

G C A C G G A T T G C T T A C C

G C C A G T T C

A A A A A A A C

A A A G

G A T A C A T G T T C C A C T

A T G T T T A C A A A T A G A

Page 21: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Going back to digital: Decoding

Sequencing

Clustering reads

…110110101101…

C A T C C

G C A C G G A T

T G C T T A C C

G C C A G T T C

A A A A

A A A C

A C A G

C A T C C

C A T G C

A T G T T

A T G T T

A T G T T

C A T C C

G C A C G G A T

T G C T T A C C

G C C A G T T C

A A G A

C A A C

A A A G

C A T C C

C A T C C

A T G T T

A T G T T

A T G T T

……..

C A T C C

G C A C G G A T

T G C T T A C C

G C C A G T T C

A A G A

C A A C

A A A G

C A T C C

C A T C C

A T G T T

A T G T T

A T G T T

Error correction

Page 22: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Error correctionBoth synthesis and sequencing are error prone:

G G A T A G C

G G A T G C A

G G A T G A

G G A T C C A

Insertions

Deletions

Substitutions

A

Aggregate error rates ~1% per position

Page 23: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Logical redundancy

C A T C C

G C A C G G A T

T G C T T A C C

G C C A G T T C

A A A A

A A A C

A A A G

C A T C C

C A T C C

A T G T T

A T G T T

A T G T T

Addresses within the value

Key identifiers(“primers”)

Redundant data in additional DNA strands. Many possibilities: parity, Reed Solomon, LDPC, … C A T C C

T C G C T A C G

T G C A A T T C

C A A C

CA A G

C A T C C A T G T T

A T G T T C A T C C

Page 24: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

SAS / SATA

Putting it all together

Page 25: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Putting it all together

DNA storage (physical) library

~100TB

Data address/key specifies physical location and primer for random access.

keyfoo.mp4

encode andselect primers seqs

manufacture DNA

keyfoo.mp4

PCR

amplification

primers

sequencing

decoding

1101000100…

Page 26: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Wet lab results

Page 27: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Experiments

catcatgg

catcatgc

ThroughputMBs/week

Page 28: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Photo: Tara Brown / UW

Page 29: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

DecodingEncoded and synthesized 3 files (151 kB):

Selected and PCRed one file for random access (42 kB):

Sequenced and decoded the resulting amplified pool:

Recovered every bit despite errors in synthesis and sequencing

Page 30: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

The importance of redundancy

C A T C C

G C A C G G A T

T G C T T A C C

G C C A G T T C

A A A A

A A A C

A A A G

C A T C C

C A T C C

A T G T T

A T G T T

A T G T T

Addresses within the value

Key identifiers(“primers”)

Redundant data in additional DNA strands. Many possibilities: parity, Reed Solomon, LDPC, … C A T C C

T C G C T A C G

T G C A A T T C

C A A C

CA A G

C A T C C A T G T T

A T G T T C A T C C

0

25

50

75

0 2500 5000 7500Number of copies

Freq

uenc

y Some strands are missing entirely

If we ignore redundancy data, we cannot recover the file.

Page 31: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Error Analysis: Synthesis and SequencingBoth synthesis and sequencing are error prone, but how much?

G G A T G C A

Array synthesis Precise synthesis

$$$$

Sequencing

Synthesis error+

Sequencing error Sequencing error

Synthesis

Sequencing

0.0%

0.5%

1.0%

1.5%

0 25 50 75 100Location

Errors

SynthesisArraySingle

Most error is due to sequencing, not synthesis

G G A T G C A

Page 32: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Durability

G G A T G C A

G G A T G C A

G G A T G C A

WaterRadiation…

Page 33: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Durability

10

100

1000

1 10 100 1000Time (years)

Cop

ies

Req

uire

d

Reliability99.99%99.9%99%

Half-life in bone: ~521 years.

DNAsyntheticfossilssurvive:

Source:Grassetal.RobustChemicalPreservationofDigitalInformationonDNAinSilicawithError-CorrectingCodes

Time Temperature1week 70˚C2,000 years 10˚C2,000,000 years -18˚C

Page 34: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

MBs/week GBs/second

Page 35: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

DNA manipulation productivity is growing

102

104

106

108

1010

1970 1980 1990 2000 2010Year

Prod

uctiv

ity

Transistors on ChipReading DNAWriting DNA

Source: Robert Carlson

And cost is decreasing…

Page 36: A DNA-Based Archival Storage System - SNIA · 2019-12-21 · A DNA-Based Archival Storage System Luis Ceze University of Washington Microsoft Research ... DNA data storage at 30,000

Molecular Information Systems Lab

Thanks!