improved models and algorithms for universal dna tag...

Improved Models and Algorithms for Universal DNA Tag Systems

Tejas IyerGeorgia Tech

David CashGeorgia Tech

Outline of Part 1: ExposiFon

Mo#va#on: The bio problem and applicaFons

Formaliza#on: The math problem

Analysis: Bounding the best possible soluFon

Part 2 (Tejas) is original contribuFon

MoFvaFon (1): DNA compuFng• Methods that exploit massive parallel and self‐assembly nature

of DNA to solve hard computaFonal problems



• Step 1 of DNA compuFng: encode the problem

ACTGTTTCATTAAGCGCGTT

⠇

GGTAATTAAC




• A trivial (ignored) step in most models of computaFon.e.g. Turing machines, circuit families, random access machines


⠇

GGTAATTAAC




• A trivial (ignored) step in most models of computaFon.e.g. Turing machines, circuit families, random access machines

• But the thermodynamics of DNA gets in the way. HybridizaFon? Secondary structures? More...


⠇

GGTAATTAAC

MoFvaFon (2): SNP microarrays

• Single NucleoFde Polymorphism (SNP) Genotyping

• DetecFng variaFon at a single locus (base) within a populaFon

• Several important applicaFons in medicine: helps explain how single bases affect our reacFon to diseases and drugs





• TesFng several SNPs is expensive or impossible if done individually





• TesFng several SNPs is expensive or impossible if done individually

• One soluFon: SNP microarrays

• Main technical component mass produced to reduce cost.

• Allow one to run hundreds of thousands of SNPs simultaneously

MoFvaFon (2): SNP microarraysTags:

TGGATTAACGTAATCCAAGGGTTACACTATGACCAG

AnF‐Tags:

ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC



AnF‐Tags:


G T

A C



AnF‐Tags:


Microarray:

ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTCG T

A C

MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:


AnF‐Tags:


Microarray:

ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTC

?TGAA

?TGAA

?TGAA

G T

A C

?TGAA


ACTT


AnF‐Tags:


Microarray:

ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTC

?TGAA

?TGAA

?TGAATGGATTAAC

G T

A

ACTT GTAATCCAA

C

ACTT?TGAAACTTGGGTTACAC TATGACCAG


ACTT


AnF‐Tags:


Microarray:

ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTC

?TGAA

?TGAA

?TGAATGGATTAACA

G T

A

ACTT GTAATCCAA

C

ACTT?TGAAACTTGGGTTACAC TATGACCAGTG

C



AnF‐Tags:


Microarray:

ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTC

?TGAA

?TGAA

?TGAA

G T

A

ACTTTGGATT

AACA

ACTT GTAATCCAA

C


C



AnF‐Tags:


Microarray:

ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTC

?TGAA

?TGAA

?TGAA

G T

A

ACTTTGGATT

AACA

C


ACTT

CGTAATCCAA



AnF‐Tags:


Microarray:

ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTC

?TGAA

?TGAA

?TGAA

G T

A

ACTTTGGATT

AACA

C

?TGAA

ACTT

CGTAATCCAA

ACTT

TTATGA

CCAG

GGGTTACACACTT

G



AnF‐Tags:


Microarray:

ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTC

?TGAA

?TGAA

?TGAA

G T

A

ACTTTGGATT

AACA

C

?TGAA

ACTT

CGTAATCCAA

ACTT

TTATGA

CCAG

GGGTTACACACTT

GObserve


• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.



• Our project focus: choosing tags so that they always “find” the correct anF‐tag




ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTC

ACTT

TTATGA

CCAG




ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTCACTT

TTATGA

CCAG




ACCTAATTG

CATTAGGTT

CCCAATGTG

ATACTGGTC

ACTTT

TATGACCAG

Choosing tags/anF‐tags• Want to avoid mishybridizaFons and have as many tags as

possible.


possible.

• But as we add tags, some will eventually be “too similar” and start hybridizing.


possible.


• One approach: choose tags to have high Hamming distance

• i.e. few matches when aligned

• Use techniques from error correcFng codes

• Limited success...


possible.


• One approach: choose tags to have high Hamming distance

• i.e. few matches when aligned

• Use techniques from error correcFng codes

• Limited success...

• Other ad hoc approaches suggested

Later approach to tag design

• Coding was developed for communicaFons theory ‐ it ignores thermodynamic properFes of DNA that determine hybridizaFon



• Ben‐Dor et al. and Brenner suggested that we assume:

Mishybridiza7on only occurs when two tags contain long common substrings.





• SFll very simple and unrealisFc, but allows one to formalize the problem and get provably good results for tag sets.





• SFll very simple and unrealisFc, but allows one to formalize the problem and get provably good results for tag sets.

• But how good is it in pracFce?

• Not addressed in current work!

DNA Thermodynamics (Review)

• mel7ng temperature TM(U,V): when 50% of U,V are in duplex



• Higher implies stronger bond



• Higher implies stronger bond

• CalculaFng melFng temperature:

1. 2‐4 Rule: TM(U,V) proporFonal to 2(# A‐T bonds) + 4(# G‐C bonds)

2. Nearest neighbor: look up interacFons between adjacent bases in experimental table.

3. Wetmur’s equa#on: applies to longer strings only.

A model for tag design• Formalized by Ben‐Dor et al.


• Define the weight of a string s as w(s) = (#A/T) + 2(#G/C)



ApplicaFon fixes temperatures h, c. An (h,c)‐code saFsfies two condiFons:

1. Each tag t must have w(t) ≥ h.

2. Any string s such that w(s) ≥ c appears in at most one tag.






• (1) ensures that each tag hybridizes with its anF‐tag strongly.






• (1) ensures that each tag hybridizes with its anF‐tag strongly.

• (2) is meant to ensure that tags do not bond with the wrong anF‐tag, but it is more subtle.

The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.


ACGCTGTA TCTGTAATGACNot allowed:



• Reflects original assumpFon that hybridizaFon occurs only if long tags share a long substring.




• Also incorporates 2‐4 Rule: more G/C bases imply stronger bond




• Also incorporates 2‐4 Rule: more G/C bases imply stronger bond

• Allows them to prove an upper bound on the number of tags in an allowed system.

Upper Bound of Ben‐Dor et al.Let Gn be the number of strings of weight n

(proporFonal to by standard recurrence relaFon)(1 +!

3)n


(proporFonal to by standard recurrence relaFon)

Theorem: For any c and h, an (h,c)‐code may contain at most

tags

(1 +!

3)n

2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3

h! c + 1


(proporFonal to by standard recurrence relaFon)

Theorem: For any c and h, an (h,c)‐code may contain at most

tags

(1 +!

3)n

Remark: SFll exponenFal in c, so it allows for quite large codes.

2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3

h! c + 1

Upper Bound: Proof

Let a c‐token be a string that contains no proper suffix of weight c.

DefiniFons:

Upper Bound: Proof


DefiniFons:

2. Any c‐token s of weight ≥ c appears in at most one tag.

Upper Bound: Proof


The tail weight of a c‐token is the weight of its last character.

DefiniFons:


Upper Bound: Proof



The tail weight of a tag is the sum of tail weights of all of the c‐tokens it contains.

DefiniFons:


Upper Bound: Proof



The tail weight of a tag is the sum of tail weights of all of the c‐tokens it contains.

DefiniFons:


Strategy:1. Show that each tag has tail weight ≥ h ‐ c + 12. Show that a (h,c)‐code can have total tail weight at most

2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3

Upper Bound Proof, Part 1

Claim 1: Each tag has tail weight ≥ h ‐ c + 1

Example: c = 4

Tag:c‐tokens:

G A C C A A T Tail WtG A C 2

C C 2C C A 1

C A A 1C A A T 1


Claim 1: Each tag has tail weight ≥ h ‐ c + 1

Example: c = 4

Tag:c‐tokens:

G A C C A A T Tail WtG A C 2

C C 2C C A 1

C A A 1C A A T 1

ObservaFon: every character gets counted, except at most (c‐1) beginning weight


Claim 2: Total tail weight ≤

Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.

2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3




For this bound, divide them into classes:

2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3





Class

<c-2>S

S<c-3>S

<c-1>W

S<c-2>W

Occurences Total Tail Wt.

2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2

2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3





Class

<c-2>S

S<c-3>S

<c-1>W

S<c-2>W

Occurences Total Tail Wt.

2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2

2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3

Actually 2⋅Gc-2

Part 2

improved models and algorithms for universal dna tag...

Documents