improved models and algorithms for universal dna tag...
TRANSCRIPT
Improved Models and Algorithms for Universal DNA Tag Systems
Tejas IyerGeorgia Tech
David CashGeorgia Tech
Outline of Part 1: ExposiFon
Mo#va#on: The bio problem and applicaFons
Formaliza#on: The math problem
Analysis: Bounding the best possible soluFon
Part 2 (Tejas) is original contribuFon
MoFvaFon (1): DNA compuFng• Methods that exploit massive parallel and self‐assembly nature
of DNA to solve hard computaFonal problems
MoFvaFon (1): DNA compuFng• Methods that exploit massive parallel and self‐assembly nature
of DNA to solve hard computaFonal problems
• Step 1 of DNA compuFng: encode the problem
ACTGTTTCATTAAGCGCGTT
⠇
GGTAATTAAC
MoFvaFon (1): DNA compuFng• Methods that exploit massive parallel and self‐assembly nature
of DNA to solve hard computaFonal problems
• Step 1 of DNA compuFng: encode the problem
• A trivial (ignored) step in most models of computaFon.e.g. Turing machines, circuit families, random access machines
ACTGTTTCATTAAGCGCGTT
⠇
GGTAATTAAC
MoFvaFon (1): DNA compuFng• Methods that exploit massive parallel and self‐assembly nature
of DNA to solve hard computaFonal problems
• Step 1 of DNA compuFng: encode the problem
• A trivial (ignored) step in most models of computaFon.e.g. Turing machines, circuit families, random access machines
• But the thermodynamics of DNA gets in the way. HybridizaFon? Secondary structures? More...
ACTGTTTCATTAAGCGCGTT
⠇
GGTAATTAAC
MoFvaFon (2): SNP microarrays
• Single NucleoFde Polymorphism (SNP) Genotyping
• DetecFng variaFon at a single locus (base) within a populaFon
• Several important applicaFons in medicine: helps explain how single bases affect our reacFon to diseases and drugs
MoFvaFon (2): SNP microarrays
• Single NucleoFde Polymorphism (SNP) Genotyping
• DetecFng variaFon at a single locus (base) within a populaFon
• Several important applicaFons in medicine: helps explain how single bases affect our reacFon to diseases and drugs
• TesFng several SNPs is expensive or impossible if done individually
MoFvaFon (2): SNP microarrays
• Single NucleoFde Polymorphism (SNP) Genotyping
• DetecFng variaFon at a single locus (base) within a populaFon
• Several important applicaFons in medicine: helps explain how single bases affect our reacFon to diseases and drugs
• TesFng several SNPs is expensive or impossible if done individually
• One soluFon: SNP microarrays
• Main technical component mass produced to reduce cost.
• Allow one to run hundreds of thousands of SNPs simultaneously
MoFvaFon (2): SNP microarraysTags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
MoFvaFon (2): SNP microarraysTags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
G T
A C
MoFvaFon (2): SNP microarraysTags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTCG T
A C
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A C
?TGAA
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
ACTT
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAATGGATTAAC
G T
A
ACTT GTAATCCAA
C
ACTT?TGAAACTTGGGTTACAC TATGACCAG
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
ACTT
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAATGGATTAACA
G T
A
ACTT GTAATCCAA
C
ACTT?TGAAACTTGGGTTACAC TATGACCAGTG
C
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A
ACTTTGGATT
AACA
ACTT GTAATCCAA
C
ACTT?TGAAACTTGGGTTACAC TATGACCAGTG
C
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A
ACTTTGGATT
AACA
C
ACTT?TGAAACTTGGGTTACAC TATGACCAGTG
ACTT
CGTAATCCAA
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A
ACTTTGGATT
AACA
C
?TGAA
ACTT
CGTAATCCAA
ACTT
TTATGA
CCAG
GGGTTACACACTT
G
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A
ACTTTGGATT
AACA
C
?TGAA
ACTT
CGTAATCCAA
ACTT
TTATGA
CCAG
GGGTTACACACTT
GObserve
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
• Our project focus: choosing tags so that they always “find” the correct anF‐tag
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
• Our project focus: choosing tags so that they always “find” the correct anF‐tag
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
ACTT
TTATGA
CCAG
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
• Our project focus: choosing tags so that they always “find” the correct anF‐tag
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTCACTT
TTATGA
CCAG
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
• Our project focus: choosing tags so that they always “find” the correct anF‐tag
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
ACTTT
TATGACCAG
Choosing tags/anF‐tags• Want to avoid mishybridizaFons and have as many tags as
possible.
Choosing tags/anF‐tags• Want to avoid mishybridizaFons and have as many tags as
possible.
• But as we add tags, some will eventually be “too similar” and start hybridizing.
Choosing tags/anF‐tags• Want to avoid mishybridizaFons and have as many tags as
possible.
• But as we add tags, some will eventually be “too similar” and start hybridizing.
• One approach: choose tags to have high Hamming distance
• i.e. few matches when aligned
• Use techniques from error correcFng codes
• Limited success...
Choosing tags/anF‐tags• Want to avoid mishybridizaFons and have as many tags as
possible.
• But as we add tags, some will eventually be “too similar” and start hybridizing.
• One approach: choose tags to have high Hamming distance
• i.e. few matches when aligned
• Use techniques from error correcFng codes
• Limited success...
• Other ad hoc approaches suggested
Later approach to tag design
• Coding was developed for communicaFons theory ‐ it ignores thermodynamic properFes of DNA that determine hybridizaFon
Later approach to tag design
• Coding was developed for communicaFons theory ‐ it ignores thermodynamic properFes of DNA that determine hybridizaFon
• Ben‐Dor et al. and Brenner suggested that we assume:
Mishybridiza7on only occurs when two tags contain long common substrings.
Later approach to tag design
• Coding was developed for communicaFons theory ‐ it ignores thermodynamic properFes of DNA that determine hybridizaFon
• Ben‐Dor et al. and Brenner suggested that we assume:
Mishybridiza7on only occurs when two tags contain long common substrings.
• SFll very simple and unrealisFc, but allows one to formalize the problem and get provably good results for tag sets.
Later approach to tag design
• Coding was developed for communicaFons theory ‐ it ignores thermodynamic properFes of DNA that determine hybridizaFon
• Ben‐Dor et al. and Brenner suggested that we assume:
Mishybridiza7on only occurs when two tags contain long common substrings.
• SFll very simple and unrealisFc, but allows one to formalize the problem and get provably good results for tag sets.
• But how good is it in pracFce?
• Not addressed in current work!
DNA Thermodynamics (Review)
• mel7ng temperature TM(U,V): when 50% of U,V are in duplex
DNA Thermodynamics (Review)
• mel7ng temperature TM(U,V): when 50% of U,V are in duplex
• Higher implies stronger bond
DNA Thermodynamics (Review)
• mel7ng temperature TM(U,V): when 50% of U,V are in duplex
• Higher implies stronger bond
• CalculaFng melFng temperature:
1. 2‐4 Rule: TM(U,V) proporFonal to 2(# A‐T bonds) + 4(# G‐C bonds)
2. Nearest neighbor: look up interacFons between adjacent bases in experimental table.
3. Wetmur’s equa#on: applies to longer strings only.
A model for tag design• Formalized by Ben‐Dor et al.
A model for tag design• Formalized by Ben‐Dor et al.
• Define the weight of a string s as w(s) = (#A/T) + 2(#G/C)
A model for tag design• Formalized by Ben‐Dor et al.
• Define the weight of a string s as w(s) = (#A/T) + 2(#G/C)
ApplicaFon fixes temperatures h, c. An (h,c)‐code saFsfies two condiFons:
1. Each tag t must have w(t) ≥ h.
2. Any string s such that w(s) ≥ c appears in at most one tag.
A model for tag design• Formalized by Ben‐Dor et al.
• Define the weight of a string s as w(s) = (#A/T) + 2(#G/C)
ApplicaFon fixes temperatures h, c. An (h,c)‐code saFsfies two condiFons:
1. Each tag t must have w(t) ≥ h.
2. Any string s such that w(s) ≥ c appears in at most one tag.
• (1) ensures that each tag hybridizes with its anF‐tag strongly.
A model for tag design• Formalized by Ben‐Dor et al.
• Define the weight of a string s as w(s) = (#A/T) + 2(#G/C)
ApplicaFon fixes temperatures h, c. An (h,c)‐code saFsfies two condiFons:
1. Each tag t must have w(t) ≥ h.
2. Any string s such that w(s) ≥ c appears in at most one tag.
• (1) ensures that each tag hybridizes with its anF‐tag strongly.
• (2) is meant to ensure that tags do not bond with the wrong anF‐tag, but it is more subtle.
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
ACGCTGTA TCTGTAATGACNot allowed:
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
ACGCTGTA TCTGTAATGACNot allowed:
• Reflects original assumpFon that hybridizaFon occurs only if long tags share a long substring.
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
ACGCTGTA TCTGTAATGACNot allowed:
• Reflects original assumpFon that hybridizaFon occurs only if long tags share a long substring.
• Also incorporates 2‐4 Rule: more G/C bases imply stronger bond
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
ACGCTGTA TCTGTAATGACNot allowed:
• Reflects original assumpFon that hybridizaFon occurs only if long tags share a long substring.
• Also incorporates 2‐4 Rule: more G/C bases imply stronger bond
• Allows them to prove an upper bound on the number of tags in an allowed system.
Upper Bound of Ben‐Dor et al.Let Gn be the number of strings of weight n
(proporFonal to by standard recurrence relaFon)(1 +!
3)n
Upper Bound of Ben‐Dor et al.Let Gn be the number of strings of weight n
(proporFonal to by standard recurrence relaFon)
Theorem: For any c and h, an (h,c)‐code may contain at most
tags
(1 +!
3)n
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
h! c + 1
Upper Bound of Ben‐Dor et al.Let Gn be the number of strings of weight n
(proporFonal to by standard recurrence relaFon)
Theorem: For any c and h, an (h,c)‐code may contain at most
tags
(1 +!
3)n
Remark: SFll exponenFal in c, so it allows for quite large codes.
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
h! c + 1
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
DefiniFons:
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
DefiniFons:
2. Any c‐token s of weight ≥ c appears in at most one tag.
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
The tail weight of a c‐token is the weight of its last character.
DefiniFons:
2. Any c‐token s of weight ≥ c appears in at most one tag.
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
The tail weight of a c‐token is the weight of its last character.
The tail weight of a tag is the sum of tail weights of all of the c‐tokens it contains.
DefiniFons:
2. Any c‐token s of weight ≥ c appears in at most one tag.
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
The tail weight of a c‐token is the weight of its last character.
The tail weight of a tag is the sum of tail weights of all of the c‐tokens it contains.
DefiniFons:
2. Any c‐token s of weight ≥ c appears in at most one tag.
Strategy:1. Show that each tag has tail weight ≥ h ‐ c + 12. Show that a (h,c)‐code can have total tail weight at most
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
ObservaFon: every character gets counted, except at most (c‐1) beginning weight
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
Class
<c-2>S
S<c-3>S
<c-1>W
S<c-2>W
Occurences Total Tail Wt.
2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
Class
<c-2>S
S<c-3>S
<c-1>W
S<c-2>W
Occurences Total Tail Wt.
2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
Class
<c-2>S
S<c-3>S
<c-1>W
S<c-2>W
Occurences Total Tail Wt.
2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
Class
<c-2>S
S<c-3>S
<c-1>W
S<c-2>W
Occurences Total Tail Wt.
2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Actually 2⋅Gc-2
Part 2