algorithms for biochip design and optimization ion mandoiu computer science & engineering...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Algorithms for Biochip Design and Optimization
Ion MandoiuComputer Science & Engineering Department
University of Connecticut
Overview
Physical design of DNA arrays DNA tag set design Digital microfluidic biochip testing Conclusions
Driver Biochip Applications
• Driver applications – Gene expression (transcription analysis)
– SNP genotyping
– CNP analysis
– Genomic-based microorganism identification
– Point-of-care diagnosis
– healthcare, forensics, environmental monitoring,…
• As focus shifts from basic research to clinical applications, there are increasingly stringent design requirements on sensitivity, specificity, cost– Assay design and optimization become critical
• Human Genome 3 109 base pairs
• Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs)
• Total #SNPs 1 107
• Difference b/w any two individuals 3 106 SNPs ( 0.1% of entire genome)
Single Nucleotide Polymorphisms
… ataggtccCtatttcgcgcCgtatacacgggTctata …… ataggtccGtatttcgcgcAgtatacacgggActata …… ataggtccCtatttcgcgcCgtatacacgggTctata …
Watson-Crick Complementarity
• Four nucleotide types: A,C,T,G
• A’s paired with T’s (2 hydrogen bonds)
• C’s paired with G’s (3 hydrogen bonds)
Hybridization
SNP genotyping via direct hybridization
• SNP1 with alleles T/G• SNP2 with alleles A/G
A CTC
G
A
A CTCG A
Array with 2 probes/SNP
Optical scanning used to identify
alleles present in the sample
Labeled sample
In-Place Probe Synthesis
CG
AC
CG
AC
ACG
AG
G
AG
C
Probes to be synthesized
A
A
A
A
A
In-Place Probe Synthesis
CG
AC
CG
AC
ACG
AG
G
AG
C
Probes to be synthesized
A
A
A
A
A
C
C
C
C
C
C
In-Place Probe Synthesis
CG
AC
CG
AC
ACG
AG
G
AG
C
Probes to be synthesized
A
A
A
A
A
C
C
C
C
C
C
G G
G G
G G
Simplified DNA Array Flow
Probe Selection
Array Manufacturing
Hybridization Experiment
Gene expression levels, SNP genotypes,…
Analysis of Hybridization Intensities
Mask Manufacturing
Physical Design:
Probe Placement & Embedding
Design
Manufacturing
End User
Unwanted Illumination Effect
• Unintended illumination during manufacturing synthesis of erroneous probes
• Effect gets worse with technology scaling
Border Length Minimization Objective
A
A
A
A
A
C
C
C
C
C
C
G G
G G
G G
border
• Effects of unintended illumination border length
CG
AC
CG
AC
ACG
AG
G
AG
C
Synchronous Synthesis• Periodic deposition sequence, e.g., (ACTG)k
– Each probe grown by one nucleotide in each period
T
GC
A
T
G
T
G
C
A
…
C
A
period
# border conflicts b/w adjacent probes = 2 x Hamming distance
C
T
A
C
G
T
2D Placement Problem
probe
Edge cost = 2 x Hamming distance
• Find minimum cost mapping of the Hamming graph onto the grid graph
• Special case of the Quadratic Assignment Problem
2D Placement: Sliding-Window Matching
1
3
2
5
4
Select mutually nonadjacent probes from small window
2
2
3
1
4
Re-assign optimally
• Slide window over entire chip• Repeat fixed # of iterations ( O(N) time for fixed window size), or until improvement drops below certain threshold
• Proposed by [Doll et al. ‘94] in VLSI context
2D Placement: Epitaxial Growth
• Proposed by [PreasL’88, ShahookarM’91] in VLSI context
• Simulates crystal growth
• Efficient “row” implementation • Use lexicographical sorting for
initial ordering of probes
• Fill cells row-by-row
• Bound number of candidate probes considered when filling each cell
• Constant # of lookahead rows O(N3/2) runtime, N = #probes
2D Placement: Recursive Partitioning• Very effective in VLSI placement [AlpertK’95,Caldwell et al.’00]• 4-way partition using linear time clustering• Repeat until Row-Epitaxial can be applied
Asynchronous Synthesis
A
A
A
C
C
C
T
T
TG
G
G
ACTG
AGT
GTG
A A
Synchronous Embedding
A
G
T
A
G
G
T
A
Dep
ositi
on S
eque
nce
Probes
G
A
A
G
T
A
G
T
ASAP Embedding
G
A
C
T
A C G T A C G TSource
Sink
• Efficient solution by dynamic programming
Optimal Single-Probe Re-Embedding
In-Place Re-Embedding Algorithms
• 2D placement fixed, allow only probe embeddings to change– Greedy: optimally re-embed probe with largest gain– Chessboard: alternate re-embedding of black/white cells– Sequential: re-embed probes row-by-row
Chip
size
Greedy Chessboard Sequential
%LB CPU %LB CPU %LB CPU
100 125.7 40 120.5 54 119.9 64500 127.1 943 121.4 1423 120.9 1535
Integration with Probe Selection
Probe Selection
Physical Design:
Placement & Embedding
Probe Pools
Pool
Size
Pool Row-Epitaxial
% Improv CPU sec.
1 - 217
2 4.3 1040
4 8.2 1796
8 11.8 3645
16 15.2 7515
Chip size 100x100
Overview
Physical design of DNA arrays DNA tag set design Digital microfluidic biochip testing Conclusions
Universal Tag Arrays
• Brenner 97, Morris et al. 98– Array consisting of application independent tags– Two-part “reporter” probes: aplication specific
primers ligated to antitags – Detection carried by a sequence of reactions
separately involving the primer and the antitag part of reporter probes
Universal Tag Array Advantages
• Cost effective– Same tag array used for different analyses
can be mass-produced– Only need to synthesize new set of reporter probes
• More reliable!– Solution phase hybridization better understood than
hybridization on solid support
SNP Genotyping with Tag ArraysTag
+
PrimerG
A
G
G
A
G
G
A
G
TG CA
T
C
Cantitag
TC C
1. Mix reporter probes with unlabeled genomic DNA
2. Solution phase hybridization
3. Single-Base Extension (SBE)4. Solid phase hybridization
Tag Set Design Problem
(H1) Tags hybridize strongly to complementary antitags
(H2) No tag hybridizes to a non-complementary antitag
t1t1 t2t2 t1 t2t1
Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H2)
Hybridization Models
• Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state
• 2-4 rule
Tm = 2 #(As and Ts) + 4 #(Cs and Gs)
• More accurate models exist, e.g., the near-neighbor model
• Hamming distance model, e.g., [Marathe et al. 01]– Models rigid DNA strands
• LCS/edit distance model, e.g., [Torney et al. 03] – Models infinitely elastic DNA strands
• c-token model [Ben-Dor et al. 00]:– Duplex formation requires formation of nucleation complex
between perfectly complementary substrings
– Nucleation complex must have weight c, where wt(A)=wt(T)=1, wt(C)=wt(G)=2 (2-4 rule)
Hybridization Models (contd.)
c-h Code Problem• c-token: left-minimal DNA string of weight c, i.e.,
– w(x) c
– w(x’) < c for every proper suffix x’ of x
• A set of tags is a c-h code if(C1) Every tag has weight h
(C2) Every c-token is used at most once
c-h Code Problem [Ben-Dor et al.00]
Given c and h, find maximum cardinality c-h code
Algorithms for c-h Code Problem
• [Ben-Dor et al.00] approximation algorithm based on DeBruijn
sequences
• Alphabetic tree search algorithm
- Enumerate candidate tags in lexicographic order, save tags whose c-tokens are not used by previously selected tags
- Easily modified to handle various combinations of constraints
• [MT 05, 06] Optimum c-h codes can be computed in practical time for small values of c by using integer programming
• Practical runtime using Garg-Koneman approximation and LP-rounding
Token Content of a Tag
c=4 CCAGATT CC CCA CAG AGA GAT GATT
Tag sequence of c-tokensEnd pos: 2 3 4 5 6 7 c-token: CCCCACAGAGAGATGATT
Layered c-token graph for length-l tags
st
c1
cN
ll-1c/2 (c/2)+1 …
Integer Program Formulation [MPT05]
• Maximum integer flow problem w/ set capacity constraints
•O(hN) constraints & variables, where N = #c-tokens
Packing LP Formulation path every for variable0/1 ps-txp
Garg-Konemann Algorithm1. x 0; y // yi are variables of the dual LP
2. Find min weight s-t path p, where weight(v) = yi for every vVi
3. While weight(p) < 1 do
M maxi |p Vi|
xp xp + 1/M
For every i, yi yi( 1 + * |p Vi|/M )
Find min weight s-t path p, where weight(v) = yi for vVi
4. For every p, xp xp / (1 - log1+)
[GK98] The algorithm computes a factor (1- )2 approximation to the optimal LP solution with (N/)* log1+N shortest path computations
LP Based Tag Set Design1. Run Garg-Konemann and store the minimum weight
paths in a list
2. Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags
3. Mark used c-tokens and run the alphabetic tree search algorithm to select additional tags
Periodic Tags [MT05]
• Key observation: c-token uniqueness constraint in c-h code formulation is too strong– A c-token should not appear in two different tags, but
can be repeated in a tag
• A tag t is called periodic if it is the prefix of () for some “period” – Periodic strings make best use of c-tokens
c-token factor graph, c=4 (incomplete)
CC
AAG AAC
AAAA
AAAT
Cycle Packing Algorithm
1. Construct c-token factor graph G
2. T{}
3. For all cycles C defining periodic tags, in increasing order of
cycle length,
• Add to T the tag defined by C
• Remove C from G
4. Perform an alphabetic tree search and add to T tags consisting
of unused c-tokens
5. Return T
Experimental Results
h
More Hybridization Constraints…
• Enforced during tag assignment by- Leaving some tags unassigned and distributing primers across multiple arrays [Ben-Dor et al. 03]
- Exploiting availability of multiple primer candidates [MPT05]
t1 t2t1
Herpes B Gene Expression Assay
Tm # poolsPool size
500 tags 1000 tags 2000 tags
# arrays % Util. # arrays % Util. # arrays % Util.
60 14461 4 94.06 2 97.20 1 72.30
5 4 96.13 2 100.00 1 72.30
67 15601 4 96.53 2 98.70 1 78.00
5 4 98.00 2 99.90 1 78.00
70 15221 4 96.73 2 98.90 1 76.10
5 4 97.80 2 99.80 1 76.10
Tm # poolsPool size
500 tags 1000 tags 2000 tags
# arrays % Util. # arrays % Util. # arrays % Util.
60 14461 4 82.26 3 65.35 2 57.05
5 4 88.26 3 70.95 2 63.55
67 15601 4 86.33 3 69.70 2 61.15
5 4 91.86 3 76.00 2 67.20
70 15221 4 88.46 3 73.65 2 65.40
5 4 92.26 2 91.10 2 70.30
GenFlex Tags
Periodic Tags
Overview
Physical design of DNA arrays DNA tag set design Digital microfluidic biochip testing Conclusions
Digital Microfluidic Biochips
[Srinivasan et al. 04]
[Su&Chakrabarty 06]
I/O I/O
Cell
• Electrodes typically arranged in rectangular grid
• Droplets moved by applying voltage to adjacent cell
• Can be used for analyses of DNA, proteins, metabolites…
Design Challenges• Testing
– High electrode failure rate, but can re-configure around– Performed both after manufacturing and concurrent with chip
operation– Main objective is minimization of completion time
• Module placement– Assay operations (mixing, amplification, etc.) can be mapped to
overlapping areas of the chip if performed at different times
• Droplet routing– When multiple droplets are routed simultaneously must prevent
accidental droplet merging or interference
Merging Interference
Concurrent Testing Problem• GIVEN:
– Input/Output cells
– Position of obstacles (cells in use by ongoing reactions)
• FIND:– Trajectories for test droplets such that
• Every non-blocked cell is visited by at least one test droplet
• Droplet trajectories meet non-merging and non-interference constraints
• Completion time is minimized
Defect model: test droplet gets stuck at defective electrode
[Su et al. 04] ILP-based solution for single test droplet case & heuristic for multiple input-output pairs with single test droplet/pair
ILP Formulation for Unconstrained Number of Droplets
• Each cell (i,j) visited at least once:
• Droplet conservation:
• No droplet merging:
• No droplet interference:
• Minimize completion time:
1),(),(
),)(,( t jiNlk
tjilkx
0),(),(
1),)(,(
),(),(),)(,(
jiNlk
tlkji
jiNlk
tjilk xx
1),(),( ),()'',(
)'',)(,(),(),(
),)(,( jiNlk lkNlk
tlklk
jiNlk
tjilk xx
1 ),()'',( )'',()'',(
)'',)('',(),(),(
),)(,( jiNji jiNlk
tjilk
jiNlk
tlkji xx
Ojizxt
zt
jilk ),(every for ,0
Minimize
),)(,(
Special Case
• NxN Chip
• I/O cells in Opposite Corners
• No Obstacles
Single droplet solution needs N2 cycles
Lower BoundClaim: Completion time is at least 4N – 4 cycles
Proof: In each cycle, each of the k droplets place 1 dollar in current cell
3k(k-1)/2 dollars paid waiting to depart
3k(k-1)/2 dollars paid waiting for last droplet
k dollars in each diagonal
1 dollar in each cell
44 1213
T222
opt
kk
N
k
)k(kN k ) k(k-
Stripe Algorithm with N/3 Droplets
Stripe algorithm has approximation factor of 4
5
44
65
N
N
Stripe Algorithm with Obstacles of width Q
• Divide array into vertical stripes of width Q+1• Use one droplet per stripe• All droplets visit cells in assigned stripes in parallel
• In case of interference droplet on left stripe waits for droplet in right stripe
Results for 120x120 Chip, 2x2 Obstacles
~20x decrease in completion time by using multiple droplets
Obstacle Area
Average completion time (cycles)
k=40 vs. k=1 speed-upk=1 k=12 k=20 k=30 k=40
0% 14400 1412 944 710 593 24x1% 14256 1420 953.4 715.2 598.8 24x5% 13680 1473 982.8 725 596.2 23x
10% 12960 1490 1010.8 734.8 592.6 22x15% 12240 1501 1025.8 730.8 588.2 21x20% 11520 1501 1046.8 738.4 580.8 20x25% 10800 1501 1071 736.6 570 19x
Overview
Physical design of DNA arrays DNA tag set design Digital microfluidic biochip testing Conclusions
Conclusions
• Biochip design is a fertile area of applications– Combinatorial optimization techniques can yield
significant improvements in assay quality/throughput
• Very dynamic area, driver applications and underlying technologies change rapidly
Acknowledgments
• Physical design of DNA arrays: A.B. Kahng, P. Pevzner, S. Reda, X. Xu, A. Zelikovsky
• Tag set design: D. Trinca
• Testing of digital microfluidic biochips: R. Garfinkel, B. Pasaniuc, A. Zelikovsky
• Financial support: UCONN Research Foundation, NSF awards 0546457 and 0543365
Questions?