algorithms for biochip design and optimization ion mandoiu computer science & engineering...

Algorithms for Biochip Design and Optimization

Ion MandoiuComputer Science & Engineering Department

University of Connecticut

Overview

Physical design of DNA arrays DNA tag set design Digital microfluidic biochip testing Conclusions

Driver Biochip Applications

• Driver applications – Gene expression (transcription analysis)

– SNP genotyping

– CNP analysis

– Genomic-based microorganism identification

– Point-of-care diagnosis

– healthcare, forensics, environmental monitoring,…

• As focus shifts from basic research to clinical applications, there are increasingly stringent design requirements on sensitivity, specificity, cost– Assay design and optimization become critical

• Human Genome 3 109 base pairs

• Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs)

• Total #SNPs 1 107

• Difference b/w any two individuals 3 106 SNPs ( 0.1% of entire genome)

Single Nucleotide Polymorphisms

… ataggtccCtatttcgcgcCgtatacacgggTctata …… ataggtccGtatttcgcgcAgtatacacgggActata …… ataggtccCtatttcgcgcCgtatacacgggTctata …

Watson-Crick Complementarity

• Four nucleotide types: A,C,T,G

• A’s paired with T’s (2 hydrogen bonds)

• C’s paired with G’s (3 hydrogen bonds)

Hybridization

SNP genotyping via direct hybridization

• SNP1 with alleles T/G• SNP2 with alleles A/G

A CTC

G

A

A CTCG A

Array with 2 probes/SNP

Optical scanning used to identify

alleles present in the sample

Labeled sample

In-Place Probe Synthesis

CG

AC

CG

AC

ACG

AG

G

AG

C

Probes to be synthesized

A

A

A

A

A


CG

AC

CG

AC

ACG

AG

G

AG

C


A

A

A

A

A

C

C

C

C

C

C


CG

AC

CG

AC

ACG

AG

G

AG

C


A

A

A

A

A

C

C

C

C

C

C

G G

G G

G G

Simplified DNA Array Flow

Probe Selection

Array Manufacturing

Hybridization Experiment

Gene expression levels, SNP genotypes,…

Analysis of Hybridization Intensities

Mask Manufacturing

Physical Design:

Probe Placement & Embedding

Design

Manufacturing

End User

Unwanted Illumination Effect

• Unintended illumination during manufacturing synthesis of erroneous probes

• Effect gets worse with technology scaling

Border Length Minimization Objective

A

A

A

A

A

C

C

C

C

C

C

G G

G G

G G

border

• Effects of unintended illumination border length

CG

AC

CG

AC

ACG

AG

G

AG

C

Synchronous Synthesis• Periodic deposition sequence, e.g., (ACTG)k

– Each probe grown by one nucleotide in each period

T

GC

A

T

G

T

G

C

A

…

C

A

period

# border conflicts b/w adjacent probes = 2 x Hamming distance

C

T

A

C

G

T

2D Placement Problem

probe

Edge cost = 2 x Hamming distance

• Find minimum cost mapping of the Hamming graph onto the grid graph

• Special case of the Quadratic Assignment Problem

2D Placement: Sliding-Window Matching

1

3

2

5

4

Select mutually nonadjacent probes from small window

2

2

3

1

4

Re-assign optimally

• Slide window over entire chip• Repeat fixed # of iterations ( O(N) time for fixed window size), or until improvement drops below certain threshold

• Proposed by [Doll et al. ‘94] in VLSI context

2D Placement: Epitaxial Growth

• Proposed by [PreasL’88, ShahookarM’91] in VLSI context

• Simulates crystal growth

• Efficient “row” implementation • Use lexicographical sorting for

initial ordering of probes

• Fill cells row-by-row

• Bound number of candidate probes considered when filling each cell

• Constant # of lookahead rows O(N3/2) runtime, N = #probes

2D Placement: Recursive Partitioning• Very effective in VLSI placement [AlpertK’95,Caldwell et al.’00]• 4-way partition using linear time clustering• Repeat until Row-Epitaxial can be applied

Asynchronous Synthesis

A

A

A

C

C

C

T

T

TG

G

G

ACTG

AGT

GTG

A A

Synchronous Embedding

A

G

T

A

G

G

T

A

Dep

ositi

on S

eque

nce

Probes

G

A

A

G

T

A

G

T

ASAP Embedding

G

A

C

T

A C G T A C G TSource

Sink

• Efficient solution by dynamic programming

Optimal Single-Probe Re-Embedding

In-Place Re-Embedding Algorithms

• 2D placement fixed, allow only probe embeddings to change– Greedy: optimally re-embed probe with largest gain– Chessboard: alternate re-embedding of black/white cells– Sequential: re-embed probes row-by-row

Chip

size

Greedy Chessboard Sequential

%LB CPU %LB CPU %LB CPU

100 125.7 40 120.5 54 119.9 64500 127.1 943 121.4 1423 120.9 1535

Integration with Probe Selection

Probe Selection

Physical Design:

Placement & Embedding

Probe Pools

Pool

Size

Pool Row-Epitaxial

% Improv CPU sec.

1 - 217

2 4.3 1040

4 8.2 1796

8 11.8 3645

16 15.2 7515

Chip size 100x100

Overview


Universal Tag Arrays

• Brenner 97, Morris et al. 98– Array consisting of application independent tags– Two-part “reporter” probes: aplication specific

primers ligated to antitags – Detection carried by a sequence of reactions

separately involving the primer and the antitag part of reporter probes

Universal Tag Array Advantages

• Cost effective– Same tag array used for different analyses

can be mass-produced– Only need to synthesize new set of reporter probes

• More reliable!– Solution phase hybridization better understood than

hybridization on solid support

SNP Genotyping with Tag ArraysTag

+

PrimerG

A

G

G

A

G

G

A

G

TG CA

T

C

Cantitag

TC C

1. Mix reporter probes with unlabeled genomic DNA

2. Solution phase hybridization

3. Single-Base Extension (SBE)4. Solid phase hybridization

Tag Set Design Problem

(H1) Tags hybridize strongly to complementary antitags

(H2) No tag hybridizes to a non-complementary antitag

t1t1 t2t2 t1 t2t1

Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H2)

Hybridization Models

• Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state

• 2-4 rule

Tm = 2 #(As and Ts) + 4 #(Cs and Gs)

• More accurate models exist, e.g., the near-neighbor model

• Hamming distance model, e.g., [Marathe et al. 01]– Models rigid DNA strands

• LCS/edit distance model, e.g., [Torney et al. 03] – Models infinitely elastic DNA strands

• c-token model [Ben-Dor et al. 00]:– Duplex formation requires formation of nucleation complex

between perfectly complementary substrings

– Nucleation complex must have weight c, where wt(A)=wt(T)=1, wt(C)=wt(G)=2 (2-4 rule)

Hybridization Models (contd.)

c-h Code Problem• c-token: left-minimal DNA string of weight c, i.e.,

– w(x) c

– w(x’) < c for every proper suffix x’ of x

• A set of tags is a c-h code if(C1) Every tag has weight h

(C2) Every c-token is used at most once

c-h Code Problem [Ben-Dor et al.00]

Given c and h, find maximum cardinality c-h code

Algorithms for c-h Code Problem

• [Ben-Dor et al.00] approximation algorithm based on DeBruijn

sequences

• Alphabetic tree search algorithm

- Enumerate candidate tags in lexicographic order, save tags whose c-tokens are not used by previously selected tags

- Easily modified to handle various combinations of constraints

• [MT 05, 06] Optimum c-h codes can be computed in practical time for small values of c by using integer programming

• Practical runtime using Garg-Koneman approximation and LP-rounding

Token Content of a Tag

c=4 CCAGATT CC CCA CAG AGA GAT GATT

Tag sequence of c-tokensEnd pos: 2 3 4 5 6 7 c-token: CCCCACAGAGAGATGATT

Layered c-token graph for length-l tags

st

c1

cN

ll-1c/2 (c/2)+1 …

Integer Program Formulation [MPT05]

• Maximum integer flow problem w/ set capacity constraints

•O(hN) constraints & variables, where N = #c-tokens

Packing LP Formulation path every for variable0/1 ps-txp

Garg-Konemann Algorithm1. x 0; y // yi are variables of the dual LP

2. Find min weight s-t path p, where weight(v) = yi for every vVi

3. While weight(p) < 1 do

M maxi |p Vi|

xp xp + 1/M

For every i, yi yi( 1 + * |p Vi|/M )

Find min weight s-t path p, where weight(v) = yi for vVi

4. For every p, xp xp / (1 - log1+)

[GK98] The algorithm computes a factor (1- )2 approximation to the optimal LP solution with (N/)* log1+N shortest path computations

LP Based Tag Set Design1. Run Garg-Konemann and store the minimum weight

paths in a list

2. Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags

3. Mark used c-tokens and run the alphabetic tree search algorithm to select additional tags

Periodic Tags [MT05]

• Key observation: c-token uniqueness constraint in c-h code formulation is too strong– A c-token should not appear in two different tags, but

can be repeated in a tag

• A tag t is called periodic if it is the prefix of () for some “period” – Periodic strings make best use of c-tokens

c-token factor graph, c=4 (incomplete)

CC

AAG AAC

AAAA

AAAT

Cycle Packing Algorithm

1. Construct c-token factor graph G

2. T{}

3. For all cycles C defining periodic tags, in increasing order of

cycle length,

• Add to T the tag defined by C

• Remove C from G

4. Perform an alphabetic tree search and add to T tags consisting

of unused c-tokens

5. Return T

Experimental Results

h

More Hybridization Constraints…

• Enforced during tag assignment by- Leaving some tags unassigned and distributing primers across multiple arrays [Ben-Dor et al. 03]

- Exploiting availability of multiple primer candidates [MPT05]

t1 t2t1

Herpes B Gene Expression Assay

Tm # poolsPool size

500 tags 1000 tags 2000 tags

# arrays % Util. # arrays % Util. # arrays % Util.

60 14461 4 94.06 2 97.20 1 72.30

5 4 96.13 2 100.00 1 72.30

67 15601 4 96.53 2 98.70 1 78.00

5 4 98.00 2 99.90 1 78.00

70 15221 4 96.73 2 98.90 1 76.10

5 4 97.80 2 99.80 1 76.10

Tm # poolsPool size

500 tags 1000 tags 2000 tags

# arrays % Util. # arrays % Util. # arrays % Util.

60 14461 4 82.26 3 65.35 2 57.05

5 4 88.26 3 70.95 2 63.55

67 15601 4 86.33 3 69.70 2 61.15

5 4 91.86 3 76.00 2 67.20

70 15221 4 88.46 3 73.65 2 65.40

5 4 92.26 2 91.10 2 70.30

GenFlex Tags

Periodic Tags

Overview


Digital Microfluidic Biochips

[Srinivasan et al. 04]

[Su&Chakrabarty 06]

I/O I/O

Cell

• Electrodes typically arranged in rectangular grid

• Droplets moved by applying voltage to adjacent cell

• Can be used for analyses of DNA, proteins, metabolites…

Design Challenges• Testing

– High electrode failure rate, but can re-configure around– Performed both after manufacturing and concurrent with chip

operation– Main objective is minimization of completion time

• Module placement– Assay operations (mixing, amplification, etc.) can be mapped to

overlapping areas of the chip if performed at different times

• Droplet routing– When multiple droplets are routed simultaneously must prevent

accidental droplet merging or interference

Merging Interference

Concurrent Testing Problem• GIVEN:

– Input/Output cells

– Position of obstacles (cells in use by ongoing reactions)

• FIND:– Trajectories for test droplets such that

• Every non-blocked cell is visited by at least one test droplet

• Droplet trajectories meet non-merging and non-interference constraints

• Completion time is minimized

Defect model: test droplet gets stuck at defective electrode

[Su et al. 04] ILP-based solution for single test droplet case & heuristic for multiple input-output pairs with single test droplet/pair

ILP Formulation for Unconstrained Number of Droplets

• Each cell (i,j) visited at least once:

• Droplet conservation:

• No droplet merging:

• No droplet interference:

• Minimize completion time:

1),(),(

),)(,( t jiNlk

tjilkx

0),(),(

1),)(,(

),(),(),)(,(

jiNlk

tlkji

jiNlk

tjilk xx

1),(),( ),()'',(

)'',)(,(),(),(

),)(,( jiNlk lkNlk

tlklk

jiNlk

tjilk xx

1 ),()'',( )'',()'',(

)'',)('',(),(),(

),)(,( jiNji jiNlk

tjilk

jiNlk

tlkji xx

Ojizxt

zt

jilk ),(every for ,0

Minimize

),)(,(

Special Case

• NxN Chip

• I/O cells in Opposite Corners

• No Obstacles

Single droplet solution needs N2 cycles

Lower BoundClaim: Completion time is at least 4N – 4 cycles

Proof: In each cycle, each of the k droplets place 1 dollar in current cell

3k(k-1)/2 dollars paid waiting to depart

3k(k-1)/2 dollars paid waiting for last droplet

k dollars in each diagonal

1 dollar in each cell

44 1213

T222

opt

kk

N

k

)k(kN k ) k(k-

Stripe Algorithm with N/3 Droplets

Stripe algorithm has approximation factor of 4

5

44

65

N

N

Stripe Algorithm with Obstacles of width Q

• Divide array into vertical stripes of width Q+1• Use one droplet per stripe• All droplets visit cells in assigned stripes in parallel

• In case of interference droplet on left stripe waits for droplet in right stripe

Results for 120x120 Chip, 2x2 Obstacles

~20x decrease in completion time by using multiple droplets

Obstacle Area

Average completion time (cycles)

k=40 vs. k=1 speed-upk=1 k=12 k=20 k=30 k=40

0% 14400 1412 944 710 593 24x1% 14256 1420 953.4 715.2 598.8 24x5% 13680 1473 982.8 725 596.2 23x

10% 12960 1490 1010.8 734.8 592.6 22x15% 12240 1501 1025.8 730.8 588.2 21x20% 11520 1501 1046.8 738.4 580.8 20x25% 10800 1501 1071 736.6 570 19x

Overview


Conclusions

• Biochip design is a fertile area of applications– Combinatorial optimization techniques can yield

significant improvements in assay quality/throughput

• Very dynamic area, driver applications and underlying technologies change rapidly

Acknowledgments

• Physical design of DNA arrays: A.B. Kahng, P. Pevzner, S. Reda, X. Xu, A. Zelikovsky

• Tag set design: D. Trinca

• Testing of digital microfluidic biochips: R. Garfinkel, B. Pasaniuc, A. Zelikovsky

• Financial support: UCONN Research Foundation, NSF awards 0546457 and 0543365

Questions?

algorithms for biochip design and optimization ion mandoiu computer science & engineering...

Documents

c c c c c c slide

c c c c c c g g g g

c g t slide

t g t g c

period t g c

hamming distance c t

sample slide

critical slide