engineering a scalable placement heuristic for dna probe arrays

Engineering a Scalable Placement Heuristic for DNA

Probe Arrays

A.B. Kahng, I.I. Mandoiu, P. Pevzner, A.B. Kahng, I.I. Mandoiu, P. Pevzner,

S. Reda (all UCSD), A. Zelikovsky (GSU)S. Reda (all UCSD), A. Zelikovsky (GSU)

Outline

• DNA probe arrays and unwanted illumination• Synchronous array design (2-D placement)• Asynchronous array design (3-D placement)• Experimental results• Extensions• Conclusions

DNA Probe Arrays

• Used in wide range of genomic analyses– Gene expression monitoring, SNP mapping, sequencing by

hybridization,…

• Arrays with up to 1000x1000 probes in commercial use, 108 probes envisioned for next generation arrays– Highly scalable algorithms required for array design

Simplified DNA Array Flow

Probe Selection

Probe Placement

Probe Alignment (Mask Design)

Array Manufacturing

Hybridization Experiment

Gene sequences, position of SNPs, etc.

This talk

Analysis of Hybridization Intensities

Mask Manufacturing

Soft/Computational Domain

Hard/Biochemistry Domain

Array Manufacturing Process

Very Large-Scale Immobilized Polymer Synthesis:

1. Treat substrate with chemically protected “linker” molecules, creating rectangular array

– Site size = approx. 10x10 microns

2. Selectively expose array sites to light

– Light deprotects exposed molecules, activating further synthesis

3. Flush chip surface with solution of protected A,C,G,T

– Binding occurs at previously deprotected sites

4. Repeat steps 2&3 until desired probes are synthesized

Photo-Deprotection Step

Our concern: diffraction unwanted illumination yield decrease

Probe Synthesis

Nu

cle

otid

e d

epo

sitio

n s

eq

uen

ce A

CG

G M3

C M2

A M1

CG

AC

CG

AC

ACG

AG

G

AG

C

Placed probes

A

A

A

A

A

C

C

C

C

C

C

G G

G G

G G

Measuring Unwanted Illumination

Nu

cle

otid

e d

epo

sitio

n s

eq

uen

ce A

CG

G M3

C M2

A M1

A

A

A

A

A

C

C

C

C

C

C

G G

G G

G G

border

Unwanted illumination border length

CG

AC

CG

AC

ACG

AG

G

AG

C

Placed probes

Synchronous vs. Asynchronous Synthesis

(a) periodic deposition sequence

(b) Synchronous embedding of CTG

(c) Asynchronous leftmost embedding of CTG

(d) Another asynchronous embedding

T

GC

A

T

G

T

G

C

A

…

C

A

4-group

(a)

C

G

T

(b)

C

T

G

(c)

G

C

T

(d)

Outline


Problem Formulation (Synchronous Case)

Synchronous Array Design (2-D Placement) Problem:• Minimize placement cost of Hamming graph H

(vertices = probes, distance = Hamming)

• On 2-dimensional grid graph G2 (N x N array, edges b/w distance 1 neighbors)

H

probe

G2site

2-D Placement Lower Bound

• Sum of Hamming distances to 4 closest neighbors minus weight of 4N heaviest arcs

H

probe

G2

TSP+1-Threading Placement

Hubbell 90’s• Find TSP tour/path over given probes w.r.t.

Hamming distance • Thread TSP path in the grid row by row

Hannenhalli,Hubbell,Lipshutz, Pevzner’02• Place the probes according to 1-Threading • Further decreases total border by 20%

Lexicographical Sorting +1-Threading

A

A

T

G

C

A

A

T

G

A

T

G

G

Radix-sort the probes in lexicographical order

1 2 3

C

C

Thread on the chip

Matching Based Probe Placement

1

3

2

5

4

Select an independent (mutually nonadjacent) set of

placed probes

Re-embed using optimal

perfect matching

2

2

3

1

4

Total cost can only decrease or remain the same

Runtime: roughly proportional to square of independent set size

Sliding Window Matching

There is a trade-off between solution quality and size/overlap of windows

Iterate SlidingWindowMatching over the chip until improvement drops below 0.1%

Effect of Window Size on Solution Quality

Increased window size/overlap decreases number of conflicts, but increases runtime

Epitaxial Placement Algorithm

• Simulates crystal-growth

• Start with arbitrary probe placed at center

• Maintain a best probe-candidate (i.e, a probe with min number of conflicts to the already placed neighbors) for each border site

• Iteratively fill the border site with minimum increase in border length

- give priority to sites with more neighbors filled

Tile- and Row- Epitaxial

• Tile-epitaxial– Divide array into 100x100 tiles– Run Epitaxial within each tile– Take into account border of already placed tiles

• Row-epitaxial– Place probes by a fast method, e.g., sort+1-thread– Re-place probes row by row, sequentially filling

sites within a row– Assign to each site a probe with min number of

conflicts among the unplaced probes from following K rows

2-D Placement Algorithm Comparison:

Border Conflict

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

100 200 300 500

LB

Row-EPTX

EPTX

Tile-EPTX

TSP+1Thr

SWM 6x6


Runtime

1

10

100

1000

10000

100000

1000000

100 200 300 500 1000

TSP+1Thr

Row-EPTX

EPTX

Tile-EPTX

SWM

Outline


Problem Formulation (Asynchronous Case)

• Asynchronous synthesis:– Periodic nucleotide deposition sequence, e.g., (ACTG)p

– Every probe grows asynchronously

Border length = Hamming distance between embedded probes • Asynchronous Array (3-D Placement) Design Problem:

– Minimize placement cost of embedded-probe Hamming graph H (vertices=probes, distance = Hamming b/w embedded probes)

– on 2-dimensional grid graph G2 (N x N array, edges b/w neighbors)

H

probe

G2

site

Lower Bound

• Sum of distances to 4 closest neighbors minus weight of 4N heaviest arcs– Distance between two probes of length p = 2p - |Longest Common Subsequence|

• Non-tight bound: example with LB = 8 and best placement cost = 10

2M

5M

4M

AC

CT TG

GA

Optimum placement

AC

CT TG

GA1

1

1

1

1 111

Nuc

leot

ide

depo

sitio

n se

quen

ce S

=A

CT

GA

A

G

T

C

A

3M

1M

A

G

G

TT

C

C

A

(c)

Optimal Probe Alignment

A

C

T

A C G T A C G TSource

Sink

• Find best alignment of probe wrt embedded neighbors• Dynamic Programming:

– Source-sink paths corresponds to feasible embeddings

– O[(probe length) x (deposition sequence length)]

• Can be extended to simultaneous alignment of two adjacent probes (2x1) with increase by O(probe length)

3-D Placement Flows

- Simultaneous placement and alignment- asynchronous epitaxial (slow and low quality)

- Synchronous placement followed by in-place probe alignment (analogous to standard for VLSI flow partition)- using previous DP to do in-place probe alignment

- Synchronous placement followed by probe alignment with reshuffle (analogous to feedback loops in VLSI flows)- asynchronous sliding window matching

Algorithms for In-Place Probe Alignment

• Asynchronous re-embedding after 2-dim placement– Greedy Algorithm

• While there exist probes to re-embed with gain– Optimally re-embed the probe with the largest gain

– Batched greedy: speed-up by avoiding recalculations– Chessboard Algorithm

• While there is gain– Re-embed probes in green sites– Re-embed probes in red sites

Comparison of In-Place Probe Alignments

Chip size

LB TSP+1Thr Greedy Chessboard 2x1 Chessboard

%LB %LB %LB CPU %LB CPU %LB CPU

100 100 152.0 125.7 40 120.5 54 119.4 480

200 100 150.2 126.3 154 120.9 221 119.7 1915

300 100 149.1 126.7 357 121.5 522 121.6 4349

500 100 147.9 127.1 943 121.4 1423 120.2 15990

• Post-placement LB = sum of distances to adjacent probes– Distance between two probes of length p = 2p - |LCS |– Useful for assessing quality of algorithms that change probe

embeddings but do not change probe placement

Outline


3-D vs. 2-D Placement Results

Chip size

TSP+1Thr TSP+1Thr+

Chessboard

Epitaxial+

Chessboard

SyncSWM+

Chessboard

AsyncSWM

Cost Cost CPU Cost CPU Cost CPU Cost CPU

100 554849 439829 113 419069 274 433274 1 417890 875

200 2140903 1723352 1901 1624988 4441 1693658 46 1636658 3676

300 4667882 3801765 12028 --- --- 3746722 112 3615282 8406

500 12702474 10426237 109648 --- --- 10049442 302 9686918 22351

1000 --- --- --- --- --- 38898792 1307 38005039 54501


Border Conflict

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

100 200 300 500 1000

TSP+1Thr

TSP+1Thr+Chess

RowEPTX+Chess

EPTX+Chess

TileEPTX+Chess

SyncSWM+Chess

AsyncSWM+Chess


Runtime

1

10

100

1000

10000

100000

1000000

100 200 300 500 1000

TSP+1Thr+Chess

RowEPTX+Chess

EPTX+Chess

TileEPTX+Chess

SyncSWM+Chess

AsyncSWM+Chess

Outline


Practical Extensions

• Distant-dependent border conflict weights

Take into account conflicts between 2-,3-hop neighbors rather than only immediate neighbors

• Position-dependent border conflict weights

In alignment DP for two sequences take into account importance of conflicts in the middle of probes – alignment cost has weights on conflicts which depend on conflict position

• Polymorphic probes

Chip contains SNP’s, e.g. pairs of probes different in a single position – they should be placed together and alignment DP should align them simultaneously

Simplified DNA Array Flow

Probe Selection

Probe Placement

Probe Alignment (Mask Design)

Array Manufacturing

Hybridization Experiment

Gene sequences, position of SNPs, etc.

This talk

Analysis of Hybridization Intensities

Mask Manufacturing

Soft/Computational Domain

Hard/Biochemistry Domain

Alignment DP for 2-SNP’s

Optimal Embedding of A{C,T}T

Summary

• Contributions:– Epitaxial placement reduces by extra 10% over the previously best

known method– Asynchronous placement problem formulation– Postplacement improvement by extra 15.5-21.8%– Lower bounds– Scalable Placements (1000x1000 in 20min)

• Ongoing work– Comparison on industrial benchmarks– Experiments with algorithms for extended formulations (SNPs,

distance-dependent weights, etc.)

Thank you!

engineering a scalable placement heuristic for dna probe arrays

Documents

array sites

probe placementruntime

arbitrary probe

d placement problem

dna probe arraysa

given probes

desired probes

1000x1000 probes