universal dna arrays ion mandoiu computer science & engineering department

56
Universal DNA Arrays Ion Mandoiu Computer Science & Engineering Department

Post on 20-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Universal DNA Arrays

Ion MandoiuComputer Science & Engineering Department

High-Throughput Genomic Assays

• Growing number of applications– Focus shifting from basic research to healthcare, forensics,

environmental monitoring,…

• Increasingly stringent requirements– Reproducibility, sensitivity, cost

• Assay design and optimization become critical– Source of challenging combinatorial problems

• Multiplex PCR primer set selection

• Probe selection

• Fidelity probes

• Mask design

• …

– This talk: design and optimization of universal tag arrays

Overview

Background on DNA Microarrays Universal Tag Arrays Tag Set Design Problem Tag Assignment Problem Conclusions

Watson-Crick Complementarity

• Four nucleotide types: A,C,T,G

• A’s paired with T’s (2 hydrogen bonds)

• C’s paired with G’s (3 hydrogen bonds)

DNA Microarrays

Images courtesy of Affymetrix.

Labeled DNA/RNA mixture flushed over array of probes

Laser activation of fluorescent labels

Optical scanning used to identify

probes with complements in the

mixture

Applications

• Gene expression (transcription analysis)

• Genomic-based microorganism identification

• Single Nucleotide Polymorphism (SNP) genotyping

Gene Expression• Cells express different subsets of genes

under different environments

Gene (DNA)

mRNAProtein

Transcription Translation

Two-Color Technique• Sample labeled RED• Control labeled GREEN• YELLOWYELLOW probes hybridize to both sample and control•BLACK probes hybridize to neither

Cy3Cy3Cy3

Cy5Cy5Cy5

cell type 2

cell type 1

RNA 2

RNA 1

target 1

target 2

Microarray Technologies

• Arrays of cDNAs– Obtained by reverse transcription from

Expressed Sequence Tags (ESTs)

• Oligonucleotide arrays– Short (20-60bp) synthetic DNA strands

Ink jet Technology

Pin Technology Quill Pen Technology

Pin Ring Technology

Robotic cDNA Arrayers

VLSIPS Array Synthesis

CG

AC

CG

AC

ACG

AG

G

AG

C

Probes to be synthesized

A

A

A

A

A

VLSIPS Array Synthesis

CG

AC

CG

AC

ACG

AG

G

AG

C

Probes to be synthesized

A

A

A

A

A

C

C

C

C

C

C

VLSIPS Array Synthesis

CG

AC

CG

AC

ACG

AG

G

AG

C

Probes to be synthesized

A

A

A

A

A

C

C

C

C

C

C

G G

G G

G G

Oligonucleotide Arrays

• Pros– Highly versatile– Very high integration (VLSIPS ~106 probes/cm2)

• Cons– Higher cost

• Significant problem for low production volumes

Overview

Background on DNA Microarrays Universal Tag Arrays Tag Set Design Problem Tag Assignment Problem Conclusions

Universal Arrays

• Key idea– Array consisting of application independent

tags– Two-part “reporter” probes: aplication specific

primers ligated to antitags – Detection carried by a sequence of reactions

separately involving the primer and the antitag part of reporter probes

• Brenner 97, Morris et al. 98

Universal Array Experiment

+

Universal Array Advantages

• Cost effective– Same array type used for large number of

experiments can be produced in large quantities

• Rapid design– Only need to synthesize new set of reporter probes

• Reliable– Solution phase hybridization better understood than

hybridization on solid support

Overview

Background on DNA Microarrays Universal Tag Arrays Tag Set Design Problem Tag Assignment Problem Conclusions

Tag Set Requirements• Hybridization constraints

(H1) Antitags hybridize strongly to complementary tags

(H2) No antitag hybridezes to a non-complementary tag

(H3) Antitags do not cross-hybridize to each other

t1t1 t2t2 t1 t2t1

Hybridization Model

• Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state

• 2-4 rule

Tm = 2 #(As and Ts) + 4 #(Cs and Gs)

• More accurate models exist, e.g., the near-neighbor model

Tag Set Design Problem

• [Ben-Dor et al. 00] Conservative formalization of (H1)+(H2) based on nucleation complex theory and 2-4 rule

(C1) Every tag must have total weight h

(C2) No 2 tags share a common substring of weight c

Where – w(A)=w(T)=1, w(C)=w(G)=2– c, h are given constant

c-h codes

• c-token: left-minimal DNA string of weight c, i.e.,– w(x) c

– w(x’) < c for every proper suffix of x

• [Ben-Dor et al. 00] A set of tags is called c-h code if(C1) Every tag has weight h

(C2*) Every c-token is used at most once

c-h code problem: given c and h, find largest c-h code

c-h code problem

• [Ben-Dor et al. 00] give a constructive upper-bound on largest c-h code size and a near-optimal algorithm based on DeBruijn sequences

• [MT05] c-h code problem can be formulated as a maximum integer flow problem with capacity constraints on (disjoint) sets of vertices– Solvable in practical time for small values of c

Token Content of a Tag

c=4 CCAGATT CC CCA CAG AGA GAT GATT

Tag of length l sequence of c-tokens

c=4, l=7

tag = CCAGATT

CCCCACAGAGAGATGATT

Layered c-token graph

s t

c1

cN

ll-1c/2 (c/2)+1 …

Integer Program Formulation

O(hN) constraints and variables, where N = #c-tokens

ILP Results

Number of c-tokens

• W=A or T, S=C or G

• Gn = #strings of weight n

G1 = 2; G2 = 6; Gn = 2Gn-2 + 2Gn-1

Token type Num tokens

<c-2>S 2 Gc-2

S<c-3>S 4 Gc-3

<c-1>W 2 Gc-1

S<c-2>W 4 Gc-2

Total Gc + 2 Gc-1

Number of c-tokensc Num c-tokens

5 208

6 568

7 1552

8 4240

9 11584

10 31648

Periodic Tags

• If multiple c-token copies are not allowed in a tag, every tag uses roughly one c-token per letter (except for first up to c-1 letters)

• A tag t is periodic if it is the prefix of () for some string – t periodic with period t uses at most || c-tokens– Advantageous to use periodic tags with short period

Portion of c-token factor graph, c=4

CC

AAG AAC

AAAA

AAAT

Tag Set Design Algorithm

1. Construct c-token factor graph G

2. T{}

3. For all cycles C defining periodic tags, in increasing order of cycle length, do

– If C has no c-tokens in common with T, then• Add a tag defined by C to T

• Remove C from G

4. Perform a pruned alphabetic tree search and add to T tags with no c-tokens in common with T

5. Return T

Vertex-disjoint Cycle Packing Problem

• Given directed graph G, find maximum number of vertex disjoint directed cycles in G

• [MT 05] APX-hard even for regular directed graphs with in-degree and out-degree 2– h-c/2+1 approximation factor for tag set design problem

• [Salavatipour and Verstraete 05] – Quasi-NP-hard to approximate within (log1- n)

– O(n1/2) approximation algorithm

Experimental Results

Antitag-to-Antitag Hybridization

• Additional practical constraint: antitags do not cross-hybridize (including self)– Ignored by Ben-Dor et al

• Formalization in c-token hybridization model:(C3) No two (anti)tags contain complementary

substrings of weight c

• Cycle packing and tree search extend easily

Results w/ Extended Constraints

Overview

Background on DNA Microarrays Universal Tag Arrays Tag Set Design Problem Tag Assignment Problem Conclusions

More Possible Mis-Hybridizations

Degrees of freedom: partition of primers on multiple arrays, tag assignment within an array (may be forced to leave tags unassigned)

Here we focus on avoiding case (a), primer-to-tag hybridization

Constraints on tag assignment

• If primer p hybridizes with tag t, then either p or t must be left un-assigned, unless p is assigned to t

p

t

t’

p’

Characterization of Assignable Sets

• [Ben-Dor 04] Set P is assignable to T iff X+Y |P|, where, in the hybridization graph induced by P+T – X = number of primers incident to a degree 1 tag

– Y = number of degree 0 tags

Y=2

X=1

MAPS Problem

• Maximum Assignable Primer Set (MAPS) Problem: given primer set P and tag set T, find maximum size assignable subset of P

• [Ben-Dor 04] Greedy deletion heuristic: repeatedly delete primer of maximum weight from P until it becomes assignable, where– Potential of tag t is 2-|P(t)|

– Potential of primer p is sum of potentials of conflicting tags

Multiplexing Problem

• Universal Tag Array Multiplexing Problem: given primer set P and tag set T, find partition of P into minimum number of assignable sets

• [Ben-Dor 04] Repeatedly find approximate MAP using greedy deletion

Integration with Probe Selection

• In practice, several primer candidates with equivalent functionality– In SNP genotyping, can pick primer from either forward and

reverse strand

– In gene expression/identification applications, many primers have desired length, Tm, etc.

Pooled Array Multiplexing Problem

• [MPT 05] Given set of primer pools P and tag set T, find a primer from each pool and a partition of selected primers into minimum number of assignable sets

X+Y Characterization no Longer Holds

Pooled Multiplexing Algorithms

1. Primer-Del = greedy deletion for pools similar to [Ben-Dor et al 04]

2. Primer-Del+ = same but never delete last primer from pool unless no other choice

3. Min-Pot = select primer with min potential from each pool, then run Primer-Del

4. Min-Deg = select primer with min conflict degree, then run Primer-Del

Results: GenFlex Tags, c=7

Results: GenFlex Tags, c=8

GenFlex vs. PerTags, c=8 (|T|=213)

Overview

Background on DNA Microarrays Universal Tag Arrays Tag Set Design Problem Tag Assignment Problem Conclusions

Conclusions

• Novel combinatorial techniques for tag set design and tag assignment lead to significant improvement in overall assay throughput

• Other applications of DNA tags: – Lab-on-chip, DNA-directed assembly of

nanostructures (e.g., carbon nano-tubes), DNA computing [Brenneman&Condon 02]

• Other applications of assignment techniques:– Genotyping by mass-spectroscopy [Aumann et al 05]– Genotyping using l-mer arrays

Ongoing Work• Special type of universal arrays: l-mer arrays

– Initially introduced for sequencing by hybridization, but proved impractical

– Currently investigated for use in resequencing by hybridization

– More promising application: SNP genotyping

– Isothermal prefix sets instead of l-mer arrays?

• Deeper integration between tag assignment and probe selection, e.g., in string barcoding

• Extend algorithms to more accurate near-neighbor hybridization model– monotonic Tm c-tokens successor graph

Open Problems• Settle approximation complexity of (vertex) disjoint cycle

packing ([Salavatipour and Verstraete 05] show approximation preserving reductions between edge and vertex disjoint versions)

• Establish better approximation bound for special instances arising in tag set design

• Improved approximation algorithms for maximum assignable tag set and the tag assignment problems

Acknowledgments

• Claudia Prajescu and Dragos Trinca

• UCONN Research Foundation