a high-performance image processing pipeline for polony ... im… · polony dna re-sequencing page...

A high-performance image processing

pipeline for Polony DNA re-sequencing

ECE1747 Project Report

December 21, 2011

Author:

Francesco Iorio

Abstract DNA Sequencing and re-sequencing are two fundamental tools in biological research,

and applied to numerous such as genome mapping and genetic diseases related to DNA

mutation.

Polony DNA re-sequencing [POR06] is a modern high-throughput technique, which has

been implemented in numerous systems, and the baseline software this project is based

on is the image processing pipeline originally written for the Polonator G.007 machine

[POL], which gathers relatively low-resolution images and extracts the relevant data to

perform DNA re-sequencing operations.

Due to the ongoing quest towards the cost reduction of re-sequencing large amounts of

DNA material, for example the human genome, a new high-throughput parallel image

processing pipeline has been designed to support a different selection of algorithms and

to exploit SMP parallel computing systems, using a combination of pipeline, data-parallel

and task-based parallelism patterns.

The new software outperforms the original implementation in terms of total running time

on a reference SMP system, while operating on 5.5x higher resolution images, as a result

of both pipelining of the different processing stages on different processors and assigning

multiple processors to the most computationally intensive stages, while overall load

balancing amongst the stages is managed by a task-based task-stealing scheduler.

Polony DNA re-sequencing /22 2011/12/21

Table of Contents

Abstract _______________________________________________________________ 2

Table of Contents _______________________________________________________ 3

Table of Tables _________________________________________________________ 4

Table of Figures ________________________________________________________ 5

1 Introduction _______________________________________________________ 6

2 Background _______________________________________________________ 6

2.1 Polony DNA re-sequencing_____________________________________________ 6

2.2 Polonator G.007 and its image processing pipeline _________________________ 8 2.2.1 Pre-cycle: image segmentation stage ___________________________________________ 9 2.2.2 Registration stage_________________________________________________________ 10 2.2.3 Extraction stage __________________________________________________________ 10 2.2.4 Base calling stage ________________________________________________________ 10

2.3 New Polonator “H” system ____________________________________________ 10

2.4 Shortcomings of the old software on the new system _______________________ 12

3 A new image processing pipeline ______________________________________ 12

3.1 Pipeline stages ______________________________________________________ 13 3.1.1 Ingestion stage ___________________________________________________________ 13 3.1.2 Phase Detection stage _____________________________________________________ 14 3.1.3 Registration stage_________________________________________________________ 15 3.1.4 Extraction stage __________________________________________________________ 17 3.1.5 Base calling stage ________________________________________________________ 17

4 Experimental setup ________________________________________________ 17

4.1 Synthetic image generator ____________________________________________ 18

4.2 Test image processing system __________________________________________ 18

5 Results ___________________________________________________________ 18

5.1 Test 1: old vs. new pipeline ____________________________________________ 18

5.2 Test 2: new pipeline using high-resolution images and fine grid pitch ________ 20

6 Conclusions and future work ________________________________________ 21

References ___________________________________________________________ 21


Table of Tables Table 1 Polonator G.007 system details ____________________________________________________ 9 Table 2 Polonator H system details _______________________________________________________ 11 Table 3 Experimental Platform __________________________________________________________ 18 Table 4 Test 1 image generator configuration parameters _____________________________________ 19 Table 5 Test 2 image generator configuration parameters _____________________________________ 20


Table of Figures Figure 1 – Polonator G.007: beads uniformly distributed over the substrate ________________________ 7 Figure 2 – Polonator G.007 image processing pipeline ________________________________________ 9 Figure 3 – Polonator “H”: beads distributed over the substrate in grid layout _____________________ 11 Figure 4 – Parallel image processing pipeline ______________________________________________ 13 Figure 5 – Partial Projection Profile technique for phase detection _____________________________ 15 Figure 6 – 2D DFT power spectrum cross-correlation registration ______________________________ 16


1 Introduction DNA is one of the fundamental building blocks of life and its accurate analysis

constitutes a major technical and scientific challenge.

Modern high-throughput re-sequencing techniques provide fast, accurate detection and

low cost, although the ever-increasing demand in the area of DNA research requires

constantly reducing the overall turnaround time of the sequencing results and the cost per

sequenced base.

Polony DNA re-sequencing [POR06] is one of the modern techniques designed to both

drastically reduce the overall cost of the re-sequencing process and to significantly

decrease the time to result, and as such offers a very competitive price/performance point

$0.11/Kbase, while being capable of analyzing around 400,000 bases per hour.

Scanning very large DNA sequences such as the entire human genome, which consists of

approximately 6 billion base pairs, still requires several days of processing, therefore

limiting the amount of research that can be performed in a given time.

2 Background DNA strands are composed by two long polymers connected in a double helix spatial

configuration. The two polymers are composed of sequences of smaller molecules,

namely nucleotides (Adenine, Cytosine, Guanine, Thymine).

In order to study DNA-based living organisms and their interaction, knowledge of the

exact sequence of nucleotides that compose DNA strands of extreme importance.

One of the largest and most expensive projects in DNA research has been the sequencing

of the entire human genome [HGP], which is the complete DNA material contained in

human genes; the project was completed in 2003 at a combined cost of over $3 billion.

Re-sequencing is a variant of the sequencing process that does not attempt to sequence

unknown DNA strands, but instead measure the difference between a previously known

strand and other unknown strands that are supposed to be very similar.

The purpose of re-sequencing is mostly to study DNA alterations and mutations, and to

measure the differences between DNA of different life forms and species.

In recent years a number of techniques have been developed to reduce the cost of

sequencing, with the goal of increasing the amount of research conducted on DNA

mutation and the effects of alterations in genes on the development of cellular lifecycles.

2.1 Polony DNA re-sequencing

In order to reduce the time and cost involved in sequencing individual bases in large

DNA strands, the Polony re-sequencing process revolves around a technique of large-

scale multiplexing of the base-detection operations.

The process consists of several interleaved chemical reactions and image processing

stages: the original large DNA strand to be re-sequenced is initially split into thousands

of small DNA strips each consisting of 26-bases to form a library, which are then

attached to a substrate and sequenced in parallel using a set of image processing

techniques used to detect bases in multiple individual strips at once.


The complete setup procedure involves the following stages:

1. Split DNA strand into 26-base long DNA strips (templates).

2. Attach especially synthesized reference (primer) strand to all templates.

3. Amplification process (strand replication) to increase the density of the templates.

4. Emulsion process to separate different templates and attach templates to magnetic

objects (beads), in order to generate objects large enough to be visible using a

regular fluoroscopy microscope, each having thousands of identical templates

attached to it.

5. A single layer of beads is spread uniformly over a panel (substrate), placed in a

container where all chemical processes take place (flow cell). This results in a

random distribution of beads over the substrate, as displayed in Figure 1.

Figure 1 – Polonator G.007: beads uniformly distributed over the substrate

After the initial setup a pre-cycle is performed to detect objects positions on the substrate.

In order to capture an image of the total substrate area at a resolution high enough to

allow distinguishing individual objects, the area is subdivided into a number of imaging

locations. For each imaging location the following operations are performed:

1. The microscope is moved to the imaging location.

2. A white-light image is taken and this allows the detection of all the objects at

once.

3. Image segmentation is performed on the image to detect the 2D spatial

coordinates of individual objects.

4. All the objects coordinates are stored to disk.


Following the setup procedure, 26 cycles of chemical reactions and image processing are

required to extract the required information for all 26 bases at each object location. For

each of the 26 cycles the following operations are performed:

1. A set of four kinds of pre-prepared polymers are flushed onto the substrate, each

kind containing a fluorophore segment which is visible when illuminated with a

light source of the appropriate color; the polymers are synthesized so they

chemically bond with templates that contain a specific nucleotide in the position

currently being sequenced.

2. For each imaging location the following operations are performed:

a. The microscope is moved to the imaging location used in the pre-cycle

four separate times, one pass per color.

b. In each pass a light of a different color is shone on the substrate and an

image is taken.

c. Image registration is performed to align the original objects coordinates

detected in the pre-cycle with the new image, due to the mechanical offset

introduced by the robotic arm visiting the same location multiple times.

d. Extraction of the objects values is performed using the original

coordinates detected in the pre-cycle and the offset detected in the

registration phase.

After the 26 cycles are performed, information about all the templates is processed to

generate base sequences for each template using the color intensities extracted at each

location to assign individual bases (nucleotides) to individual locations in each template

(base calling).

The resulting 26-base sequences are then individually aligned to the original reference

DNA strand using error metrics when perfect alignments are not found, resulting in a full

sequence of the new strand together with a complete list of differences between the new

strand and the reference strand.

2.2 Polonator G.007 and its image processing pipeline

The first implementation of the Polony DNA re-sequencing process in a commercially

available system is the Polonator G.007 system [POL], which is the result of the

collaboration between the Harvard Wyss Institute and Dover Systems; Table 1 details the

machine’s full specifications.


Table 1 Polonator G.007 system details

Imaging resolution 1000x1000 pixels

Imaging interval 120ms

Total imaging area 2000 mm2

Imaging throughput 22 MB/sec.

Sequencing throughput ~0.4 Gbases/hr.

Flow cell sequencing time ~45 hours

Human genome sequencing time ~710 hours

Figure 2 – Polonator G.007 image processing pipeline

Figure 2 shows the overall Polonator G.007 image processing software pipeline structure.

In the pipeline each stage is executed sequentially and serializes all its state to disk prior

to moving to the subsequent stage.

All the imaging stages move the microscope over a number of imaging locations in order

to have images that cover the full substrate. The full set of images taken at all imaging

locations represents a full scan of the flow cell.

2.2.1 Pre-cycle: image segmentation stage

As previously discussed the pre-cycle performs a full scan of the flow cell, taking a

white-light image at each imaging location.

Objects are randomly placed on the substrate, therefore in order to detect individual

object positions, image segmentation is performed on the white-light images and the

coordinates of each object centroid is stored to disk.

The segmentation algorithm uses a threshold value to distinguish pixels that constitute the

image background from the objects themselves; the threshold value is computed per


image by adding the standard deviation of all pixel values in the image to the mean of all

pixel values.

Pixels above the threshold value represent objects, while the other pixels represent the

background. Connected component labeling is performed using the pixels above the

threshold and centroids are computed for all the detected objects.

In order to facilitate the subsequent registration stage, coordinates for a subset of the

detected objects are also stored to disk as a registration array.

2.2.2 Registration stage

Due to the very high level of image magnification and mechanical tolerances, two images

taken at the same imaging location in two different times can exhibit a small offset,

which experiments have quantified in +-20 pixels in both X and Y directions.

Image registration is therefore required to return a 2D vector representing the offset

between an image taken at a specific imaging location in a cycle and the reference, white-

light image taken at the same imaging location.

Registration is performed individually on all four color images taken at each imaging

location, as they are all taken at different times.

The registration process involves reading pixel values in a window centered at at the

objects coordinates specified in the registration array, and finding the offset at which the

sum of the intensity values is highest. The highest value represents the point of maximum

correlation.

2.2.3 Extraction stage

The offset resulting from the registration stage is added to the 2D coordinates of all

objects detected at each imaging location to extract the individual objects values.

Individual intensity values are extracted from each of the four color images and stored to

disk for processing by the base calling stage.

2.2.4 Base calling stage

Base calling reads the four intensity values for all detected objects from disk and assigns

to each object a nucleotide base depending on the highest intensity value out the four.

The exact details of this process are presented in section 3.1.5, as it was largely

unchanged between the old and the new software.

As the base calling is not a very computationally intensive step relative to the image

processing stages, all experiments for the old software pipeline do not include it in the

timing results.

2.3 New Polonator “H” system

In order increase the overall image processing throughput with the ultimate goal of

increasing the sequencing rate (and simultaneously reducing the cost), a new system was

designed: the Polonator “H”.

The new system was designed to use one or more cameras, up to four, to increase the

level of multiplexing by sampling multiple imaging locations at a time. Furthermore, high

resolution cameras and fast frame grabbers are used, which decrease the interval between

images to 60ms.


Another improvement over the old system is in the design of the flow cell substrate:

while in the old system beads were uniformly spread, therefore assuming random 2D

coordinates, the new system uses a different chemical process that produces DNA

nanoballs and attaches them to an etched-silicon grid.

This results in the objects assuming positions on a uniform grid of predetermined size and

pitch, which is an important property that can be exploited by the image processing

software.

Figure 3 displays the regular grid layout the images generate by the new system exhibit.

Figure 3 – Polonator “H”: beads distributed over the substrate in grid layout

The full system specification is detailed in Table 2.

Table 2 Polonator H system details

Imaging resolution 2560x2160 pixels

Imaging interval 60ms

Total imaging area 5000 mm2

Imaging throughput ~184 MB/sec. per camera

Peak sequencing throughput 1 camera: ~8.7 Gbases/hr.

4 cameras: ~34.9 Gbases/hr.

Peak flow cell sequencing time 1 camera: ~14 hours

4 cameras: ~3.5 hours

Peak human genome sequencing time 1 camera: ~34.2 hours

4 cameras: ~8.5 hours


2.4 Shortcomings of the old software on the new system

As the new system was being designed it was immediately apparent that the original

software was not suitable, as it has a number of shortcomings which prevent it from

working properly under the new operating conditions:

While using full image segmentation works fine for objects detection, the new

grid-shaped layout allows for more efficient processing due to its predetermined

geometry.

Due to the very fine grid pitch used by the new machine (800nm), the microscope

zoom level (20x) and pixel size (6.5um), the objects dimensions are very small, of

the order of ~1-2 pixels in diameter, and the distance between consecutive objects

centers on the grid is ~2.5 pixels and can therefore partially overlap, which

determines anomalies in the segmentation procedure used.

The registration phase uses a set of known objects coordinates and samples an

image window around each presumed object location, using the cumulative peak

intensity level from the four color images to determine the most likely offset;

while this technique works well with randomly distributed objects, it can fail in

presence of a highly filled regular grid, as it can determine multiple equally valid

registrations, each spaced by the grid frequency.

The registration phase finds integer offsets, which are not sufficient in presence of

a fine grid.

The problem of connected component labeling used by the segmentation is known

to be hard to parallelize, and does not scale well on multiple processors.

After analyzing the aforementioned shortcomings the decision was made to rewrite the

software pipeline completely, to use more appropriate algorithms and to exploit parallel

computing to increase the overall throughput.

3 A new image processing pipeline In order to sustain the high data throughput the new machine generates while performing

the imaging cycles, the new image processing software pipeline was designed to exploit

multiple CPU cores, commonly available in modern microprocessors.

As explained above, the new system attaches objects to a regular grid, which can simplify

detection using techniques previously used in microarray analysis; the new pipeline

aggressively exploits the inherent knowledge of the grid geometry to accelerate its stages.

In his report Peter Bajcsy [BAJ06] examines several methods commonly applied in

detection and extraction of object data from DNA microarrays, which are structures that

produce images very similar to the new Polonator system, and we thus designed the new

software pipeline to reuse part of the already developed knowledge and apply it to our

scenario.

Figure 4 shows the pipeline design and illustrates the included processing stages.


Figure 4 – Parallel image processing pipeline

All the pipeline stages are concurrently active, except for Phase Detection and

Registration, which are active in different imaging cycles, and use lightweight events to

communicate data items between them.

The pipeline uses task-based parallelism throughout, and employs Intel’s Threading

Building Blocks [INT] as the main task-stealing scheduler to automatically balance the

load over all the available CPU threads at runtime, by maintaining separate task-queues

per thread, therefore minimizing the overhead of context switching in presence dynamic

assignment of load to threads.

Data is input sequentially to the ingestion stage by one or more cameras (up to four), and

is subsequently forwarded through the pipeline for processing, generating as output a

string of bases per cycle, corresponding to one base sequenced per object per imaging

location.

3.1 Pipeline stages

3.1.1 Ingestion stage

The ingestion stage temporarily caches images coming from one or more cameras to form

image sets composed of the four color images that represent a unique imaging location.

The frame grabber transfers images to main memory using DMA into a pre-allocated

circular buffer and generates an event when an image is ready essentially creating a

producer-consumer queue operating on the circular buffer.

Once all the four images for a specific imaging location are ready the ingestion stage

creates a small memory structure to contain all the information pertinent to that imaging

location (image set) and forwards it to the next stage in the pipeline.


The decision to have a dedicated stage for image ingestion facilitates future extensions of

the pipeline, or the reordering of the imaging sequence by encapsulating the caching and

reordering to feed the other pipeline stages with a consistent stream of image sets.

3.1.2 Phase Detection stage

The phase detection stage is dedicated to positioning a 2D frame of reference for the

objects grid, with the goal of having the ability to calculate all remaining objects

coordinates starting from the reference frame and using the grid pitch information that is

known in advance.

This dramatically reduces the processing complexity that regular image segmentation

presents.

As previously described by Bajsci [BAC06], Deepa [DE09] and Siswantoro [SIS10]

horizontal and vertical projection profiles are useful in determining the grid layout in

images containing microarray data.

The technique used for this project is a refinement that uses specific knowledge about the

images produced by the Polonator machine. More specifically the key difference between

traditional DNA microarrays and our machine is that images taken from DNA

microarrays generally contain the objects grid in one or more sections of the image, but

very rarely the grid spans the entire image, whereas in our scenario the grid fills

completely the image, as the substrate we are imaging is much larger than the microscope

imaging area due to the required resolution and zoom level.

Using this knowledge, combined with experimental-derived knowledge of average and

maximum fill-rate in our grids (80% and 95% respectively), we could therefore avoid

sampling the full image to generate the horizontal and vertical projection profiles,

sampling instead a small vertical slice and a small horizontal slice, their size determined

statistically to ensure 99% fill rate.

After the projection profiles have been obtained we do not directly detect intensity peaks

on the profiles like the regular projection profile technique, but instead we perform

intensity binning using the grid pitch as the binning frequency, to obtain a compact

profile that exhibits a peak corresponding to the desired phase value.

We named this technique “Partial Projection Profile”, Figure 5 depicts the process.


Figure 5 – Partial Projection Profile technique for phase detection

3.1.3 Registration stage

The purpose of the registration stage is to align images taken at the same imaging

location in different cycles. The reason this is necessary is the combination of

temperature variations and mechanical drift in the robotic arm movement, which causes

images taken at the same imaging location after the mechanical arm moved away from

the location and then returned to it to have a small offset.

Image registration is not a technique commonly used in DNA microarrays processing, as

normally each acquired image represents a different experiment and therefore requires

separate grid placement and values extraction, whereas in our case we not only have prior

information on the grid phase and pitch, we also need to preserve the spatial

characteristics of the grid, by matching all object coordinates at every imaging location,

so that intensity values for an individual object in different cycles can be concatenated to

form the sequence of bases present in that object.

General image registration is a widely researched problem, especially in medical

imaging. Wisetphanichkij and Dejhan [WIS05] describe a robust image registration

procedure that includes coarse and fine affine transformations registration, while

NessAiver, Subhasish Biswas [NES00] present a DFT-based registration of rotation and

translation.

Full image DFT cross-correlation would allow us to have precise registration between the

images acquired in the first cycle and the images acquired in the subsequent cycles, but

the high resolution of the images would make its computation very slow and beyond our

time constraints.

Using random regions inside the image is not guaranteed to succeed due to the very

regular grid layout of our images, which in the worst case scenarios could be completely


empty (no object present on a grid region) or completely populated (all cells in the grid

region contain an object). In the first case registration is obviously impossible, while in

the second case every offset which is a multiple of the grid pitch would be an equally

valid registration.

We therefore adopted a combination of machine-specific heuristics and a DFT-based

power spectrum cross-correlation registration for finding sub-pixel registration offsets

faster than using full image DFT cross correlation.

The process chooses a number of regions inside each image captured at all imaging

locations in the first cycle (templates) based on the presence of objects in these regions

forming discernible patterns, and uses the obtained region coordinates to perform cross

correlation to find the relative offsets in all the images captured in the subsequent cycles.

The technique is implemented by splitting the registration operations for each imaging

location between two separate stages.

The phase detection stage performs the patterns search by detecting absence of objects on

the grid, then stores the selected regions coordinates and computes the 2D forward DFT

of those regions.

The registration phase uses the information generated in the phase alignment stage to

extract regions at the same coordinates, then compute the 2D forward DFT of the regions,

compute the power spectrum cross-correlation in the frequency space, and then perform

an inverse 2D DFT to generate a correlation intensity image. The correlation intensity

image is searched for the peak value to find the highest correlation level and that is used

as the registration offset.

The inverse 2D DFT is performed on a 4x larger domain, in order to have 0.25 pixel

accuracy of the registration offsets, Figure 6 shows the results of the inverse 2D DFT.

Processing of the forward and inverse 2D DFT is performed in parallel on the different

regions in each image by creating individual tasks for each DFT operation and assigning

it to the Intel Threading Building Blocks task-stealing scheduler, which arbitrates

between inter-stage pipeline parallelism and intra-stage task parallelism.

Figure 6 – 2D DFT power spectrum cross-correlation registration


3.1.4 Extraction stage

The extraction stage uses the previously calculated phase (for the phase alignment stage)

and offset (for the registration stage) values that position precisely the grid frame over the

image, and then proceeds to iteratively calculate all the objects centers coordinates and to

sample the four color images at those coordinates.

For each object position the output is therefore a set of four 16bit values, one per color.

Normalization of the intensity values is performed here, as well as compensation for

different response rate of the different fluorescent polymers.

The process is very memory intensive, as it involves reading values from the large images

at dynamically computed coordinates, causing a large amount of cache misses.

3.1.5 Base calling stage

Base calling involves determining the base present at each object location in a specific

imaging cycle.

The four color intensities extracted in the previous stage contain information about the

base present at each object location.

The sets of intensity values for each imaging location are treated as an array of four-

dimensional vectors, which are processed using the following operations:

Principal component analysis is performed on each vector to determine an initial

estimation of the most likely base assignment each vector represents and vectors

are marked as representing the corresponding base.

Clustering of the vectors is then performed using the initial assignments, to form

four 4-dimensional clusters, one for each base type.

All vectors are then re-assigned to base types according to their Euclidean

distance from the clusters centroids, the base type is selected as the cluster with

the smallest distance from the vector.

Assignments are then converted into 4bit values; while 2bit values would normally be

sufficient to encode four values, additional information is required to mark invalid

(empty) object locations, therefore a 4bit value has been selected as it is a good

compromise between values packing and potential future extensions.

The total output generated by the base calling stage is then a file containing 26

consecutive arrays of bases, one per cycle, with each array containing one base value per

object per imaging location.

Using the high-resolution images generated by the new system the output is

approximately 1023*863 4bit values = ~441KB per imaging location per cycle, for a total

of ~11.7GB of data for a complete flow cell.

4 Experimental setup The Polonator “H” system hardware has been designed, but has not been built at the time

of writing, therefore all the experiments were performed using a combination of a

synthetic image generator and a hardware platform that represent as faithfully as possible

the operating conditions of the new machine, as reported in Table 3.

In order to perform fair performance comparisons, experiments involving the old

software system have been designed to operate under conditions as similar as possible to


the new pipeline, and used the same hardware and software environment used to develop

and test the new system.

Table 3 Experimental Platform

Model HP Z800 workstation

Processor 2 x Intel Xeon E5630 2.53GHz

Memory 12 GB

OS Windows 7 Enterprise 64bit

Compiler Intel C++ Compiler v12.1

4.1 Synthetic image generator

As noted earlier the physical system was not available to conduct tests, therefore in order

to create repeatable performance and validity experiments a synthetic image generator

was designed and developed.

The image generator takes as input several parameters and creates sets of images

reproducing the standard operating conditions of both the old and the new machines.

Input parameters to the image generator are the following:

Image dimensions in pixels

Grid pitch in nm

Objects size in nm

Pixel size in nm

Phase in pixels

Offset in pixels

Rotation in degrees

Fill rate as percentage

4.2 Test image processing system

A full set of unit tests was designed to verify the validity of the image processing stages

before performance tests were performed, all the results reported below report data from

verification tests as well as performance.

All the tests involve pre-loading the set of images into main memory and then measuring

the peak image processing throughput; this choice was made to replicate the behavior of a

frame grabber that can DMA images directly into main memory.

5 Results Two different sets of tests were performed. The first measures the performance

differential between the old and the new software using an image library generated by the

synthetic generator with parameters representing the old system operating conditions. The

second measures the performance of the new pipeline using an image library generated

by the synthetic generator with parameters representing the new system operating

conditions.

5.1 Test 1: old vs. new pipeline

Test configuration:


256 imaging locations

3 Sequencing cycles

No sequence alignment

Image generator parameters in Table 4

Table 4 Test 1 image generator configuration parameters

Image dimensions 1000x1000

Grid pitch 1500nm

Objects size 550nm

Pixel size 320nm

Phase (avg.) 1.5

Offset (avg.) 0.9

Rotation 0.001

Fill rate 95%

Old software stages performance (without base calling stage):

Pre-cycle - image segmentation stage: 176ms/image set

Cycles 1-26 - image registration stage: 16ms/image

Cycles 1-26 - objects extraction stage: 9ms/image

Total average: 39.66ms/image

New pipeline stages performance (1 thread – with base calling stage):

Cycle 1 - image phase detection stage: 1.1ms/image set

Cycles 2-26 - image registration: 18.9ms/image set

Cycles 1-26 - objects extraction stage: 1.6ms/image set


New pipeline stages performance (4 threads – with base calling stage):





Total running time/throughput:

Old software: 142.2 sec. => 41.2 MB/s

New pipeline (1 thread): 13.03 sec. => 450 MB/s

New pipeline (4 threads): 4.53 sec. => 1292 MB/s

The results show how the new pipeline exhibits a large performance benefit compared

with the old software. Most of the advantage is already visible in the single-threaded

implementation of the new pipeline as it benefits greatly by the new algorithms that use

inherent knowledge of the objects location distribution in the images.

The parallel implementation further increased the performance differential by both

reducing the latency of the most computationally intensive stages and at the same time


operating on multiple stages simultaneously, therefore maximizing the utilization of the

system’s resources.

5.2 Test 2: new pipeline using high-resolution images and fine grid pitch

Test configuration:

64 imaging locations

3 Sequencing cycles

Base calling included

No sequence alignment

Image generator parameters in Table 5

Table 5 Test 2 image generator configuration parameters

Image dimensions 2560x2160

Grid pitch 800nm

Objects size 400nm

Pixel size 320nm

Phase (avg.) 0.4

Offset (avg.) 0.2

Rotation 0.001

Fill rate 95%

Stages performance (1 thread):





Stages performance (4 threads):





Total running time/throughput:

1 thread: 25.6 sec. => 315 MB/s

4 threads: 10.9 sec. => 741 MB/s

The results show how the extraction stage has a much more significant impact on the

overall performance of the stages, making the multithreaded stages perform only slightly

better than the single threaded equivalents; pipeline parallelism partially compensates for

this by aggressively overlapping the stages, therefore masking the additional latency the

extraction stage requires.


6 Conclusions and future work The new pipeline is ~10x faster than the old software due to a choice of different

algorithms that use system-specific knowledge and lower overall disk I/O.

The new pipeline also scales well on a modern commodity multicore system up to ~20x

faster than the original software.

Furthermore, the new pipeline addresses validity issues present in the old software when

applied to the new hardware system, which would otherwise produce erroneous results.

While the current peak performance level supports 4 high-resolution cameras,

performance is barely sufficient to handle the desired throughput and therefore additional

improvements are required to guarantee sustained levels of performance for extensive

continuous usage.

In particular more work is required to improve caching effects, especially in the object

extraction stage, which is the critical bottleneck when processing high-resolution images.

Applying parallel execution and memory access tiling to the extraction stage should give

good results provided the system’s memory bandwidth limitations are not exceeded.

Grid rotation is not currently detected automatically, and it needs to be input as a global

parameter; a method based on the Radon Transform can be applied to perform an initial

calibration and then used throughout one full experiment, as the grid rotation value is

dependent only on the camera CCD sensor alignment relative to the flow cell, and is

therefore not expected to vary within the same experiment.

The most computationally intensive algorithms within the stages (DFT/iDFT/Extraction)

can be moved to a GPU; while this would increase the overall cost of the system, it would

provide additional performance, which can be translated in more accurate processing.

References

[BAJ04] Peter Bajcsy. GRIDLINE: automatic grid alignment in DNA microarray

scans. In IEEE Transactions on Image Processing, Vol. 13, Issue 1, Page

15, January 2004.

[BAJ06] Peter Bajcsy. An Overview of DNAMicroarray Grid Alignment and

Foreground Separation Approaches. In EURASIP Journal on Applied

Signal Processing, Pages 1–13, 2006.

[DE09] Deepa J, Tessamma T. Automatic Gridding of DNA Microarray Images

using Optimum Subimage. In International Journal of Recent Trends in

Engineering, Vol. 1, No. 4, May 2009.

[HGP] Human Genome Project.

http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml

[INT] Intel Corporation. Intel Threading Building Blocks Tutorial.

http://threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Sou

rce%20Documentation/Tutorial.pdf. 2003.

[KAT02] M. Katzer, F. Kummert and G. Sagerer. Robust Automatic Microarray

Image Analysis. In Proceedings of the International Conference on

Bioinformatics: North-South Networking, Bangkok, 2002.


[LAR08] Monica G. Larese, Juan C. Gomez. Automatic Spot Addressing in cDNA

Microarray Images. In Journal of Computer Science and Technology

(JCS&T), Vol. 8 No. 2, July 2008. [MO00] S. K. Moore. Understanding

The Human Genome. In IEEE Spectrum, Pages 33-42, November 2000.

[NES00] Moriel S. NessAiver, Subhasish Biswas. Image Registration Using a

Discrete Fourier Transform Implementation Of the Decoupled Automated

Rotation and Translation Algorithm (DFT-DART). In Proceedings of

ISMRM, Denver, CO, 2000, 586, Vol. 8 No. 2, July 2008.

[MO00] S. K. Moore. Understanding The Human Genome. In IEEE Spectrum,

Pages 33-42, November 2000.

[POL] Polonator G.007 system and software. http://www.polonator.org/

[POR06] Gregory J. Porreca, Jay Shendure, George M. Church, Polony DNA

Sequencing. In Current Protocols in Molecular Biology. Unit Number:

UNIT 7.8. Harvard Medical School, Boston, Massachusetts. 2006.

[SIS10] Joko Siswantoro, Automatic Gridding for DNA Microarray Image Using

Image Projection Profile. In Proceedings of the 6th IMT-GT Conference

on Mathematics, Statistics and its Applications (ICMSA2010). Universiti

Tunku Abdul Rahman, Kuala Lumpur, Malaysia. 2010.

[WIS05] Sompong Wisetphanichkij, Kobchai Dejhan. Fast Fourier Transform

Technique and Affine Transform Estimation-Based High Precision Image

Registration Method. In GESTS Int’l Trans. Computer Science and Engr.

Journal, Vol.20, No.1, 2005.

[WA05] Yu Wang, Frank Y. Shih, Marc Q. Ma. Precise Gridding of Microarray

Images by Detecting and Correcting Rotations in Subarrays. In

Proceedings of the 8th Joint Conference on Information Sciences, 2005.

[YA00] Y.H. Yang, M. J. Buckley, S. Dudoit and T.P.Speed. Comparison of

Methods for Image Analysis on cDNA Microarray Data. Technical Report

#584, Department of Statistics, University of California at Berkeley,

November 2000.

a high-performance image processing pipeline for polony ... im… · polony dna re-sequencing page...

Documents