design patterns for parallel vision applications · the quality of this reproduction is dependent...

D esign P atterns for P arallel V ision A pplications

Sanjay S. K adam

A th e s is su b m itte d for th e d eg ree o f

D o c to r o f P h ilo so p h y

in th e U n iv e r s ity o f L ondon

UCL

U n iv e r s ity C o lle g e L ondon

D e p a r tm e n t o f C o m p u te r S c ien ce

J u n e 1998

ProQuest Number: 10010623

All rights reserved

INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,

a note will indicate the deletion.

uest.

ProQuest 10010623

Published by ProQuest LLC(2016). Copyright of the Dissertation is held by the Author.

All rights reserved.This work is protected against unauthorized copying under Title 17, United States Code.

Microform Edition © ProQuest LLC.

ProQuest LLC 789 East Eisenhower Parkway

P.O. Box 1346 Ann Arbor, Ml 48106-1346

A bstract

C o m p u ter vision is a challenging application for high perform ance com puting . To m eet

its co m p u ta tio n a l dem ands, a num ber of SIM D and M IM D based parallel m achines have

been proposed and developed. However, due to high costs and long te rm design tim es

these m achines have no t been widely used. Recently, netw ork based environm ents, such

as a c luster of w orksta tions, have provided effective and econom ical p la tfo rm s for high

perform ance com puting. B u t developing parallel applications on such m achines involves

com plex decisions ab o u t d istribu tion of processes over th e processors, scheduling of proces

sor tim e betw een com peting processes, com m unication p a tte rn s , etc. W riting explicit code

to contro l these decisions increases program com plexity and reduces p rogram reliability

and code re-usability.

We propose a design m ethodology based on design p a tte rn s which is in tended to sup

po rt parallelization of vision applications on a cluster of w orksta tions. We identify com m on

algorithm ic form s occurring repeated ly in parallel vision algorithm s and fo rm ulate these as

design p a tte rn s . We specify various aspects of parallel behaviour of a design p a tte rn , such

as process p lacem ent or com m unication p a tte rn s, in its definition or separate ly as issues

to be addressed explicitly during its im plem entation . Design p a tte rn s ensure p rogram

reliability and code re-usability since they cap tu re the essence of w orking designs in a

form th a t m akes them usable in different s itua tions and in fu tu re work.

T he research work is concerned w ith presenting a cata logue of design p a tte rn s to

im plem ent various form s of parallelism in vision applications on a cluster of w orksta tions.

Using relevant design p a tte rn s , we im plem ent represen ta tive vision algorithm s in low, in

te rm ed ia te and high level vision tasks. M ajority of these im plem entations show prom ising

results. For exam ple, given a 512x512 im age, th e im age resto ra tio n algorithm based on

M arkov random field m odel can be com pleted in less th an 45 seconds on a netw ork of 16

w orksta tions (Sun SPA R C station 5). T he sam e task takes m ore th an 10 m inutes on a

single such w orksta tion .

A cknow ledgem ents

I th an k my supervisors D r. G raham R oberts and Prof. B ernard B uxton for providing

invaluable suggestions, m oral su p p o rt and cordial a tm osphere for conducting th is research

work a t D ep artm en t of C om puter Science, U niversity College London.

I am obliged to th e A ssociation of C om m onw ealth Universities for financing my re

search work th rough the B ritish Council in th e form of C om m onw ealth Scholarship. I

also th a n k D r. V ijay P. B hatkar, Executive D irector, C-DAC (C enter for D evelopm ent

of A dvanced C om puting), Pune, India, for approving th e required s tu d y leave (from my

cu rren t em ploym ent) for com pletion of my P h .D work.

I am gratefu l to m any colleagues and friends w ith whom I had bo th academ ic and

non-academ ic in teractions. My special th an k s go to Jo n a th an Poole for in troducing me to

th e concepts in parallel and d istrib u ted com puting using U C -f-f (a concurren t extension

of C-f-|-). I also th an k D r. Ju lia Schnabel for providing me w ith an application in m edical

im aging which served as an excellent exam ple for parallelization as discussed in ch ap te r 6.

I am also thankfu l to D r. N iladri C h a tte rjee for reviewing th e early d ra fts of th is thesis and

providing num erous suggestions tow ards enhancing th e technical quality of th e m ateria l

presen ted . M any th an k s to K am alendu Pal, A rif Iqbal, Adil Q ureshi, Ihsan K han, C iannis

K oufakis, D r. Bill Langdon and o th er researchers in th e d ep a rtm en t for several discussions

on b o th technical and social aspects of s tu d en t life in London.

I express u tm o st g ra titu d e to my paren ts and relatives for all th e ir good wishes and

blessings. I am also indebted to my wife D r. P ra tim a for her continuous su p p o rt,

encouragem ent and concern for th e com pletion of th is research work. Her ad ju stm en ts

to th e seclusion and frequent d isrup tions to th e fam ily life during th e course of th is s tu d y

were highly appreciable. A nd finally, th an k s to our little sons K irtiR a j and P ran av who

w ith th e ir ever cheerful appearance and innocent b u t inv igorating smiles helped m e to

relieve m yself from th e tensions and hardships of th e s tu d en t life.

C ontents

1 Introduction 16

1.1 Overview ................................................................................................................................. 16

1.2 Aims of th is Research W o r k .................................................................................................. 20

1.3 C on tribu tions of th e D is s e r ta t io n ........................................................................................21

1.4 O rganization of th e t h e s i s ......................................................................................................22

2 Parallelism in C om puter V ision 24

2.1 Parallel C o m p u tin g ............................................................................................................... 25

2.1.1 Parallel com puting s y s t e m s ..................................................................................... 25

2.1.2 A lgorithm ic c la s s e s ................................................................................................... 27

2.1.3 Perform ance of parallel p ro g r a m s ...................................................................... 28

2.2 An Overview of C om puter V is io n ................................................................................... 29

2.2.1 O b jec t recognition in 2D s c e n e s ......................................................................... 30

2.2.2 F eatu re D etection ................................................................................................... 31

C on ten ts 5

2.2.3 S e g m e n ta t io n .................................................................................................................33

2.2.4 R eseg m en ta tio n ............................................................................................................. 35

2.2.5 P roperties and R e la tio n s ........................................................................................... 35

2.2.6 O bjec t R eco g n itio n ...................................................................................................... 36

2.3 C o m puta tiona l C haracteris tics ...........................................................................................36

2.3.1 Low level p ro c e s s in g .................................................................................................. 37

2.3.2 In term ed iate level p r o c e s s in g .................................................................................37

2.3.3 High level p ro c e s s in g .................................................................................................. 38

2.4 Parallel system s for v i s i o n ................................................................................................. 39

2.4.1 M esh connected s y s te m s ...................................................................................... 39

2.4.2 P y r a m i d s ........................................................................................................................40

2.4.3 H y p e rc u b e s ....................................................................................................................40

2.4.4 Shared m em ory m a c h in e s ....................................................................................... 41

2.4.5 Pipelined System s and Systolic a r r a y s ................................................................42

2.4.6 P artitio n ab le S y s t e m s .............................................................................................. 42

2.4.7 G eneral purpose parallel s y s t e m s .........................................................................43

2.5 C om puting on w orksta tion c l u s t e r s ...................................................................................44

2.5.1 C luster C o n f ig u r a t io n .............................................................................................. 44

2.5.2 A dvantages of w orksta tion c l u s t e r s .....................................................................46

C o n ten ts 6

2.5.3 Use of c l u s t e r s .............................................................................................................. 47

2.5.4 Parallel com puting using C lusters ...................................................................... 48

2.6 P arallelization using Design P a t t e r n s ................................................................................50

2.6.1 Design p a t t e r n s ......................................................................................................... 50

2.6.2 Form s of Parallelism in Vision ......................................................................... 52

2.6.3 Design p a tte rn s for parallel vision ....................................................................... 55

2.7 R elated w o r k .......................................................................................................................... 56

2.8 S u m m a r y ................................................................................................................................. 58

3 D esign patterns for parallelizing vision applications 60

3.1 O rganization of p a t t e r n s .................................................................................................... 61

3.2 D escription of design p a tte rn s - a tem p la te ................................................................. 63

3.3 Farm er-W orker P a tte rn .........................................................................................................65

3.4 M aster-W orker P a t t e r n ............................................................................................................ 70

3.5 C ontroller-W orker P a tte rn . ..............................................................................................76

3.6 D ivide-and-C onquer P a t t e r n ............................................................................................. 82

3.7 Tem poral M ultiplexing P a t t e r n .......................................................................................... 87

3.8 P ipeline P a tte rn .......................................................................................................................91

3.9 C om posite Pipeline P a t t e r n ............................................................................................. 98

3.10 S u m m a r y ................................................................................................................................... 104

C on ten ts 7

4 Low level algorithm s 106

4.1 Paralle lization of low level a lgorithm s .......................................................................... 109

4.2 P artitio n in g th e im age d a t a ................................................................................................110

4.3 G rey scale tra n s fo rm a tio n s ................................................................................................... 112

4.4 Im age filtering .........................................................................................................................113

4.4.1 C o n v o lu tio n .................................................................................................................... 114

4.4.2 R ank f i l t e r in g ................................................................................................................ 119

4.4.3 S patial f i l t e r s ................................................................................................................ 120

4.5 F ast Fourier t r a n s f o r m s ...................................................................................................... 122

4.6 Im age r e s to r a t io n .....................................................................................................................124

4.6.1 M arkov random field models for im age re c o v e ry ............................................124

4.7 S u m m a r y ................................................................................................................................... 128

5 Interm ediate level processing 130

5.1 Region based segm entation ............................................................................................... 132

5.2 Parallel R egion-based s e g m e n ta tio n .................................................................................134

5.3 S egm entation using P ercep tual O rg a n iz a tio n .................................................................138

5.3.1 Sequential Line grouping a lg o r i th m ..................................................................... 139

5.3.2 Parallel Line grouping a l g o r i t h m .........................................................................143

5.4 S u m m a r y ...................................................................................................................................145

C on ten ts 8

6 High level processing 147

6.1 Sequential geom etric hashing algorithm .......................................................................148

6.1.1 Preprocessing P h a s e ..................................................................................................149

6.1.2 R ecognition p h a s e ......................................................................................................150

6.2 P aralle l geom etric hashing a lg o r i th m .............................................................................. 152

6.3 M ulti-scale active shape description - an a p p l i c a t i o n ................................................157

6.3.1 An overview of the shape description p r o c e s s ............................................... 158

6.4 P arallelization of th e shape description p r o c e s s ........................................................ 161

6.4.1 Parallelization using Tem poral M ultiplexing p a t t e r n .................................162

6.4.2 P arallelization using Pipeline p a t t e r n ..................................................................164

6.4.3 P arallelization using C om posite P ipeline p a t t e r n ........................................ 166

6.5 S u m m a r y ................................................................................................................................... 168

7 C onclusion 170

7.1 Aim s and M o tiv a tio n ..............................................................................................................170

7.2 R esearch R e v i e w ..................................................................................................................... 172

7.3 C on tribu tions of th e Research w o rk ................................................................................. 175

7.4 C om parison w ith related w o rk ............................................................................................ 176

7.5 F u tu re w o r k ................................................................................................................................177

A N otation 180

C on ten ts 9

A .l P a tte rn D i a g r a m ..................................................................................................................... 180

A .2 O b jec t In teraction C h a r t s .................................................................................................... 181

Bibliography 182

List o f F igures

2.1 An overview of a typical vision based a p p l i c a t io n ........................................................ 31

2.2 P rocessing levels in a typical vision based a p p lic a tio n ................................................. 37

2.3 A 4-connected m esh, pyram id and a 3-dim ensional hypercube of processing

e le m e n ts ..........................................................................................................................................40

2.4 Shared m em ory m achines (in terconnected by a bus and sw itching netw ork)

and systo lic /p ipeline s y s t e m s ...............................................................................................43

2.5 C om m on cluster configurations: bus, s ta r and a r i n g ................................................. 45

3.1 Farm er-W orker P a tte rn ...................................................................................................... 66

3.2 O b jec t In teraction in the Farm er-W orker P a t t e r n .................................................... 67

3.3 M aster-W orker P a t t e r n ..............................................................................................................72

3.4 O b jec t In teraction in th e M aster-W orker P a t t e r n ........................................................ 73

3.5 C ontroller-W orker P a t t e r n .......................................................................................................77

3.6 O b jec t In teraction in the C ontroller-W orker P a t t e r n ............................................ 78

3.7 C onvolution m asks for finding a) horizontal edges and b) vertical edges . . . 82

3.8 D C P a t t e r n ................................................................................................................................... 83

10

L ist o f F igures 11

3.9 O b jec t In teraction in the DC P a tte rn ............................................................................ 84

3.10 T M P a t t e r n ..................................................................................................................................88

3.11 O bjec t In teraction in th e T M P a t t e r n ...............................................................................89

3.12 Vehicle identification s y s te m ...................................................................................................92

3.13 P ipeline P a tte rn ........................................................................................................................ 93

3.14 O b jec t In teraction in th e Pipeline P a t t e r n ........................................................................ 94

3.15 Vehicle identification s y s te m ...................................................................................................99

3.16 C om posite P ipeline P a t t e r n ............................................................................................... 100

3.17 O b jec t In teraction in the C om posite Pipeline P a tte rn ......................................... 101

4.1 P artitio n in g of an im age, a) Row p artition ing b) Row partitio n in g w ith

d a ta th a t is to be overlapped a n d /o r com m unicated .............................................. I l l

4.2 Perform ance of h istogram e q u a l iz a t io n ............................................................................113

4.3 Perform ance of th e convolution operation using a 3x3 w in d o w .............................115

4.4 Perform ance of th e convolution operation using a 15x15 window ......................116

4.5 Perform ance of th e convolution operation on a IK x lK i m a g e .............................117

4.6 Perform ance of th e Farm er-W orker p a tte rn in convolution operation on

varying th e processor load and num ber of su b tasks (window size 15x15) . . 118

4.7 Perform ance of th e sharpening operation using spatia l filters (window size

1 1 x 1 1 ) ........................................................................................................................................... 121

4.8 T he d a ta blocks needed to transpose th e in term ed iate r e s u l t s ...........................123

List o f F igures 12

4.9 Perform ance of the im age resto ra tion algorithm using M R F m odel (window

size 3 x 3 ) ................................................................................................................................. 126

4.10 Perform ance of th e M aster-W orker p a tte rn (in im age recovery operation

using th e M R F m odel on a 512x512 im age) su b jec t to th e ex ternal load

and load d i s t r ib u t io n ........................................................................................................ 128

5.1 a) P artitio n ed im age b) C orresponding qu ad tree .......................................................133

5.2 a) D istribu tion of subim ages b) M erging of s u b i m a g e s .....................................135

5.3 Perform ance of th e parallel sp lit and m erge segm entation algorithm . . . . 136

5.4 Line G rouping ...........................................................................................................................139

5.5 R elational constra in ts in the line grouping algorithm a) proxim ity b) collinear-

ity and c) c o n tin u a tio n .....................................................................................................141

5.6 Indexing technique used in th e line grouping process, a) search a rea for th e

base line b) th e index a r r a y ..........................................................................................142

6.1 Preprocessing phase in th e geom etric hashing algorithm a) O rthogonal co

o rd in a te system defined by th e basis set b) A dding (m o d e l, basis) pairs in

th e hash t a b l e .......................................................................................................................150

6.2 Recognition phase in the geom etric hashing algorithm a) O rthogonal coordi

n a te system defined by th e basis set b) Accessing and collecting (m odel, basis)

pairs from th e hash bins in hash t a b l e .....................................................................151

6.3 Hash tab le d a ta s tru c tu re a) sym m etric indexing in hash tab le b) hash

entries in norm al hash tab le c) reduction in hash entries using sym m etries . 154

6.4 Perform ance of th e geom etric hashing algorithm for o b jec t recognition . . . 156

List o f F igures 13

6.5 M ulti-scale shape description process a) p ropagation step applied on a set

of five im age slices b) m ulti-scale shape stack of an im age slice com puted

in th e shape focusing step (F igure (b) ad ap ted from (Schnabel, 1997)). . . . 160

6.6 Shape focusing perform ed a t different scales in th e im age scale-space of an

im age slice using active con tour models: (a) cr = 8 (b) a = 4 (c) a = 2 (d)

(7 = 1. Im age (a) also contains th e initial con tour superim posed in black.

All im ages are taken from (Schnabel, 1997)................................................................. 160

6.7 V isualization of th e stack contours (those displayed in F igure 6.6) stacked

using triangu la tion . Im age taken from (Schnabel, 1997)........................................ 161

6.8 P arallelization of th e shape description process using a P ipeline p a tte rn .

T he in teger values denote sequential tim e (in seconds) required for executing

corresponding com ponents of the Pipeline p a t te rn ...................................................... 164

6.9 P arallelization of th e m ulti-scale shape descrip tion process using a C om

posite P ipeline p a t t e r n ........................................................................................................... 167

List o f Tables

4.1 E xecution tim e in (m in:sec) foT h istogram e q u a liz a t io n ........................................... 113

4.2 Execution tim e in (m in:sec) for the convolution o p e r a t i o n .................................... 114

4.3 Perform ance of th e Farm er-W orker p a tte rn on varying th e ex ternal load

and num ber of sub tasks. T he execution tim e (m in isec) displayed are for

the convolution operation (window size 15x15)............................................................. 118

4.4 E xecution tim e in (m inisec) for the rank filtering o p e r a t i o n ................................ 120

4.5 Execution tim e in (m in:sec) ï o t the sharpening operation .................................... 121

4.6 Execution tim e in (m in:sec) fov F F T o p e r a t i o n ..........................................................123

4.7 Execution tim e in (m in:sec) fo i im age resto ra tio n using M R F m odel . . . . 125

4.8 Perform ance of th e M aster-W orker p a tte rn when su b jec ted to th e ex ternal

load. T he execution tim es (m in:sec) displayed are for th e im age resto ra tio n

operation using th e M R F m odel on a 512x512 im age.................................................127

5.1 E xecution tim e in (m in:sec) for the parallel sp lit and m erge segm entation

a l g o r i t h m .................................................................................................................................... 136

5.2 E xecution tim e in (m inisec) for various operations in th e parallel sp lit and

m erge segm entation algorithm applied on a 512X512 i m a g e ................................137

14

List o f T ables 15

5.3 Execution tim e in (m in isec) for th e line grouping process .....................................144

6.1 E xecution tim e in (m in:sec) for th e geom etric hashing a l g o r i t h m ......................156

6.2 Execution tim es in (seconds) iov different im plem entations and individual

steps of th e shape description process ............................................................................168

Chapter 1

Introduction

1.1 O verview

C o m p u ter Vision deals w ith the principles and techniques to ex tra c t and in te rp re t useful

inform ation in a scene by cap tu rin g and analyzing im ages of th a t scene. It has applications

in several areas such as rem ote sensing, autonom ous vehicle guidance, industria l inspection,

and m edical im aging. Some of these applications, such as au tonom ous vehicle guidance,

are real-time and involve algorithm s which m ust com plete th e ir co m p u ta tio n s w ith in a

fraction of a second. Some applications requiring hum an in terac tion are in teractive and

m ust com plete w ithin few seconds or less, depending on th e ty p e of in teraction required.

O th er applica tions are batch applications which can to lera te m axim um latency of few

hours or even days. T he n a tu re of th e algorithm s involved in these applica tions is th u s

varied. B ut m ost of these algorithm s are com putationally intensive and require enorm ous

com puting power for th e ir practical im plem entation .

C om pute r vision uses a broad spec trum of algorithm s covering different areas such as

im age and signal processing, graph theory, m athem atics, and artificial intelligence. From

a co m p u ta tio n al perspective, vision processing is conveniently classified in to th ree levels:

low, in term ediate , and high. Low level processing involves pixel-based tran sfo rm atio n s

w here uniform com puta tions are applied a t each pixel or a neighborhood around each

16

Chapter 1. Introduction 17

pixel in th e im age d a ta . These com pu tations are m ainly num eric and well s tru c tu re d .

In te rm ed ia te level processing involves bo th num eric and sym bolic co m pu ta tions. It com

prises algorithm s to form regions of in terest in th e im age d a ta , such as grouping of low level

fea tu res (e.g. edges) into lines, arcs, or rec tangu lar borders of an ob jec t. High level process

ing involves sym bolic com pu tations w here d a ta provided by th e low and in term ed ia te level

algorithm s is used for testin g and generating hypotheses for ob jec t recognition. A typical

vision application com prises low, in term ediate and high level vision ta sk s /a lg o rith m s and

th u s involves b o th num eric and sym bolic com putations. Therefore, although vision has

been identified as a grand challenge application for high perform ance com puting , com

p u ta tio n a l charac teristics of vision applications are different from th e s tru c tu re d num ber

crunching com pu ta tions arising in m ost o ther grand challenge applica tions (W ang e t al.,

1996).

To m eet the com pu ta tional dem ands of vision tasks, several efforts have been directed

tow ards providing a high perform ance com puting su p p o rt for th e ir p ractical im plem en

ta tio n . A brief survey of th e research efforts in high perform ance parallel com puting for

vision can be found in (W ebb, 1994). These efforts can broadly be grouped in to follow

ing categories, based on th e type of com puting p la tfo rm s they utilize: special-purpose

hardw are chips, SIM D based m achines, specialized vision system s and general purpose

parallel m achines. Special-purpose hardw are chips serve as accelerators to specific vision

algorithm s since they im plem ent the com pu tations in hardw are . However, they are su itab le

only for specific w ell-structured low level vision algorithm s, such as im age convolution.

SIM D based m ultiprocessor m achines such as m eshes, a rray processors, hypercubes,

and pyram ids, consist of sim ple processing elem ents connected by a com m unication n e t

work. These m achines perform well for im plem enting m ost of th e low level vision algo

rithm s. B ut they are no t well suited for high level vision a lgorithm s since these algorithm s

involve nonuniform processing and com plex d a ta s tru c tu re s . Specialized vision system s are

special purpose parallel m achines designed to su it th e requ irem ents of vision tasks. T hey

are capable of being partitioned into one or m ore independen t SIM D and M IM D sub

system s to m atch th e com pu tational characteristics of vision a lgorithm s a t various levels.

For exam ple, th e im age understand ing arch itectu re (lUA ) (W eems e t al., 1989) has th ree

hierarchical levels of com puting p latform s to su p p o rt processing of low, in te rm ed ia te and


high level vision tasks. Specialized vision system s, however, have com plex arch itectu res

which involve significant design and developm ent effort. T he need to develop new system

softw are for such m achines results in huge system developm ent costs.

G eneral purpose parallel m achines such as IBM SP-2, M eiko CS-2, Intel P aragon ,

C ray T 3D , and SGI Power Challenge, have been used successfully for a variety of high

perfo rm ance com puting applications. These com m ercial m achines have n o t been developed

for any specific applications, bu t are m eant to be general purpose system s. M ost of them

have a sim ilar arch itec tu re consisting of processors in terconnected by a high speed netw ork.

T hese processors are those th a t are used in large uniprocessor w orksta tions. T hese m a

chines are typically organized as a single box th a t contains all th e processor and m em ory

m odules in terconnected by a special purpose in terconnection netw ork. A lthough there

have been som e a tte m p ts to use these m achines for parallel vision applica tions (W ang,

1995), th ey are still not very popular w ith m any organizational setups.

Recently, netw ork-based com puting environm ents, such as a cluster of w orksta tions,

have provided effective and economical p latform s for high perform ance com puting . A

c lu ste r of w orksta tions offers several advantages for parallelizing and executing large ap

plications on a relatively low-priced and readily available pool of m achines. It provides

m ultip le C PU s for parallel com puting and d ram atically im proves v irtu a l m em ory and file

system perform ance. It can approach or exceed supercom pute r perform ance for som e

app lica tions and can easily be tuned to advances in processor and netw ork technology

(A nderson e t al., 1995), (T urco tte , 1996). A cluster of w orksta tions can in co rporate h e t

erogeneous arch itectu res, so applications can select th e m ost su itab le com puting resources

for each com pu ta tion .

B u t developing parallel applications on such m achines involves com plex decisions ab o u t

d is trib u tio n of processes over th e processors, process synchronization , scheduling of proces

sor tim e betw een com peting processes, com m unication p a tte rn s , e tc. W riting explicit code

to con tro l these decisions increases program com plexity and , reduces p rogram reliability

and code reusability. Also, th e available m achines and th e ir capabilities can vary from

one execution to ano ther, and high com m unication costs can degrade the perform ance in

m any applica tions. M oreover, developers do not wish to spend tim e in low level parallel


program m ing in order to gain th e advantages of po ten tia l parallelism in an application .

M ost of th em use or m odify existing parallel code to im plem ent parallelism for th e ir

applications. In fact, som e recent surveys of experienced parallel p rog ram m ers have shown

th a t ab o u t 69% m odify existing program s or com pose p rogram s from existing blocks of

code. T he rem aining 31% who s ta r t from scratch are typically co m p u te r scien tists and

applied m ath em atic ian s (Pancake, 1996),

T he m ain goal of th is thesis is to present a design m ethodology based on design pa tterns

in tended to su p p o rt parallelization of vision applications on a c luster of w orksta tions. M ost

of th e parallel algorithm s used in im plem enting vision tasks repeated ly use only a Unite set

of a lgorithm ic form s. We identify these com m on algorithm ic form s and fo rm ulate these as

design p a tte rn s . We specify various aspects of parallel behavior of a design p a tte rn , such

as process p lacem ent and com m unication p a tte rn s, in its definition or sep ara te ly as issues

to be addressed explicitly during its im plem entation . Design p a tte rn s ensure p rogram

reliability and code reusability since they cap tu re the essence of w orking designs in a form

th a t m akes them usable in different situations and in fu tu re w ork (Coplien & Schm idt,

1995), T he use of the design p a tte rn s would enable developm ent of d is trib u ted softw are

quickly, econom ically and reliably. Using a cluster of w orksta tions, researchers can use th e

design p a tte rn s to im plem ent m any in teractive and batch app lications in com pu ter vision,

A clu ster of w orksta tions is characterized by high com m unication costs and a varia

tion in speed facto rs of individual m achines in the netw ork. We need to address these

issues while form ulating th e design p a tte rn s. One facto r th a t m inim izes th e effect of high

com m unication costs on perform ance is granularity. G ran u la rity of an algorithm describes

th e am oun t of work associated w ith each task relative to com m unication . An algorithm

th a t exchanges d a ta between its processes after a sm all num ber of co m p u ta tio n s is called

fine-grained while an algorithm w here the com pu tations continue for a long tim e before

th e com m unication is required is term ed as coarse-grained. Since a clu ster of w orksta tions

is inherently coarse-grained, we need to form ulate design p a tte rn s so th a t they im plem ent

coarse grained parallelism . Also, th e design p a tte rn s should d is tr ib u te th e work load

according to th e speed factors of individual m achines in th e netw ork.

We begin our work by analyzing th e com putation and com m unication charac te ristic s


of vision algorithm s. We identify various form s of parallelism in vision algorithm s and

fo rm ulate design p a tte rn s to im plem ent them . Each design p a tte rn cap tu res com m on

designs used by developers to parallelize their applications. We presen t a ca ta logue of

design p a tte rn s to im plem ent various form s of parallelism in vision applications on a

cluster of w orksta tions. Using relevant design p a tte rn s , we im plem ent rep resen ta tive vision

algorithm s in low, in term ed iate and high level vision tasks, and presen t th e experim ental

resu lts of th e corresponding parallel im plem entations.

In low level, we im plem ent algorithm s such as h istogram equalization , convolution,

im age filtering using sp a tia l filters, and im age resto ra tio n using M arkov random field

models. In in term ed iate level, we im plem ent region-based sp lit and m erge segm enta tion

algorithm and line grouping algorithm based on principles of percep tual grouping. In

high level, we im plem ent geom etric hashing algorithm for o b jec t recognition. We also

discuss parallelization of an application in medical im aging, namely, m ulti-scale active

shape descrip tion of M R (m agnetic resonance) brain im ages using active con tour models.

1.2 A im s o f th is R esearch W ork

T he focus of th e work in th is thesis is to develop m ethodologies to su p p o rt parallelization

of vision applications on a cluster of w orkstations. T he m ain goals of th is thesis work are:

• To analyze com pu tational characteristics of vision task s and identify com m on algo

rithm ic s tru c tu re s in their parallel im plem entations.

• To cap tu re and a rticu la te these algorithm ic s tru c tu re s as design p a tte rn s in a form

th a t m akes them usable in different situations and in fu tu re work.

• To use these design p a tte rn s for im plem enting som e rep resen ta tive vision algorithm s

in low, in term ed iate and high level vision processing.

• To evaluate th e viability of using a cluster of w orksta tions to parallelize vision

applications.


1.3 C ontribu tions o f th e D isserta tion

T he con tribu tions of th is d isserta tion are three fold. F irstly , we propose a design m eth o d

ology based on design p a tte rn s in tended to su p p o rt parallelization of vision applica tions

on a cluster of w orksta tions. Secondly, we present coarse-grained parallel a lgorithm s for

som e represen ta tive vision algorithm s in low, in te rm ed ia te and high level vision process

ing. T hirdly, we use relevant design p a tte rn s to im plem ent these parallel a lgorithm s on

w orksta tion clusters. These con tribu tions are sum m arized as follows:

• Design p a tte rn s: We identify com m on algorithm ic s tru c tu re s occurring repeated ly

in parallel vision ta sk s / applications and form ulate these as design p a tte rn s . We

describe each design p a tte rn using a tem plate which outlines in ten t, m otivation ,

s tru c tu re , in teraction am ongst th e com ponents and applicability of th e design p a t

te rn . T his descrip tion enables selection and use of a design p a tte rn in different

s itu a tio n s and in fu tu re work.

• C oarse-grained parallel algorithm s: We present coarse-grained parallel a lgorithm s

and im plem entations for several vision tasks such as convolution, im age filtering,

im age resto ra tion , region-based segm entation , line grouping, and geom etric hashing

algorithm for ob ject recognition. We also present different parallel im plem entations

of th e m ulti-scale active shape description process (an application in m edical im ag

ing) using different design p a tte rn s.

• Im plem entation on a cluster of w orkstations: Using relevant design p a tte rn s , we

perform parallel im plem entations of the selected represen ta tive vision ta sk s s ta te d

above. T he results of these im plem entations enable critical assessm ent of th e design

p a tte rn s for achieving im provem ents in application perform ance. It also enables

evaluating th e viability of using w orksta tion clusters for im plem enting parallel vision

applications.


1.4 O rganization o f th e th esis

T he rem ainder of th e thesis is organized as follows:

• C h ap te r 2 reviews concepts and m ethods in several different areas re la ted to the

parallel vision system s. We begin w ith a brief in troduction to parallel com puting

system s and parallel algorithm s. We then describe general principles and m ethods

used in th e field of com puter vision, giving specific em phasis on app lications in

volving analysis of 2D scenes. We also describe th e co m p u ta tio n al charac teristics

of vision algorithm s and outline SIM D and M IM D based parallel m achines used for

parallelizing these algorithm s. We then describe parallel com puting on w orksta tion

clusters and discuss th e ir advantages over th e conventional parallel m achines. We

presen t various form s of parallelism in vision algorithm s and in troduce th e concept of

design p a tte rn s in tended to su p p o rt parallelization of vision applica tions on a cluster

of w orksta tions. Finally, we outline some of th e leading research efforts re la ted to

th e work presented in th is thesis.

• C h ap te r 3 presents a detailed description of each design p a tte rn . We use a tem

p la te to specify various aspects of parallel behavior (such as process p lacem ent

and com m unication p a tte rn s) of each design p a tte rn . T he tem p la tes ou tline in ten t,

m otivation, s tru c tu re , in teraction am ongst th e com ponents and applicability of th e

design p a tte rn s .

• C h ap te r 4 discusses parallelization of some low level vision algorithm s such as his

tog ram equalization, convolution, im age sharpening using sp a tia l filters, fast fourier

transform s, and im age resto ra tion using M arkov random field models. Each algo

rithm is parallelized by using either Farm er-W orker, M aster-W orker or C ontroller-

W orker p a tte rn .

• C h ap te r 5 presents resu lts of parallelization of some in term ed ia te level a lgorithm s

such as region-based segm entation , and line grouping a lgorithm based on th e princi

ples of percep tual organization. We use D ivide-and-C onquer p a tte rn for im plem ent

ing th e parallel region-based segm entation algorithm . T he line grouping algorithm

is parallelized by using th e C ontroller-W orker p a tte rn .


• C h ap te r 6 presents results of parallelization of a high level vision a lgorithm , namely,

geom etric hashing for ob ject recognition. We use a Farm er-W orker p a tte rn to per

form m ultiple m atching operations (probes) for identifying each ob ject in an im age.

In th e last section of th is chap ter, we discuss parallelization of an app lication in

m edical im aging, namely, m ulti-scale active shape descrip tion of M R (m agnetic

resonance) bra in im ages using active con tour models. We discuss th ree different

approaches of parallelizing th e shape description process. Each approach uses a

different design p a tte rn , namely. Tem poral M ultiplexing, P ipeline or C om posite

Pipeline.

• Finally, ch ap te r 7 presents concluding rem arks and d irections for fu tu re research.

Chapter 2

Parallelism in C om puter V ision

C om puter vision is a challenging application for high perform ance com puting . M any

vision applications are com putationally intensive and involve com plex processing. For a

p ractical and real-tim e im plem entation of vision applications, h igh-perform ance com puting

su p p o rt is essential. Over th e p ast several years, parallel processing has been perceived to

be an a ttra c tiv e and economical way to achieve th e required level of perform ance in vision

applications. C om p u ta tio n a l dem ands and real-tim e co n stra in ts associa ted w ith th e vision

applica tions have induced several research efforts to explore th e use of parallel com puting

resources for parallelizing vision applications (W ebb, 1994). M ost vision applica tions

consist of im age preprocessing followed by ob ject identification. A lthough b o th these task s

involve large num ber of com putations, they em body different co m p u ta tio n al paradigm s.

As a resu lt, several special and general purpose parallel m achines have been proposed,

developed and used in im plem enting parallel solutions to m any vision algorithm s.

T his chap te r gives an overview of the algorithm s in co m p u ter vision and presen ts

parallel system s and m ethodologies used in parallelizing vision applications. T he ch ap te r

is organized as follows. Section 2.1 in troduces som e concepts in parallel com puting . Sec

tion 2.2 gives an overview of th e principles and m ethods involved in th e field of com puter

vision. Section 2.3 discusses com putational characteristics of vision applications and th e ir

classification in to th ree levels, low, in term ediate and high. Section 2.4 ou tlines different

parallel system s used for parallelizing vision applications. Section 2.5 describes parallel

24

Chapter 2. Parallelism in Computer Vision 25

com puting on a cluster of w orksta tions. Section 2.6 proposes a m ethodology, based on

design p a tte rn s , which can be used to parallelize a m ajo rity of th e vision app lications on

netw ork-based m achines, such as a cluster of w orksta tions. We also describe various form s

of parallelism th a t can be applied to parallelize vision applications. Finally, section 2.7

outlines som e of th e leading research efforts which have been in sp irational to th e work

presented in th is thesis.

2.1 P arallel C om puting

P arallel com puting is concerned w ith applying m ultip le processors to solve a single com

p u ta tio n a l problem for achieving b e tte r perform ance. T his section begins w ith an in

tro d u c tio n to parallel com puting system s. It is followed by a descrip tion of a b s tra c t

algorithm ic classes characterizing different parallel algorithm s. These classes are useful

when discussing algorithm s a t a higher level.

2 .1 .1 P a ra lle l co m p u tin g sy ste m s

A parallel com puter is a collection of processors and m em ory connected by som e ty p e of

com m unication netw ork. Parallel com puting system s include a full spec trum of sizes and

prices, from a collection of w orksta tions a ttach ed to a local-area netw ork, to an expensive

high-perform ance m achine w ith thousands of processors connected by high-speed sw itches

(D uncan, 1992).

T he arch itec tu res of th e com puting system s are com m only organized in te rm s of in

stru c tio n s tream s and d a ta stream s (Flynn, 1972). T he th ree cases th a t have becom e

fam iliar te rm s to th e parallel program m er are SISD (single in s tru c tio n , single d a ta ) , SIM D

(single in struction , m ultiple d a ta ) and M IM D (m ultiple in s tru c tio n , m ultiple d a ta ) . SISD

com puters are th e trad itio n a l von N eum ann com puters th a t have a single in stru c tio n

stream and a single d a ta s tream . All operations on these com puters are logically sequen

tial. In a SIM D parallel com puter a single instruction s tream is applied to m ultiple d a ta

stream s. S lM D -based m achines usually consist of a large num ber of sim ple processors


connected by an in terconnection netw ork. T he M IM D d a ta m odel is th e m ost general

m odel of a parallel com puter. A M IM D com puter has m ultiple processing elem ents each

of which is a com plete com puter in its own right.

A lthough SIM D system s are easy to program , optim izing SIM D program s to yield

acceptab le perform ance is very difficult. As a result, SIM D com puters have no t been very

popu lar for scientific com puting. T his m akes M IM D system s th e overw helm ing m ajo rity

of parallel system s, especially when a cluster of w orksta tions is viewed as a single M IM D

com puter. A M IM D com puter consists of processors and m em ory. T he m em ory can be

either shared or d is trib u ted am ong th e processors. We can therefore consider tw o d istinc t

program m ing models: shared m em ory M IM D and d is trib u ted m em ory M IM D . However,

since th e sam e issues of d a ta locality and concurrency arise in bo th th e cases, we can

view M IM D com puter in te rm s of a com m on program m ing m odel. O ne such m odel is

the coordination m odel (M attson , 1996), (Foster, 1995), w here a parallel co m p u ta tio n is

viewed as a collection of d istinc t processes which in te rac t a t d iscre te po in ts th ro u g h a

coord ination operation . T he term coordination refers to th e basic operations to contro l a

parallel com puter. It includes coordination operations for in fo rm ation exchange, process

synchronization and process m anagem ent. These coord ination operations m ay vary in

speed and s tru c tu re , however, the overall model is essentially th e sam e.

B u t describing parallel and d istribu ted com puters in te rm s of a coord ination m odel

is not universally accepted like th e von N eum ann m odel (M attso n , 1996). However, such

a m odel can be s ta ted and used for program m ing parallel com puters w ithin a universal

program m ing m odel. A lthough the com puter system s differ, th e difference is g ran u la rity

(ra tio of com p u ta tio n to com m unication), and no t the fun d am en ta l p rogram m ing m odel

(M attson , 1996).

T he program m ing model, in order to be useful, m ust be im plem ented as a p rogram m ing

environm ent. T here are several program m ing environm ents su p p o rtin g various in carn a

tions of th e coordination model which run well on parallel com puters as well as a cluster of

w orksta tions (T urco tte , 1993), (Cheng, 1993). One can develop a parallel code using som e

high level language designed specifically to su p p o rt parallel and d is trib u ted com puting .

A lternatively, one can use a sequential language com bined w ith a coord ination lib rary


(often called as m essage-passing lib rary ), such as PV M (Sunderam , 1990).

P rog ram s w ritten for parallel M IM D system s fall into tw o categories: SPM D (single

p rogram m ultiple d a ta ) and M PM D (m ultiple program m ultiple d a ta ) . For SPM D pro

gram s, each processor executes th e sam e ob ject code. SPM D style of p rogram m ing is easy

to code since th e program m er needs to m ain tain a single source code. In co n tra s t, M PM D

program s allow each processor to have a d istinct executable code. A program m er can split

th e program in to different m odules which can be developed and debugged independently

or reused as com ponents of o th er program s. A M PM D program requires less m em ory

com pared to its equivalent SPM D version (M attson , 1996).

2 .1 .2 A lg o r ith m ic c lasses

M ost of th e parallel a lgorithm s can be classified in term s of th e regularity of th e under

lying d a ta s tru c tu re s (space) and the synchronization required as these d a ta elem ents are

u p d a ted (tim e) (Angus e t ah, 1989), (M attson , 1996). Based on th is classification schem e

th ere are four general classes of parallel algorithm s:

1. Synchronous

Synchronous algorithm s are those in which regular d a ta elem ents are u p d a ted a t

regular intervals of tim e. They are regular in space and regular in tim e. T hey involve

tig h tly coupled m anipulation of identical d a ta elem ents. Synchronous algorithm s

can be expressed in te rm s of a single instruction s tream , and are therefore easily

m apped onto SIM D com puters. T he parallelism is usually expressed in te rm s of th e

decom position of the d a ta . In fact, the d a ta drives th e parallelism , hence th e nam e

data parallelism . However, d a ta parallelism is m ore general th an SIM D parallelism ,

since d a ta parallelism does not insist on a single instruction stream .

2. Loosely synchronous

A loosely synchronous algorithm synchronously up d ates d a ta elem ents which differ

from one processor to ano ther. Loosely synchronous a lgorithm s are regular in tim e

b u t irregular in space. They have tig h t coupling betw een th e task s as in th e syn


chronous case. However, due to variation in th e d a ta elem ents across th e processors,

th e work loads can vary from processor to processor. Hence, loosely synchronous

algorithm s need som e m echanism to balance th e co m p u ta tio n al load am ong th e

processors of th e parallel com puter.

3. A synchronous

A synchronous algorithm s do no t have regular d a ta up d ates, so th e system proceeds

w ith nonuniform and som etim es random synchronization. These a lgorithm s are

irregular in tim e and usually irregular in space w ith unpred ic tab le or nonex isten t

coupling betw een th e tasks. T his class of problem s, o th er th an th e em barrassing ly

parallel subset described next, is m ost rare. T his is because program s for im ple

m enting asynchronous algorithm s are difficult to co n stru c t. W hile synchronous

and loosely synchronous algorithm s are usually parallelized by focusing on d a ta

decom position, asynchronous algorithm s are usually parallelized by decom position

of th e control, which is referred to as fu n c tio n a l or control parallelism .

4. E m barrassingly parallel

E m barrassing ly parallel algorithm s are those asynchronous algorithm s for which th e

task s are com pletely independent and uncoupled. T he parallelism in th is case is

triv ial and th e program s are am ong the sim plest parallel p rogram s to co n stru c t.

P roblem s in th is class are very com m on in parallel com puting since th e ir com pu

ta tio n s easily m ap in to th is model. In fact, any p rogram consisting of a loop w ith

com pute-in tensive and independent ite ra tio n s can be parallelized using th is m odel.

E m barrassing ly parallel program s usually utilize an SPM D style of p rogram m ing

com bined w ith some m echanism for load balancing. Load balancing schem es can

e ither be s ta tic or dynam ic.

2 .1 .3 P er fo rm a n ce o f para llel program s

T he m ain goal of parallelism is to reduce th e execution tim e of th e whole p rogram in o rder

to produce th e resu lts faster. T he perform ance estim ates of a parallel p rogram are based

on th e tim ings of its com plete sequential code. T he sequential p rogram typically com prises

of tw o d istinc t sections of code, inherently sequential code and po ten tially parallel code.


T he parallel con ten t p of th e program is defined as th e ra tio of th e tim e taken to execute the

po ten tia lly parallel code upon th e tim e taken to execute th e whole code. T h e m axim um

theo re tica l speedup th a t can be achieved for a given program is a function of th e parallel

co n ten t p and th e num ber of processors th a t will be used {N). I t is given by A m d ah l’s law

(A m dahl, 1988) which is s ta te d as follows

Theoretical speedup = —— . ^ (2.1)

T he theore tical speedup is lower th an the ideal speedup^ which reflects th e ideal th a t

applying N processors to a program should cause it to com plete N tim es faste r. T he size

of th e gap betw een th e ideal and theoretical speedup is a function of th e serial co n ten t of

th e program . T his suggests th a t th e am ount of speedup th a t can be achieved for every

program is lim ited beyond a certain num ber of processors. T he gap between th e theore tical

and ideal speedup m ay change due to the increase in problem size (e.g. when num ber of

ite ra tio n s are increased in a sim ulation). T he gap narrow s down when the parallel con ten t

of th e program increases due to increase in problem size, while th e gap m ay actually widen

if th e length of th e serial bottlenecks also increase upon increase in problem size. However,

th e theo re tica l speedup is rarely achievable by a parallel application. T here will actually

be an observed speedup which is much lower th an th e theore tical speedup, reflecting the

effect of ex ternal overhead on th e to ta l execution. T his overhead com es from tw o sources

a) th e add itional processor cycles expended in sim ply m anaging th e parallelism b) w asted

tim e spen t w aiting for I /O , com m unication am ong processors, and , com petition from the

o p era tin g system and o ther users (Pancake, 1996). T heoretical speedup does no t take

these facto rs in to account.

2.2 A n O verview o f C om puter V ision

T he basic in p u t in com puter vision is a set of one or m ore im age(s) of som e scene, while

th e o u tp u t is a descrip tion of th e ob jects in th a t scene. An im age, cap tu red by a sensor,

is an a rray of num bers called pixels th a t represent average brightness (gray level) or color

values a t d iscrete grid points in the scene. A gray level is usually represen ted as an 8-bit


in teger having 256 d istinc t values, while each color value is represented by a n-valued tup le

m easuring brightness in a se t of n -spectral bands (e.g., red, blue and green).

We can view im age processing as a prelude to com pu ter vision. Im age processing

algorithm s op era te on im ages to ex trac t and represent scene inform ation. H igher level

vision algorithm s use scene inform ation for ob ject recognition and scene in te rp re ta tio n .

C om puter vision therefore encom passes processing from sensing to scene in te rp re ta tio n .

T he m ain areas of im age processing include image enhancem ent and restoration (to im

prove appearance of an im age or to undo effects of im age deg radations such as b lurring

or noise), image com pression (to reduce an im age to sm aller sets of d a ta which can

be used for reconstruction of an acceptable approx im ation to th e original im age), and

image reconstruction from projections (to construc t im ages of cross-section of an ob ject

by analyzing a set of pro jections taken from different d irections, as in tom ography).

Since m ajo rity of th e applications in com puter vision involve two dim ensional (2D)

scenes and th e general goal is to recognize ob jects of in terest in th e im ages of these

scenes, we will restric t our discussion to analysis of 2D scenes. T he following subsections

outline general techniques involved in 2D ob ject recognition. A detailed discussion dealing

p rim arily w ith 2D vision can be found in (B allard & Brow n, 1982), (Rosenfeld & Kak,

1982), (Sonka e t al., 1993), while an outline of bo th 2D and 3D vision is given in (Rosenfeld,

1988).

2 .2 .1 O b jec t rec o g n itio n in 2D scen es

Some exam ples of applications involving 2D scenes are: recognition of alphanum eric char

ac te rs from an im age of a docum ent, recognition of blood cells from an im age of a specim en

seen th ro u g h a m icroscope, and identification of houses and roads from high a ltitu d e aerial

pho tographs. A general fram ew ork describing m ajo r techniques used in ob ject recognition

is shown in F igure 2.1. F ea tu re detection techniques are used for detec ting local fea tu res

such as edges (a t which th e gray level changes ab ru p tly ), lines, curves, spots, and corners.

S egm entation p artitio n s th e image pixels into hom ogeneous regions. B oth segm enta tion

and fea tu re detection assign labels to the im age pixels which ind icate the classes to which


th e pixels belong.

Recognition/Generic Description

^ Model Matching/Object Recognition

Relational structureA Segmentation/Resegmentation

Property Measurement

Scene FeaturesA Feature Detection

Image enhancement/restoration

Digitized Image of the Scene

^ Imaging device

Real-World scene

tIllumination

Figure 2.1: An overview of a typical vision based application

R esegm entation techniques group the segm ented regions in th e im age in to groups

or p a r ts th a t satisfy certain geom etric constra in ts . P ro p e rty m easurem ent a lgorithm s

co m p u te various properties such as area, perim eter, and average gray level, for such p arts .

M odel m atching or ob ject recognition is then regarded as identification of im age p a rts

th a t correspond to th e ob ject p a rts and satisfy th e ap p ro p ria te co n stra in ts .

2 .2 .2 F eatu re D e te c t io n

W e describe basic featu re detection techniques used for de tec ting various local fea tu res in

th e im age.

1. T em plating

A subim age of a local fea tu re th a t is to be detec ted is regarded as a tem p la te and

m atched a t every possible position in th e im age for best fit. T he degree of m atch or

m ism atch identifies th e fea tu re a t the corresponding pixels. T hus if and t ( z , j )

represen t pixel intensities in th e im age and th e tem p la te , respectively, a m easure of

th e m ism atch between them can be expressed by (Rosenfeld, 1988) D(a,6)(/)^} =

j + 6 ) - f(%,j))^, where (a, 6) is th e displacem ent of th e origin of t


relative to th a t of /. T he value of (a, b) th a t m inim izes D rep resen ts the m ost likely

position of th e tem p la te in th e im age. This m ethod is com p u ta tio n ally intensive and

does no t give correct results if th e im age in tensity varies significantly over areas of

th e size of th e tem plate .

2. Edge detection

Edge detection techniques a tte m p t to find pixels th a t lie on th e borders betw een

different ob jects in th e im age. Some s tan d a rd approaches used are (Rosenfeld, 1988):

• M ask m atching: A tem p la te representing ideal edges in various o rien ta tions is

m atched in the neighborhood of each pixel in th e im age. A pixel is classified as

an edge pixel if th e degree of such a m atch is sufficiently high. S harp m atches

are ob ta ined by using m asks which are second differences of ideal s tep (or ram p)

edges. T his technique is also used for detec ting lines, curves, sp o ts and corners.

• G rad ien t m agnitude: If Aa; and A y denote th e first differences of th e im age gray

level in th e x and y d irections, then th e d irection of m axim um ra te of change of

gray level is tan~^ [ Ay / A ^ ) and the g rad ien t m agnitude of th is m axim um ra te

of change is s q r t[ A l + A ^). A pixel lies on an edge if th e g rad ien t m agnitude

a t th a t pixel is sufficiently high. T he differences A a; and A y can be regarded

as good approxim ations to th e p artia l derivatives and th e d ig ita l im age as a

good approxim ation to a sm oothly varying brigh tness function . T he g rad ien t

m agnitude approach has several refinem ents (e.g. local m axim um selection,

differences of averages, etc.) to overcom e th e effect of noise in the im age. A

detailed descrip tion of these can be found in (Rosenfeld, 1988).

• L aplacian Zero-crossing

In th is approach , a Laplacian of the im age gray level i.e. th e sum of th e second

differences in th e x and y d irections in th e neighborhood of a given pixel is

com puted . T his sum is positive on one side of an edge and negative on the

o ther, hence its zero-crossings define th e location of th e edges.

• Hough transfo rm s

T he Hough transfo rm a ttem p ts to d e tec t fea tu res such as lines, circles, or

curves, th a t have equations of a particu la r ty p e by working in a su itab le pa

ram e ter space. For example, to detec t a rb itra ry s tra ig h t lines, a local curve


detection process is applied to th e im age to get th e edge pixels. A s tra ig h t

line is characterized by a slope 9 and d istance r from th e origin in th e (r, 9)

p a ram etric space (D uda & H art, 1972). If P is an edge pixel and if it lies on

a s tra ig h t line we can com pute (r, 9) for P and m ark th e position (r, 9) in a

d iscrete (r, 9) array. W hen th is process is done for all edge pixels and if the

im age contains m any collinear P ’s then th ere will be a pixel in th e (r, 9) a rray

th a t has a high count of m arks.

2 .2 .3 S eg m en ta tio n

S egm entation techniques are used for identifying pixels th a t form hom ogeneous regions in

th e im age. F ea tu re detection is a special form of segm entation since it identifies special

types of pixels which have specific local properties. T he com m on techniques used in

segm enta tion (Rosenfeld, 1988) are described below.

1. G ray level threshold ing

Regions in th is segm entation are assum ed to have approx im ately co n stan t gray level

across th e pixels co n stitu tin g them . A plot of frequency of each gray level (called

th e im age histogram ) in th e im age gives various peaks (surrounded by valleys) which

represent ideal gray levels of the corresponding regions. T he im age can be segm ented

in to regions by dividing th e gray scale in to intervals each contain ing a single peak.

T his m ethod of segm entation is known as (multi)-thresholding; th e poin ts sep a ra tin g

th e intervals on the gray scale are called thresholds. T hreshold ing produces good re

su lts only if th e peaks are well separated . Various refinem ents to th is basic technique

can be applied when th e peaks overlap or are widely separa ted .

2. R elaxation techniques

T hese are ite ra tive techniques which are used for ge ttin g a s tab le solu tion from an

in itial approxim ation . In th e con tex t of segm entation , each pixel is initially clas

sified independently (w ith certain probabilities). These pixels are then reclassified

itera tively to m ake th e classification m ore consistent. T he consistency crite ria in

segm entation of the im age in to regions m eans th a t if m ajo rity of the neighbors of a


pixel P belong to a given class, so should P. If th e goal is to d e tec t edges or curves,

th e consistency crite ria m eans th a t if P lies on an edge or a curve having a given

slope a t P , so should its neighbors in th a t direction having a sim ilar slope.

3. G lobal H om ogeneity

In th is approach , en tire region or curve is required to be a good fit (e.g., in the

least squares sense) to som e s tan d a rd function. For exam ple, an edge or curve m ay

be required to be a good fit to a s tra ig h t line or to a polynom ial o f higher degree.

A split and m erge approach can then be used for segm enting an im age or a curve

in to globally hom ogeneous p arts . In th is approach , an en tire im age or a curve is

sp lit (e.g., in to q u ad ran ts or arcs) if th e m easure of th e fit is n o t good enough.

T h e sp littin g process is repeated for each p a r t until th e en tire im age or curve is

p artitioned in to p a r ts each of which has a good fit and no tw o ad jacen t p a rts can be

m erged to yield a good fit.

4. Region G row ing; Edge or C urve Tracking

In region growing, a region is built by s ta rtin g w ith a set of one or m ore ‘sim ilar’ pixels

(e.g. ‘s im ilar’ by pixel difference) and gradually ex tending th is se t by repeated ly

adding new pixels or connected sets which resem ble pixels already in th e set. T he

resem blance is usually governed by some hom ogeneity criterion (based on e ither gray

tone or tex tu re ) th a t m ust be satisfied by the new pixels for inclusion in th e region.

T he procedure for edges or curves is analogous. O ne s ta r ts w ith stro n g edge/cu rve

pixels and ex tends them by adding neighboring edge pixels th a t continue th e edge

sm ooth ly or preserve th e good global fit. T he m ain d isadvan tage of th is approach

is th a t th e resu lts of segm entation are o rder-dependent. T hey depend on choice

of th e s ta r tin g point and th e order in which th e pixels are exam ined for possible

inco rporation in to th e region, edge or curve.

5. H ierarchical Techniques

Here one applies a local feature detection technique to a reduced-reso lu tion im age

to d e tec t ‘coarse fe a tu res’ of various sizes (edges betw een large regions, th ick curves,

large spo ts, e tc .). T he finer im age features can then be located by exam ining succes

sively higher-resolution versions of th e im age in th e vicinity of th e de tec ted features.


T his process requires only a succession of local searches and thereby reduces th e cost

of global search.

2 .2 .4 R e se g m e n ta tio n

R esegm entation m ethods are used for form ing m eaningful en tities or p a r ts by segm enting

or grouping regions, edges or curves using certain geom etric criteria . E xam ples of such

en tities are (Rosenfeld, 1988):

1. C onnected com ponents and holes: Segm entation of an im age often resu lts in m any

disconnected fragm ents. R esegm entation m ethods applied to such fragm en ts resu lt

in m axim al connected sets of pixels called connected com ponents. Holes are regions

su rrounded by pixels of a connected com ponent.

2. B orders, Arcs and Curves: Edges obtained in segm entation m ay be grouped to g eth er

to form borders of ob jects or to form arcs and curves in th e im age. An arc m ay be

fu r th e r segm ented into sm oothly curved subarcs which m ay m eet a t corners.

3. T hining, Shrinking and Expanding: These techniques are used for form ing a skeleton

of given ob jec ts or to d ilate a given ob ject in th e im age.

2 .2 .5 P r o p e r tie s and R e la tio n s

A fter th e resegm entation process, m any useful p roperties of th e im age p a rts can be

m easured by applying various techniques. Exam ples of such properties are: num ber of

connected com ponents or holes, a rea (num ber of pixels in th e im age p a r t) , perim eter,

com pactness [a re a /p e r im e te r ‘s) and elongatedness [a rea /th ickn e ss 's ) . M any types of

re la tions between im age p a rts are im p o rtan t for ob ject recognition especially when these

are betw een p a rts of ob jects. M ost of these relations are defined in te rm s of relative

p ro p erty values such as ligh tness/darkness, size, positional reference (e.g. near, far, above

below, e tc .), and o rien ta tion (parallel, etc.) (Rosenfeld, 1988).


2 .2 .6 O b jec t R e c o g n itio n

O b jec t recognition m ay be achieved in several ways. In th e graph-based approach , the

ob jects are assum ed to consist of p a rts having certa in properties and re la tionships. They

are represen ted as labeled g raphs w ith nodes representing p a rts , labeled w ith p ro p e rty

values, and th e arcs representing relations, labeled w ith re lation values. Tw o such g raphs

are created , one for the expected class of ob jects (called ob jec t g raph) and th e o th e r for

th e ac tu a l observed ob ject classes in the im age (scene g rap h ). R ecognition is th en achieved

by finding subgraphs of th e scene g raph th a t are close m atches to th e o b jec t g rap h . T he

m ain lim ita tion in th is approach is th a t the observed im age p a r ts m ay n o t co rrespond to

th e expected o b jec t p a rts . This m ay be due to segm enta tion errors, w here a single node

m ay sp lit in to several nodes or several nodes m ay m erge in to a single node. Also, it is

som etim es difficult to characterize ob jects as labeled graphs.

In an o th er approach, although applicable only in som e special cases, th e o b jec ts are

characterized by a set of ideal (global) p roperty values or co n stra in ts on these values.

R ecognition then consists of m atching an observed list w ith th e ideal list. In certa in

cases an en tire ob ject is tre a ted as a tem p la te and m atched for op tim al fit in th e im age.

T he g raph-based approach, however, appears to be m ore general and is applicable in the

m a jo rity of th e cases (Rosenfeld, 1988).

2.3 C om p u tation a l C haracteristics

Investigation of parallel processing solutions to vision applications necessita tes under

s tan d in g th e n a tu re of th e com putations involved. A typical vision application involves

several stages of processing w ith a varying mix of sym bolic and num eric processing. Vision

app lica tions are conveniently classified in to th ree levels (W eems e t ah , 1989): low level,

in term ed ia te level, and high level as shown in F igure 2.2. T he low level processing in

volves w ell-structured local com putations on th e im age d a ta while th e o th er levels involve

sym bolic com pu ta tions w ith irregular com m unication p a tte rn s .


Recognition/Generic Description High Level

(D ata structures-to-Data Structures Processing)t Model Matching/Object Recognition

Relational structure Intermediate Level

t Segmentation/Resegmentation (Image-ta-Im age/Property Measurement Im age-to-Data Structures Processing)

Scene Features Low Level

Feature Detection (Image-to-Jmage Processing)Image enhancement/restorationt

Digitized Image of the Scene

Figure 2.2: P rocessing levels in a typical vision based application

2 .3 .1 Low le v e l p ro cess in g

Low level processing involves im age processing techniques such as im age enhancem ent and

re s to ra tio n , and com pu ter vision techniques of featu re ex trac tio n and edge detec tion . Low

level processing consists of pixel-to-pixel transfo rm ations, w here uniform co m p u ta tio n s

are applied a t each pixel or a t a neighborhood around each pixel in th e im age. T he com

p u ta tio n s are num eric, regular and well suited to spatia l parallelism . T he com m unication

p a tte rn is local and processing across th e im age is identical. A lthough the co m p u ta tio n s

required a t low level are quite straigh tfo rw ard , th e sheer volum e of d a ta to be processed

dem ands enorm ous com puting power.

2 .3 .2 In te r m e d ia te lev e l p ro cessin g

A t th e in term ed iate level, th e basic un it of in form ation is a descrip tion of low level im age

fea tu res such as edges, curves, and in tensity regions. T he a lgorithm s in th is category

consist of bo th sym bolic and num eric com putations. T he sym bolic com p u ta tio n s involve

grouping of the low level features in to m eaningful en tities such as sets of parallel lines,

rec tan g u lar borders of an ob ject, or planes. T he algorithm s a t th is level a t te m p t to o u tp u t

descrip tions of possible ob jects in the im age d a ta . T he grouping operations (eg m erging

and sp litting of regions, or linking and reorganizing of lines) involve a large am oun t of


non-local com m unications. T he fragm ents of lines require m atching and m erging across

large fraction of th e im age. Similarly, regions need to be m erged and com pared w ith o thers

from possibly non-contiguous areas during th e segm entation process. T h e com m unication

p a tte rn is th u s d a ta dependent and irregular.

2 .3 .3 H igh le v e l p ro cess in g

High level applications generate and te s t hypotheses for ob jec t recognition based on d a ta

provided by th e low and in term ed iate levels of processing. T he app lica tions a t th is level

a t te m p t to recognize ob jects in th e im age using either g raph-based or ru le-based techniques

on th e ob jec t descrip tions generated a t th e in term ed iate level. P rocessing a t th is level is

very irregu lar and m ay involve dynam ic scheduling of th e co m pu ta tions.

T he volum e of d a ta analyzed as th e processing progresses from low levels to high levels

is su b stan tia lly reduced. However, th e inform ation con ten t of th e d a ta is m uch higher.

For exam ple, w here pixel values in low level represen t b rightness values in th e im age

d a ta , relevant d a ta in high level m ay represent relative size or shapes of th e ob jects. T he

d a ta types shift from prim arily num eric to prim arily sym bolic (Y alam anchilli & A ggarw al,

1994). Hence, th e com pu tations involving these d a ta s tru c tu re s are com plex (e.g. o b jec t

recognition, au to m atic vehicle guidance). T he source of co m p u ta tio n a l burden shifts from

large volum es of d a ta to com plex num erical and inferencing operations as th e processing

progresses from low to high level.

Low level algorithm s are usually highly s tru c tu red , repetitive and com posed of fixed

sets of operations w ith relatively few da ta-d ep en d en t branches. It is therefore possible to

o b ta in relatively accura te estim ates of th e operation counts. B u t high level a lgorithm s

are highly da ta-d ep en d en t and processing requirem ents can vary widely based on th e

applica tion dom ain. For exam ple, it is very difficult to es tim ate th e num bers of fea tu re or

ob jec ts to be processed, and even m ore difficult to es tim ate th e am o u n t of com p u ta tio n s

involved. It is therefore very difficult to establish th e processing requ irem ents and source

of parallelism (e.g. d a ta /fu n c tio n a l parallelism ) in high level vision a lgorithm s. Hence,

th e n a tu re of th e algorithm ic characteristics change as processing evolves from low level


to high levels. T hese characteristics have influenced the design of several different parallel

arch itectu res discussed in th e next section.

2.4 P arallel sy stem s for v ision

M any applica tions in com puter vision have enorm ous d a ta th ro u g h p u t and processing

requirem ents which have fa r exceeded the capabilities of existing uniprocessor arch itec

tu res. Parallel processing has been perceived as a necessary solution th a t has led to the

conception, design, and subsequent analysis of a num ber of parallel system s for com puter

vision which are described below (W eems e t al., 1989), (C houdhary & P ate l, 1990). T he

lite ra tu re on parallel system s for com puter vision is vast, however, m ost of th e m ateria l

can be found in (Duff & Levialdi, 1982), (K endall & U hr, 1982), (U hr, 1987), (Page, 1988),

(P ra sa n n a K um ar, 1991), (N arayan e t ah, 1992), (Siegel e t ah, 1992).

2 .4 .1 M esh co n n ec ted sy ste m s

M esh connected m achines consist of a large num ber of sim ple processing elem ents arranged

in a tw o-dim ensional array, w ith each processing elem ent connected to its four, six, or

eight neighbors (Figure 2.3). T he processing elem ents execute in struc tions b ro ad cast by

a cen tra l controller in a SIM D mode. T he organization of these m achines m atches th e

s tru c tu re of th e im age d a ta which m akes them su itab le for low level im age processing

operations involving com putations on individual pixels or sm all neighborhoods of pixels.

T hey are, however, not su itab le for in term ed iate and high level processing due to sim plicity

and SIM D n a tu re of the processing elem ents. Also, com m unication of in form ation across

long distances in th e com m unication netw ork is very tim e consum ing. Some exam ples

of m esh connected m achines are (C houdhary & P ate l, 1990), (Y alam anchilli & A ggarw al,

1994) th e M assively Parallel Processor, the B inary A rray Processor, th e D istrib u ted A rray

P rocessor (DA P) and the Cellular Logic Im age P rocessor (C L IP) series of m achines a t

U niversity College in London, the s ta te of th e a r t in th e series being C L IP ? processor

array.

Chapter 2. Parallelism in Computer Vision

2 .4 .2 P y r a m id s

40

Pyram id m achines consist of a large num ber of simple processing elem ents arranged in

layers of m esh-connected arrays. W ith exception of th e array a t the lowest layer, each

array in the pyram id is one fourth as large as the a rray below it and each processing

elem ent is connected to four processors in the array below it (Figure 2.3). Pyram id

m achines a tte m p t to minimize the com m unication delays over large distances present in

the mesh connected system s. However, due to SIMD n a tu re of the processing elem ents

these machines can be used in im proving speed of m ostly low level algorithm s, especially

those which depend upon com m unication between pixels th a t are spatially d is tan t in an

image. Some exam ples of Pyram id machines are (C houdhary & Patel, 1990) Non-Von,

the in traco m p u te r, PAPIA, and M P P Pyram id.

F igure 2.3; A 4-connected mesh, pyram id and a 3-dim ensional hypercube of processing

elem ents

2 .4 .3 H y p e r c u b e s

Hypercube machines consist of 2" processors connected by a com m unication network th a t

resem bles an 77-dimensional cube. Each processor is connected to n o ther processors and

can com m unicate with any o ther processor using a t m ost n com m unication links (Fig

ure 2.3). Hypercube machines can be built to operate in both SIM D and M IM D mode.


T hese m achines provide efficient com m unication betw een all th e processors because the

netw ork has sm all d iam eter. T hey can be used for m ost low level algorithm s and som e

in term ed ia te and high level applications. However, th e a lgorithm s need to be tu n ed to th e

underlying topology. Also, larger hypercubes are costly to build since they require m any

links to be added to each processor. An exam ple of a SIM D hypercube is C onnection

m achine C M -2 while Intel H ypercube, N C ube and Cosm ic C ube are exam ples of some

M IM D hypercube m achines (C houdhary & P ate l, 1990).

2 .4 .4 Shared m em o ry m ach in es

Shared m em ory system s are usually M IM D m achines consisting of several general purpose

processors which have access to a large global m em ory th rough an in terconnection n e t

work. In som e cases th e processors m ay also have a sm all am o u n t of local m em ory. T he

in terconnection netw ork m ay be bus-based or m ay involve use of a m ultistage sw itching

network. T he form er involves a high-speed bus th a t connects th e processors and th e

m em ory while th e la tte r provides links between processors and m em ory on a dem and

basis (F igure 2.4). T he bus-based m achines have lim ited scalability since th e com m on

bus used in com m unication lim its the num ber of processors th a t can be added in the

system . Scalability is much b e tte r in the m ultistage sw itching netw ork m achines b u t the

in terconnection netw orks are com plex to build.

S hared m em ory m achines are su itab le for high level vision applica tions due to ease of

p rogram m ing and uniform view of th e system . T he control of in form ation and synchro

n ization is much easier com pared to th a t in th e d is trib u ted m em ory m achines. However,

due to slow access to global m em ory and tim e penalty in process synchronization , such

system s are efficient only for coarse-grained parallelism . E xam ples of shared m em ory

m achines (C houdhary h P ate l, 1990) th a t use bus a rch itec tu re are Sequent B alance and

E ncore M ultim ax and those which use m ultistage netw orks are BBN B utterfly , IBM R P3

and C edar.


2 .4 .5 P ip e lin e d S y ste m s and S y s to lic arrays

T he m achines in th is category consist of a pipeline of processing elem ents w here d a ta is

fed from one end of the pipeline. T his d a ta then passes th rough th e processing elem ents in

a serial fashion, and th e results are ob ta ined a t th e o ther end of th e pipeline (F igure 2.4).

These system s are used for perform ing a sequence of operations on a s tream of inpu t

d a ta . Such system s are useful in m orphological operations w here long sequences of local

operations are perform ed on a given im age d a ta . T he exam ples of m achines in th is category

are cyc tocom puters and th e systolic a rrays (e.g. SLAP or Scan Line A rray Processor)

(Y alam anchilli & Aggarw al, 1994).

Several solutions have been developed for low level im age processing algorithm s using

systolic a rrays (U hr e t al., 1986). These solutions, called systo lic solutions^ are realized

by organizing th e flow of d a ta stream s th rough such arrays. Systolic so lu tions have been

o b ta ined for a variety of problem s such as edge detection , connected com ponent labeling,

and fast Fourier transform s. However, th e difficult problem in using these m achines is

to determ ine w hether a systolic solution exists for a certain problem and, if so, to derive

th is solution. A represen ta tive of the s ta te of th e effort is th e CM U W arp p ro jec t. T he

CM U W arp, a linear systolic array of 10 W arp cells or processing elem ents, was designed

to provide high-speed operations for a num ber of low level im age processing applications.

B u t its flexibility m akes it possible to program a variety of o th er applica tions as well. T he

a rray can o p era te as a purely systolic array or as a se t of processors on a bus in th e SIM D

or M IM D m ode (Yalam anchilli & Aggarwal, 1994).

2 .4 .6 P a r tit io n a b le S y ste m s

D ue to varied n a tu re of vision applications there were m any efforts to design and develop

arch itec tu res th a t supported both SIM D and M IM D types of processing. Such hybrid

system s addressed th e issue of flexibility, p artitio n ab ility and reconfigurability needed in

low, in te rm ed ia te and high level vision applications. Some exam ples o f such system s

include PM 4, PASM , R E PL IC A , D isputer, W ISA RD , VisTA, th e Im age U nderstand ing

A rch itectu re (lUA) and N ET R A . A brief description of all these system s can be found in


Shared Metmiry

PE PEPEPE PE

Memory PEPEPE

Controller

Interconnection Network

Figure 2.4: Shared m em ory machines (interconnected by a bus and sw itching network)

and systo lic /p ipeline system s

(Yalam anchilli & Aggarwal, 1994), (C houdhary & Patel, 1990), and (P ra san n a K um ar,

1991). T he common characteristics of these machines is th a t they consist of large num ber

of processing units which can be partitioned into groups th a t can operate in SIM D and

M IM D mode. T he arch itectu re of lUA (Weems e t ah, 1989), for exam ple, has th ree

different layers of processing units su itable for low, in term ediate and high level vision

algorithm s. However, such system s involve considerable design and developm ent costs due

to their specialized and complex architecture.

2 .4 .7 G e n e r a l p u r p o s e p a r a l l e l s y s t e m s

C eneral purpose parallel system s are the curren t high-perform ance parallel m achines such

as IBM SP-2, Meiko CS-2, Intel Paragon, Cray T3D , and PA RAM lOOOOF Since they

are based on w orkstation m icroprocessor technology, these system s are versatile and cost-

effective com pared to the specialized vision system s described earlier. These system s

m ainly consist of processing units, each with a local memory, and a high speed in terconnec

tion network. They are m ostly tightly-coupled, i.e. the in terconnects are system -specific

w ith poin t-to -poin t links between the processors. T heir m ajor d isadvantage is th a t it is

difficult for a parallel application to use the resources efficiently. Also, the system -specific

in terconnects do not provide a flexibility of adding existing machines as hosts. T hey

Developed by C -D /\C (C enter for Developm ent of Advanced C om puting), Pune, India


can n o t in co rporate heterogeneous arch itectures, hence applications cannot select m ost

su itab le com puting resources for each com pu tation . Therefore, a lthough tigh tly-coupled

system s alw ays su p p o rt faste r com m unication, their advantage is likely to shrink over tim e

(S teenkiste, 1996).

2.5 C om p u tin g on w ork station clusters

D uring th e p ast several years, netw ork-based com puting environm ents, such as a c lu ste r of

w orksta tions, have proved to be an a ttra c tiv e a lte rn a tiv e for high-perform ance com puting

over th e conventional parallel m achines. This is due to rapid advances in m icroprocessor

technology and em ergence of high-speed netw orks having a netw ork bandw id th of the

o rder of g igabit per second (Boden e t ah, 1995), (S teenkiste, 1996). A cluster of w o rk sta

tions offers several advantages for im plem enting high-perform ance com puting solutions.

It provides m ultiple C PU s, large memory, stab le softw are, and heterogeneous com pu ting

environm ents for developing high-perform ance com puting solutions to m any co m p u ta tio n

intensive problem s. It is believed th a t th e fu tu re com puting environm ents will slowly

m ig rate tow ards th e concept th a t ‘the netw ork is th e co m p u te r’ (T urco tte , 1996).

2 .5 .1 C lu ster C onfigu ration

A w orksta tion c luster is basically a collection of w orksta tions connected by a com m odity

netw ork, such as E th e rn e t or ATM . T he th ree com m on netw ork topologies em ployed w ith

w orksta tion clusters are shown in F igure 2.5. T he E th e rn e t or bus is the m ost com m only

im plem ented netw ork for clusters. Switch based in terconnects are typically configured

in a sta r arrangem en t, and are used exclusively w ith dedicated clusters. T here a re also

hierarchical designs in which m ultiple types of in terconnects are utilized.

T h e w orksta tions in a cluster com m unicate w ith each o ther by exchanging m essages

or d a ta packets tran sm itte d using either transm ission control pro tocol (T C P ) or user

d a tag ram pro tocol (U D P). T he form er processes s tream s of d a ta such th a t th e reliability

of m essage delivery is assured. T he la tte r sends d a ta packets th a t are a tte m p te d to be


delivered (i.e. reliability of message delivery is not assured) (T urcotte , 1996). Two softw are

m ethods are used for com m unicating the messages: message passing a.i\d distributed shared

mem ory. M essage passing involves explicit tran sm ittin g of messages between the system s.

D istributed shared m em ory (DSM ), which is usually im plem ented using m essage passing,

involves accessing of d a ta w ithout the concern for physical location.

Switch

Figure 2.5: Com mon cluster configurations: bus, s ta r and a ring

W orkstation clusters have one obvious lim itation due to the use of relatively slow

network interconnection. T he interconnects have a low banduridth and a high latency.,

where, bandw idth refers to the speed a t which message d a ta is tran sm itted and latency is

the tim e spent in in itiating the transm ission of a message. E th ern e t, the m ost com m only

im plem ented netw ork for clusters, t ransm its inform ation a t lO M b/s and has a message la

tency of 4/rs. T here have been several efforts to design expensive high-speed in terconnects

to overcome the lim itations induced by the speed of E th ern e t. Typical exam ples include,

ED DI (100 M b /s), lliP P I (800 M b/s), VM E Bit3 (20 M b/s) and ATM OC-12 (622 M b /s)

(T urco tte , 1996).

The need to m aximize the network perform ance (high bandw idth and low latency),

particu larly for parallel applications, has yielded unique solutions. A recent exam ple of one

such network is M yrinet (Boden et ah, 1995). It consists of a collection of w orkstations,

the netw ork com prising links and switches to route the d a ta , and the network interface

between the w orkstations and the network links. T he netw ork interface consisting of

a special processor can transfer blocks of d a ta to allow for the overlap of com putation

and com m unication. One way message latencies of 100 ps and bandw idths of 255 M b /s

have been observed in a M yrinet-based system interconnecting several Sparc w orkstations.

(Boden e t ah, 1995).


W orksta tion clusters are sim ple to configure. However, it is im p o rtan t to identify

categories of applications which can be im plem ented on these system s m ore effectively.

T he applications which require com pu tational capabilities of high-perform ance com puting

system s can be categorized as follows (T urcotte , 1996):

• C apab ility dem and which includes m egaproblem s th a t require all th e co m p u ta tio n al

capabilities of any available system including m em ory and C PU . G rand Challenge

applications which require m assive parallel processing fall in to th is category.

• C apacity dem and which includes applications requiring su b stan tia l, b u t fa r from

u ltim ate , perform ance and m aking m odera te dem ands on m em ory. T hese jo b s are

ideal cand ida tes for w orksta tion clusters.

W orksta tion clusters provide practical and cost-effective com puting solution for th e ca

pacity dem and problem s. T hey provide com plem entary ra th e r th an practica l replacem ents

for th e general-purpose parallel com puting m achines.

2 .5 .2 A d v a n ta g es o f w o rk sta tio n c lu sters

W orksta tion clusters offer several advantages over th e trad itio n a l parallel com puting en

v ironm ents (T urco tte , 1996) as described below:

• W orksta tion clusters provide simple, inexpensive and readily accessible com puting

p latform to design, develop and im plem ent parallel so lu tions to a wide range of

applications. T hey offer excellent p rice /perfo rm ance benefits in com parison w ith

th e trad itio n a l parallel com puting solutions.

• W orksta tions provide large, cost-effective m em ory which is n o t available in m ost

trad itio n a l parallel com puters, and as problem s continue to grow in com plexity and

detail, availability of a large m em ory is as im p o rtan t as th e processor speed.

• W orksta tion clusters offer s tab le softw are environm ents com pared to dedicated p a ra l

lel m achines. Softw are environm ents such as opera tin g system s, com pilers, libraries.


and softw are tools, are yet to develop to a po int of general accep tab ility for dedicated

parallel m achines.

• C lusters provide a cost-elfective environm ent to s tu d y topics re la ted to heterogeneous

com puting . It is generally believed th a t fu tu re , high-perform ance com puting system s

will achieve m axim um perform ance capabilities only by exploiting th e benefits of

heterogeneous com puting environm ents.

• C lusters have a graceful degradability . T he en tire c luster is no t lost due to failure

of a single system in th e cluster. Also, since the clusters are c reated using com

m odity com ponents, m ain tenance costs are usually m uch less th an for an equivalent

investm ent in a dedicated parallel com puter.

2 .5 .3 U se o f c lu sters

C lusters can be used as enterprise clusters or dedicated c lusters (T urco tte , 1996). E n

terp rise clusters are configured w ith w orksta tions th a t are owned by different individuals

or groups. T he m achines in th is cluster are norm ally heterogeneous (m ultivendor), and

are alm ost exclusively connected via E th ern e t. T his type of clustering relies on individual

ow ners con trib u tin g their unused com puting cycles to a shared pool. T he individual owners

expect to receive m ore resources th an they con tribu te . E n terp rise clusters are controlled

and m anaged by a m anagem ent softw are. This softw are enables effective use of collective

idle tim e available on m ost w orkstations. This idle tim e can be used to process jobs of

several different users in th e group. T he m anagem ent softw are ensures th a t th e system s

of individual owners are no t sa tu ra te d when they try to use th e ir own system s. T he

individual ow ners can specify how their system will p a rtic ip a te in th e resource pool.

Several papers have proposed different schemes for sharing resources in en terp rise

c lusters w here th e m ain idea is to identify idle m achines in th e netw ork and schedule

background jobs on them w ith m inim um disruption to the individual owners of th e m a

chines. W hen th e owner resum es activ ity a t a w orksta tion , th e jo b is e ither suspended,

te rm in a ted , or moved to an o th er m achine in th e cluster. T hese efforts have resu lted in

e ither speeding individual jobs or program s by locating idle resources (Alonso & Cova,


1988), (M utka & Livny, 1987), or in sim ply achieving higher levels of m achine utiliza tion

th ro u g h load balancing or load sharing (Theim er & L antz, 1988), (Litzkow e t al., 1988),

(T andiary e t al., 1996), (C lark & M cM illin, 1992).

D edicated clusters are installed as su b s titu tes or replacem ents to th e trad itio n a l parallel

com puting system s. T hey consist of individual w orksta tions m anaged by a single group

which adm in isters th e clusters like a central m ainfram e. T hey are usually in terconnected

by high-speed netw orks such as FD D I, SO CC, and H iP P I (T urco tte , 1996). D edicated

clusters usually have a control w orksta tion which m anages th e jo b queue and ac ts as

an in terface to th e rem aining clusters. T he contro l system can be used to dynam ically

p a rtitio n th e clusters to execute in teractive jobs (e.g. code developm ent, graphics, e tc .),

serial batch jobs and jobs th a t have been parallelized.

2 .5 .4 P a ra lle l co m p u tin g u sin g C lu sters

W orksta tion clusters, both enterprise and dedicated , can be used as parallel com puting

environm ents for im plem enting parallel solutions to a wide range of applications. T here

have been several papers which have addressed th e issues involved in solving a single

problem on a collection of w orksta tions. Silverm an and S tu a r t (Silverm an & S tu a rt, 1989)

have used th e cluster as a loosely coupled m essage passing parallel co m p u ter to solve som e

asynchronous algorithm s in num erical analysis. M agee and C heung (M agee & C heung,

1991) have proposed a supervisor-worker p rogram m ing m odel to d is trib u te com pu ta tions

over a set of w orksta tions.

A tallah et al. (A tallah e t al., 1992) have proposed a resource m anagem en t technique

called coscheduling OT gang scheduling. I t involves dividing a large ta sk in to su b task s which

are then scheduled to execute concurrently on a set of w orksta tions. T he su b task s need to

coo rd ina te their execution by s ta rtin g a t th e sam e tim e and com puting a t th e sam e pace.

W ang and Blum (W ang & Blum, 1996) have developed a sm all m essage-paasing lib rary to

im plem ent ite ra tiv e num erical a lgorithm s which require synchronization a t th e end of each

ite ra tio n (synchronous algorithm s). Finally, there have been a tte m p ts to d em o n stra te the

capab ility of w orksta tion clusters to solve som e grand challenge problem s (Beguelin e t al..


1991), (N akanishi & Sunderam , 1992).

Tw o com m only used approaches to parallelize applica tions using clusters are:

• Extension of existing sequential languages (e.g. C + + , FO R T R A N ) to handle nec

essary com m unications and synchronization (see (W ilson & Lu, 1996) for several

concurren t C + + extensions).

• Defining new program m ing languages or an environm ent based on ob ject-o rien ted ,

functional or logical paradigm s.

T here are several softw are system s such as Express, L inda, p4, PV M , and M P I, which

are used for creating parallel applications on w orksta tion clusters. A com prehensive review

of these system s is contained in (T urcotte, 1993). T his section briefly describes ch a rac te r

istics of a P arallel V irtual M achine system which is used as a program m ing environm ent

in th is thesis.

Parallel V irtua l M achine (PV M ) (Beguelin e t al., 1992) was developed a t O ak Ridge

N ational L aborato ry , Tennessee, and is th e m ost popu lar system for developing parallel

applications on w orksta tion clusters (T urcotte , 1993). PV M is a softw are lib rary which

allows utiliza tion of a heterogeneous netw ork of parallel and serial com puters as a single

com puting resource. It is based on th e m essage passing m odel (coord ination m odel dis

cussed in Section 2.1.1). An application in PV M consists of m ultip le com ponents, each of

which im plem ents a particu la r functional process. T here are four categories of com ponents

in PV M : process m anagem ent, interprocess com m unication, synchronization and service

(s ta tu s checking, buffer m anipulation , etc). T he PV M m odel is based on asynchronous

processes which are typically executed as individual program s (e.g. heavyw eight Unix

Processes) on each system in the cluster. T he com m unication betw een the processes

occurs v ia explicit m essage passing.


2.6 P ara llelization using D esign P a ttern s

M ost of th e parallel program s are usually coded in te rm s of high level co n stru c ts w here

th e functions for com m unication, synchronization, and som etim es even co m p u ta tio n are

rolled in to a single routine. T his style of parallel code developm ent increases p rogram

com plexity and reduces program reliability and code reusability. W riting explicit parallel

code for parallelizing various applications on a cluster of w orksta tions has som e add itional

problem s too. T he available m achines and th e ir capabilities can vary from one execution

to ano ther, which, som etim es can lead to a significant reduction in parallel perform ance.

Also, ab o u t 69% of parallel p rogram m ers (Pancake, 1996) m odify or use existing blocks

of code to com pose new program s. Since m ost of th e parallel program s, especially those

in vision, utilize a ra th e r sm all set of recurring algorithm ic s tru c tu re s , it is m eaningful to

identify and form ulate these algorithm ic s tru c tu re s as design p a tte rn s . Such decoupling

would reduce program com plexity and increase code reusability in different s itu a tio n s and

in fu tu re softw are developm ent.

2 .6 .1 D es ig n p a ttern s

T h e concept of a design p a tte rn was in troduced by arch itec t C hristopher A lexander who

described the recurring them es in arch itectu re as th e design p a tte rn s (A lexander, 1979).

A p a tte rn represents a replicated sim ilarity in a design, and in p articu la r a sim ilarity th a t

can be custom ized and tuned to hum an needs and com forts. T hus, an arch on every

door and window o f a room is a p a tte rn , yet it does no t specifically im ply th e size of th e

archs, th e ir height from th e floor nor their fram ing. T he idea in troduced by C hristopher

A lexander has inspired softw are designers over th e p ast decade to discover (and rediscover)

softw are arch itec tu ra l p a tte rn s in the softw are people develop. In softw are, design p a tte rn s

are softw are ab strac tio n s th a t occur repeatedly while developing softw are solutions for

problem s in a p articu la r dom ain such as business d a ta processing, telecom m unications,

d is trib u ted com m unication softw are, and parallel vision processing (G am m a e t ah, 1994).

Design p a tte rn s cap tu re th e s ta tic and dynam ic s tru c tu re s of th e solutions th a t occur

repeated ly when developing applications in a p articu la r dom ain (Coplien & Schm idt,


1995), (B uschm ann et al., 1996). T hey a rticu la te proven design techniques for developing

softw are solutions in a p articu la r contex t. C ap tu rin g and a rticu la tin g key design p a tte rn s

helps to enhance softw are quality by addressing basic challenges in softw are developm ent.

T hese challenges include com m unication of designs am ong th e developers; accom m odating

new design parad igm s or styles; resolving reusability and po rtab ility issues; and avoiding

developm ent tra p s and pitfalls th a t are usually learned only by costly tria l and erro r

(Coplien & Schm idt, 1995).

Design p a tte rn s serve as a good com m unication m edium . W hen several softw are de

velopers are discussing various po ten tial solutions to a problem , th ey can use th e p a tte rn

nam es as a precise and concise way to com m unicate com plex concepts effectively. Design

p a tte rn s are ex trac ted from working designs. T hey cap tu re th e essential p a r ts of a design in

a com pact form , including specifics abou t the con tex t th a t m akes th e p a tte rn s applicable or

no t. T his com pact represen ta tion helps developers and m ain tainers u n d ers tan d th e archi

te c tu re of a system , which allows m ore effective softw are developm ent (Beck e t al., 1996).

P a tte rn s prom ote design reuse where rou tine solutions w ith w ell-understood properties

can be reapplied to new problem s w ith confidence (M onroe e t al., 1997). Encouraging

and recording th e reuse of best practices can lead to a significant code reuse. A collection

of design p a tte rn s would help developers produce good designs fa ste r and would provide

a lte rn a tiv es when applied to particu lar situations.

T he design p a tte rn s in th e parallel vision system s, im plem ented on netw ork-based

m achines (such as a cluster of w orkstations) are th e softw are com ponents which d is trib u te

and execute com pu tations of various vision applications on these m achines. Developing

a parallel im plem entation for an application in such an environm ent usually involves a

sequence of steps. These steps include a) p artitio n in g th e application in to different taaks,

b) using a su itab le parallel program m ing language or tool, to concurren tly im plem ent

(m ap) these task s on a given num ber of w orksta tions, and c) m anaging th e low level

p rogram m ing details such as m arshalling d a ta , sending and receiving m essages, and process

(task) synchronization. T he partition ing , m apping and com m unication s tru c tu re of the

parallelization process of an application is a parallel program m ing paradigm th a t can be

used to parallelize any o ther application w ith a sim ilar co m p u ta tio n al s tru c tu re . T he

design p a tte rn s essentially cap tu re these parallel program m ing parad igm s and relieve the


user from tedious parallelization details.

T he m ain advantages in using design p a tte rn s for parallelizing vision applica tions on

a c luster of w orksta tions are:

1. T he design p a tte rn s can be developed to utilize a readily available pool of w ork

s ta tio n s which, for som e applications, can approach or exceed perform ance over

non-available fastest m achines.

2. A design p a tte rn decouples th e details of parallel im plem entations from th e user.

3. A design p a tte rn can be reutilized to parallelize any application w ith a sim ilar

co m p u ta tio n al s tru c tu re as im plem ented by th a t p a tte rn .

2 .6 .2 Form s o f P a ra lle lism in V ision

M any vision applications can be parallelized by using various form s of parallelism . Each

form of parallelism is a sim ple organizational technique th a t can be used for designing

and developing parallel algorithm s for a certain class of problem s. Identifying various

form s of parallelism in vision applications would help in cap tu rin g and a rticu la tin g key

design p a tte rn s in parallel vision system s. M any of these form s are varian ts of th e class of

a lgorithm s described in section 2.1.2.

D ata Partitioning

In th is form of parallelism , th e im age array is p a rtitio n ed in to ad jacen t regions or subim

ages and each subim age is processed in parallel by a different processor. Such ty p e of

parallelism is su itab le for low level processing operations, such as im age filtering and

convolution. T he regions m ay overlap a t the boundaries of th e subdivisions to enable

processing of th e pixels a t th e region boundaries.


Synchronous Iteration

In case of Synchronous Ite ra tio n , each processor perform s th e sam e ite ra tiv e co m pu ta tion

on a different region of im age d a ta . T he processors, however, m ust be synchronized

a t th e end of each ite ra tion and hence no processor can s ta r t th e nex t ite ra tio n until

all th e processors have finished th e previous ite ra tio n . T he need for synchronization is

due to th e fact th a t d a ta produced by a given processor during i-th. ite ra tio n is used by

o th er processors during ( i+ l j th ite ra tio n . This form of parallelism is su itab le for ite ra tiv e

sm ooth ing and sharpening operations on the im age d a ta .

A lgorithm ic Parallelism

In algorithm ic parallelism , the algorithm is p artitio n ed in to several independen t p a rts

and each p a r t is processed by a sep a ra te processor concurrently . Each processor works

independently and requires no explicit synchronization or com m unication w ith th e o ther

processors. For exam ple, th e two convolutions in Sobel edge detec tion can be executed

concurren tly on sep a ra te processors (D ow nton e t al., 1996).

Tem poral M ultiplexing

In th is form of parallelism , instead of sp litting individual im age d a ta sets, com plete im age

d a ta sets are processed in parallel by different processors. T h is form of parallelism is also

identified as processor farm ing (D ow nton e t al., 1996). However, tem p o ra l m ultiplexing

form of parallelism is som etim es also associated w ith o p e ra to r parallelism in low level

im age processing. Low level op era to rs , such as erosion and dilation in im age m orphology,

can be cascaded in to several stages (P itas, 1993). Each stage, im plem ented on sep a ra te

processor, processes a com plete im age d a ta set. T he o u tp u t of any s tag e is in p u t of th e

subsequent stage. For exam ple, if F is a o p era to r o p era tin g on im age 7, then F can be

cascaded into several stages as follows:

O = F{1) = J ’„ ( F „ _ i( . . . ( f 2 (F i( / ) ) ) . . . ) ) (2.2)


B u t th e cascaded im plem entation of an o p e ra to r/a lg o rith m also represen ts th e pipeline

form of parallelism (described la ter). In th is thesis we do no t associate th is form of

parallelism w ith tem pora l m ultiplexing. We identify tem pora l m ultiplexing w ith a type

of processing th a t involves im plem entation of an a lgo rithm / o p e ra to r as a single program

unit, o perating on com plete im age d a ta sets.

W orkpool

In th e w orkpool m ode of parallelism , a cen tral pool of sim ilar co m p u ta tio n al task s is

m ain tained . A large num ber of workers repeated ly retrieve task s from th e pool, perform

required com pu tations, and possibly add new task s to th e pool. T h e co m p u ta tio n te rm i

n a tes when th e task pool is em pty. T his technique is used for im plem enting solutions to

com binato rial problem s in high level vision such as tree or g raph searches. A large num ber

of task s are generated dynam ically which can be picked up by any worker process.

P ipeline

In pipelining, th e application algorithm is sequentially subdivided in to various com ponents

arranged in a pipeline. Each com ponent is processed by different processor and perform s a

certa in phase of th e overall com pu tation . T he d a ta flows th ro u g h en tire pipeline s tru c tu re

th ro u g h th e neighboring com ponent processors.

P ipeline P rocessor Farm

Pipeline P rocessor Farm (D ow nton e t al., 1996) is a generalized form of pipeline parallelism

w here each com ponent in the pipeline m ay be parallelized by various parallel p rogram m ing

techniques described earlier.


2 .6 .3 D es ig n p a ttern s for para lle l v is io n

Based on various form s of parallelism discussed in section 2.6.2, we p resen t following

design p a tte rn s to parallelize vision applications on a cluster of w orksta tions. A detailed

descrip tion of each p a tte rn is presented in ch ap te r 3.

• Farm er-W orker p a tte rn : T his p a tte rn consist of a fa rm er process (or com ponent)

which is continuously polled for a com pu tational work by a set of independent

w orker com ponents. It is m ainly used for im plem enting d a ta parallelism , where th e

im age d a ta is divided into different subim ages which are processed independently by

different workers. T here is no com m unication betw een th e w orker com ponents.

• M aster-W orker p a tte rn : This p a tte rn consists of a m aster com ponent which dis

tr ib u te s th e work to various worker com ponents. Each worker com ponent com m u

nicates w ith neighboring worker com ponents to exchange th e in te rm ed ia te results.

T his p a tte rn is used for parallelizing synchronous d a ta parallel a lgorithm s.

• C ontroller-W orker p a tte rn : T his p a tte rn is sim ilar to th e M aster-W orker p a tte rn

described above, except th a t each worker m ay com m unicate w ith every o th er worker

in th e p a tte rn . It is used for parallelizing a class of problem s in which each ob ject

or su b task of th e problem needs to in te rac t w ith every o th er o b jec t or a su b task .

• D ivide-and-C onquer p a tte rn : This p a tte rn is used for s tru c tu rin g applications in

which e ither th e d a ta or th e application algorithm is divided in to several sub tasks.

Each su b task m ay be executed on single processor or m ay be fu rth e r divided (recur

sively) in to sm aller sub tasks.

• T em poral M ultiplexing p a tte rn : This p a tte rn is used for processing several d a ta sets

or a sequence of im age fram es on m ultiple processors. Each processor processes a

com plete d a ta se t and executes the sam e program code.

• P ipeline p a tte rn : This p a tte rn consists of a pipeline of com ponents executed con

cu rren tly in a specified order. It is used in s itu a tio n s w here a vision application can

be divided in to com ponents which are by them selves independen t, and in te rac t w ith


each o th er only by using o u tp u t d a ta stream of one com ponent as an in p u t d a ta

stream to ano ther.

• C om posite Pipeline p a tte rn : S tructurally , th is p a tte rn is sim ilar to th e pipeline

p a tte rn . T he only difference is th a t each com ponent of th e pipeline can itself be

parallelized using any of the design p a tte rn s s ta te d above.

2 .7 R ela ted work

In th is section, we outline som e of th e leading research efforts which have been in sp ira tional

to th e work presented in th is thesis. A lthough the concept of design p a tte rn s is new,

th e idea of identifying and cap tu rin g com m on form s as softw are ab strac tio n s in parallel

softw are system s is a decade old.

Zim ran et al. (Z im ran e t al., 1990) have proposed a set of im plem entation m achines

used for parallel im plem entation of various applications on a shared and d is trib u ted m em

ory parallel m achines. A layer of im plem entation m achines (IM) is in troduced betw een the

application and the physical m achine. T he im plem entation m achines consist of com m on

parallel program m ing paradigm s such as m aster/s lav e , pipeline, and pyram ids. Each

im plem entation m achine is associated w ith a m athem atica l rep resen ta tion th a t can predic t

th e perform ance bounds for d is trib u ted com putations. An application is developed in

te rm s of one or m ore im plem entation m achines which are then im plem ented efficiently on

th e underly ing hardw are. T he IM s are m ade available in the form of m odifiable tem p la tes

which im plem ent th e relevant com m unication and synchronization functions. However, the

se t of im plem entation m achines presented, do no t address issues re la ted to dom ain-specific

problem s. T hey represent only th e general form s of parallel p rogram m ing paradigm s.

M agee and C heung (M agee & C heung, 1991) have described th e supervisor-w orker

parad igm to d is trib u te the com pu tations of an application on a netw ork of w orksta tions.

T hey have discussed the robustness and load balancing properties of th is paradigm and

have applied sim ple form ulae to p red ic t the perform ance of an a lgorithm , im plem ented

using th is parad igm . T he supervisor-w orker parad igm consists of a superv isor process


th a t d is trib u te s th e com pu ta tional work to a num ber of worker processes, each working

independen tly of th e o ther. However, only em barrassingly parallel class of applications

can be parallelized using th is paradigm .

Singh et a l (Singh e t al., 1991) developed a system called F ram eW orks which uses

tem p la tes to generate d istrib u ted applications on a netw ork of w orksta tions. P rog ram s

are w ritten as sequential procedures enclosed in tem plates. T h e tem p la tes hide th e low

level parallelization details, such as com m unication and synchronization . A user selects

ap p ro p ria te tem p la tes (e.g. pipeline, con trac to r, in p u t/o u tp u t) to describe th e behavior

of a parallel p rogram . T he system then generates th e code for im plem enting th e com

m unication and synchronization betw een the processes. T he concepts of th e Fram eW orks

system were la te r used to c reate ano th er such system called E nterprise.

T he E n terp rise system , like Fram eW orks, has a graphical in terface by which the users

can create parallel applications using assets such as pipeline, m aste r/s lav e , divide-and-

conquer (Schaeffer e t ah, 1993). This system au tom atically inserts necessary code for

com m unication and synchronization relieving th e users from low level p rogram m ing de

tails which include m arshalling d a ta , sending/receiving m essages and synchronization.

However, b o th Fram eW orks and E n terp rise system s do not su p p o rt d a ta parallelism and

com plex synchronization, com m unication, and scheduling s tru c tu re s . M ost of th e p a r

allelism th a t can be achieved in an application is perform ed by pipelining and tem pora l

m ultiplexing. In these form s of parallelism th e processors o p era te only on com plete im ages.

D arling ton et al. (D arlington e t ah, 1993) have proposed a set of h igher-order parallel

form s called skeletons as th e basic building blocks of a parallel p rogram . T hey have also

provided program tran sfo rm atio n s which convert betw een skeletons, giving po rtab ility

across several different m achines. A skeleton cap tu res an algorithm ic form com m on to a

range of program m ing applications. Each skeleton is associated w ith a set of arch itec tu res

on which efficient realizations of the skeleton are known to exist. T h e skeletons are also

associa ted w ith perform ance models which can be used to p red ic t th e perform ance of a

parallel p rogram im plem ented using these skeletons. A set of tran sfo rm atio n s is used for

transfo rm ing one skeleton to ano ther in order to su it th e a rch itec tu ra l requirem ents of dif

feren t m achines. However, th e skeletons represent a general class of parallel p rogram m ing


parad igm s. They are no t dom ain-specific and therefore need to be tu n ed and ex tended in

o rder to refiect th e characteristics and contro l s tru c tu re s associated w ith th e problem s in

a given dom ain.

D ow nton et a l (D ow nton e t al., 1996) have proposed a design m ethodology based on

pipeline of processor farm (P P F ) for parallelizing vision applications on M IM D m achines.

T heir design m ethod enables parallelization of com plete vision system s (w ith continuous

in p u t/o u tp u t) in a top-dow n fashion, w here parallel im plem entations of individual algo

rithm s are tre a ted as com ponents in th e design model. However, th is design m ethodology is

im plicit, i.e. it does no t present detailed descrip tion of the m ethods or designs used in p ar

allelization of individual algorithm s. For exam ple, their paper identifies ‘d a ta parallelism ’

as one of several m ethods for parallelizing vision algorithm s. B u t ‘d a ta paralle lism ’ can

be applied to b o th synchronous and em barrassing ly parallel algorithm s. O ur work in th is

thesis aim s to m ake th e design inform ation in d esigns/m ethods for parallel vision system s,

explicit. We a b s tra c t and docum ent th e design inform ation in th e ir design m ethodology

in th e form of C om posite-P ipeline p a tte rn in th is thesis.

2.8 Sum m ary

In th is ch ap te r we have reviewed concepts and m ethods in several different areas re la ted to

th e parallel vision system s. We began w ith a brief in troduction to parallel com puting sys

tem s and th e ir classification as SISD, SIM D and M IM D m achines, based on th e in struc tion

s tream s and d a ta stream s. T his was followed by a discussion on parallel algorithm s and

th e ir classification in term s of different algorithm ic classes such as synchronous, loosely

synchronous, asynchronous, and em barrassing ly parallel. T hese classes are useful when

discussing ab o u t com pu tations a t higher level. We have also given a brief in tro d u ctio n on

m easuring perform ance in parallel p rogram s.

We then described general principles and m ethods used in th e field of com pu ter vision.

O ur p rim ary concern has been vision applications involving analysis of 2D scenes. We

presented different techniques and algo rithm s for featu re detec tion , segm en ta tion , reseg

m en ta tio n and ob ject recognition used in 2D vision. We also described th e com pu ta tional


charac teristics of these algorithm s and their classification in to th ree levels: low, in te r

m ediate and high. Low level a lgorithm s are usually highly s tru c tu re d , repetitive and

com posed of fixed sets of operations. H igher level algorithm s on th e o ther hand , are very

irregu lar and m ay involve dynam ic scheduling of th e com pu tations. T he d istinctive n a tu re

of th e ir charac teristics has influenced th e design and developm ent of several different

parallel arch itec tu res in com puter vision. Several such arch itec tu res com prising e ither

SIM D, M IM D or bo th SIM D and M IM D (partitionab le) dedicated parallel m achines have

been described.

We described parallel com puting using w orksta tion clusters and discussed th e ir advan

tages over the conventional parallel m achines. This was followed by an in troduction to th e

concept of design p a tte rn s . Design p a tte rn s are softw are ab strac tio n s th a t occur repeated ly

while developing softw are solutions for problem s in a p a rticu la r dom ain . V arious form s

of parallelism in vision applications were identified in order to cap tu re and a rticu la te key

design p a tte rn s in parallel vision system s. Finally, we have outlined som e of th e leading

research efforts th a t have been insp irational to th e work presented in th is thesis.

Chapter 3

D esign patterns for parallelizing

vision applications

D esign p a tte rn s for parallel vision applications (in troduced in section 2.6.3) represent

designs or m ethods used for im plem enting these applications on various parallel archi

tec tu res . Som e of these p a tte rn s , such as Farm er-W orker and M aster-W orker, represent

com m on m ethods which can be used for parallelizing algorithm s no t only in vision b u t

also in o th e r com puting disciplines. B u t o ther p a tte rn s , such as Tem poral M ultiplexing

and C om posite Pipeline, are suitable only for parallelizing applications in vision (for an

exam ple, see (D ow nton e t al., 1996)).

T h ere have been several efforts in th e p ast to present different design m ethods for

parallelizing vision a lgo rithm s/app lications on various parallel arch itectu res (D ow nton

e t ah , 1996), (S tou t, 1987). However, there have been no a tte m p ts to a b s tra c t and

docum en t th e design inform ation in these design m ethods. T his ch ap te r a t te m p ts to

fill th is gap by cap tu rin g and docum enting th is design inform ation in th e form of design

p a tte rn s . T hese design p a tte rn s have been form ulated to rep resen t com m on algorithm ic

s tru c tu re s in various parallel vision a lgo rithm s/app lica tions described in (K endall & U hr,

1982), (U hr, 1987), (S tou t, 1987), (Page, 1988), (P ra san n a K um ar, 1991), (Hussain, 1991),

(P itas , 1993), (W ang e t ah, 1996), (Dow nton e t ah, 1996). A docum en ta tion or ca ta logue

of key design p a tte rn s for parallel vision applications would give s tan d a rd nam es and

60

Chapter 3. Design patterns for parallelizing vision applications 61

definitions to th e techniques used in parallelization of these app lications. By m aking

design knowledge explicit in th e form of design p a tte rn s , experienced and novice designers

would be able to reuse th e designs in different s itu a tio n s (Coplien & Schm idt, 1995).

Design p a tte rn s are useful in tu rn ing an analysis m odel in to an im plem en tation m odel

(Beck e t al., 1996).

T his ch ap te r describes a system of design p a tte rn s used for parallelizing m ajo rity

of vision applica tions on coarse-grained m achines, such as a c luster of w orksta tions. A

system of p a tte rn s for parallel vision applications consists of m any different p a tte rn s used

in different s ituations. In order to facilitate th e ir effective use and to help developers

in selecting and im plem enting the right p a tte rn s for a given situ a tio n , it is necessary to

describe th e p a tte rn s in a uniform way. Such a descrip tion m ust address all th e aspects

relevant to a p a t te rn ’s characterization , detailed descrip tion, im plem entation , selection

and com parison w ith o ther p a tte rn s. A system of p a tte rn s should add ress issues con

cerning th e construc tion of p a tte rn s into m ore com plex and heterogeneous s tru c tu re s . A

com prehensive and well-defined system of p a tte rn s form s a uniquely pow erful and flexible

vehicle for expressing softw are system s (B uschm ann & M eunier, 1995).

T his ch ap te r is organized as follows. Section 3.1 describes different classification

schem es used in classifying th e p a tte rn s a t various levels of ab s trac tio n . Section 3.2

outlines th e tem p la te used for describing the design p a tte rn s . T he rem ain ing sections

describe different design p a tte rn s used in parallelizing various vision app lica tions on a

c luster of w orksta tions. T he p a tte rn s in these sections have also been published in (K adam

e t al., 1997) (K adam et al., 1996).

3.1 O rganization o f p atterns

Design p a tte rn s vary in their level of ab strac tio n and are usually organized in to different

categories based on some classification scheme. Such a classification schem e is believed to

provide a guide when selecting a p a tte rn for a p articu la r design situ a tio n . G am m a et al.

(G am m a e t al., 1994) classify design p a tte rn s according to th e ir functionality . T he design

p a tte rn s can e ither have creational., structural^ or behavioral purpose.


C reational p a tte rn s concern th e process of ob ject creation . T h e Singleton p a tte rn

(G am m a e t ah, 1994) is a creational p a tte rn used in ensuring th a t a class or a com ponen t of

som e design p a tte rn has only one instance. S tru c tu ra l p a tte rn s deal w ith th e com position

of classes or ob jects. T he Proxy p a tte rn (G am m a et ah, 1994) is a s tru c tu ra l p a tte rn

which m akes th e clients or th e users of a com ponent com m unicate w ith a rep resen ta tive

ra th e r th an w ith th e com ponent itself. Behavioral p a tte rn s charac terize th e ways in which

classes or ob jec ts in te rac t and d is trib u te th e responsibility. T h e Itera tor p a tte rn (G am m a

e t ah, 1994) is a behavioral p a tte rn which provides a way to access th e elem ents of an

aggregate ob ject sequentially w ithou t exposing its underlying rep resen ta tion .

T he classification scheme proposed by G am m a et al. has certa in lim ita tions. T he

classes of functionality in th is classification scheme are of general n a tu re ra th e r th an

being specific for any application dom ain. Hence, it is difficult to select ap p ro p ria te

p a tte rn s for solving or s tru c tu rin g problem s in a given application dom ain. B uschm ann

and M eunier (B uschm ann & M eunier, 1995) therefore proposed a classification scheme

which classifies th e p a tte rn s in to different classes based on different levels of ab s trac tio n s in

softw are system s. T hey identified th ree different classes of p a tte rn s , namely, architectural

fram ew orks, design patterns and idioms. This classification schem e was la te r used to

form ally describe a system of p a tte rn s for softw are a rch itec tu re (B uschm ann et ah, 1996).

An a rch itec tu ra l fram ew ork expresses a fundam ental paradigm for s tru c tu rin g softw are

system s. It provides a set of predefined subsystem s and includes rules and guidelines for

organizing th e relationships between them . For exam ple, a Pipeline p a tte rn described in

section 3.8 can be considered as an arch itectu ra l p a tte rn when it is used for s tru c tu rin g a

vision application th a t can be divided into a sequence of independen t subsystem s, executed

in a specified order. Each subsystem in terac ts w ith its neighboring subsystem s only by

exchanging s tream s of d a ta . An application s tru c tu red using a P ipeline p a tte rn m ay be

parallelized by executing th e application subsystem s concurrently . T he execution and

th e in teractions of th e application subsystem s are im plem ented by th e corresponding

com ponen ts of th e P ipeline p a tte rn .

An a rch itec tu ra l fram ew ork consists of several sm aller un its called design p a tte rn s .

Design p a tte rn s describe th e basic scheme for s tru c tu rin g subsystem s and com ponen ts


of a softw are system , as well as the relationships between them . Design p a tte rn s are

m edium -level p a tte rn s , sm aller in scale th an th e a rch itec tu ra l p a tte rn s . A M aster-W orker

p a tte rn described in section 3.4 is an exam ple of a design p a tte rn which can be used for

d is trib u tin g com pu ta tions of an application to identical w orker com ponents. Idiom s, on th e

o th er hand , are low-level p a tte rn s which are specific to som e program m ing language. An

idiom describes th e aspects of bo th design and im plem entation of th e specific com ponents

in a p a tte rn by using th e featu res of a given language. A singleton p a tte rn described

earlier is an exam ple of an idiom.

T he classification scheme based on different levels of ab strac tio n s in softw are system s

(also term ed as system granu larity ) can som etim es be am biguous. A p a tte rn can be used

to s tru c tu re e ither a com plete softw are system or ju s t a single com ponent or subsystem .

A P ipeline p a tte rn , for exam ple, can be p a r t of a larger system . Its classification as an

arch itec tu ra l p a tte rn or design p a tte rn therefore depends on th e con tex t. Similarly, the

b oundary betw een th e design p a tte rn s and idioms is im precise. In fact, B uschm ann and

M eunier (B uschm ann & M eunier, 1995) s ta ted th is am biguity when th ey proposed their

classification scheme. However, th is classification schem e provides a reasonable hierarchy

for describing m ost of th e p a tte rn s in softw are system s.

We do no t follow any s tr ic t classification scheme, b u t ra th e r use it as a general guide

to specify th e ty p e of p a tte rn s we propose and describe in th is thesis. Using th e classifi

cation schem e form ally used by B uschm ann et al. to classify the p a tte rn s in th e ir book

(B uschm ann et al., 1996), we describe a system of p a tte rn s for parallel vision applications

a t th e level of a rch itec tu ra l fram ew orks and design p a tte rn s . If no t s ta te d otherw ise, we

use th e te rm design patterns to represent all the p a tte rn s a t various levels of ab strac tio n .

Also, we use th e te rm s pattern and design pattern as synonym s.

3.2 D escrip tion o f design p a ttern s - a tem p la te

We use a tem p la te to describe all the design p a tte rn s presented in th is thesis. T h e tem p la te

provides descrip tion of how each p a tte rn works, w here it should be applied and w hat

are th e tradeoffs in its use. This description schem e for th e p a tte rn s is closely re la ted


to th e ones proposed by G am m a et al. (G am m a e t al., 1994) and B uschm ann et al.

(B uschm ann et ah, 1996). Its in ten tion is to su p p o rt the understand ing , com parison,

selection, and im plem entation of p a tte rn s w ithin a given design s itu a tio n . T he tem p la te

used for describing each p a tte rn is given below:

1. P a tte rn nam e

T he nam e of th e p a tte rn which conveys the essence of th a t p a tte rn .

2. In ten t

A sh o rt s ta tem en t ab o u t the m ain functionality of th e p a tte rn and th e problem s

th a t it addresses.

3. M otivation

An exam ple illu stra ting a concrete instance of th e p a tte rn . T he m otivational exam ple

re la tes the p a tte rn to its practical usage.

4. S tru c tu re

T he s tru c tu re of th e p a tte rn in te rm s of o b jects or com ponen ts described in both

tex tu a l and graphic represen ta tion . We use a varian t of th e o b jec t m odel (described

in A ppendix A) to display th e s tru c tu re of th e p a tte rn .

5. In teraction

T he in teractions between th e com ponents of th e p a tte rn and between th e outside

world are depicted . We ad a p t the ob ject m essage sequence ch a rt n o ta tio n (described

in A ppendix A) to describe the in teractions between th e com ponents of a p a tte rn .

6. Im plem entation

T he general guidelines for im plem enting th e p a tte rn . T hese are, however, only

suggestions which should be su itab ly modified depending upon the needs of a given

problem .

7. Consequences

T he consequences and trade-offs of using a p a tte rn . T he param ete rs th a t can be

varied independently by using the design p a tte rn . We describe th e benefits and

po ten tia l liabilities of a p a tte rn .


8. A pplicability

T he set of conditions and requirem ents th a t ind icate when th e p a tte rn m ay be

applicable.

9. K nown Uses

We provide exam ples of th e use of th e p a tte rn in different s itu a tio n s. We also provide

som e re la ted efforts in using th e p a tte rn or its varian ts.

3.3 Farm er-W orker P a ttern

Intent

T he Farm er-W orker p a tte rn , which provides dynam ic load balancing, is used for im ple

m enting em barrassingly parallel algorithm s. T he farm er com ponent divides th e problem

task in to a collection of independent subtasks. T he worker com ponen ts g rab individual

su b task s and perform identical operations on the d a ta , before re tu rn in g th e transfo rm ed

values to th e farm er for collating.

M otivation

A veraging is a sim ple im age enhancem ent technique which is used for rem oving noise

from an im age co rrup ted by random noise. It uses linear local window opera tions to change

th e pixel in tensities in the corrup ted im age using th e equation

/ { a ,b ) = { (3.1)(hi)G.A/'

w here, / is noisy im age, / is filtered im age, and W is a set of N neighboring pixel points

around a poin t (a,h) in th e im age (Sonka e t ah, 1993). T he averaging operation , using

th e F arm er-W orker p a tte rn , can be parallelized by dividing th e im age in to subim ages and

averaging these subim ages concurrently on different processors.


SplitWorkSendSubtaskCollateResultsSendFinalResults

Fanner

RequestSubtaskProcessSubtaskSendResults

Worker (p-1)


Worker (2)


Worker (1)


Worker (p)

Figure 3.1: Farm er-W orker P a tte rn

Stru cture

T he Farm er-W orker p a tte rn consists of a farm er com ponent and several independen t

bu t identical worker com ponents or processes as shown in F igure 3.1. T he client in te rac ts

w ith th e farm er com ponent to parallelize a certain applica tion . T he farm er com ponen t

is responsible for partition ing the application in to several independen t su b task s, s ta r tin g

th e w orker com ponents to process these sub tasks, collecting th e p artia l resu lts from the

w orker com ponents, and finally re tu rn ing the collected resu lts to th e client. T h e w orker

com ponents are responsible for processing the individual su b task s c reated and assigned by

th e fa rm er. T he Farm er-W orker p a tte rn consists of one farm er and a t least two w orkers.

Interaction

T h e in teractions between th e com ponents of th e Farm er-W orker p a tte rn are show n in

F igure 3.2.

• T he client requests th e farm er to parallelize a given application .

• T he farm er com ponent divides the application into different su b tasks and s ta r ts

several w orker com ponents to process these sub tasks.

Chapter 3. Design patterns for parallelizing vision applications

CallToParallelize

^ SplitWork

SendSubtask

ProcessSubtaskSendResults

(RequestSubtask)

SendSubtask

ProcessSubtaskSendResults

(RequestSubtask)

CollateResultsSendFinalResults

Worker (1)Client Worker (p)

Figure 3.2; O bject Interaction in the Fanner-W orker P a tte rn

• Each worker repeatedly requests a subtask , perform s specified com putation on the

d a ta in the sub task , and re tu rns the results back to the farm er. This continues until

a term ination condition is encountered.

• 'The term ination condition occurs when there are no m ore tasks to be processed.

T he farm er detects this condition and signals the worker com ponents to term inate .

• T he farm er collates the results returned by the workers for a given application. The

farm er re tu rns the collated result to the client.

Im plem entation

T he Farm er-W orker p a tte rn can be implemented by following the steps described below:

/. F arfifion the work. Specify how the problem task can be divided into a collection of

independent subtasks. For the averaging operation , we could either partition the image

in to horizontal or vertical blocks of subim ages. Each subim age represents a sub task to be


processed. T he subim ages m ust also include th e required pixel values a t th e boundaries

of th e p artitio n .

2. C om bine the results. Specify how th e final results should be collated from th e p artia l

resu lts ob tained from th e worker com ponents. In th e averaging exam ple, th e farm er

com ponen t sim ply collates th e averaged subim ages onto th e o u tp u t im age w ith o u t any

change.

3. Specify the in teraction between the fa rm er and the workers. T his in teraction can be

im plem ented in a t least th ree different ways: a) Each w orker receives a su b task from th e

farm er a t th e beginning. W hen a worker re tu rn s th e p artia l resu lts to th e fa rm er, th e

farm er collates these results and sends ano ther su b task to th e worker, b) A sep a ra te

com ponen t called gatherer is created . W hile the farm er d is trib u te s the su b task s to th e

w orkers, th e g a th erer collects th e partia l results from each worker. T he g a th e re r then

re tu rn s th e final collected result to the farm er, c) If th e o peration of collecting th e p artia l

resu lts is triv ial or easily delayed to th e end of com p u ta tio n , th e farm er can tu rn in to

a w orker afte r se ttin g up th e collection of sub tasks in a com m on repository, such as a

su b task queue. T h e workers now fetch th e sub tasks from th e su b task queue. However,

th is im plem entation needs a shared counter to m anage th e su b task queue. In all th e cases,

when th ere are no m ore sub tasks to be processed, th e farm er sends a te rm in atio n m essage

to each worker (and g a th erer in (b)). In the averaging exam ple, as the fa rm er sim ply

collects th e results re tu rned by the workers, we use th e first m ethod to im plem ent th e

in te rac tio n betw een th e farm er and th e workers.

4. Im p lem en t the fa rm er and the worker com ponents according to the specifications

ou tlined in previous steps.

C onsequences

T he F arm er-W orker p a tte rn provides several benefits:

D ynam ic load balancing: T he Farm er-W orker p a tte rn provides an even d istrib u tio n of

th e load when th e com pu tational requirem ents of th e individual su b tasks and th e speed

of different processors in the parallel system , vary significantly and unpredictably . T he


worker com ponents in a Farm er-W orker p a tte rn g rab th e su b tasks and process them

a t th e ir own pace. A faste r processor or node of th e parallel system would g rab and

process m ore su b tasks th an the slower nodes. Hence, th e num ber of su b tasks processed by

each w orker is p roportional to th e speed of their corresponding nodes or processors. T he

Farm er-W orker p a tte rn therefore provides dynam ic load balancing of th e su b task s during

its execution.

Scalability and flexib ility: It is possible to add new workers or change existing algorithm s

in th e workers w ith o u t m ajo r changes to the farm er. T he client is no t affected by these

changes. Similarly, it is possible to change th e algorithm s for p artitio n in g th e w ork or

co-ord inating th e workers in th e farm er com ponent w ith o u t affecting th e client.

T he F arm er-W orker p a tte rn suffers from th e following liabilities:

Feasibility: T he Farm er-W orker p a tte rn m ay no t always be feasible. T he ac tiv ities of

p artitio n in g of th e work, s ta rtin g and controlling th e workers, delegating th e w ork am ongst

th e w orkers, and collecting th e final results, consum e processing tim e. T he p a tte rn would

be effective only when the tim e spen t in these activ ities is significantly lower th a n th e tim e

required to perform th e com pu tations in a given application.

E ffectiveness: T h e Farm er-W orker p a tte rn is effective only w hen there are m ore sub tasks

th an th e num ber of processors. T he parallelism in th is p a tte rn is expressed in term s

of th e num ber of sub tasks. W hen all th e sub tasks are processed, no fu rth e r parallelism

is available in th e application . On the o ther hand , to o m any subtcisks w ith relatively

lower com pute to com m unication ra tio , m ay lead to poor perform ance. A p roper balance

betw een th e g ran u la rity and th e num ber of su b tasks created is therefore critica l for the

effectiveness of th is p a tte rn .

A p p licab ility

T h e Farm er-W orker p a tte rn represents a parallel program m ing paradigm for im ple

m enting em barrassingly parallel algorithm s. It can be used to parallelize any vision

applica tion in which


• th e d a ta can be partitioned in to several independent d a ta sets

• each d a ta set can be processed concurrently by different workers

• th e processing of each d a ta set does not require in teraction betw een th e worker

com ponents to exchange th e in term ediate results

K now n U ses

T he Farm er-W orker p a tte rn has applications in various levels of vision processing. In

low level processing, it can be used for parallelizing local w indow -based operations such

as convolution, edge detection , linear and non-linear (e.g. m edian) filtering, and im age

th inn ing . In in term ed iate level it can be used to ex trac t featu res of individual ob jec ts

concurrently . In high level processing it can be used for processing several fea tu res or

ob jec ts for o b jec t recognition, concurrently. T he algorithm ic s tru c tu re /m o tif represen ted

by th e Farm er-W orker p a tte rn is described in (M attson , 1996).

3.4 M aster-W orker P a ttern

Intent

T he M aster-W orker p a tte rn is used for parallelizing a class of problem s which exhib it a

synchronous form of parallelism . T he m aster com ponent divides th e problem in to several

su b task s and d is trib u tes them to identical worker com ponents. Each w orker com ponent

perform s com pu ta tions on its assigned su b task iteratively^ and com m unicates th e in term e

d ia te resu lts to its neighboring workers a t th e end of each ite ra tio n . T he m aster com ponent

collates th e final resu lts re tu rned by the worker com ponents afte r a fixed num ber of such

itera tions.

M otivation

An extrem um filte r is a w indow -based non-linear o p era to r which sharpens th e blurred

edges in an im age to th e original step edges (K ram er & B ruckner, 1975). The ex trem um


filter replaces th e cen tral pixel value w ithin a filter w indow by th e nearest ex trem e pixel

value occurring w ithin th e window. It can be expressed using th e following equation

/ (a, oj — (3.2)m i n { f ( i , j ) } otherw ise

w here, / ' ( a , b) represents th e new pixel value, and m a x { f { i ^ j ) } and represent

th e m axim um and m inim um values (extrem e values) occurring w ith in a window centered

a t po in t (a, 6). E x trem um filter is applied iteratively so th a t th e b lurred edges converge

to th e original s tep edges. K ram er e t al. (K ram er & B ruckner, 1975) have reported

th a t a t least 20-50 ite ra tio n s were required to observe a com plete convergence in a 27x33

im age. T he execution tim e required for operating on larger im ages can therefore be quite

significant. In fact, it can be seen th a t th e com pu tational com plexity of th is o p era to r w ith

M ite ra tio n s, operating on a n x n im age and using a m x m w indow, is 0 { 2 M m ‘ n^). T he

ex trem um filter o p era to r can be parallelized by dividing th e im age in to several subim ages,

and filtering these subim ages concurrently using different w orker com ponents. Each worker

com ponent com m unicates the required boundary inform ation to its neighboring workers

afte r every ite ra tio n . By using a set of P processors, th e co m p u ta tio n al com plexity of th e

ex trem um filter o p era to r can be reduced to 0 { 2 M in ? n ‘ / P ) , su b jec t to th e com m unication

overheads.

Structure

T he M aster-W orker p a tte rn consists of a m aste r com ponent and several identical

w orker com ponents or processes as shown in F igure 3.3. T h e worker com ponents are

spatially arranged in a pipeline to reflect th e com m unication s tru c tu re of th e p artitioned

problem which th e p a tte rn im plem ents. T he client in te rac ts w ith th e m aster com ponent

to parallelize certa in application . T he m aster com ponent is responsible for p artition ing

th e application in to several sub tasks, s ta rtin g th e w orker com ponents to process these

sub tasks, collecting th e results re tu rned by th e workers, and finally re tu rn in g th e collected

resu lts to th e client. T he worker com ponents are responsible for repeatedly perform ing th e

co m p u ta tio n s on th e ir assigned sub tasks, and com m unicating th e in te rm ed ia te resu lts to


their neighboring workers afte r every itera tion . T he M aster-W orker p a tte rn consists of

one m aster and a t least two workers.

SplitWorkSendSubtasksCollateResultsSendFinalResults

Master

DoCalculationExchangeData

SendResults

Worker (p-1)


SendResults

Worker (2)


SendResults

Worker (1)


SendResults

W orker (p)

Figure 3.3: M aster-W orker P a tte rn

Interaction

T he in teractions between th e com ponents of th e M aster-W orker p a tte rn are shown in

F igure 3.4.

• T he client requests the m aster to parallelize a given applica tion .

• T h e m aster com ponent divides th e application into several su b task s and s ta r ts the

w orker com ponents to process these sub tasks. T he num ber of subtcisks created is

equal to th e num ber of processors available.

• Each worker perform s a fixed num ber of com pute-com m unicate cycles. A com pute-

com m unicate cycle denotes an operation in which th e w orkers com pute on th e d a ta

in th e ir assigned sub tasks and then com m unicate the in te rm ed ia te resu lts to their

neighboring worker com ponents. T he workers re tu rn th e com puted resu lts back to

th e m aster afte r perform ing a fixed num ber of these com pute-com m unicate cycles.

• T he m aster collates the results re tu rned by th e workers for th e given application .

T he m aster re tu rn s the collated result to th e client.


CallToParallelize

^ ~ | SplitWork

SendSubtask

SendSubtask

ExchangeResults

ExchangeResults

DoCalculation DoCalculation

SendResults

SendResults

CollateResultsSendFinalResults

Worker (1) Worker (2)MasterClient

Figure 3.4: O bject interaction in the M aster-W orker P a tte rn

Im plem entation

T he M aster-W orker p a tte rn can be im plem ented by following the steps described below:

1. P artition the work. Specify how the problem task can be divided into a collection of

sub tasks. T he num ber of sub tasks created should be equal to the num ber of processors

or m achines available in the parallel system . Also, the am oun t of com putational work

in each sub task should be proportional to the speed factors of individual m achines used

in parallelization. For the filtering operation , one can partition the image into either

horizontal or vertical blocks of subim ages. Each subim age represents a sub task to be

processed.

2. C'ornbine the resiilts. Specify how the final results should be collated from the results

re turned by the worker com ponents. In the filtering exam ple, the m aster com ponent

sim ply collates the filtered subim ages onto the o u tp u t image w ithout any change.


3. Specify the in teraction between the m aster and the workers. T his in terac tion can be

specified as follows. T he m aster s ta r ts the worker com ponents and d is trib u te s a single

su b task to each w orker com ponent. T he m aster then w aits for the workers to re tu rn the

com puted results. W hen all the workers com m unicate th e ir com puted results, th e m aste r

te rm in a tes all th e worker com ponents. T he m aster collects and re tu rn s th e final resu lt to

th e client.

4- Specify the in teraction between the worker com ponents. T his in teraction can be speci

fied as follows. W hen a worker com pletes its co m pu ta tion in any com pute-com m unicate

cycle, it com m unicates the required in term ediate resu lts to its neighboring w orkers asyn-

chronously. It then suspends its activ ities and w aits to receive th e in te rm ed ia te resu lts

from its neighboring workers. N ote th a t when a process sends a m essage asynchronously,

it does no t w ait for the destination process to receive it. T his im plem entation therefo re

does no t lead to a deadlock condition.

5. Im p lem en t the m aster and the worker com ponents according to th e specifications

outlined in previous steps.

C onsequences

T he M aster-W orker p a tte rn provides several benefits:

Scalability and flexibility: T he M aster-W orker p a tte rn is scalable w ith respect to th e addi

tion of new workers. It is also flexible w ith respect to changing of th e existing a lgorithm s

in th e w orkers w ithou t involving m ajor changes to th e m aster. T he client is n o t affected

by such changes. Similarly, it is possible to change th e algorithm s for p artitio n in g th e

w ork or co-ord inating the workers in th e m aster com ponent w ith o u t affecting th e client.

Separation o f concerns and efficiency: T he M aster-W orker p a tte rn sep ara tes th e client

code from th e code for sp litting the work, delegating th e w ork to different workers,

m anaging in teractions between th e workers, collecting th e resu lts from th e w orkers, and

handling th e worker failures. T he M aster-W orker p a tte rn can speed up co m p u ta tio n tim e

in m any applications. However, it m ay not always be feasible to parallelize any app lica tion

due to overheads in parallelization (see below).


T he M aster-W orker p a tte rn suffers from th e following liabilities:

Feasibility: T he M aster-W orker p a tte rn m ay no t always be feasible. T he activ ities of

partitio n in g of th e work, s ta rtin g and controlling th e workers, delegating th e work to the

w orkers, m anaging th e worker-w orker com m unication, and collecting th e final resu lts, are

tim e consum ing. T his p a tte rn would be effective only when th e tim e sp en t in these activ i

ties is significantly lower th an th e com puting tim e required to execute a given application.

Load balancing: T he M aster-W orker p a tte rn can suffer from serious load im balances during

its execution. T his can happen when it is im plem ented on non-dedicated parallel system s,

such as en terprise clusters (see section 2.5.3). Each worker in th e M aster-W orker p a tte rn

depends on th e o th er workers to perform com putations on its assigned su b task . A m achine

in an en terprise cluster can lead to reduction in perform ance of th is p a tte rn , when it is

tim e-shared by o ther users while executing some worker com ponent of th e p a tte rn . A s ta tic

load d istrib u tio n based on the speed factors of individual m achines used in parallelization

is effective only on dedicated parallel system s.

E rror Recovery: It is hard to devise m echanism s to handle a failure in som e worker com

ponen t during th e execution of th is p a tte rn . Since each worker is dependen t on th e o th er

workers for perform ing its com putations, such a failure can lead to a deadlock condition.

It is also difficult to deal w ith th e failure of com m unication betw een th e m aster and the

w orkers or between different workers.

A p p licab ility

T he M aster-W orker p a tte rn represents a parallel p rogram m ing m odel for im plem enting

synchronous parallel algorithm s. It can be used to parallelize any vision application in

which

• th e d a ta can be partitioned into several d a ta sets

• each d a ta set can be processed concurrently by different workers

• th e processing of each d a ta set requires an in teraction between th e w orker com po

nents to exchange the in term ediate results


K now n U ses

T he M aster-W orker p a tte rn has applications m ostly a t low level vision processing.

T he higher levels do not exhibit regularity in d a ta s tru c tu re s and com p u ta tio n . In low

level processing, it can be used for parallelizing ite ra tive w indow -based operations such

as sp a tia l non-linear filters, and ite ra tive relaxation algorithm s used for im age resto ra tio n

and segm entation . T he algorithm ic s tru c tu re /m o tif represen ted by th e M aster-W orker

p a tte rn is described in (M attson , 1996).

3.5 C ontroller-W orker P a ttern

Intent

T he C ontroller-W orker p a tte rn is used for parallelizing a class of problem s in which

each ob jec t or sub task of th e problem can poten tially in te rac t w ith any o th er ob ject

or a su b task . T he controller com ponent divides th e problem in to several subtaisks and

d is trib u te s them to identical worker com ponents. Each w orker perform s calculations on

its assigned sub task , and com m unicates th e in term ediate resu lts to som e or all o th er worker

com ponents. T he controller com ponent collates th e final resu lts re tu rn ed by th e worker

com ponents.

M otivation

H istogram equalization is a popular grey scale tran sfo rm atio n which is used for enhanc

ing th e co n tra s t in an im age. It aim s to transfo rm th e im age to have equally d is trib u ted

b rightness levels over whole of th e brightness scale. A h istogram H of an im age is a

p robab ility density function of th e grey values in th e im age. If Uk represents th e num ber

of pixels a t a grey level k and if N denotes the to ta l num ber of pixels in an im age, then

th e h istogram H is defined as H{i ) = Ui / N. H istogram equalization m aps th e original

pixel values from a scale [a, 6] to the new values from a scale [c, d\ such th a t th e desired

o u tp u t h istogram is uniform over the whole new brightness scale [c, d]. T he tran sfo rm atio n

function is m onotonically increasing and is given by (Sonka e t ah , 1993)


/ ( i j )f ' { i , j ) = ( { d - c ) / N ) J 2 H{ k ) + c (3.3)

k = a

where / and f ' represent th e original and transform ed im age functions, respectively.

Controller

SplitWork SendSub tasks CollateResults SendFinalResults

Worker (1) Worker (2)

DoCalculation DoCalculationExchangeData ExchangeData

SendResults SendResults

A1

A1

1

------------- -------

1

___ ' t ___

AII

Worker (p-1) Worker (p)

DoCalculation DoCalculationExchangeData ExchangeData

SendResults SendResults

AII

Figure 3.5: C ontroller-W orker P a tte rn

H istogram equalization algorithm can be parallelized using th e C ontroller-W orker p a t

te rn . T he C ontro ller divides th e im age in to several subim ages and sends each subim age

to different worker. Each worker com putes the p artia l h istogram of its subim age and

com m unicates it to all o ther workers. Each worker then com bines these p artia l h istogram s

to form a com plete h istogram of an entire im age. T he w orkers perform h istogram equal

ization on their subim ages (using equation 3.3) and re tu rn th e transfo rm ed subim ages to

th e C ontroller.

Stru cture

T he C ontroller-W orker p a tte rn consists of a controller com ponent and several iden

tical worker com ponents or processes as shown in F igure 3.5. T he client in te rac ts w ith

th e controller to parallelize certain application. T he controller com ponent is responsible

for p artitio n in g th e application into several sub tasks, s ta r tin g th e worker com ponen ts to


process these subtasks, collecting the results returned by th e workers, and finally re turn ing

1 he collected results to the client. The worker com ponents are responsible for perform ing

the com putations on their assigned subtasks. Each worker may exchange in term ediate

results w ith some or all o ther worker com ponents, during the com pu tation . T he Controller-

W orker p a tte rn consists of one controller and a t least two workers.

Interaction

T h e in teractions between the com ponents of the C ontroller-W orker p a tte rn are shown in

fig u re 3.6.

CallToParallelize

^ SplitWork

SendSiibtask

SendSubtask


ExchangeResults


SendResults

SendResults

CollateResults

SendFinalResults

Worker (1)Controller Worker (2)Client

Figure 3.6: O bject Interaction in the C ontroller-W orker l^attern

• T he client requests the controller to parallelize a given application.

• The controller divides the application into several sub tasks and s ta r ts the worker

com ponents to process these subtasks. The num ber of sub tasks created is equal to

the num ber of processors available.


• Each worker perform s com putations on its assigned su b task and com m unicates the

in term ed ia te results to one or m ore worker com ponents. T he workers re tu rn the

com puted resu lts back to th e controller.

• T he controller collates th e results re tu rned by th e workers, and re tu rn s th e collated

resu lt to the client.

Im plem entation

T he C ontroller-W orker p a tte rn can be im plem ented by following th e steps described below:

1. P artition the work. Specify how the problem task can be divided in to a collection of

sub tasks. T he num ber of sub tasks created should be equal to th e num ber of processors

or m achines available in th e parallel system . Also, the am oun t of co m p u ta tio n in each

su b task should be p roportional to the speed factors of th e individual m achines used in

parallelization. For th e h istogram equalization operation , we could either p a rtitio n th e

im age in to horizontal or vertical blocks of subim ages. Each subim age represents a su b task

to be processed.

2. C om bine the results. Specify how the final resu lts should be collated from th e results

re tu rn ed by th e worker com ponents. In th e h istogram equalization exam ple, th e controller

com ponen t sim ply collates the transform ed subim ages on to th e o u tp u t im age, w ith o u t any

change.

3. Specify the in teraction between the controller and the workers. T his in teraction can be

specified as follows. T he controller s ta r ts the worker com ponents and d is trib u te s a single

su b task to each worker com ponent. T he controller then w aits for th e workers to re tu rn

th e com puted results. W hen all the workers com m unicate th e ir com puted resu lts, the

controller signals th e workers to te rm in a te their processing. T he controller collects and

re tu rn s th e final resu lt to th e client.

4- Specify the in teraction between the worker com ponents. E ach w orker m ay com m unicate

(asynchronously) th e in term ediate results to som e or all o th er worker com ponents, and

m ay w ait to receive the sam e from som e or every o ther w orker com ponent. T hus, th is


in teraction m ay som etim es involve global broadcasting of m essages from each w orker to

all o th er workers.

5. Im p lem en t the controller and the worker com ponents according to th e specifications


C onsequences

T he C ontroller-W orker p a tte rn provides several benefits:

Scalability and flexibility: T he C ontroller-W orker p a tte rn is scalable w ith respect to the

add ition of new worker com ponents. Increasing th e num ber of worker com ponen ts does

no t resu lt in m ajo r changes to the controller or to th e client program . Also, it is easy to

change th e program code in all worker com ponents to realize different im plem entations.

Separation o f concerns and efficiency: T he C ontroller-W orker p a tte rn sep ara tes th e client

code from th e code for sp litting the work, delegating th e work to different workers, m an

aging in teractions between th e workers, and collecting th e resu lts from th e workers. The

C ontroller-W orker p a tte rn can speed up execution tim e of m any com puta tionally intensive

applications. However, it m ay no t always be feasible to parallelize a given applica tion due

to overheads in parallelization (see below ).

T he C ontroller-W orker p a tte rn suffers from the following liabilities:

Feasibility: T he C ontroller-W orker p a tte rn m ay no t always be feasible. T he activ ities of

p artitio n in g of th e work, s ta rtin g and controlling th e workers, delegating th e w ork to the

w orkers, m anaging the worker-w orker com m unication, and collecting th e final resu lts, are

tim e consum ing. In fact, significant delays can occur in th e w orker-w orker in teractions

especially when they involve global broadcasting of m essages from each worker to all o ther

workers.

Load balancing: T he C ontroller-W orker p a tte rn can suffer from serious load im balances

during its execution. This can happen when it is im plem ented on non-dedicated parallel

system s, such as en terprise clusters (see section 2.5.3). E ach w orker in th e C ontroller-

W orker p a tte rn m ay depend on the o ther workers to perform th e co m p u ta tio n s on its


assigned sub task . A m achine in an enterprise cluster can reduce th e perform ance in

th is p a tte rn , when it is tim e-shared by o ther users during th e execution of som e worker

com ponent w ithin th e p a tte rn . A s ta tic load d istribu tion based on th e speed facto rs of

individual m achines used in parallelization is effective only on dedicated parallel system s.

E rror Recovery: I t is hard to devise m echanism s to handle a failure in som e worker

com ponen t during th e im plem entation of th is p a tte rn . If each w orker depends on th e o th er

workers for perform ing its com putations, such a failure can lead to a deadlock condition.

It is also difficult to deal w ith th e failure of com m unication between th e contro ller and the

workers or betw een different workers.

A p p licab ility

T h e C ontroller-W orker p a tte rn can be used to parallelize any vision application in

which

• th e d a ta can be partitioned into several d a ta sets

• each d a ta se t can be processed concurrently by different workers

• th e processing of each d a ta set requires an in teraction betw een som e or all th e worker

com ponents, to exchange in term ediate results.

K now n U ses

T he C ontroller-W orker p a tte rn has applications m ostly a t low and in term ed ia te level

processing. In low level processing, it can be used for parallelizing tw o-dim ensional F ast

Fourier T ransform s. In th e in term ediate level, it can be used for parallelizing Hough

tran sfo rm s and connected com ponent labeling algorithm s.

An iterative varian t of th e C ontroller-W orker p a tte rn can be realized by perform ing the

com pute-com m unicate cycles iteratively. Each worker com ponent perform s com p u ta tio n s

on its assigned su b task iteratively. Each worker com m unicates th e in term ed iate resu lts to

som e or all o th er w orker com ponents a t the end of every ite ra tio n . However, a parallel

im plem entation using an ite ra tive varian t of the C ontroller-W orker p a tte rn involves huge


com m unication costs, and therefore may no t resu lt in any significant perform ance gains

in m any applications.

3.6 D iv id e-and -C on q uer P a ttern

Intent

T he D ivide-and-C onquer (DC) p a tte rn is used for s tru c tu rin g applications in which

either th e d a ta or th e application algorithm is divided into several sub tasks. Each subtcisk

m ay be executed on single processor or m ay be fu rth e r divided (recursively) in to sm aller

sub tasks. T he su b tasks are executed independently and concurren tly producing several

p a rtia l results. A set of com bining functions are then applied on these p artia l resu lts to

produce th e m ain result.

M otivation

An edge, a local boundary of some ob ject in an im age, represen ts a sharp d iscontinuity

in th e im age function /(æ , y ) . It is described by a g rad ien t th a t po in ts in th e d irection of the

largest grow th of th e im age function. An edge has b o th m agnitude and direction which is

ca lcu lated using th e grad ien t. T he g rad ien t is app rox im ated by first-order differences and

expressed as a g rad ien t o p era to r A f { x , y ) = ( Ax f { x ^ y ) ^ A y f { x , y ) ) . A popular g rad ien t

o p e ra to r is th e Sobel edge detec to r which is represented by tw o convolution m asks for

finding edges in th e horizontal (A^,) and th e vertical d irections (A^) as shown below.

- 1 0 1 1 2 1

-2 0 2 0 0 0

- 1 0 1 - 1 - 2 - 1

(a) (b)

F igure 3.7: C onvolution m asks for finding a) horizontal edges and b) vertical edges

T h e direction of the edge a t a point {x^y) in th e im age is given by t an ^(A ^/A ^;),


while th e edge m agnitude is expressed as + T he Sobel edge d e tec to r can be

parallelized using th e D C p a tte rn by com puting the horizontal and vertical g rad ien ts

concurrently . T he horizontal and the vertical grad ien ts can th en be com bined to com pute

th e edge d irection and the edge m agnitude, using the expressions given above.

Stru cture

T he D C p a tte rn consists of a m anager com ponent and several d istinc t w orker com

ponen ts or processes as shown in F igure 3.8. T he m anager com ponent creates a set of

worker com ponents to process each sub task . Each worker m ay perform co m p u ta tio n s on

its assigned su b task or m ay recursively divide it fu rth e r into sm aller sub tasks for executing

them on a different set of processor nodes.

Send DataCollateResultsSendFinalResults

Manager

ReceiveDataCompute/Parallelize

SendResults

Worker (p-1)


SendResults

Worker (2)


SendResults

Worker (p)


SendResults

Worker (1)

Figure 3.8: D C P a tte rn

Interaction

T he in teractions between th e com ponents of the D C p a tte rn are shown in F igure 3.9.

• T he client requests the m anager to parallelize a given application .

• T he m anager s ta r ts th e worker com ponents and d is trib u te s th e sub tasks to different

worker com ponents.


CallToParallelize

SendData

SendData

^ Compute/Parallelize ^ Compute/Parallelize

SendResults

SendResults

CollateResults

SendFinalResults

Worker (1)Manager Worker (2)Client

I'iguTC 3.9: O bject Interaction in the DC" P a tte rn

• Each worker com ponent perform s com putation on its assigned sub task and re tu rns

the partia l results to the m anager. A lternatively, a worker may recursively divide its

assigned sub task into sm aller subtasks and execute them concurrently on a different

set of processor nodes. A worker, in this case, ac ts as a m anager for parallelizing its

assigned snbtask .

• The m anager com putes the main result from the results re tu rned by the worker

com ponents

• Id le m anager re turns the main result to the client.

Im p le m e n ta t io n

I he DO p a tte rn can be im plem ented by following the steps described below:

/. Design the manager eomponent. The m anager controls the worker com ponents. It

creates and schedules the worker com ponents during the processing of the subtasks. If the

DC p a tte rn is used for im plem enting d a ta parallelism , specify the dividing function which

p artitio n s the d a ta into subtasks. However, if the DC p a tte rn is used for im plem enting


algorithm ic parallelism , divide the application algorithm m anually in to d istinc t p rogram

units. T h e m anager should create worker com ponents to execute these program units. In

b o th th e cases, specify the com bining function which com bines th e p artia l resu lts re tu rned

by th e worker com ponents. In th e Sobel edge detection exam ple, a com bining function

in th e m anager com bines th e edge d a ta re tu rned by th e w orker com ponents, in o rder to

com pute th e edge direction and edge m agnitude.

2. D esign the worker com ponent. Each worker m ay sim ply apply a com puting function

on its assigned su b task . A lternatively, a worker m ay serve as a m anager for parallelizing

its assigned su b task using a different set of processor nodes. Each worker should re tu rn

th e p artia l resu lts (of the assigned subtask) to its corresponding m anager. In th e Sobel

edge detection exam ple, each worker com ponent com putes th e edge d a ta in the horizontal

(Aa;) and th e vertical (Ay) directions, concurrently.

3. Specify the in teraction between the m anager and the workers. T his in teraction can be

specified as follows. T he m anager s ta r ts th e worker com ponents and d is trib u te s a single

su b task to each w orker com ponent. T he m anager then w aits for th e workers to re tu rn

th e com puted results. W hen a worker com m unicates its resu lt, th e m anager signals the

worker to te rm in a te its processing. In the Sobel edge detec tion exam ple, th e m anager

com m unicates com plete im age d a ta to each worker and w aits for receiving th e edge d a ta

from all th e workers.

f . Im p lem en t the m anager and the worker com ponents according to th e specifications


C onsequences

T he D C p a tte rn provides several benefits:

Separation o f concerns: T he m anager com ponent separates th e client code from th e code

in w orker com ponents used for perform ing the ac tua l com pu ta tions in the sub tasks. Also,

th e code for creating and controlling th e worker com ponents is encapsulated in a m anager

com ponent, sep a ra te from the client.


E fficiency: T he D C p a tte rn provides a sim ple s tra teg y of parallelizing an application. It

can be used to achieve im proved perform ance in m any applications which can be divided

(recursively) in to sm aller b u t independent com pu tational units.

E rror Recovery: I t is relatively easy to devise m echanism s to handle a failure in som e

w orker com ponent during th e execution of th is p a tte rn . T his is due to th e fac t th a t all

th e w orker com ponents process their sub tasks independently.

T he D C p a tte rn suffers from th e following liabilities:

Scalability: T he scalability of th e D C p a tte rn when used for im plem enting algorithm ic p ar

allelism is constra ined by th e am ount of parallelism th a t can be achieved in th e algorithm .

In fact, th e algorithm d ic ta tes the parallelism .

Load imbalances: T he D C p a tte rn m ay lead to load im balances when used for im plem enting

d a ta parallelism . For exam ple, equal d istribu tion of th e im age d a ta in the connected

com ponen t labeling algorithm m ay lead to unequal load d istribu tion when th e connected

com ponen ts span only a sm all region in th e im age.

A p p licab ility

T he D ivide-and-C onquer p a tte rn can be used for parallelizing any vision application

in which

• th e d a ta or th e algorithm can be divided in to several sub tasks

• each su b task can be executed on a single processor or m ay recursively be parallelized

using th e divide-and-conquer principle

• all su b tasks created can be processed concurrently on different processors w ithou t

explicit com m unication between th e processors

K now n U ses

T h e divide-and-conquer parallel program m ing m odel has been used for parallelizing

a num ber of vision algorithm s. S tou t (S tout, 1987) has proposed several divide-and-


conquer algorithm s for im age processing. Sunwoo et. al. (Sunwoo et ai., 1987) have used

divide-and-conquer techniques to segm ent an im age into different regions. C houdhary

and T h ak u r have parallelized connected com ponent labeling algorithm s on coarse grained

m achines using th e divide-and-conquer principle (C houdhary & T h ak u r, 1994). H am eed

et al. (Ham eed e t al., 1997) have employed different d ivide-and-conquer approaches to

parallelize a con tour ranking algorithm on coarse grained m achines.

3 .7 T em poral M u ltip lex ing P a ttern

Intent

T he T em poral M ultiplexing (TM ) p a tte rn is used for processing several d a ta se ts or

a sequence of im age fram es on m ultiple processors. Each processor processes a com plete

d a ta set and executes the sam e program code.

M otivation

A com puter-assisted sperm m otility system enables study ing th e m otion of th e sperm s

in living organism s (Irvine, 1995). In hum an beings it is used for estim ating th e degree

of m ale fertility. In a sperm m otility system , a sequence of im age fram es of th e sperm

m ovem ent are cap tu red over a given tim e fram e. T hese im age fram es are then analyzed for

finding th e sperm and m otion characteristics such as th e sperm density, size and shape of

th e sperm heads, velocity of the sperm s, and the shape of th e m otion tra jec to ry . A sperm

m otility system involves a set of com m on preprocessing and fea tu re ex trac tion operations

on th e individual im age fram es. T he m odule to com pute th e velocity of individual sperm s,

for exam ple, involves simple operations such as im age threshold ing , noise suppression,

rem oval of th in lines (sperm tails) or con tam inating partic les, segm entation , and finally

region m erging for ex trac ting th e sperm heads/cells. T he processed im age fram es are then

com bined (superim posed) for tracking th e m otion tra je c to rie s of individual sperm s and to

com pute th e sperm velocities.

Since th e preprocessing and featu re ex trac tion operations on individual im age fram es


are independen t of each o ther, the TM p a tte rn can be used to process each im age fram e

concurrently . Perform ing d a ta parallelism on individual im age fram es in such cases may

no t im prove perform ance due to com m unication overheads and sim plicity of th e operations.

ReceiveDataSetDoCalculation

SendResults

Worker (p-1)

ReceiveDataSendDataSet

Manager

SendResults


Worker (2)

SendResults


Worker (p)

SendResults


W orker (1)

Figure 3.10: TM P a tte rn

S tru cture

T he T M p a tte rn consists of a m anager com ponent and several identical w orker com po

nen ts or processes as shown in F igure 3.10. T he m anager creates, controls and schedules

th e w orker com ponents to process the d a ta sets. It receives th e d a ta sets from an ex ternal

com ponen t called th e data source. T he worker com ponents are responsible for perform ing

co m p u ta tio n on individual d a ta sets, and to re tu rn th e processed values to an ex ternal

com ponen t called th e data sink. T he TM p a tte rn consists of one m anager and a t least

tw o workers.

In teraction

T h e in teractions betw een the com ponents of the TM p a tte rn are shown in F igure 3.11.

• T he ex ternal d a ta source supplies a sequence of d a ta sets to th e m anager

• T he m anager assigns individual d a ta sets to available w orkers. If all th e w orkers are


busy, the m anager suspends its activities until some worker is free to process a d a ta

set.

Each worker processes its assigned d a ta set, sends the processed values to a d a ta

sink com ponent, and in teracts with the m anager for a new d a ta set.

riie above two steps are repeated until there are no m ore d a ta sets to be processed.

SendData

DoCalculationSendData

SendDataSet

DoCalculationSendResults

SendResults

Worker (1) D ata SinkManager Worker (p)D ata Source

Figure 3.11: O bject Interaction in the TM P a tte rn

Im p le m e n ta t io n

The TM p a tte rn can be im plem ented by following the steps described below:

/. Design the m anager component. The m anager controls the worker com ponents. It

creates and schedules the worker com ponents for processing the d a ta sets. The m anager

com ponent m aintains a queue of available worker com ponents. W hen a worker requests

a new d a ta set, the m anager adds it to the end of th is queue. If the queue of available

workers is not em pty, the m anager reads a d a ta set from the d a ta source and assigns it

1.0 the first available worker in this queue. However, when the queue is em pty (all the


workers are b u sy ), th e m anager suspends its activities until a t least one worker is ready to

process a d a ta set. In the sperm m otility system , th e m anager assigns each im age fram e to

a sep a ra te worker. T he m anager, in th is case, can also serve as a d a ta source. It therefore

m ain ta ins a repository of all th e im age fram es to be processed by th e worker com ponents.

2. D esign the w orker com ponent. Each worker should be designed to process th e assigned

d a ta set, send th e processed values to th e d a ta sink, and request a for new d a ta set from

th e m anager. In th e sperm m otility exam ple, each w orker perform s a com plete set of

preprocessing and featu re ex traction operations on their assigned im age fram es. Each

w orker sends th e processed im age fram es to the d a ta sink com ponent.

3. Im p lem en t the m anager and the worker com ponents according to th e specifications

ou tlined in previous steps.

C onsequences

T h e T M p a tte rn provides several benefits:

Scalability and flexib ility: New worker com ponents can be easily added w ithou t perform ing

m ajo r changes to th e m anager com ponent. Also, it is easy to change th e p rogram code in

all worker com ponents to realize different im plem entations.

E fficiency: T he use of TM p a tte rn enables scaling of th e th ro u g h p u t to process the

individual d a ta sets in d irect p roportion to th e num ber of processors used.

D ynam ic load balancing: T he T M p a tte rn , like th e F arm er-W orker p a tte rn , provides an

even d istrib u tio n of the load while processing th e d a ta sets. T he num ber of d a ta sets

processed by each w orker is p roportional to th e speed of th e ir corresponding nodes or

processors.

T he T M p a tte rn suffers from th e following liabilities:

E ffectiveness: T he TM p a tte rn is effective only when th ere are m ore d a ta se ts /im ag e

fram es th a n th e num ber of processors. T he parallelism in th is p a tte rn is expressed in

te rm s of th e num ber of d a ta se ts /im ag e fram es processed. W hen all th e d a ta se ts /im ag e


fram es are processed, no fu rth e r parallelism is available in th e application.

Latency: T he use of T M p a tte rn does no t im prove th e la tency to process individual d a ta

sets, it rem ains unchanged in th is p a tte rn .

A p p licab ility

T he T M p a tte rn can be used to parallelize any vision applica tion in which

• it is required to process a collection/ sequence of im age fram es or im age d a ta sets

• th e processing of each im age uses th e sam e program code

• th e im ages can be processed concurrently on different processors w ithou t explicit

com m unication between th e processors

K now n U ses

T he TM p a tte rn is used for parallelizing com plete d a ta sets. D ow nton et al. (D ow nton

e t al., 1996) have used tem pora l m ultiplexing techniques in th e postcode recognition

system . T hey have used it for verifying th e validity of p o stu la ted postcodes by m atch ing

them w ith th e entries in a d a tab ase of valid postcodes.

3.8 P ip elin e P attern

Intent

T h e P ipeline p a tte rn is used for parallelizing applica tions which process a s tream of

d a ta , and which can be divided in to a sequence (pipeline) of several independent su b task s

th a t are executed in a determ ined order. T he d a ta stream in th e p a tte rn is provided by a

data source com ponent. T he processed results are collected by th e data sink com ponen t.

Each su b task is im plem ented by a worker com ponent which reads a s tream of d a ta ,

processes it, and passes th e processed results to an o th er worker (or d a ta sink) in th e

p a tte rn .

Chapter 3. Design pa ttern s for parallelizing vision applications

M otivation

92

A vehicle identification system involves analyzing th e im ages of th e vehicles, for iden ti

fying th e owners of th e vehicles. Such a system , for exam ple, can be used for track ing the

identification of th e vehicles, which break a specified speed lim it on a m o to r highway or

city roads. A high speed cam era cap tu res th e im ages of th e high speed vehicles which are

then analyzed a t a certain tim e of the day. A typical vehicle identification system consists

of a t least four d istin c t m odules (subtasks) as shown in F igure 3.12.

InputImages

OwnerIdentification

Feature Extraction Classification Database SearchPreprocessing

Figure 3.12: Vehicle identification system

T he preprocessing m odule ex trac ts th e region in th e im age th a t su rrounds th e num ber

p late . It then applies thresholding, edge detection and th inn ing operations on th e ex

trac ted region in o rder to recover and skeletonize th e charac te rs in th e num ber p late . T he

o u tp u t of th is m odule serves as an inpu t to the fea tu re ex trac tio n m odule, which ex tra c ts

a num ber of fea tu res concerning each charac ter. T he featu re vectors of all th e ch a rac te rs

in th e num ber p la te are then presented to the classification m odule. T he classification

m odule com pares th e featu re vector of each ch arac te r w ith a set of pre-stored exam pler

fea tu re vectors. A set of possible charac ters for each ch arac te r in th e num ber p la te is then

presen ted to th e d a tab ase search module.

T he d a tab ase search m odule searches a d a tab ase of valid vehicle reg istra tion num bers

for each com plete se t of charac te rs th a t m ay poten tially represen t a num ber p la te . T he

ones th a t m atch th e d a tab ase entries w ith the highest p robabilities are then considered

as recognized num ber plates. T he d a tab ase search m odule then o u tp u ts th e identification

of th e vehicle from th e d a tab ase entry. For a given num ber p la te im age, if th e system

o u tp u ts m ore th an one po ten tia l num ber p late entry , som e verification (either m anually

or au to m ated ) needs to be devised to resolve the system am biguity.


T he d istin c t m odules of th e vehicle identification system can be easily s tru c tu re d

using th e pipeline p a tte rn . Each m odule can run concurren tly on different processors

and in te rac t w ith its neighboring m odules only by exchanging stream s of d a ta .

Stru cture

T he P ipeline p a tte rn consists of a d a ta source, a d a ta sink, and several w orker com

ponen ts as shown in F igure 3.13. T he d a ta source provides a sequence of in p u t values

(having th e sam e s tru c tu re or d a ta type) in the pipeline. T he d a ta sink collects the

processed values from th e end of the pipeline. Each w orker com ponent is responsible for

receiving th e d a ta from its preceding worker (or d a ta source), processing th is d a ta , and

sending th e processed results to th e following worker (or d a ta sink). T he first and the

last w orker com ponents com m unicate w ith th e d a ta source and th e d a ta sink com ponents,

respectively. T he in term ed iate worker com ponents com m unicate only w ith th e ir im m ediate

neighbors. N ote th a t a Pipeline p a tte rn does not provide for dividing th e applica tion into

d ifferent sub tasks. It provides only a s tru c tu re to an applica tion th a t is divided m anually

in to different sub tasks. T he client is responsible for creating , s ta r tin g and te rm in a tin g the

com ponen ts in the P ipeline p a tte rn .

C lient

ReceiveDataDoCalculation

SendResults

Worker (1)

SendResults

ReceiveDataDoCalculation

Worker (p)

CollectResults

SendFinalResults

Data S ink

ReadDataSendData

Data Source

Figure 3.13: P ipeline P a tte rn

C h apter 3. D esign p a tte rn s for parallelizing vision app lica tion s

In teraction

94

44ie in teractions between the com ponents of the Pipeline p a tte rn are shown in F igure 3.14.

CallToReadData

SendData

DoCalculation

SendResults

DoCalculation

SendResults

CollectResults

SendFinalResults

Worker (1) Worker (2)Data SourceClient Data Sink

Figure 3.14: Object Interaction in the Pipeline P a tte rn

• 44ie client calls the d a ta source com ponent to read the d a ta sets.

• T he d a ta source com ponent reads and a ttem p ts to send a new d a ta set to the first

worker. If the first worker is busy with processing a previous d a ta set, the d a ta

source com ponent suspends itself until the worker is ready to receive the cu rren t

d a ta set.

• Each in term ediate worker (not shown in the figure for brevity) retrieves (pulls) a

d a ta set from its preceding worker, processes it, and sends (pushes) the processed

d a ta to its successor. A worker may suspend its ac tiv ities tem porarily , if the d a ta

from the preceding worker is not available, or if the worker following im m ediately is

not w aiting for the d a ta .

• 14ie last worker sends the processed d a ta set to the d a ta sink and waits for a new

d a ta se t from its predecessor.


• T he last th ree processing steps are repeated until th ere are no m ore d a ta sets to be

processed in th e pipeline.

• T he d a ta sink sends th e processed d a ta sets to the client.

Im p lem en tation

T he P ipeline p a tte rn can be im plem ented by following th e steps described below:

1. D ivide the application. T he application should be m anually divided into a sequence

of functional u n its or sub tasks. The processing in each su b task m ust depend only on

th e o u tp u t of its d irect predecessor. T he com pu tational load in each su b task should be

p roportiona l to th e speed facto rs of the individual processors available for parallelizing

th e application . In th e vehicle identification system , th e application can be divided into

four d is tin c t functional units, namely, preprocessing, fea tu re ex trac tio n , classification and

d a tab a se search.

2. D esign the data source and data sink com ponents. These can be designed in two different

ways: a) B oth th e d a ta source and the d a ta sink are designed as sep a ra te com ponents

which are executed concurren tly w ith respect to th e client. T he client calls th e d a ta

source com ponent to read and o u tp u t th e d a ta stream in to th e pipeline, and w aits for

th e d a ta sink to re tu rn th e final results collected during th e execution of th e pipeline, b)

A lternatively , th e client functions as a d a ta source (or d a ta sink) and creates a sep a ra te

com ponen t for d a ta sink (or d a ta source). A client canno t perform bo th these task s by

itself, since it does not resu lt in any perform ance gain on using th is p a tte rn . In the

vehicle identification exam ple, th e d a ta source m ay be designed as a sep a ra te com ponent

which reads vehicle im ages from the specified files and presen ts them to preprocessing

m odule. T he d a ta sink com ponent may sim ply sto re the details of each num ber p la te and

its p o ten tia l ow ner(s) in a specified file.

3. D esign the worker com ponents. Each worker com ponent should repeated ly receive

a d a ta set from its predecessor, processes it, and o u tp u t th e processed d a ta set to its

successor. Each w orker should be im plem ented as a sep a ra te program un it th a t perform s

th e required com pu tation on its d a ta s e t . In the vehicle identification exam ple, each worker


perform s specified operations on its in p u t d a ta and passes its o u tp u t to th e neighboring

w orker or d a ta sink.

4. Specify the in teraction between different com ponents in the pattern . T his in te rac tion can

be specified by using inter-process com m unication calls su p p o rted by a m essage-passing

lib rary (section 2.1.1). N ote th a t each worker should fo rm a t th e resu lts in o rder to pass

them to its successor in th e pipeline.

5. Im p lem en t the com ponents and start the pipeline. T he com ponents in th e p a tte rn can

be im plem ented according to th e specifications given in previous steps. T he client s ta r ts

each com ponent as a sep a ra te th read or process. T he processing in th e pipeline s ta r ts

when th e d a ta source o u tp u ts th e d a ta sets to th e first w orker in th e pipeline. Each d a ta

set is transfo rm ed by different worker com ponents in th e pipeline and is finally collected

by th e d a ta sink. W hen th ere are no m ore d a ta sets to be processed, th e client te rm in a tes

all th e com ponents of the p a tte rn , after collecting th e processed resu lts from th e d a ta sink.

C onsequences

T he P ipeline p a tte rn provides several benefits:

Flexibility: Since th e worker com ponents in th e P ipeline p a tte rn are independent and

in te rac t only by exchanging s tream s of d a ta , they can be easily replaced by m ore efficient

com ponen ts having th e sam e functionality. T he w orker com ponents can be reused in

different s ituations. Also, new worker com ponents can be easily added to refine th e

functionality of th e existing pipeline.

E fficiency: T he P ipeline p a tte rn helps in increasing th e system th ro u g h p u t and reduce th e

la tency in applications which process long s tream s of d a ta . However, th e use of P ipeline

p a tte rn for im proving th e application perform ance is feasible only when th e g ran u la rity of

each w orker is sufficiently high. T he tim e required to tran sfe r th e d a ta betw een th e w orker

com ponen ts should be relatively lower th an the tim e required to perform the com p u ta tio n s

on each w orker com ponent.


T he P ipeline p a tte rn suffers from the following liabilities:

Sharing global in form ation: Sharing of global in form ation between different com ponents

in th e P ipeline p a tte rn is inefficient and does no t provide full benefits of th e p a tte rn .

Load balancing: Like the M aster-W orker p a tte rn , P ipeline p a tte rn can suffer from serious

load im balances during its execution on enterprise clusters (section 2.5.3). T h ro u g h p u t

and latency are influenced by th e speed of the slowest w orker com ponent in th e p a tte rn .

E rror Recovery: I t is difficult to handle failures in the worker com ponents during the

execution of th is p a tte rn . Each worker is dependen t on o th er workers for perform ing its

com pu tations. Consequently, a failure in any worker com ponent can lead to a significant

loss in processing tim e. In m any cases, th e application m ay need to be s ta r te d from th e

beginning.

Scalability: An application parallelized using a P ipeline p a tte rn is usually not scalable w ith

respect to addition of processors used for parallelization. T his is because th e num ber of

worker com ponents in a Pipeline p a tte rn are defined by th e num ber of sub tasks com prising

th e application.

A p p licab ility

T he P ipeline p a tte rn can be used to parallelize applica tions in which

• it is necessary to process a long stream of d a ta values

• th e application is com posed of a sequence of independen t functional u n its which

process th e d a ta s tream independently , b u t in a determ ined order

• th e functional un its com m unicate w ith each o ther only by exchanging s tream s of

d a ta

K now n U ses

T he P ipeline p a tte rn has applications a t all levels of vision processing. A t th e low level,

it can be used for parallelizing C anny edge de tec to r (Sonka e t al., 1993), when applied on


a sequence of im age fram es. T he C anny edge de tec to r is com posed of several independen t

functional un its and is therefore easily im plem ented using a P ipeline p a tte rn (Rulf, 1988).

N ote th a t th e scalability of a P ipeline p a tte rn m ay be increased by em ploying tw o or

m ore P ipeline p a tte rn s for parallelizing a single applica tion . Each P ipeline p a tte rn can

concurren tly process a p a r t of th e d a ta stream (if feasible) in th e application . U sing two

or m ore P ipeline p a tte rn s to parallelize a single application can be considered as a varian t

of th e P ipeline p a tte rn . We call th is varian t th e M ultip le P ipeline p a tte rn . A n o th er

varian t of th e P ipeline p a tte rn (used in (Dow nton e t al., 1996)) can be realized by m aking

th e pipeline com m unications ‘bo th w ays’. T his enables o u tp u t of one or m ore P ipeline

com ponen ts to be used as an in p u t (feedback) of relevant com ponent(s) in th e P ipeline.

3.9 C om p osite P ip elin e P attern

Intent

T he C om posite P ipeline p a tte rn consists of a pipeline of design p a tte rn s a n d /o r se

quen tia l com ponents which together parallelize a com plete vision application processing

a continuous stream of d a ta . It provides a s tru c tu re to these applications which can be

parallelized by dividing these into several independent functional un its th a t com m unicate

w ith each o th er only by exchanging stream s of d a ta . Each functional un it in tu rn m ay

be parallelized by using relevant design p a tte rn s or m ay be im plem ented as a sequentia l

com ponent.

M otivation

C onsider th e vehicle identification system as outlined in section 3.8. Since th e in p u t

to each m odule depends on the o u tp u t of th e previous m odule, th e perform ance of the

overall system depends on th e speed of the slowest m odule. T he use of a C om posite

P ipeline p a tte rn in th is s ituation can lead to im proved system perform ance com pared to

a sim ple pipeline im plem entation . Each m odule in th is system (see F igure 3.15) m ay be

parallelized by dividing th e d a ta set w ithin each m odule into su b task s and processing these


sub task s concurrently (d a ta parallelism ). A lternatively, each d a ta set may be processed

on a different processor w ithout d a ta partition ing (tem poral m ultiplexing).

InputIm ag e s

OwnerIdentification

Feature Extraction. Preprocessing Classification

Figure 3.15; Vehicle identification system

hbr exam ple, the preprocessing operations on each im age may be perform ed concur

rently on different processors. Similarly, the search for d a tab ase entries for different

num ber plates may be executed on different processors. Both these m odules exhibit

tem poral m ultiplexing form of parallelism . In the feature ex traction and classification

m odules, each ch arac ter in a image fram e may be processed on a sep ara te processor (d a ta

parallelism ). However, such parallelism may not always be feasible if the com m unication

overheads are too high. In such cases, tem poral m ultiplexing alone may be used to increase

t he system perform ance.

S tru ctu re

th e s tru c tu re of the C om posite Pipeline pa tte rn is as shown in F igure 3.16. It is sim ilar

to the Pipeline p a tte rn . It has a data source which provides the inputs, a data .snrA; which

collects the o u tp u ts and a sequence of design p a tte rn a n d /o r sequential worker com ponents

th a t process the input stream of d a ta . We shall refer to the design p a tte rn s and the

secpiential worker com ponents as functiona l com ponents of the p a tte rn . Each functional

com ponent is responsible for receiving the d a ta from its predecessor, processing th is d a ta ,

and sending the processed results to it successor. N ote th a t the C om posite p a tte rn , like

th e f^ipeline p a tte rn , does not provide for dividing the application in to different subtasks.

It only provides a s tru c tu re to an application th a t is divided m anually in to different

functional com ponents. T he client is responsible for creating, s ta rtin g and te rm in atin g

the com ponents in the C om posite Pipeline p a tte rn .


C lient

Data Source

ReadData Send Data

Farmer-Worker TM Pattern

Pattern(l) Pattern (p) Data Sink

ReceiveData ReceiveData ColiectResultsDoCalcuIafion DoCalculation

SendResuits SendResuits SendFinalResults

Figure 3.J6: C om posite Pipeline P a tte rn

In teraction

T he in teractions between the com ponents of the C om posite Pipeline p a tte rn are shown

in F igure 3.17.

• T he client calls the d a ta source com ponent to read the d a ta sets.

• T he d a ta source com ponent reads and a ttem p ts to send a new d a ta set to the first

functional com ponent. If the first functional com ponent is busy with processing a

previous d a ta set, the data, source com ponent suspends itself until the com ponent is

ready to receive the curren t d a ta set.

• Each in term ediate functional com ponent (not shown in the figure for brevity) re

trieves (pulls) a d a ta set from its predecessor, processes it, and sends (pushes) the


CallToReadData

SendData

DoCalculation

SendResuits

DoCalculation

SendResuits

ColiectResults

SendFinalResults

Data Source Pattern (1) Pattern (2) Data SinkClient

Figure 3.17: O bject Interaction in the C om posite Pipeline P a tte rn

processed d a ta to its successor. A functional com ponent may suspend its activ ities

tem porarily , if the d a ta from the preceding com ponent is not available, or if the

following com ponent is not waiting for the d a ta .

• T he last functional com ponent sends the processed d a ta set to the d a ta sink and

w aits for a new d a ta set from its predecessor.

• d he last th ree processing steps are repeated until there are no more d a ta sets to be

processed in the pipeline.

• T he d a ta sink sends the processed d a ta sets to the client.

I m p le m e n ta t io n

T he C om posite Pipeline pa tte rn can be im plem ented by following the steps described

below:

/. Divide the application. The application should be m anually divided into a sequence of

functional units. T he processing in each functional unit m ust depend only on the o u tp u t of


its d irec t predecessor. For exam ple, in th e vehicle identification system , th e applica tion is

divided in to preprocessing, featu re ex traction , classification and d a tab ase search m odules.

2. D esign the data source and data sink com ponents. T hese com ponents, like in the

P ipeline p a tte rn , can be designed as two separa te com ponents different from th e client.

A lternatively , th e client can function as a d a ta source (or d a ta sink) and create a sep a ra te

com ponen t for d a ta sink (or d a ta source).

3. D esign the fu n c tio n a l com ponents. Design each functional com ponent as an indepen

d en t p rogram un it which runs sequentially or which can be parallelized by using relevant

design p a tte rn . Each functional com ponent m ust repeated ly retrieve a d a ta set from its

predecessor, processes it, and o u tp u t th e processed resu lts to its successor. In th e vehicle

identification system , som e or all th e m odules m ay be designed to im plem ent e ith er d a ta

parallelism or tem pora l m ultiplexing on their assigned d a ta sets.

4- Specify the in teraction between different com ponents in the pattern . T his in teraction can

be specified by using in ter-process com m unication calls su p p o rted by a m essage-passing

lib rary (section 2.1.1).

5. Im p lem en t the com ponents and start the pipeline. T he com ponents in th e p a tte rn are

im plem ented according to the specifications given in previous steps. T he processing in the

pipeline s ta r ts when th e d a ta source o u tp u ts th e d a ta sets to th e first functional com ponent

in th e pipeline. Each d a ta set is transform ed by different functional com ponents in the

pipeline and is finally collected by th e d a ta sink. W hen th ere are no m ore d a ta se ts to

be processed, th e client te rm inates all th e com ponents of th e p a tte rn , afte r collecting the

processed resu lts from th e d a ta sink.

C onsequences

T he C om posite P ipeline p a tte rn provides several benefits:

Flexibility: Since th e functional com ponents in the C om posite P ipeline p a tte rn are inde

p enden t and in te rac t only by exchanging stream s of d a ta , they can be easily replaced by

m ore efficient com ponents having th e sam e functionality . For exam ple, a slow sequen-


tia l w orker com ponent m ay be replaced by an equivalent parallel functional com ponen t.

T he functional com ponents can be reused in different situ a tio n s. Also, new functional

com ponen ts can be easily added to refine the functionality of th e existing pipeline.

E fficiency: T he C om posite P ipeline p a tte rn can achieve b e tte r perform ance th an a plain

P ipeline im plem entation . A slow worker com ponent in th e plain P ipeline im plem en tation

can be identified and could possibly be im plem ented as a parallel functional com ponen t.

However, th e use of C om posite Pipeline p a tte rn is effective only when the g ran u la rity of

each functional com ponent is sufficiently high.

T he C om posite P ipeline p a tte rn suffers from the following liabilities:

Load balancing: T he C om posite Pipeline p a tte rn , like the M aster-W orker and th e P ipeline

p a tte rn , can suffer from serious load im balances during its execution on th e en terp rise

c lusters (section 2.5.3). However, these load im balances can possibly be reduced by using

th e F arm er-W orker or th e Tem poral M ultiplexing p a tte rn , to parallelize relevant functional

com ponents. B oth these p a tte rn s have dynam ic load balancing property .

E rror Recovery: It is difficult to handle failures in functional com ponents during th e exe

cution of th is p a tte rn . Each functional com ponent is dependent on th e o ther com ponen ts

for perform ing its com putations. Consequently, a failure in any functional com ponen t can

lead to a significant loss in processing tim e.

A p p licab ility

T he C om posite Pipeline p a tte rn can be used to parallelize applica tions in which

• it is necessary to process a long stream of d a ta values.

• th e applica tion is com posed of a sequence of independent functional u n its which

process th e d a ta stream independently, bu t in a determ ined order.

• th e functional un its com m unicate w ith each o ther only by exchanging s tream s of

d a ta .


• each functional un it m ay be im plem ented as a sequential com ponent or m ay in tu rn

be parallelized using relevant design p a tte rn .

K now n U ses

T he C om posite P ipeline p a tte rn is an arch itectu ra l p a tte rn which is used for p ara l

lelizing com plete vision system s. Singh (Singh e t al., 1991) and Schaeffer (Schaeffer e t al.,

1993) have used com posite pipeline principle to parallelize an im age rendering application .

T hey used tem pora l m ultiplexing to speed up individual stages of th e pipeline. D ow nton

et al. (D ow nton e t al., 1996) la te r proposed th e principle of com posite pipeline as a design

m ethodology for parallelizing em bedded im age processing applications, and applied it

to parallelize th e im age coding and postcode recognition applications. T hey proposed

b o th d a ta and algorithm ic parallelism (in addition to tem pora l m ultiplexing) to speed

up individual stages of the pipeline. A varian t of th e C om posite P ipeline p a tte rn (used

in (D ow nton e t al., 1996)) can be realized by m aking th e pipeline com m unications ‘bo th

w ays’. T his enables o u tp u t of one or m ore C om posite P ipeline com ponents to be used cls

an in p u t (feedback) of relevant com ponent(s) preceding in th e p a tte rn .

3.10 Sum m ary

Design p a tte rn s for parallel vision applications represent designs or m ethods used for

parallelizing these applications on various parallel arch itectu res. A lthough th e lite ra tu re

on parallelization of vision algorithm s is vast, there has been no previous efforts to a b s tra c t

and docum ent th e design inform ation in these parallel im plem entations. In th is ch ap te r

we have a ttem p ted to cap tu re and docum ent th is design inform ation in th e form of design

p a tte rn s . These design p a tte rn s can be used for im plem enting parallel so lutions to m any

vision algo rithm s/ applications on coarse-grained parallel m achines, such as a c luster of

w orksta tions. Each p a tte rn has been described in a uniform way using a tem p la te . T he

tem p la te provides descrip tion of how each p a tte rn works, w here it should be applied and

w h a t are th e trade-off in its use.

T he design p a tte rn s presented in th is chap te r include Farm er-W orker, M aster-W orker,


C ontroller-W orker, D ivide-and-C onquer (D C), Tem poral M ultiplexing, P ipeline, and Com

posite P ipeline, T he Farm er-W orker p a tte rn is used for parallelizing em barrassingly p ar

allel algorithm s, while M aster-W orker p a tte rn and C ontroller-W orker p a tte rn are used for

parallelizing problem s exhibiting synchronous form of parallelism . D iv ide-and-C onquer

p a tte rn is used for parallelizing algorithm s th a t use a recursive s tra teg y to sp lit a problem

in to sm aller subproblem s and m erge th e solution to these subproblem s in to final solution.

Tem poral M ultiplexing p a tte rn is used for processing several d a ta sets or im age fram es

on m ultip le processors. Finally, P ipeline and C om posite P ipeline p a tte rn s are used for

parallelizing applica tions which can be divided in to a sequence (pipeline) of several in

dependen t sub tasks th a t are executed in a determ ined order. In th e C om posite P ipeline

p a tte rn , each su b task m ay be fu rth e r parallelized using o th er relevant design p a tte rn s .

Chapter 4

Low level algorithm s

T he design p a tte rn s described in previous chap ter can be used for parallelizing a m ajo rity

of vision algorithm s on coarse-grained parallel m achines, such els w orksta tion clusters. In

th e rem aining p a r t of th is thesis, we use and evaluate th e applicability of these p a tte rn s for

parallelizing som e represen ta tive vision algorithm s on a c luster of w orksta tions. T here are

tw o different ways in which th is can be done a) for a given design p a tte rn , one can describe

a se t of vision algorithm s which can be parallelized using th is p a tte rn , a lternatively , b) for

a given vision algorithm , one can describe a set of one or m ore design p a tte rn s which can

be used to parallelize th is algorithm .

We follow th e second approach by grouping algorithm s in som e order (e.g. low level, in

te rm ed ia te level, and high level in com puter vision), and describing various design p a tte rn s

th a t can be used to parallelize these algorithm s. T his approach ensures logical consistency

of describing algorithm s or techniques used in a given dom ain, such as com pu ter vision.

T his ch ap te r therefore discusses parallelization of some rep resen ta tive low level vision

a lgorithm s using th e ap p ro p ria te design p a tte rn s. C h ap te r 5 discusses parallelization

of som e in term ed iate level algorithm s, while ch ap te r 6 discusses parallelization of some

rep resen ta tive high level a lgo rithm s/app lica tions. We begin th is ch ap te r by describing

charac te ristic s of low level algorithm s.

Low level a lgorithm s aim a t im proving the im age d a ta by suppressing noise or unw anted

106

Chapter 4. Low level algorithms 107

d isto rtio n s, and enhancing som e im age featu res im p o rtan t for fu rth e r processing a n d /o r

for hum an in te rp re ta tio n . T he inpu t and o u tp u t to these algorithm s are pixel based

in tensity im ages. T he com pu tations involved in these algorithm s are pixel based im age

tran sfo rm atio n s which use a large num ber of sim ple m ath em atica l operations on th e pixel

values in an in p u t im age to com pute a new set of pixel values in th e o u tp u t im age. This

ch ap te r discusses parallelization of som e represen ta tive low level vision a lgorithm s using

th e design p a tte rn s described in C h ap te r 3.

Low level vision algorithm s can be broadly classified in to two categories depending on

th e size of th e pixel neighborhood used for calculating th e new pixel value.

• Local algorithm s: In local algorithm s, th e value of a processed pixel depends only on

th e values of th e pixels placed in its local neighborhood (window). T he size of th e

neighborhood in the local a lgorithm s m ay be fixed, as in Sobel edge detec tion and

threshold ing operations, or m ay vary, as in convolution and filtering operations. We

also classify th e poin t operations in th is category w here th e value of th e new pixel

depends only on th e original value of th a t pixel (e.g. b righ tness correc tion). T he

local a lgorithm s can be fu rth e r classified as iterative and non-iterative. An exam ple

of an ite ra tiv e local a lgorithm is the ex trem um filter described in section 3.4, while,

th e edge detec tion algorithm using th e Sobel edge o p e ra to r is an exam ple of a non

ite ra tiv e local algorithm .

• G lobal algorithm s: In global algorithm s, th e value of a processed pixel m ay depend

on values of all pixels covering large neighborhoods or even en tire im age. T he

algorithm s in th is category are fu rth e r classified as global fixed and global varying.

In th e global fixed algorithm s, th e value of a processed pixel depends on th e values

of all pixels in th e in p u t im age. Some exam ples of th e global fixed algorithm s are:

h istogram equalization , and the two dim ensional d iscre te Fourier transfo rm . In th e

global varying algorithm s, th e value of a new pixel m ay depend on th e pixels in en tire

in p u t im age, or on th e pixels in sm all region of th e in p u t im age. For exam ple, in a

connected com ponent labeling algorithm , a connected com ponent m ay span only a

sm all region or it m ay be spread over th e en tire im age. T he am o u n t of co m p u ta tio n

in global fixed algorithm s therefore depends only on th e size of th e in p u t im age.


while th e am oun t of com putation in global varying algorithm s depends on bo th , th e

size and th e con ten ts of th e inpu t image.

T he classification scheme described above was used by C houdhary and P a te l (C houd-

hary & P a te l, 1990) to provide an insight in to th e perform ance of an algorithm based on

its com m unication requirem ents. We have ex tended it fu r th e r to in troduce th e iterative

and non-itera tive class of local algorithm s. T he extended classification schem e enables

identification of relevant design p a tte rn s which can be used for parallelizing th e low level

algorithm s.

T h e rest of th e chap te r is organized as follows. Section 4.1 outlines th e m ethods

which can be used to parallelize m ost of th e low level algorithm s. Section 4.2 describes

th e schem e th a t is used in partition ing th e im age d a ta . T he rem aining sections present

th e experim ental results of parallelizing various rep resen ta tive low level vision algorithm s.

Section 4.3 presents parallelization of a h istogram equalization algorithm which is a global

algorithm used for co n tra s t enhancem ent. Section 4.4 discusses various filtering operations

and th e ir parallel im plem entations. Section 4.5 presents resu lts of parallelization of a two-

d im ensional Fourier transfo rm . Finally, section 4.6 discusses parallelization of an im age

re s to ra tio n algorithm using M arkov random field m odels.

T he algorithm s presented in th is chap te r (and those in tw o chap te rs following im

m ediately) have been im plem ented on a netw ork of up to sixteen w orksta tions. Each

w orksta tion is a Su n S P A R C sta tio n 5 m achine w ith 32 M bytes of local m em ory and a clock

speed of 170 M Hz. All w orksta tions th u s have th e sam e speed fac tors (a w orksta tion w ith

a speed facto r of 2 is twice as fast as a w orksta tion w ith a speed facto r of 1). T he p rogram

code for im plem enting various parallel algorithm s using corresponding design p a tte rn s

has been w ritten in C-|-+ and th e PV M m essage-passing kernel (Sunderam , 1990). T he

perform ance of th e corresponding parallel im plem entations have been mezisured in te rm s

of execution tim es and program speedups. T he speedup of a parallel p rogram is defined

as

execution tim e on one w orkstation . .speedup = ---------- :------ :------------------- :------ :------ (4.1)

execution tim e on p w orkstations


4.1 P ara lle liza tion o f low level a lgorith m s

M ost of th e low level vision algorithm s are parallelized by p artitio n in g th e im age in to

subim ages, and processing these subim ages concurren tly using different processors. Using

th is stra teg y , Siegel et al. (Siegel e t ah, 1992) parallelized a local convolution algorithm

using tw o d istinc t approaches, namely, complete sum s and partia l sum s. In th e ‘com plete

su m s’ approach , all th e d a ta needed by a processor to process its sub im age is transferred

to it before th e com pu ta tion . T he processors then work independently w ith o u t in terac tin g

w ith each o th er during th e com putation . W ith th e ‘p artia l su m s’ approach , each processor

perform s co m puta tion on its subim age and in te rac ts w ith o th er processors to exchange

th e in te rm ed ia te resu lts during the com putation . We extend these tw o approaches to

parallelize m ost of th e low level algorithm s.

T he local non-itera tive algorithm s can be parallelized using th e ‘com plete su m s’ ap

proach. T hey can be im plem ented by using th e Farm er-W orker p a tte rn (section 3.3).

T he local ite ra tiv e and th e global low level algorithm s can be parallelized using th e

‘p artia l su m s’ approach . However, the algorithm s w ithin these classes exh ib it different

com m unication p a tte rn s . In a local itera tive algorithm , each processor com m unicates

w ith its neighbors afte r every itera tion . These com m unications are regu lar and can be

determ ined before th e s ta r t of th e com putation . Local ite ra tiv e algo rithm s can therefore

be parallelized using th e M aster-W orker p a tte rn (section 3.4). T he global a lgorithm s usu

ally involve all-to-all processor com m unications. In certain cases, these com m unications

m ay be determ ined before th e s ta r t of th e com p u ta tio n , as in th e co m p u ta tio n of a two

dim ensional fa s t Fourier transfo rm of an im age. B u t in o th er cases, th ey are determ ined

dynam ically or only afte r th e s ta r t of th e com pu ta tion , as in th e connected com ponent

labeling algorithm . T he global algorithm s are therefore parallelized using th e C ontroller-

W orker p a tte rn (section 3.5).

A no ther im p o rtan t consideration in parallelization of th e low level a lgorithm s is the

num ber of im age p artitio n s or subim ages created for concurren t execution. T he num ber of

subim ages created in th e local non-iterative algorithm s should be ab o u t tw o to th ree tim es

m ore th a n th e num ber of processors (workers) used in parallelization. T his m axim izes th e

degree of parallelism achievable in an application and resu lts in b e tte r perform ance as


described in section 4.4.1. T he num ber of subim ages c reated in th e local ite ra tiv e and

global low level a lgorithm s should however be equal to th e num ber of processors available.

T his is because each worker is required to in te rac t w ith o th er workers to exchange th e in te r

m ediate resu lts during th e com pu ta tion . T he com pu tational w orkloads in the subim ages,

if m easurable, should be p roportional to the effective speed facto rs of th e corresponding

processors used in parallelization.

An effective speed facto r of a m achine a t any in s tan t of tim e is th e fraction of its C PU

tim e th a t is ded icated for processing th e subim age. T he effective speed fac to r of a m achine

can vary over tim e depending on th e workload (of ex ternal processes) on th a t m achine.

N ote th a t th is s tra teg y of using the w orkloads to divide th e im age in to subim ages ensures

only s ta tic load d istribu tion . It is effective only when th e applica tion is parallelized on a

ded icated w orksta tion cluster (section 2.5.3), where th e speed facto rs are always co n stan t.

4.2 P a rtition in g th e im age d ata

T he perform ance of a low level algorithm parallelized on a cluster of w orksta tions de

pends on a p artition ing of th e im age into subim ages, and corresponding com m unication

overheads. T he com m unication overheads are directly re la ted to th e way th e im age is

pa rtitio n ed . T he com m unication overheads arise due to th e d istrib u tio n of subim ages

to th e w orker processors, exchange of the in term ediate resu lts (if applicable), and th e

collection of final resu lts from the w orker processors. T here are m any different m ethods

to p a rtitio n a given im age in to subim ages. We use a sim ple row p artitio n in g m ethod

in which an im age is horizontally divided into a given num ber of subim ages as shown

in F igure 4.1. T he row partition ing m ethod allows one to divide a given im age in to

any num ber of subim ages of ap p ro p ria te sizes. T hus each processor can be assigned a

p roportiona l w orkload based on its speed facto r (Angus e t al., 1989).

F igure 4.1 (a) shows th e row p artition ing of an im age in to d istinc t subim ages (non

overlapping) for th e global algorithm s. Such algorithm s do n o t need pixel values from o th er

subim ages in order to perform com putations on th e b oundary pixels of any subim age.

F igure 4.1 (b) shows the row p artition ing scheme for parallelizing th e local low level

Chapter 4. Low level algorithms i l l

P(i)

P(2)

P(i)

P(n)

w in d o w o v e r la p p in g ro w s

P(i)

(a) ( h )

Figure 4.1: P artition ing of an image, a) Row partition ing b) Row partition ing w ith d a ta

th a t is to be overlapped an d /o r com m unicated

algorithm s. In a local low level algorithm (except point operations), the value of a boundary

pixel in any given subim age may depend on the values of the pixels present in o ther

subim age(s). Therefore, each subim age also has an additional num ber of overlapping rows

belonging to its neighboring subim ages as shown in F igure 4.1 (b). In local itera tive

a lgorithm s, these overlapping rows are com m unicated between the neighboring workers

a fte r every itera tion .

O ther m ethods for partition ing an image are column, diagonal, cross and heuristic.

Roth row and column partition m ethods are sim ilar, hence, either of them could be used

for partition ing the image. The diagonal partition ing m ethod involves dividing the image

into diagonal strips. This m ethod is however difficult to im plem ent and becomes extrem ely

com plicated when parallelizing local itera tive algorithm s. T he cross partition m ethod

involves dividing the image in both horizontal and vertical directions. T he num ber of

subim ages created using this m ethod is always a square num ber. This places a restric tion

on the num ber of processors th a t can be used in parallelization, especially, in the algorithm s

parallelized using the ‘partia l sum s’ approach.

44ie heuristic partition m ethod was proposed and used by Lee and Hamdi (Lee &


H am di, 1995) to parallelize th e local convolution operation on a netw ork of w orksta tions.

T heir a lgorithm can p artitio n th e im age into any num ber of subim ages using b o th hor

izontal and vertical p artition ing directions. However, bo th heuristic and cross p artitio n

m ethods involve rectangu lar shaped subim ages. In local ite ra tiv e algorithm s, m any w orker

processes m ay be required to exchange their in term ed iate resu lts w ith eight o th er worker

processes. In row or colum n partition ing , each w orker process is required to in te rac t w ith

a t th e m ost tw o o th er worker processes. Therefore, th e row p artitio n in g m ethod hcis

num ber of advantages com pared to th e o ther p artition ing m ethods.

4 .3 G rey scale transform ations

G rey scale tran sfo rm atio n s m odify th e brightness of th e pixels in an im age based on the

p roperties of the pixels itself. They are used for enhancing th e co n tra s t and im prove th e

ap pearance of an im age so th a t it could be easily in te rp re ted by a hum an observer. T he

m ost com m on grey scale transfo rm for co n tra s t enhancem ent is h istogram equalization

which was described in section 3.5.

H istogram equalization is a global low level algorithm . In th is section, we p resen t th e

experim ental resu lts of parallelizing th is algorithm (as outlined in section 3.5) using the

C ontroller-W orker p a tte rn . T he execution tim es for th e h istogram equalization algorithm

parallelized using different num ber of w orksta tions are displayed in Table 4.1. A p lo t of

these execution tim es and the speedups achieved for th is algorithm are shown in F igure 4.2.

T he execution tim e for th e h istogram equalization algorithm on a single w orksta tion is

of th e o rder of few seconds. However, th e tim e sp en t in all-to-all w orker com m unications

is relatively large com pared to th e tim e spen t in th e ac tua l com p u ta tio n . Hence, th e

execution tim e of th e parallel algorithm increases significantly w ith increase in th e num ber

of w orksta tions, even for 512x512 and IK x lK im ages. For a 2K x2K im age th e re is slight

im provem ent in execution tim e until ab o u t five to six w orksta tions (F igure 4.2), due to

increase in co m p u ta tio n tim e. However, th e execution tim e increases for seven or m ore

w orksta tions. Hence, global algorithm s involving all-to-all w orker com m unications, bu t

relatively lower execution tim e, should preferably be executed on a single w orksta tion .

Chapter 4. Low level algorithms

Table 4.1: Execution tim e in (m in:sec) for h istogram equalization

113

Image Size Number of Workstations

1 2 4 6 8 10 12 14 16

512x512 0:01 0:01 0:02 0:02 0:02 0:03 0:04 0:04 0:04

iK x lK 0:02 0:04 0:04 0:04 0:05 0:06 0:06 0:07 0:09

2Kx2K 0:16 0:14 0:13 0:15 0:16 0:16 0:17 0:22 0:23

30 n 2Kx2KiK xlK512x51225-

2 0 -

15-

1 0 -

0 2 4 6 8 10 12 14 16

16-,ideal

14-

1 2 -

1 0 -

0 2 4 6 8 10 12 14 16

Exec, tim e (sec) v /s Processors Speedup v /s P rocessors

F igure 4.2: Perform ance of h istogram equalization

4 .4 Im age filtering

Im age filtering algorithm s are im age transfo rm s th a t use a local neighborhood of a pixel

in th e in p u t im age to produce a new pixel value in th e o u tp u t im age. A filter m ay be

classified as linear or nonlinear. Linear filters calcu late th e new pixel value f ' { i , j ) as a

linear com bination of th e pixel values in a local neighborhood Af of th e pixel f { i ^ j ) in the

in p u t im age. A com m on class of linear filters are th e convolution-based filters which are

described in th e next section. L inear filters, when used for rem oving noise in an im age,

b lur sh arp edges in th a t im age. Nagao (N agao & M atsuyam a, 1979) and Lee (Lee, 1983)

therefore suggested edge preserving non-linear filters, which, no t only remove noise b u t

also preserve sh arp edges in a given im age. N on-linear filters are discussed in section 4.4.2

and section 4.4.3.

Chapter 4. Low level algorithms

4.4 .1 C onvolution

114

C onvolution is a fundam ental operation in im age processing. It is used in im age sm oothing,

edge or line detection (Sonka e t al., 1993), fea tu re ex trac tio n , and tem p la te m atch ing

(R anka & Sahni, 1990). If A/ is a set of neighboring points around a po in t (a, 6) in th e

im age, and if A is a m x m convolution m ask of co-efRcients, th e convolution f ' {a^b) a t

(a, 6) is given by

/ K 6 ) = - c ; - (4.2)

w here, (c, d) is th e displacem ent of th e origin of h relative to th a t of / . O n a sequential

m achine, th e com pu ta tional com plexity to perform th e convolution o peration on an im age

of size n x n is O (n^m ^). T his operation can be very tim e consum ing w hen th e size of

th e im age a n d /o r th e size of th e convolution m ask is large. T he execution tim e of th is

opera tio n can be reduced by dividing th e im age into subim ages, and convolving these

subim ages concurren tly using different processors. By using a set of P processors, th e

co m p u ta tio n a l com plexity of th e convolution operation can be reduced up to / F) ) .

Table 4.2: Execution tim e in (m in : sec) fov th e convolution op era tio n

Image Size Window Size Number of Workstations

1 2 4 6 8 10 12 14 16

3x3 0:05 0:03 0:02 0:02 0:02 0:03 0:03 0:04 0:04

512x5127x7 0:19 0:10 0:05 0:04 0:03 0:03 0:03 0:04 0:04

11x11 0:44 0:23 0:11 0:08 0:07 0:06 0:06 0:06 0:06

15x15 1:22 0:41 0:21 0:15 0:11 0:10 0:10 0:08 0:08

3x3 0:19 0:11 0:06 0:05 0:05 0:07 0:07 0:08 0:09

iK x lK7x7 1:18 0:40 0:20 0:15 0:11 0:11 0:10 0:11 0:11

11x11 3:04 1:33 0:46 0:32 0:24 0:22 0:19 0:17 0:16

15x15 5 :2 8 2:49 1:22 0:56 0:41 0:40 0:32 0:31 0:26

3x3 1:29 0:45 0:23 0:17 0:14 0:15 0:16 0:17 0:20

2Kx2K7x7 5:25 2:43 1:22 0:55 0:41 0:36 0:31 0:29 0:28

11x11 1 2 :28 6:14 3:07 2:05 1:34 1:15 1:03 0:59 0:47

15x15 22:55 11:28 5:44 3:50 2:52 2 :2 2 2:04 1:46 1:30


90-,

80-2Kx2KiK xlK512x512

70-

60-

50-

40-

30-

2 0 -

1 0 -

0 2 4 6 8 10 12 14 16

16nideal

14-

1 2 -

1 0 -

0 2 4 6 8 10 12 14 16Exec, tim e(sec) v /s Processors Speedup v /s P rocessors

F igure 4.3: Perform ance of th e convolution operation using a 3x3 window

C onvolution operation can be parallelized by using th e Farm er-W orker p a tte rn .

Table 4.2 shows th e execution tim es of th e parallel convolution operation ob ta ined by

varying different param eters such as the window size, im age size, and the num ber of

w orksta tions used in parallelization. T he entries in th is tab le enable s tu d y th e influence

of these p aram eters on the execution tim e and th e speedup of th e parallel convolution

opera tion . We can m ake two different observations from th is tab le . F irstly , by keeping the

window size fixed, we can observe the perform ance results by varying bo th , th e im age size

and th e num ber of w orksta tions used in parallelization. Secondly, by keeping th e im age

size fixed, we can observe th e perform ance results by varying th e window size and the

num ber of w orksta tions used in parallelization.

T he execution tim es and th e speedups achieved for th e parallel convolution operation

using a 3x3 and a 15x15 window (window size fixed), for exam ple, are shown in F igure 4.3

and F igure 4.4, respectively. F igure 4.3 shows th a t for a sm all window, th e execution tim e

decreases upon increase in th e num ber of w orksta tions used in parallelization. However,

th e execution tim e gradually increases when th e num ber of w orksta tions are increased

beyond seven or eight. C orresponding speedup curves show a sim ilar behavior. T he

increase in th e execution tim es or th e decline in th e corresponding speedups afte r using

eight or m ore w orksta tions, is due to increase in the percentage of th e com m unication

tim e w ith respect to the corresponding com putation tim e. However, when th e window


size is larger (F igure 4.4), m ore com putations are needed a t each pixel in th e convolution

operation . Since th e com m unication tim e in a 15x15 convolution operation is nearly th e

sam e as th a t in a 3x3 operation , th e ra tio of th e com pu tation tim e to the com m unication

tim e is dom inated by th e com putation tim e. T his resu lts in relatively g rea te r speedups

w ith increase in th e num ber of w orksta tions used in parallelization .

We can observe sim ilar resu lts by keeping the im age size fixed, b u t varying th e window

size and th e num ber of w orksta tions used in parallelization. F igure 4.5 shows th e perfor

m ance resu lts of th e convolution operation on a IK x lK im age. T he observed speedups

increase as th e window size is increased. As in the above case, a larger window size

im plies m ore com pu ta tions in the convolution operation . T herefore, as th e tim e sp en t

in com m unicating th e subim ages and the results is alm ost th e sam e across windows of

different sizes, the ra tio of th e com putation tim e to th e com m unication tim e increases.

Hence, higher speedups can be obtained w ith the increase in window size and , th e num ber

of w orksta tions used in parallelization.

-k- 2Kx2K24-1 IKxlK

512x5122 0 -

16-

1 2 -

0 2 4 6 8 10 12 14 16

16-,ideal

14-

1 2 -

1 0 -

0 2 4 6 8 10 12 14 16

Exec, tim e(m in) v /s Processors Speedup v /s P rocessors

F igure 4.4: Perform ance of the convolution operation using a 15x15 window


6

5

15x154 11x11

7x73x3

3

2

1

0 2 4 6 8 10 12 14 16

ideal

14-

1 2 -

1 0 -

0 2 4 6 8 10 12 14 16

Exec, tim e (min) v /s Processors Speedup v /s P rocessors

F igure 4.5: Perform ance of the convolution o peration on a IK x lK im age

T he convolution algorithm was parallelized by Lee and H am di (Lee & H am di, 1995)

on a netw ork of SUN Sparc IP C w orkstations. T hey used a heuristic p artitio n in g m ethod

(section 4.2) to p artitio n th e im age into several different subim ages of the sam e size.

T he num ber of subim ages created were equal to th e num ber of w orksta tions used in

parallelization. However, th is partition ing scheme can reduce th e perform ance of th e

parallel convolution algorithm in some cases as explained below:

• F irstly , th e m achines in a cluster of w orksta tions m ay have different processing speeds

or speed factors. Assigning subim ages of th e sam e size to such m achines will lead to

load im balances. The size of a subim age assigned to any w orksta tion should therefore

be p roportional to its effective speed factor.

• Secondly, even if all th e w orksta tions used in parallelization have th e sam e speed

facto rs, it is difficult to d istrib u te these subim ages to all w orksta tions a t th e sam e

tim e. T here is always some delay before th e last w orksta tion gets its subim age and

s ta r ts processing. T his can cause som e reduction in th e overall perform ance o f th e

parallel im plem entation .

Finally, th e perform ance of a parallel convolution algorithm im plem ented on an


enterprise cluster (section 2.5.3), will degrade significantly if a p a rtic ip a tin g m achine

is tim e-shared to run o ther processes. Each m achine in an en terprise c luster is tim e-

shared betw een different users.

Hence, th e heuristic partition ing m ethod can som etim es resu lt in significant reduction

in th e overall perform ance of th e parallel convolution algorithm .

Table 4.3: Perform ance of th e Farm er-W orker p a tte rn on varying th e ex ternal load and

num ber of sub tasks. T he execution tim e (m in:sec) displayed are for th e convolution

o peration (w indow size 15x15).

Row No. External Load (Y /N )

Number of Workstations

1 2 4 6 8 10 12 14 16

1 (o) N 5:28 2:49 1:22 0:56 0:41 0:40 0:32 0:31 0:26

2 (•) Y 5:28 2:52 1:25 1:01 0:42 0:42 0:34 0:35 0:27

3 (*) Y 5:28 5:10 2:24 1:36 1:11 0:58 0:48 0:41 0:36

subtasks = processors & external load subtasks processors & external load subtasks ]$> processors & no external load

16-1ideal

14-

1 2 -

1 0 -

0 2 4 6 8 10 12 14 160 2 4 6 8 10 12 14 16Exec, tim e(m in) v /s Processors Speedup v /s Processors

F igure 4.6: Perform ance of the Farm er-W orker p a tte rn in convolution operation on varying

th e processor load and num ber of sub tasks (window size 15x15)

To overcom e these lim itations, we use the row p artitio n in g m ethod (in th e Farm er-

W orker p a tte rn ) to p artitio n th e im age into several different subim ages of th e sam e size.

However, th e num ber of subim ages created is a t least tw o tim es m ore th an th e num ber of

w o rk sta tio n s used in parallelization. Each m achine would therefore process a p roportional

num ber of subim ages according to its speed factor. Table 4.3 shows th e perform ance results


of th e convolution operation , parallelized using tw o different m ethods. T he convolution

operation was perform ed on an IK x lK im age using a 15x15 window. T he entries in the

first row of th e tab le display execution tim es of th e parallel convolution algorithm using

th e F arm er-W orker p a tte rn . T he w orksta tions used in parallelization were of th e sam e

speed factors.

We then reduced th e speed of one w orksta tion , by executing a com putation-in tensive

no n -term in a tin g ex ternal process. We im plem ented the parallel convolution algorithm

using a different num ber of w orksta tions, b u t always included th e w orksta tion executing

th e ex terna l process. T he effective speed factor of th e w orksta tion executing an ex ternal

process was nearly halved, since it was tim e-shared to execute an ex ternal process and

a worker com ponent of th e Farm er-W orker p a tte rn . T he entries in the second row of

Table 4.3 show resu lts of th is parallelization . T he perform ance resu lts are sim ilar to the

previous resu lts (i.e. entries in first row of the table) since, m ost of th e subim ages are now

processed by o ther w orksta tions. T here is not much reduction in th e overall perform ance

as can be seen from F igure 4.6.

However, if we p artitio n th e image in to several different subim ages of th e sam e size

and , if th e num ber of subim ages created are equal to th e num ber of w orksta tions used

in parallelization, th e perform ance of th e parallel convolution operation degrades signif

icantly. T his can be seen from the entries in the th ird row of Table 4.3. T he execution

tim e is dom inated by th e slow w orkstation executing an ex tern a l process. Since th e slow

w orksta tion has th e sam e workload as the o ther w orksta tions, it takes m ore tim e to process

its subim age. T his reduces th e overall perform ance in th e parallel convolution algorithm .

Hence, th e Farm er-W orker p a tte rn which has an inheren t dynam ic load balancing property ,

can be used to achieve im proved perform ance over th e conventional m ethods used for

parallelizing an application.

4 .4 .2 R an k filter in g

R ank F ilte rs are non-linear filters which are used for reducing th e variance in an im age.

T hey elim inate sa lt-and -pepper noise b u t unlike th e linear filters they preserve th e sh arp


edges. A rank filter transfo rm s an im age by changing each pixel value to a specified

value in th e neighborhood of th a t pixel point. If A f rep resen ts a set of pixel values in the

neighborhood of som e pixel po int (i,j) and if the elem ents in J\f are sorted in ascending

order, th en a rank filter R i of zth order assigns the zth elem ent in A f. T hree special rank

filters are th e Rmini Rmax and Rmedian, which respectively assign m inim um , m axim um

and m edian pixel values to th e pixel point (i,j). A review of th e rank filters and their

p roperties is given in (Hodgson e t ah, 1985).

R ank filters can be parallelized using th e Farm er-W orker p a tte rn . T he execution tim es

for th e rank filtering operation parallelized using different num ber of w orksta tions are

displayed in Table 4.4. T he perform ance results of th e rank filtering operation are sim ilar

to th e perform ance results of th e convolution operation .

Table 4.4: Execution tim e in (m in:sec) for th e rank filtering operation


1 2 4 6 8 10 12 14 16

128x1283x3 0:01 0:02 0:03 0:01 0:01 0:01 0:01 0:01 0:01

11x11 0:09 0:05 0:03 0:03 0:02 0:01 0:02 0:02 0:02

256x2563x3 0:02 0:02 0:02 0:01 0:01 0:01 0:02 0:02 0:02

11x11 0:33 0:20 0:11 0:09 0:07 0:06 0:05 0:05 0:04

512x5123x3 0:08 0:04 0:03 0:02 0:02 0:02 0:03 0:04 0:04

11x11 2:14 1:13 0:38 0:30 0:23 0:22 0:18 0:18 0:16

iK x lK3x3 0:30 0:17 0:09 0:07 0:07 0:06 0:07 0:07 0:08

11x11 9:14 4:39 2:20 1:45 1:14 1:14 0:59 0:57 0:43

4 .4 .3 S p a tia l filters

A com bination of Rmin &nd Rmax rank filters form s a fam ily of sp a tia l filters. S patial

filters can be used as approxim ations to the tru e low-pass and high-pass filters. A spatia l

low-pass filter, for exam ple, can be defined as w here (or Rmaxi.^^))

deno tes applying Rmin (or Rmax) ^ tim es to th e im age T . T he cut-off frequency is

determ ined by n, th e larger th e value of n, th e lower is th e value of cut-off frequency.

O th er definitions of spatia l low-pass filters can be found in (H ussain, 1991). A high-pass


filtered im age is ob tained by su b trac tin g th e original im age T form th e low-pass filtered

im age A high-pass filter sharpens details in an im age.

Table 4.5: Execution tim e in (m in:sec) for the sharpen ing operation


1 2 4 6 8 10 12 14 16

128x1283x3 0:04 0:02 0:01 0:02 0:02 0:02 0:02 0:02 0:02

11x11 1:09 0:35 0:20 0:16 0:14 0:13 0:11 0:10 0:08

256x2563x3 0:15 0:09 0:05 0:05 0:03 0:03 0:03 0:03 0:02

11x11 4:51 2:27 1:23 1:07 0:54 0:53 0:46 0:41 0:34

512x5123x3 0:59 0:30 0:15 0:11 0:09 0:08 0:07 0:07 0:06

11x11 20:58 10:50 5:15 3:31 2:46 2:25 2:24 2:09 1:40

22 n

2 0 -

18-

16-512x512256x25614-128x128

1 2 -

1 0 -

0 2 4 6 8 10 12 14 16

16n ideal

14-

1 2 -

1 0 -

0 2 4 6 8 10 12 14 16

Exec, tim e(m in) v /s Processors Speedup v /s P rocessors

F igure 4.7: Perform ance of th e sharpening operation using spa tia l filters (window size

11x11)

S patia l filters are itera tive operations, hence, they can be parallelized using th e M aster-

W orker p a tte rn . T he execution tim es for the sharpening operation (high pass filtering)

parallelized using different num ber of w orksta tions are displayed in Table 4.5. A plot of

these execution tim es and the speedups achieved for th is operation are shown in F igure 4.7.

T h e low-pass filtering was perform ed w ith a value of n, equal to 5.


Since a low-pass filtering of th e im age is involved in th is operation , there is a need for

com m unicating th e boundary inform ation after each ite ra tio n . Each ite ra tio n involves a

rank filtering operation on all th e subim ages. However, th e tim e required for perform ing

th e rank filtering operation on each subim age is m uch higher th an th e tim e required to

exchange th e boundary inform ation. Therefore, th e tim e spen t in worker-w orker com m u

nications does no t appear to m ake a significant deg radation in th e overall perform ance.

4 .5 Fast Fourier transform s

A tw o-dim ensional fast Fourier transfo rm (2D -FFT ) of an im age is a global algorithm in

which th e value of each pixel depends on th e values of all pixels in th e im age. T h e Fourier

tran sfo rm of an im age enables im age filtering in the frequency dom ain. A tw o-dim ensional

fast Fourier transfo rm F{u^v ) of an im age f { x , y ) is given by (Sonka e t al., 1993)

1 M - 1 7 V - 1

= j v w E E + (4-3)m = 0 n = 0

= E E » ) ] e x p ( Z ^ ) (4.4)m = 0 n= 0

w here, u = 0 , 1 , . . . , M — 1, v = 0 ,1 ,. . . , A" — 1, and i = \ / —T. A 2 D -F F T is separable

and therefore can be expressed as two one-dim ensional fa s t Fourier transfo rm s: a one

dim ensional F F T along th e rows followed by a one-dim ensional F F T of th e in te rm ed ia te

resu lts along th e colum ns, or vice versa. T he term in square brackets in equation 4.4, for

exam ple, corresponds to th e one-dim ensional Fourier tran sfo rm of th e m th row.

A 2 D -F F T can be parallelized by com puting th e one-dim ensional F F T along th e

rows (or colum ns), transposing the in term ediate results, and finally com puting a one

dim ensional F F T along th e colum ns (or rows) (C houdhary & P ate l, 1990). We use

th e C ontroller-W orker p a tte rn to im plem ent th is form of parallelism . Each processor or

w orksta tion is assigned a set of contiguous rows of th e in p u t im age. T he num ber of rows

assigned to each processor is p roportional to its speed factors. Each processor com putes


t,he 1Ü -FFT along its rows (using 113-FFT algorithm given in (Press et al., 1992)). The

processors then com m unicate with each o ther to transpose the in term ediate results (the

row F F T s) as shown in F igure 4.8.

N

PO

N

P I

P2

P3

hlgure 4.8: The d a ta blocks needed to transpose the in term ediate results

ICach processor needs to com m unicate and exchange a block with every o ther proces

sor. A pair of d a ta blocks exchanged by any two ;)rocessors are shown with the sam e

sh a d e s /p a tte rn s (Figure 4.8). After exchanging the row F F T s as specified above, each

processor com putes the 1 D -FFT along the columns. Finally, each processor sends the

com puted results to the Controller.

4 able 4.6: Execution tim e in (m in:sec) i'or F F T operation


I 2 4 6 8 10 12 14 16

256x256 0:01 0:03 0:03 0:03 0:04 0:04 0:05 0:05 0:05

512x512 0:04 0:07 0:08 0:09 0:11 0:12 0:15 0:17 0:20

I Kxl K 0:20 0:28 0:29 0:31 0:45 0:51 0:52 0:57 1:01

T he execution tim es for the F F T operation parallelized using different num ber of work

s ta tio n s are displayed in Table 4.6. From Table 4.6 we can observe th a t the com m unication

overheads dom inate the perform ance of th a t the F F T operation . The com putational tim e

for th is operation on a single w orkstation is of the order of few seconds. However, the tim e

sp en t in all-to-all worker com m unications and in com m unicating the final results to the


controller, is relatively large com pared to the tim e spen t in th e com p u ta tio n . M oreover,

th e w orker-w orker com m unications involve costly floating po in t exchanges. It is therefore

difficult, to achieve any significant perform ance gains in parallelization of th e 2D -F F T

o peration in a w orksta tion environm ent.

4.6 Im age restoration

4 .6 .1 M arkov ran d om field m o d e ls for im a g e reco v ery

M arkov random field (M R F) m odels and Bayesian m ethods are stochastic techniques used

in im age resto ra tio n , im age segm entation and im age in te rp re ta tio n . In an M R F m odel,

th e problem is form ulated as an optim ization problem [m axim um a posterio ri (M A P)

estim ation rule] by representing the local characteristics of th e im age pixels by M arkov

random field and its associated G ibbs d istribu tion . An ite ra tiv e op tim iza tion m ethod , such

as sim ulated annealing, is applied to generate a sequence of im ages which converge in an

ap p ro p ria te sense to th e op tim al M A P estim ate. T he algorithm s based on th is stochastic

technique are com putationally intensive and highly parallel. T he algorithm used for im age

re s to ra tio n is presented in a nut-shell. A detailed discussion and various o th er algorithm s

based on th is technique are presented in (M ardia & K anji, 1993).

If / is th e observed im age and if Q, denotes th e se t of all possible in te rp re ta tio n s of /,

th en th e M A P estim ate o f / i s th e one which m axim izes th e probab ility of th e in te rp re ta tio n

g given th e observed im age / i.e. we seek

m ax^^çi[P {g = uj\f)] (4.5)

A fter rigorous m athem atica l analysis and sim plification th is u ltim ate ly leads to m ini

m izing of an energy function which is given by (B uxton e t al., 1986)

+ ^ ( / K 6 ) - ^ ( a , 6 ) ) ^ / 2 ( T ^ (4.6)(a ,6) / (a ,6)


where, (a, 6) is any po in t in th e image and € vV, which is a se t of neighboring

poin ts around th e point (a, 6). T he p aram ete r a denotes th e s tan d a rd deviation of th e

add itive G aussian noise (w ith zero m ean) in th e degraded im age. T he real-valued function

F [ / ( a , 6), / ( i , i ) ] adds a value to th e energy function which is inversely p ro p o rtio n al to th e

degree of sim ilarity betw een th e pixel in tensities of th e im age poin ts (n, 6) and (z ,j ) .

T he energy function given by equation 4.6 is m inim ized using sim ulated annealing

process which is described below (B uxton e t al., 1986), (K apoor e t al., 1994).

1. In itialize s ta r tin g tem p era tu re T

2. For each poin t (a, b) in th e image do

• com pute energy a t po int (a, h)

• generate tria l pixel value and using th is value, com pute tria l energy a t (a, 6).

C om pute change in energy A£^ = tr ia l energy - energy

• if (A < 0) then accept the s ta te change i.e. assign th e tria l value to po in t (a, b)

o therw ise assign th is tria l value to th e po int (a, 6) only when exp{—A E / T ) >

random[0^ 1)

3. R ep ea t s tep 2 Ninner tim es

4. Lower th e tem p e ra tu re to C/log{k-[-C) w here A; is th e to ta l num ber of ite ra tio n cycles

(com plete ra s te r scans of th e image) and C is a co n stan t, independen t of k

5. R ep ea t s tep 2 to s tep 4 Nouter tim es

Table 4.7: E xecution tim e in (m in:sec) for im age re sto ra tio n using M R F m odel


1 2 4 6 8 10 12 14 16

128x128

256x256

512x512

0:35

2:23

10:19

0:20

1:19

5:15

0:11

0:40

2:42

0:08

0:29

1:46

0:07

0:21

1:22

0:07

0:20

1:06

0:06

0:17

1:03

0:05

0:15

0:56

0:05

0:12

0:43

T he M R F algorithm can be parallelized by using th e M aster-W orker p a tte rn . T h e

execution tim es for th e M R F algorithm parallelized using different num ber of w orksta tions


1 1 -,

1 0 -512x512256x256128x128

0 2 4 6 8 10 12 14 16

16-1ideal

14-

1 2 -

1 0 -

0 2 4 6 8 10 12 14 16Exec, tim e(m in) v /s Processors Speedup v /s P rocessors

F igure 4.9: Perform ance of th e im age resto ration algorithm using M R F m odel (window

size 3x3)

are displayed in Table 4.7. A plot of these execution tim es and th e speedups achieved for

th is operation are shown in F igure 4.10. T he algorithm was executed w ith th e values

of Ninner = Nouter = 10. T he M R F algorithm is com m unication intensive. T he workers

com m unicate th e boundary inform ation afte r each ra ste r scan of th e ir assigned subim ages.

T h e to ta l num ber of worker-w orker com m unications in th is exam ple is therefore very

high (each worker com m unicating 100 tim es w ith every o th er w orker). B u t since the

com puting tim e between successive com m unications w ithin each w orker is relatively larger,

th e observed speedups are quite close to th e ideal speedups.

N ote th a t , unlike th e Farm er-W orker p a tte rn , an application parallelized using the

M aster-W orker p a tte rn does not have inheren t load balancing property . T his can result

in serious load im balances when the p a tte rn is im plem ented on an en terprise cluster

(section 2.5.3). Each worker com ponent in th e M aster-W orker p a tte rn depends on o ther

w orkers to perform th e com putations on its assigned su b task (subim age). A m achine

executing a w orker com ponent of the M aster-W orker p a tte rn , can delay th e processing in

o th e r w orker com ponents when it is also tim e-shared to run ex ternal processes. T his can

lead to significant reduction in the overall perform ance of th e corresponding application

th a t is parallelized using th is p a tte rn .


We can observe th e eflfect of executing an ex ternal process (ex ternal load) on th e per

form ance of a M aster-W orker p a tte rn , by conducting a sim ple experim ent. As an exam ple

of an application , we parallelize the im age resto ra tio n operation based on M R F model,

using th e M aster-W orker p a tte rn . T he perform ance resu lts of th e parallel im plem entation ,

using a 512x512 im age, are shown in Table 4.8. T he entries in th e first row of th e tab le

display execution tim es w ithou t any ex ternal load or processes on th e m achines executing

th e p a tte rn . T he am oun t of work d istribu ted to all th e worker com ponents is p roportional

to th e effective speed factors of their corresponding m achines. T he entries in th e second

row display execution tim es when one of th e m achines is tim e-shared to run an ex ternal

process, during th e execution of a worker com ponent of th e M aster-W orker p a tte rn . T he

effective speed facto r of such a m achine is therefore halved w ith respect to th e rest of the

m achines. Hence, th e corresponding worker com ponent takes longer tim e to perform its

co m p u ta tio n s. All workers in th e M aster-W orker p a tte rn exchange in term ed iate resu lts

w ith th e ir neighbors afte r every itera tion . T he presence of a slow worker com ponent th ere

fore resu lts in increased w aiting tim e for the rem aining worker com ponents for exchanging

th e ir in te rm ed ia te results. This reduces th e overall perform ance in the application as can

be seen from th e entries in the second row of Table 4.8.

A p o ten tia l solution to overcom e the load im balancing problem in th e M aster-W orker

p a tte rn is to dynam ically schedule th e worker com ponen ts afte r every fixed num ber of

ite ra tio n s . However, th e tim e required to schedule th e w orker com ponents on o th er idle

m achines should be significantly lower th an the overall com pu tation tim e of th e applica

tion.

T able 4.8: Perform ance of th e M aster-W orker p a tte rn w hen sub jected to the ex ternal load.

T he execution tim es (m in:sec) displayed are for th e im age res to ra tio n operation using the

M R F m odel on a 512x512 im age.

Row No. External Load Number of Workstations(Y /N )

1 2 4 6 8 10 12 14 16

1 (o) N 10:19 5:15 2:42 1:46 1:22 1:06 1:03 0:56 0:43

2 ( . ) Y 10:19 9:43 4:42 3^ 9 2:22 1:53 1:43 1:19 1:11


1 1 .

1 0 -

with external load no external load

0 2 4 6 8 10 12 14 16

16-1 ideal

14-

1 2 -

1 0 -

0 2 4 6 8 10 12 14 16

Exec, tim e(m iii) v /s Processors S peedup v /s P rocessors

F igure 4.10: Perform ance of th e M aster-W orker p a tte rn (in im age recovery operation using

th e M R F m odel on a 512x512 image) sub ject to th e ex ternal load and load d is trib u tio n

4 .7 Sum m ary

In th is ch ap te r we have presented parallel im plem entations of som e rep resen ta tive low

level vision algorithm s on a cluster of w orksta tions. Each algorithm has been parallelized

using ap p ro p ria te design p a tte rn s such as Farm er-W orker, M aster-W orker, and C ontroller-

W orker. T he algorithm s which have been parallelized include h istogram equalization,

convolution, rank filtering, im age sharpening using sp a tia l filters, 2 D -F F T (of an im age),

and im age resto ra tion using M R F models. Some of these a lgorithm s parallelized using the

C ontroller-W orker p a tte rn (e.g. h istogram equalization and 2D -F F T ) do no t resu lt in any

significant speedups. This is because the tim e spen t in all-to-all w orker com m unications in

th e C ontroller-W orker p a tte rn is relatively high com pared to th e tim e spen t in th e actual

co m p u ta tio n . T hese algorithm s therefore do no t represen t ideal cand ida tes for parallel

im plem entation on w orksta tion clusters.

Paralle l im plem entations of o ther low level algorithm s have however shown prom ising

resu lts. T h e convolution and rank filtering operations parallelized using th e Farm er-

W orker p a tte rn have resulted in significant perform ance gains. We have also illu s tra ted

th e advan tage of using a Farm er-W orker p a tte rn to achieve im proved perform ance over the

conventional m ethods of parallelizing these algorithm s. T he im age sharpen ing and im age


resto ra tio n algorithm s representing synchronous form of parallelism have been parallelized

using th e M aster-W orker p a tte rn . A lthough these algorithm s are com m unication intensive,

th e com puting tim e between successive com m unications a t each w orker is relatively high.

T he observed speedups in these algorithm s are therefore reasonably close to th e ideal

speedups. Finally, we also illustra ted th e problem of load im balances th a t can occur in

th e M aster-W orker p a tte rn when im plem ented on enterprise clusters (section 2.5.3). These

load im balances, caused due to some ex ternal processes, can lead to a significant reduction

in th e overall perform ance of the corresponding algorithm parallelized using th is p a tte rn .

Chapter 5

In term ediate level processing

In th is ch ap te r we discuss parallelization of som e rep resen ta tive in term ed iate level algo

rith m s in com pu ter vision. In term ediate level processing form s a bridge between th e low

level and th e high level processing operations in com puter vision. It com prises algorithm s

which reduce th e visual in form ation produced by the low level operations to a form su itab le

for th e recognition step in high level processing. T he basic un it of inform ation processed

by these algorithm s is a token which can either represen t a line, an intensity, color or

te x tu re based region, or a surface. T he processing step involves grouping these tokens

in to generic en tities such as sets of parallel lines, rectangles or polygons, hom ogeneous and

contiguous regions, or plane surfaces. Hence, the operations involved a t th e in term ed iate

level processing are m ainly partition ing and m erging which tran sfe r th e tokens in to m ore

useful and m eaningful s tru c tu re s for fu rth e r processing.

However, unlike in low level processing, th e operations or com p u ta tio n s a t th e in ter

m ediate level processing are no t very regular. T he form of parallelism in the algorithm s a t

th is level is therefore not im m ediately evident. For exam ple, even th e m ost soph isticated

low level a lgorithm s for detecting edges and lines in an im age can generate a significant

num ber of line fragm ents across th e im age. A grouping algorithm used for linking and

reorganizing th e line fragm ents into m eaningful s tru c tu re s m ay need to m atch and m erge

fragm en ts of lines across large fractions of th e im age. In th e parallel im plem entation , th is

m ay lead to a large am ount of non-local and irregular com m unication p a tte rn s between

130

Chapter 5. Intermediate level processing 131

a significant num ber of processors. Hence, developing parallel so lutions for in te rm ed ia te

level algorithm s is relatively difficult. D uring p ast several years, m any parallel algorithm s

for th e in te rm ed ia te operations have been suggested and are co n stan tly being im proved.

However, m ost of these algorithm s have been designed for specific class of parallel archi

tec tu res (C hau d h ary & Aggarw al, 1990). In th is chap ter, we discuss parallelization of

som e rep resen ta tive in term ediate level a lgorithm s on coarse-grained m achines, such as a

c luster of w orksta tions.

Segm entation is one of th e m ost im p o rtan t in term ed iate level operations in com pu ter

vision. It involves ex traction of featu res or ob jects from an im age which are used in

subsequen t processing, namely, ob ject description and recognition. T he m ain objective of

segm en ta tion is to p artitio n th e im age into m eaningful regions which co n stitu te a p a r t or

whole of th e o b jec ts in an im age. T here are two m ain approaches to segm enta tion , namely,

region-based and edge or pixel-based (Gonzalez & W oods, 1993), (Awcock & T hom as,

1995). R egion-based segm entation aim s a t creating hom ogeneous regions by grouping

to g e th er pixels which share com m on features. P ixel-based segm entation aim s to d e tec t

and enhance edges in an im age, and then link them to crea te a boundary which encloses

a region of uniform ity. Region-based segm entation is identified as a sim ilarity m ethod

since th e im age regions require some sim ilarity criterion for creation . In co n tra s t, pixel-

based segm enta tion is term ed as discontinuity m ethod since th e creation of regions involves

de tec tion of edges th a t are ab ru p t d iscontinuities in pixel grey-level values.

T his ch ap te r is organized as follows. In section 5.1 we discuss parallelization of a

region-based segm entation algorithm . In section 5.3 we discuss parallel im plem entation of

a percep tua l g rouping algorithm used for grouping line tokens in to m eaningful en tities such

as s tra ig h t lines, junctions, and rectangles or polygons. P ercep tu a l grouping algorithm s

co n s titu te pixel-based approach to im age segm entation . In each of these sections, we

p resen t a sequential algorithm followed by its corresponding parallel im plem entation .

T hese im plem entations have been designed and developed for parallel execution on a

c lu ster of w orksta tions, using the relevant design p a tte rn s .


5.1 R eg ion based segm en tation

R egion-based segm entation can be form ally defined as follows (Gonzalez k W oods, 1993).

A region R of an im age X is defined as a connected hom ogeneous subset of th e im age w ith

respect to som e ‘sim ilarity c rite rio n ’ such as gray tone, or tex tu re . Let P deno te a logical

p red icate which assigns th e value tr u e ( l) ov fa lse (0) to P , depending only on th e properties

re la ted to th e pixels in R. For exam ple, P (R ) = true, if th e difference between th e m axim um

and m inim um pixel value in R is less th an som e threshold . A region-based segm enta tion

of an im age is a p artitio n of X in to several hom ogeneous regions R { , i = 1, 2, ...n such th a t

i=l= x (5.1)

n = 0 for all i and j ^ j (5.2)

p(p,) = 1 for i = 1, 2 ,.. . , n (5.3)

p(p,ijp ,) = 0 for i 7 j (5.4)

C ondition (5.1) indicates th a t every pixel m ust be in a region. C ondition (5.2) in

d icates th a t th e regions m ust be disjoint (their in tersection m ust be 0, an em pty set).

C ondition (5.3) deals w ith the properties th a t m ust be satisfied by th e pixels in th e regions.

Finally, condition (5.4) indicates th a t th e ad jacen t regions R{ and R j are different in th e

sense of p red icate P .

T he R egion-based segm entation algorithm s can be classified in to th ree categories:

1. Region growing: In region growing an im age is divided in to an a rb itra ry num ber of

e lem entary regions, often s ta rtin g a t th e level of individual pixels. These elem entary

regions are then merged to form larger regions on th e basis of certain hom ogeneity

criterion . T he region growing algorithm s ta r ts w ith an im age p a rtitio n th a t s a t

isfies condition (5.3) and proceeds to fulfill condition (5.4). T he m erging process

te rm in a tes when no two adjacent regions are sim ilar.

2. Region splitting: In co n tra st, region sp litting views th e en tire im age as a single region.


Each region is then recursively subdivided in to sm aller subregions, if th e region is

n o t hom ogeneous enough. T he processing s ta r ts in a condition satisfy ing (5.4) and

proceeds to fulfill condition (5.3). T he m easure for hom ogeneity is sim ilar to th a t

used in region growing.

3. Region sp litting and merging: This scheme com bines b o th th e split and m erge oper

ations in one algorithm (Horowitz & Pavlidis, 1974), in order to exhibit advantages

of bo th th e m ethods. T he im age is initially subdivided in to an a rb itra ry se t of

d isjo int regions which are then m erged a n d /o r sp lit in an a tte m p t to satisfy the

conditions s ta ted in equations 5.1-5.4. A sp lit and m erge algorithm begins w ith

satisfy ing neither of the two conditions (5.3) and (5.4) and ends up w ith satisfying

bo th (5.3) and (5.4).

A sim ple realization of th e split and m erge technique is to represen t th e en tire im age

as one region, initially, and then recursively divide a region in to sm aller and sm aller

q u a d ra n t regions in a quad tree fashion (Figure 5.1) such th a t for any region B*, P{ Ri ) =

f a l s e (Gonzalez & W oods, 1993). Also, m erge th e ad jacen t regions Ri and R j for which

P { R i U R j ) = tru e . T he algorithm stops when no fu rth e r sp litting or m erging is possible.

T he ro o t of th e tree in F igure 5.1 corresponds to th e en tire im age while th e leaves of the

tree correspond to individual pixels. Each in term ed iate node corresponds to a subdivision.

R ll R12R2

R13 R14

R4 R3

R

R3 R4

R ll R12 R13 R14

(a) (b)

F igure 5.1: a) P artitioned im age b) C orresponding qu ad tree


5.2 P arallel R eg ion -based segm en ta tion

Region-based segm entation can be com putationally expensive in im ages of com plex scenes.

Hence, recent work in region-based segm entation has concen tra ted m ainly on developing

efficient parallel a lgorithm s (C opty e t ah, 1989), (C houdhary & T h ak u r, 1994), (Ham -

brusch e t ah, 1994), (Alnuweiri & P rasan n a , 1992), (W illebeek-LeM air & Reeves, 1990),

(H aralick & Shapiro, 1985). T he effectiveness of a p articu la r algorithm depends on the

applica tion area, in p u t image, and the type of parallel arch itectu re . In th is section, we

focus on experim ental evaluation of th e parallel sp lit and m erge segm enta tion algorithm

applied on gray-scale im ages and im plem ented on coarse-grained m achines, such as a

c luster of w orksta tions.

T he region-based split and m erge segm entation algorithm is well su ited for parallel

im plem entation using th e divide and conquer principle. D ivide and conquer algorithm s

(S to u t, 1987) use a recursive s tra teg y to split a problem in to sm aller subproblem s and

m erge th e solutions to these subproblem s in to the final solution. D ivide and conquer

stra teg ie s ap p ear to provide a n a tu ra l and efficient parallel solu tion to m any problem s on

coarse-grained m achines. Several divide and conquer algorithm s have been proposed for

im age processing (C haudhary & A ggarw al, 1991), (S tou t, 1987), (Sunwoo e t al., 1987).

T he first phase in th e parallel split and m erge segm entation algorithm involves sp litting

th e im age in to several subim ages such th a t each processor or w orksta tion has its own

subim age associated w ith it. We describe the sp litting and d is trib u tio n process la ter. In

th e nex t phase, each w orksta tion applies a sequential region growing algorithm to segm ent

its associated subim age. T he region growing algorithm defines individual pixels as initial

e lem entary regions. It then adds ad jacen t pixels to a region if th e difference betw een their

grey values and th e average pixel value of the cu rren t pixels in th e region is less th an a

th resho ld . A fter com pleting th e segm entation process, th e final phase involves m erging of

th e segm ented subim ages a t th e boundaries of subdivision. T he m erging process occurs

in phases, in a b inary tree fashion as shown in F igure 5.2 (b), and takes lo g P s teps for a

given P num ber of processors. T he segm ented regions of th e en tire im age are in th e roo t

processor afte r th e m erging step .

Chapter 5. Intermediate level processing L35

M erging of the segm ented subim ages is performed a t the boundary of subdivision.

W hile m erging along any boundary, the intensity values of the two neighboring pixels a t

the boundary are com pared. If they satisfy the hom ogeneity criterion (the difference in

their values is less than some threshold) the two regions across the boundary are merged.

I he value of each pixel in the merged region is set to the average of all the pixel values

w ithin th is region. If the values of two neighboring pixels are sam e or do not satisfy the

hom ogeneity criterion the regions are kept unchanged.

T he sp litting and d istribu tion of the subim ages is the inverse of m erging process

perform ed a t different levels of the binary tree. T he processor a t the root of the binary

tree divides the im age into two subim ages and sends them across two processors a t the

lower level. Each in term ediate processor in the tree subdivides its assigned subim age into

th ree p arts (F igure 5.2 (a)). It retains one p art to itself (for segm enting), and sends o ther

two p arts to its left and right siblings in the binary tree. If there is no right sibling (as in

node 3, Figure 5.2), the subim age is subdivided only in two parts . T he leaf processors do

not perform any subdivision on their assigned subim age. At the end of the sp litting and

d istribu tion process, each processor (w orkstation) has an associated subim age. T he size

of this subim age is proportional to the speed factor of the underlying w orkstation.

(a) (b)

Figure 5.2: a) D istribution of subim ages b) M erging of subim ages


T he sp lit and m erge segm entation algorithm can be parallelized using th e D ivide-and-

C onquer (DC) p a tte rn (section 3.6). The execution tim es for th e parallel segm enta tion

algorithm im plem ented on varying num ber of w orksta tions are displayed in Table 5.1. A

p lo t of these execution tim es and th e speedups achieved for th is operation are show n in

F igure 5.3. T he value of th e threshold used for adding ad jacen t pixels to th e regions in

th e corresponding subim ages, was 15. From F igure 5.1 it can be seen th a t a lthough the

execution tim e of the parallel segm entation algorithm initially decreases, it does n o t show

any significant im provem ent when the num ber of w orksta tions used in parallelization are

increased beyond six. T he corresponding speedup curves show sim ilar behavior. T h e drop

in th e scalability of th e parallel segm entation algorithm is due to th e tim e com plexity of

th e m erging processes.

T able 5.1: Execution tim e in (m inisec) for the parallel sp lit and m erge segm entation

algorithm

Im age Size N o. o f R egions N um ber of W orkstations

1 2 4 6 8 10 12 14 16

256x256 1385 2:12 1:15 0:52 0:38 0:35 0:33 0:32 0:29 0:32

512x512 2023 3:10 1:55 1:27 1:02 1:01 0:58 0:49 0:50 0:50

iK x lK 2423 4:09 2:54 2:35 2:01 1:59 1:53 1:51 1:52 1:52

5

4iK x lK

512x512256x256

3

2

1

0 2 4 6 8 10 12 14 16

16nideal

14-

1 2 -

1 0 -

0 2 4 6 8 10 12 14 16

Exec, tim e(m ins) v /s Processors Speedup v /s Processors

F igure 5.3: Perform ance of th e parallel sp lit and m erge segm entation algorithm


Table 5.2 displays execution tim es for perform ing various operations in th e parallel

segm en ta tion algorithm applied on a 512X512 im age. T he percentage figures for each

op era tio n in a colum n are com puted w ith respect to the to ta l parallel execution tim e

(displayed in last row of th e colum n) required to segm ent an im age on a given num ber

of w orksta tions. T he experim ental resu lts presented in th is tab le show th a t th e tim e

sp en t in th e m erging operation increases w ith the increase in num ber of w orksta tions used

in parallelization. In certa in cases, it exceeds th e to ta l tim e required for segm enting th e

individual subim ages. T he influence of th e com m unication tim e on th e overall perform ance

of th e parallel segm entation algorithm is relatively insignificant as can be seen from th e

percen tage figures of th e corresponding execution tim es displayed in Table 5.2.

Table 5.2: Execution tim e in (m in:sec) for various operations in th e parallel sp lit and

m erge segm entation algorithm applied on a 512X512 im age

O peration N um ber of W orkstations

2 4 6 8 10 12 14 16

R egion G rowing1:46

(92.2% )

1:07

(77.0% )

0:42

(67.7% )

0:36

(59.0% )

0:30

(51.7% )

0:25

(51.0% )

0:25

(50.0% )

0:20

(40.0% )

M erging0:08

(7.0% )

0:19

(21.8% )

0:19

(30.7% )

0:23

(37.7% )

0:26

(44.8% )

0:22

(44.9% )

0:22

(44.0% )

0:27

(54.0% )

C om m unication0:01

(0.8% )

0:01

(1.2% )

0:01

(1.6% )

0:02

(3.3% )

0:02

(3.5% )

0:02

(4.1% )

0:03

(6.0% )

0:03

(6.0% )

T ota l tim e 1:55 1:27 1:02 1:01 0:58 0:49 0:50 0:50

Hence, if com m unication tim e is no t a dom inan t facto r, th e perform ance of a parallel

a lgorithm im plem ented using a D ivide-and-C onquer p a tte rn is m ainly influenced by th e

tim e com plexity of th e m erge operation . N ote th a t th e regions produced by a parallel

segm en ta tion algorithm m ay som etim es be different from those produced by an equivalent

sequentia l algorithm due to different s ta rtin g pixel points. T his can happen w hen th e

co n tra s t betw een th e regions in th e im age is low. T he m ajo rity of th e previous im plem en

ta tio n s of th e parallel segm entation algorithm s have e ither used b inary im ages or grey-

level im ages contain ing artificial regions which have a high degree of co n tra s t betw een each

o th er.


5.3 S egm en ta tion using P ercep tu a l O rganization

An edge or pixel based segm entation involves detec tion of edge poin ts represen ting dis

continuities in pixel in tensities in an im age, and linking these edge po in ts in to chains of

contiguous curves. However, th is m ethod often resu lts in a fragm ented segm enta tion in

which th e curves produced do no t correspond to com plete o b jec t boundaries in im ages of

com plex environm ents. T here are two approaches th a t have been proposed to deal w ith

th is problem . One is su itab le for applications in restric ted dom ain and m akes use of model-

based tech.mqu.es (Chin & Dyer, 1986). M odel-based techniques rely on th e prior knowledge

of th e o b jec ts in a scene, and predict th e ir appearance in th e low level descrip tions th a t

can be ex trac ted from th e fragm ented segm entation . O th er approach which has becom e

popular in recent years and which appears prom ising even in com plex environm ents is th a t

of perceptual organization (Lowe, 1985).

P ercep tu a l organization hierarchically organizes low level im age featu res to higher level

s tru c tu re s such as edge points to lines, lines to parallels, rectangles and polygons, and,

rectangles and polygons to the ob ject descriptions. P ercep tual organization is form ally

defined as the ability of the hum an visual system to derive relevant groupings or s tru c tu re s

from th e in p u t im ages w ithou t any prior knowledge ab o u t their con ten ts (Lowe, 1985). T he

grouping process follows th e laws of perceptual grouping such as proxim ity (closer elem ents

are grouped to g e th er), sim ilarity (sim ilar elem ents are grouped to g eth er), con tinuation

(elem ents lying on a com m on line or curve are grouped to g e th er), closure (curves ten d to

be com pleted to enclose a region), and sym m etry (elem ents sym m etric ab o u t som e axis

are grouped to g e th er). T he hum an visual system is very good a t detec ting geom etric

re la tionsh ips such as collinearity, parallelism , connectivity , and repetitive p a tte rn s in an

o therw ise random ly d istribu ted set of im age elem ents, and it can usually see shapes in

a rran g em en ts of poor m achine-generated edge o u tp u ts of even com plex scenes (Lowe,

1985).

P ercep tu a l organization has recently been applied to solve a num ber of p ractical com

p u te r vision problem s. It has proved to be effective for ex trac tio n of s tra ig h t lines (B oldt

e t al., 1989), ex trac tio n of curves (Dolan & Weiss, 1993), detection of buildings in aerial

im ages (H uertas e t al., 1993), (M ohan & N evatia, 1989), searching geom etric s tru c tu re s


in n a tu ra l scene im ages (Reynolds & Beveridge, 1987), and detection of large m an-m ade

o b jec ts in non-urban scenes (Lu & Aggarwal, 1992), In th is section, we discuss parallel

im plem entation of th e perceptual grouping steps as ou tlined in (Lu & A ggarw al, 1992),

w ith specific em phasis on th e line grouping process. T he following section presen ts the

sequentia l line grouping process as described in (B oldt e t ah , 1989), (Lu & A ggarw al,

1992), while th e section following it presents its parallel im plem entation .

5.3.1 Sequential Line grouping algorithm

T he in p u t to th e line grouping process is a set of fragm ented line segm ents which are

ex trac ted using th e existing edge detection, edge linking and linear approx im ation tech

niques. T he o u tp u t is a set of s tra ig h t lines which represen t linear s tru c tu re s a t a higher

level of g ranu la rity as shown in Figure 5.4. T here are several existing techniques th a t

could be used for ex trac tin g the initial line fragm ents in an im age. We use the techniques

described in th e Scerpo vision system (Lowe, 1985) to perform the edge detec tion and

linear approx im ation of the edge contours by piecewise linear segm ents. These operations

co n s titu te a prerequisite s tep to th e line grouping process. We describe these operations

briefly for th e sake of com pleteness.

Figure 5.4: Line G rouping

We use two algorithm s based on the Laplacian of G aussian and th e Sobel edge o p e ra to r


to select th e in itial edge locations as described in (Lowe, 1985). We convolve th e im age

w ith a Laplacian of G aussian o p era to r and assign to each pixel in the convolved im age,

a gray value p roportional to th e absolute value of th e resu lt of th e convolution. W e then

apply a Sobel grad ien t o p era to r to the convolved im age and select as edge locations only

those zero crossing pixels th a t are above a given threshold in th e Sobel g rad ien t im age. We

th en perform edge th inning on th e resu ltan t im age and link th e edge points on th e basis

o f connectiv ity to form th e edge contours. We use a sim ple recursive endpo in t subdivision

m ethod to app rox im ate th e edge contours by piecewise line segm ents as in (Lowe, 1985).

In th is m ethod , a line segm ent joining the endpoin ts of an edge con tour is recursively

subdivided a t the po int of m axim um deviation. T his subdivision continues and eventually

re tu rn s a se t consisting of one or m ore line segm ents such th a t th e m axim um deviation

of any poin t on th e edge contour from its corresponding line segm ent is less th a n som e

th resho ld value.

T h e line segm ents ex trac ted using the techniques described above are often fragm ented

and do no t reflect th e linear s tru c tu res in the im age well. A post-processing m ethod based

on th e principles of percep tual grouping is needed to ob ta in th e required linear s tru c tu re s .

T he line grouping process perform s a repeated grouping of lines in to longer lines using th e

principles or rela tional constra in ts of perceptual grouping. We use th ree basic re la tional

co n stra in ts of percep tual grouping, namely, proximity^ collinearity^ and continuation^ to

im plem ent th e line grouping algorithm . T he details of o th er finer co n stra in ts are given

in (B oldt e t ah, 1989). Consider an a rb itra ry ungrouped line in the im age. W e call such

a line as base line. A set of previously ungrouped lines are grouped w ith the base line if

th ey satisfy th e following relational constra in ts:

• P roxim ity : T he end points of the lines should fall in th e neighborhood of th e base

line. T he size and shape of the neighborhood is controlled by th e corresponding

p aram eters . F igure 5.5 (a) shows a circular neighborhood draw n a t th e end po in ts

of th e base line.

• C ollinearity: T he lines should approxim ately be collinear to th e base line. T he

difference in the o rien ta tion of th e base line and any o th er line in its proxim ity

should be less th an a threshold (Figure 5.5 (b)).


• C 'ontinnation or Overlap: The lines within the proxim ity of the base line m ust not

overlap too much. The distance between the point Q l of the base line and the

projection of point P2 on /I m ust be sm aller than a threshold (F igure 5.5 (c)),

where 12 is any line within the proxim ity of the base line U.

(a)

( b )

12

Q2

Figure 5.5: Relational constra in ts in the line grouping algorithm a) proxim ity b) collinear

ity and c) continuation

T he line grouping algorithm searches the neighborhoods of the end points of each base

line in order to find all lines within its proximity. Each line w ithin the proxim ity of the

base line needs to satisfy o ther two conditions in order to be considered for grouping w ith

(he base line. We call the set of lines L th a t satisfy the conditions s ta ted above w ith

respect to the base line / i , with /I G T, a token cjvoup.


A fter finding a token group L w ith respect to th e base line / I , a rep resen ta tive line

/ of L is com puted . Line I passes th rough th e poin t th a t is geom etric cen ter of th e line

segm ents in L (Lu & A ggarw al, 1992). T he o rien ta tion of I is th e length-w eighted average

of th e o rien ta tion of th e lines in L. T he endpoin ts of line I are determ ined by orthogonally

p ro jec ting th e line segm ents in L onto 1. T he tw o fu r th e s t a p a rt p ro jec tion poin ts are

th e end po in ts of I. T he line I replaces th e lines in L (see F igure 5.4). T he line grouping

process continues until no m ore m erging is possible. It always te rm in a tes a fte r a finite

num ber of ite ra tio n s as there are only finite num ber of lines in th e im age and th e ir num ber

declines in each ite ra tio n .

N ote th a t in order to reduce the search space, a line segm ent is represented by two of its

end poin ts and is indexed by the im age pixels corresponding to th e end poin ts (F igure 5.6).

Hence, an index a rray of th e size of the original im age is construc ted prior to th e grouping

process. W hen searching for lines close to a base line, th e neighborhood of th e end points

of th e base line in th e index array is searched. Only those lines whose end poin ts fall into

th is neighborhood are exam ined as shown in F igure 5.6(b).

B (base line)pointer to B

pointer p pointer to q

pointer to r

(a) (b)

F igure 5.6: Indexing technique used in th e line grouping process, a) search a rea for th e

base line b) th e index array


5 .3 .2 P ara lle l L ine grou p in g a lgor ith m

In th e parallel im plem entation , we assum e th a t th e fragm ented line segm ents or line tokens

have been ex trac ted from th e inpu t im age using th e existing m ethods of edge detection ,

edge linking and linear approxim ation . T he in p u t to th e parallel percep tual grouping

algorithm is therefore a set of line tokens which are com m unicated to each processor or

w orksta tion before s ta r tin g the line grouping process. Each processor has a com plete set

of th e token d a ta consisting of all inpu t tokens in th e im age. Each processor co n stru c ts

an index a rray and uses it to p a rtitio n the token d a ta in to a set of token groups. A load

balancing procedure is then employed to assign each processor a finite num ber of token

groups, in p roportion to its corresponding speed factor. T he token groups assigned to

th e processors are then processed in parallel. Each token group consisting of two or m ore

line segm ents is replaced by a representa tive line to form a new token, using th e m erging

procedure described in section 5.3.1.

A fter com pletion of the m erging process, each processor com m unicates its tokens (those

processed by it) to all o ther processors. Again, each processor has a com plete se t of

new token d a ta . T his process is repeated for a fixed num ber of ite ra tio n s or until no

m ore tokens can be grouped and m erged into rep resen ta tive line tokens. T he parallel line

g rouping algorithm can be sum m arized as follows:

1. B roadcast token d a ta from each w orksta tion to every o th er

2. Form Token G roups a t each processor

3. Assign a d istinc t set of token groups to each processor (for merging)

4. Perform m erging of th e token groups a t each processor

5. R epeat s tep 1 to step 4 for a fixed num ber of ite ra tio n s or until no m ore m erging is

possible

N ote th a t th e po ten tia l parallelism th a t can be exploited in the line grouping process

is m ainly during th e m erging process. A fter partitio n in g th e in p u t token d a ta into token

groups, th e replacem ent of each token group by a rep resen ta tiv e line is essentially a parallel


local processes. A spa tia l partition ing of th e index a rray (e ither horizontally or vertically)

in o rder to parallelize th e line grouping process m ay no t always be feasible. For exam ple,

when th e line segm ents in a token group span large portions of th e im age space, it is

ex trem ely difficult, if not im possible, to p artitio n th e index a rray spatially , in order to

realize a parallel im plem entation . These line segm ents would spread them selves across

several index a rray partitions.

T he parallel line grouping algorithm presented in th is section is sim ilar to an earlier

im plem entation proposed by P rasa n n a et. al. (P ra san n a &; W ang, 1996). However, we use

a different load balancing scheme which is based on th e d is trib u tio n of th e token groups.

T he load balancing m ethod used in (P ra san n a & W ang, 1996) is based on th e to ta l search

a rea of th e in p u t tokens. This m ethod m ay no t always lead to an even d istrib u tio n of

load, since, m any base line tokens m ay span large portions of th e im age a rea and m ay

no t require grouping or m erging w ith o ther line tokens (their token groups consist of only

th e base line). Also, th e parallel line grouping algorithm presented in (P ra sa n n a & W ang,

1996) is non-hierarchical, th a t is, it does no t group th e line tokens itera tively in to higher

levels of g ranularity .

T he s tru c tu re of th e parallel line grouping algorithm is sim ilar to th a t im plem ented by

an ite ra tiv e varian t of th e C ontroller-W orker p a tte rn (section 3.5). Hence, the parallel line

grouping algorithm can be parallelized using an ite ra tiv e varian t of th e C ontroller-W orker

p a tte rn . T he execution tim es for th e parallel line grouping algorithm im plem ented on

varying num ber of w orksta tions are displayed in Table 5.3. From Table 5.3, it can be

seen th a t th e execution tim e of the parallel line grouping algorithm does no t show any

im provem ent over its corresponding sequential im plem entation .

Table 5.3: E xecution tim e in (min:sec) for the line grouping process

Image Size No. of Tokens Number of Workstations

1 2 4 6 8 10 12 14 16

256x256 855 0:01 0:01 0:02 0:02 0:03 0:03 0:04 0:05 0:05

512x512 1454 0:02 0:03 0:04 0:04 0:04 0:05 0:05 0:05 0:06

iK x lK 7921 0:07 0:17 0:18 0:18 0:21 0:24 0:25 0:26 0:30

T h e poor perform ance of the parallel a lgorithm is m ainly due to inheren t sequen


tial n a tu re of th e line grouping process and the com m unication overheads in its parallel

im plem entation . T he only parallelism th a t can be exploited in th is algorithm is during

th e m erging operation , where, different token groups are replaced by th e ir correspond

ing rep resen ta tive line tokens, concurrently. T he tim e spen t in th e m erging operation

is however, significantly lower th an th e tim e spen t in com m unicating th e newly form ed

tokens betw een different w orksta tions, during each ite ra tio n . Also, when th e num ber of

line tokens required to be processed increase, th e com m unication overheads d om inate the

overall execution tim e as can be seen from th e entries in th e th ird row of Table 5.3. T he

com m unication overheads include tim e spen t in packing and unpacking th e line tokens in to

d a ta packets, and th e tim e spen t in com m unicating these d a ta packets betw een different

w orksta tions.

N evertheless, th e line grouping algorithm based on th e principles of p ercep tual o rgan i

zation serves as a typical exam ple of an interm ediate-level operation in com pu ter vision.

It illu stra te s problem s and difficulties encountered while parallelizing such algorithm s,

particu la rly on a cluster of w orkstations. Such algorithm s are m ore su itab le for sequential

im plem entations in th e w orksta tion environm ents.

5.4 Sum m ary

In th is chap te r, we have presented parallel im plem entations of two in term ed ia te level

vision algorithm s, namely, region-based split and m erge segm enta tion a lgorithm , and the

line grouping algorithm based on th e principles of percep tual organization . T h e segm en

ta tio n algorithm has been parallelized using the D iv ide-and-C onquer (DC) p a tte rn . T he

perform ance of th is algorithm does not show a scalable im provem ent w ith increase in

num ber of w orksta tions used in parallelization beyond a certain lim it. T his is due to

corresponding increase in tim e needed to m erge th e segm ented subim ages in th e m erging

opera tion . T he influence of th e com m unication tim e on overall perform ance of th e parallel

segm en ta tion algorithm is relatively insignificant. Hence, if com m unication tim e is no t

a dom in an t factor, th e perform ance of an algorithm parallelized using a D C p a tte rn is

influenced m ainly by the tim e com plexity of th e m erging operation .


T he line grouping algorithm has been parallelized using an ite ra tiv e varian t of the

C ontroller-W orker p a tte rn . Since th is algorithm is inheren tly sequential in n a tu re , the

only parallelism th a t can be exploited in th is algorithm is during th e rep lacem ent of

token groups by th e ir corresponding represen ta tive line tokens. T he tim e spen t in th is

o peration is however, significantly lower th an th e tim e spen t in all-to-all w orker com m u

nications in th e C ontroller-W orker p a tte rn . T he perform ance of th e parallel line grouping

algorithm therefore, does no t show any im provem ent over its corresponding sequential

im plem entation . T his exam ple illustra tes problem s and difficulties encountered while

parallelizing a typical in term ed iate level a lgorithm on coarse-grained m achines, such as

a clu ster of w orksta tions. It also shows lim itations of th e use of C ontroller-W orker p a tte rn

for parallelizing such applications on these machines.

Chapter 6

H igh level processing

In th is ch ap te r we discuss parallelization of a high level vision algorithm for ob jec t recog

nition using a Farm er-W orker p a tte rn . We also discuss parallelization of an application in

m edical im aging using th ree different design p a tte rn s , nam ely, Tem poral M ultiplexing,

P ipeline, and C om posite Pipeline, High level processing in com puter vision involves

recognition of ob jec ts in a scene based on th e knowledge acquired by th e lower level

processes from th e im ages(s) of th a t scene. T he task s a t th is level are usually top-dow n

or m odel-d irected , and involve m ainly sym bolic a n d /o r knowledge processing.

An exam ple of a high level vision task is m odel-based ob jec t recognition. Given a

d a tab a se of o b jec t models, m odel-based ob ject recognition involves finding instances of

these o b jec ts in a given scene, A m odel-based vision system ex tra c ts scene featu res, such

as edges and points from an im age of a scene, and com pares them w ith a d a tab a se of

o b jec t m odels in order to identify ob jects w ithin th a t scene. M ost m odel-based ob ject

recognition system s are based on hypothesizing m atches betw een th e scene and m odel

featu res, p red ic ting new m atches, and verifying or changing th e hypotheses th ro u g h a

search process (G rim son, 1990), (Lowe, 1985), T he task becom es m ore com plex if the

o b jec ts are overlapped or occluded in the scene, A review a n d /o r m ethods used in m odel-

based o b jec t recognition in com puter vision can be found in (Chin k. Dyer, 1986) (G rim son

& H utten locher, 1991),

147

Chapter 6. High level processing 148

In recent years, a new m ethod based on geom etric hashing has been proposed for

m odel-based recognition of ob jects (L am dan & W olfson, 1988). T his m ethod offers a

different and m ore parallelizable paradigm for m odel m atching . T he geom etric hashing

algorithm used for m odel m atching consists of two phases: preprocessing and recognition.

T he preprocessing phase uses a collection of ob ject m odels to build a hash table (described

la ter) d a ta s tru c tu re . T his d a ta s tru c tu re encodes th e m odel in form ation in a highly

red u n d an t and m ultiple view point way. In th e recognition phase, th e p roperties of th e

e x trac ted featu res in th e scene im age are used to index th e hash tab le d a ta s tru c tu re for

a possible m atch to cand ida te ob ject models. A lthough geom etric hashing still requires

a search over the featu res in a scene, it obviates a search over th e m odels and th e m odel

featu res. Hence, th e recognition phase is com putationally efficient and highly am enable

to parallel im plem entation (R ogoutsos & H um m el, 1992).

In th is chap ter, we discuss parallel im plem entation of th e recognition phase in th e

geom etric hashing algorithm used for m odel m atch ing . Section 6.1 describes th e sequen

tia l algorithm for perform ing geom etric hashing, while section 6.2 discusses its parallel

im plem entation . We end th is chap te r w ith a section th a t discusses parallelization of an

application in m edical im aging, namely, m ulti-scale shape descrip tion of M R brain im ages

in epileptic p a tien ts . We use th ree different approaches (based on Tem poral M ultiplexing,

P ipeline, and C om posite P ipeline p a tte rn s) to discuss parallelization of different m odules

in th is application.

6.1 Sequentia l geom etric hash ing a lgorithm

We assum e th a t th e d a tab ase has M ob ject m odels and each m odel is represented by

n fea tu re points. T he preprocessing and recognition phases of th e geom etric hashing

algorithm work as follows:


6.1 .1 P r e p r o c e ss in g P h a se

In th e preprocessing phase a hash tab le is created from th e M m odels in th e da tab ase .

For each m odel, tw o a rb itra ry featu re points, referred as basis sei, are used to define an

o rthogonal coord ina te system as shown in F igure 6 .1(a). Using th is coo rd ina te system ,

a new set of transfo rm ed coord inates of th e rem aining fea tu re po in ts in th e m odel are

com puted using sim ple transfo rm ation equations in ana ly tic geom etry (Efimov, 1966).

These new coord inates are then used to hash or generate entries in to a hash tab le . Each

en try in th e hash tab le consists of a {model, basis) pair, represen ting th e m odel num ber

and th e basis set. T his process is repeated for all possible basis sets in a given m odel, and

for all m odels in th e d a tab ase . As a resu lt, the hash bins in th e hash tab le will receive

m ore th a n one entry. T he final hash tab le contains a list of {model, basis) entries in each

bin, as shown in F igure 6.1. T he preprocessing procedure is executed off-line and only

once. T h e steps in th e preprocessing phase are outlined below:

1. E x tra c t a se t of n fea tu re points from a given m odel m .

2. Select as basis set, a pair of two distinct fea,t\ive po ints (%,j).

3. C om pute th e coord inates of the rem aining fea tu re poin ts in th e m odel w ith respect

to th e coo rd ina te system defined by th is basis se t (%,j).

4. C om pute th e hash bin locations using a hash function h (described later) applied on

th e transfo rm ed coord inates in step 3.

5. A dd {model, basis) pair, i.e. (m , (z ,j) ) , to th e list of entries in corresponding hash

bin locations com puted in step 4.

6. R epeat s teps 2-5 for all possible basis sets in m odel m .

7. R epeat steps 1-6 for all m odels m in th e d a tab ase .


M odel m Basis P = (p, q)

new coordinatesBin I

(m, P)

(n ,R)

k = h(x’y ’) (m,P)

Bin k

(m, P)

(n, V)

(n, R)

(b)

( k T )

(n ,V)

(a)

F igure 6.1: P reprocessing phase in th e geom etric hashing algorithm a) O rthogonal coor

d in a te system defined by th e basis set b) A dding {models basis) pairs in th e hash tab le

6 .1 .2 R e c o g n it io n phase

In th e recognition phase, an a rb itra ry pair of fea tu re po in ts from th e scene im age is

chosen as a basis set. T he transfo rm ed coord inates of th e rem aining poin ts in th e scene are

calcu lated relative to th e coord ina te system defined by th is basis set. Each new coord ina te

is m apped to th e hash tab le (the sam e as th a t in th e preprocessing phase), and th e en tries

in th e corresponding bin receive a vote. T he (m ode/, basis) pairs which receive sufficient

votes (i.e. above a certain threshold value) are taken as p o ten tia l m atch ing cand ida te

m odels. These are then passed to a verification m odule, which verifies th e presence of

m atch ing m odels aga inst th e scene features.

T he m ain goal of th e voting scheme is to reduce th e num ber of cand ida tes used in the

verification step . T he execution of the recognition phase corresponding to a bcisis set is

term ed as a probe. T he steps in th e recognition phase are outlined below:

1. E x tra c t a set of S fea tu re points from th e scene.

2. Select as basis set, an a rb itra ry pair of fea tu re po in ts (%,j) from S .


Scene Features Basis P= (p,q)

new coordinates Bin 1

k = h(x’ y

Bin k B inN

(a)

(m, P)

(m, P) (n,V)

(n, R) (n,V)(n,R)

Votes for Model m, Basis P

(n,V) (n, R) (m, P)Cell (n, V) Cell (n, R) Cell(m,P)Cell (m, P) Cell (I, T)

(b)

F igure 6.2: R ecognition phase in the geom etric hashing algorithm a) O rthogonal coordi

n a te system defined by th e basis set b) Accessing and collecting (modeZ, basis) pairs from

th e hash bins in hash tab le

3. Perform a probe using sequence of following steps:

• C om pute th e transfo rm ed coord inates of rem aining fea tu re po ints in S w ith

respect to the coord inate system defined by th is basis set

• C om pute th e hash bin locations in th e hash tab le using a hash function h

(described later) applied on th e transfo rm ed coord inates.

• Form a list of all th e {models basis) pairs s to red in th e corresponding hash bin

locations com puted in th e previous step .

• Select th e (m ode/, basis) pairs (w inning m odels) receiving a count of votes above

a given threshold value (if any).

4. R epeat from step 2 until som e winning {model, basis) pairs are found or until com-


pletion of som e specified num ber of itera tions.

5. Verify th e po ten tia l m odels found in step 3 (if any) aga inst th e set S of fea tu res in

th e scene.

6. Rem ove fea tu re points of th e m atch ing m odel(s) from th e scene (if applicable) and

rep ea t s teps 2-6 until som e specified condition or for a fixed num ber of ite ra tions.

T he selection of (m ode/, basis) pairs receiving m axim um votes in step 3 m ay be per

form ed by histogram ing (i.e. counting) these entries using corresponding (m ode/, basis)

counters. A lternatively , th e [modeCbasis) pairs m ay be sorted in order to find w inning

m odels having a coun t above a given threshold value.

6.2 P arallel geom etric hashing a lgorithm

In th is section, we present a parallel im plem entation of th e recognition phase of th e

geom etric hashing algorithm . T he preprocessing phase is a one tim e process and can

be carried o u t off-line. T he parallel im plem entation of th e recognition phase m ay be

realized by e ither a) perform ing th e operations of a single probe across several processors,

concurrently , or b) perform ing m ultiple probes on several processors, concurrently . In th e

la tte r case, each probe m ay in tu rn be im plem ented on a set of one or m ore processors.

T he su itab ility of each m ethod depends on the size of the hash tab le and th e am o u n t of

m em ory available on each processor of the underlying parallel arch itec tu re .

T here have been several p rior efforts in parallelizing th e recognition phase of the

geom etric hashing algorithm . B ourdon et. al. (B ourdon & M edioni, 1988) and R ogoutsos

et. al. (R ogoutsos & H um m el, 1992) have proposed parallel im plem entation of a probe

step (m ethod (a)) across several processors on SIM D hypercube-based m achines. T heir

im plem entations em ploy large num ber of processors in p roportion to th e size of th e m odel

d a tab ase . W ang et. al. (W ang e t al., 1994) have proposed several parallel im plem entations

of th e recognition phase on CM -5 and M P-1, using bo th m ethods (a) and (b).

Each im plem entation uses a different s tra teg y for d istrib u tin g th e hash tab le entries.


Each im plem entation uses e ither a histogram ing or a so rting m ethod to com pute th e win

ning [modeC basis) pairs receiving th e m axim um num ber of votes during th e recognition

phase. T heir im plem entations are independent of th e size of th e m odel d a tab a se and

achieve im proved perform ance over earlier efforts. T hey have achieved a single probe

tim e of ab o u t 200 millisecs on a 32-node CM-5 connection m achine, while R ogoutsos et.

al. (R ogoutsos & H um m el, 1992) have reported a single probe tim e of 1.52 sec on a 8K

processor connection m achine. B oth have used a synthesized m odel d a tab a se contain ing

1024 m odels (each m odel consisting of 16 featu re po in ts or d o t p a tte rn s) and a scene

consisting of approx im ately 200-256 featu re points.

D ue to lim ited local m em ory on individual processors, all th e above im plem entations

involve d istrib u tio n of th e hash tab le entries across several processors and, parallelizing

th e o pera tions of a single probe on these processors. In w orksta tion environm ent, th e tim e

required for perform ing a probe step on a single w orksta tion is of th e order 1.2 to 1.3 secs.

Parallelizing th e operations of a single probe across several w orksta tions as in previous

approaches would no t lead to any significant im provem ent in th e perform ance due to high

com m unication costs. Parallelizing a probe step involves com puting local (m ode/, basis)

w inning pairs and com m unicating these winning pairs betw een different processors in

o rder to find th e global (m ode/, basis) w inning pairs. Since an ob jec t requires a round 100-

250 probes for recognition (Rogoutsos & H um m el, 1992), we perform m ultiple probes on

various w orksta tions, concurrently. T he operations of each probe are however perform ed

on a single w orksta tion .

We now discuss th e ac tua l parallelization of th e recognition phase on a c luster of

w orksta tions. As in (R ogoutsos & Hum m el, 1992), we use a synthesized m odel d a tab ase

con tain ing 1024 models. Each m odel consists of 16 random ly generated fea tu re points

(do t p a tte rn s) . These m odel pairs are generated using a G aussian d istrib u tio n w ith zero

m ean and un it s tan d a rd deviation. Similarly, we co n stru c t a scene consisting of 200

scene po in ts using a norm al d istribu tion . In order to m ake th e recognition process as

efficient as possible, we apply tw o enhancem ents as m entioned in (R ogoutsos & H um m el,

1992). F irstly , we apply a rehashing function to th e transform ed coord inates (step 3 of th e

recognition phase described in section 6.1.2) so th a t th e expected list lengths of th e entries

in th e hash bins become as even as possible. For each transfo rm ed coord ina te (ri, u), th e

Chapter 6. High level processing

following hash function is applied:

154

_ u ^ + v ^f ( u ^ v ) = { l - e x p 2cr2 ^a ta n 2 {v^u)) (6 .1)

w here, a represen ts th e s tan d a rd deviation of th e m odel po ints. T he values of th e two

coord ina tes in equation 6.1 lie in intervals (0 ,1 ) and (—7r,7r), respectively. T hese coordi

n a te values can be quantized in to a tw o dim ensional hash a rray as shown in F igure 6 .3 (a).

E ach hash location contains a po in ter to a list or bin of (m , (%,j)) entries.

Secondly, we use certain sym m etries in th e hash tab le to reduce th e num ber of entries

in th e hash lists. If an en try of th e form (m , (%,j)) hashes to a location {x^y) in th e hash

tab le , then there will be a m irro r-en try of th e form (m , (j, %)) in location (2 , 99 — y) as

shown in F igure 6.3(b) - (c). We can therefore sto re only those (m , entries in th e hash

tab le for which i < j . T his will reduce th e num ber of entries in th e hash tab le by nearly

half, thereby halving the m em ory required to sto re th e hash tab le during the recognition

phase. For such a hash tab le , if f { u , v ) hashes to location (2 ,y ) in th e probe step of the

recognition phase, th e enties in locations (2 , y) and th e m irror-en tries in location (2 , 99 —y)

a re collected in a list, in order to com pute th e w inning [models basis) pairs in subsequent

processing. T he m irro r-en try of (m , ( i , j ) ) is (m , (j, %)).

■(x, y )

■(x, 99-y)

(a)

9 9 -y

Bin (x, y) Bin (x, 99-y)

(2, (3, 9))

(7, (9, 4))

a , (5, 3))

(2, (9,3))

(7, (4,9))

a , (3,5))

(b)

Bin (x, y) Bin (x, 99-y)

(2, (3, 9))

(7, (4,9))

(I, (3,5))

(c)

F igure 6.3: Hash tab le d a ta s tru c tu re a) sym m etric indexing in hash tab le b) hash en tries

in norm al hash tab le c) reduction in hash entries using sym m etries


For a d a tab ase consisting of 1024 m odels, w ith each m odel contain ing 16 fea tu re points,

th e size of th e norm al hash tab le would be ab o u t 20 M bytes, assum ing 6 bytes for each

(m , (i, j ) ) entry. Using the sym m etries m entioned above, th e size of th e hash tab le would

be reduced to 10 M bytes. T he w orksta tions (SUN S P A R C station 5) th a t we used in

im plem enting th e parallel geom etric hashing algorithm have 32 M bytes of local m em ory.

Hence, unlike in previous approaches, each processor (w orksta tion in th is im plem entation)

can sto re a sep a ra te b u t com plete copy of the hash tab le during th e parallel execution of

th e recognition phase. N ote th a t although we have used a synthesized m odel d a tab ase ,

th e size of th is d a tab ase is nearly the sam e as th e size of a typ ical m odel d a tab a se used in

th e s ta te -o f-th e -a rt im age understand ing techniques em ploying geom etric hashing (W ang

et ah, 1994).

T he algorithm for perform ing m ultiple probes in th e recognition phase can easily be

parallelized by using th e Farm er-W orker p a tte rn . Each worker w orksta tion in th e F arm er-

W orker p a tte rn has a copy of th e hash tab le and a set of scene fea tu res in its local m em ory,

prior to th e s ta r t of th e recognition phase. These are loaded from a file created during

th e preprocessing phase. The Farm er generates a rb itra ry basis sets, and assigns each to

a d ifferent worker for processing. Each worker perform s corresponding probe step using

its assigned basis set. Each worker com m unicates the w inning (m ode/, basis) pair(s) (if

any) to th e F arm er controlling th e whole process. W hen no w inning {m ode l, basis) are

found, each w orker is assigned an o th er basis set to perform a new probe. T his processes

continues until w inning (m ode/, basis) pairs are found or for a fixed num ber of ite ra tio n s.

T he algorithm is outlined below:

1. G enera te basis sets and assign each to a different w orksta tion (worker).

2. Perform probe step using the assigned basis set on each worker.

3. Select th e (m ode/, basis) pairs th a t receives a count of votes above certain th resho ld

value (if any). If no such (m ode/, basis) pairs exist, rep ea t th e procedure from step 1

for a certa in num ber of ite ra tio n s or until some specified condition.

4. Verify th e p o ten tia l m odels found in step 3 (if any) aga inst th e set S of fea tu res in

th e scene.


5. Rem ove th e featu re poin ts of m atching m odel(s) (if applicable) from th e scene and

rep ea t s teps 1-5 until some specified condition.

Table 6.1: E xecution tim e in (min:sec) for th e geom etric hashing algorithm

N o. o f P robes N um ber of W orkstations

1 2 4 6 8 10 12 14 16

50 1:03 0:36 0:21 0:17 0:13 0:12 0:11 0:10 0:12

100 2:00 1:05 0:37 0:27 0:22 0:18 0:17 0:15 0:15

150 3:06 1:37 0:50 0:36 0:29 0:24 0:23 0:21 0:18

200 4:00 2:05 1:06 0:47 0:37 0:30 0 ^ 8 0:24 0:23

250 5:03 2:36 1:21 0:56 0:44 0:37 0:32 0:28 0:27

5

4ideal250 probes

150 probes 14 50 probes3

2

1

0 2 4 6 8 10 12 14 160 2 4 6 8 10 12 14 16Exec, tim e(m in) v /s Processors Speedup v /s P rocessors

F igure 6.4: Perform ance of the geom etric hashing algorithm for o b jec t recognition

T he execution tim es for th e recognition phase of th e geom etric hashing algorithm

parallelized using different num ber of w orksta tions are shown in Table 6.1. A plot of these

execution tim es and th e speedups achieved for th is phase are shown in F igure 6.4. Since

th e com m unication tim e in th e algorithm is negligible com pared to th e co m p u ta tio n tim e,

th e observed speedups are qu ite close to the ideal speedups as can be seen from F igure 6.4.


For com parison, we com pute th e tim e required to perform 200 probes in previous im ple

m en ta tions. Using a 8K processor connection m achine, th e tim e required for perform ing

200 p robe steps is approx im ately 5 m ins (based on 1.52 secs/p ro b e tim e as repo rted

in (R ogoutsos & H um m el, 1992)). T he tim e required to perform th e sam e num ber of

p robes on 32-nodes of a CM -5 connection m achine would be around 38 secs (assum ing

m inim um tim e of 188 millisecs/ probe as reported in (W ang et ah , 1994)). Using 512

processors (the m axim um on CM -5) and perform ing m ultiple probes concurren tly (each

p robe im plem ented on a p artitio n of 32 processors), th e tim e required to perform 200

probes m ay be reduced to 2 to 3 secs. However, th e la tte r im plem entation (assum ing such

an im plem entation is possible) m ay need significant p rogram m ing effort in o rder to exploit

th e hardw are of th e underlying parallel m achine.

From Table 6.1, it can be seen th a t th e tim e required to perform 200 probes using 16

w o rk sta tio n s is only 23 secs. Hence, as th is exam ple illu stra tes, a w orksta tion environm ent

can provide a reasonable or som etim es even b e tte r perform ance com pared to th e conven

tional dedicated parallel m achines. N ote th a t the earlier im plem entations of th e parallel

geom etric hashing algorithm are fine-grained. In co n tra s t, th e parallel im plem entation of

th e geom etric hashing algorithm presented in th is section is relatively coarse-grained.

6.3 M u lti-sca le active shape d escrip tion - an ap p lication

All our previous discussions in th is thesis have so far been concen tra ted upon parallelizing

individual vision algorithm s using corresponding design p a tte rn s . In th is section, we

discuss parallelization of com plete m odules, each com prising a collection of several algo

rithm s, a t an application level. We take an application from th e field of m edical im aging,

nam ely, m ulti-scale active shape description of M R (m agnetic resonance) b ra in im ages

using active con tour models. This application form s a p a r t of th e research w ork carried

o u t in th e D ep artm en t of C om puter Science, U niversity College London, UK (Schnabel,

1997). W e presen t a brief overview of th is application and discuss parallelization of som e

of its m odules.


6.3.1 A n overview of th e shape descrip tion process

D etecting and describing brain deform ations in certain b rain diseases (e.g. epilepsy) is

a m ajo r ta sk in M R im aging. A conventional m ethod of detec ting and describing these

deform ations is to first segm ent th e cross-sectional im ages (im age slices) of th e brain

in to different regions. These regions correspond to different p a r ts of th e brain . A fter

identifying relevant regions(s), a set of shape m easurem ents (e.g. area, perim eter, e tc.) is

applied on these region(s) in order to de tec t and describe ap p ro p ria te shape deform ations.

T his ta sk is perform ed m anually by th e expert clinicians. T he conventional m ethod of

finding and describing shape deform ations is however tim e consum ing and ted ious. It

usually involves processing large volumes of volum etric bra in d a ta . Also, due to sho rtage

of ex p ert clinicians, it is difficult to diagnose each p a tien t in a given tim e co n stra in t. As

a resu lt, th ere is g rea t dem and to au to m ate th e shape descrip tion process in o rder to

p roduce m eaningful shape descrip tions reliably and quickly.

T he research work in (Schnabel, 1997) aim s to au to m ate th e shape descrip tion process,

and a tte m p ts to present it as a shape description tool for diagnosis. T h e shape descrip

tion tool enables bo th q u an tita tiv e and q ualita tive shape analysis a t different levels of

im age resolution (or scale). T he shape analysis process uses concepts in m ulti-scale im age

processing (M arr, 1982), (W itkin , 1983) to describe shape changes across several scales.

T hese concepts are based on th e fact th a t global shape fea tu res of o b jec ts in an im age

can be visualized a t coarser levels of im age resolution (higher scale). B u t finer shape

featu res of these ob jec ts can be observed only a t finer levels of im age resolution (lower

scale). T he shape descrip tion tool enables descrip tion of th e shape charac te ristic s and

shape changes across several different scales, s ta r tin g from either end of th e scale. T he

ac tu a l shape ex trac tion from th e im age slices is perform ed by using active con tour m odels.

Active con tour m odels or snakes are energy-m inim izing spline con tours used for im age

segm en ta tion (K ass e t ah, 1987).

T he m ain s tep s/m o d u les in the shape descrip tion process are: a) preprocessing b)

propagation c) shape focusing and d) shape analysis. T he preprocessing step involves appli

cation of sim ple im age processing techniques such as th resholding, h istog ram equalization ,

and m orphological operations (opening), on each im age slice in the volum etric b ra in d a ta .


T hese operations are applied for enhancing the o b jects of in tere st in each im age slice. T he

p ropagation step com putes shape contours for each im age slice as shown in F igure 6 .5(a).

T he process begins by first com puting a shape con tou r for som e in termediate im age slice.

An in term ed ia te im age slice is th e one a t the cen ter or near th e cen ter of th e set of all

im age slices. T he shape con tour for the in term ed iate im age slice is com puted by applying

an op tim iza tion procedure (W illiam s & Shah, 1992) on a given in itial con tour (usually

a circle), superim posed on a G aussian-blurred o u tp u t of th e im age slice. T he optim ized

shape con tou r of th e in term ed ia te im age slice is then p ropagated to bo th its neighboring

im age slices as shown in F igure 6 .5(a).

Using th e optim ized shape contours as initial con tours (superim posed on th e G aussian

o u tp u ts of corresponding im age slices), th e shape con tours of b o th neighboring im age slices

are com puted by applying th e sam e optim ization procedure. T his process is repea ted by

p ro p ag atin g th e shape contours to bo th sides of th e brain volum e (i.e. tw o im age slice

p a r titio n s defined by th e in term ediate im age slice) as shown in F igure 6 .5(a). A t th e end

of th e p ropagation process, each im age slice has an associated in itial shape con tour which

is used as an in p u t in th e subsequent shape focusing step .

T h e shape focusing step operates on each im age slice separately . It begins w ith the

construc tion of a scale-space for each im age slice. A scale-space of an im age slice consists

of a set of im ages ob tained by convolving th e im age slice by a G aussian function using

increasing values of a , where a represents th e scale or w id th of th e scaling o p era to r. Using

th e m ulti-scale active contour m odel (Schnabel, 1997), th e shape focusing process ex trac ts

a sh ap e of in te rest from various im ages in th e im age scale-space of each im age slice. This

is perform ed by p ropagating th e initial shape con tour (com puted in th e p ropagation step)

th ro u g h various im ages in th e im age scale-space (s ta rtin g a t lowest resolution or highest

scale), and regularizing the active contour m odel’s energy function w ith respect to the

scale. T h e initial, in term ediate , and final shape focusing resu lts form a m ulti-scale shape

s tack as shown in F igure 6.5(b). An illustra tion of th e shape focusing process applied

on four different im ages (scales) in th e im age scale-space of an im age slice is shown in

F igure 6.6.

In th e final shape analysis step , each m ulti-scale shape stack is analyzed using classic


O Propagalion ^

— o

Pn)pagaiion

<-------

G aussian B lurring and O ptinri/.aliun

G aussian B lurring and O piin iizaliun

G aussian B lurring and O pliiiiizatiim

O Pnrpagalion / \

—

G aussian B lurring and O ptim izalion

/ - ' N . Pm pagaliuno —

G aussian B lu n in g and O ptim iza tion

S hape C unliiu r iiir im age slice I

Shape Ciinliiur h ir im age slice 2

Shape Q in liiu r lo r im age slice 3

(a)

Sha|ie C onlour lo r im age slice 4

Shape C on lou r lo r im age s lice 5

Shape Focusing al increasing levels o l derail

L ow R esolution

(C oarser details)

i i iy h R esolu tion

(F in e r details)

(b )

l-'igure 6.5; M ulti-scale shape description process a) pro;)agation step applied on a set

of five im age slices b) multi-scale shape stack of an im age slice com puted in the shape

focusing step (F igure (b) adapted from (Schnabel, 1997)).

( b ) ( c ) ( d )

Figure 6.6: Shape focusing perform ed a t different scales in the image scale-space of an

im age slice using active contour models: (a) cr = 8 (b) a = 4 (c) a = 2 (d) a = 1. Im age

(a) also contains the initial contour superim posed in black. All images are taken from

(Schnabel, 1997).

shape descrip to rs in order to find the global and local changes in the shape. T he shape

con tour a t each layer or scale of the multi-scale shape stack is used to com pu te the

mean and slope m easurem ents for finding shape changes between the layers. These shape


Figure 6.7; V isualization of the stack contours (those displayed in F igure 6.6) stacked

using triangu la tion . Image taken from (Schnabel, 1997).

contours also are stacked and visualized (volume visualization) for qualita tive inspection

(Figure 6.7). Also, for each scale, the corresponding shape contours across all the m ulti-

scale shape stacks are stacked and visualized for global inspection.

6.4 P arallelization o f the shape descrip tion process

In t his section we discuss parallelization of some of the m odules in the m ulti-scale shape

description process applied on the volum etric brain d a ta of epileptic patien ts. T he task

is to ob tain shape descriptions of the grey m atte r/co rtic a l interface of the brain in order

to enable the study of its s tru c tu ra l abnorm alities (cortical dysgenesis) related to the

sym ptom s of epilepsy. T he num ber of image slices in the volum etric brain d a ta involved

in this application is 124 (for each patien t), of which only 96 im age slices contain the

im age of the actual grey m atte r. Each image slice is of the size 256X256 pixels (with slice

thickness 5 rivni, and pixel size 0.9375 mrn^).

We discuss th ree different approaches for parallelizing the shape description process.

Each approach uses a different design p a tte rn , namely. Tem poral M ultiplexing, P ipeline

or C om posite Pipeline. Of the three approaches, we provide experim ental results only

for the first approach. For the rem aining two approaches, we provide estimates of the

corresponding parallel execution times. These estim ates represent reasonable approx im a

tions of the corresponding parallel execution tim es. This is because the com ponents in

1 lie P ipeline/C om posite Pipeline im plem entations use existing sequential codes. Using

sequential execution tim e of each com ponent, it is easy to estim ate the overall parallel


execution tim e in these im plem entations (ignoring th e com m unication overheads). T he

com m unication overheads are however relatively negligible and can therefore be safely

ignored (they involve com m unication of 256X256 im ages a n d /o r sim ple d a ta s tru c tu re s

(e.g. contours, e tc .)). Also, in all th e th ree approaches, we do no t discuss parallelization

of th e final shape analysis step . T he shape analysis s tep requires d a ta from m ulti-scale

shape stacks of all im age slices, and therefore can only be perform ed sequentially.

Using th e sequential code developed in (Schnabel, 1997), th e tim e required to perform

th e preprocessing step on each im age slice is 3 secs, while th e tim e required to perform th e

corresponding p ropagation step is 16 secs. T he shape focusing step com prises a sequence

of operations such as G aussian sm oothing, com puting of im age po ten tia ls (‘C om pute

P o te n tia l’), and optim ization . These operations are applied on each im age (to ta l 16) in

th e im age scale-space of a given im age slice. T he G aussian sm ooth ing operation produces

a sm oothed im age, while ‘C om pute P o te n tia l’ operation ex tra c ts certa in im age featu res

such as th e m agnitude and direction of the im age grad ien t, th e im age cu rvatu re , and

d istance-transfo rm ed ridges of th e grad ien t m agnitude, from th e sm oothed im age. T he

‘C om pute P o te n tia l’ operation stores these im age featu res in a d a ta s tru c tu re called ‘Po

te n tia l’, which, along w ith th e sm oothed im age, is used for com puting th e shape con tour

of th e im age during th e optim ization operation .

T he th ree shape focusing operations require average processing tim e of 2 secs, 7 secs,

and 7 secs, respectively. Therefore th e to ta l tim e required to perform th e shape focusing

step on a single im age slice is 256 secs (16*16). Hence, for a set of 96 im age slices, th e

to ta l sequential tim e required to perform th e preprocessing, p ropagation and th e shape

focusing steps is 26400 secs (7 hrs, 20 mins). In order to m ain ta in consistency w ith earlier

discussions, we assum e availability of a t th e m ost 16 w orksta tions for th e parallelization

process.

6 .4 .1 P a ra lle l iza t io n us in g T em p ora l M u lt ip le x in g p a tte r n

T he sim plest form of parallelism th a t can be im plem ented w ith o u t m a jo r changes to

th e existing sequential code is realized by using th e Tem poral M ultip lexing p a tte rn . In


th is approach , we assum e th a t th e preprocessing and p ropagation steps are perform ed

sequentially. We parallelize only th e shape focusing step by processing th e im age slices

on different w orksta tions, concurrently . T he sequential a lgorithm to perform th e shape

focusing process is outlined below (s ta rtin g a t th e coarsest level of scale) :

1. G en era te an im age in th e im age scale-space for th e cu rren t im age slice, using a

G aussian sm ooth ing function .

2. Using th e G aussian im age generated in previous step , com pute im age po ten tia ls (i.e.

relevant im age features) required for th e optim ization operation in the nex t step .

3. Taking active contour m odel from the previous im age in th e im age scale-space as an

in itial shape contour, optim ize th e shape contour for th e cu rren t im age using fast

local op tim ization m ethod (W illiam s & Shah, 1992). N ote th a t for the first im age

in th e im age scale-space, th e initial shape contour is th e one, which is com puted in

th e p ropagation step.

4. R epeat s teps 1- 3 for all scales in the im age scale-space of th e cu rren t im age slice.

5. R ep ea t th e process from step 1 for all im age slices, s ta r tin g from th e coarsest level

of scale.

Since th e com pu ta tions of each im age slice in the shape focusing step are independent

of each o th er, they can be perform ed in parallel. Using a se t of 16 w orksta tions and

a T em poral M ultiplexing p a tte rn to process each im age slice concurrently , th e observed

parallel execution tim e required for processing all the im age slices in th e shape focusing step

is 1656 secs (Table 6.2). T he to ta l tim e required to perform th e preprocessing (sequential

im p lem en tation ), p ropagation (sequential im plem entation) and th e shape focusing (p ara l

lel im plem entation) s tep is 3480 secs (58 mins). Hence, concurren t processing of th e im age

slices in th e shape focusing step leads to a significant reduction in overall application tim e,

although we parallelized only p a r t of th e application.

Chapter 6. High level processing

6 .4 .2 P ara lle l iza t io n u s in g P ip e l in e p a tte r n

164

A lthough Tem poral M ultiplexing p a tte rn m ay also be used for parallelizing th e p repro

cessing step , th e p ropagation step does not enable concurren t processing of im age slices for

com puting th e initial shape con tours. In th e p ropagation step , th e o u tp u t shape con tour

of an im age slice serves as an in p u t for th e co m pu ta tion of th e final shape con to u r of

e ither one or b o th of its neighboring im age slices (F igure 6 .5 (a)). In such situ a tio n s,

a P ipeline p a tte rn m ay be used to exploit po ten tia l parallelism in an application . One

possible im plem entation of th e shape description process using a P ipeline p a tte rn is shown

in F igure 6.8. We assum e th a t each com ponent of th e P ipeline p a tte rn is im plem ented on

a sep a ra te w orksta tion .

Shape FocusingShapeDescription

ImageSlices Shape

AnalysisPropagationPreprocessing Compute

PotentialGaussianSmoothing Optimization

Repeated 16 times for each image slice

F igure 6.8: Paralle lization of th e shape description process using a P ipeline p a tte rn . T he

in teger values denote sequential tim e (in seconds) required for executing corresponding

com ponen ts of th e P ipeline p a tte rn .

T he processing in th e P ipeline p a tte rn begins by passing th e in term ed ia te im age slice

th ro u g h th e preprocessing com ponent, followed by th e ad jacen t im age slices in e ither of th e

tw o im age slice p artitio n s shown in F igure 6 .5(a). T he preprocessing com ponent processes

a given im age slice (called current im age slice) and passes it to th e p ropagation com ponent.

T he p ropagation com ponent optim izes shape con tour of th e cu rren t im age slice. It sto res

th is shape con tour for using it as an in p u t during processing of th e subsequent im age slice.

T he p ropagation com ponent passes th e optim ized shape con tour and th e cu rren t im age

slice to th e ‘G aussian S m ooth ing’ com ponent in th e shape focusing step .

T he shape focusing step com putes a m ulti-scale shape stack for th e cu rren t im age

slice as follows. T h e ‘G aussian S m ooth ing’ com ponent of th e P ipeline p a tte rn generates


G aussian-b lurred im ages of th e cu rren t im age slice, using decreasing values of sigm a (to ta l

16 sigm a values). These G aussian-b lurred im ages are then sequentially passed from th e

‘C om pute P o te n tia l’ com ponent to th e ‘O p tim iza tio n ’ com ponent. T he ‘O p tim iza tio n ’

com ponent optim izes th e shape con tour of th e cu rren t G aussian-b lurred im age, and sto res

it for using it as an in p u t for com puting shape con tour of th e subsequen t G aussian-b lurred

im age. T hese operations in th e shape focusing step are rep eated for 16 different sigm a

values. A fter com pletion of th e shape focusing step on cu rren t im age slice, th e resu lting

m ulti-scale shape stack of th e cu rren t im age slice is passed to th e shape analysis s tep . T he

shape analysis s tep can be perform ed separately and is therefore enclosed in a d o tted box.

A single P ipeline p a tte rn m ay be used to process both im age slice p a rtitio n s (defined

by th e in te rm ed ia te im age slice) one afte r the o ther. A lternatively , tw o P ipeline p a tte rn s

(M ultiple Pipelines) m ay be used for processing each p artitio n concurrently . We use th e

second approach since it reduces the overall execution tim e by alm ost half. A ssum ing

48th im age slice as th e in term ed iate im age slice, we divide th e set of 96 im age slices

in to tw o p artitio n s. Each p artitio n contains 48 and 49 im age slices, respectively (bo th

p a rtitio n s contain th e in term ed iate im age slice for p ropagating th e in itial shape con tou r).

We estim ate th e tim e required to process im age slices in larger of th e two im age slice

p a rtitio n s (i.e. th e one containing 49 im age slices). T his estim ate also represen ts th e to ta l

tim e required for processing all th e 96 im age slices, since th e sm aller im age slice p artitio n

can be processed concurren tly along w ith the larger one.

T he preprocessing and p ropagation com ponent operations can be overlapped w ith th e

op era tio n s in th e shape focusing step (except for the first im age slice). We therefore

co n cen tra te on th e shape focusing step . Assum ing overlap of co m p u ta tio n s in th ree

different operations of the shape focusing step , th e tim e required to perform th e shape

focusing step on 49 im age slices is 5497 secs (2 [latency) -f- 7 [latency) -f (7*16)*49).

T he latency te rm s in the expression represent execution tim es required for perform ing

corresponding operations (i.e. G aussian S m oothing and C om pute P o ten tia l) , for th e first

G aussian-b lu rred im age of the first im age slice. T he term ‘(7*16)’ denotes tim e required

for perform ing th e ‘O p tim iza tio n ’ operation on all G aussian-b lurred im ages of th e cu rren t

im age slice. T his also represents the tim e required for perform ing shape focusing step

on th e cu rren t im age slice (except for th e first im age slice), since o th er operations in


th e shape focusing step are executed concurrently . Hence, th e to ta l tim e required for

perform ing th e preprocessing, propagation , and th e shape focusing step is 5516 secs (1 hr^

31 mins^ 56 secs). N ote th a t th e tim es for the preprocessing and p ropagation steps (as

shown in Table 6.2) in th e P ipeline im plem entation , represen t execution tim es required to

perform these steps only on th e first im age slice.

T he to ta l execution tim e of the application parallelized using two sim ple P ipeline

p a tte rn s is relatively higher th an th e execution tim e in previous parallel im plem entation .

T his d rop in perform ance is due to the tim e-com plexity of th e shape focusing step and

inability to use add itional w orksta tions in th e parallelization process. As there are only five

com ponents in a Pipeline, the two Pipeline p a tte rn s to g eth er can use only 10 w orksta tions.

T he parallel im plem entation using a T em poral M ultiplexing p a tte rn can utilize all 16

w orksta tions. Hence, although th e percentage of th e application code parallelized using a

P ipeline p a tte rn is relatively higher th an in the previous approach , th e inability to scale

th e num ber of w orksta tions used in parallelization does n o t lead to any im provem ent in

th e overall perform ance of th e application over earlier approach.

6 .4 .3 P a ra lle l iza t io n us in g C o m p o s ite P ip e l in e p a t te r n

T h e lim ita tions in bo th the Tem poral M ultiplexing and P ipeline p a tte rn s can be resolved

by using a C om posite P ipeline p a tte rn . T he m ain bo ttleneck in th e sim ple P ipeline

p a tte rn is th e shape focusing step , which requires parallel execution tim e of 112 secs

(7*16) or approx im ate ly 2 m ins for com puting a m ulti-scale shape stack for each im age

slice (assum ing overlapping of com pu tations of individual operations in th e shape focusing

s tep ). T h e perform ance or th ro u g h p u t in a sim ple P ipeline p a tte rn depends on th e speed

of its slowest com ponent. Hence, using a Tem poral M ultiplexing p a tte rn a t th e shape

focusing step , significant perform ance gains can be achieved in th e sim ple P ipeline p a tte rn .

T he resu lting p a tte rn co n stitu te s a C om posite P ipeline p a tte rn as shown in F igure 6.9.

In th e C om posite Pipeline p a tte rn , we im plem ent th e preprocessing and th e prop

agation steps on a single w orksta tion . T he rem aining 15 w orksta tions can be used for

parallelizing th e shape focusing step using a T em poral M ultiplexing p a tte rn . T he process-


ShapeDescriptionImage Slices

Shape AnalysisShape Focusing

(TM pattern)

Preprocessingand

Propagation

Shape Focusing

(Worker)

Shape Focusing

(Worker)

F igure 6.9: Para lle liza tion of th e m ulti-scale shape description process using a C om posite

Pipeline p a tte rn

ing of each im age slice in th e C om posite P ipeline p a tte rn begins w ith th e execution of

preprocessing and th e p ropagation steps. Each im age slice th a t passes th rough th e first

s tag e (preprocessing and p ropagation) can im m ediately use a free w orksta tion to perform

th e shape focusing step . T his is because th e tim e required to process each im age slice

in th e first stage is 19 seconds. T he shape focusing step requires 256 secs (sequential

tim e) to process each im age slice. As there are 15 w orksta tions in th e second stage of the

C om posite P ipeline p a tte rn , th e average tim e required to perform th e shape focusing step

on each im age slice is approxim ately 17 sec (256/15), which is lower th an th e tim e spen t

in th e first stage. Therefore, any im age slice th a t passes th rough th e first stage can use

som e free w orksta tion th a t com pletes processing on its previous im age slice (if applicable).

Also, w ith th e exception of the last im age slice, th e operations of th e shape focusing

s tep can be completely overlapped w ith the operations of th e preprocessing and p ro p a

g a tion steps. T he shape focusing tim e shown in Table 6.2 for th e C om posite P ipeline

im plem en tation therefore, represents tim e required to process only th e last im age slice.

T he tim e required for th e preprocessing and p ropagation steps in th is im plem entation

is th e sam e as th a t in th e sequential version. Hence, th e to ta l execution tim e of th e

application parallelized using th e C om posite P ipeline p a tte rn is 2080 secs (34 mins, 40

secs), which represents a significant im provem ent over earlier approaches.

T he shape descrip tion exam ple illu stra tes th a t using sim ple design p a tte rn s and m ost

of th e existing sequential code, the w orksta tion environm ent can offer significant benefits


T able 6.2: Execution tim es in (seconds) ioT different im plem entations and individual steps

of th e shape descrip tion process

Im plementation Preprocessing Propagation Shape Focusing Total Time

Sequential 288 1536 24576 26400

(7 hrs, 20 mins)

Temporal M ultiplexing 288 1536 1656 3480

(58 mins)

Multiple Pipelines 3 16 5497 5516

(1 hr, 31 mins, 56 secs)

C om posite Pipeline 288 1536 256 2080

(34 mins, 40 secs)

of parallelizing m any vision applications. A lthough w orksta tion clusters m ay or m ay

no t be used in th e final system im plem entation , they can provide significant su p p o rt

for developing and p ro to typ ing applications requiring a large am o u n t of com puting tim e,

in m any research and o th er organizational setups which do n o t have dedicated parallel

com puting facilities.

6.5 Sum m ary

In th is ch ap te r, we discussed parallel im plem entation of th e recognition phase of th e

geom etric hashing algorithm used for ob ject recognition. We also discussed parallelization

of m ulti-scale active shape descrip tion process using th ree different p a tte rn s , namely.

T em poral M ultiplexing, P ipeline, and C om posite Pipeline. T he recognition phase of th e

geom etric hashing algorithm perform s several probe steps for identifying an o b jec t in

a scene im age. Each probe step (associated w ith a basis set) com prises a sequence of

op era tio n s for finding po ten tia l m odels th a t m atch th e scene featu res. We have developed

a coarse-grained parallel algorithm for th e recognition phase. T his algorithm perform s

m ultip le probes on different w orksta tions, concurrently . T he opera tions of each probe are

however perform ed on a single w orksta tion . T he perform ance of th is parallel algorithm

parallelized using th e Farm er-W orker p a tte rn has shown encouraging results. T h e perfor

m ance resu lts are som etim es even b e tte r th an those in earlier im plem entations perform ed


on dedicated parallel m achines.

T he parallelization of th e m ulti-scale active shape descrip tion process for M R brain

im ages in epileptic p a tien ts has also shown prom ising resu lts. T he sequential execution

tim e required to process 96 im age slices is 7 hrs, 20 m ins. T his includes tim e required for

perform ing preprocessing, p ropagation , and shape focusing steps in th e shape descrip tion

process. T he corresponding o b serv ed /estim ated parallel execution tim es using T em poral

M ultiplexing, P ipeline (M ultiple Pipelines), and C om posite P ipeline p a tte rn s are 58 mms;

1 hr, 31 m in s , 56 secs; and 34 m ins, 40 secs, respectively. O f th e th ree p a tte rn s . T em poral

M ultip lexing is th e sim plest to im plem ent. However, no t all m odules can be parallelized

using th is p a tte rn alone. P ipeline p a tte rn has lim ited scalability w ith respect to increase in

num ber of w orksta tions used in parallelization. Using M ultiple P ipelines solves th is prob

lem p artia lly b u t no t com pletely. C om posite P ipeline p a tte rn resolves lim ita tions in bo th

T em poral M ultiplexing and P ipeline p a tte rn s , and therefore achieves b e tte r perform ance

resu lts in com parison w ith o th er two p a tte rn s .

T he exam ples in th is ch ap te r illu s tra te th a t using sim ple design p a tte rn s and m ost

of th e existing sequential code, th e w orksta tion environm ent can offer significant benefits

for parallelizing m any high level vision algorithm s a n d /o r applications. T hey can provide

significant su p p o rt for developing and p ro to typ ing applications requiring large am oun t

of com puting tim e, in m any research and o ther o rgan izational se tups which do no t have

ded icated parallel com puting facilities.

Chapter 7

C onclusion

7.1 A im s and M otivation

T he research work in th is thesis is aim ed a t presenting and evaluating a set of design

p a tte rn s in tended to su p p o rt parallelization of vision applications on coarse-grained p ar

allel m achines, such as a cluster of w orksta tions. W orksta tion environm ents have recently

proved to be effective and econom ical p latform s for high perform ance com puting com pared

to th e conventional parallel m achines. They offer several advantages for parallelizing

and executing large applications on a relatively low-priced and readily available pool of

m achines. However, developing parallel applications on such m achines involves com plex

decisions such as dividing th e applications into several processes, d is trib u tio n of these p ro

cesses over various processors, scheduling of processor tim e betw een com peting processes,

and synchronization of th e com m unication between different processes.

D eveloping parallel p rogram s to control these decisions usually involves w riting ex

plicit p rogram code for process scheduling, process com m unication , and som etim es even

co m p u ta tio n in a single rou tine. This style of parallel code developm ent increases p rogram

com plexity, and reduces program reliability and code reusability. W riting explicit parallel

code for parallelizing various applications on a cluster of w orksta tions has som e add itional

problem s. T he available m achines and their capabilities can vary dynam ically during

170

Chapter 7. Conclusion 171

program execution or from one execution to ano ther. T his can som etim es lead to a

significant reduction in overall perform ance of an application . Also, m ost developers do

no t wish to spend tim e in low level p rogram m ing details in o rder to gain advan tages of

po ten tia l parallelism in an application . A bout 69% of parallel p rogram m ers (Pancake,

1996) m odify or use existing blocks of code to com pose new program s. M oreover, th e

m odification or p a rtia l reuse of existing code or p rogram design is often restric ted to

individual developers. T here is very little sharing of design knowledge am ong developers.

T he parallel p rogram s used for im plem enting m ajo rity of vision task s utilize a finite

se t of recurring algorithm ic s tru c tu re s or parallel p rogram m ing m odels. O ur research has

aim ed a t cap tu rin g and articu la tin g th e design inform ation in these algorithm ic s tru c tu re s

in th e form of design p a tte rn s. We have specified various aspects of parallel behavior

of each design p a tte rn (e.g. s tru c tu re , process p lacem ent, com m unication p a tte rn s , etc.)

in its definition or separate ly as issues to be addressed explicitly during its im plem enta

tion. Design p a tte rn s decouple th e code for im plem enting low level parallel p rogram m ing

details (i.e. process scheduling, com m unication, e tc.) from th e code for m anaging th e

ac tu a l com pu ta tion . Such decoupling ensures program reliability and code reusability.

Design p a tte rn s cap tu re design inform ation in a form th a t m akes them usable in differ

en t s itu a tio n s and in fu tu re work. T he design p a tte rn s presented in th is thesis would

enable researchers and developers to im plem ent m any in teractive and batch applications

in com pu ter vision on w orksta tion clusters.

A cluster of w orksta tions is characterized by high com m unication costs and a varia tion

in speed facto rs of individual m achines in the netw ork. A key fac to r th a t m inim izes the

effect of high com m unication costs on perform ance is ‘g ra n u la rity ’ (section 1.1) of a parallel

a lgorithm , which describes th e am ount of work associated w ith each p ro cess /ta sk relative

to th e com m unication. A cluster of w orksta tions is inherently coarse-grained. We have

fo rm ulated th e design p a tte rn s so th a t they im plem ent coarse-grained parallelism . Also,

due to variation in speed factors of individual m achines, an application parallelized on such

m achines needs to include proper load balancing s tra teg ies in order to ob ta in m axim um

perform ance gains. T he design p a tte rn s presented in th is thesis a tte m p t to d is trib u te th e

work load according to th e speed factors of individual m achines in th e netw ork. T his

load balancing is perform ed either s ta tica lly (i.e. before the s ta r t of th e co m p u ta tio n ), or


dynam ically (during th e co m pu ta tion ).

We began our w ork by analyzing th e com pu tation and com m unication charac teristics

of vision tasks. We identified various form s of parallelism in vision task s and fo rm ulated

design p a tte rn s to im plem ent these tasks. Each design p a tte rn cap tu res com m on designs

used by developers to parallelize th e ir tasks. We presented a cata logue of design p a tte rn s

to im plem ent various form s of parallelism in vision tasks on a cluster of w orksta tions.

O ur nex t goal in th is thesis has been to evaluate th e use of these design p a tte rn s for

parallelizing vision tasks on a cluster of w orksta tions. We have im plem ented rep resen ta tive

vision algorithm s in low, in term ed iate and high level vision processing, and presen ted

th e experim ental resu lts of corresponding parallel im plem entations. T he resu lts of these

im plem entations have helped us to critically assess th e use of design p a tte rn s for achieving

perform ance gains in various algorithm s. It has also enabled evaluating th e v iab ility of

using w orksta tion clusters for im plem enting parallel vision applications.

7.2 R esearch R ev iew

T he lite ra tu re on parallelization of vision a lg o rith m s/ap p lica tio n s is vast, b u t th e re have

been no previous efforts to a b s tra c t and docum ent th e design inform ation from th e ir

corresponding parallel im plem entations. In ch ap te r 3, we have a tte m p te d to cap tu re

and docum ent th is design inform ation in th e form of design p a tte rn s so th a t they can

be used for parallelizing m any vision a lg o rith m s/ap p lica tio n s on coarse-grained parallel

m achines, such as a cluster of w orksta tions. A cata logue of key design p a tte rn s for parallel

vision applications would give s tan d a rd nam es and definitions to th e techniques used in

parallelization of these applications. Each p a tte rn has been described in a uniform way

using a tem p la te which provides description of how each p a tte rn works, where it should

be applied and w hat are th e trade-off in its use.

T he design p a tte rn s presented in chap te r 3 include. Farm er-W orker, M aster-W orker,

C ontroller-W orker, D ivide-and-C onquer, Tem poral M ultiplexing, P ipeline, and C om pos

ite P ipeline. T he Farm er-W orker p a tte rn is used for im plem enting d a ta parallel algo


rith m s which require no com m unication during com p u ta tio n . B oth M aster-W orker and

C ontroller-W orker p a tte rn s are used for parallelizing problem s exhibiting d a ta parallelism ,

b u t which require com m unication of in term ed iate resu lts during processing. D ivide-and-

C onquer p a tte rn is used for parallelizing algorithm s th a t use a recursive s tra teg y to sp lit a

problem in to sm aller subproblem s and m erge the solution to these subproblem s in to a final

so lu tion . Tem poral M ultiplexing p a tte rn is used for processing several d a ta sets or im age

fram es on m ultiple processors. Finally, P ipeline and C om posite P ipeline p a tte rn s are

used for parallelizing applications th a t can be divided into a sequence (pipeline) of several

independen t subproblem s which are executed in a determ ined order. In th e C om posite

P ipeline p a tte rn , each subproblem m ay be fu rth e r parallelized using o th er relevant design

p a tte rn s .

A fter presenting a cata logue of design p a tte rn s , our next ta sk in th is thesis has been

to evaluate th e use of these p a tte rn s for parallelizing vision a lg o rith m s/ap p lica tio n s on a

clu ster of w orksta tions. We have im plem ented various represen ta tive algorithm s in low,

in term ed ia te and high level vision processing, and presented th e experim ental resu lts. In

ch ap te r 4, we presented parallel im plem entations of som e represen ta tive low level vision

algorithm s. Low level algorithm s parallelized using th e C ontroller-W orker p a tte rn (e.g.

h istogram equalization and 2D -F F T ) do not resu lt in any significant speedups due to

th e tim e com plexity of all-to-all worker com m unications in th is p a tte rn . B u t o th er low

level a lgorithm s parallelized using the Farm er-W orker p a tte rn (e.g. convolution and rank

filtering) and th e M aster-W orker p a tte rn (e.g. ‘ite ra tiv e ’ im age sharpen ing and im age

re s to ra tio n ) have shown encouraging results. However, applications parallelized using th e

M aster-W orker p a tte rn on enterprise clusters (section 2.5.3) m ay resu lt in dynam ic load

im balances and subsequently reduction in overall perform ance of th e application.

In ch ap te r 5, we presented parallel im plem entations of tw o in term ed ia te level vision

algorithm s, namely, region-based split and m erge segm entation a lgorithm , and th e line

g rouping algorithm based on th e principles of percep tual o rgan ization . T he segm enta

tion a lgorithm parallelized using th e D ivide-and-C onquer (DC) p a tte rn does no t exhibit

perform ance scalability owing to increase in corresponding tim e required for m erging th e

segm ented subim ages. If com m unication tim e is no t a dom inan t facto r, th e perform ance of

an a lgorithm parallelized using a D C p a tte rn is in fact influenced m ainly by th e tim e com


plexity of th e m erging operation . T he line grouping algorithm has been parallelized using

an ‘ite ra tiv e ’ varian t of the C ontroller-W orker p a tte rn . T he perform ance of th e parallel

line grouping algorithm , however, does not show any im provem ent over its corresponding

sequential im plem entation . T he tim e spen t in ac tu a l co m p u ta tio n is significantly lower

th an th e tim e spen t in all-to-all worker com m unications in th e C ontroller-W orker p a tte rn .

In fact, it is very difficult to achieve any significant perform ance gains using C ontroller-

W orker p a tte rn , specially when it involves frequent all-to-all w orker com m unications.

In ch ap te r 6, we discussed parallel im plem entation of th e recognition phase of th e

geom etric hashing algorithm used for ob ject recognition. T he recognition phase perform s

several probe steps for identifying an ob ject from a scene im age. Each probe step (asso

ciated w ith a basis set) com prises a sequence of operations for finding p o ten tia l m odels

th a t m atch th e scene features. We developed a coarse-grained parallel algorithm for th e

recognition phase by perform ing m ultiple probes on different w orksta tions, concurrently .

T he operations of each probe are however perform ed on a single w orksta tion . T he parallel

im plem entation of th e recognition phase (using th e Farm er-W orker p a tte rn ) has in certain

cases achieved b e tte r results th an earlier im plem entations perform ed on dedicated parallel

m achines.

We also discussed parallelization of m ulti-scale active shape descrip tion process using

th ree different p a tte rn s: T em poral M ultiplexing, P ipeline, and C om posite P ipeline. All

th ree im plem entations have shown prom ising resu lts. O f th e th ree p a tte rn s . T em poral

M ultiplexing p a tte rn is th e sim plest to im plem ent since it allows m ost of th e existing

sequential code to be used in th e parallel im plem entation . However, no t all m odules of

th is applica tion can be parallelized using Tem poral M ultiplexing p a tte rn alone. Use of

P ipeline p a tte rn increases degree of parallelization b u t th is p a tte rn has lim ited scalabil

ity. C om posite P ipeline p a tte rn resolves lim ita tions in b o th T em poral M ultiplexing and

P ipeline p a tte rn , and therefore achieves b e tte r application perform ance com pared to o th er

tw o p a tte rn s .

To sum m arize, th e exam ples in th is thesis have shown th a t for m ost low level and high

level algorithm s in vision th e w orksta tion environm ent offers reasonable and som etim es

significant benefits for parallelizing these algorithm s. In term ed ia te level algorithm s, how


ever, do no t represen t ideal cand ida tes for parallel im plem entation on w orksta tion clusters

due to th e ir ‘com m unication-in tensive’ na tu re . M ost of th e applications parallelized using

Farm er-W orker, Tem poral M ultiplexing, and C om posite P ipeline p a tte rn s have shown

encouraging results. A pplications parallelized using M aster-W orker, D ivide-and-C onquer,

and P ipeline p a tte rn s have shown satisfying results. A pplications parallelized using th e

C ontroller-W orker p a tte rn have, however, not resulted in any significant perform ance

gains. Also, th e m edical im aging application in ch ap te r 6 illu stra tes th a t w orksta tion

env ironm ents can provide significant su p p o rt for developing and p ro to ty p in g applica tions

requiring large am oun t of com puting tim e, in m any research and o th er o rgan izational

se tu p s which do no t have dedicated parallel com puting facilities.

7.3 C ontribu tions o f th e R esearch work

T h e con tribu tions of th is d isserta tion can be evaluated in te rm s of: a cata logue of design

p a tte rn s for parallel vision system s, coarse-grained parallel algorithm s for represen ta tive

vision applications, and critical assessm ent of th e use of design p a tte rn s in im plem enting

these applications on w orksta tion clusters. We rep ea t/su m m arize these co n tribu tions

again as follows:

• Catalogue o f design patterns: We presented a cata logue of design p a tte rn s for parallel

vision system s, describing each p a tte rn in te rm s of in ten t, m otivation , s tru c tu re ,

in teraction am ongst th e com ponents, and applicability. T his descrip tion enables

selection and use of a design p a tte rn in different s itu a tio n s and in fu tu re work.

• Coarse-grained parallel algorithms: We presented coarse-grained parallel a lgorithm s

and im plem entations for several vision task s such as convolution, im age filtering,

im age resto ra tion , region-based segm entation , line grouping, and geom etric hashing

algorithm for ob ject recognition. We also presented different parallel im plem enta

tions of th e m ulti-scale active shape descrip tion process (an application in m edical

im aging) using different design p a tte rn s.


• Im p lem en ta tion on a cluster o f workstations: Using relevant design p a tte rn s , we

perform ed parallel im plem entations of the selected rep resen ta tive vision task s s ta te d

above. T he resu lts of these im plem entations enable critica l assessm ent of th e design

p a tte rn s for achieving im provem ents in application perform ance. It also enables

evaluating th e viability of using w orksta tion clusters for im plem enting parallel vision

applications.

7.4 C om parison w ith related work

A lthough th e concept of ab strac tin g com m on parallel p rogram m ing designs in th e form

of design p a tte rn s is new, th ere have been several prior efforts to identify and cap tu re

general parallel program m ing designs/m odels (C handy & Kesselm an, 1991), (K ung, 1989)

as softw are com ponents (e.g. im plem enta tion m achines (Z im ran e t al., 1990), tem plates

(Singh e t al., 1991), assets (Schaeffer e t ah, 1993), and skeletons (D arling ton e t ah , 1993)).

These softw are com ponents com prise ‘ready-to -use’ softw are rou tines for im plem enting

low level program m ing details (e.g. process scheduling, com m unication, e tc.) in th e

corresponding parallel p rogram m ing models. T he softw are system s based on these softw are

com ponents allow program m ers to w rite their parallel p rogram s in te rm s of these softw are

com ponents. T hese system s au tom atically insert th e necessary code for process scheduling

and com m unication in order to realize th e corresponding parallel im plem entation .

However, these system s do no t choose the ty p e of parallelism to apply; th is choice is

left to th e developer who judges and selects th e best form of parallelism in a p a rticu la r

application . Also, m ost of these system s have lim ited applicability. For exam ple, th e

E n terp rise system (Schaeffer e t ah, 1993) does no t su p p o rt d a ta parallelism , one of th e

m ost im p o rtan t form of parallelism in com puter vision. M ost of these system s do no t

su p p o rt com plex a n d /o r dom ain-specific parallel p rogram m ing m odels (e.g. parallelism

represented by th e C om posite Pipeline p a tte rn in vision).

O ur research work of presenting design p a tte rn s for parallel vision system s differs from

these approaches. We do no t present ‘ready-to -use’ program code th a t can be sim ply in

serted as softw are rou tine in a parallel im plem entation . We instead identify and docum ent


explicitly various parallel p rogram m ing m odels com m only occurring in parallel so lutions

of problem s in certain dom ain, such as com puter vision. T he ‘in te n t’, ‘m o tiv a tio n ’, and

th e ‘app licab ility ’ aspects of th e design p a tte rn descrip tions enable th e user to select

a p p ro p ria te design p a tte rn (s) for parallelizing a given applica tion . T he o th er aspects of

th e design p a tte rn descrip tions provide guidelines for the ac tu a l im plem entation of the

design p a tte rn s for a p articu la r problem .

A design m ethodology for parallelizing com plete vision system s has also been presented

by D ow nton et. al. (D ow nton e t al., 1996). T heir design m ethod , based on pipeline o f

processor fa rm s (PPF)^ enables parallelization of com plete vision system s (w ith continuous

in p u t/o u tp u t) on M IM D parallel m achines. T he parallelization process in th e ir design

m odel is perform ed in a top-dow n fashion, where parallel im plem entations of individual

a lgorithm s are tre a ted as com ponents in th e design model. W hile th e design m ethodology

in (D ow nton e t al., 1996) has been im plicit, our work has concen tra ted on m aking th is

design m ethodology explicit. We have docum ented the P P F design m ethod in th e form

of C om posite-P ipeline p a tte rn in th is thesis. Also, th e design m ethod in (D ow nton e t al.,

1996) discusses parallelization a t m ostly application level. O ur work has a tte m p te d to

discuss parallelization a t bo th algorithm ic and application levels in vision.

T he m ain d isadvan tage of th e design p a tte rn s is th a t they do n o t provide a detailed

solution . A p a tte rn provides a generic scheme for solving a class of problem s, ra th e r

th a n ‘read y -to -u se’ softw are m odule which can be inserted in program . A user needs to

im plem ent th is schem e according to th e requirem ents of a given problem . A p a tte rn only

provides guidance for solving problem s, b u t it does not provide com plete solutions.

7.5 Future work

T he research work in th is thesis has aim ed a t presenting a set of design p a tte rn s in tended

to su p p o rt parallelization of vision application on a cluster of w orksta tions. Using these

design p a tte rn s we have also parallelized represen ta tive vision algorithm s in o rder to

d em o n stra te th e ir usefulness in im plem enting these algorithm s on w orksta tion clusters.

T he research work however raises fu rth e r questions and brings up research topics in a


num ber of research areas such as:

• Fault tolerance: T he available resources in w orksta tion environm ents (especially in

en terp rise clusters) can change dynam ically during parallel execution of an applica

tion . A w orksta tion m ay becom e overloaded, or m ay be powered off for m ain tenance

purposes or, in th e w orst case, m ay crash. T he first tw o cases m ay be predicted

or known in advance. T he th ird case is unexpected and m ay resu lt in significant

loss of processing tim e. T he use of com m on m ethods, such as checkpointing and,

erro r detec tion and recovery, have high overheads. An a lte rn a tiv e m ethod is to

include fau lt to lerance m echanism s in each design p a tte rn . Some such a tte m p ts (for

w orksta tion environm ents) have been explored in the ‘processor fa rm ’ (C lem atis,

1994) and th e ‘supervisor-w orker’ (M agee & C heung, 1991) m odels (bo th m odels

represen t Farm er-W orker form of com pu ta tion ).

D etecting a failure in som e worker com ponent of a Farm er-W orker p a tte rn is rela

tively easy. T he Farm er com ponent can de tec t (and rectify) such a failure when som e

worker com ponent does no t respond w ithin a certain tim e lim it. O th er s tra teg ies

for de tec ting failures in e ither F arm er com ponent or process com m unication m ay be

sim ilarly devised. D etecting and rectifying failures in o th er p a tte rn s (e.g. M aster-

W orker and Pipeline) is however com plicated. Each worker com ponent in these

p a tte rn s sends/receives m essages from o ther worker com ponents. A failure in any

worker com ponent can lead to deadlock. Devising m echanism s for handling such

situ a tio n s is a challenging task .

• Load balancing: T he Farm er-W orker and T em poral-M ultip lexing p a tte rn s have an

inheren t load balancing property . However, o ther p a tte rn s m ay suffer from load

im balances during th e ir execution, especially when im plem ented on en terp rise clus

ters . T here is a need for devising m echanism s in order to minim ize th e effect of load

im balances in these p a tte rn s . Load balancing schemes m ay be inco rporated in the

p a tte rn itself. For exam ple, when a w orksta tion executing a w orker com ponent

of th e M aster-W orker p a tte rn is overloaded w ith ex ternal processes, th e worker

com ponen t m ay be transferred to ano ther free w orksta tion . T he overloaded worker

com ponents m ay be detec ted afte r every cycle of fixed num ber of ite ra tions, until

th e com pletion of com pu tation . T he code for perform ing load balancing operations


m ay be included in th e p a tte rn im plem entation , or m ay be p a r t of a sep a ra te design

p a tte rn im plem entation ,

• P erform ance prediction:

Designing p ractical m odels for predicting parallel execution tim e of an applica tion

im plem ented on an en terprise cluster has been a challenging research a rea (Yan e t al.,

1996). We in tend to s tu d y th e feasibility of designing such m odels for th e design

p a tte rn s in parallel com puter vision. Each design p a tte rn m ay include a perform ance

prediction m odel as in skeletons (D arlington e t al., 1993) or im plem en ta tion m achines

(Z im ran e t al., 1990). T he com plexity of th e perform ance prediction m odel depends

on th e s tru c tu re of th e underlying design p a tte rn . For exam ple, using sequential tim e

of an algorithm , it is relatively easy to predict th e app rox im ate parallel execution

tim e in th e Farm er-W orker and Tem poral-M ultip lexing im plem entations. Similarly,

if th e sequential execution tim e of each com ponent in th e P ipeline and C om posite

P ipeline p a tte rn s is known, it is relatively easy to pred ic t th e parallel execution

tim e of th e corresponding application. However, p redicting perform ance in M aster-

W orker or D ivide-and-C onquer p a tte rn is relatively difficult.

Some im p o rtan t factors which need to be considered while designing perform ance

prediction m odels for each design p a tte rn (im plem ented on a w orksta tion cluster)

include com pu ta tional com plexity of th e problem , num ber of w orksta tions used in

parallelization, relative speed factors of individual m achines and th e netw ork b an d

w idth . T he com plexity of the perform ance prediction m odels is also influenced by

n a tu re of th e vision algorithm s. W hile it is relatively easy to p red ic t perform ance

in w ell-structured low level vision algorithm s, pred ic ting perform ance in in term e

d ia te and high level vision algorithm s is relatively difficult due to uncerta in ties in

com pu tations.

A ppendix A

N ota tion

A .l P a ttern D iagram

We use a varian t of th e ob ject m odel to describe th e com ponents and th e ir re la tionsh ips

in a design p a tte rn (B uschm ann e t ah, 1996).

SplitWorkSendSubtasksCollateResultsSendFinalResults

Master

DoCalcuIationExchangeData

SendResulls

Worker (p-1)


SendResulls

Worker (2)


SendResulls

Worker (p)


SendResulls

Worker (1)

T he com ponents are shown as rectangu lar boxes, denoting th e nam e of th e com po

nents and th e associated procedures w ithin th e com ponents. A line th a t connects th e

com ponents denotes an association.

180

Appendix A. Notation

A .2 O bject In teraction C harts

181

VVe ad ap t the O bjec t Message Sequence C h art no ta tion (OM SC) given in (B uschm ann

et ah, 1996) to describe the ob ject in teractions am ong the com ponents of a p a tte rn .

CallToParallelize

^ SplitWork

SendSubtask

SendSubtask

Loop DoCalcuIation DoCalcuIation

SendRt'sults

ProcedureScndRcsuIts(Activity lines'

CollateResults

SendFinalResults

Master Worker (1) Worker (2)C lient

T he com ponents in a p a tte rn are draw n as rectangu lar boxes. T hey are labeled

with their corresponding nam es. The activities of the com ponents are denoted by the

vertical bars a ttached to the bo ttom of the box (activ ity lines). T he messages between

t he com ponents are denoted by the horizontal arrows. T he tim e elapsed is shown from

to p to bo ttom , however, the tim e scale is not scaled. An itera tive com putation is shown

by an upward arrow , while, a procedure call within a p a tte rn com ponent, is shown by a

sm all downward arrow .

Bibliography

A lexander, C. (1979), The tim eless way o f building, Oxford U niversity P ress, New York,

US.

Alnuweiri, H. M. & P rasan n a , V. K. (1992), “P arallel arch itec tu res and a lgorithm s for

im age com ponent labeling” , IE E E Transactions on P a ttern A na lysis and M achine

In telligence 1 4 (1 0 ) , 1014-1034.

Alonso, R. & Cova, L. L. (1988), Sharing jobs am ong independently owned processors,

in “Proceedings of th e 8 th In terna tional C onference on D istrib u ted C om puting Sys

tem s” , IE E E C om puter Society Press, p p .282-288.

A m dahl, G . M. (1988), “L im its of expecta tion” . In tern a tio n a l Journa l o f Supercom puter

A pplica tions 2 (1 ) , 88-94.

A nderson, T . E., Culler, D. E ., P a tte rso n , D. A. e t a l .(1995), “A case for N O W (N etw orks

of W o rk sta tio n s)” , IE E E M icro F e b , 54-64.

Angus, I., Fox, G. C ., Kim, J . S. & W alker, D. W . (1989), Solving Problem s on C oncurrent

Processors, P rentice-H all, Englewood Cliffs, New Jersey, US.

A tallah , M . J ., Black, C. L., M arinescu, D. C. e t a l.(1992), “M odels and algorithm s

for coscheduling com pute-in tensive tasks on a netw ork of w orksta tions” . Journa l o f

Parallel and D istributed C om puting 16, 319-327.

Awcock, G . J . & T hom as, R. (1995), Applied Im age Processing, M acm illan, B asingstoke,

E ngland .

182

Bibliography. 183

B allard , D. H. & Brown, C. M. (1982), C om puter Vision^ P rentice-H all, Englew ood Cliffs,

New Jersey, US.

Beck, K ., Coplien, J . 0 . , C rocker, R ., Dom inick, L. e t a l .(1996), “In d u stria l experience

w ith design p a tte rn s” , IE E E Proceedings o f IC SE -18 p p .103-113.

Beguelin, A., D ongarra , J ., G eist, A., Jiang , W ., M anchek, R. & Sunderam , V. S. (1992),

P V M 3 U ser’s Guide and Reference Manual.^ o rn l/tm -12187 edition. O ak Ridge N a

tional L aborato ry , O ak Ridge,Tennessee, US.

Beguelin, A., D ongarra , J ., G eist, A., M anchek, R. & Sunderam , V. S. (1991), “Solving

co m p u ta tio n al g rand challengees using a netw ork of heterogeneous su p erco m p u te rs” ,

Proceedings o f F ifth S IA M Conference on Parallel Processing .

B oden, N. J ., C ohen, D., Felderm an, R. E. e t a l .(1995), “M yrinet: A gigabit-per-second

local a rea netw ork” , IE E E M icro F eb , 29-35.

B oldt, M ., Weiss, R. & R isem an, E. (1989), “Token-based ex trac tio n of s tra ig h t lines” ,

IE E E Transactions on System s, M an, and Cybernetics 1 9 (6 ) , 1581-1594.

B ourdon , O. & M edioni, G. (1988), “O bjec t recognition using geom etric hashing on th e

connection m achine” . In terna tiona l Conference on P a ttern Recognition p p .596-600.

B uschm ann, F . & M eunier, R. (1995), A System of P a tte rn s , in J . O. Coplien & D. C.

Schm idt (eds.), “P a tte rn Languages of P rog ram Design” , Addison-W esley, R eading,

M A, US, pp.325-343.

B uschm ann, F ., M eunier, R ., R ohnert, H., Som m erlad, P. & S tal, M. (eds.) (1996),

P attern -O rien ted Software A rchitecture - A S ystem o f P atterns, W iley and Sons,

C hichester, UK.

B uxton , H. e t a l .(1986), “A parallel approach to th e p ic tu re re s to ra tio n a lgorithm of

G em an and G em an on an SIMD m achine” . Im age and V ision C om puting p p .133-

142.

C handy, K. M. & K esselm an, C. (1991), “P arallel P rogram m ing in 2001” , IE E E Software

N o v , 11-20.

Bibliography. 184

C haudhary , V. & Aggarw al, J . K. (1990), Parallelism in com pu ter vision: a review,

in V. K um ar, P. S. G opalakrisnan &: L. N. K anal (eds.), “P arallel A lgorithm s for

M achine Intelligence and Vision” , Springer Verlag, p p .271-309.

C haudhary , V. & A ggarwal, J . K. (1991), “On th e com plexity of parallel im age com ponent

labeling” . In terna tiona l Conference on Parallel Processing I I I , 183-187.

C heng, D. Y. (1993), A survey of parallel program m ing languages and tools, Technical

R ep o rt RND-93-005, NASA Ames R esearch C enter.

C hin, R. T . & Dyer, C. R. (1986), “M odel-based recognition in ro b o t vision” , A C M

C om puting Surveys 1 8 (1 ) , 67-108.

C houdhary , A. & T hak u r, R. (1994), “C onnected com ponent labelling on coarse-grain

parallel com puters - an experim antal s tu d y ” . Journal o f Parallel and D istributed

C om puting 2 0 (1 ) , 79-83.

C houdhary , A. N. & P ate l, J . H. (1990), Parallel A rchitectures and Parallel A lgorithm s

fo r Integrated V ision System s^ K luwer A cadem ic P ublishers, B oston, USA.

C lark , H. & M cM illin, B. (1992), “D W A CS-a d is trib u ted com pute server utilizing idle

w orksta tions” . Journal o f Parallel and D istributed C om puting 1 4 (2 ) F e b , 175-186.

C lem atis, A. (1994), “F au lt to la ra n t p rogram m ing for netw ork based parallel com pu ting” .

M icroprocessing and M icroprogramm ing 40 , 765-768.

Coplien, J . O. & Schm idt, D. C. (eds.) (1995), P attern Languages o f Program Design,

Addison-W esley, R eading, MA, US.

C opty, N., R anka, S., Fox, C . & Shankar, R. V. (1989), “A d a ta parallel algorithm for

solving th e region growing problem on th e connection m achine” , Journa l o f Parallel

and D istributed C om puting 2 1 (1 ) , 160-168.

D arling ton , J ., F ield, A. J ., H arrison, P. C ., Kelly, P. H. J . e t a l .(1993), Paralle l p ro

gram m ing using skeleton functions. Technical R eport D oC 93 /6 , Im perial College,

London, UK.

D olan, J . & Weiss, R. (1993), “P ercep tual grouping of curved lines” . Proceedings o f the

D A R P A Im age Understanding W orkshop p p .1135-1145.

Bibliography. 185

D ow nton , A., Tregidgo, R. W . S. & C uhadar, A. (1996), G eneralized parallelism for em bed

ded vision applications, in A. Y. H. Zom aya (ed.), “Parallel C om puting : P arad igm s

and A pplications” , In ternational Thom son C om puter P ress, London, UK, p p .553-

577.

D uda , R. O. & H art, P. E. (1972), “Use of th e Hough tran sfo rm atio n to d e tec t lines and

curves in p ic tu res” . C om m unications o f the A C M p p .11-15.

Duff, M . & Levialdi, S. (eds.) (1982), Languages and Architectures fo r Im age Processing.,

A cadem ic P ress, 24 /28 Oval R oad, London N W l 7DX, UK.

D uncan , R. (1992), “Parallel com puter a rch itectu res” . A dvances in C om puters 34 , 113-

157.

Efim ov, N. V. (1966), A n elem entary course in analytical geometry., P ergam on P ress,

O xford.

F lynn , M . J . (1972), “Some com puter organizations and their effectiveness” , IE E E Trans

actions on C om puters 0 -2 1 ( 9 ) .

Foster, I. T . (1995), D esigning and Building Parallel Programs: C oncepts and Tools fo r

Parallel Software Engineering., Addison-W esley, R eading, M A W okingham .

C am m a, E ., Helm, R., Johnson, R. & Vlissides, J . (1994), Design P atterns: E lem en ts o f

Reusable O bject-O riented Software, Addison-W esley, R eading, M A, US.

Gonzalez, R. C. & W oods, R. E. (1993), Digital Im age Processing, Addison-W esley, R ead

ing, M A, US.

C rim son , W . (1990), Object Recognition by C om puter: The Role o f G eom etric C onstraints,

M IT Press.

C rim son , W . E. L. & H uttenlocher, D. P. (1991), “On th e verification of hypothesized

m atches in m odel-based recognition” , IE E E Transactions on P a ttern A n a lysis and

M achine Intelligence 1 3 (1 2 ) , 1201-1213.

H am brusch, S., He, X. & Miller, R. (1994), “P arallel algorithm s for gray-scale digitized

p ic tu re com ponent labeling on a m esh-connected co m p u te r” , Journal o f Parallel and

D istributed C om puting 2 0 (1 ) , 56-68.

Bibliography. 186

H am eed, F ., H am brusch, S. E ., K hokhar, A. A. & P a te l, J . N. (1997), “C on to u r ran k

ing on coarse grained m achines: A case stu d y for low-level vision co m p u ta tio n s” .

C oncurrency: Practice and Experience 9 (3 ) , 203-221.

H aralick, R. M . & Shapiro, L. G. (1985), “Im age segm enta tion techniques” . C om puter

Vision, Graphics and Im age Processing 29, 100-132.

H odgson, R. M ., Bailey, D. G ., Naylor, M. J ., Ng, A. L. M. & McNeill, S. J . (1985),

“P ro p erties , im plem entations and applications of rank filters” . Im age and Vision

C om puting 3, 3-14.

H orow itz, S. L. & Pavlidis, T . (1974), “P ic tu re segm enta tion by a d irected sp lit-and-

m erge procedure” . Proceedings o f the 2nd In tern a tio n a l Jo in t Conference on P a ttern

R ecognition p p .424-433.

H uertas, A., Lin, C. & N evatia, R. (1993), “D etection of buildings from m onocular views

of aerial scenes using percep tual grouping and shadow s” . Proceedings o f the D A R P A

Im age U nderstanding W orkshop p p .253-260.

H ussain, Z. (1991), Digital Im age Processing, Practical A pplica tions o f Parallel Processing

Techniques, Ellis Horw ood, C hichester, W est Sussex, UK.

Irvine, D . S. (1995), “C om puter-assisted semen analysis system s - Sperm m otility assess

m en t” , H um an Reproduction lO ( S l ) , 53-59.

K adam , S., R oberts , G. &: B uxton , B. (1996), Parallelizing vision-related app lica tions on

netw ork of w orksta tions using design p a tte rn s . Technical R eport R N /9 6 /2 5 , D e p a rt

m en t of C om pute r Science, University College, London, UK.

K adam , S., R oberts , G. & B uxton , B. (1997), “Design p a tte rn s for parallelizing vision-

re la ted applica tions on netw ork of w orksta tions” . The 11th A n n u a l In tern a tio n a l

Sym posium on High P erform ance C om puting System s, H P C S ’97 J u l , 569-583.

K apoor, S. e t a l .(1994), “D ep th and Im age Recovery Using a M R F M odel” , IE E E Trans

actions on P a ttern A na lysis and M achine In telligence pp. 1117-1122.

K ass, M ., W itk in , A. & Terzopoulos, D. (1987), “Snakes: A ctive con tour m odels” . P ro

ceedings o f the 1st In terna tiona l Conference o f C om puter V ision p p .259-268.

Bibliography. 187

K endall, P. & U hr, L. (eds.) (1982), M ulticom puters and Im age Processing, A lgorithm s

and Programs, A cadem ic Press, 111 F ifth Avenue, NY 10003, USA.

K ram er, H. P. & B ruckner, J . B. (1975), “Ite ra tio n s of a non-linear tran sfo rm atio n for

enhancem ent of dig ital im ages” . P attern Recognition 7, 53-58.

K ung, H. T . (1989), C om p u ta tio n a l m odels of parallel com puters, in R. J . Elliot & C. A. R.

H oare (eds.), “Scientific A pplications of M ultiprocessors” , P ren tice Hall.

L am dan , Y. & W olfson, H. (1988), “G eom etric hashing: a general and efficient m odel based

recognition schem e” . In terna tiona l Conference on C om puter V ision p p .238-249.

Lee, C. K. & H am di, M. (1995), “Parallel im age processing applications on a netw ork of

w o rk sta tio n s” . Parallel C om puting 2 1 (1 ) , 137-160.

Lee, J . S. (1983), “D igital im age sm ooth ing and sigm a filter” . C om puter Vision, Graphics

and Im age Processing 24 , 255-269.

Litzkow , M . J ., Livny, M. & M utka, M. W . (1988), C ondor - A hun ter of idle w orksta

tions, in “P roceedings of th e 8 th In terna tional Conference on D istrib u ted C om puting

S ystem s” , IE E E C om puter Society Press, p p .104-111.

Lowe, D. G . (1985), Perceptual O rganization and Visual Recognition, K luwer A cadem ic

P ress, H ingham , M A, US.

Lu, H. Q. & A ggarw al, J . K. (1992), “A pplying percep tual organization to th e detection

of m an-m ade ob jec ts in non-urban scenes” . P attern Recognition 2 5 (8 ) , 835-853.

M agee, J . N. & C heung, S. C. (1991), “Parallel a lgorithm design for w orksta tion c lusters” .

Softw are-P ractice and Experience 2 1 (3 ) M a r , 235-250.

M ard ia , K. V. & K anji, G. K. (eds.) (1993), Sta tis tics and Images, Vol. 1 of A dvances

in A pplied S ta tis tics Series, C arfax Publishing Com pany, P O Box 25, A bingdon,

O xfordshire 0 X 1 4 3UE, UK. A Supplem ent to Jo u rn a l of Applied S ta tis tics Volume

20 Nos 5 /6 1993.

M arr, D. (1982), Vision: A com putational investigation in to the hum an representation

and processing o f visual in form ation, W . H. Freem an, San Francisco.

Bibliography. 188

M attso n , T . G. (1996), Scientific com putation , in A. Y, H. Z om aya (ed.), “P ara lle l and

D istrib u ted C om puting H andbook” , M cG raw Hill, M cG raw Hill series on C om pute r

Engineering, p p .981-1002.

M ohan, R. &; N evatia, R. (1989), “Using percep tual o rgan ization to e x tra c t 3-D

s tru c tu re s” , IE E E Transactions on P a ttern A na lysis and M achine In telligence

1 1 (1 1 ) , 1121-1139.

M onroe, R. T ., K om panek, A., M elton, R. & G arlan , D. (1997), “A rch itec tu ra l styles,

design p a tte rn s , and o b jects” , IE E E Software 1 4 (1 ) , 43-52.

M utka, M. W . & Livny, M. (1987), Scheduling rem ote processing capacity in a w orksta tion-

processor bank netw ork, in “Proceedings of the 7th In te rn a tio n a l C onference on

D istrib u ted C om puting System s” , IE E E C om pute r Society P ress, p p .2 -9 .

N agao, M. & M atsuyam a, T . (1979), “Edge preserving sm oo th ing” . C om puter Graphics

and Im age Processing 9, 394-407.

N akanishi, H. & Sunderam , V. S. (1992), “S uperconcurren t sim ulation of polym er chains

on heterogeneous netw orks” , Proceedings o f IE E E Supercom puting Sym posium .

N arayan , P., C hen, L. h Davis, L. (1992), “Effective use of SIM D parallelism in low- and

in term ediate-level vision” , IE E E C om puter 25 F e b , 68-73.

Page, I. (ed.) (1988), Parallel A rchitectures and C om puter Vision., O xford U niversity Press.

P ancake, C. (1996), “W h a t com puter scientists and engineers should know a b o u t parallel-

sim and perform ance” . C om puter Applications in E ngineering E ducation 4 (2 ) , 145-

160.

P ita s , I. (1993), Digital Image Processing A lgorithm s, P ren tice Hall, New York, US.

P ra sa n n a K um ar, V. (ed.) (1991), Parallel Architectures and A lgorithm s fo r Im age Un

derstanding, A cadem ic Press, 1250 Sixth A venue, San Diego, CA 92101.

P ra sa n n a , V. K. & W ang, C. L. (1996), Parallelism for Im age U nderstand ing , in A. Y. H.

Zom aya (ed.), “Parallel and D istribu ted C om puting H andbook” , M cC raw Hill, Mc-

C raw Hill series on C om puter Engineering, p p .1042-1070.

Bibliography. 189

P ress, W . H., Teukolsky, S. A., V etterling, W . T . & F lannery, B. P. (1992), N um erical

Recipes in C, C am bridge U niversity Press, New York, US.

R anka, S. & Sahni, S. (1990), “Im age tem p la te m atch ing on M IM D hypercube m ulticom

p u te rs” , Journa l o f Parallel and D istributed C om puting 10 , 79-84.

R eynolds, G . & Beveridge, J . R. (1987), “Searching for geom etric s tru c tu re in im ages of

n a tu ra l scenes” . Proceedings o f the D A R P A Im age U nderstanding W orkshop p p .257-

271.

R ogoutsos, I. & Hum m el, R. (1992), “M assively parallel m odel m atching: geom etric

hash ing on th e connection m achine” , IE E E C om puter p p .33-42.

R osenfeld, A. (1988), “C om puter V ision” , Advances in C om puters 27 , 265-308.

Rosenfeld, A. & K ak, A. C. (1982), D igital P icture Processing^ A cadem ic P ress, New York,

US.

Ruff, B. P. D . (1988), A pipelined arch itec tu re for a video ra te canny o p era to r used a t the

in itia l s tage of a stereo im age analysis system , in I. Page (ed.), “Parallel A rch itectu res

and C o m p u te r Vision” , Oxford University Press.

Schaeffer, J ., Szafron, D ., Lobe, G. & Parsons, I. (1993), “T he E n terp rise m odel for devel

oping d is trib u ted app lica tions” , IE E E Parallel and D istributed Technology A u g , 8 5 -

96.

Schnabel, J . A. (1997), M ulti-Scale Active Shape D escription in M edical Im aging, P hD

thesis. U niversity College London, London, UK.

Siegel, H., A rm strong , J . B. & W atson, D. (1992), “M apping com puter vision re la ted tasks

o n to reconfigurable parallel processing system s” , IE E E C om puter 25 F e b , 54-63.

S ilverm an, R. D. & S tu a rt, S. J . (1989), “A d is trib u ted batch ing system for parallel

processing” , Software-Practice and Experience 1 9 (1 2 ) D e c , 1163-1174.

Singh, A ., Schaeffer, J . & G reen, M. (1991), “A tem plate-based approach to th e generation

of d is trib u ted applications using a netw ork of w orksta tions” , IE E E Transactions on

P arallel and D istributed S ystem s 2 (1 ) J a n , 52-67.

Bibliography. 190

Sonka, M ., Hlavac, V. & Boyle, R . (1993), Im age Processing, A na lysis and M achine Vision,

C h ap m an and Hall, London, UK.

S teenk iste , P. (1996), “N etw ork-based m ulticom puters: a p ractica l supercom pute r archi

te c tu re ” , IE E E T ransactions on Parallel and D istributed S ystem s 7 (8 ) A u g , 861-875.

S to u t, Q. F . (1987), “S upporting divide-and-conquer algorithm s for im age processing” .

Jo u rn a l o f Parallel and D istributed C om puting 4 (1 ) , 95-115.

S underam , V. (1990), “PV M : a fram ew ork for parallel d is trib u ted com puting” . C oncur

rency: Practice and Experience 2, 315-339.

Sunw oo, M. H., B aroody, B. S. & Aggarw al, J . K. (1987), “A parallel a lgorithm for region

labeling” . Proceedings o f the IE E E W orkshop on C om puter A rchitecture fo r P a ttern

A n a lysis and M achine In telligence p p .27-34.

T andiary , P ., K o thari, S. C ., D ixit, A. & A nderson, E. W . (1996), “B atru n : utilizing idle

w orksta tions for large-scale com puting” , IE E E Parallel and D istributed Technology

S u m m e r , 41-48.

T heim er, M . M . & L antz, K. A. (1988), F inding idle m achines in a w orksta tion-based dis

tr ib u ted system , in “Proceedings of th e 8 th In terna tional Conference on D istrib u ted

C o m puting System s” , IE E E C om puter Society P ress, pp. 112-122.

T u rco tte , L. (1993), A survey of softw are environm ents for exploiting netw orked com puting

resources. Technical R ep o rt M SM -EIRS-ERC-93-2, M ississippi S ta te U niversity.

T u rco tte , L. H. (1996), C luster com puting, in A. Y. H. Zom aya (ed.), “Paralle l and

D is trib u ted C om puting H andbook” , M cG raw Hill, M cG raw Hill series on C o m p u ter

E ngineering, p p .762-779.

U hr, L. (ed.) (1987), Parallel C om puter Vision, A cadem ic P ress, B oston, USA.

U hr, L., P res to n , K., Levialdi, S. & Duff, M. J . B. (eds.) (1986), Evaluation o f M ulticom

pu ters fo r Im age Processing, Academ ic P ress, New York, USA.

W ang, C. L. (1995), High perform ance com puting for vision on d is trib u ted m em ory m a

chines, P hD thesis. U niversity of Southern C alifornia, USA.

Bibliography. 191

W ang, C. L., B h at, P. B. & P rasan n a , V. K. (1996), “H igh-Perform ance com puting for

vision” . Proceedings o f the IE E E 8 4 (7 ) J u l , 931-946.

W ang, C. L., P ra san n a , V. K., Kim, H. J . & K hokhar, A. A. (1994), “Scalable da ta-para lle l

im p lem en tations of ob ject recognition using geom etric hash ing” . Journa l o f Parallel

and D istributed C om puting 2 1 (1 ) , 96-109.

W ang, X. & B lum , E. K. (1996), “Parallel execution of ite ra tiv e co m p u ta tio n s on w ork

s ta tio n c lusters” . Journal o f Parallel and D istributed C om puting 34 , 218-226.

W ebb, J . (1994), “High perform ance com puting in im age processing and com puter vision” .

In tern a tio n a l C onference on P a ttern Recognition S ep , 218-222.

W eem s, C. C ., L evitan , S. P., H anson, A. R ., R isem an, E. M . e t a l .(1989), “T he im age

u n d erstan d in g arch itec tu re” . In terna tiona l Journal o f C om puter V ision 2 (3 ) , 251-

282.

W illebeek-LeM air, M. & Reeves, A. P. (1990), “Solving non-uniform problem s on SIM D

com puters: case stu d y on region grow ing” , Journal o f Parallel and D istributed C om

p u ting 8 (2 ) , 135-149.

W illiam s, D. & Shah, M. (1992), “A fast algorithm for active con tours and cu rvatu re

es tim a tio n ” , C V G IP : Im age Understanding 5 5 (1 ) , 14-26.

W ilson, G. V. & Lu, P. (eds.) (1996), Parallel Prpgram m ing using T he M IT Press,

C am bridge, M assachusetts, London, UK.

W itk in , A. (1983), “Scale-space filtering” . In terna tiona l Jo in t C onference on A rtific ia l

In telligence pp. 1019-1022.

Y alam anchilli, S. & A ggarw al, J . K. (1994), “Parallel processing m ethodologies for im

age processing and com puter vision” . Advances in E lectronics and E lectron Physics

87 , 259-300.

Yan, Y ., Zhang, X. & Song, Y. (1996), “An effective and practical perform ance predic

tion m odel for parallel com puting on nondedicated heterogeneous N O W ” , Journa l o f

Parallel and D istributed C om puting 38 , 63-80.

Bibliography. 192

Z im ran , E ., R ao, M. & Segall, Z. (1990), “P erform ance efficient m apping of applications

to parallel and d is trib u ted arch itectu res” , In terna tiona l C onference on Parallel Pro

cessing I I , 147-154.

design patterns for parallel vision applications · the quality of this reproduction is dependent...

Documents