design patterns for parallel vision applications · the quality of this reproduction is dependent...
TRANSCRIPT
D esign P atterns for P arallel V ision A pplications
Sanjay S. K adam
A th e s is su b m itte d for th e d eg ree o f
D o c to r o f P h ilo so p h y
in th e U n iv e r s ity o f L ondon
UCL
U n iv e r s ity C o lle g e L ondon
D e p a r tm e n t o f C o m p u te r S c ien ce
J u n e 1998
ProQuest Number: 10010623
All rights reserved
INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
uest.
ProQuest 10010623
Published by ProQuest LLC(2016). Copyright of the Dissertation is held by the Author.
All rights reserved.This work is protected against unauthorized copying under Title 17, United States Code.
Microform Edition © ProQuest LLC.
ProQuest LLC 789 East Eisenhower Parkway
P.O. Box 1346 Ann Arbor, Ml 48106-1346
A bstract
C o m p u ter vision is a challenging application for high perform ance com puting . To m eet
its co m p u ta tio n a l dem ands, a num ber of SIM D and M IM D based parallel m achines have
been proposed and developed. However, due to high costs and long te rm design tim es
these m achines have no t been widely used. Recently, netw ork based environm ents, such
as a c luster of w orksta tions, have provided effective and econom ical p la tfo rm s for high
perform ance com puting. B u t developing parallel applications on such m achines involves
com plex decisions ab o u t d istribu tion of processes over th e processors, scheduling of proces
sor tim e betw een com peting processes, com m unication p a tte rn s , etc. W riting explicit code
to contro l these decisions increases program com plexity and reduces p rogram reliability
and code re-usability.
We propose a design m ethodology based on design p a tte rn s which is in tended to sup
po rt parallelization of vision applications on a cluster of w orksta tions. We identify com m on
algorithm ic form s occurring repeated ly in parallel vision algorithm s and fo rm ulate these as
design p a tte rn s . We specify various aspects of parallel behaviour of a design p a tte rn , such
as process p lacem ent or com m unication p a tte rn s, in its definition or separate ly as issues
to be addressed explicitly during its im plem entation . Design p a tte rn s ensure p rogram
reliability and code re-usability since they cap tu re the essence of w orking designs in a
form th a t m akes them usable in different s itua tions and in fu tu re work.
T he research work is concerned w ith presenting a cata logue of design p a tte rn s to
im plem ent various form s of parallelism in vision applications on a cluster of w orksta tions.
Using relevant design p a tte rn s , we im plem ent represen ta tive vision algorithm s in low, in
te rm ed ia te and high level vision tasks. M ajority of these im plem entations show prom ising
results. For exam ple, given a 512x512 im age, th e im age resto ra tio n algorithm based on
M arkov random field m odel can be com pleted in less th an 45 seconds on a netw ork of 16
w orksta tions (Sun SPA R C station 5). T he sam e task takes m ore th an 10 m inutes on a
single such w orksta tion .
A cknow ledgem ents
I th an k my supervisors D r. G raham R oberts and Prof. B ernard B uxton for providing
invaluable suggestions, m oral su p p o rt and cordial a tm osphere for conducting th is research
work a t D ep artm en t of C om puter Science, U niversity College London.
I am obliged to th e A ssociation of C om m onw ealth Universities for financing my re
search work th rough the B ritish Council in th e form of C om m onw ealth Scholarship. I
also th a n k D r. V ijay P. B hatkar, Executive D irector, C-DAC (C enter for D evelopm ent
of A dvanced C om puting), Pune, India, for approving th e required s tu d y leave (from my
cu rren t em ploym ent) for com pletion of my P h .D work.
I am gratefu l to m any colleagues and friends w ith whom I had bo th academ ic and
non-academ ic in teractions. My special th an k s go to Jo n a th an Poole for in troducing me to
th e concepts in parallel and d istrib u ted com puting using U C -f-f (a concurren t extension
of C-f-|-). I also th an k D r. Ju lia Schnabel for providing me w ith an application in m edical
im aging which served as an excellent exam ple for parallelization as discussed in ch ap te r 6.
I am also thankfu l to D r. N iladri C h a tte rjee for reviewing th e early d ra fts of th is thesis and
providing num erous suggestions tow ards enhancing th e technical quality of th e m ateria l
presen ted . M any th an k s to K am alendu Pal, A rif Iqbal, Adil Q ureshi, Ihsan K han, C iannis
K oufakis, D r. Bill Langdon and o th er researchers in th e d ep a rtm en t for several discussions
on b o th technical and social aspects of s tu d en t life in London.
I express u tm o st g ra titu d e to my paren ts and relatives for all th e ir good wishes and
blessings. I am also indebted to my wife D r. P ra tim a for her continuous su p p o rt,
encouragem ent and concern for th e com pletion of th is research work. Her ad ju stm en ts
to th e seclusion and frequent d isrup tions to th e fam ily life during th e course of th is s tu d y
were highly appreciable. A nd finally, th an k s to our little sons K irtiR a j and P ran av who
w ith th e ir ever cheerful appearance and innocent b u t inv igorating smiles helped m e to
relieve m yself from th e tensions and hardships of th e s tu d en t life.
C ontents
1 Introduction 16
1.1 Overview ................................................................................................................................. 16
1.2 Aims of th is Research W o r k .................................................................................................. 20
1.3 C on tribu tions of th e D is s e r ta t io n ........................................................................................21
1.4 O rganization of th e t h e s i s ......................................................................................................22
2 Parallelism in C om puter V ision 24
2.1 Parallel C o m p u tin g ............................................................................................................... 25
2.1.1 Parallel com puting s y s t e m s ..................................................................................... 25
2.1.2 A lgorithm ic c la s s e s ................................................................................................... 27
2.1.3 Perform ance of parallel p ro g r a m s ...................................................................... 28
2.2 An Overview of C om puter V is io n ................................................................................... 29
2.2.1 O b jec t recognition in 2D s c e n e s ......................................................................... 30
2.2.2 F eatu re D etection ................................................................................................... 31
C on ten ts 5
2.2.3 S e g m e n ta t io n .................................................................................................................33
2.2.4 R eseg m en ta tio n ............................................................................................................. 35
2.2.5 P roperties and R e la tio n s ........................................................................................... 35
2.2.6 O bjec t R eco g n itio n ...................................................................................................... 36
2.3 C o m puta tiona l C haracteris tics ...........................................................................................36
2.3.1 Low level p ro c e s s in g .................................................................................................. 37
2.3.2 In term ed iate level p r o c e s s in g .................................................................................37
2.3.3 High level p ro c e s s in g .................................................................................................. 38
2.4 Parallel system s for v i s i o n ................................................................................................. 39
2.4.1 M esh connected s y s te m s ...................................................................................... 39
2.4.2 P y r a m i d s ........................................................................................................................40
2.4.3 H y p e rc u b e s ....................................................................................................................40
2.4.4 Shared m em ory m a c h in e s ....................................................................................... 41
2.4.5 Pipelined System s and Systolic a r r a y s ................................................................42
2.4.6 P artitio n ab le S y s t e m s .............................................................................................. 42
2.4.7 G eneral purpose parallel s y s t e m s .........................................................................43
2.5 C om puting on w orksta tion c l u s t e r s ...................................................................................44
2.5.1 C luster C o n f ig u r a t io n .............................................................................................. 44
2.5.2 A dvantages of w orksta tion c l u s t e r s .....................................................................46
C o n ten ts 6
2.5.3 Use of c l u s t e r s .............................................................................................................. 47
2.5.4 Parallel com puting using C lusters ...................................................................... 48
2.6 P arallelization using Design P a t t e r n s ................................................................................50
2.6.1 Design p a t t e r n s ......................................................................................................... 50
2.6.2 Form s of Parallelism in Vision ......................................................................... 52
2.6.3 Design p a tte rn s for parallel vision ....................................................................... 55
2.7 R elated w o r k .......................................................................................................................... 56
2.8 S u m m a r y ................................................................................................................................. 58
3 D esign patterns for parallelizing vision applications 60
3.1 O rganization of p a t t e r n s .................................................................................................... 61
3.2 D escription of design p a tte rn s - a tem p la te ................................................................. 63
3.3 Farm er-W orker P a tte rn .........................................................................................................65
3.4 M aster-W orker P a t t e r n ............................................................................................................ 70
3.5 C ontroller-W orker P a tte rn . ..............................................................................................76
3.6 D ivide-and-C onquer P a t t e r n ............................................................................................. 82
3.7 Tem poral M ultiplexing P a t t e r n .......................................................................................... 87
3.8 P ipeline P a tte rn .......................................................................................................................91
3.9 C om posite Pipeline P a t t e r n ............................................................................................. 98
3.10 S u m m a r y ................................................................................................................................... 104
C on ten ts 7
4 Low level algorithm s 106
4.1 Paralle lization of low level a lgorithm s .......................................................................... 109
4.2 P artitio n in g th e im age d a t a ................................................................................................110
4.3 G rey scale tra n s fo rm a tio n s ................................................................................................... 112
4.4 Im age filtering .........................................................................................................................113
4.4.1 C o n v o lu tio n .................................................................................................................... 114
4.4.2 R ank f i l t e r in g ................................................................................................................ 119
4.4.3 S patial f i l t e r s ................................................................................................................ 120
4.5 F ast Fourier t r a n s f o r m s ...................................................................................................... 122
4.6 Im age r e s to r a t io n .....................................................................................................................124
4.6.1 M arkov random field models for im age re c o v e ry ............................................124
4.7 S u m m a r y ................................................................................................................................... 128
5 Interm ediate level processing 130
5.1 Region based segm entation ............................................................................................... 132
5.2 Parallel R egion-based s e g m e n ta tio n .................................................................................134
5.3 S egm entation using P ercep tual O rg a n iz a tio n .................................................................138
5.3.1 Sequential Line grouping a lg o r i th m ..................................................................... 139
5.3.2 Parallel Line grouping a l g o r i t h m .........................................................................143
5.4 S u m m a r y ...................................................................................................................................145
C on ten ts 8
6 High level processing 147
6.1 Sequential geom etric hashing algorithm .......................................................................148
6.1.1 Preprocessing P h a s e ..................................................................................................149
6.1.2 R ecognition p h a s e ......................................................................................................150
6.2 P aralle l geom etric hashing a lg o r i th m .............................................................................. 152
6.3 M ulti-scale active shape description - an a p p l i c a t i o n ................................................157
6.3.1 An overview of the shape description p r o c e s s ............................................... 158
6.4 P arallelization of th e shape description p r o c e s s ........................................................ 161
6.4.1 Parallelization using Tem poral M ultiplexing p a t t e r n .................................162
6.4.2 P arallelization using Pipeline p a t t e r n ..................................................................164
6.4.3 P arallelization using C om posite P ipeline p a t t e r n ........................................ 166
6.5 S u m m a r y ................................................................................................................................... 168
7 C onclusion 170
7.1 Aim s and M o tiv a tio n ..............................................................................................................170
7.2 R esearch R e v i e w ..................................................................................................................... 172
7.3 C on tribu tions of th e Research w o rk ................................................................................. 175
7.4 C om parison w ith related w o rk ............................................................................................ 176
7.5 F u tu re w o r k ................................................................................................................................177
A N otation 180
C on ten ts 9
A .l P a tte rn D i a g r a m ..................................................................................................................... 180
A .2 O b jec t In teraction C h a r t s .................................................................................................... 181
Bibliography 182
List o f F igures
2.1 An overview of a typical vision based a p p l i c a t io n ........................................................ 31
2.2 P rocessing levels in a typical vision based a p p lic a tio n ................................................. 37
2.3 A 4-connected m esh, pyram id and a 3-dim ensional hypercube of processing
e le m e n ts ..........................................................................................................................................40
2.4 Shared m em ory m achines (in terconnected by a bus and sw itching netw ork)
and systo lic /p ipeline s y s t e m s ...............................................................................................43
2.5 C om m on cluster configurations: bus, s ta r and a r i n g ................................................. 45
3.1 Farm er-W orker P a tte rn ...................................................................................................... 66
3.2 O b jec t In teraction in the Farm er-W orker P a t t e r n .................................................... 67
3.3 M aster-W orker P a t t e r n ..............................................................................................................72
3.4 O b jec t In teraction in th e M aster-W orker P a t t e r n ........................................................ 73
3.5 C ontroller-W orker P a t t e r n .......................................................................................................77
3.6 O b jec t In teraction in the C ontroller-W orker P a t t e r n ............................................ 78
3.7 C onvolution m asks for finding a) horizontal edges and b) vertical edges . . . 82
3.8 D C P a t t e r n ................................................................................................................................... 83
10
L ist o f F igures 11
3.9 O b jec t In teraction in the DC P a tte rn ............................................................................ 84
3.10 T M P a t t e r n ..................................................................................................................................88
3.11 O bjec t In teraction in th e T M P a t t e r n ...............................................................................89
3.12 Vehicle identification s y s te m ...................................................................................................92
3.13 P ipeline P a tte rn ........................................................................................................................ 93
3.14 O b jec t In teraction in th e Pipeline P a t t e r n ........................................................................ 94
3.15 Vehicle identification s y s te m ...................................................................................................99
3.16 C om posite P ipeline P a t t e r n ............................................................................................... 100
3.17 O b jec t In teraction in the C om posite Pipeline P a tte rn ......................................... 101
4.1 P artitio n in g of an im age, a) Row p artition ing b) Row partitio n in g w ith
d a ta th a t is to be overlapped a n d /o r com m unicated .............................................. I l l
4.2 Perform ance of h istogram e q u a l iz a t io n ............................................................................113
4.3 Perform ance of th e convolution operation using a 3x3 w in d o w .............................115
4.4 Perform ance of th e convolution operation using a 15x15 window ......................116
4.5 Perform ance of th e convolution operation on a IK x lK i m a g e .............................117
4.6 Perform ance of th e Farm er-W orker p a tte rn in convolution operation on
varying th e processor load and num ber of su b tasks (window size 15x15) . . 118
4.7 Perform ance of th e sharpening operation using spatia l filters (window size
1 1 x 1 1 ) ........................................................................................................................................... 121
4.8 T he d a ta blocks needed to transpose th e in term ed iate r e s u l t s ...........................123
List o f F igures 12
4.9 Perform ance of the im age resto ra tion algorithm using M R F m odel (window
size 3 x 3 ) ................................................................................................................................. 126
4.10 Perform ance of th e M aster-W orker p a tte rn (in im age recovery operation
using th e M R F m odel on a 512x512 im age) su b jec t to th e ex ternal load
and load d i s t r ib u t io n ........................................................................................................ 128
5.1 a) P artitio n ed im age b) C orresponding qu ad tree .......................................................133
5.2 a) D istribu tion of subim ages b) M erging of s u b i m a g e s .....................................135
5.3 Perform ance of th e parallel sp lit and m erge segm entation algorithm . . . . 136
5.4 Line G rouping ...........................................................................................................................139
5.5 R elational constra in ts in the line grouping algorithm a) proxim ity b) collinear-
ity and c) c o n tin u a tio n .....................................................................................................141
5.6 Indexing technique used in th e line grouping process, a) search a rea for th e
base line b) th e index a r r a y ..........................................................................................142
6.1 Preprocessing phase in th e geom etric hashing algorithm a) O rthogonal co
o rd in a te system defined by th e basis set b) A dding (m o d e l, basis) pairs in
th e hash t a b l e .......................................................................................................................150
6.2 Recognition phase in the geom etric hashing algorithm a) O rthogonal coordi
n a te system defined by th e basis set b) Accessing and collecting (m odel, basis)
pairs from th e hash bins in hash t a b l e .....................................................................151
6.3 Hash tab le d a ta s tru c tu re a) sym m etric indexing in hash tab le b) hash
entries in norm al hash tab le c) reduction in hash entries using sym m etries . 154
6.4 Perform ance of th e geom etric hashing algorithm for o b jec t recognition . . . 156
List o f F igures 13
6.5 M ulti-scale shape description process a) p ropagation step applied on a set
of five im age slices b) m ulti-scale shape stack of an im age slice com puted
in th e shape focusing step (F igure (b) ad ap ted from (Schnabel, 1997)). . . . 160
6.6 Shape focusing perform ed a t different scales in th e im age scale-space of an
im age slice using active con tour models: (a) cr = 8 (b) a = 4 (c) a = 2 (d)
(7 = 1. Im age (a) also contains th e initial con tour superim posed in black.
All im ages are taken from (Schnabel, 1997)................................................................. 160
6.7 V isualization of th e stack contours (those displayed in F igure 6.6) stacked
using triangu la tion . Im age taken from (Schnabel, 1997)........................................ 161
6.8 P arallelization of th e shape description process using a P ipeline p a tte rn .
T he in teger values denote sequential tim e (in seconds) required for executing
corresponding com ponents of the Pipeline p a t te rn ...................................................... 164
6.9 P arallelization of th e m ulti-scale shape descrip tion process using a C om
posite P ipeline p a t t e r n ........................................................................................................... 167
List o f Tables
4.1 E xecution tim e in (m in:sec) foT h istogram e q u a liz a t io n ........................................... 113
4.2 Execution tim e in (m in:sec) for the convolution o p e r a t i o n .................................... 114
4.3 Perform ance of th e Farm er-W orker p a tte rn on varying th e ex ternal load
and num ber of sub tasks. T he execution tim e (m in isec) displayed are for
the convolution operation (window size 15x15)............................................................. 118
4.4 E xecution tim e in (m inisec) for the rank filtering o p e r a t i o n ................................ 120
4.5 Execution tim e in (m in:sec) ï o t the sharpening operation .................................... 121
4.6 Execution tim e in (m in:sec) fov F F T o p e r a t i o n ..........................................................123
4.7 Execution tim e in (m in:sec) fo i im age resto ra tio n using M R F m odel . . . . 125
4.8 Perform ance of th e M aster-W orker p a tte rn when su b jec ted to th e ex ternal
load. T he execution tim es (m in:sec) displayed are for th e im age resto ra tio n
operation using th e M R F m odel on a 512x512 im age.................................................127
5.1 E xecution tim e in (m in:sec) for the parallel sp lit and m erge segm entation
a l g o r i t h m .................................................................................................................................... 136
5.2 E xecution tim e in (m inisec) for various operations in th e parallel sp lit and
m erge segm entation algorithm applied on a 512X512 i m a g e ................................137
14
List o f T ables 15
5.3 Execution tim e in (m in isec) for th e line grouping process .....................................144
6.1 E xecution tim e in (m in:sec) for th e geom etric hashing a l g o r i t h m ......................156
6.2 Execution tim es in (seconds) iov different im plem entations and individual
steps of th e shape description process ............................................................................168
Chapter 1
Introduction
1.1 O verview
C o m p u ter Vision deals w ith the principles and techniques to ex tra c t and in te rp re t useful
inform ation in a scene by cap tu rin g and analyzing im ages of th a t scene. It has applications
in several areas such as rem ote sensing, autonom ous vehicle guidance, industria l inspection,
and m edical im aging. Some of these applications, such as au tonom ous vehicle guidance,
are real-time and involve algorithm s which m ust com plete th e ir co m p u ta tio n s w ith in a
fraction of a second. Some applications requiring hum an in terac tion are in teractive and
m ust com plete w ithin few seconds or less, depending on th e ty p e of in teraction required.
O th er applica tions are batch applications which can to lera te m axim um latency of few
hours or even days. T he n a tu re of th e algorithm s involved in these applica tions is th u s
varied. B ut m ost of these algorithm s are com putationally intensive and require enorm ous
com puting power for th e ir practical im plem entation .
C om pute r vision uses a broad spec trum of algorithm s covering different areas such as
im age and signal processing, graph theory, m athem atics, and artificial intelligence. From
a co m p u ta tio n al perspective, vision processing is conveniently classified in to th ree levels:
low, in term ediate , and high. Low level processing involves pixel-based tran sfo rm atio n s
w here uniform com puta tions are applied a t each pixel or a neighborhood around each
16
Chapter 1. Introduction 17
pixel in th e im age d a ta . These com pu tations are m ainly num eric and well s tru c tu re d .
In te rm ed ia te level processing involves bo th num eric and sym bolic co m pu ta tions. It com
prises algorithm s to form regions of in terest in th e im age d a ta , such as grouping of low level
fea tu res (e.g. edges) into lines, arcs, or rec tangu lar borders of an ob jec t. High level process
ing involves sym bolic com pu tations w here d a ta provided by th e low and in term ed ia te level
algorithm s is used for testin g and generating hypotheses for ob jec t recognition. A typical
vision application com prises low, in term ediate and high level vision ta sk s /a lg o rith m s and
th u s involves b o th num eric and sym bolic com putations. Therefore, although vision has
been identified as a grand challenge application for high perform ance com puting , com
p u ta tio n a l charac teristics of vision applications are different from th e s tru c tu re d num ber
crunching com pu ta tions arising in m ost o ther grand challenge applica tions (W ang e t al.,
1996).
To m eet the com pu ta tional dem ands of vision tasks, several efforts have been directed
tow ards providing a high perform ance com puting su p p o rt for th e ir p ractical im plem en
ta tio n . A brief survey of th e research efforts in high perform ance parallel com puting for
vision can be found in (W ebb, 1994). These efforts can broadly be grouped in to follow
ing categories, based on th e type of com puting p la tfo rm s they utilize: special-purpose
hardw are chips, SIM D based m achines, specialized vision system s and general purpose
parallel m achines. Special-purpose hardw are chips serve as accelerators to specific vision
algorithm s since they im plem ent the com pu tations in hardw are . However, they are su itab le
only for specific w ell-structured low level vision algorithm s, such as im age convolution.
SIM D based m ultiprocessor m achines such as m eshes, a rray processors, hypercubes,
and pyram ids, consist of sim ple processing elem ents connected by a com m unication n e t
work. These m achines perform well for im plem enting m ost of th e low level vision algo
rithm s. B ut they are no t well suited for high level vision a lgorithm s since these algorithm s
involve nonuniform processing and com plex d a ta s tru c tu re s . Specialized vision system s are
special purpose parallel m achines designed to su it th e requ irem ents of vision tasks. T hey
are capable of being partitioned into one or m ore independen t SIM D and M IM D sub
system s to m atch th e com pu tational characteristics of vision a lgorithm s a t various levels.
For exam ple, th e im age understand ing arch itectu re (lUA ) (W eems e t al., 1989) has th ree
hierarchical levels of com puting p latform s to su p p o rt processing of low, in te rm ed ia te and
Chapter 1. Introduction 18
high level vision tasks. Specialized vision system s, however, have com plex arch itectu res
which involve significant design and developm ent effort. T he need to develop new system
softw are for such m achines results in huge system developm ent costs.
G eneral purpose parallel m achines such as IBM SP-2, M eiko CS-2, Intel P aragon ,
C ray T 3D , and SGI Power Challenge, have been used successfully for a variety of high
perfo rm ance com puting applications. These com m ercial m achines have n o t been developed
for any specific applications, bu t are m eant to be general purpose system s. M ost of them
have a sim ilar arch itec tu re consisting of processors in terconnected by a high speed netw ork.
T hese processors are those th a t are used in large uniprocessor w orksta tions. T hese m a
chines are typically organized as a single box th a t contains all th e processor and m em ory
m odules in terconnected by a special purpose in terconnection netw ork. A lthough there
have been som e a tte m p ts to use these m achines for parallel vision applica tions (W ang,
1995), th ey are still not very popular w ith m any organizational setups.
Recently, netw ork-based com puting environm ents, such as a cluster of w orksta tions,
have provided effective and economical p latform s for high perform ance com puting . A
c lu ste r of w orksta tions offers several advantages for parallelizing and executing large ap
plications on a relatively low-priced and readily available pool of m achines. It provides
m ultip le C PU s for parallel com puting and d ram atically im proves v irtu a l m em ory and file
system perform ance. It can approach or exceed supercom pute r perform ance for som e
app lica tions and can easily be tuned to advances in processor and netw ork technology
(A nderson e t al., 1995), (T urco tte , 1996). A cluster of w orksta tions can in co rporate h e t
erogeneous arch itectu res, so applications can select th e m ost su itab le com puting resources
for each com pu ta tion .
B u t developing parallel applications on such m achines involves com plex decisions ab o u t
d is trib u tio n of processes over th e processors, process synchronization , scheduling of proces
sor tim e betw een com peting processes, com m unication p a tte rn s , e tc. W riting explicit code
to con tro l these decisions increases program com plexity and , reduces p rogram reliability
and code reusability. Also, th e available m achines and th e ir capabilities can vary from
one execution to ano ther, and high com m unication costs can degrade the perform ance in
m any applica tions. M oreover, developers do not wish to spend tim e in low level parallel
Chapter 1. Introduction 19
program m ing in order to gain th e advantages of po ten tia l parallelism in an application .
M ost of th em use or m odify existing parallel code to im plem ent parallelism for th e ir
applications. In fact, som e recent surveys of experienced parallel p rog ram m ers have shown
th a t ab o u t 69% m odify existing program s or com pose p rogram s from existing blocks of
code. T he rem aining 31% who s ta r t from scratch are typically co m p u te r scien tists and
applied m ath em atic ian s (Pancake, 1996),
T he m ain goal of th is thesis is to present a design m ethodology based on design pa tterns
in tended to su p p o rt parallelization of vision applications on a c luster of w orksta tions. M ost
of th e parallel algorithm s used in im plem enting vision tasks repeated ly use only a Unite set
of a lgorithm ic form s. We identify these com m on algorithm ic form s and fo rm ulate these as
design p a tte rn s . We specify various aspects of parallel behavior of a design p a tte rn , such
as process p lacem ent and com m unication p a tte rn s, in its definition or sep ara te ly as issues
to be addressed explicitly during its im plem entation . Design p a tte rn s ensure p rogram
reliability and code reusability since they cap tu re the essence of w orking designs in a form
th a t m akes them usable in different situations and in fu tu re w ork (Coplien & Schm idt,
1995), T he use of the design p a tte rn s would enable developm ent of d is trib u ted softw are
quickly, econom ically and reliably. Using a cluster of w orksta tions, researchers can use th e
design p a tte rn s to im plem ent m any in teractive and batch app lications in com pu ter vision,
A clu ster of w orksta tions is characterized by high com m unication costs and a varia
tion in speed facto rs of individual m achines in the netw ork. We need to address these
issues while form ulating th e design p a tte rn s. One facto r th a t m inim izes th e effect of high
com m unication costs on perform ance is granularity. G ran u la rity of an algorithm describes
th e am oun t of work associated w ith each task relative to com m unication . An algorithm
th a t exchanges d a ta between its processes after a sm all num ber of co m p u ta tio n s is called
fine-grained while an algorithm w here the com pu tations continue for a long tim e before
th e com m unication is required is term ed as coarse-grained. Since a clu ster of w orksta tions
is inherently coarse-grained, we need to form ulate design p a tte rn s so th a t they im plem ent
coarse grained parallelism . Also, th e design p a tte rn s should d is tr ib u te th e work load
according to th e speed factors of individual m achines in th e netw ork.
We begin our work by analyzing th e com putation and com m unication charac te ristic s
Chapter 1. Introduction 20
of vision algorithm s. We identify various form s of parallelism in vision algorithm s and
fo rm ulate design p a tte rn s to im plem ent them . Each design p a tte rn cap tu res com m on
designs used by developers to parallelize their applications. We presen t a ca ta logue of
design p a tte rn s to im plem ent various form s of parallelism in vision applications on a
cluster of w orksta tions. Using relevant design p a tte rn s , we im plem ent rep resen ta tive vision
algorithm s in low, in term ed iate and high level vision tasks, and presen t th e experim ental
resu lts of th e corresponding parallel im plem entations.
In low level, we im plem ent algorithm s such as h istogram equalization , convolution,
im age filtering using sp a tia l filters, and im age resto ra tio n using M arkov random field
models. In in term ed iate level, we im plem ent region-based sp lit and m erge segm enta tion
algorithm and line grouping algorithm based on principles of percep tual grouping. In
high level, we im plem ent geom etric hashing algorithm for o b jec t recognition. We also
discuss parallelization of an application in medical im aging, namely, m ulti-scale active
shape descrip tion of M R (m agnetic resonance) brain im ages using active con tour models.
1.2 A im s o f th is R esearch W ork
T he focus of th e work in th is thesis is to develop m ethodologies to su p p o rt parallelization
of vision applications on a cluster of w orkstations. T he m ain goals of th is thesis work are:
• To analyze com pu tational characteristics of vision task s and identify com m on algo
rithm ic s tru c tu re s in their parallel im plem entations.
• To cap tu re and a rticu la te these algorithm ic s tru c tu re s as design p a tte rn s in a form
th a t m akes them usable in different situations and in fu tu re work.
• To use these design p a tte rn s for im plem enting som e rep resen ta tive vision algorithm s
in low, in term ed iate and high level vision processing.
• To evaluate th e viability of using a cluster of w orksta tions to parallelize vision
applications.
Chapter 1. Introduction 21
1.3 C ontribu tions o f th e D isserta tion
T he con tribu tions of th is d isserta tion are three fold. F irstly , we propose a design m eth o d
ology based on design p a tte rn s in tended to su p p o rt parallelization of vision applica tions
on a cluster of w orksta tions. Secondly, we present coarse-grained parallel a lgorithm s for
som e represen ta tive vision algorithm s in low, in te rm ed ia te and high level vision process
ing. T hirdly, we use relevant design p a tte rn s to im plem ent these parallel a lgorithm s on
w orksta tion clusters. These con tribu tions are sum m arized as follows:
• Design p a tte rn s: We identify com m on algorithm ic s tru c tu re s occurring repeated ly
in parallel vision ta sk s / applications and form ulate these as design p a tte rn s . We
describe each design p a tte rn using a tem plate which outlines in ten t, m otivation ,
s tru c tu re , in teraction am ongst th e com ponents and applicability of th e design p a t
te rn . T his descrip tion enables selection and use of a design p a tte rn in different
s itu a tio n s and in fu tu re work.
• C oarse-grained parallel algorithm s: We present coarse-grained parallel a lgorithm s
and im plem entations for several vision tasks such as convolution, im age filtering,
im age resto ra tion , region-based segm entation , line grouping, and geom etric hashing
algorithm for ob ject recognition. We also present different parallel im plem entations
of th e m ulti-scale active shape description process (an application in m edical im ag
ing) using different design p a tte rn s.
• Im plem entation on a cluster of w orkstations: Using relevant design p a tte rn s , we
perform parallel im plem entations of the selected represen ta tive vision ta sk s s ta te d
above. T he results of these im plem entations enable critical assessm ent of th e design
p a tte rn s for achieving im provem ents in application perform ance. It also enables
evaluating th e viability of using w orksta tion clusters for im plem enting parallel vision
applications.
Chapter 1. Introduction 22
1.4 O rganization o f th e th esis
T he rem ainder of th e thesis is organized as follows:
• C h ap te r 2 reviews concepts and m ethods in several different areas re la ted to the
parallel vision system s. We begin w ith a brief in troduction to parallel com puting
system s and parallel algorithm s. We then describe general principles and m ethods
used in th e field of com puter vision, giving specific em phasis on app lications in
volving analysis of 2D scenes. We also describe th e co m p u ta tio n al charac teristics
of vision algorithm s and outline SIM D and M IM D based parallel m achines used for
parallelizing these algorithm s. We then describe parallel com puting on w orksta tion
clusters and discuss th e ir advantages over th e conventional parallel m achines. We
presen t various form s of parallelism in vision algorithm s and in troduce th e concept of
design p a tte rn s in tended to su p p o rt parallelization of vision applica tions on a cluster
of w orksta tions. Finally, we outline some of th e leading research efforts re la ted to
th e work presented in th is thesis.
• C h ap te r 3 presents a detailed description of each design p a tte rn . We use a tem
p la te to specify various aspects of parallel behavior (such as process p lacem ent
and com m unication p a tte rn s) of each design p a tte rn . T he tem p la tes ou tline in ten t,
m otivation, s tru c tu re , in teraction am ongst th e com ponents and applicability of th e
design p a tte rn s .
• C h ap te r 4 discusses parallelization of some low level vision algorithm s such as his
tog ram equalization, convolution, im age sharpening using sp a tia l filters, fast fourier
transform s, and im age resto ra tion using M arkov random field models. Each algo
rithm is parallelized by using either Farm er-W orker, M aster-W orker or C ontroller-
W orker p a tte rn .
• C h ap te r 5 presents resu lts of parallelization of some in term ed ia te level a lgorithm s
such as region-based segm entation , and line grouping a lgorithm based on th e princi
ples of percep tual organization. We use D ivide-and-C onquer p a tte rn for im plem ent
ing th e parallel region-based segm entation algorithm . T he line grouping algorithm
is parallelized by using th e C ontroller-W orker p a tte rn .
Chapter 1. Introduction 23
• C h ap te r 6 presents results of parallelization of a high level vision a lgorithm , namely,
geom etric hashing for ob ject recognition. We use a Farm er-W orker p a tte rn to per
form m ultiple m atching operations (probes) for identifying each ob ject in an im age.
In th e last section of th is chap ter, we discuss parallelization of an app lication in
m edical im aging, namely, m ulti-scale active shape descrip tion of M R (m agnetic
resonance) bra in im ages using active con tour models. We discuss th ree different
approaches of parallelizing th e shape description process. Each approach uses a
different design p a tte rn , namely. Tem poral M ultiplexing, P ipeline or C om posite
Pipeline.
• Finally, ch ap te r 7 presents concluding rem arks and d irections for fu tu re research.
Chapter 2
Parallelism in C om puter V ision
C om puter vision is a challenging application for high perform ance com puting . M any
vision applications are com putationally intensive and involve com plex processing. For a
p ractical and real-tim e im plem entation of vision applications, h igh-perform ance com puting
su p p o rt is essential. Over th e p ast several years, parallel processing has been perceived to
be an a ttra c tiv e and economical way to achieve th e required level of perform ance in vision
applications. C om p u ta tio n a l dem ands and real-tim e co n stra in ts associa ted w ith th e vision
applica tions have induced several research efforts to explore th e use of parallel com puting
resources for parallelizing vision applications (W ebb, 1994). M ost vision applica tions
consist of im age preprocessing followed by ob ject identification. A lthough b o th these task s
involve large num ber of com putations, they em body different co m p u ta tio n al paradigm s.
As a resu lt, several special and general purpose parallel m achines have been proposed,
developed and used in im plem enting parallel solutions to m any vision algorithm s.
T his chap te r gives an overview of the algorithm s in co m p u ter vision and presen ts
parallel system s and m ethodologies used in parallelizing vision applications. T he ch ap te r
is organized as follows. Section 2.1 in troduces som e concepts in parallel com puting . Sec
tion 2.2 gives an overview of th e principles and m ethods involved in th e field of com puter
vision. Section 2.3 discusses com putational characteristics of vision applications and th e ir
classification in to th ree levels, low, in term ediate and high. Section 2.4 ou tlines different
parallel system s used for parallelizing vision applications. Section 2.5 describes parallel
24
Chapter 2. Parallelism in Computer Vision 25
com puting on a cluster of w orksta tions. Section 2.6 proposes a m ethodology, based on
design p a tte rn s , which can be used to parallelize a m ajo rity of th e vision app lications on
netw ork-based m achines, such as a cluster of w orksta tions. We also describe various form s
of parallelism th a t can be applied to parallelize vision applications. Finally, section 2.7
outlines som e of th e leading research efforts which have been in sp irational to th e work
presented in th is thesis.
2.1 P arallel C om puting
P arallel com puting is concerned w ith applying m ultip le processors to solve a single com
p u ta tio n a l problem for achieving b e tte r perform ance. T his section begins w ith an in
tro d u c tio n to parallel com puting system s. It is followed by a descrip tion of a b s tra c t
algorithm ic classes characterizing different parallel algorithm s. These classes are useful
when discussing algorithm s a t a higher level.
2 .1 .1 P a ra lle l co m p u tin g sy ste m s
A parallel com puter is a collection of processors and m em ory connected by som e ty p e of
com m unication netw ork. Parallel com puting system s include a full spec trum of sizes and
prices, from a collection of w orksta tions a ttach ed to a local-area netw ork, to an expensive
high-perform ance m achine w ith thousands of processors connected by high-speed sw itches
(D uncan, 1992).
T he arch itec tu res of th e com puting system s are com m only organized in te rm s of in
stru c tio n s tream s and d a ta stream s (Flynn, 1972). T he th ree cases th a t have becom e
fam iliar te rm s to th e parallel program m er are SISD (single in s tru c tio n , single d a ta ) , SIM D
(single in struction , m ultiple d a ta ) and M IM D (m ultiple in s tru c tio n , m ultiple d a ta ) . SISD
com puters are th e trad itio n a l von N eum ann com puters th a t have a single in stru c tio n
stream and a single d a ta s tream . All operations on these com puters are logically sequen
tial. In a SIM D parallel com puter a single instruction s tream is applied to m ultiple d a ta
stream s. S lM D -based m achines usually consist of a large num ber of sim ple processors
Chapter 2. Parallelism in Computer Vision 26
connected by an in terconnection netw ork. T he M IM D d a ta m odel is th e m ost general
m odel of a parallel com puter. A M IM D com puter has m ultiple processing elem ents each
of which is a com plete com puter in its own right.
A lthough SIM D system s are easy to program , optim izing SIM D program s to yield
acceptab le perform ance is very difficult. As a result, SIM D com puters have no t been very
popu lar for scientific com puting. T his m akes M IM D system s th e overw helm ing m ajo rity
of parallel system s, especially when a cluster of w orksta tions is viewed as a single M IM D
com puter. A M IM D com puter consists of processors and m em ory. T he m em ory can be
either shared or d is trib u ted am ong th e processors. We can therefore consider tw o d istinc t
program m ing models: shared m em ory M IM D and d is trib u ted m em ory M IM D . However,
since th e sam e issues of d a ta locality and concurrency arise in bo th th e cases, we can
view M IM D com puter in te rm s of a com m on program m ing m odel. O ne such m odel is
the coordination m odel (M attson , 1996), (Foster, 1995), w here a parallel co m p u ta tio n is
viewed as a collection of d istinc t processes which in te rac t a t d iscre te po in ts th ro u g h a
coord ination operation . T he term coordination refers to th e basic operations to contro l a
parallel com puter. It includes coordination operations for in fo rm ation exchange, process
synchronization and process m anagem ent. These coord ination operations m ay vary in
speed and s tru c tu re , however, the overall model is essentially th e sam e.
B u t describing parallel and d istribu ted com puters in te rm s of a coord ination m odel
is not universally accepted like th e von N eum ann m odel (M attso n , 1996). However, such
a m odel can be s ta ted and used for program m ing parallel com puters w ithin a universal
program m ing m odel. A lthough the com puter system s differ, th e difference is g ran u la rity
(ra tio of com p u ta tio n to com m unication), and no t the fun d am en ta l p rogram m ing m odel
(M attson , 1996).
T he program m ing model, in order to be useful, m ust be im plem ented as a p rogram m ing
environm ent. T here are several program m ing environm ents su p p o rtin g various in carn a
tions of th e coordination model which run well on parallel com puters as well as a cluster of
w orksta tions (T urco tte , 1993), (Cheng, 1993). One can develop a parallel code using som e
high level language designed specifically to su p p o rt parallel and d is trib u ted com puting .
A lternatively, one can use a sequential language com bined w ith a coord ination lib rary
Chapter 2. Parallelism in Computer Vision 27
(often called as m essage-passing lib rary ), such as PV M (Sunderam , 1990).
P rog ram s w ritten for parallel M IM D system s fall into tw o categories: SPM D (single
p rogram m ultiple d a ta ) and M PM D (m ultiple program m ultiple d a ta ) . For SPM D pro
gram s, each processor executes th e sam e ob ject code. SPM D style of p rogram m ing is easy
to code since th e program m er needs to m ain tain a single source code. In co n tra s t, M PM D
program s allow each processor to have a d istinct executable code. A program m er can split
th e program in to different m odules which can be developed and debugged independently
or reused as com ponents of o th er program s. A M PM D program requires less m em ory
com pared to its equivalent SPM D version (M attson , 1996).
2 .1 .2 A lg o r ith m ic c lasses
M ost of th e parallel a lgorithm s can be classified in term s of th e regularity of th e under
lying d a ta s tru c tu re s (space) and the synchronization required as these d a ta elem ents are
u p d a ted (tim e) (Angus e t ah, 1989), (M attson , 1996). Based on th is classification schem e
th ere are four general classes of parallel algorithm s:
1. Synchronous
Synchronous algorithm s are those in which regular d a ta elem ents are u p d a ted a t
regular intervals of tim e. They are regular in space and regular in tim e. T hey involve
tig h tly coupled m anipulation of identical d a ta elem ents. Synchronous algorithm s
can be expressed in te rm s of a single instruction s tream , and are therefore easily
m apped onto SIM D com puters. T he parallelism is usually expressed in te rm s of th e
decom position of the d a ta . In fact, the d a ta drives th e parallelism , hence th e nam e
data parallelism . However, d a ta parallelism is m ore general th an SIM D parallelism ,
since d a ta parallelism does not insist on a single instruction stream .
2. Loosely synchronous
A loosely synchronous algorithm synchronously up d ates d a ta elem ents which differ
from one processor to ano ther. Loosely synchronous a lgorithm s are regular in tim e
b u t irregular in space. They have tig h t coupling betw een th e task s as in th e syn
Chapter 2. Parallelism in Computer Vision 28
chronous case. However, due to variation in th e d a ta elem ents across th e processors,
th e work loads can vary from processor to processor. Hence, loosely synchronous
algorithm s need som e m echanism to balance th e co m p u ta tio n al load am ong th e
processors of th e parallel com puter.
3. A synchronous
A synchronous algorithm s do no t have regular d a ta up d ates, so th e system proceeds
w ith nonuniform and som etim es random synchronization. These a lgorithm s are
irregular in tim e and usually irregular in space w ith unpred ic tab le or nonex isten t
coupling betw een th e tasks. T his class of problem s, o th er th an th e em barrassing ly
parallel subset described next, is m ost rare. T his is because program s for im ple
m enting asynchronous algorithm s are difficult to co n stru c t. W hile synchronous
and loosely synchronous algorithm s are usually parallelized by focusing on d a ta
decom position, asynchronous algorithm s are usually parallelized by decom position
of th e control, which is referred to as fu n c tio n a l or control parallelism .
4. E m barrassingly parallel
E m barrassing ly parallel algorithm s are those asynchronous algorithm s for which th e
task s are com pletely independent and uncoupled. T he parallelism in th is case is
triv ial and th e program s are am ong the sim plest parallel p rogram s to co n stru c t.
P roblem s in th is class are very com m on in parallel com puting since th e ir com pu
ta tio n s easily m ap in to th is model. In fact, any p rogram consisting of a loop w ith
com pute-in tensive and independent ite ra tio n s can be parallelized using th is m odel.
E m barrassing ly parallel program s usually utilize an SPM D style of p rogram m ing
com bined w ith some m echanism for load balancing. Load balancing schem es can
e ither be s ta tic or dynam ic.
2 .1 .3 P er fo rm a n ce o f para llel program s
T he m ain goal of parallelism is to reduce th e execution tim e of th e whole p rogram in o rder
to produce th e resu lts faster. T he perform ance estim ates of a parallel p rogram are based
on th e tim ings of its com plete sequential code. T he sequential p rogram typically com prises
of tw o d istinc t sections of code, inherently sequential code and po ten tially parallel code.
Chapter 2. Parallelism in Computer Vision 29
T he parallel con ten t p of th e program is defined as th e ra tio of th e tim e taken to execute the
po ten tia lly parallel code upon th e tim e taken to execute th e whole code. T h e m axim um
theo re tica l speedup th a t can be achieved for a given program is a function of th e parallel
co n ten t p and th e num ber of processors th a t will be used {N). I t is given by A m d ah l’s law
(A m dahl, 1988) which is s ta te d as follows
Theoretical speedup = —— . ^ (2.1)
T he theore tical speedup is lower th an the ideal speedup^ which reflects th e ideal th a t
applying N processors to a program should cause it to com plete N tim es faste r. T he size
of th e gap betw een th e ideal and theoretical speedup is a function of th e serial co n ten t of
th e program . T his suggests th a t th e am ount of speedup th a t can be achieved for every
program is lim ited beyond a certain num ber of processors. T he gap between th e theore tical
and ideal speedup m ay change due to the increase in problem size (e.g. when num ber of
ite ra tio n s are increased in a sim ulation). T he gap narrow s down when the parallel con ten t
of th e program increases due to increase in problem size, while th e gap m ay actually widen
if th e length of th e serial bottlenecks also increase upon increase in problem size. However,
th e theo re tica l speedup is rarely achievable by a parallel application. T here will actually
be an observed speedup which is much lower th an th e theore tical speedup, reflecting the
effect of ex ternal overhead on th e to ta l execution. T his overhead com es from tw o sources
a) th e add itional processor cycles expended in sim ply m anaging th e parallelism b) w asted
tim e spen t w aiting for I /O , com m unication am ong processors, and , com petition from the
o p era tin g system and o ther users (Pancake, 1996). T heoretical speedup does no t take
these facto rs in to account.
2.2 A n O verview o f C om puter V ision
T he basic in p u t in com puter vision is a set of one or m ore im age(s) of som e scene, while
th e o u tp u t is a descrip tion of th e ob jects in th a t scene. An im age, cap tu red by a sensor,
is an a rray of num bers called pixels th a t represent average brightness (gray level) or color
values a t d iscrete grid points in the scene. A gray level is usually represen ted as an 8-bit
Chapter 2. Parallelism in Computer Vision 30
in teger having 256 d istinc t values, while each color value is represented by a n-valued tup le
m easuring brightness in a se t of n -spectral bands (e.g., red, blue and green).
We can view im age processing as a prelude to com pu ter vision. Im age processing
algorithm s op era te on im ages to ex trac t and represent scene inform ation. H igher level
vision algorithm s use scene inform ation for ob ject recognition and scene in te rp re ta tio n .
C om puter vision therefore encom passes processing from sensing to scene in te rp re ta tio n .
T he m ain areas of im age processing include image enhancem ent and restoration (to im
prove appearance of an im age or to undo effects of im age deg radations such as b lurring
or noise), image com pression (to reduce an im age to sm aller sets of d a ta which can
be used for reconstruction of an acceptable approx im ation to th e original im age), and
image reconstruction from projections (to construc t im ages of cross-section of an ob ject
by analyzing a set of pro jections taken from different d irections, as in tom ography).
Since m ajo rity of th e applications in com puter vision involve two dim ensional (2D)
scenes and th e general goal is to recognize ob jects of in terest in th e im ages of these
scenes, we will restric t our discussion to analysis of 2D scenes. T he following subsections
outline general techniques involved in 2D ob ject recognition. A detailed discussion dealing
p rim arily w ith 2D vision can be found in (B allard & Brow n, 1982), (Rosenfeld & Kak,
1982), (Sonka e t al., 1993), while an outline of bo th 2D and 3D vision is given in (Rosenfeld,
1988).
2 .2 .1 O b jec t rec o g n itio n in 2D scen es
Some exam ples of applications involving 2D scenes are: recognition of alphanum eric char
ac te rs from an im age of a docum ent, recognition of blood cells from an im age of a specim en
seen th ro u g h a m icroscope, and identification of houses and roads from high a ltitu d e aerial
pho tographs. A general fram ew ork describing m ajo r techniques used in ob ject recognition
is shown in F igure 2.1. F ea tu re detection techniques are used for detec ting local fea tu res
such as edges (a t which th e gray level changes ab ru p tly ), lines, curves, spots, and corners.
S egm entation p artitio n s th e image pixels into hom ogeneous regions. B oth segm enta tion
and fea tu re detection assign labels to the im age pixels which ind icate the classes to which
Chapter 2. Parallelism in Computer Vision 31
th e pixels belong.
Recognition/Generic Description
^ Model Matching/Object Recognition
Relational structureA Segmentation/Resegmentation
Property Measurement
Scene FeaturesA Feature Detection
Image enhancement/restoration
Digitized Image of the Scene
^ Imaging device
Real-World scene
tIllumination
Figure 2.1: An overview of a typical vision based application
R esegm entation techniques group the segm ented regions in th e im age in to groups
or p a r ts th a t satisfy certain geom etric constra in ts . P ro p e rty m easurem ent a lgorithm s
co m p u te various properties such as area, perim eter, and average gray level, for such p arts .
M odel m atching or ob ject recognition is then regarded as identification of im age p a rts
th a t correspond to th e ob ject p a rts and satisfy th e ap p ro p ria te co n stra in ts .
2 .2 .2 F eatu re D e te c t io n
W e describe basic featu re detection techniques used for de tec ting various local fea tu res in
th e im age.
1. T em plating
A subim age of a local fea tu re th a t is to be detec ted is regarded as a tem p la te and
m atched a t every possible position in th e im age for best fit. T he degree of m atch or
m ism atch identifies th e fea tu re a t the corresponding pixels. T hus if and t ( z , j )
represen t pixel intensities in th e im age and th e tem p la te , respectively, a m easure of
th e m ism atch between them can be expressed by (Rosenfeld, 1988) D(a,6)(/)^} =
j + 6 ) - f(%,j))^, where (a, 6) is th e displacem ent of th e origin of t
Chapter 2. Parallelism in Computer Vision 32
relative to th a t of /. T he value of (a, b) th a t m inim izes D rep resen ts the m ost likely
position of th e tem p la te in th e im age. This m ethod is com p u ta tio n ally intensive and
does no t give correct results if th e im age in tensity varies significantly over areas of
th e size of th e tem plate .
2. Edge detection
Edge detection techniques a tte m p t to find pixels th a t lie on th e borders betw een
different ob jects in th e im age. Some s tan d a rd approaches used are (Rosenfeld, 1988):
• M ask m atching: A tem p la te representing ideal edges in various o rien ta tions is
m atched in the neighborhood of each pixel in th e im age. A pixel is classified as
an edge pixel if th e degree of such a m atch is sufficiently high. S harp m atches
are ob ta ined by using m asks which are second differences of ideal s tep (or ram p)
edges. T his technique is also used for detec ting lines, curves, sp o ts and corners.
• G rad ien t m agnitude: If Aa; and A y denote th e first differences of th e im age gray
level in th e x and y d irections, then th e d irection of m axim um ra te of change of
gray level is tan~^ [ Ay / A ^ ) and the g rad ien t m agnitude of th is m axim um ra te
of change is s q r t[ A l + A ^). A pixel lies on an edge if th e g rad ien t m agnitude
a t th a t pixel is sufficiently high. T he differences A a; and A y can be regarded
as good approxim ations to th e p artia l derivatives and th e d ig ita l im age as a
good approxim ation to a sm oothly varying brigh tness function . T he g rad ien t
m agnitude approach has several refinem ents (e.g. local m axim um selection,
differences of averages, etc.) to overcom e th e effect of noise in the im age. A
detailed descrip tion of these can be found in (Rosenfeld, 1988).
• L aplacian Zero-crossing
In th is approach , a Laplacian of the im age gray level i.e. th e sum of th e second
differences in th e x and y d irections in th e neighborhood of a given pixel is
com puted . T his sum is positive on one side of an edge and negative on the
o ther, hence its zero-crossings define th e location of th e edges.
• Hough transfo rm s
T he Hough transfo rm a ttem p ts to d e tec t fea tu res such as lines, circles, or
curves, th a t have equations of a particu la r ty p e by working in a su itab le pa
ram e ter space. For example, to detec t a rb itra ry s tra ig h t lines, a local curve
Chapter 2. Parallelism in Computer Vision 33
detection process is applied to th e im age to get th e edge pixels. A s tra ig h t
line is characterized by a slope 9 and d istance r from th e origin in th e (r, 9)
p a ram etric space (D uda & H art, 1972). If P is an edge pixel and if it lies on
a s tra ig h t line we can com pute (r, 9) for P and m ark th e position (r, 9) in a
d iscrete (r, 9) array. W hen th is process is done for all edge pixels and if the
im age contains m any collinear P ’s then th ere will be a pixel in th e (r, 9) a rray
th a t has a high count of m arks.
2 .2 .3 S eg m en ta tio n
S egm entation techniques are used for identifying pixels th a t form hom ogeneous regions in
th e im age. F ea tu re detection is a special form of segm entation since it identifies special
types of pixels which have specific local properties. T he com m on techniques used in
segm enta tion (Rosenfeld, 1988) are described below.
1. G ray level threshold ing
Regions in th is segm entation are assum ed to have approx im ately co n stan t gray level
across th e pixels co n stitu tin g them . A plot of frequency of each gray level (called
th e im age histogram ) in th e im age gives various peaks (surrounded by valleys) which
represent ideal gray levels of the corresponding regions. T he im age can be segm ented
in to regions by dividing th e gray scale in to intervals each contain ing a single peak.
T his m ethod of segm entation is known as (multi)-thresholding; th e poin ts sep a ra tin g
th e intervals on the gray scale are called thresholds. T hreshold ing produces good re
su lts only if th e peaks are well separated . Various refinem ents to th is basic technique
can be applied when th e peaks overlap or are widely separa ted .
2. R elaxation techniques
T hese are ite ra tive techniques which are used for ge ttin g a s tab le solu tion from an
in itial approxim ation . In th e con tex t of segm entation , each pixel is initially clas
sified independently (w ith certain probabilities). These pixels are then reclassified
itera tively to m ake th e classification m ore consistent. T he consistency crite ria in
segm entation of the im age in to regions m eans th a t if m ajo rity of the neighbors of a
Chapter 2. Parallelism in Computer Vision 34
pixel P belong to a given class, so should P. If th e goal is to d e tec t edges or curves,
th e consistency crite ria m eans th a t if P lies on an edge or a curve having a given
slope a t P , so should its neighbors in th a t direction having a sim ilar slope.
3. G lobal H om ogeneity
In th is approach , en tire region or curve is required to be a good fit (e.g., in the
least squares sense) to som e s tan d a rd function. For exam ple, an edge or curve m ay
be required to be a good fit to a s tra ig h t line or to a polynom ial o f higher degree.
A split and m erge approach can then be used for segm enting an im age or a curve
in to globally hom ogeneous p arts . In th is approach , an en tire im age or a curve is
sp lit (e.g., in to q u ad ran ts or arcs) if th e m easure of th e fit is n o t good enough.
T h e sp littin g process is repeated for each p a r t until th e en tire im age or curve is
p artitioned in to p a r ts each of which has a good fit and no tw o ad jacen t p a rts can be
m erged to yield a good fit.
4. Region G row ing; Edge or C urve Tracking
In region growing, a region is built by s ta rtin g w ith a set of one or m ore ‘sim ilar’ pixels
(e.g. ‘s im ilar’ by pixel difference) and gradually ex tending th is se t by repeated ly
adding new pixels or connected sets which resem ble pixels already in th e set. T he
resem blance is usually governed by some hom ogeneity criterion (based on e ither gray
tone or tex tu re ) th a t m ust be satisfied by the new pixels for inclusion in th e region.
T he procedure for edges or curves is analogous. O ne s ta r ts w ith stro n g edge/cu rve
pixels and ex tends them by adding neighboring edge pixels th a t continue th e edge
sm ooth ly or preserve th e good global fit. T he m ain d isadvan tage of th is approach
is th a t th e resu lts of segm entation are o rder-dependent. T hey depend on choice
of th e s ta r tin g point and th e order in which th e pixels are exam ined for possible
inco rporation in to th e region, edge or curve.
5. H ierarchical Techniques
Here one applies a local feature detection technique to a reduced-reso lu tion im age
to d e tec t ‘coarse fe a tu res’ of various sizes (edges betw een large regions, th ick curves,
large spo ts, e tc .). T he finer im age features can then be located by exam ining succes
sively higher-resolution versions of th e im age in th e vicinity of th e de tec ted features.
Chapter 2. Parallelism in Computer Vision 35
T his process requires only a succession of local searches and thereby reduces th e cost
of global search.
2 .2 .4 R e se g m e n ta tio n
R esegm entation m ethods are used for form ing m eaningful en tities or p a r ts by segm enting
or grouping regions, edges or curves using certain geom etric criteria . E xam ples of such
en tities are (Rosenfeld, 1988):
1. C onnected com ponents and holes: Segm entation of an im age often resu lts in m any
disconnected fragm ents. R esegm entation m ethods applied to such fragm en ts resu lt
in m axim al connected sets of pixels called connected com ponents. Holes are regions
su rrounded by pixels of a connected com ponent.
2. B orders, Arcs and Curves: Edges obtained in segm entation m ay be grouped to g eth er
to form borders of ob jects or to form arcs and curves in th e im age. An arc m ay be
fu r th e r segm ented into sm oothly curved subarcs which m ay m eet a t corners.
3. T hining, Shrinking and Expanding: These techniques are used for form ing a skeleton
of given ob jec ts or to d ilate a given ob ject in th e im age.
2 .2 .5 P r o p e r tie s and R e la tio n s
A fter th e resegm entation process, m any useful p roperties of th e im age p a rts can be
m easured by applying various techniques. Exam ples of such properties are: num ber of
connected com ponents or holes, a rea (num ber of pixels in th e im age p a r t) , perim eter,
com pactness [a re a /p e r im e te r ‘s) and elongatedness [a rea /th ickn e ss 's ) . M any types of
re la tions between im age p a rts are im p o rtan t for ob ject recognition especially when these
are betw een p a rts of ob jects. M ost of these relations are defined in te rm s of relative
p ro p erty values such as ligh tness/darkness, size, positional reference (e.g. near, far, above
below, e tc .), and o rien ta tion (parallel, etc.) (Rosenfeld, 1988).
Chapter 2. Parallelism in Computer Vision 36
2 .2 .6 O b jec t R e c o g n itio n
O b jec t recognition m ay be achieved in several ways. In th e graph-based approach , the
ob jects are assum ed to consist of p a rts having certa in properties and re la tionships. They
are represen ted as labeled g raphs w ith nodes representing p a rts , labeled w ith p ro p e rty
values, and th e arcs representing relations, labeled w ith re lation values. Tw o such g raphs
are created , one for the expected class of ob jects (called ob jec t g raph) and th e o th e r for
th e ac tu a l observed ob ject classes in the im age (scene g rap h ). R ecognition is th en achieved
by finding subgraphs of th e scene g raph th a t are close m atches to th e o b jec t g rap h . T he
m ain lim ita tion in th is approach is th a t the observed im age p a r ts m ay n o t co rrespond to
th e expected o b jec t p a rts . This m ay be due to segm enta tion errors, w here a single node
m ay sp lit in to several nodes or several nodes m ay m erge in to a single node. Also, it is
som etim es difficult to characterize ob jects as labeled graphs.
In an o th er approach, although applicable only in som e special cases, th e o b jec ts are
characterized by a set of ideal (global) p roperty values or co n stra in ts on these values.
R ecognition then consists of m atching an observed list w ith th e ideal list. In certa in
cases an en tire ob ject is tre a ted as a tem p la te and m atched for op tim al fit in th e im age.
T he g raph-based approach, however, appears to be m ore general and is applicable in the
m a jo rity of th e cases (Rosenfeld, 1988).
2.3 C om p u tation a l C haracteristics
Investigation of parallel processing solutions to vision applications necessita tes under
s tan d in g th e n a tu re of th e com putations involved. A typical vision application involves
several stages of processing w ith a varying mix of sym bolic and num eric processing. Vision
app lica tions are conveniently classified in to th ree levels (W eems e t ah , 1989): low level,
in term ed ia te level, and high level as shown in F igure 2.2. T he low level processing in
volves w ell-structured local com putations on th e im age d a ta while th e o th er levels involve
sym bolic com pu ta tions w ith irregular com m unication p a tte rn s .
Chapter 2. Parallelism in Computer Vision 37
Recognition/Generic Description High Level
(D ata structures-to-Data Structures Processing)t Model Matching/Object Recognition
Relational structure Intermediate Level
t Segmentation/Resegmentation (Image-ta-Im age/Property Measurement Im age-to-Data Structures Processing)
Scene Features Low Level
Feature Detection (Image-to-Jmage Processing)Image enhancement/restorationt
Digitized Image of the Scene
Figure 2.2: P rocessing levels in a typical vision based application
2 .3 .1 Low le v e l p ro cess in g
Low level processing involves im age processing techniques such as im age enhancem ent and
re s to ra tio n , and com pu ter vision techniques of featu re ex trac tio n and edge detec tion . Low
level processing consists of pixel-to-pixel transfo rm ations, w here uniform co m p u ta tio n s
are applied a t each pixel or a t a neighborhood around each pixel in th e im age. T he com
p u ta tio n s are num eric, regular and well suited to spatia l parallelism . T he com m unication
p a tte rn is local and processing across th e im age is identical. A lthough the co m p u ta tio n s
required a t low level are quite straigh tfo rw ard , th e sheer volum e of d a ta to be processed
dem ands enorm ous com puting power.
2 .3 .2 In te r m e d ia te lev e l p ro cessin g
A t th e in term ed iate level, th e basic un it of in form ation is a descrip tion of low level im age
fea tu res such as edges, curves, and in tensity regions. T he a lgorithm s in th is category
consist of bo th sym bolic and num eric com putations. T he sym bolic com p u ta tio n s involve
grouping of the low level features in to m eaningful en tities such as sets of parallel lines,
rec tan g u lar borders of an ob ject, or planes. T he algorithm s a t th is level a t te m p t to o u tp u t
descrip tions of possible ob jects in the im age d a ta . T he grouping operations (eg m erging
and sp litting of regions, or linking and reorganizing of lines) involve a large am oun t of
Chapter 2. Parallelism in Computer Vision 38
non-local com m unications. T he fragm ents of lines require m atching and m erging across
large fraction of th e im age. Similarly, regions need to be m erged and com pared w ith o thers
from possibly non-contiguous areas during th e segm entation process. T h e com m unication
p a tte rn is th u s d a ta dependent and irregular.
2 .3 .3 H igh le v e l p ro cess in g
High level applications generate and te s t hypotheses for ob jec t recognition based on d a ta
provided by th e low and in term ed iate levels of processing. T he app lica tions a t th is level
a t te m p t to recognize ob jects in th e im age using either g raph-based or ru le-based techniques
on th e ob jec t descrip tions generated a t th e in term ed iate level. P rocessing a t th is level is
very irregu lar and m ay involve dynam ic scheduling of th e co m pu ta tions.
T he volum e of d a ta analyzed as th e processing progresses from low levels to high levels
is su b stan tia lly reduced. However, th e inform ation con ten t of th e d a ta is m uch higher.
For exam ple, w here pixel values in low level represen t b rightness values in th e im age
d a ta , relevant d a ta in high level m ay represent relative size or shapes of th e ob jects. T he
d a ta types shift from prim arily num eric to prim arily sym bolic (Y alam anchilli & A ggarw al,
1994). Hence, th e com pu tations involving these d a ta s tru c tu re s are com plex (e.g. o b jec t
recognition, au to m atic vehicle guidance). T he source of co m p u ta tio n a l burden shifts from
large volum es of d a ta to com plex num erical and inferencing operations as th e processing
progresses from low to high level.
Low level algorithm s are usually highly s tru c tu red , repetitive and com posed of fixed
sets of operations w ith relatively few da ta-d ep en d en t branches. It is therefore possible to
o b ta in relatively accura te estim ates of th e operation counts. B u t high level a lgorithm s
are highly da ta-d ep en d en t and processing requirem ents can vary widely based on th e
applica tion dom ain. For exam ple, it is very difficult to es tim ate th e num bers of fea tu re or
ob jec ts to be processed, and even m ore difficult to es tim ate th e am o u n t of com p u ta tio n s
involved. It is therefore very difficult to establish th e processing requ irem ents and source
of parallelism (e.g. d a ta /fu n c tio n a l parallelism ) in high level vision a lgorithm s. Hence,
th e n a tu re of th e algorithm ic characteristics change as processing evolves from low level
Chapter 2. Parallelism in Computer Vision 39
to high levels. T hese characteristics have influenced the design of several different parallel
arch itectu res discussed in th e next section.
2.4 P arallel sy stem s for v ision
M any applica tions in com puter vision have enorm ous d a ta th ro u g h p u t and processing
requirem ents which have fa r exceeded the capabilities of existing uniprocessor arch itec
tu res. Parallel processing has been perceived as a necessary solution th a t has led to the
conception, design, and subsequent analysis of a num ber of parallel system s for com puter
vision which are described below (W eems e t al., 1989), (C houdhary & P ate l, 1990). T he
lite ra tu re on parallel system s for com puter vision is vast, however, m ost of th e m ateria l
can be found in (Duff & Levialdi, 1982), (K endall & U hr, 1982), (U hr, 1987), (Page, 1988),
(P ra sa n n a K um ar, 1991), (N arayan e t ah, 1992), (Siegel e t ah, 1992).
2 .4 .1 M esh co n n ec ted sy ste m s
M esh connected m achines consist of a large num ber of sim ple processing elem ents arranged
in a tw o-dim ensional array, w ith each processing elem ent connected to its four, six, or
eight neighbors (Figure 2.3). T he processing elem ents execute in struc tions b ro ad cast by
a cen tra l controller in a SIM D mode. T he organization of these m achines m atches th e
s tru c tu re of th e im age d a ta which m akes them su itab le for low level im age processing
operations involving com putations on individual pixels or sm all neighborhoods of pixels.
T hey are, however, not su itab le for in term ed iate and high level processing due to sim plicity
and SIM D n a tu re of the processing elem ents. Also, com m unication of in form ation across
long distances in th e com m unication netw ork is very tim e consum ing. Some exam ples
of m esh connected m achines are (C houdhary & P ate l, 1990), (Y alam anchilli & A ggarw al,
1994) th e M assively Parallel Processor, the B inary A rray Processor, th e D istrib u ted A rray
P rocessor (DA P) and the Cellular Logic Im age P rocessor (C L IP) series of m achines a t
U niversity College in London, the s ta te of th e a r t in th e series being C L IP ? processor
array.
Chapter 2. Parallelism in Computer Vision
2 .4 .2 P y r a m id s
40
Pyram id m achines consist of a large num ber of simple processing elem ents arranged in
layers of m esh-connected arrays. W ith exception of th e array a t the lowest layer, each
array in the pyram id is one fourth as large as the a rray below it and each processing
elem ent is connected to four processors in the array below it (Figure 2.3). Pyram id
m achines a tte m p t to minimize the com m unication delays over large distances present in
the mesh connected system s. However, due to SIMD n a tu re of the processing elem ents
these machines can be used in im proving speed of m ostly low level algorithm s, especially
those which depend upon com m unication between pixels th a t are spatially d is tan t in an
image. Some exam ples of Pyram id machines are (C houdhary & Patel, 1990) Non-Von,
the in traco m p u te r, PAPIA, and M P P Pyram id.
F igure 2.3; A 4-connected mesh, pyram id and a 3-dim ensional hypercube of processing
elem ents
2 .4 .3 H y p e r c u b e s
Hypercube machines consist of 2" processors connected by a com m unication network th a t
resem bles an 77-dimensional cube. Each processor is connected to n o ther processors and
can com m unicate with any o ther processor using a t m ost n com m unication links (Fig
ure 2.3). Hypercube machines can be built to operate in both SIM D and M IM D mode.
Chapter 2. Parallelism in Computer Vision 41
T hese m achines provide efficient com m unication betw een all th e processors because the
netw ork has sm all d iam eter. T hey can be used for m ost low level algorithm s and som e
in term ed ia te and high level applications. However, th e a lgorithm s need to be tu n ed to th e
underlying topology. Also, larger hypercubes are costly to build since they require m any
links to be added to each processor. An exam ple of a SIM D hypercube is C onnection
m achine C M -2 while Intel H ypercube, N C ube and Cosm ic C ube are exam ples of some
M IM D hypercube m achines (C houdhary & P ate l, 1990).
2 .4 .4 Shared m em o ry m ach in es
Shared m em ory system s are usually M IM D m achines consisting of several general purpose
processors which have access to a large global m em ory th rough an in terconnection n e t
work. In som e cases th e processors m ay also have a sm all am o u n t of local m em ory. T he
in terconnection netw ork m ay be bus-based or m ay involve use of a m ultistage sw itching
network. T he form er involves a high-speed bus th a t connects th e processors and th e
m em ory while th e la tte r provides links between processors and m em ory on a dem and
basis (F igure 2.4). T he bus-based m achines have lim ited scalability since th e com m on
bus used in com m unication lim its the num ber of processors th a t can be added in the
system . Scalability is much b e tte r in the m ultistage sw itching netw ork m achines b u t the
in terconnection netw orks are com plex to build.
S hared m em ory m achines are su itab le for high level vision applica tions due to ease of
p rogram m ing and uniform view of th e system . T he control of in form ation and synchro
n ization is much easier com pared to th a t in th e d is trib u ted m em ory m achines. However,
due to slow access to global m em ory and tim e penalty in process synchronization , such
system s are efficient only for coarse-grained parallelism . E xam ples of shared m em ory
m achines (C houdhary h P ate l, 1990) th a t use bus a rch itec tu re are Sequent B alance and
E ncore M ultim ax and those which use m ultistage netw orks are BBN B utterfly , IBM R P3
and C edar.
Chapter 2. Parallelism in Computer Vision 42
2 .4 .5 P ip e lin e d S y ste m s and S y s to lic arrays
T he m achines in th is category consist of a pipeline of processing elem ents w here d a ta is
fed from one end of the pipeline. T his d a ta then passes th rough th e processing elem ents in
a serial fashion, and th e results are ob ta ined a t th e o ther end of th e pipeline (F igure 2.4).
These system s are used for perform ing a sequence of operations on a s tream of inpu t
d a ta . Such system s are useful in m orphological operations w here long sequences of local
operations are perform ed on a given im age d a ta . T he exam ples of m achines in th is category
are cyc tocom puters and th e systolic a rrays (e.g. SLAP or Scan Line A rray Processor)
(Y alam anchilli & Aggarw al, 1994).
Several solutions have been developed for low level im age processing algorithm s using
systolic a rrays (U hr e t al., 1986). These solutions, called systo lic solutions^ are realized
by organizing th e flow of d a ta stream s th rough such arrays. Systolic so lu tions have been
o b ta ined for a variety of problem s such as edge detection , connected com ponent labeling,
and fast Fourier transform s. However, th e difficult problem in using these m achines is
to determ ine w hether a systolic solution exists for a certain problem and, if so, to derive
th is solution. A represen ta tive of the s ta te of th e effort is th e CM U W arp p ro jec t. T he
CM U W arp, a linear systolic array of 10 W arp cells or processing elem ents, was designed
to provide high-speed operations for a num ber of low level im age processing applications.
B u t its flexibility m akes it possible to program a variety of o th er applica tions as well. T he
a rray can o p era te as a purely systolic array or as a se t of processors on a bus in th e SIM D
or M IM D m ode (Yalam anchilli & Aggarwal, 1994).
2 .4 .6 P a r tit io n a b le S y ste m s
D ue to varied n a tu re of vision applications there were m any efforts to design and develop
arch itec tu res th a t supported both SIM D and M IM D types of processing. Such hybrid
system s addressed th e issue of flexibility, p artitio n ab ility and reconfigurability needed in
low, in te rm ed ia te and high level vision applications. Some exam ples o f such system s
include PM 4, PASM , R E PL IC A , D isputer, W ISA RD , VisTA, th e Im age U nderstand ing
A rch itectu re (lUA) and N ET R A . A brief description of all these system s can be found in
Chapter 2. Parallelism in Computer Vision 43
Shared Metmiry
PE PEPEPE PE
Memory PEPEPE
Controller
Interconnection Network
Figure 2.4: Shared m em ory machines (interconnected by a bus and sw itching network)
and systo lic /p ipeline system s
(Yalam anchilli & Aggarwal, 1994), (C houdhary & Patel, 1990), and (P ra san n a K um ar,
1991). T he common characteristics of these machines is th a t they consist of large num ber
of processing units which can be partitioned into groups th a t can operate in SIM D and
M IM D mode. T he arch itectu re of lUA (Weems e t ah, 1989), for exam ple, has th ree
different layers of processing units su itable for low, in term ediate and high level vision
algorithm s. However, such system s involve considerable design and developm ent costs due
to their specialized and complex architecture.
2 .4 .7 G e n e r a l p u r p o s e p a r a l l e l s y s t e m s
C eneral purpose parallel system s are the curren t high-perform ance parallel m achines such
as IBM SP-2, Meiko CS-2, Intel Paragon, Cray T3D , and PA RAM lOOOOF Since they
are based on w orkstation m icroprocessor technology, these system s are versatile and cost-
effective com pared to the specialized vision system s described earlier. These system s
m ainly consist of processing units, each with a local memory, and a high speed in terconnec
tion network. They are m ostly tightly-coupled, i.e. the in terconnects are system -specific
w ith poin t-to -poin t links between the processors. T heir m ajor d isadvantage is th a t it is
difficult for a parallel application to use the resources efficiently. Also, the system -specific
in terconnects do not provide a flexibility of adding existing machines as hosts. T hey
Developed by C -D /\C (C enter for Developm ent of Advanced C om puting), Pune, India
Chapter 2. Parallelism in Computer Vision 44
can n o t in co rporate heterogeneous arch itectures, hence applications cannot select m ost
su itab le com puting resources for each com pu tation . Therefore, a lthough tigh tly-coupled
system s alw ays su p p o rt faste r com m unication, their advantage is likely to shrink over tim e
(S teenkiste, 1996).
2.5 C om p u tin g on w ork station clusters
D uring th e p ast several years, netw ork-based com puting environm ents, such as a c lu ste r of
w orksta tions, have proved to be an a ttra c tiv e a lte rn a tiv e for high-perform ance com puting
over th e conventional parallel m achines. This is due to rapid advances in m icroprocessor
technology and em ergence of high-speed netw orks having a netw ork bandw id th of the
o rder of g igabit per second (Boden e t ah, 1995), (S teenkiste, 1996). A cluster of w o rk sta
tions offers several advantages for im plem enting high-perform ance com puting solutions.
It provides m ultiple C PU s, large memory, stab le softw are, and heterogeneous com pu ting
environm ents for developing high-perform ance com puting solutions to m any co m p u ta tio n
intensive problem s. It is believed th a t th e fu tu re com puting environm ents will slowly
m ig rate tow ards th e concept th a t ‘the netw ork is th e co m p u te r’ (T urco tte , 1996).
2 .5 .1 C lu ster C onfigu ration
A w orksta tion c luster is basically a collection of w orksta tions connected by a com m odity
netw ork, such as E th e rn e t or ATM . T he th ree com m on netw ork topologies em ployed w ith
w orksta tion clusters are shown in F igure 2.5. T he E th e rn e t or bus is the m ost com m only
im plem ented netw ork for clusters. Switch based in terconnects are typically configured
in a sta r arrangem en t, and are used exclusively w ith dedicated clusters. T here a re also
hierarchical designs in which m ultiple types of in terconnects are utilized.
T h e w orksta tions in a cluster com m unicate w ith each o ther by exchanging m essages
or d a ta packets tran sm itte d using either transm ission control pro tocol (T C P ) or user
d a tag ram pro tocol (U D P). T he form er processes s tream s of d a ta such th a t th e reliability
of m essage delivery is assured. T he la tte r sends d a ta packets th a t are a tte m p te d to be
Chapter 2. Parallelism in Computer Vision 45
delivered (i.e. reliability of message delivery is not assured) (T urcotte , 1996). Two softw are
m ethods are used for com m unicating the messages: message passing a.i\d distributed shared
mem ory. M essage passing involves explicit tran sm ittin g of messages between the system s.
D istributed shared m em ory (DSM ), which is usually im plem ented using m essage passing,
involves accessing of d a ta w ithout the concern for physical location.
Switch
Figure 2.5: Com mon cluster configurations: bus, s ta r and a ring
W orkstation clusters have one obvious lim itation due to the use of relatively slow
network interconnection. T he interconnects have a low banduridth and a high latency.,
where, bandw idth refers to the speed a t which message d a ta is tran sm itted and latency is
the tim e spent in in itiating the transm ission of a message. E th ern e t, the m ost com m only
im plem ented netw ork for clusters, t ransm its inform ation a t lO M b/s and has a message la
tency of 4/rs. T here have been several efforts to design expensive high-speed in terconnects
to overcome the lim itations induced by the speed of E th ern e t. Typical exam ples include,
ED DI (100 M b /s), lliP P I (800 M b/s), VM E Bit3 (20 M b/s) and ATM OC-12 (622 M b /s)
(T urco tte , 1996).
The need to m aximize the network perform ance (high bandw idth and low latency),
particu larly for parallel applications, has yielded unique solutions. A recent exam ple of one
such network is M yrinet (Boden et ah, 1995). It consists of a collection of w orkstations,
the netw ork com prising links and switches to route the d a ta , and the network interface
between the w orkstations and the network links. T he netw ork interface consisting of
a special processor can transfer blocks of d a ta to allow for the overlap of com putation
and com m unication. One way message latencies of 100 ps and bandw idths of 255 M b /s
have been observed in a M yrinet-based system interconnecting several Sparc w orkstations.
(Boden e t ah, 1995).
Chapter 2. Parallelism in Computer Vision 46
W orksta tion clusters are sim ple to configure. However, it is im p o rtan t to identify
categories of applications which can be im plem ented on these system s m ore effectively.
T he applications which require com pu tational capabilities of high-perform ance com puting
system s can be categorized as follows (T urcotte , 1996):
• C apab ility dem and which includes m egaproblem s th a t require all th e co m p u ta tio n al
capabilities of any available system including m em ory and C PU . G rand Challenge
applications which require m assive parallel processing fall in to th is category.
• C apacity dem and which includes applications requiring su b stan tia l, b u t fa r from
u ltim ate , perform ance and m aking m odera te dem ands on m em ory. T hese jo b s are
ideal cand ida tes for w orksta tion clusters.
W orksta tion clusters provide practical and cost-effective com puting solution for th e ca
pacity dem and problem s. T hey provide com plem entary ra th e r th an practica l replacem ents
for th e general-purpose parallel com puting m achines.
2 .5 .2 A d v a n ta g es o f w o rk sta tio n c lu sters
W orksta tion clusters offer several advantages over th e trad itio n a l parallel com puting en
v ironm ents (T urco tte , 1996) as described below:
• W orksta tion clusters provide simple, inexpensive and readily accessible com puting
p latform to design, develop and im plem ent parallel so lu tions to a wide range of
applications. T hey offer excellent p rice /perfo rm ance benefits in com parison w ith
th e trad itio n a l parallel com puting solutions.
• W orksta tions provide large, cost-effective m em ory which is n o t available in m ost
trad itio n a l parallel com puters, and as problem s continue to grow in com plexity and
detail, availability of a large m em ory is as im p o rtan t as th e processor speed.
• W orksta tion clusters offer s tab le softw are environm ents com pared to dedicated p a ra l
lel m achines. Softw are environm ents such as opera tin g system s, com pilers, libraries.
Chapter 2. Parallelism in Computer Vision 47
and softw are tools, are yet to develop to a po int of general accep tab ility for dedicated
parallel m achines.
• C lusters provide a cost-elfective environm ent to s tu d y topics re la ted to heterogeneous
com puting . It is generally believed th a t fu tu re , high-perform ance com puting system s
will achieve m axim um perform ance capabilities only by exploiting th e benefits of
heterogeneous com puting environm ents.
• C lusters have a graceful degradability . T he en tire c luster is no t lost due to failure
of a single system in th e cluster. Also, since the clusters are c reated using com
m odity com ponents, m ain tenance costs are usually m uch less th an for an equivalent
investm ent in a dedicated parallel com puter.
2 .5 .3 U se o f c lu sters
C lusters can be used as enterprise clusters or dedicated c lusters (T urco tte , 1996). E n
terp rise clusters are configured w ith w orksta tions th a t are owned by different individuals
or groups. T he m achines in th is cluster are norm ally heterogeneous (m ultivendor), and
are alm ost exclusively connected via E th ern e t. T his type of clustering relies on individual
ow ners con trib u tin g their unused com puting cycles to a shared pool. T he individual owners
expect to receive m ore resources th an they con tribu te . E n terp rise clusters are controlled
and m anaged by a m anagem ent softw are. This softw are enables effective use of collective
idle tim e available on m ost w orkstations. This idle tim e can be used to process jobs of
several different users in th e group. T he m anagem ent softw are ensures th a t th e system s
of individual owners are no t sa tu ra te d when they try to use th e ir own system s. T he
individual ow ners can specify how their system will p a rtic ip a te in th e resource pool.
Several papers have proposed different schemes for sharing resources in en terp rise
c lusters w here th e m ain idea is to identify idle m achines in th e netw ork and schedule
background jobs on them w ith m inim um disruption to the individual owners of th e m a
chines. W hen th e owner resum es activ ity a t a w orksta tion , th e jo b is e ither suspended,
te rm in a ted , or moved to an o th er m achine in th e cluster. T hese efforts have resu lted in
e ither speeding individual jobs or program s by locating idle resources (Alonso & Cova,
Chapter 2. Parallelism in Computer Vision 48
1988), (M utka & Livny, 1987), or in sim ply achieving higher levels of m achine utiliza tion
th ro u g h load balancing or load sharing (Theim er & L antz, 1988), (Litzkow e t al., 1988),
(T andiary e t al., 1996), (C lark & M cM illin, 1992).
D edicated clusters are installed as su b s titu tes or replacem ents to th e trad itio n a l parallel
com puting system s. T hey consist of individual w orksta tions m anaged by a single group
which adm in isters th e clusters like a central m ainfram e. T hey are usually in terconnected
by high-speed netw orks such as FD D I, SO CC, and H iP P I (T urco tte , 1996). D edicated
clusters usually have a control w orksta tion which m anages th e jo b queue and ac ts as
an in terface to th e rem aining clusters. T he contro l system can be used to dynam ically
p a rtitio n th e clusters to execute in teractive jobs (e.g. code developm ent, graphics, e tc .),
serial batch jobs and jobs th a t have been parallelized.
2 .5 .4 P a ra lle l co m p u tin g u sin g C lu sters
W orksta tion clusters, both enterprise and dedicated , can be used as parallel com puting
environm ents for im plem enting parallel solutions to a wide range of applications. T here
have been several papers which have addressed th e issues involved in solving a single
problem on a collection of w orksta tions. Silverm an and S tu a r t (Silverm an & S tu a rt, 1989)
have used th e cluster as a loosely coupled m essage passing parallel co m p u ter to solve som e
asynchronous algorithm s in num erical analysis. M agee and C heung (M agee & C heung,
1991) have proposed a supervisor-worker p rogram m ing m odel to d is trib u te com pu ta tions
over a set of w orksta tions.
A tallah et al. (A tallah e t al., 1992) have proposed a resource m anagem en t technique
called coscheduling OT gang scheduling. I t involves dividing a large ta sk in to su b task s which
are then scheduled to execute concurrently on a set of w orksta tions. T he su b task s need to
coo rd ina te their execution by s ta rtin g a t th e sam e tim e and com puting a t th e sam e pace.
W ang and Blum (W ang & Blum, 1996) have developed a sm all m essage-paasing lib rary to
im plem ent ite ra tiv e num erical a lgorithm s which require synchronization a t th e end of each
ite ra tio n (synchronous algorithm s). Finally, there have been a tte m p ts to d em o n stra te the
capab ility of w orksta tion clusters to solve som e grand challenge problem s (Beguelin e t al..
Chapter 2. Parallelism in Computer Vision 49
1991), (N akanishi & Sunderam , 1992).
Tw o com m only used approaches to parallelize applica tions using clusters are:
• Extension of existing sequential languages (e.g. C + + , FO R T R A N ) to handle nec
essary com m unications and synchronization (see (W ilson & Lu, 1996) for several
concurren t C + + extensions).
• Defining new program m ing languages or an environm ent based on ob ject-o rien ted ,
functional or logical paradigm s.
T here are several softw are system s such as Express, L inda, p4, PV M , and M P I, which
are used for creating parallel applications on w orksta tion clusters. A com prehensive review
of these system s is contained in (T urcotte, 1993). T his section briefly describes ch a rac te r
istics of a P arallel V irtual M achine system which is used as a program m ing environm ent
in th is thesis.
Parallel V irtua l M achine (PV M ) (Beguelin e t al., 1992) was developed a t O ak Ridge
N ational L aborato ry , Tennessee, and is th e m ost popu lar system for developing parallel
applications on w orksta tion clusters (T urcotte , 1993). PV M is a softw are lib rary which
allows utiliza tion of a heterogeneous netw ork of parallel and serial com puters as a single
com puting resource. It is based on th e m essage passing m odel (coord ination m odel dis
cussed in Section 2.1.1). An application in PV M consists of m ultip le com ponents, each of
which im plem ents a particu la r functional process. T here are four categories of com ponents
in PV M : process m anagem ent, interprocess com m unication, synchronization and service
(s ta tu s checking, buffer m anipulation , etc). T he PV M m odel is based on asynchronous
processes which are typically executed as individual program s (e.g. heavyw eight Unix
Processes) on each system in the cluster. T he com m unication betw een the processes
occurs v ia explicit m essage passing.
Chapter 2. Parallelism in Computer Vision 50
2.6 P ara llelization using D esign P a ttern s
M ost of th e parallel program s are usually coded in te rm s of high level co n stru c ts w here
th e functions for com m unication, synchronization, and som etim es even co m p u ta tio n are
rolled in to a single routine. T his style of parallel code developm ent increases p rogram
com plexity and reduces program reliability and code reusability. W riting explicit parallel
code for parallelizing various applications on a cluster of w orksta tions has som e add itional
problem s too. T he available m achines and th e ir capabilities can vary from one execution
to ano ther, which, som etim es can lead to a significant reduction in parallel perform ance.
Also, ab o u t 69% of parallel p rogram m ers (Pancake, 1996) m odify or use existing blocks
of code to com pose new program s. Since m ost of th e parallel program s, especially those
in vision, utilize a ra th e r sm all set of recurring algorithm ic s tru c tu re s , it is m eaningful to
identify and form ulate these algorithm ic s tru c tu re s as design p a tte rn s . Such decoupling
would reduce program com plexity and increase code reusability in different s itu a tio n s and
in fu tu re softw are developm ent.
2 .6 .1 D es ig n p a ttern s
T h e concept of a design p a tte rn was in troduced by arch itec t C hristopher A lexander who
described the recurring them es in arch itectu re as th e design p a tte rn s (A lexander, 1979).
A p a tte rn represents a replicated sim ilarity in a design, and in p articu la r a sim ilarity th a t
can be custom ized and tuned to hum an needs and com forts. T hus, an arch on every
door and window o f a room is a p a tte rn , yet it does no t specifically im ply th e size of th e
archs, th e ir height from th e floor nor their fram ing. T he idea in troduced by C hristopher
A lexander has inspired softw are designers over th e p ast decade to discover (and rediscover)
softw are arch itec tu ra l p a tte rn s in the softw are people develop. In softw are, design p a tte rn s
are softw are ab strac tio n s th a t occur repeatedly while developing softw are solutions for
problem s in a p articu la r dom ain such as business d a ta processing, telecom m unications,
d is trib u ted com m unication softw are, and parallel vision processing (G am m a e t ah, 1994).
Design p a tte rn s cap tu re th e s ta tic and dynam ic s tru c tu re s of th e solutions th a t occur
repeated ly when developing applications in a p articu la r dom ain (Coplien & Schm idt,
Chapter 2. Parallelism in Computer Vision 51
1995), (B uschm ann et al., 1996). T hey a rticu la te proven design techniques for developing
softw are solutions in a p articu la r contex t. C ap tu rin g and a rticu la tin g key design p a tte rn s
helps to enhance softw are quality by addressing basic challenges in softw are developm ent.
T hese challenges include com m unication of designs am ong th e developers; accom m odating
new design parad igm s or styles; resolving reusability and po rtab ility issues; and avoiding
developm ent tra p s and pitfalls th a t are usually learned only by costly tria l and erro r
(Coplien & Schm idt, 1995).
Design p a tte rn s serve as a good com m unication m edium . W hen several softw are de
velopers are discussing various po ten tial solutions to a problem , th ey can use th e p a tte rn
nam es as a precise and concise way to com m unicate com plex concepts effectively. Design
p a tte rn s are ex trac ted from working designs. T hey cap tu re th e essential p a r ts of a design in
a com pact form , including specifics abou t the con tex t th a t m akes th e p a tte rn s applicable or
no t. T his com pact represen ta tion helps developers and m ain tainers u n d ers tan d th e archi
te c tu re of a system , which allows m ore effective softw are developm ent (Beck e t al., 1996).
P a tte rn s prom ote design reuse where rou tine solutions w ith w ell-understood properties
can be reapplied to new problem s w ith confidence (M onroe e t al., 1997). Encouraging
and recording th e reuse of best practices can lead to a significant code reuse. A collection
of design p a tte rn s would help developers produce good designs fa ste r and would provide
a lte rn a tiv es when applied to particu lar situations.
T he design p a tte rn s in th e parallel vision system s, im plem ented on netw ork-based
m achines (such as a cluster of w orkstations) are th e softw are com ponents which d is trib u te
and execute com pu tations of various vision applications on these m achines. Developing
a parallel im plem entation for an application in such an environm ent usually involves a
sequence of steps. These steps include a) p artitio n in g th e application in to different taaks,
b) using a su itab le parallel program m ing language or tool, to concurren tly im plem ent
(m ap) these task s on a given num ber of w orksta tions, and c) m anaging th e low level
p rogram m ing details such as m arshalling d a ta , sending and receiving m essages, and process
(task) synchronization. T he partition ing , m apping and com m unication s tru c tu re of the
parallelization process of an application is a parallel program m ing paradigm th a t can be
used to parallelize any o ther application w ith a sim ilar co m p u ta tio n al s tru c tu re . T he
design p a tte rn s essentially cap tu re these parallel program m ing parad igm s and relieve the
Chapter 2. Parallelism in Computer Vision 52
user from tedious parallelization details.
T he m ain advantages in using design p a tte rn s for parallelizing vision applica tions on
a c luster of w orksta tions are:
1. T he design p a tte rn s can be developed to utilize a readily available pool of w ork
s ta tio n s which, for som e applications, can approach or exceed perform ance over
non-available fastest m achines.
2. A design p a tte rn decouples th e details of parallel im plem entations from th e user.
3. A design p a tte rn can be reutilized to parallelize any application w ith a sim ilar
co m p u ta tio n al s tru c tu re as im plem ented by th a t p a tte rn .
2 .6 .2 Form s o f P a ra lle lism in V ision
M any vision applications can be parallelized by using various form s of parallelism . Each
form of parallelism is a sim ple organizational technique th a t can be used for designing
and developing parallel algorithm s for a certain class of problem s. Identifying various
form s of parallelism in vision applications would help in cap tu rin g and a rticu la tin g key
design p a tte rn s in parallel vision system s. M any of these form s are varian ts of th e class of
a lgorithm s described in section 2.1.2.
D ata Partitioning
In th is form of parallelism , th e im age array is p a rtitio n ed in to ad jacen t regions or subim
ages and each subim age is processed in parallel by a different processor. Such ty p e of
parallelism is su itab le for low level processing operations, such as im age filtering and
convolution. T he regions m ay overlap a t the boundaries of th e subdivisions to enable
processing of th e pixels a t th e region boundaries.
Chapter 2. Parallelism in Computer Vision 53
Synchronous Iteration
In case of Synchronous Ite ra tio n , each processor perform s th e sam e ite ra tiv e co m pu ta tion
on a different region of im age d a ta . T he processors, however, m ust be synchronized
a t th e end of each ite ra tion and hence no processor can s ta r t th e nex t ite ra tio n until
all th e processors have finished th e previous ite ra tio n . T he need for synchronization is
due to th e fact th a t d a ta produced by a given processor during i-th. ite ra tio n is used by
o th er processors during ( i+ l j th ite ra tio n . This form of parallelism is su itab le for ite ra tiv e
sm ooth ing and sharpening operations on the im age d a ta .
A lgorithm ic Parallelism
In algorithm ic parallelism , the algorithm is p artitio n ed in to several independen t p a rts
and each p a r t is processed by a sep a ra te processor concurrently . Each processor works
independently and requires no explicit synchronization or com m unication w ith th e o ther
processors. For exam ple, th e two convolutions in Sobel edge detec tion can be executed
concurren tly on sep a ra te processors (D ow nton e t al., 1996).
Tem poral M ultiplexing
In th is form of parallelism , instead of sp litting individual im age d a ta sets, com plete im age
d a ta sets are processed in parallel by different processors. T h is form of parallelism is also
identified as processor farm ing (D ow nton e t al., 1996). However, tem p o ra l m ultiplexing
form of parallelism is som etim es also associated w ith o p e ra to r parallelism in low level
im age processing. Low level op era to rs , such as erosion and dilation in im age m orphology,
can be cascaded in to several stages (P itas, 1993). Each stage, im plem ented on sep a ra te
processor, processes a com plete im age d a ta set. T he o u tp u t of any s tag e is in p u t of th e
subsequent stage. For exam ple, if F is a o p era to r o p era tin g on im age 7, then F can be
cascaded into several stages as follows:
O = F{1) = J ’„ ( F „ _ i( . . . ( f 2 (F i( / ) ) ) . . . ) ) (2.2)
Chapter 2. Parallelism in Computer Vision 54
B u t th e cascaded im plem entation of an o p e ra to r/a lg o rith m also represen ts th e pipeline
form of parallelism (described la ter). In th is thesis we do no t associate th is form of
parallelism w ith tem pora l m ultiplexing. We identify tem pora l m ultiplexing w ith a type
of processing th a t involves im plem entation of an a lgo rithm / o p e ra to r as a single program
unit, o perating on com plete im age d a ta sets.
W orkpool
In th e w orkpool m ode of parallelism , a cen tral pool of sim ilar co m p u ta tio n al task s is
m ain tained . A large num ber of workers repeated ly retrieve task s from th e pool, perform
required com pu tations, and possibly add new task s to th e pool. T h e co m p u ta tio n te rm i
n a tes when th e task pool is em pty. T his technique is used for im plem enting solutions to
com binato rial problem s in high level vision such as tree or g raph searches. A large num ber
of task s are generated dynam ically which can be picked up by any worker process.
P ipeline
In pipelining, th e application algorithm is sequentially subdivided in to various com ponents
arranged in a pipeline. Each com ponent is processed by different processor and perform s a
certa in phase of th e overall com pu tation . T he d a ta flows th ro u g h en tire pipeline s tru c tu re
th ro u g h th e neighboring com ponent processors.
P ipeline P rocessor Farm
Pipeline P rocessor Farm (D ow nton e t al., 1996) is a generalized form of pipeline parallelism
w here each com ponent in the pipeline m ay be parallelized by various parallel p rogram m ing
techniques described earlier.
Chapter 2. Parallelism in Computer Vision 55
2 .6 .3 D es ig n p a ttern s for para lle l v is io n
Based on various form s of parallelism discussed in section 2.6.2, we p resen t following
design p a tte rn s to parallelize vision applications on a cluster of w orksta tions. A detailed
descrip tion of each p a tte rn is presented in ch ap te r 3.
• Farm er-W orker p a tte rn : T his p a tte rn consist of a fa rm er process (or com ponent)
which is continuously polled for a com pu tational work by a set of independent
w orker com ponents. It is m ainly used for im plem enting d a ta parallelism , where th e
im age d a ta is divided into different subim ages which are processed independently by
different workers. T here is no com m unication betw een th e w orker com ponents.
• M aster-W orker p a tte rn : This p a tte rn consists of a m aster com ponent which dis
tr ib u te s th e work to various worker com ponents. Each worker com ponent com m u
nicates w ith neighboring worker com ponents to exchange th e in te rm ed ia te results.
T his p a tte rn is used for parallelizing synchronous d a ta parallel a lgorithm s.
• C ontroller-W orker p a tte rn : T his p a tte rn is sim ilar to th e M aster-W orker p a tte rn
described above, except th a t each worker m ay com m unicate w ith every o th er worker
in th e p a tte rn . It is used for parallelizing a class of problem s in which each ob ject
or su b task of th e problem needs to in te rac t w ith every o th er o b jec t or a su b task .
• D ivide-and-C onquer p a tte rn : This p a tte rn is used for s tru c tu rin g applications in
which e ither th e d a ta or th e application algorithm is divided in to several sub tasks.
Each su b task m ay be executed on single processor or m ay be fu rth e r divided (recur
sively) in to sm aller sub tasks.
• T em poral M ultiplexing p a tte rn : This p a tte rn is used for processing several d a ta sets
or a sequence of im age fram es on m ultiple processors. Each processor processes a
com plete d a ta se t and executes the sam e program code.
• P ipeline p a tte rn : This p a tte rn consists of a pipeline of com ponents executed con
cu rren tly in a specified order. It is used in s itu a tio n s w here a vision application can
be divided in to com ponents which are by them selves independen t, and in te rac t w ith
Chapter 2. Parallelism in Computer Vision 56
each o th er only by using o u tp u t d a ta stream of one com ponent as an in p u t d a ta
stream to ano ther.
• C om posite Pipeline p a tte rn : S tructurally , th is p a tte rn is sim ilar to th e pipeline
p a tte rn . T he only difference is th a t each com ponent of th e pipeline can itself be
parallelized using any of the design p a tte rn s s ta te d above.
2 .7 R ela ted work
In th is section, we outline som e of th e leading research efforts which have been in sp ira tional
to th e work presented in th is thesis. A lthough the concept of design p a tte rn s is new,
th e idea of identifying and cap tu rin g com m on form s as softw are ab strac tio n s in parallel
softw are system s is a decade old.
Zim ran et al. (Z im ran e t al., 1990) have proposed a set of im plem entation m achines
used for parallel im plem entation of various applications on a shared and d is trib u ted m em
ory parallel m achines. A layer of im plem entation m achines (IM) is in troduced betw een the
application and the physical m achine. T he im plem entation m achines consist of com m on
parallel program m ing paradigm s such as m aster/s lav e , pipeline, and pyram ids. Each
im plem entation m achine is associated w ith a m athem atica l rep resen ta tion th a t can predic t
th e perform ance bounds for d is trib u ted com putations. An application is developed in
te rm s of one or m ore im plem entation m achines which are then im plem ented efficiently on
th e underly ing hardw are. T he IM s are m ade available in the form of m odifiable tem p la tes
which im plem ent th e relevant com m unication and synchronization functions. However, the
se t of im plem entation m achines presented, do no t address issues re la ted to dom ain-specific
problem s. T hey represent only th e general form s of parallel p rogram m ing paradigm s.
M agee and C heung (M agee & C heung, 1991) have described th e supervisor-w orker
parad igm to d is trib u te the com pu tations of an application on a netw ork of w orksta tions.
T hey have discussed the robustness and load balancing properties of th is paradigm and
have applied sim ple form ulae to p red ic t the perform ance of an a lgorithm , im plem ented
using th is parad igm . T he supervisor-w orker parad igm consists of a superv isor process
Chapter 2. Parallelism in Computer Vision 57
th a t d is trib u te s th e com pu ta tional work to a num ber of worker processes, each working
independen tly of th e o ther. However, only em barrassingly parallel class of applications
can be parallelized using th is paradigm .
Singh et a l (Singh e t al., 1991) developed a system called F ram eW orks which uses
tem p la tes to generate d istrib u ted applications on a netw ork of w orksta tions. P rog ram s
are w ritten as sequential procedures enclosed in tem plates. T h e tem p la tes hide th e low
level parallelization details, such as com m unication and synchronization . A user selects
ap p ro p ria te tem p la tes (e.g. pipeline, con trac to r, in p u t/o u tp u t) to describe th e behavior
of a parallel p rogram . T he system then generates th e code for im plem enting th e com
m unication and synchronization betw een the processes. T he concepts of th e Fram eW orks
system were la te r used to c reate ano th er such system called E nterprise.
T he E n terp rise system , like Fram eW orks, has a graphical in terface by which the users
can create parallel applications using assets such as pipeline, m aste r/s lav e , divide-and-
conquer (Schaeffer e t ah, 1993). This system au tom atically inserts necessary code for
com m unication and synchronization relieving th e users from low level p rogram m ing de
tails which include m arshalling d a ta , sending/receiving m essages and synchronization.
However, b o th Fram eW orks and E n terp rise system s do not su p p o rt d a ta parallelism and
com plex synchronization, com m unication, and scheduling s tru c tu re s . M ost of th e p a r
allelism th a t can be achieved in an application is perform ed by pipelining and tem pora l
m ultiplexing. In these form s of parallelism th e processors o p era te only on com plete im ages.
D arling ton et al. (D arlington e t ah, 1993) have proposed a set of h igher-order parallel
form s called skeletons as th e basic building blocks of a parallel p rogram . T hey have also
provided program tran sfo rm atio n s which convert betw een skeletons, giving po rtab ility
across several different m achines. A skeleton cap tu res an algorithm ic form com m on to a
range of program m ing applications. Each skeleton is associated w ith a set of arch itec tu res
on which efficient realizations of the skeleton are known to exist. T h e skeletons are also
associa ted w ith perform ance models which can be used to p red ic t th e perform ance of a
parallel p rogram im plem ented using these skeletons. A set of tran sfo rm atio n s is used for
transfo rm ing one skeleton to ano ther in order to su it th e a rch itec tu ra l requirem ents of dif
feren t m achines. However, th e skeletons represent a general class of parallel p rogram m ing
Chapter 2. Parallelism in Computer Vision 58
parad igm s. They are no t dom ain-specific and therefore need to be tu n ed and ex tended in
o rder to refiect th e characteristics and contro l s tru c tu re s associated w ith th e problem s in
a given dom ain.
D ow nton et a l (D ow nton e t al., 1996) have proposed a design m ethodology based on
pipeline of processor farm (P P F ) for parallelizing vision applications on M IM D m achines.
T heir design m ethod enables parallelization of com plete vision system s (w ith continuous
in p u t/o u tp u t) in a top-dow n fashion, w here parallel im plem entations of individual algo
rithm s are tre a ted as com ponents in th e design model. However, th is design m ethodology is
im plicit, i.e. it does no t present detailed descrip tion of the m ethods or designs used in p ar
allelization of individual algorithm s. For exam ple, their paper identifies ‘d a ta parallelism ’
as one of several m ethods for parallelizing vision algorithm s. B u t ‘d a ta paralle lism ’ can
be applied to b o th synchronous and em barrassing ly parallel algorithm s. O ur work in th is
thesis aim s to m ake th e design inform ation in d esigns/m ethods for parallel vision system s,
explicit. We a b s tra c t and docum ent th e design inform ation in th e ir design m ethodology
in th e form of C om posite-P ipeline p a tte rn in th is thesis.
2.8 Sum m ary
In th is ch ap te r we have reviewed concepts and m ethods in several different areas re la ted to
th e parallel vision system s. We began w ith a brief in troduction to parallel com puting sys
tem s and th e ir classification as SISD, SIM D and M IM D m achines, based on th e in struc tion
s tream s and d a ta stream s. T his was followed by a discussion on parallel algorithm s and
th e ir classification in term s of different algorithm ic classes such as synchronous, loosely
synchronous, asynchronous, and em barrassing ly parallel. T hese classes are useful when
discussing ab o u t com pu tations a t higher level. We have also given a brief in tro d u ctio n on
m easuring perform ance in parallel p rogram s.
We then described general principles and m ethods used in th e field of com pu ter vision.
O ur p rim ary concern has been vision applications involving analysis of 2D scenes. We
presented different techniques and algo rithm s for featu re detec tion , segm en ta tion , reseg
m en ta tio n and ob ject recognition used in 2D vision. We also described th e com pu ta tional
Chapter 2. Parallelism in Computer Vision 59
charac teristics of these algorithm s and their classification in to th ree levels: low, in te r
m ediate and high. Low level a lgorithm s are usually highly s tru c tu re d , repetitive and
com posed of fixed sets of operations. H igher level algorithm s on th e o ther hand , are very
irregu lar and m ay involve dynam ic scheduling of th e com pu tations. T he d istinctive n a tu re
of th e ir charac teristics has influenced th e design and developm ent of several different
parallel arch itec tu res in com puter vision. Several such arch itec tu res com prising e ither
SIM D, M IM D or bo th SIM D and M IM D (partitionab le) dedicated parallel m achines have
been described.
We described parallel com puting using w orksta tion clusters and discussed th e ir advan
tages over the conventional parallel m achines. This was followed by an in troduction to th e
concept of design p a tte rn s . Design p a tte rn s are softw are ab strac tio n s th a t occur repeated ly
while developing softw are solutions for problem s in a p a rticu la r dom ain . V arious form s
of parallelism in vision applications were identified in order to cap tu re and a rticu la te key
design p a tte rn s in parallel vision system s. Finally, we have outlined som e of th e leading
research efforts th a t have been insp irational to th e work presented in th is thesis.
Chapter 3
D esign patterns for parallelizing
vision applications
D esign p a tte rn s for parallel vision applications (in troduced in section 2.6.3) represent
designs or m ethods used for im plem enting these applications on various parallel archi
tec tu res . Som e of these p a tte rn s , such as Farm er-W orker and M aster-W orker, represent
com m on m ethods which can be used for parallelizing algorithm s no t only in vision b u t
also in o th e r com puting disciplines. B u t o ther p a tte rn s , such as Tem poral M ultiplexing
and C om posite Pipeline, are suitable only for parallelizing applications in vision (for an
exam ple, see (D ow nton e t al., 1996)).
T h ere have been several efforts in th e p ast to present different design m ethods for
parallelizing vision a lgo rithm s/app lications on various parallel arch itectu res (D ow nton
e t ah , 1996), (S tou t, 1987). However, there have been no a tte m p ts to a b s tra c t and
docum en t th e design inform ation in these design m ethods. T his ch ap te r a t te m p ts to
fill th is gap by cap tu rin g and docum enting th is design inform ation in th e form of design
p a tte rn s . T hese design p a tte rn s have been form ulated to rep resen t com m on algorithm ic
s tru c tu re s in various parallel vision a lgo rithm s/app lica tions described in (K endall & U hr,
1982), (U hr, 1987), (S tou t, 1987), (Page, 1988), (P ra san n a K um ar, 1991), (Hussain, 1991),
(P itas , 1993), (W ang e t ah, 1996), (Dow nton e t ah, 1996). A docum en ta tion or ca ta logue
of key design p a tte rn s for parallel vision applications would give s tan d a rd nam es and
60
Chapter 3. Design patterns for parallelizing vision applications 61
definitions to th e techniques used in parallelization of these app lications. By m aking
design knowledge explicit in th e form of design p a tte rn s , experienced and novice designers
would be able to reuse th e designs in different s itu a tio n s (Coplien & Schm idt, 1995).
Design p a tte rn s are useful in tu rn ing an analysis m odel in to an im plem en tation m odel
(Beck e t al., 1996).
T his ch ap te r describes a system of design p a tte rn s used for parallelizing m ajo rity
of vision applica tions on coarse-grained m achines, such as a c luster of w orksta tions. A
system of p a tte rn s for parallel vision applications consists of m any different p a tte rn s used
in different s ituations. In order to facilitate th e ir effective use and to help developers
in selecting and im plem enting the right p a tte rn s for a given situ a tio n , it is necessary to
describe th e p a tte rn s in a uniform way. Such a descrip tion m ust address all th e aspects
relevant to a p a t te rn ’s characterization , detailed descrip tion, im plem entation , selection
and com parison w ith o ther p a tte rn s. A system of p a tte rn s should add ress issues con
cerning th e construc tion of p a tte rn s into m ore com plex and heterogeneous s tru c tu re s . A
com prehensive and well-defined system of p a tte rn s form s a uniquely pow erful and flexible
vehicle for expressing softw are system s (B uschm ann & M eunier, 1995).
T his ch ap te r is organized as follows. Section 3.1 describes different classification
schem es used in classifying th e p a tte rn s a t various levels of ab s trac tio n . Section 3.2
outlines th e tem p la te used for describing the design p a tte rn s . T he rem ain ing sections
describe different design p a tte rn s used in parallelizing various vision app lica tions on a
c luster of w orksta tions. T he p a tte rn s in these sections have also been published in (K adam
e t al., 1997) (K adam et al., 1996).
3.1 O rganization o f p atterns
Design p a tte rn s vary in their level of ab strac tio n and are usually organized in to different
categories based on some classification scheme. Such a classification schem e is believed to
provide a guide when selecting a p a tte rn for a p articu la r design situ a tio n . G am m a et al.
(G am m a e t al., 1994) classify design p a tte rn s according to th e ir functionality . T he design
p a tte rn s can e ither have creational., structural^ or behavioral purpose.
Chapter 3. Design patterns for parallelizing vision applications 62
C reational p a tte rn s concern th e process of ob ject creation . T h e Singleton p a tte rn
(G am m a e t ah, 1994) is a creational p a tte rn used in ensuring th a t a class or a com ponen t of
som e design p a tte rn has only one instance. S tru c tu ra l p a tte rn s deal w ith th e com position
of classes or ob jects. T he Proxy p a tte rn (G am m a et ah, 1994) is a s tru c tu ra l p a tte rn
which m akes th e clients or th e users of a com ponent com m unicate w ith a rep resen ta tive
ra th e r th an w ith th e com ponent itself. Behavioral p a tte rn s charac terize th e ways in which
classes or ob jec ts in te rac t and d is trib u te th e responsibility. T h e Itera tor p a tte rn (G am m a
e t ah, 1994) is a behavioral p a tte rn which provides a way to access th e elem ents of an
aggregate ob ject sequentially w ithou t exposing its underlying rep resen ta tion .
T he classification scheme proposed by G am m a et al. has certa in lim ita tions. T he
classes of functionality in th is classification scheme are of general n a tu re ra th e r th an
being specific for any application dom ain. Hence, it is difficult to select ap p ro p ria te
p a tte rn s for solving or s tru c tu rin g problem s in a given application dom ain. B uschm ann
and M eunier (B uschm ann & M eunier, 1995) therefore proposed a classification scheme
which classifies th e p a tte rn s in to different classes based on different levels of ab s trac tio n s in
softw are system s. T hey identified th ree different classes of p a tte rn s , namely, architectural
fram ew orks, design patterns and idioms. This classification schem e was la te r used to
form ally describe a system of p a tte rn s for softw are a rch itec tu re (B uschm ann et ah, 1996).
An a rch itec tu ra l fram ew ork expresses a fundam ental paradigm for s tru c tu rin g softw are
system s. It provides a set of predefined subsystem s and includes rules and guidelines for
organizing th e relationships between them . For exam ple, a Pipeline p a tte rn described in
section 3.8 can be considered as an arch itectu ra l p a tte rn when it is used for s tru c tu rin g a
vision application th a t can be divided into a sequence of independen t subsystem s, executed
in a specified order. Each subsystem in terac ts w ith its neighboring subsystem s only by
exchanging s tream s of d a ta . An application s tru c tu red using a P ipeline p a tte rn m ay be
parallelized by executing th e application subsystem s concurrently . T he execution and
th e in teractions of th e application subsystem s are im plem ented by th e corresponding
com ponen ts of th e P ipeline p a tte rn .
An a rch itec tu ra l fram ew ork consists of several sm aller un its called design p a tte rn s .
Design p a tte rn s describe th e basic scheme for s tru c tu rin g subsystem s and com ponen ts
Chapter 3. Design patterns for parallelizing vision applications 63
of a softw are system , as well as the relationships between them . Design p a tte rn s are
m edium -level p a tte rn s , sm aller in scale th an th e a rch itec tu ra l p a tte rn s . A M aster-W orker
p a tte rn described in section 3.4 is an exam ple of a design p a tte rn which can be used for
d is trib u tin g com pu ta tions of an application to identical w orker com ponents. Idiom s, on th e
o th er hand , are low-level p a tte rn s which are specific to som e program m ing language. An
idiom describes th e aspects of bo th design and im plem entation of th e specific com ponents
in a p a tte rn by using th e featu res of a given language. A singleton p a tte rn described
earlier is an exam ple of an idiom.
T he classification scheme based on different levels of ab strac tio n s in softw are system s
(also term ed as system granu larity ) can som etim es be am biguous. A p a tte rn can be used
to s tru c tu re e ither a com plete softw are system or ju s t a single com ponent or subsystem .
A P ipeline p a tte rn , for exam ple, can be p a r t of a larger system . Its classification as an
arch itec tu ra l p a tte rn or design p a tte rn therefore depends on th e con tex t. Similarly, the
b oundary betw een th e design p a tte rn s and idioms is im precise. In fact, B uschm ann and
M eunier (B uschm ann & M eunier, 1995) s ta ted th is am biguity when th ey proposed their
classification scheme. However, th is classification schem e provides a reasonable hierarchy
for describing m ost of th e p a tte rn s in softw are system s.
We do no t follow any s tr ic t classification scheme, b u t ra th e r use it as a general guide
to specify th e ty p e of p a tte rn s we propose and describe in th is thesis. Using th e classifi
cation schem e form ally used by B uschm ann et al. to classify the p a tte rn s in th e ir book
(B uschm ann et al., 1996), we describe a system of p a tte rn s for parallel vision applications
a t th e level of a rch itec tu ra l fram ew orks and design p a tte rn s . If no t s ta te d otherw ise, we
use th e te rm design patterns to represent all the p a tte rn s a t various levels of ab strac tio n .
Also, we use th e te rm s pattern and design pattern as synonym s.
3.2 D escrip tion o f design p a ttern s - a tem p la te
We use a tem p la te to describe all the design p a tte rn s presented in th is thesis. T h e tem p la te
provides descrip tion of how each p a tte rn works, w here it should be applied and w hat
are th e tradeoffs in its use. This description schem e for th e p a tte rn s is closely re la ted
Chapter 3. Design patterns for parallelizing vision applications 64
to th e ones proposed by G am m a et al. (G am m a e t al., 1994) and B uschm ann et al.
(B uschm ann et ah, 1996). Its in ten tion is to su p p o rt the understand ing , com parison,
selection, and im plem entation of p a tte rn s w ithin a given design s itu a tio n . T he tem p la te
used for describing each p a tte rn is given below:
1. P a tte rn nam e
T he nam e of th e p a tte rn which conveys the essence of th a t p a tte rn .
2. In ten t
A sh o rt s ta tem en t ab o u t the m ain functionality of th e p a tte rn and th e problem s
th a t it addresses.
3. M otivation
An exam ple illu stra ting a concrete instance of th e p a tte rn . T he m otivational exam ple
re la tes the p a tte rn to its practical usage.
4. S tru c tu re
T he s tru c tu re of th e p a tte rn in te rm s of o b jects or com ponen ts described in both
tex tu a l and graphic represen ta tion . We use a varian t of th e o b jec t m odel (described
in A ppendix A) to display th e s tru c tu re of th e p a tte rn .
5. In teraction
T he in teractions between th e com ponents of th e p a tte rn and between th e outside
world are depicted . We ad a p t the ob ject m essage sequence ch a rt n o ta tio n (described
in A ppendix A) to describe the in teractions between th e com ponents of a p a tte rn .
6. Im plem entation
T he general guidelines for im plem enting th e p a tte rn . T hese are, however, only
suggestions which should be su itab ly modified depending upon the needs of a given
problem .
7. Consequences
T he consequences and trade-offs of using a p a tte rn . T he param ete rs th a t can be
varied independently by using the design p a tte rn . We describe th e benefits and
po ten tia l liabilities of a p a tte rn .
Chapter 3. Design patterns for parallelizing vision applications 65
8. A pplicability
T he set of conditions and requirem ents th a t ind icate when th e p a tte rn m ay be
applicable.
9. K nown Uses
We provide exam ples of th e use of th e p a tte rn in different s itu a tio n s. We also provide
som e re la ted efforts in using th e p a tte rn or its varian ts.
3.3 Farm er-W orker P a ttern
Intent
T he Farm er-W orker p a tte rn , which provides dynam ic load balancing, is used for im ple
m enting em barrassingly parallel algorithm s. T he farm er com ponent divides th e problem
task in to a collection of independent subtasks. T he worker com ponen ts g rab individual
su b task s and perform identical operations on the d a ta , before re tu rn in g th e transfo rm ed
values to th e farm er for collating.
M otivation
A veraging is a sim ple im age enhancem ent technique which is used for rem oving noise
from an im age co rrup ted by random noise. It uses linear local window opera tions to change
th e pixel in tensities in the corrup ted im age using th e equation
/ { a ,b ) = { (3.1)(hi)G.A/'
w here, / is noisy im age, / is filtered im age, and W is a set of N neighboring pixel points
around a poin t (a,h) in th e im age (Sonka e t ah, 1993). T he averaging operation , using
th e F arm er-W orker p a tte rn , can be parallelized by dividing th e im age in to subim ages and
averaging these subim ages concurrently on different processors.
Chapter 3. Design patterns for parallelizing vision applications 66
SplitWorkSendSubtaskCollateResultsSendFinalResults
Fanner
RequestSubtaskProcessSubtaskSendResults
Worker (p-1)
RequestSubtaskProcessSubtaskSendResults
Worker (2)
RequestSubtaskProcessSubtaskSendResults
Worker (1)
RequestSubtaskProcessSubtaskSendResults
Worker (p)
Figure 3.1: Farm er-W orker P a tte rn
Stru cture
T he Farm er-W orker p a tte rn consists of a farm er com ponent and several independen t
bu t identical worker com ponents or processes as shown in F igure 3.1. T he client in te rac ts
w ith th e farm er com ponent to parallelize a certain applica tion . T he farm er com ponen t
is responsible for partition ing the application in to several independen t su b task s, s ta r tin g
th e w orker com ponents to process these sub tasks, collecting th e p artia l resu lts from the
w orker com ponents, and finally re tu rn ing the collected resu lts to th e client. T h e w orker
com ponents are responsible for processing the individual su b task s c reated and assigned by
th e fa rm er. T he Farm er-W orker p a tte rn consists of one farm er and a t least two w orkers.
Interaction
T h e in teractions between th e com ponents of th e Farm er-W orker p a tte rn are show n in
F igure 3.2.
• T he client requests th e farm er to parallelize a given application .
• T he farm er com ponent divides the application into different su b tasks and s ta r ts
several w orker com ponents to process these sub tasks.
Chapter 3. Design patterns for parallelizing vision applications
CallToParallelize
^ SplitWork
SendSubtask
ProcessSubtaskSendResults
(RequestSubtask)
SendSubtask
ProcessSubtaskSendResults
(RequestSubtask)
CollateResultsSendFinalResults
Worker (1)Client Worker (p)
Figure 3.2; O bject Interaction in the Fanner-W orker P a tte rn
• Each worker repeatedly requests a subtask , perform s specified com putation on the
d a ta in the sub task , and re tu rns the results back to the farm er. This continues until
a term ination condition is encountered.
• 'The term ination condition occurs when there are no m ore tasks to be processed.
T he farm er detects this condition and signals the worker com ponents to term inate .
• T he farm er collates the results returned by the workers for a given application. The
farm er re tu rns the collated result to the client.
Im plem entation
T he Farm er-W orker p a tte rn can be implemented by following the steps described below:
/. F arfifion the work. Specify how the problem task can be divided into a collection of
independent subtasks. For the averaging operation , we could either partition the image
in to horizontal or vertical blocks of subim ages. Each subim age represents a sub task to be
Chapter 3. Design patterns for parallelizing vision applications 68
processed. T he subim ages m ust also include th e required pixel values a t th e boundaries
of th e p artitio n .
2. C om bine the results. Specify how th e final results should be collated from th e p artia l
resu lts ob tained from th e worker com ponents. In th e averaging exam ple, th e farm er
com ponen t sim ply collates th e averaged subim ages onto th e o u tp u t im age w ith o u t any
change.
3. Specify the in teraction between the fa rm er and the workers. T his in teraction can be
im plem ented in a t least th ree different ways: a) Each w orker receives a su b task from th e
farm er a t th e beginning. W hen a worker re tu rn s th e p artia l resu lts to th e fa rm er, th e
farm er collates these results and sends ano ther su b task to th e worker, b) A sep a ra te
com ponen t called gatherer is created . W hile the farm er d is trib u te s the su b task s to th e
w orkers, th e g a th erer collects th e partia l results from each worker. T he g a th e re r then
re tu rn s th e final collected result to the farm er, c) If th e o peration of collecting th e p artia l
resu lts is triv ial or easily delayed to th e end of com p u ta tio n , th e farm er can tu rn in to
a w orker afte r se ttin g up th e collection of sub tasks in a com m on repository, such as a
su b task queue. T h e workers now fetch th e sub tasks from th e su b task queue. However,
th is im plem entation needs a shared counter to m anage th e su b task queue. In all th e cases,
when th ere are no m ore sub tasks to be processed, th e farm er sends a te rm in atio n m essage
to each worker (and g a th erer in (b)). In the averaging exam ple, as the fa rm er sim ply
collects th e results re tu rned by the workers, we use th e first m ethod to im plem ent th e
in te rac tio n betw een th e farm er and th e workers.
4. Im p lem en t the fa rm er and the worker com ponents according to the specifications
ou tlined in previous steps.
C onsequences
T he F arm er-W orker p a tte rn provides several benefits:
D ynam ic load balancing: T he Farm er-W orker p a tte rn provides an even d istrib u tio n of
th e load when th e com pu tational requirem ents of th e individual su b tasks and th e speed
of different processors in the parallel system , vary significantly and unpredictably . T he
Chapter 3. Design patterns for parallelizing vision applications 69
worker com ponents in a Farm er-W orker p a tte rn g rab th e su b tasks and process them
a t th e ir own pace. A faste r processor or node of th e parallel system would g rab and
process m ore su b tasks th an the slower nodes. Hence, th e num ber of su b tasks processed by
each w orker is p roportional to th e speed of their corresponding nodes or processors. T he
Farm er-W orker p a tte rn therefore provides dynam ic load balancing of th e su b task s during
its execution.
Scalability and flexib ility: It is possible to add new workers or change existing algorithm s
in th e workers w ith o u t m ajo r changes to the farm er. T he client is no t affected by these
changes. Similarly, it is possible to change th e algorithm s for p artitio n in g th e w ork or
co-ord inating th e workers in th e farm er com ponent w ith o u t affecting th e client.
T he F arm er-W orker p a tte rn suffers from th e following liabilities:
Feasibility: T he Farm er-W orker p a tte rn m ay no t always be feasible. T he ac tiv ities of
p artitio n in g of th e work, s ta rtin g and controlling th e workers, delegating th e w ork am ongst
th e w orkers, and collecting th e final results, consum e processing tim e. T he p a tte rn would
be effective only when the tim e spen t in these activ ities is significantly lower th a n th e tim e
required to perform th e com pu tations in a given application.
E ffectiveness: T h e Farm er-W orker p a tte rn is effective only w hen there are m ore sub tasks
th an th e num ber of processors. T he parallelism in th is p a tte rn is expressed in term s
of th e num ber of sub tasks. W hen all th e sub tasks are processed, no fu rth e r parallelism
is available in th e application . On the o ther hand , to o m any subtcisks w ith relatively
lower com pute to com m unication ra tio , m ay lead to poor perform ance. A p roper balance
betw een th e g ran u la rity and th e num ber of su b tasks created is therefore critica l for the
effectiveness of th is p a tte rn .
A p p licab ility
T h e Farm er-W orker p a tte rn represents a parallel program m ing paradigm for im ple
m enting em barrassingly parallel algorithm s. It can be used to parallelize any vision
applica tion in which
Chapter 3. Design patterns for parallelizing vision applications 70
• th e d a ta can be partitioned in to several independent d a ta sets
• each d a ta set can be processed concurrently by different workers
• th e processing of each d a ta set does not require in teraction betw een th e worker
com ponents to exchange th e in term ediate results
K now n U ses
T he Farm er-W orker p a tte rn has applications in various levels of vision processing. In
low level processing, it can be used for parallelizing local w indow -based operations such
as convolution, edge detection , linear and non-linear (e.g. m edian) filtering, and im age
th inn ing . In in term ed iate level it can be used to ex trac t featu res of individual ob jec ts
concurrently . In high level processing it can be used for processing several fea tu res or
ob jec ts for o b jec t recognition, concurrently. T he algorithm ic s tru c tu re /m o tif represen ted
by th e Farm er-W orker p a tte rn is described in (M attson , 1996).
3.4 M aster-W orker P a ttern
Intent
T he M aster-W orker p a tte rn is used for parallelizing a class of problem s which exhib it a
synchronous form of parallelism . T he m aster com ponent divides th e problem in to several
su b task s and d is trib u tes them to identical worker com ponents. Each w orker com ponent
perform s com pu ta tions on its assigned su b task iteratively^ and com m unicates th e in term e
d ia te resu lts to its neighboring workers a t th e end of each ite ra tio n . T he m aster com ponent
collates th e final resu lts re tu rned by the worker com ponents afte r a fixed num ber of such
itera tions.
M otivation
An extrem um filte r is a w indow -based non-linear o p era to r which sharpens th e blurred
edges in an im age to th e original step edges (K ram er & B ruckner, 1975). The ex trem um
Chapter 3. Design patterns for parallelizing vision applications 71
filter replaces th e cen tral pixel value w ithin a filter w indow by th e nearest ex trem e pixel
value occurring w ithin th e window. It can be expressed using th e following equation
/ (a, oj — (3.2)m i n { f ( i , j ) } otherw ise
w here, / ' ( a , b) represents th e new pixel value, and m a x { f { i ^ j ) } and represent
th e m axim um and m inim um values (extrem e values) occurring w ith in a window centered
a t po in t (a, 6). E x trem um filter is applied iteratively so th a t th e b lurred edges converge
to th e original s tep edges. K ram er e t al. (K ram er & B ruckner, 1975) have reported
th a t a t least 20-50 ite ra tio n s were required to observe a com plete convergence in a 27x33
im age. T he execution tim e required for operating on larger im ages can therefore be quite
significant. In fact, it can be seen th a t th e com pu tational com plexity of th is o p era to r w ith
M ite ra tio n s, operating on a n x n im age and using a m x m w indow, is 0 { 2 M m ‘ n^). T he
ex trem um filter o p era to r can be parallelized by dividing th e im age in to several subim ages,
and filtering these subim ages concurrently using different w orker com ponents. Each worker
com ponent com m unicates the required boundary inform ation to its neighboring workers
afte r every ite ra tio n . By using a set of P processors, th e co m p u ta tio n al com plexity of th e
ex trem um filter o p era to r can be reduced to 0 { 2 M in ? n ‘ / P ) , su b jec t to th e com m unication
overheads.
Structure
T he M aster-W orker p a tte rn consists of a m aste r com ponent and several identical
w orker com ponents or processes as shown in F igure 3.3. T h e worker com ponents are
spatially arranged in a pipeline to reflect th e com m unication s tru c tu re of th e p artitioned
problem which th e p a tte rn im plem ents. T he client in te rac ts w ith th e m aster com ponent
to parallelize certa in application . T he m aster com ponent is responsible for p artition ing
th e application in to several sub tasks, s ta rtin g th e w orker com ponents to process these
sub tasks, collecting th e results re tu rned by th e workers, and finally re tu rn in g th e collected
resu lts to th e client. T he worker com ponents are responsible for repeatedly perform ing th e
co m p u ta tio n s on th e ir assigned sub tasks, and com m unicating th e in te rm ed ia te resu lts to
Chapter 3. Design patterns for parallelizing vision applications 72
their neighboring workers afte r every itera tion . T he M aster-W orker p a tte rn consists of
one m aster and a t least two workers.
SplitWorkSendSubtasksCollateResultsSendFinalResults
Master
DoCalculationExchangeData
SendResults
Worker (p-1)
DoCalculationExchangeData
SendResults
Worker (2)
DoCalculationExchangeData
SendResults
Worker (1)
DoCalculationExchangeData
SendResults
W orker (p)
Figure 3.3: M aster-W orker P a tte rn
Interaction
T he in teractions between th e com ponents of th e M aster-W orker p a tte rn are shown in
F igure 3.4.
• T he client requests the m aster to parallelize a given applica tion .
• T h e m aster com ponent divides th e application into several su b task s and s ta r ts the
w orker com ponents to process these sub tasks. T he num ber of subtcisks created is
equal to th e num ber of processors available.
• Each worker perform s a fixed num ber of com pute-com m unicate cycles. A com pute-
com m unicate cycle denotes an operation in which th e w orkers com pute on th e d a ta
in th e ir assigned sub tasks and then com m unicate the in te rm ed ia te resu lts to their
neighboring worker com ponents. T he workers re tu rn th e com puted resu lts back to
th e m aster afte r perform ing a fixed num ber of these com pute-com m unicate cycles.
• T he m aster collates the results re tu rned by th e workers for th e given application .
T he m aster re tu rn s the collated result to th e client.
Chapter 3. Design patterns for parallelizing vision applications 73
CallToParallelize
^ ~ | SplitWork
SendSubtask
SendSubtask
ExchangeResults
ExchangeResults
DoCalculation DoCalculation
SendResults
SendResults
CollateResultsSendFinalResults
Worker (1) Worker (2)MasterClient
Figure 3.4: O bject interaction in the M aster-W orker P a tte rn
Im plem entation
T he M aster-W orker p a tte rn can be im plem ented by following the steps described below:
1. P artition the work. Specify how the problem task can be divided into a collection of
sub tasks. T he num ber of sub tasks created should be equal to the num ber of processors
or m achines available in the parallel system . Also, the am oun t of com putational work
in each sub task should be proportional to the speed factors of individual m achines used
in parallelization. For the filtering operation , one can partition the image into either
horizontal or vertical blocks of subim ages. Each subim age represents a sub task to be
processed.
2. C'ornbine the resiilts. Specify how the final results should be collated from the results
re turned by the worker com ponents. In the filtering exam ple, the m aster com ponent
sim ply collates the filtered subim ages onto the o u tp u t image w ithout any change.
Chapter 3. Design patterns for parallelizing vision applications 74
3. Specify the in teraction between the m aster and the workers. T his in terac tion can be
specified as follows. T he m aster s ta r ts the worker com ponents and d is trib u te s a single
su b task to each w orker com ponent. T he m aster then w aits for the workers to re tu rn the
com puted results. W hen all the workers com m unicate th e ir com puted results, th e m aste r
te rm in a tes all th e worker com ponents. T he m aster collects and re tu rn s th e final resu lt to
th e client.
4- Specify the in teraction between the worker com ponents. T his in teraction can be speci
fied as follows. W hen a worker com pletes its co m pu ta tion in any com pute-com m unicate
cycle, it com m unicates the required in term ediate resu lts to its neighboring w orkers asyn-
chronously. It then suspends its activ ities and w aits to receive th e in te rm ed ia te resu lts
from its neighboring workers. N ote th a t when a process sends a m essage asynchronously,
it does no t w ait for the destination process to receive it. T his im plem entation therefo re
does no t lead to a deadlock condition.
5. Im p lem en t the m aster and the worker com ponents according to th e specifications
outlined in previous steps.
C onsequences
T he M aster-W orker p a tte rn provides several benefits:
Scalability and flexibility: T he M aster-W orker p a tte rn is scalable w ith respect to th e addi
tion of new workers. It is also flexible w ith respect to changing of th e existing a lgorithm s
in th e w orkers w ithou t involving m ajor changes to th e m aster. T he client is n o t affected
by such changes. Similarly, it is possible to change th e algorithm s for p artitio n in g th e
w ork or co-ord inating the workers in th e m aster com ponent w ith o u t affecting th e client.
Separation o f concerns and efficiency: T he M aster-W orker p a tte rn sep ara tes th e client
code from th e code for sp litting the work, delegating th e w ork to different workers,
m anaging in teractions between th e workers, collecting th e resu lts from th e w orkers, and
handling th e worker failures. T he M aster-W orker p a tte rn can speed up co m p u ta tio n tim e
in m any applications. However, it m ay not always be feasible to parallelize any app lica tion
due to overheads in parallelization (see below).
Chapter 3. Design patterns for parallelizing vision applications 75
T he M aster-W orker p a tte rn suffers from th e following liabilities:
Feasibility: T he M aster-W orker p a tte rn m ay no t always be feasible. T he activ ities of
partitio n in g of th e work, s ta rtin g and controlling th e workers, delegating th e work to the
w orkers, m anaging th e worker-w orker com m unication, and collecting th e final resu lts, are
tim e consum ing. T his p a tte rn would be effective only when th e tim e sp en t in these activ i
ties is significantly lower th an th e com puting tim e required to execute a given application.
Load balancing: T he M aster-W orker p a tte rn can suffer from serious load im balances during
its execution. T his can happen when it is im plem ented on non-dedicated parallel system s,
such as en terprise clusters (see section 2.5.3). Each worker in th e M aster-W orker p a tte rn
depends on th e o th er workers to perform com putations on its assigned su b task . A m achine
in an en terprise cluster can lead to reduction in perform ance of th is p a tte rn , when it is
tim e-shared by o ther users while executing some worker com ponent of th e p a tte rn . A s ta tic
load d istrib u tio n based on the speed factors of individual m achines used in parallelization
is effective only on dedicated parallel system s.
E rror Recovery: It is hard to devise m echanism s to handle a failure in som e worker com
ponen t during th e execution of th is p a tte rn . Since each worker is dependen t on th e o th er
workers for perform ing its com putations, such a failure can lead to a deadlock condition.
It is also difficult to deal w ith th e failure of com m unication betw een th e m aster and the
w orkers or between different workers.
A p p licab ility
T he M aster-W orker p a tte rn represents a parallel p rogram m ing m odel for im plem enting
synchronous parallel algorithm s. It can be used to parallelize any vision application in
which
• th e d a ta can be partitioned into several d a ta sets
• each d a ta set can be processed concurrently by different workers
• th e processing of each d a ta set requires an in teraction between th e w orker com po
nents to exchange the in term ediate results
Chapter 3. Design patterns for parallelizing vision applications 76
K now n U ses
T he M aster-W orker p a tte rn has applications m ostly a t low level vision processing.
T he higher levels do not exhibit regularity in d a ta s tru c tu re s and com p u ta tio n . In low
level processing, it can be used for parallelizing ite ra tive w indow -based operations such
as sp a tia l non-linear filters, and ite ra tive relaxation algorithm s used for im age resto ra tio n
and segm entation . T he algorithm ic s tru c tu re /m o tif represen ted by th e M aster-W orker
p a tte rn is described in (M attson , 1996).
3.5 C ontroller-W orker P a ttern
Intent
T he C ontroller-W orker p a tte rn is used for parallelizing a class of problem s in which
each ob jec t or sub task of th e problem can poten tially in te rac t w ith any o th er ob ject
or a su b task . T he controller com ponent divides th e problem in to several subtaisks and
d is trib u te s them to identical worker com ponents. Each w orker perform s calculations on
its assigned sub task , and com m unicates th e in term ediate resu lts to som e or all o th er worker
com ponents. T he controller com ponent collates th e final resu lts re tu rn ed by th e worker
com ponents.
M otivation
H istogram equalization is a popular grey scale tran sfo rm atio n which is used for enhanc
ing th e co n tra s t in an im age. It aim s to transfo rm th e im age to have equally d is trib u ted
b rightness levels over whole of th e brightness scale. A h istogram H of an im age is a
p robab ility density function of th e grey values in th e im age. If Uk represents th e num ber
of pixels a t a grey level k and if N denotes the to ta l num ber of pixels in an im age, then
th e h istogram H is defined as H{i ) = Ui / N. H istogram equalization m aps th e original
pixel values from a scale [a, 6] to the new values from a scale [c, d\ such th a t th e desired
o u tp u t h istogram is uniform over the whole new brightness scale [c, d]. T he tran sfo rm atio n
function is m onotonically increasing and is given by (Sonka e t ah , 1993)
Chapter 3. Design patterns for parallelizing vision applications 77
/ ( i j )f ' { i , j ) = ( { d - c ) / N ) J 2 H{ k ) + c (3.3)
k = a
where / and f ' represent th e original and transform ed im age functions, respectively.
Controller
SplitWork SendSub tasks CollateResults SendFinalResults
Worker (1) Worker (2)
DoCalculation DoCalculationExchangeData ExchangeData
SendResults SendResults
A1
A1
1
------------- -------
1
___ ' t ___
AII
Worker (p-1) Worker (p)
DoCalculation DoCalculationExchangeData ExchangeData
SendResults SendResults
AII
Figure 3.5: C ontroller-W orker P a tte rn
H istogram equalization algorithm can be parallelized using th e C ontroller-W orker p a t
te rn . T he C ontro ller divides th e im age in to several subim ages and sends each subim age
to different worker. Each worker com putes the p artia l h istogram of its subim age and
com m unicates it to all o ther workers. Each worker then com bines these p artia l h istogram s
to form a com plete h istogram of an entire im age. T he w orkers perform h istogram equal
ization on their subim ages (using equation 3.3) and re tu rn th e transfo rm ed subim ages to
th e C ontroller.
Stru cture
T he C ontroller-W orker p a tte rn consists of a controller com ponent and several iden
tical worker com ponents or processes as shown in F igure 3.5. T he client in te rac ts w ith
th e controller to parallelize certain application. T he controller com ponent is responsible
for p artitio n in g th e application into several sub tasks, s ta r tin g th e worker com ponen ts to
Chapter 3. Design patterns for parallelizing vision applications 78
process these subtasks, collecting the results returned by th e workers, and finally re turn ing
1 he collected results to the client. The worker com ponents are responsible for perform ing
the com putations on their assigned subtasks. Each worker may exchange in term ediate
results w ith some or all o ther worker com ponents, during the com pu tation . T he Controller-
W orker p a tte rn consists of one controller and a t least two workers.
Interaction
T h e in teractions between the com ponents of the C ontroller-W orker p a tte rn are shown in
fig u re 3.6.
CallToParallelize
^ SplitWork
SendSiibtask
SendSubtask
DoCalculation DoCalculation
ExchangeResults
DoCalculation DoCalculation
SendResults
SendResults
CollateResults
SendFinalResults
Worker (1)Controller Worker (2)Client
Figure 3.6: O bject Interaction in the C ontroller-W orker l^attern
• T he client requests the controller to parallelize a given application.
• The controller divides the application into several sub tasks and s ta r ts the worker
com ponents to process these subtasks. The num ber of sub tasks created is equal to
the num ber of processors available.
Chapter 3. Design patterns for parallelizing vision applications 79
• Each worker perform s com putations on its assigned su b task and com m unicates the
in term ed ia te results to one or m ore worker com ponents. T he workers re tu rn the
com puted resu lts back to th e controller.
• T he controller collates th e results re tu rned by th e workers, and re tu rn s th e collated
resu lt to the client.
Im plem entation
T he C ontroller-W orker p a tte rn can be im plem ented by following th e steps described below:
1. P artition the work. Specify how the problem task can be divided in to a collection of
sub tasks. T he num ber of sub tasks created should be equal to th e num ber of processors
or m achines available in th e parallel system . Also, the am oun t of co m p u ta tio n in each
su b task should be p roportional to the speed factors of th e individual m achines used in
parallelization. For th e h istogram equalization operation , we could either p a rtitio n th e
im age in to horizontal or vertical blocks of subim ages. Each subim age represents a su b task
to be processed.
2. C om bine the results. Specify how the final resu lts should be collated from th e results
re tu rn ed by th e worker com ponents. In th e h istogram equalization exam ple, th e controller
com ponen t sim ply collates the transform ed subim ages on to th e o u tp u t im age, w ith o u t any
change.
3. Specify the in teraction between the controller and the workers. T his in teraction can be
specified as follows. T he controller s ta r ts the worker com ponents and d is trib u te s a single
su b task to each worker com ponent. T he controller then w aits for th e workers to re tu rn
th e com puted results. W hen all the workers com m unicate th e ir com puted resu lts, the
controller signals th e workers to te rm in a te their processing. T he controller collects and
re tu rn s th e final resu lt to th e client.
4- Specify the in teraction between the worker com ponents. E ach w orker m ay com m unicate
(asynchronously) th e in term ediate results to som e or all o th er worker com ponents, and
m ay w ait to receive the sam e from som e or every o ther w orker com ponent. T hus, th is
Chapter 3. Design patterns for parallelizing vision applications 80
in teraction m ay som etim es involve global broadcasting of m essages from each w orker to
all o th er workers.
5. Im p lem en t the controller and the worker com ponents according to th e specifications
outlined in previous steps.
C onsequences
T he C ontroller-W orker p a tte rn provides several benefits:
Scalability and flexibility: T he C ontroller-W orker p a tte rn is scalable w ith respect to the
add ition of new worker com ponents. Increasing th e num ber of worker com ponen ts does
no t resu lt in m ajo r changes to the controller or to th e client program . Also, it is easy to
change th e program code in all worker com ponents to realize different im plem entations.
Separation o f concerns and efficiency: T he C ontroller-W orker p a tte rn sep ara tes th e client
code from th e code for sp litting the work, delegating th e work to different workers, m an
aging in teractions between th e workers, and collecting th e resu lts from th e workers. The
C ontroller-W orker p a tte rn can speed up execution tim e of m any com puta tionally intensive
applications. However, it m ay no t always be feasible to parallelize a given applica tion due
to overheads in parallelization (see below ).
T he C ontroller-W orker p a tte rn suffers from the following liabilities:
Feasibility: T he C ontroller-W orker p a tte rn m ay no t always be feasible. T he activ ities of
p artitio n in g of th e work, s ta rtin g and controlling th e workers, delegating th e w ork to the
w orkers, m anaging the worker-w orker com m unication, and collecting th e final resu lts, are
tim e consum ing. In fact, significant delays can occur in th e w orker-w orker in teractions
especially when they involve global broadcasting of m essages from each worker to all o ther
workers.
Load balancing: T he C ontroller-W orker p a tte rn can suffer from serious load im balances
during its execution. This can happen when it is im plem ented on non-dedicated parallel
system s, such as en terprise clusters (see section 2.5.3). E ach w orker in th e C ontroller-
W orker p a tte rn m ay depend on the o ther workers to perform th e co m p u ta tio n s on its
Chapter 3. Design patterns for parallelizing vision applications 81
assigned sub task . A m achine in an enterprise cluster can reduce th e perform ance in
th is p a tte rn , when it is tim e-shared by o ther users during th e execution of som e worker
com ponent w ithin th e p a tte rn . A s ta tic load d istribu tion based on th e speed facto rs of
individual m achines used in parallelization is effective only on dedicated parallel system s.
E rror Recovery: I t is hard to devise m echanism s to handle a failure in som e worker
com ponen t during th e im plem entation of th is p a tte rn . If each w orker depends on th e o th er
workers for perform ing its com putations, such a failure can lead to a deadlock condition.
It is also difficult to deal w ith th e failure of com m unication between th e contro ller and the
workers or betw een different workers.
A p p licab ility
T h e C ontroller-W orker p a tte rn can be used to parallelize any vision application in
which
• th e d a ta can be partitioned into several d a ta sets
• each d a ta se t can be processed concurrently by different workers
• th e processing of each d a ta set requires an in teraction betw een som e or all th e worker
com ponents, to exchange in term ediate results.
K now n U ses
T he C ontroller-W orker p a tte rn has applications m ostly a t low and in term ed ia te level
processing. In low level processing, it can be used for parallelizing tw o-dim ensional F ast
Fourier T ransform s. In th e in term ediate level, it can be used for parallelizing Hough
tran sfo rm s and connected com ponent labeling algorithm s.
An iterative varian t of th e C ontroller-W orker p a tte rn can be realized by perform ing the
com pute-com m unicate cycles iteratively. Each worker com ponent perform s com p u ta tio n s
on its assigned su b task iteratively. Each worker com m unicates th e in term ed iate resu lts to
som e or all o th er w orker com ponents a t the end of every ite ra tio n . However, a parallel
im plem entation using an ite ra tive varian t of the C ontroller-W orker p a tte rn involves huge
Chapter 3. Design patterns for parallelizing vision applications 82
com m unication costs, and therefore may no t resu lt in any significant perform ance gains
in m any applications.
3.6 D iv id e-and -C on q uer P a ttern
Intent
T he D ivide-and-C onquer (DC) p a tte rn is used for s tru c tu rin g applications in which
either th e d a ta or th e application algorithm is divided into several sub tasks. Each subtcisk
m ay be executed on single processor or m ay be fu rth e r divided (recursively) in to sm aller
sub tasks. T he su b tasks are executed independently and concurren tly producing several
p a rtia l results. A set of com bining functions are then applied on these p artia l resu lts to
produce th e m ain result.
M otivation
An edge, a local boundary of some ob ject in an im age, represen ts a sharp d iscontinuity
in th e im age function /(æ , y ) . It is described by a g rad ien t th a t po in ts in th e d irection of the
largest grow th of th e im age function. An edge has b o th m agnitude and direction which is
ca lcu lated using th e grad ien t. T he g rad ien t is app rox im ated by first-order differences and
expressed as a g rad ien t o p era to r A f { x , y ) = ( Ax f { x ^ y ) ^ A y f { x , y ) ) . A popular g rad ien t
o p e ra to r is th e Sobel edge detec to r which is represented by tw o convolution m asks for
finding edges in th e horizontal (A^,) and th e vertical d irections (A^) as shown below.
- 1 0 1 1 2 1
-2 0 2 0 0 0
- 1 0 1 - 1 - 2 - 1
(a) (b)
F igure 3.7: C onvolution m asks for finding a) horizontal edges and b) vertical edges
T h e direction of the edge a t a point {x^y) in th e im age is given by t an ^(A ^/A ^;),
Chapter 3. Design patterns for parallelizing vision applications 83
while th e edge m agnitude is expressed as + T he Sobel edge d e tec to r can be
parallelized using th e D C p a tte rn by com puting the horizontal and vertical g rad ien ts
concurrently . T he horizontal and the vertical grad ien ts can th en be com bined to com pute
th e edge d irection and the edge m agnitude, using the expressions given above.
Stru cture
T he D C p a tte rn consists of a m anager com ponent and several d istinc t w orker com
ponen ts or processes as shown in F igure 3.8. T he m anager com ponent creates a set of
worker com ponents to process each sub task . Each worker m ay perform co m p u ta tio n s on
its assigned su b task or m ay recursively divide it fu rth e r into sm aller sub tasks for executing
them on a different set of processor nodes.
Send DataCollateResultsSendFinalResults
Manager
ReceiveDataCompute/Parallelize
SendResults
Worker (p-1)
ReceiveDataCompute/Parallelize
SendResults
Worker (2)
ReceiveDataCompute/Parallelize
SendResults
Worker (p)
ReceiveDataCompute/Parallelize
SendResults
Worker (1)
Figure 3.8: D C P a tte rn
Interaction
T he in teractions between th e com ponents of the D C p a tte rn are shown in F igure 3.9.
• T he client requests the m anager to parallelize a given application .
• T he m anager s ta r ts th e worker com ponents and d is trib u te s th e sub tasks to different
worker com ponents.
Chapter 3. Design patterns for parallelizing vision applications 84
CallToParallelize
SendData
SendData
^ Compute/Parallelize ^ Compute/Parallelize
SendResults
SendResults
CollateResults
SendFinalResults
Worker (1)Manager Worker (2)Client
I'iguTC 3.9: O bject Interaction in the DC" P a tte rn
• Each worker com ponent perform s com putation on its assigned sub task and re tu rns
the partia l results to the m anager. A lternatively, a worker may recursively divide its
assigned sub task into sm aller subtasks and execute them concurrently on a different
set of processor nodes. A worker, in this case, ac ts as a m anager for parallelizing its
assigned snbtask .
• The m anager com putes the main result from the results re tu rned by the worker
com ponents
• Id le m anager re turns the main result to the client.
Im p le m e n ta t io n
I he DO p a tte rn can be im plem ented by following the steps described below:
/. Design the manager eomponent. The m anager controls the worker com ponents. It
creates and schedules the worker com ponents during the processing of the subtasks. If the
DC p a tte rn is used for im plem enting d a ta parallelism , specify the dividing function which
p artitio n s the d a ta into subtasks. However, if the DC p a tte rn is used for im plem enting
Chapter 3. Design patterns for parallelizing vision applications 85
algorithm ic parallelism , divide the application algorithm m anually in to d istinc t p rogram
units. T h e m anager should create worker com ponents to execute these program units. In
b o th th e cases, specify the com bining function which com bines th e p artia l resu lts re tu rned
by th e worker com ponents. In th e Sobel edge detection exam ple, a com bining function
in th e m anager com bines th e edge d a ta re tu rned by th e w orker com ponents, in o rder to
com pute th e edge direction and edge m agnitude.
2. D esign the worker com ponent. Each worker m ay sim ply apply a com puting function
on its assigned su b task . A lternatively, a worker m ay serve as a m anager for parallelizing
its assigned su b task using a different set of processor nodes. Each worker should re tu rn
th e p artia l resu lts (of the assigned subtask) to its corresponding m anager. In th e Sobel
edge detection exam ple, each worker com ponent com putes th e edge d a ta in the horizontal
(Aa;) and th e vertical (Ay) directions, concurrently.
3. Specify the in teraction between the m anager and the workers. T his in teraction can be
specified as follows. T he m anager s ta r ts th e worker com ponents and d is trib u te s a single
su b task to each w orker com ponent. T he m anager then w aits for th e workers to re tu rn
th e com puted results. W hen a worker com m unicates its resu lt, th e m anager signals the
worker to te rm in a te its processing. In the Sobel edge detec tion exam ple, th e m anager
com m unicates com plete im age d a ta to each worker and w aits for receiving th e edge d a ta
from all th e workers.
f . Im p lem en t the m anager and the worker com ponents according to th e specifications
outlined in previous steps.
C onsequences
T he D C p a tte rn provides several benefits:
Separation o f concerns: T he m anager com ponent separates th e client code from th e code
in w orker com ponents used for perform ing the ac tua l com pu ta tions in the sub tasks. Also,
th e code for creating and controlling th e worker com ponents is encapsulated in a m anager
com ponent, sep a ra te from the client.
Chapter 3. Design patterns for parallelizing vision applications 86
E fficiency: T he D C p a tte rn provides a sim ple s tra teg y of parallelizing an application. It
can be used to achieve im proved perform ance in m any applications which can be divided
(recursively) in to sm aller b u t independent com pu tational units.
E rror Recovery: I t is relatively easy to devise m echanism s to handle a failure in som e
w orker com ponent during th e execution of th is p a tte rn . T his is due to th e fac t th a t all
th e w orker com ponents process their sub tasks independently.
T he D C p a tte rn suffers from th e following liabilities:
Scalability: T he scalability of th e D C p a tte rn when used for im plem enting algorithm ic p ar
allelism is constra ined by th e am ount of parallelism th a t can be achieved in th e algorithm .
In fact, th e algorithm d ic ta tes the parallelism .
Load imbalances: T he D C p a tte rn m ay lead to load im balances when used for im plem enting
d a ta parallelism . For exam ple, equal d istribu tion of th e im age d a ta in the connected
com ponen t labeling algorithm m ay lead to unequal load d istribu tion when th e connected
com ponen ts span only a sm all region in th e im age.
A p p licab ility
T he D ivide-and-C onquer p a tte rn can be used for parallelizing any vision application
in which
• th e d a ta or th e algorithm can be divided in to several sub tasks
• each su b task can be executed on a single processor or m ay recursively be parallelized
using th e divide-and-conquer principle
• all su b tasks created can be processed concurrently on different processors w ithou t
explicit com m unication between th e processors
K now n U ses
T h e divide-and-conquer parallel program m ing m odel has been used for parallelizing
a num ber of vision algorithm s. S tou t (S tout, 1987) has proposed several divide-and-
Chapter 3. Design patterns for parallelizing vision applications 87
conquer algorithm s for im age processing. Sunwoo et. al. (Sunwoo et ai., 1987) have used
divide-and-conquer techniques to segm ent an im age into different regions. C houdhary
and T h ak u r have parallelized connected com ponent labeling algorithm s on coarse grained
m achines using th e divide-and-conquer principle (C houdhary & T h ak u r, 1994). H am eed
et al. (Ham eed e t al., 1997) have employed different d ivide-and-conquer approaches to
parallelize a con tour ranking algorithm on coarse grained m achines.
3 .7 T em poral M u ltip lex ing P a ttern
Intent
T he T em poral M ultiplexing (TM ) p a tte rn is used for processing several d a ta se ts or
a sequence of im age fram es on m ultiple processors. Each processor processes a com plete
d a ta set and executes the sam e program code.
M otivation
A com puter-assisted sperm m otility system enables study ing th e m otion of th e sperm s
in living organism s (Irvine, 1995). In hum an beings it is used for estim ating th e degree
of m ale fertility. In a sperm m otility system , a sequence of im age fram es of th e sperm
m ovem ent are cap tu red over a given tim e fram e. T hese im age fram es are then analyzed for
finding th e sperm and m otion characteristics such as th e sperm density, size and shape of
th e sperm heads, velocity of the sperm s, and the shape of th e m otion tra jec to ry . A sperm
m otility system involves a set of com m on preprocessing and fea tu re ex trac tion operations
on th e individual im age fram es. T he m odule to com pute th e velocity of individual sperm s,
for exam ple, involves simple operations such as im age threshold ing , noise suppression,
rem oval of th in lines (sperm tails) or con tam inating partic les, segm entation , and finally
region m erging for ex trac ting th e sperm heads/cells. T he processed im age fram es are then
com bined (superim posed) for tracking th e m otion tra je c to rie s of individual sperm s and to
com pute th e sperm velocities.
Since th e preprocessing and featu re ex trac tion operations on individual im age fram es
Chapter 3. Design patterns for parallelizing vision applications 88
are independen t of each o ther, the TM p a tte rn can be used to process each im age fram e
concurrently . Perform ing d a ta parallelism on individual im age fram es in such cases may
no t im prove perform ance due to com m unication overheads and sim plicity of th e operations.
ReceiveDataSetDoCalculation
SendResults
Worker (p-1)
ReceiveDataSendDataSet
Manager
SendResults
ReceiveDataSetDoCalculation
Worker (2)
SendResults
ReceiveDataSetDoCalculation
Worker (p)
SendResults
ReceiveDataSetDoCalculation
W orker (1)
Figure 3.10: TM P a tte rn
S tru cture
T he T M p a tte rn consists of a m anager com ponent and several identical w orker com po
nen ts or processes as shown in F igure 3.10. T he m anager creates, controls and schedules
th e w orker com ponents to process the d a ta sets. It receives th e d a ta sets from an ex ternal
com ponen t called th e data source. T he worker com ponents are responsible for perform ing
co m p u ta tio n on individual d a ta sets, and to re tu rn th e processed values to an ex ternal
com ponen t called th e data sink. T he TM p a tte rn consists of one m anager and a t least
tw o workers.
In teraction
T h e in teractions betw een the com ponents of the TM p a tte rn are shown in F igure 3.11.
• T he ex ternal d a ta source supplies a sequence of d a ta sets to th e m anager
• T he m anager assigns individual d a ta sets to available w orkers. If all th e w orkers are
Chapter 3. Design patterns for parallelizing vision applications 89
busy, the m anager suspends its activities until some worker is free to process a d a ta
set.
Each worker processes its assigned d a ta set, sends the processed values to a d a ta
sink com ponent, and in teracts with the m anager for a new d a ta set.
riie above two steps are repeated until there are no m ore d a ta sets to be processed.
SendData
DoCalculationSendData
SendDataSet
DoCalculationSendResults
SendResults
Worker (1) D ata SinkManager Worker (p)D ata Source
Figure 3.11: O bject Interaction in the TM P a tte rn
Im p le m e n ta t io n
The TM p a tte rn can be im plem ented by following the steps described below:
/. Design the m anager component. The m anager controls the worker com ponents. It
creates and schedules the worker com ponents for processing the d a ta sets. The m anager
com ponent m aintains a queue of available worker com ponents. W hen a worker requests
a new d a ta set, the m anager adds it to the end of th is queue. If the queue of available
workers is not em pty, the m anager reads a d a ta set from the d a ta source and assigns it
1.0 the first available worker in this queue. However, when the queue is em pty (all the
Chapter 3. Design patterns for parallelizing vision applications 90
workers are b u sy ), th e m anager suspends its activities until a t least one worker is ready to
process a d a ta set. In the sperm m otility system , th e m anager assigns each im age fram e to
a sep a ra te worker. T he m anager, in th is case, can also serve as a d a ta source. It therefore
m ain ta ins a repository of all th e im age fram es to be processed by th e worker com ponents.
2. D esign the w orker com ponent. Each worker should be designed to process th e assigned
d a ta set, send th e processed values to th e d a ta sink, and request a for new d a ta set from
th e m anager. In th e sperm m otility exam ple, each w orker perform s a com plete set of
preprocessing and featu re ex traction operations on their assigned im age fram es. Each
w orker sends th e processed im age fram es to the d a ta sink com ponent.
3. Im p lem en t the m anager and the worker com ponents according to th e specifications
ou tlined in previous steps.
C onsequences
T h e T M p a tte rn provides several benefits:
Scalability and flexib ility: New worker com ponents can be easily added w ithou t perform ing
m ajo r changes to th e m anager com ponent. Also, it is easy to change th e p rogram code in
all worker com ponents to realize different im plem entations.
E fficiency: T he use of TM p a tte rn enables scaling of th e th ro u g h p u t to process the
individual d a ta sets in d irect p roportion to th e num ber of processors used.
D ynam ic load balancing: T he T M p a tte rn , like th e F arm er-W orker p a tte rn , provides an
even d istrib u tio n of the load while processing th e d a ta sets. T he num ber of d a ta sets
processed by each w orker is p roportional to th e speed of th e ir corresponding nodes or
processors.
T he T M p a tte rn suffers from th e following liabilities:
E ffectiveness: T he TM p a tte rn is effective only when th ere are m ore d a ta se ts /im ag e
fram es th a n th e num ber of processors. T he parallelism in th is p a tte rn is expressed in
te rm s of th e num ber of d a ta se ts /im ag e fram es processed. W hen all th e d a ta se ts /im ag e
Chapter 3. Design patterns for parallelizing vision applications 91
fram es are processed, no fu rth e r parallelism is available in th e application.
Latency: T he use of T M p a tte rn does no t im prove th e la tency to process individual d a ta
sets, it rem ains unchanged in th is p a tte rn .
A p p licab ility
T he T M p a tte rn can be used to parallelize any vision applica tion in which
• it is required to process a collection/ sequence of im age fram es or im age d a ta sets
• th e processing of each im age uses th e sam e program code
• th e im ages can be processed concurrently on different processors w ithou t explicit
com m unication between th e processors
K now n U ses
T he TM p a tte rn is used for parallelizing com plete d a ta sets. D ow nton et al. (D ow nton
e t al., 1996) have used tem pora l m ultiplexing techniques in th e postcode recognition
system . T hey have used it for verifying th e validity of p o stu la ted postcodes by m atch ing
them w ith th e entries in a d a tab ase of valid postcodes.
3.8 P ip elin e P attern
Intent
T h e P ipeline p a tte rn is used for parallelizing applica tions which process a s tream of
d a ta , and which can be divided in to a sequence (pipeline) of several independent su b task s
th a t are executed in a determ ined order. T he d a ta stream in th e p a tte rn is provided by a
data source com ponent. T he processed results are collected by th e data sink com ponen t.
Each su b task is im plem ented by a worker com ponent which reads a s tream of d a ta ,
processes it, and passes th e processed results to an o th er worker (or d a ta sink) in th e
p a tte rn .
Chapter 3. Design pa ttern s for parallelizing vision applications
M otivation
92
A vehicle identification system involves analyzing th e im ages of th e vehicles, for iden ti
fying th e owners of th e vehicles. Such a system , for exam ple, can be used for track ing the
identification of th e vehicles, which break a specified speed lim it on a m o to r highway or
city roads. A high speed cam era cap tu res th e im ages of th e high speed vehicles which are
then analyzed a t a certain tim e of the day. A typical vehicle identification system consists
of a t least four d istin c t m odules (subtasks) as shown in F igure 3.12.
InputImages
OwnerIdentification
Feature Extraction Classification Database SearchPreprocessing
Figure 3.12: Vehicle identification system
T he preprocessing m odule ex trac ts th e region in th e im age th a t su rrounds th e num ber
p late . It then applies thresholding, edge detection and th inn ing operations on th e ex
trac ted region in o rder to recover and skeletonize th e charac te rs in th e num ber p late . T he
o u tp u t of th is m odule serves as an inpu t to the fea tu re ex trac tio n m odule, which ex tra c ts
a num ber of fea tu res concerning each charac ter. T he featu re vectors of all th e ch a rac te rs
in th e num ber p la te are then presented to the classification m odule. T he classification
m odule com pares th e featu re vector of each ch arac te r w ith a set of pre-stored exam pler
fea tu re vectors. A set of possible charac ters for each ch arac te r in th e num ber p la te is then
presen ted to th e d a tab ase search module.
T he d a tab ase search m odule searches a d a tab ase of valid vehicle reg istra tion num bers
for each com plete se t of charac te rs th a t m ay poten tially represen t a num ber p la te . T he
ones th a t m atch th e d a tab ase entries w ith the highest p robabilities are then considered
as recognized num ber plates. T he d a tab ase search m odule then o u tp u ts th e identification
of th e vehicle from th e d a tab ase entry. For a given num ber p la te im age, if th e system
o u tp u ts m ore th an one po ten tia l num ber p late entry , som e verification (either m anually
or au to m ated ) needs to be devised to resolve the system am biguity.
Chapter 3. Design patterns for parallelizing vision applications 93
T he d istin c t m odules of th e vehicle identification system can be easily s tru c tu re d
using th e pipeline p a tte rn . Each m odule can run concurren tly on different processors
and in te rac t w ith its neighboring m odules only by exchanging stream s of d a ta .
Stru cture
T he P ipeline p a tte rn consists of a d a ta source, a d a ta sink, and several w orker com
ponen ts as shown in F igure 3.13. T he d a ta source provides a sequence of in p u t values
(having th e sam e s tru c tu re or d a ta type) in the pipeline. T he d a ta sink collects the
processed values from th e end of the pipeline. Each w orker com ponent is responsible for
receiving th e d a ta from its preceding worker (or d a ta source), processing th is d a ta , and
sending th e processed results to th e following worker (or d a ta sink). T he first and the
last w orker com ponents com m unicate w ith th e d a ta source and th e d a ta sink com ponents,
respectively. T he in term ed iate worker com ponents com m unicate only w ith th e ir im m ediate
neighbors. N ote th a t a Pipeline p a tte rn does not provide for dividing th e applica tion into
d ifferent sub tasks. It provides only a s tru c tu re to an applica tion th a t is divided m anually
in to different sub tasks. T he client is responsible for creating , s ta r tin g and te rm in a tin g the
com ponen ts in the P ipeline p a tte rn .
C lient
ReceiveDataDoCalculation
SendResults
Worker (1)
SendResults
ReceiveDataDoCalculation
Worker (p)
CollectResults
SendFinalResults
Data S ink
ReadDataSendData
Data Source
Figure 3.13: P ipeline P a tte rn
C h apter 3. D esign p a tte rn s for parallelizing vision app lica tion s
In teraction
94
44ie in teractions between the com ponents of the Pipeline p a tte rn are shown in F igure 3.14.
CallToReadData
SendData
DoCalculation
SendResults
DoCalculation
SendResults
CollectResults
SendFinalResults
Worker (1) Worker (2)Data SourceClient Data Sink
Figure 3.14: Object Interaction in the Pipeline P a tte rn
• 44ie client calls the d a ta source com ponent to read the d a ta sets.
• T he d a ta source com ponent reads and a ttem p ts to send a new d a ta set to the first
worker. If the first worker is busy with processing a previous d a ta set, the d a ta
source com ponent suspends itself until the worker is ready to receive the cu rren t
d a ta set.
• Each in term ediate worker (not shown in the figure for brevity) retrieves (pulls) a
d a ta set from its preceding worker, processes it, and sends (pushes) the processed
d a ta to its successor. A worker may suspend its ac tiv ities tem porarily , if the d a ta
from the preceding worker is not available, or if the worker following im m ediately is
not w aiting for the d a ta .
• 14ie last worker sends the processed d a ta set to the d a ta sink and waits for a new
d a ta se t from its predecessor.
Chapter 3. Design patterns for parallelizing vision applications 95
• T he last th ree processing steps are repeated until th ere are no m ore d a ta sets to be
processed in th e pipeline.
• T he d a ta sink sends th e processed d a ta sets to the client.
Im p lem en tation
T he P ipeline p a tte rn can be im plem ented by following th e steps described below:
1. D ivide the application. T he application should be m anually divided into a sequence
of functional u n its or sub tasks. The processing in each su b task m ust depend only on
th e o u tp u t of its d irect predecessor. T he com pu tational load in each su b task should be
p roportiona l to th e speed facto rs of the individual processors available for parallelizing
th e application . In th e vehicle identification system , th e application can be divided into
four d is tin c t functional units, namely, preprocessing, fea tu re ex trac tio n , classification and
d a tab a se search.
2. D esign the data source and data sink com ponents. These can be designed in two different
ways: a) B oth th e d a ta source and the d a ta sink are designed as sep a ra te com ponents
which are executed concurren tly w ith respect to th e client. T he client calls th e d a ta
source com ponent to read and o u tp u t th e d a ta stream in to th e pipeline, and w aits for
th e d a ta sink to re tu rn th e final results collected during th e execution of th e pipeline, b)
A lternatively , th e client functions as a d a ta source (or d a ta sink) and creates a sep a ra te
com ponen t for d a ta sink (or d a ta source). A client canno t perform bo th these task s by
itself, since it does not resu lt in any perform ance gain on using th is p a tte rn . In the
vehicle identification exam ple, th e d a ta source m ay be designed as a sep a ra te com ponent
which reads vehicle im ages from the specified files and presen ts them to preprocessing
m odule. T he d a ta sink com ponent may sim ply sto re the details of each num ber p la te and
its p o ten tia l ow ner(s) in a specified file.
3. D esign the worker com ponents. Each worker com ponent should repeated ly receive
a d a ta set from its predecessor, processes it, and o u tp u t th e processed d a ta set to its
successor. Each w orker should be im plem ented as a sep a ra te program un it th a t perform s
th e required com pu tation on its d a ta s e t . In the vehicle identification exam ple, each worker
Chapter 3. Design patterns for parallelizing vision applications 96
perform s specified operations on its in p u t d a ta and passes its o u tp u t to th e neighboring
w orker or d a ta sink.
4. Specify the in teraction between different com ponents in the pattern . T his in te rac tion can
be specified by using inter-process com m unication calls su p p o rted by a m essage-passing
lib rary (section 2.1.1). N ote th a t each worker should fo rm a t th e resu lts in o rder to pass
them to its successor in th e pipeline.
5. Im p lem en t the com ponents and start the pipeline. T he com ponents in th e p a tte rn can
be im plem ented according to th e specifications given in previous steps. T he client s ta r ts
each com ponent as a sep a ra te th read or process. T he processing in th e pipeline s ta r ts
when th e d a ta source o u tp u ts th e d a ta sets to th e first w orker in th e pipeline. Each d a ta
set is transfo rm ed by different worker com ponents in th e pipeline and is finally collected
by th e d a ta sink. W hen th ere are no m ore d a ta sets to be processed, th e client te rm in a tes
all th e com ponents of the p a tte rn , after collecting th e processed resu lts from th e d a ta sink.
C onsequences
T he P ipeline p a tte rn provides several benefits:
Flexibility: Since th e worker com ponents in th e P ipeline p a tte rn are independent and
in te rac t only by exchanging s tream s of d a ta , they can be easily replaced by m ore efficient
com ponen ts having th e sam e functionality. T he w orker com ponents can be reused in
different s ituations. Also, new worker com ponents can be easily added to refine th e
functionality of th e existing pipeline.
E fficiency: T he P ipeline p a tte rn helps in increasing th e system th ro u g h p u t and reduce th e
la tency in applications which process long s tream s of d a ta . However, th e use of P ipeline
p a tte rn for im proving th e application perform ance is feasible only when th e g ran u la rity of
each w orker is sufficiently high. T he tim e required to tran sfe r th e d a ta betw een th e w orker
com ponen ts should be relatively lower th an the tim e required to perform the com p u ta tio n s
on each w orker com ponent.
Chapter 3. Design patterns for parallelizing vision applications 97
T he P ipeline p a tte rn suffers from the following liabilities:
Sharing global in form ation: Sharing of global in form ation between different com ponents
in th e P ipeline p a tte rn is inefficient and does no t provide full benefits of th e p a tte rn .
Load balancing: Like the M aster-W orker p a tte rn , P ipeline p a tte rn can suffer from serious
load im balances during its execution on enterprise clusters (section 2.5.3). T h ro u g h p u t
and latency are influenced by th e speed of the slowest w orker com ponent in th e p a tte rn .
E rror Recovery: I t is difficult to handle failures in the worker com ponents during the
execution of th is p a tte rn . Each worker is dependen t on o th er workers for perform ing its
com pu tations. Consequently, a failure in any worker com ponent can lead to a significant
loss in processing tim e. In m any cases, th e application m ay need to be s ta r te d from th e
beginning.
Scalability: An application parallelized using a P ipeline p a tte rn is usually not scalable w ith
respect to addition of processors used for parallelization. T his is because th e num ber of
worker com ponents in a Pipeline p a tte rn are defined by th e num ber of sub tasks com prising
th e application.
A p p licab ility
T he P ipeline p a tte rn can be used to parallelize applica tions in which
• it is necessary to process a long stream of d a ta values
• th e application is com posed of a sequence of independen t functional u n its which
process th e d a ta s tream independently , b u t in a determ ined order
• th e functional un its com m unicate w ith each o ther only by exchanging s tream s of
d a ta
K now n U ses
T he P ipeline p a tte rn has applications a t all levels of vision processing. A t th e low level,
it can be used for parallelizing C anny edge de tec to r (Sonka e t al., 1993), when applied on
Chapter 3. Design patterns for parallelizing vision applications 98
a sequence of im age fram es. T he C anny edge de tec to r is com posed of several independen t
functional un its and is therefore easily im plem ented using a P ipeline p a tte rn (Rulf, 1988).
N ote th a t th e scalability of a P ipeline p a tte rn m ay be increased by em ploying tw o or
m ore P ipeline p a tte rn s for parallelizing a single applica tion . Each P ipeline p a tte rn can
concurren tly process a p a r t of th e d a ta stream (if feasible) in th e application . U sing two
or m ore P ipeline p a tte rn s to parallelize a single application can be considered as a varian t
of th e P ipeline p a tte rn . We call th is varian t th e M ultip le P ipeline p a tte rn . A n o th er
varian t of th e P ipeline p a tte rn (used in (Dow nton e t al., 1996)) can be realized by m aking
th e pipeline com m unications ‘bo th w ays’. T his enables o u tp u t of one or m ore P ipeline
com ponen ts to be used as an in p u t (feedback) of relevant com ponent(s) in th e P ipeline.
3.9 C om p osite P ip elin e P attern
Intent
T he C om posite P ipeline p a tte rn consists of a pipeline of design p a tte rn s a n d /o r se
quen tia l com ponents which together parallelize a com plete vision application processing
a continuous stream of d a ta . It provides a s tru c tu re to these applications which can be
parallelized by dividing these into several independent functional un its th a t com m unicate
w ith each o th er only by exchanging stream s of d a ta . Each functional un it in tu rn m ay
be parallelized by using relevant design p a tte rn s or m ay be im plem ented as a sequentia l
com ponent.
M otivation
C onsider th e vehicle identification system as outlined in section 3.8. Since th e in p u t
to each m odule depends on the o u tp u t of th e previous m odule, th e perform ance of the
overall system depends on th e speed of the slowest m odule. T he use of a C om posite
P ipeline p a tte rn in th is s ituation can lead to im proved system perform ance com pared to
a sim ple pipeline im plem entation . Each m odule in th is system (see F igure 3.15) m ay be
parallelized by dividing th e d a ta set w ithin each m odule into su b task s and processing these
Chapter 3. Design patterns for parallelizing vision applications 99
sub task s concurrently (d a ta parallelism ). A lternatively, each d a ta set may be processed
on a different processor w ithout d a ta partition ing (tem poral m ultiplexing).
InputIm ag e s
OwnerIdentification
Feature Extraction. Preprocessing Classification
Figure 3.15; Vehicle identification system
hbr exam ple, the preprocessing operations on each im age may be perform ed concur
rently on different processors. Similarly, the search for d a tab ase entries for different
num ber plates may be executed on different processors. Both these m odules exhibit
tem poral m ultiplexing form of parallelism . In the feature ex traction and classification
m odules, each ch arac ter in a image fram e may be processed on a sep ara te processor (d a ta
parallelism ). However, such parallelism may not always be feasible if the com m unication
overheads are too high. In such cases, tem poral m ultiplexing alone may be used to increase
t he system perform ance.
S tru ctu re
th e s tru c tu re of the C om posite Pipeline pa tte rn is as shown in F igure 3.16. It is sim ilar
to the Pipeline p a tte rn . It has a data source which provides the inputs, a data .snrA; which
collects the o u tp u ts and a sequence of design p a tte rn a n d /o r sequential worker com ponents
th a t process the input stream of d a ta . We shall refer to the design p a tte rn s and the
secpiential worker com ponents as functiona l com ponents of the p a tte rn . Each functional
com ponent is responsible for receiving the d a ta from its predecessor, processing th is d a ta ,
and sending the processed results to it successor. N ote th a t the C om posite p a tte rn , like
th e f^ipeline p a tte rn , does not provide for dividing the application in to different subtasks.
It only provides a s tru c tu re to an application th a t is divided m anually in to different
functional com ponents. T he client is responsible for creating, s ta rtin g and te rm in atin g
the com ponents in the C om posite Pipeline p a tte rn .
Chapter 3. Design patterns for parallelizing vision applications 100
C lient
Data Source
ReadData Send Data
Farmer-Worker TM Pattern
Pattern(l) Pattern (p) Data Sink
ReceiveData ReceiveData ColiectResultsDoCalcuIafion DoCalculation
SendResuits SendResuits SendFinalResults
Figure 3.J6: C om posite Pipeline P a tte rn
In teraction
T he in teractions between the com ponents of the C om posite Pipeline p a tte rn are shown
in F igure 3.17.
• T he client calls the d a ta source com ponent to read the d a ta sets.
• T he d a ta source com ponent reads and a ttem p ts to send a new d a ta set to the first
functional com ponent. If the first functional com ponent is busy with processing a
previous d a ta set, the data, source com ponent suspends itself until the com ponent is
ready to receive the curren t d a ta set.
• Each in term ediate functional com ponent (not shown in the figure for brevity) re
trieves (pulls) a d a ta set from its predecessor, processes it, and sends (pushes) the
Chapter 3. Design patterns for parallelizing vision applications 101
CallToReadData
SendData
DoCalculation
SendResuits
DoCalculation
SendResuits
ColiectResults
SendFinalResults
Data Source Pattern (1) Pattern (2) Data SinkClient
Figure 3.17: O bject Interaction in the C om posite Pipeline P a tte rn
processed d a ta to its successor. A functional com ponent may suspend its activ ities
tem porarily , if the d a ta from the preceding com ponent is not available, or if the
following com ponent is not waiting for the d a ta .
• T he last functional com ponent sends the processed d a ta set to the d a ta sink and
w aits for a new d a ta set from its predecessor.
• d he last th ree processing steps are repeated until there are no more d a ta sets to be
processed in the pipeline.
• T he d a ta sink sends the processed d a ta sets to the client.
I m p le m e n ta t io n
T he C om posite Pipeline pa tte rn can be im plem ented by following the steps described
below:
/. Divide the application. The application should be m anually divided into a sequence of
functional units. T he processing in each functional unit m ust depend only on the o u tp u t of
Chapter 3. Design patterns for parallelizing vision applications 102
its d irec t predecessor. For exam ple, in th e vehicle identification system , th e applica tion is
divided in to preprocessing, featu re ex traction , classification and d a tab ase search m odules.
2. D esign the data source and data sink com ponents. T hese com ponents, like in the
P ipeline p a tte rn , can be designed as two separa te com ponents different from th e client.
A lternatively , th e client can function as a d a ta source (or d a ta sink) and create a sep a ra te
com ponen t for d a ta sink (or d a ta source).
3. D esign the fu n c tio n a l com ponents. Design each functional com ponent as an indepen
d en t p rogram un it which runs sequentially or which can be parallelized by using relevant
design p a tte rn . Each functional com ponent m ust repeated ly retrieve a d a ta set from its
predecessor, processes it, and o u tp u t th e processed resu lts to its successor. In th e vehicle
identification system , som e or all th e m odules m ay be designed to im plem ent e ith er d a ta
parallelism or tem pora l m ultiplexing on their assigned d a ta sets.
4- Specify the in teraction between different com ponents in the pattern . T his in teraction can
be specified by using in ter-process com m unication calls su p p o rted by a m essage-passing
lib rary (section 2.1.1).
5. Im p lem en t the com ponents and start the pipeline. T he com ponents in th e p a tte rn are
im plem ented according to the specifications given in previous steps. T he processing in the
pipeline s ta r ts when th e d a ta source o u tp u ts th e d a ta sets to th e first functional com ponent
in th e pipeline. Each d a ta set is transform ed by different functional com ponents in the
pipeline and is finally collected by th e d a ta sink. W hen th ere are no m ore d a ta se ts to
be processed, th e client te rm inates all th e com ponents of th e p a tte rn , afte r collecting the
processed resu lts from th e d a ta sink.
C onsequences
T he C om posite P ipeline p a tte rn provides several benefits:
Flexibility: Since th e functional com ponents in the C om posite P ipeline p a tte rn are inde
p enden t and in te rac t only by exchanging stream s of d a ta , they can be easily replaced by
m ore efficient com ponents having th e sam e functionality . For exam ple, a slow sequen-
Chapter 3. Design patterns for parallelizing vision applications 103
tia l w orker com ponent m ay be replaced by an equivalent parallel functional com ponen t.
T he functional com ponents can be reused in different situ a tio n s. Also, new functional
com ponen ts can be easily added to refine the functionality of th e existing pipeline.
E fficiency: T he C om posite P ipeline p a tte rn can achieve b e tte r perform ance th an a plain
P ipeline im plem entation . A slow worker com ponent in th e plain P ipeline im plem en tation
can be identified and could possibly be im plem ented as a parallel functional com ponen t.
However, th e use of C om posite Pipeline p a tte rn is effective only when the g ran u la rity of
each functional com ponent is sufficiently high.
T he C om posite P ipeline p a tte rn suffers from the following liabilities:
Load balancing: T he C om posite Pipeline p a tte rn , like the M aster-W orker and th e P ipeline
p a tte rn , can suffer from serious load im balances during its execution on th e en terp rise
c lusters (section 2.5.3). However, these load im balances can possibly be reduced by using
th e F arm er-W orker or th e Tem poral M ultiplexing p a tte rn , to parallelize relevant functional
com ponents. B oth these p a tte rn s have dynam ic load balancing property .
E rror Recovery: It is difficult to handle failures in functional com ponents during th e exe
cution of th is p a tte rn . Each functional com ponent is dependent on th e o ther com ponen ts
for perform ing its com putations. Consequently, a failure in any functional com ponen t can
lead to a significant loss in processing tim e.
A p p licab ility
T he C om posite Pipeline p a tte rn can be used to parallelize applica tions in which
• it is necessary to process a long stream of d a ta values.
• th e applica tion is com posed of a sequence of independent functional u n its which
process th e d a ta stream independently, bu t in a determ ined order.
• th e functional un its com m unicate w ith each o ther only by exchanging s tream s of
d a ta .
Chapter 3. Design patterns for parallelizing vision applications 104
• each functional un it m ay be im plem ented as a sequential com ponent or m ay in tu rn
be parallelized using relevant design p a tte rn .
K now n U ses
T he C om posite P ipeline p a tte rn is an arch itectu ra l p a tte rn which is used for p ara l
lelizing com plete vision system s. Singh (Singh e t al., 1991) and Schaeffer (Schaeffer e t al.,
1993) have used com posite pipeline principle to parallelize an im age rendering application .
T hey used tem pora l m ultiplexing to speed up individual stages of th e pipeline. D ow nton
et al. (D ow nton e t al., 1996) la te r proposed th e principle of com posite pipeline as a design
m ethodology for parallelizing em bedded im age processing applications, and applied it
to parallelize th e im age coding and postcode recognition applications. T hey proposed
b o th d a ta and algorithm ic parallelism (in addition to tem pora l m ultiplexing) to speed
up individual stages of the pipeline. A varian t of th e C om posite P ipeline p a tte rn (used
in (D ow nton e t al., 1996)) can be realized by m aking th e pipeline com m unications ‘bo th
w ays’. T his enables o u tp u t of one or m ore C om posite P ipeline com ponents to be used cls
an in p u t (feedback) of relevant com ponent(s) preceding in th e p a tte rn .
3.10 Sum m ary
Design p a tte rn s for parallel vision applications represent designs or m ethods used for
parallelizing these applications on various parallel arch itectu res. A lthough th e lite ra tu re
on parallelization of vision algorithm s is vast, there has been no previous efforts to a b s tra c t
and docum ent th e design inform ation in these parallel im plem entations. In th is ch ap te r
we have a ttem p ted to cap tu re and docum ent th is design inform ation in th e form of design
p a tte rn s . These design p a tte rn s can be used for im plem enting parallel so lutions to m any
vision algo rithm s/ applications on coarse-grained parallel m achines, such as a c luster of
w orksta tions. Each p a tte rn has been described in a uniform way using a tem p la te . T he
tem p la te provides descrip tion of how each p a tte rn works, w here it should be applied and
w h a t are th e trade-off in its use.
T he design p a tte rn s presented in th is chap te r include Farm er-W orker, M aster-W orker,
Chapter 3. Design patterns for parallelizing vision applications 105
C ontroller-W orker, D ivide-and-C onquer (D C), Tem poral M ultiplexing, P ipeline, and Com
posite P ipeline, T he Farm er-W orker p a tte rn is used for parallelizing em barrassingly p ar
allel algorithm s, while M aster-W orker p a tte rn and C ontroller-W orker p a tte rn are used for
parallelizing problem s exhibiting synchronous form of parallelism . D iv ide-and-C onquer
p a tte rn is used for parallelizing algorithm s th a t use a recursive s tra teg y to sp lit a problem
in to sm aller subproblem s and m erge th e solution to these subproblem s in to final solution.
Tem poral M ultiplexing p a tte rn is used for processing several d a ta sets or im age fram es
on m ultip le processors. Finally, P ipeline and C om posite P ipeline p a tte rn s are used for
parallelizing applica tions which can be divided in to a sequence (pipeline) of several in
dependen t sub tasks th a t are executed in a determ ined order. In th e C om posite P ipeline
p a tte rn , each su b task m ay be fu rth e r parallelized using o th er relevant design p a tte rn s .
Chapter 4
Low level algorithm s
T he design p a tte rn s described in previous chap ter can be used for parallelizing a m ajo rity
of vision algorithm s on coarse-grained parallel m achines, such els w orksta tion clusters. In
th e rem aining p a r t of th is thesis, we use and evaluate th e applicability of these p a tte rn s for
parallelizing som e represen ta tive vision algorithm s on a c luster of w orksta tions. T here are
tw o different ways in which th is can be done a) for a given design p a tte rn , one can describe
a se t of vision algorithm s which can be parallelized using th is p a tte rn , a lternatively , b) for
a given vision algorithm , one can describe a set of one or m ore design p a tte rn s which can
be used to parallelize th is algorithm .
We follow th e second approach by grouping algorithm s in som e order (e.g. low level, in
te rm ed ia te level, and high level in com puter vision), and describing various design p a tte rn s
th a t can be used to parallelize these algorithm s. T his approach ensures logical consistency
of describing algorithm s or techniques used in a given dom ain, such as com pu ter vision.
T his ch ap te r therefore discusses parallelization of some rep resen ta tive low level vision
a lgorithm s using th e ap p ro p ria te design p a tte rn s. C h ap te r 5 discusses parallelization
of som e in term ed iate level algorithm s, while ch ap te r 6 discusses parallelization of some
rep resen ta tive high level a lgo rithm s/app lica tions. We begin th is ch ap te r by describing
charac te ristic s of low level algorithm s.
Low level a lgorithm s aim a t im proving the im age d a ta by suppressing noise or unw anted
106
Chapter 4. Low level algorithms 107
d isto rtio n s, and enhancing som e im age featu res im p o rtan t for fu rth e r processing a n d /o r
for hum an in te rp re ta tio n . T he inpu t and o u tp u t to these algorithm s are pixel based
in tensity im ages. T he com pu tations involved in these algorithm s are pixel based im age
tran sfo rm atio n s which use a large num ber of sim ple m ath em atica l operations on th e pixel
values in an in p u t im age to com pute a new set of pixel values in th e o u tp u t im age. This
ch ap te r discusses parallelization of som e represen ta tive low level vision a lgorithm s using
th e design p a tte rn s described in C h ap te r 3.
Low level vision algorithm s can be broadly classified in to two categories depending on
th e size of th e pixel neighborhood used for calculating th e new pixel value.
• Local algorithm s: In local algorithm s, th e value of a processed pixel depends only on
th e values of th e pixels placed in its local neighborhood (window). T he size of th e
neighborhood in the local a lgorithm s m ay be fixed, as in Sobel edge detec tion and
threshold ing operations, or m ay vary, as in convolution and filtering operations. We
also classify th e poin t operations in th is category w here th e value of th e new pixel
depends only on th e original value of th a t pixel (e.g. b righ tness correc tion). T he
local a lgorithm s can be fu rth e r classified as iterative and non-iterative. An exam ple
of an ite ra tiv e local a lgorithm is the ex trem um filter described in section 3.4, while,
th e edge detec tion algorithm using th e Sobel edge o p e ra to r is an exam ple of a non
ite ra tiv e local algorithm .
• G lobal algorithm s: In global algorithm s, th e value of a processed pixel m ay depend
on values of all pixels covering large neighborhoods or even en tire im age. T he
algorithm s in th is category are fu rth e r classified as global fixed and global varying.
In th e global fixed algorithm s, th e value of a processed pixel depends on th e values
of all pixels in th e in p u t im age. Some exam ples of th e global fixed algorithm s are:
h istogram equalization , and the two dim ensional d iscre te Fourier transfo rm . In th e
global varying algorithm s, th e value of a new pixel m ay depend on th e pixels in en tire
in p u t im age, or on th e pixels in sm all region of th e in p u t im age. For exam ple, in a
connected com ponent labeling algorithm , a connected com ponent m ay span only a
sm all region or it m ay be spread over th e en tire im age. T he am o u n t of co m p u ta tio n
in global fixed algorithm s therefore depends only on th e size of th e in p u t im age.
Chapter 4. Low level algorithms 108
while th e am oun t of com putation in global varying algorithm s depends on bo th , th e
size and th e con ten ts of th e inpu t image.
T he classification scheme described above was used by C houdhary and P a te l (C houd-
hary & P a te l, 1990) to provide an insight in to th e perform ance of an algorithm based on
its com m unication requirem ents. We have ex tended it fu r th e r to in troduce th e iterative
and non-itera tive class of local algorithm s. T he extended classification schem e enables
identification of relevant design p a tte rn s which can be used for parallelizing th e low level
algorithm s.
T h e rest of th e chap te r is organized as follows. Section 4.1 outlines th e m ethods
which can be used to parallelize m ost of th e low level algorithm s. Section 4.2 describes
th e schem e th a t is used in partition ing th e im age d a ta . T he rem aining sections present
th e experim ental results of parallelizing various rep resen ta tive low level vision algorithm s.
Section 4.3 presents parallelization of a h istogram equalization algorithm which is a global
algorithm used for co n tra s t enhancem ent. Section 4.4 discusses various filtering operations
and th e ir parallel im plem entations. Section 4.5 presents resu lts of parallelization of a two-
d im ensional Fourier transfo rm . Finally, section 4.6 discusses parallelization of an im age
re s to ra tio n algorithm using M arkov random field m odels.
T he algorithm s presented in th is chap te r (and those in tw o chap te rs following im
m ediately) have been im plem ented on a netw ork of up to sixteen w orksta tions. Each
w orksta tion is a Su n S P A R C sta tio n 5 m achine w ith 32 M bytes of local m em ory and a clock
speed of 170 M Hz. All w orksta tions th u s have th e sam e speed fac tors (a w orksta tion w ith
a speed facto r of 2 is twice as fast as a w orksta tion w ith a speed facto r of 1). T he p rogram
code for im plem enting various parallel algorithm s using corresponding design p a tte rn s
has been w ritten in C-|-+ and th e PV M m essage-passing kernel (Sunderam , 1990). T he
perform ance of th e corresponding parallel im plem entations have been mezisured in te rm s
of execution tim es and program speedups. T he speedup of a parallel p rogram is defined
as
execution tim e on one w orkstation . .speedup = ---------- :------ :------------------- :------ :------ (4.1)
execution tim e on p w orkstations
Chapter 4. Low level algorithms 109
4.1 P ara lle liza tion o f low level a lgorith m s
M ost of th e low level vision algorithm s are parallelized by p artitio n in g th e im age in to
subim ages, and processing these subim ages concurren tly using different processors. Using
th is stra teg y , Siegel et al. (Siegel e t ah, 1992) parallelized a local convolution algorithm
using tw o d istinc t approaches, namely, complete sum s and partia l sum s. In th e ‘com plete
su m s’ approach , all th e d a ta needed by a processor to process its sub im age is transferred
to it before th e com pu ta tion . T he processors then work independently w ith o u t in terac tin g
w ith each o th er during th e com putation . W ith th e ‘p artia l su m s’ approach , each processor
perform s co m puta tion on its subim age and in te rac ts w ith o th er processors to exchange
th e in te rm ed ia te resu lts during the com putation . We extend these tw o approaches to
parallelize m ost of th e low level algorithm s.
T he local non-itera tive algorithm s can be parallelized using th e ‘com plete su m s’ ap
proach. T hey can be im plem ented by using th e Farm er-W orker p a tte rn (section 3.3).
T he local ite ra tiv e and th e global low level algorithm s can be parallelized using th e
‘p artia l su m s’ approach . However, the algorithm s w ithin these classes exh ib it different
com m unication p a tte rn s . In a local itera tive algorithm , each processor com m unicates
w ith its neighbors afte r every itera tion . These com m unications are regu lar and can be
determ ined before th e s ta r t of th e com putation . Local ite ra tiv e algo rithm s can therefore
be parallelized using th e M aster-W orker p a tte rn (section 3.4). T he global a lgorithm s usu
ally involve all-to-all processor com m unications. In certain cases, these com m unications
m ay be determ ined before th e s ta r t of th e com p u ta tio n , as in th e co m p u ta tio n of a two
dim ensional fa s t Fourier transfo rm of an im age. B u t in o th er cases, th ey are determ ined
dynam ically or only afte r th e s ta r t of th e com pu ta tion , as in th e connected com ponent
labeling algorithm . T he global algorithm s are therefore parallelized using th e C ontroller-
W orker p a tte rn (section 3.5).
A no ther im p o rtan t consideration in parallelization of th e low level a lgorithm s is the
num ber of im age p artitio n s or subim ages created for concurren t execution. T he num ber of
subim ages created in th e local non-iterative algorithm s should be ab o u t tw o to th ree tim es
m ore th a n th e num ber of processors (workers) used in parallelization. T his m axim izes th e
degree of parallelism achievable in an application and resu lts in b e tte r perform ance as
Chapter 4. Low level algorithms 110
described in section 4.4.1. T he num ber of subim ages c reated in th e local ite ra tiv e and
global low level a lgorithm s should however be equal to th e num ber of processors available.
T his is because each worker is required to in te rac t w ith o th er workers to exchange th e in te r
m ediate resu lts during th e com pu ta tion . T he com pu tational w orkloads in the subim ages,
if m easurable, should be p roportional to the effective speed facto rs of th e corresponding
processors used in parallelization.
An effective speed facto r of a m achine a t any in s tan t of tim e is th e fraction of its C PU
tim e th a t is ded icated for processing th e subim age. T he effective speed fac to r of a m achine
can vary over tim e depending on th e workload (of ex ternal processes) on th a t m achine.
N ote th a t th is s tra teg y of using the w orkloads to divide th e im age in to subim ages ensures
only s ta tic load d istribu tion . It is effective only when th e applica tion is parallelized on a
ded icated w orksta tion cluster (section 2.5.3), where th e speed facto rs are always co n stan t.
4.2 P a rtition in g th e im age d ata
T he perform ance of a low level algorithm parallelized on a cluster of w orksta tions de
pends on a p artition ing of th e im age into subim ages, and corresponding com m unication
overheads. T he com m unication overheads are directly re la ted to th e way th e im age is
pa rtitio n ed . T he com m unication overheads arise due to th e d istrib u tio n of subim ages
to th e w orker processors, exchange of the in term ediate resu lts (if applicable), and th e
collection of final resu lts from the w orker processors. T here are m any different m ethods
to p a rtitio n a given im age in to subim ages. We use a sim ple row p artitio n in g m ethod
in which an im age is horizontally divided into a given num ber of subim ages as shown
in F igure 4.1. T he row partition ing m ethod allows one to divide a given im age in to
any num ber of subim ages of ap p ro p ria te sizes. T hus each processor can be assigned a
p roportiona l w orkload based on its speed facto r (Angus e t al., 1989).
F igure 4.1 (a) shows th e row p artition ing of an im age in to d istinc t subim ages (non
overlapping) for th e global algorithm s. Such algorithm s do n o t need pixel values from o th er
subim ages in order to perform com putations on th e b oundary pixels of any subim age.
F igure 4.1 (b) shows the row p artition ing scheme for parallelizing th e local low level
Chapter 4. Low level algorithms i l l
P(i)
P(2)
P(i)
P(n)
w in d o w o v e r la p p in g ro w s
P(i)
(a) ( h )
Figure 4.1: P artition ing of an image, a) Row partition ing b) Row partition ing w ith d a ta
th a t is to be overlapped an d /o r com m unicated
algorithm s. In a local low level algorithm (except point operations), the value of a boundary
pixel in any given subim age may depend on the values of the pixels present in o ther
subim age(s). Therefore, each subim age also has an additional num ber of overlapping rows
belonging to its neighboring subim ages as shown in F igure 4.1 (b). In local itera tive
a lgorithm s, these overlapping rows are com m unicated between the neighboring workers
a fte r every itera tion .
O ther m ethods for partition ing an image are column, diagonal, cross and heuristic.
Roth row and column partition m ethods are sim ilar, hence, either of them could be used
for partition ing the image. The diagonal partition ing m ethod involves dividing the image
into diagonal strips. This m ethod is however difficult to im plem ent and becomes extrem ely
com plicated when parallelizing local itera tive algorithm s. T he cross partition m ethod
involves dividing the image in both horizontal and vertical directions. T he num ber of
subim ages created using this m ethod is always a square num ber. This places a restric tion
on the num ber of processors th a t can be used in parallelization, especially, in the algorithm s
parallelized using the ‘partia l sum s’ approach.
44ie heuristic partition m ethod was proposed and used by Lee and Hamdi (Lee &
Chapter 4. Low level algorithms 112
H am di, 1995) to parallelize th e local convolution operation on a netw ork of w orksta tions.
T heir a lgorithm can p artitio n th e im age into any num ber of subim ages using b o th hor
izontal and vertical p artition ing directions. However, bo th heuristic and cross p artitio n
m ethods involve rectangu lar shaped subim ages. In local ite ra tiv e algorithm s, m any w orker
processes m ay be required to exchange their in term ed iate resu lts w ith eight o th er worker
processes. In row or colum n partition ing , each w orker process is required to in te rac t w ith
a t th e m ost tw o o th er worker processes. Therefore, th e row p artitio n in g m ethod hcis
num ber of advantages com pared to th e o ther p artition ing m ethods.
4 .3 G rey scale transform ations
G rey scale tran sfo rm atio n s m odify th e brightness of th e pixels in an im age based on the
p roperties of the pixels itself. They are used for enhancing th e co n tra s t and im prove th e
ap pearance of an im age so th a t it could be easily in te rp re ted by a hum an observer. T he
m ost com m on grey scale transfo rm for co n tra s t enhancem ent is h istogram equalization
which was described in section 3.5.
H istogram equalization is a global low level algorithm . In th is section, we p resen t th e
experim ental resu lts of parallelizing th is algorithm (as outlined in section 3.5) using the
C ontroller-W orker p a tte rn . T he execution tim es for th e h istogram equalization algorithm
parallelized using different num ber of w orksta tions are displayed in Table 4.1. A p lo t of
these execution tim es and the speedups achieved for th is algorithm are shown in F igure 4.2.
T he execution tim e for th e h istogram equalization algorithm on a single w orksta tion is
of th e o rder of few seconds. However, th e tim e sp en t in all-to-all w orker com m unications
is relatively large com pared to th e tim e spen t in th e ac tua l com p u ta tio n . Hence, th e
execution tim e of th e parallel algorithm increases significantly w ith increase in th e num ber
of w orksta tions, even for 512x512 and IK x lK im ages. For a 2K x2K im age th e re is slight
im provem ent in execution tim e until ab o u t five to six w orksta tions (F igure 4.2), due to
increase in co m p u ta tio n tim e. However, th e execution tim e increases for seven or m ore
w orksta tions. Hence, global algorithm s involving all-to-all w orker com m unications, bu t
relatively lower execution tim e, should preferably be executed on a single w orksta tion .
Chapter 4. Low level algorithms
Table 4.1: Execution tim e in (m in:sec) for h istogram equalization
113
Image Size Number of Workstations
1 2 4 6 8 10 12 14 16
512x512 0:01 0:01 0:02 0:02 0:02 0:03 0:04 0:04 0:04
iK x lK 0:02 0:04 0:04 0:04 0:05 0:06 0:06 0:07 0:09
2Kx2K 0:16 0:14 0:13 0:15 0:16 0:16 0:17 0:22 0:23
30 n 2Kx2KiK xlK512x51225-
2 0 -
15-
1 0 -
0 2 4 6 8 10 12 14 16
16-,ideal
14-
1 2 -
1 0 -
0 2 4 6 8 10 12 14 16
Exec, tim e (sec) v /s Processors Speedup v /s P rocessors
F igure 4.2: Perform ance of h istogram equalization
4 .4 Im age filtering
Im age filtering algorithm s are im age transfo rm s th a t use a local neighborhood of a pixel
in th e in p u t im age to produce a new pixel value in th e o u tp u t im age. A filter m ay be
classified as linear or nonlinear. Linear filters calcu late th e new pixel value f ' { i , j ) as a
linear com bination of th e pixel values in a local neighborhood Af of th e pixel f { i ^ j ) in the
in p u t im age. A com m on class of linear filters are th e convolution-based filters which are
described in th e next section. L inear filters, when used for rem oving noise in an im age,
b lur sh arp edges in th a t im age. Nagao (N agao & M atsuyam a, 1979) and Lee (Lee, 1983)
therefore suggested edge preserving non-linear filters, which, no t only remove noise b u t
also preserve sh arp edges in a given im age. N on-linear filters are discussed in section 4.4.2
and section 4.4.3.
Chapter 4. Low level algorithms
4.4 .1 C onvolution
114
C onvolution is a fundam ental operation in im age processing. It is used in im age sm oothing,
edge or line detection (Sonka e t al., 1993), fea tu re ex trac tio n , and tem p la te m atch ing
(R anka & Sahni, 1990). If A/ is a set of neighboring points around a po in t (a, 6) in th e
im age, and if A is a m x m convolution m ask of co-efRcients, th e convolution f ' {a^b) a t
(a, 6) is given by
/ K 6 ) = - c ; - (4.2)
w here, (c, d) is th e displacem ent of th e origin of h relative to th a t of / . O n a sequential
m achine, th e com pu ta tional com plexity to perform th e convolution o peration on an im age
of size n x n is O (n^m ^). T his operation can be very tim e consum ing w hen th e size of
th e im age a n d /o r th e size of th e convolution m ask is large. T he execution tim e of th is
opera tio n can be reduced by dividing th e im age into subim ages, and convolving these
subim ages concurren tly using different processors. By using a set of P processors, th e
co m p u ta tio n a l com plexity of th e convolution operation can be reduced up to / F) ) .
Table 4.2: Execution tim e in (m in : sec) fov th e convolution op era tio n
Image Size Window Size Number of Workstations
1 2 4 6 8 10 12 14 16
3x3 0:05 0:03 0:02 0:02 0:02 0:03 0:03 0:04 0:04
512x5127x7 0:19 0:10 0:05 0:04 0:03 0:03 0:03 0:04 0:04
11x11 0:44 0:23 0:11 0:08 0:07 0:06 0:06 0:06 0:06
15x15 1:22 0:41 0:21 0:15 0:11 0:10 0:10 0:08 0:08
3x3 0:19 0:11 0:06 0:05 0:05 0:07 0:07 0:08 0:09
iK x lK7x7 1:18 0:40 0:20 0:15 0:11 0:11 0:10 0:11 0:11
11x11 3:04 1:33 0:46 0:32 0:24 0:22 0:19 0:17 0:16
15x15 5 :2 8 2:49 1:22 0:56 0:41 0:40 0:32 0:31 0:26
3x3 1:29 0:45 0:23 0:17 0:14 0:15 0:16 0:17 0:20
2Kx2K7x7 5:25 2:43 1:22 0:55 0:41 0:36 0:31 0:29 0:28
11x11 1 2 :28 6:14 3:07 2:05 1:34 1:15 1:03 0:59 0:47
15x15 22:55 11:28 5:44 3:50 2:52 2 :2 2 2:04 1:46 1:30
Chapter 4. Low level algorithms 115
90-,
80-2Kx2KiK xlK512x512
70-
60-
50-
40-
30-
2 0 -
1 0 -
0 2 4 6 8 10 12 14 16
16nideal
14-
1 2 -
1 0 -
0 2 4 6 8 10 12 14 16Exec, tim e(sec) v /s Processors Speedup v /s P rocessors
F igure 4.3: Perform ance of th e convolution operation using a 3x3 window
C onvolution operation can be parallelized by using th e Farm er-W orker p a tte rn .
Table 4.2 shows th e execution tim es of th e parallel convolution operation ob ta ined by
varying different param eters such as the window size, im age size, and the num ber of
w orksta tions used in parallelization. T he entries in th is tab le enable s tu d y th e influence
of these p aram eters on the execution tim e and th e speedup of th e parallel convolution
opera tion . We can m ake two different observations from th is tab le . F irstly , by keeping the
window size fixed, we can observe the perform ance results by varying bo th , th e im age size
and th e num ber of w orksta tions used in parallelization. Secondly, by keeping th e im age
size fixed, we can observe th e perform ance results by varying th e window size and the
num ber of w orksta tions used in parallelization.
T he execution tim es and th e speedups achieved for th e parallel convolution operation
using a 3x3 and a 15x15 window (window size fixed), for exam ple, are shown in F igure 4.3
and F igure 4.4, respectively. F igure 4.3 shows th a t for a sm all window, th e execution tim e
decreases upon increase in th e num ber of w orksta tions used in parallelization. However,
th e execution tim e gradually increases when th e num ber of w orksta tions are increased
beyond seven or eight. C orresponding speedup curves show a sim ilar behavior. T he
increase in th e execution tim es or th e decline in th e corresponding speedups afte r using
eight or m ore w orksta tions, is due to increase in the percentage of th e com m unication
tim e w ith respect to the corresponding com putation tim e. However, when th e window
Chapter 4. Low level algorithms 116
size is larger (F igure 4.4), m ore com putations are needed a t each pixel in th e convolution
operation . Since th e com m unication tim e in a 15x15 convolution operation is nearly th e
sam e as th a t in a 3x3 operation , th e ra tio of th e com pu tation tim e to the com m unication
tim e is dom inated by th e com putation tim e. T his resu lts in relatively g rea te r speedups
w ith increase in th e num ber of w orksta tions used in parallelization .
We can observe sim ilar resu lts by keeping the im age size fixed, b u t varying th e window
size and th e num ber of w orksta tions used in parallelization. F igure 4.5 shows th e perfor
m ance resu lts of th e convolution operation on a IK x lK im age. T he observed speedups
increase as th e window size is increased. As in the above case, a larger window size
im plies m ore com pu ta tions in the convolution operation . T herefore, as th e tim e sp en t
in com m unicating th e subim ages and the results is alm ost th e sam e across windows of
different sizes, the ra tio of th e com putation tim e to th e com m unication tim e increases.
Hence, higher speedups can be obtained w ith the increase in window size and , th e num ber
of w orksta tions used in parallelization.
-k- 2Kx2K24-1 IKxlK
512x5122 0 -
16-
1 2 -
0 2 4 6 8 10 12 14 16
16-,ideal
14-
1 2 -
1 0 -
0 2 4 6 8 10 12 14 16
Exec, tim e(m in) v /s Processors Speedup v /s P rocessors
F igure 4.4: Perform ance of the convolution operation using a 15x15 window
Chapter 4. Low level algorithms 117
6
5
15x154 11x11
7x73x3
3
2
1
0 2 4 6 8 10 12 14 16
ideal
14-
1 2 -
1 0 -
0 2 4 6 8 10 12 14 16
Exec, tim e (min) v /s Processors Speedup v /s P rocessors
F igure 4.5: Perform ance of the convolution o peration on a IK x lK im age
T he convolution algorithm was parallelized by Lee and H am di (Lee & H am di, 1995)
on a netw ork of SUN Sparc IP C w orkstations. T hey used a heuristic p artitio n in g m ethod
(section 4.2) to p artitio n th e im age into several different subim ages of the sam e size.
T he num ber of subim ages created were equal to th e num ber of w orksta tions used in
parallelization. However, th is partition ing scheme can reduce th e perform ance of th e
parallel convolution algorithm in some cases as explained below:
• F irstly , th e m achines in a cluster of w orksta tions m ay have different processing speeds
or speed factors. Assigning subim ages of th e sam e size to such m achines will lead to
load im balances. The size of a subim age assigned to any w orksta tion should therefore
be p roportional to its effective speed factor.
• Secondly, even if all th e w orksta tions used in parallelization have th e sam e speed
facto rs, it is difficult to d istrib u te these subim ages to all w orksta tions a t th e sam e
tim e. T here is always some delay before th e last w orksta tion gets its subim age and
s ta r ts processing. T his can cause som e reduction in th e overall perform ance o f th e
parallel im plem entation .
Finally, th e perform ance of a parallel convolution algorithm im plem ented on an
Chapter 4. Low level algorithms 118
enterprise cluster (section 2.5.3), will degrade significantly if a p a rtic ip a tin g m achine
is tim e-shared to run o ther processes. Each m achine in an en terprise c luster is tim e-
shared betw een different users.
Hence, th e heuristic partition ing m ethod can som etim es resu lt in significant reduction
in th e overall perform ance of th e parallel convolution algorithm .
Table 4.3: Perform ance of th e Farm er-W orker p a tte rn on varying th e ex ternal load and
num ber of sub tasks. T he execution tim e (m in:sec) displayed are for th e convolution
o peration (w indow size 15x15).
Row No. External Load (Y /N )
Number of Workstations
1 2 4 6 8 10 12 14 16
1 (o) N 5:28 2:49 1:22 0:56 0:41 0:40 0:32 0:31 0:26
2 (•) Y 5:28 2:52 1:25 1:01 0:42 0:42 0:34 0:35 0:27
3 (*) Y 5:28 5:10 2:24 1:36 1:11 0:58 0:48 0:41 0:36
subtasks = processors & external load subtasks processors & external load subtasks ]$> processors & no external load
16-1ideal
14-
1 2 -
1 0 -
0 2 4 6 8 10 12 14 160 2 4 6 8 10 12 14 16Exec, tim e(m in) v /s Processors Speedup v /s Processors
F igure 4.6: Perform ance of the Farm er-W orker p a tte rn in convolution operation on varying
th e processor load and num ber of sub tasks (window size 15x15)
To overcom e these lim itations, we use the row p artitio n in g m ethod (in th e Farm er-
W orker p a tte rn ) to p artitio n th e im age into several different subim ages of th e sam e size.
However, th e num ber of subim ages created is a t least tw o tim es m ore th an th e num ber of
w o rk sta tio n s used in parallelization. Each m achine would therefore process a p roportional
num ber of subim ages according to its speed factor. Table 4.3 shows th e perform ance results
Chapter 4. Low level algorithms 119
of th e convolution operation , parallelized using tw o different m ethods. T he convolution
operation was perform ed on an IK x lK im age using a 15x15 window. T he entries in the
first row of th e tab le display execution tim es of th e parallel convolution algorithm using
th e F arm er-W orker p a tte rn . T he w orksta tions used in parallelization were of th e sam e
speed factors.
We then reduced th e speed of one w orksta tion , by executing a com putation-in tensive
no n -term in a tin g ex ternal process. We im plem ented the parallel convolution algorithm
using a different num ber of w orksta tions, b u t always included th e w orksta tion executing
th e ex terna l process. T he effective speed factor of th e w orksta tion executing an ex ternal
process was nearly halved, since it was tim e-shared to execute an ex ternal process and
a worker com ponent of th e Farm er-W orker p a tte rn . T he entries in the second row of
Table 4.3 show resu lts of th is parallelization . T he perform ance resu lts are sim ilar to the
previous resu lts (i.e. entries in first row of the table) since, m ost of th e subim ages are now
processed by o ther w orksta tions. T here is not much reduction in th e overall perform ance
as can be seen from F igure 4.6.
However, if we p artitio n th e image in to several different subim ages of th e sam e size
and , if th e num ber of subim ages created are equal to th e num ber of w orksta tions used
in parallelization, th e perform ance of th e parallel convolution operation degrades signif
icantly. T his can be seen from the entries in the th ird row of Table 4.3. T he execution
tim e is dom inated by th e slow w orkstation executing an ex tern a l process. Since th e slow
w orksta tion has th e sam e workload as the o ther w orksta tions, it takes m ore tim e to process
its subim age. T his reduces th e overall perform ance in th e parallel convolution algorithm .
Hence, th e Farm er-W orker p a tte rn which has an inheren t dynam ic load balancing property ,
can be used to achieve im proved perform ance over th e conventional m ethods used for
parallelizing an application.
4 .4 .2 R an k filter in g
R ank F ilte rs are non-linear filters which are used for reducing th e variance in an im age.
T hey elim inate sa lt-and -pepper noise b u t unlike th e linear filters they preserve th e sh arp
Chapter 4. Low level algorithms 120
edges. A rank filter transfo rm s an im age by changing each pixel value to a specified
value in th e neighborhood of th a t pixel point. If A f rep resen ts a set of pixel values in the
neighborhood of som e pixel po int (i,j) and if the elem ents in J\f are sorted in ascending
order, th en a rank filter R i of zth order assigns the zth elem ent in A f. T hree special rank
filters are th e Rmini Rmax and Rmedian, which respectively assign m inim um , m axim um
and m edian pixel values to th e pixel point (i,j). A review of th e rank filters and their
p roperties is given in (Hodgson e t ah, 1985).
R ank filters can be parallelized using th e Farm er-W orker p a tte rn . T he execution tim es
for th e rank filtering operation parallelized using different num ber of w orksta tions are
displayed in Table 4.4. T he perform ance results of th e rank filtering operation are sim ilar
to th e perform ance results of th e convolution operation .
Table 4.4: Execution tim e in (m in:sec) for th e rank filtering operation
Image Size Window Size Number of Workstations
1 2 4 6 8 10 12 14 16
128x1283x3 0:01 0:02 0:03 0:01 0:01 0:01 0:01 0:01 0:01
11x11 0:09 0:05 0:03 0:03 0:02 0:01 0:02 0:02 0:02
256x2563x3 0:02 0:02 0:02 0:01 0:01 0:01 0:02 0:02 0:02
11x11 0:33 0:20 0:11 0:09 0:07 0:06 0:05 0:05 0:04
512x5123x3 0:08 0:04 0:03 0:02 0:02 0:02 0:03 0:04 0:04
11x11 2:14 1:13 0:38 0:30 0:23 0:22 0:18 0:18 0:16
iK x lK3x3 0:30 0:17 0:09 0:07 0:07 0:06 0:07 0:07 0:08
11x11 9:14 4:39 2:20 1:45 1:14 1:14 0:59 0:57 0:43
4 .4 .3 S p a tia l filters
A com bination of Rmin &nd Rmax rank filters form s a fam ily of sp a tia l filters. S patial
filters can be used as approxim ations to the tru e low-pass and high-pass filters. A spatia l
low-pass filter, for exam ple, can be defined as w here (or Rmaxi.^^))
deno tes applying Rmin (or Rmax) ^ tim es to th e im age T . T he cut-off frequency is
determ ined by n, th e larger th e value of n, th e lower is th e value of cut-off frequency.
O th er definitions of spatia l low-pass filters can be found in (H ussain, 1991). A high-pass
Chapter 4. Low level algorithms 121
filtered im age is ob tained by su b trac tin g th e original im age T form th e low-pass filtered
im age A high-pass filter sharpens details in an im age.
Table 4.5: Execution tim e in (m in:sec) for the sharpen ing operation
Image Size Window Size Number of Workstations
1 2 4 6 8 10 12 14 16
128x1283x3 0:04 0:02 0:01 0:02 0:02 0:02 0:02 0:02 0:02
11x11 1:09 0:35 0:20 0:16 0:14 0:13 0:11 0:10 0:08
256x2563x3 0:15 0:09 0:05 0:05 0:03 0:03 0:03 0:03 0:02
11x11 4:51 2:27 1:23 1:07 0:54 0:53 0:46 0:41 0:34
512x5123x3 0:59 0:30 0:15 0:11 0:09 0:08 0:07 0:07 0:06
11x11 20:58 10:50 5:15 3:31 2:46 2:25 2:24 2:09 1:40
22 n
2 0 -
18-
16-512x512256x25614-128x128
1 2 -
1 0 -
0 2 4 6 8 10 12 14 16
16n ideal
14-
1 2 -
1 0 -
0 2 4 6 8 10 12 14 16
Exec, tim e(m in) v /s Processors Speedup v /s P rocessors
F igure 4.7: Perform ance of th e sharpening operation using spa tia l filters (window size
11x11)
S patia l filters are itera tive operations, hence, they can be parallelized using th e M aster-
W orker p a tte rn . T he execution tim es for the sharpening operation (high pass filtering)
parallelized using different num ber of w orksta tions are displayed in Table 4.5. A plot of
these execution tim es and the speedups achieved for th is operation are shown in F igure 4.7.
T h e low-pass filtering was perform ed w ith a value of n, equal to 5.
Chapter 4. Low level algorithms 122
Since a low-pass filtering of th e im age is involved in th is operation , there is a need for
com m unicating th e boundary inform ation after each ite ra tio n . Each ite ra tio n involves a
rank filtering operation on all th e subim ages. However, th e tim e required for perform ing
th e rank filtering operation on each subim age is m uch higher th an th e tim e required to
exchange th e boundary inform ation. Therefore, th e tim e spen t in worker-w orker com m u
nications does no t appear to m ake a significant deg radation in th e overall perform ance.
4 .5 Fast Fourier transform s
A tw o-dim ensional fast Fourier transfo rm (2D -FFT ) of an im age is a global algorithm in
which th e value of each pixel depends on th e values of all pixels in th e im age. T h e Fourier
tran sfo rm of an im age enables im age filtering in the frequency dom ain. A tw o-dim ensional
fast Fourier transfo rm F{u^v ) of an im age f { x , y ) is given by (Sonka e t al., 1993)
1 M - 1 7 V - 1
= j v w E E + (4-3)m = 0 n = 0
= E E » ) ] e x p ( Z ^ ) (4.4)m = 0 n= 0
w here, u = 0 , 1 , . . . , M — 1, v = 0 ,1 ,. . . , A" — 1, and i = \ / —T. A 2 D -F F T is separable
and therefore can be expressed as two one-dim ensional fa s t Fourier transfo rm s: a one
dim ensional F F T along th e rows followed by a one-dim ensional F F T of th e in te rm ed ia te
resu lts along th e colum ns, or vice versa. T he term in square brackets in equation 4.4, for
exam ple, corresponds to th e one-dim ensional Fourier tran sfo rm of th e m th row.
A 2 D -F F T can be parallelized by com puting th e one-dim ensional F F T along th e
rows (or colum ns), transposing the in term ediate results, and finally com puting a one
dim ensional F F T along th e colum ns (or rows) (C houdhary & P ate l, 1990). We use
th e C ontroller-W orker p a tte rn to im plem ent th is form of parallelism . Each processor or
w orksta tion is assigned a set of contiguous rows of th e in p u t im age. T he num ber of rows
assigned to each processor is p roportional to its speed factors. Each processor com putes
Chapter 4. Low level algorithms 123
t,he 1Ü -FFT along its rows (using 113-FFT algorithm given in (Press et al., 1992)). The
processors then com m unicate with each o ther to transpose the in term ediate results (the
row F F T s) as shown in F igure 4.8.
N
PO
N
P I
P2
P3
hlgure 4.8: The d a ta blocks needed to transpose the in term ediate results
ICach processor needs to com m unicate and exchange a block with every o ther proces
sor. A pair of d a ta blocks exchanged by any two ;)rocessors are shown with the sam e
sh a d e s /p a tte rn s (Figure 4.8). After exchanging the row F F T s as specified above, each
processor com putes the 1 D -FFT along the columns. Finally, each processor sends the
com puted results to the Controller.
4 able 4.6: Execution tim e in (m in:sec) i'or F F T operation
Image Size Number of Workstations
I 2 4 6 8 10 12 14 16
256x256 0:01 0:03 0:03 0:03 0:04 0:04 0:05 0:05 0:05
512x512 0:04 0:07 0:08 0:09 0:11 0:12 0:15 0:17 0:20
I Kxl K 0:20 0:28 0:29 0:31 0:45 0:51 0:52 0:57 1:01
T he execution tim es for the F F T operation parallelized using different num ber of work
s ta tio n s are displayed in Table 4.6. From Table 4.6 we can observe th a t the com m unication
overheads dom inate the perform ance of th a t the F F T operation . The com putational tim e
for th is operation on a single w orkstation is of the order of few seconds. However, the tim e
sp en t in all-to-all worker com m unications and in com m unicating the final results to the
Chapter 4. Low level algorithms 124
controller, is relatively large com pared to the tim e spen t in th e com p u ta tio n . M oreover,
th e w orker-w orker com m unications involve costly floating po in t exchanges. It is therefore
difficult, to achieve any significant perform ance gains in parallelization of th e 2D -F F T
o peration in a w orksta tion environm ent.
4.6 Im age restoration
4 .6 .1 M arkov ran d om field m o d e ls for im a g e reco v ery
M arkov random field (M R F) m odels and Bayesian m ethods are stochastic techniques used
in im age resto ra tio n , im age segm entation and im age in te rp re ta tio n . In an M R F m odel,
th e problem is form ulated as an optim ization problem [m axim um a posterio ri (M A P)
estim ation rule] by representing the local characteristics of th e im age pixels by M arkov
random field and its associated G ibbs d istribu tion . An ite ra tiv e op tim iza tion m ethod , such
as sim ulated annealing, is applied to generate a sequence of im ages which converge in an
ap p ro p ria te sense to th e op tim al M A P estim ate. T he algorithm s based on th is stochastic
technique are com putationally intensive and highly parallel. T he algorithm used for im age
re s to ra tio n is presented in a nut-shell. A detailed discussion and various o th er algorithm s
based on th is technique are presented in (M ardia & K anji, 1993).
If / is th e observed im age and if Q, denotes th e se t of all possible in te rp re ta tio n s of /,
th en th e M A P estim ate o f / i s th e one which m axim izes th e probab ility of th e in te rp re ta tio n
g given th e observed im age / i.e. we seek
m ax^^çi[P {g = uj\f)] (4.5)
A fter rigorous m athem atica l analysis and sim plification th is u ltim ate ly leads to m ini
m izing of an energy function which is given by (B uxton e t al., 1986)
+ ^ ( / K 6 ) - ^ ( a , 6 ) ) ^ / 2 ( T ^ (4.6)(a ,6) / (a ,6)
Chapter 4. Low level algorithms 125
where, (a, 6) is any po in t in th e image and € vV, which is a se t of neighboring
poin ts around th e point (a, 6). T he p aram ete r a denotes th e s tan d a rd deviation of th e
add itive G aussian noise (w ith zero m ean) in th e degraded im age. T he real-valued function
F [ / ( a , 6), / ( i , i ) ] adds a value to th e energy function which is inversely p ro p o rtio n al to th e
degree of sim ilarity betw een th e pixel in tensities of th e im age poin ts (n, 6) and (z ,j ) .
T he energy function given by equation 4.6 is m inim ized using sim ulated annealing
process which is described below (B uxton e t al., 1986), (K apoor e t al., 1994).
1. In itialize s ta r tin g tem p era tu re T
2. For each poin t (a, b) in th e image do
• com pute energy a t po int (a, h)
• generate tria l pixel value and using th is value, com pute tria l energy a t (a, 6).
C om pute change in energy A£^ = tr ia l energy - energy
• if (A < 0) then accept the s ta te change i.e. assign th e tria l value to po in t (a, b)
o therw ise assign th is tria l value to th e po int (a, 6) only when exp{—A E / T ) >
random[0^ 1)
3. R ep ea t s tep 2 Ninner tim es
4. Lower th e tem p e ra tu re to C/log{k-[-C) w here A; is th e to ta l num ber of ite ra tio n cycles
(com plete ra s te r scans of th e image) and C is a co n stan t, independen t of k
5. R ep ea t s tep 2 to s tep 4 Nouter tim es
Table 4.7: E xecution tim e in (m in:sec) for im age re sto ra tio n using M R F m odel
Image Size Number of Workstations
1 2 4 6 8 10 12 14 16
128x128
256x256
512x512
0:35
2:23
10:19
0:20
1:19
5:15
0:11
0:40
2:42
0:08
0:29
1:46
0:07
0:21
1:22
0:07
0:20
1:06
0:06
0:17
1:03
0:05
0:15
0:56
0:05
0:12
0:43
T he M R F algorithm can be parallelized by using th e M aster-W orker p a tte rn . T h e
execution tim es for th e M R F algorithm parallelized using different num ber of w orksta tions
Chapter 4. Low level algorithms 126
1 1 -,
1 0 -512x512256x256128x128
0 2 4 6 8 10 12 14 16
16-1ideal
14-
1 2 -
1 0 -
0 2 4 6 8 10 12 14 16Exec, tim e(m in) v /s Processors Speedup v /s P rocessors
F igure 4.9: Perform ance of th e im age resto ration algorithm using M R F m odel (window
size 3x3)
are displayed in Table 4.7. A plot of these execution tim es and th e speedups achieved for
th is operation are shown in F igure 4.10. T he algorithm was executed w ith th e values
of Ninner = Nouter = 10. T he M R F algorithm is com m unication intensive. T he workers
com m unicate th e boundary inform ation afte r each ra ste r scan of th e ir assigned subim ages.
T h e to ta l num ber of worker-w orker com m unications in th is exam ple is therefore very
high (each worker com m unicating 100 tim es w ith every o th er w orker). B u t since the
com puting tim e between successive com m unications w ithin each w orker is relatively larger,
th e observed speedups are quite close to th e ideal speedups.
N ote th a t , unlike th e Farm er-W orker p a tte rn , an application parallelized using the
M aster-W orker p a tte rn does not have inheren t load balancing property . T his can result
in serious load im balances when the p a tte rn is im plem ented on an en terprise cluster
(section 2.5.3). Each worker com ponent in th e M aster-W orker p a tte rn depends on o ther
w orkers to perform th e com putations on its assigned su b task (subim age). A m achine
executing a w orker com ponent of the M aster-W orker p a tte rn , can delay th e processing in
o th e r w orker com ponents when it is also tim e-shared to run ex ternal processes. T his can
lead to significant reduction in the overall perform ance of th e corresponding application
th a t is parallelized using th is p a tte rn .
Chapter 4. Low level algorithms 127
We can observe th e eflfect of executing an ex ternal process (ex ternal load) on th e per
form ance of a M aster-W orker p a tte rn , by conducting a sim ple experim ent. As an exam ple
of an application , we parallelize the im age resto ra tio n operation based on M R F model,
using th e M aster-W orker p a tte rn . T he perform ance resu lts of th e parallel im plem entation ,
using a 512x512 im age, are shown in Table 4.8. T he entries in th e first row of th e tab le
display execution tim es w ithou t any ex ternal load or processes on th e m achines executing
th e p a tte rn . T he am oun t of work d istribu ted to all th e worker com ponents is p roportional
to th e effective speed factors of their corresponding m achines. T he entries in th e second
row display execution tim es when one of th e m achines is tim e-shared to run an ex ternal
process, during th e execution of a worker com ponent of th e M aster-W orker p a tte rn . T he
effective speed facto r of such a m achine is therefore halved w ith respect to th e rest of the
m achines. Hence, th e corresponding worker com ponent takes longer tim e to perform its
co m p u ta tio n s. All workers in th e M aster-W orker p a tte rn exchange in term ed iate resu lts
w ith th e ir neighbors afte r every itera tion . T he presence of a slow worker com ponent th ere
fore resu lts in increased w aiting tim e for the rem aining worker com ponents for exchanging
th e ir in te rm ed ia te results. This reduces th e overall perform ance in the application as can
be seen from th e entries in the second row of Table 4.8.
A p o ten tia l solution to overcom e the load im balancing problem in th e M aster-W orker
p a tte rn is to dynam ically schedule th e worker com ponen ts afte r every fixed num ber of
ite ra tio n s . However, th e tim e required to schedule th e w orker com ponents on o th er idle
m achines should be significantly lower th an the overall com pu tation tim e of th e applica
tion.
T able 4.8: Perform ance of th e M aster-W orker p a tte rn w hen sub jected to the ex ternal load.
T he execution tim es (m in:sec) displayed are for th e im age res to ra tio n operation using the
M R F m odel on a 512x512 im age.
Row No. External Load Number of Workstations(Y /N )
1 2 4 6 8 10 12 14 16
1 (o) N 10:19 5:15 2:42 1:46 1:22 1:06 1:03 0:56 0:43
2 ( . ) Y 10:19 9:43 4:42 3^ 9 2:22 1:53 1:43 1:19 1:11
Chapter 4. Low level algorithms 128
1 1 .
1 0 -
with external load no external load
0 2 4 6 8 10 12 14 16
16-1 ideal
14-
1 2 -
1 0 -
0 2 4 6 8 10 12 14 16
Exec, tim e(m iii) v /s Processors S peedup v /s P rocessors
F igure 4.10: Perform ance of th e M aster-W orker p a tte rn (in im age recovery operation using
th e M R F m odel on a 512x512 image) sub ject to th e ex ternal load and load d is trib u tio n
4 .7 Sum m ary
In th is ch ap te r we have presented parallel im plem entations of som e rep resen ta tive low
level vision algorithm s on a cluster of w orksta tions. Each algorithm has been parallelized
using ap p ro p ria te design p a tte rn s such as Farm er-W orker, M aster-W orker, and C ontroller-
W orker. T he algorithm s which have been parallelized include h istogram equalization,
convolution, rank filtering, im age sharpening using sp a tia l filters, 2 D -F F T (of an im age),
and im age resto ra tion using M R F models. Some of these a lgorithm s parallelized using the
C ontroller-W orker p a tte rn (e.g. h istogram equalization and 2D -F F T ) do no t resu lt in any
significant speedups. This is because the tim e spen t in all-to-all w orker com m unications in
th e C ontroller-W orker p a tte rn is relatively high com pared to th e tim e spen t in th e actual
co m p u ta tio n . T hese algorithm s therefore do no t represen t ideal cand ida tes for parallel
im plem entation on w orksta tion clusters.
Paralle l im plem entations of o ther low level algorithm s have however shown prom ising
resu lts. T h e convolution and rank filtering operations parallelized using th e Farm er-
W orker p a tte rn have resulted in significant perform ance gains. We have also illu s tra ted
th e advan tage of using a Farm er-W orker p a tte rn to achieve im proved perform ance over the
conventional m ethods of parallelizing these algorithm s. T he im age sharpen ing and im age
Chapter 4. Low level algorithms 129
resto ra tio n algorithm s representing synchronous form of parallelism have been parallelized
using th e M aster-W orker p a tte rn . A lthough these algorithm s are com m unication intensive,
th e com puting tim e between successive com m unications a t each w orker is relatively high.
T he observed speedups in these algorithm s are therefore reasonably close to th e ideal
speedups. Finally, we also illustra ted th e problem of load im balances th a t can occur in
th e M aster-W orker p a tte rn when im plem ented on enterprise clusters (section 2.5.3). These
load im balances, caused due to some ex ternal processes, can lead to a significant reduction
in th e overall perform ance of the corresponding algorithm parallelized using th is p a tte rn .
Chapter 5
In term ediate level processing
In th is ch ap te r we discuss parallelization of som e rep resen ta tive in term ed iate level algo
rith m s in com pu ter vision. In term ediate level processing form s a bridge between th e low
level and th e high level processing operations in com puter vision. It com prises algorithm s
which reduce th e visual in form ation produced by the low level operations to a form su itab le
for th e recognition step in high level processing. T he basic un it of inform ation processed
by these algorithm s is a token which can either represen t a line, an intensity, color or
te x tu re based region, or a surface. T he processing step involves grouping these tokens
in to generic en tities such as sets of parallel lines, rectangles or polygons, hom ogeneous and
contiguous regions, or plane surfaces. Hence, the operations involved a t th e in term ed iate
level processing are m ainly partition ing and m erging which tran sfe r th e tokens in to m ore
useful and m eaningful s tru c tu re s for fu rth e r processing.
However, unlike in low level processing, th e operations or com p u ta tio n s a t th e in ter
m ediate level processing are no t very regular. T he form of parallelism in the algorithm s a t
th is level is therefore not im m ediately evident. For exam ple, even th e m ost soph isticated
low level a lgorithm s for detecting edges and lines in an im age can generate a significant
num ber of line fragm ents across th e im age. A grouping algorithm used for linking and
reorganizing th e line fragm ents into m eaningful s tru c tu re s m ay need to m atch and m erge
fragm en ts of lines across large fractions of th e im age. In th e parallel im plem entation , th is
m ay lead to a large am ount of non-local and irregular com m unication p a tte rn s between
130
Chapter 5. Intermediate level processing 131
a significant num ber of processors. Hence, developing parallel so lutions for in te rm ed ia te
level algorithm s is relatively difficult. D uring p ast several years, m any parallel algorithm s
for th e in te rm ed ia te operations have been suggested and are co n stan tly being im proved.
However, m ost of these algorithm s have been designed for specific class of parallel archi
tec tu res (C hau d h ary & Aggarw al, 1990). In th is chap ter, we discuss parallelization of
som e rep resen ta tive in term ediate level a lgorithm s on coarse-grained m achines, such as a
c luster of w orksta tions.
Segm entation is one of th e m ost im p o rtan t in term ed iate level operations in com pu ter
vision. It involves ex traction of featu res or ob jects from an im age which are used in
subsequen t processing, namely, ob ject description and recognition. T he m ain objective of
segm en ta tion is to p artitio n th e im age into m eaningful regions which co n stitu te a p a r t or
whole of th e o b jec ts in an im age. T here are two m ain approaches to segm enta tion , namely,
region-based and edge or pixel-based (Gonzalez & W oods, 1993), (Awcock & T hom as,
1995). R egion-based segm entation aim s a t creating hom ogeneous regions by grouping
to g e th er pixels which share com m on features. P ixel-based segm entation aim s to d e tec t
and enhance edges in an im age, and then link them to crea te a boundary which encloses
a region of uniform ity. Region-based segm entation is identified as a sim ilarity m ethod
since th e im age regions require some sim ilarity criterion for creation . In co n tra s t, pixel-
based segm enta tion is term ed as discontinuity m ethod since th e creation of regions involves
de tec tion of edges th a t are ab ru p t d iscontinuities in pixel grey-level values.
T his ch ap te r is organized as follows. In section 5.1 we discuss parallelization of a
region-based segm entation algorithm . In section 5.3 we discuss parallel im plem entation of
a percep tua l g rouping algorithm used for grouping line tokens in to m eaningful en tities such
as s tra ig h t lines, junctions, and rectangles or polygons. P ercep tu a l grouping algorithm s
co n s titu te pixel-based approach to im age segm entation . In each of these sections, we
p resen t a sequential algorithm followed by its corresponding parallel im plem entation .
T hese im plem entations have been designed and developed for parallel execution on a
c lu ster of w orksta tions, using the relevant design p a tte rn s .
Chapter 5. Intermediate level processing 132
5.1 R eg ion based segm en tation
R egion-based segm entation can be form ally defined as follows (Gonzalez k W oods, 1993).
A region R of an im age X is defined as a connected hom ogeneous subset of th e im age w ith
respect to som e ‘sim ilarity c rite rio n ’ such as gray tone, or tex tu re . Let P deno te a logical
p red icate which assigns th e value tr u e ( l) ov fa lse (0) to P , depending only on th e properties
re la ted to th e pixels in R. For exam ple, P (R ) = true, if th e difference between th e m axim um
and m inim um pixel value in R is less th an som e threshold . A region-based segm enta tion
of an im age is a p artitio n of X in to several hom ogeneous regions R { , i = 1, 2, ...n such th a t
i=l= x (5.1)
n = 0 for all i and j ^ j (5.2)
p(p,) = 1 for i = 1, 2 ,.. . , n (5.3)
p(p,ijp ,) = 0 for i 7 j (5.4)
C ondition (5.1) indicates th a t every pixel m ust be in a region. C ondition (5.2) in
d icates th a t th e regions m ust be disjoint (their in tersection m ust be 0, an em pty set).
C ondition (5.3) deals w ith the properties th a t m ust be satisfied by th e pixels in th e regions.
Finally, condition (5.4) indicates th a t th e ad jacen t regions R{ and R j are different in th e
sense of p red icate P .
T he R egion-based segm entation algorithm s can be classified in to th ree categories:
1. Region growing: In region growing an im age is divided in to an a rb itra ry num ber of
e lem entary regions, often s ta rtin g a t th e level of individual pixels. These elem entary
regions are then merged to form larger regions on th e basis of certain hom ogeneity
criterion . T he region growing algorithm s ta r ts w ith an im age p a rtitio n th a t s a t
isfies condition (5.3) and proceeds to fulfill condition (5.4). T he m erging process
te rm in a tes when no two adjacent regions are sim ilar.
2. Region splitting: In co n tra st, region sp litting views th e en tire im age as a single region.
Chapter 5. Intermediate level processing 133
Each region is then recursively subdivided in to sm aller subregions, if th e region is
n o t hom ogeneous enough. T he processing s ta r ts in a condition satisfy ing (5.4) and
proceeds to fulfill condition (5.3). T he m easure for hom ogeneity is sim ilar to th a t
used in region growing.
3. Region sp litting and merging: This scheme com bines b o th th e split and m erge oper
ations in one algorithm (Horowitz & Pavlidis, 1974), in order to exhibit advantages
of bo th th e m ethods. T he im age is initially subdivided in to an a rb itra ry se t of
d isjo int regions which are then m erged a n d /o r sp lit in an a tte m p t to satisfy the
conditions s ta ted in equations 5.1-5.4. A sp lit and m erge algorithm begins w ith
satisfy ing neither of the two conditions (5.3) and (5.4) and ends up w ith satisfying
bo th (5.3) and (5.4).
A sim ple realization of th e split and m erge technique is to represen t th e en tire im age
as one region, initially, and then recursively divide a region in to sm aller and sm aller
q u a d ra n t regions in a quad tree fashion (Figure 5.1) such th a t for any region B*, P{ Ri ) =
f a l s e (Gonzalez & W oods, 1993). Also, m erge th e ad jacen t regions Ri and R j for which
P { R i U R j ) = tru e . T he algorithm stops when no fu rth e r sp litting or m erging is possible.
T he ro o t of th e tree in F igure 5.1 corresponds to th e en tire im age while th e leaves of the
tree correspond to individual pixels. Each in term ed iate node corresponds to a subdivision.
R ll R12R2
R13 R14
R4 R3
R
R3 R4
R ll R12 R13 R14
(a) (b)
F igure 5.1: a) P artitioned im age b) C orresponding qu ad tree
Chapter 5. Intermediate level processing 134
5.2 P arallel R eg ion -based segm en ta tion
Region-based segm entation can be com putationally expensive in im ages of com plex scenes.
Hence, recent work in region-based segm entation has concen tra ted m ainly on developing
efficient parallel a lgorithm s (C opty e t ah, 1989), (C houdhary & T h ak u r, 1994), (Ham -
brusch e t ah, 1994), (Alnuweiri & P rasan n a , 1992), (W illebeek-LeM air & Reeves, 1990),
(H aralick & Shapiro, 1985). T he effectiveness of a p articu la r algorithm depends on the
applica tion area, in p u t image, and the type of parallel arch itectu re . In th is section, we
focus on experim ental evaluation of th e parallel sp lit and m erge segm enta tion algorithm
applied on gray-scale im ages and im plem ented on coarse-grained m achines, such as a
c luster of w orksta tions.
T he region-based split and m erge segm entation algorithm is well su ited for parallel
im plem entation using th e divide and conquer principle. D ivide and conquer algorithm s
(S to u t, 1987) use a recursive s tra teg y to split a problem in to sm aller subproblem s and
m erge th e solutions to these subproblem s in to the final solution. D ivide and conquer
stra teg ie s ap p ear to provide a n a tu ra l and efficient parallel solu tion to m any problem s on
coarse-grained m achines. Several divide and conquer algorithm s have been proposed for
im age processing (C haudhary & A ggarw al, 1991), (S tou t, 1987), (Sunwoo e t al., 1987).
T he first phase in th e parallel split and m erge segm entation algorithm involves sp litting
th e im age in to several subim ages such th a t each processor or w orksta tion has its own
subim age associated w ith it. We describe the sp litting and d is trib u tio n process la ter. In
th e nex t phase, each w orksta tion applies a sequential region growing algorithm to segm ent
its associated subim age. T he region growing algorithm defines individual pixels as initial
e lem entary regions. It then adds ad jacen t pixels to a region if th e difference betw een their
grey values and th e average pixel value of the cu rren t pixels in th e region is less th an a
th resho ld . A fter com pleting th e segm entation process, th e final phase involves m erging of
th e segm ented subim ages a t th e boundaries of subdivision. T he m erging process occurs
in phases, in a b inary tree fashion as shown in F igure 5.2 (b), and takes lo g P s teps for a
given P num ber of processors. T he segm ented regions of th e en tire im age are in th e roo t
processor afte r th e m erging step .
Chapter 5. Intermediate level processing L35
M erging of the segm ented subim ages is performed a t the boundary of subdivision.
W hile m erging along any boundary, the intensity values of the two neighboring pixels a t
the boundary are com pared. If they satisfy the hom ogeneity criterion (the difference in
their values is less than some threshold) the two regions across the boundary are merged.
I he value of each pixel in the merged region is set to the average of all the pixel values
w ithin th is region. If the values of two neighboring pixels are sam e or do not satisfy the
hom ogeneity criterion the regions are kept unchanged.
T he sp litting and d istribu tion of the subim ages is the inverse of m erging process
perform ed a t different levels of the binary tree. T he processor a t the root of the binary
tree divides the im age into two subim ages and sends them across two processors a t the
lower level. Each in term ediate processor in the tree subdivides its assigned subim age into
th ree p arts (F igure 5.2 (a)). It retains one p art to itself (for segm enting), and sends o ther
two p arts to its left and right siblings in the binary tree. If there is no right sibling (as in
node 3, Figure 5.2), the subim age is subdivided only in two parts . T he leaf processors do
not perform any subdivision on their assigned subim age. At the end of the sp litting and
d istribu tion process, each processor (w orkstation) has an associated subim age. T he size
of this subim age is proportional to the speed factor of the underlying w orkstation.
(a) (b)
Figure 5.2: a) D istribution of subim ages b) M erging of subim ages
Chapter 5. Intermediate level processing 136
T he sp lit and m erge segm entation algorithm can be parallelized using th e D ivide-and-
C onquer (DC) p a tte rn (section 3.6). The execution tim es for th e parallel segm enta tion
algorithm im plem ented on varying num ber of w orksta tions are displayed in Table 5.1. A
p lo t of these execution tim es and th e speedups achieved for th is operation are show n in
F igure 5.3. T he value of th e threshold used for adding ad jacen t pixels to th e regions in
th e corresponding subim ages, was 15. From F igure 5.1 it can be seen th a t a lthough the
execution tim e of the parallel segm entation algorithm initially decreases, it does n o t show
any significant im provem ent when the num ber of w orksta tions used in parallelization are
increased beyond six. T he corresponding speedup curves show sim ilar behavior. T h e drop
in th e scalability of th e parallel segm entation algorithm is due to th e tim e com plexity of
th e m erging processes.
T able 5.1: Execution tim e in (m inisec) for the parallel sp lit and m erge segm entation
algorithm
Im age Size N o. o f R egions N um ber of W orkstations
1 2 4 6 8 10 12 14 16
256x256 1385 2:12 1:15 0:52 0:38 0:35 0:33 0:32 0:29 0:32
512x512 2023 3:10 1:55 1:27 1:02 1:01 0:58 0:49 0:50 0:50
iK x lK 2423 4:09 2:54 2:35 2:01 1:59 1:53 1:51 1:52 1:52
5
4iK x lK
512x512256x256
3
2
1
0 2 4 6 8 10 12 14 16
16nideal
14-
1 2 -
1 0 -
0 2 4 6 8 10 12 14 16
Exec, tim e(m ins) v /s Processors Speedup v /s Processors
F igure 5.3: Perform ance of th e parallel sp lit and m erge segm entation algorithm
Chapter 5. Intermediate level processing 137
Table 5.2 displays execution tim es for perform ing various operations in th e parallel
segm en ta tion algorithm applied on a 512X512 im age. T he percentage figures for each
op era tio n in a colum n are com puted w ith respect to the to ta l parallel execution tim e
(displayed in last row of th e colum n) required to segm ent an im age on a given num ber
of w orksta tions. T he experim ental resu lts presented in th is tab le show th a t th e tim e
sp en t in th e m erging operation increases w ith the increase in num ber of w orksta tions used
in parallelization. In certa in cases, it exceeds th e to ta l tim e required for segm enting th e
individual subim ages. T he influence of th e com m unication tim e on th e overall perform ance
of th e parallel segm entation algorithm is relatively insignificant as can be seen from th e
percen tage figures of th e corresponding execution tim es displayed in Table 5.2.
Table 5.2: Execution tim e in (m in:sec) for various operations in th e parallel sp lit and
m erge segm entation algorithm applied on a 512X512 im age
O peration N um ber of W orkstations
2 4 6 8 10 12 14 16
R egion G rowing1:46
(92.2% )
1:07
(77.0% )
0:42
(67.7% )
0:36
(59.0% )
0:30
(51.7% )
0:25
(51.0% )
0:25
(50.0% )
0:20
(40.0% )
M erging0:08
(7.0% )
0:19
(21.8% )
0:19
(30.7% )
0:23
(37.7% )
0:26
(44.8% )
0:22
(44.9% )
0:22
(44.0% )
0:27
(54.0% )
C om m unication0:01
(0.8% )
0:01
(1.2% )
0:01
(1.6% )
0:02
(3.3% )
0:02
(3.5% )
0:02
(4.1% )
0:03
(6.0% )
0:03
(6.0% )
T ota l tim e 1:55 1:27 1:02 1:01 0:58 0:49 0:50 0:50
Hence, if com m unication tim e is no t a dom inan t facto r, th e perform ance of a parallel
a lgorithm im plem ented using a D ivide-and-C onquer p a tte rn is m ainly influenced by th e
tim e com plexity of th e m erge operation . N ote th a t th e regions produced by a parallel
segm en ta tion algorithm m ay som etim es be different from those produced by an equivalent
sequentia l algorithm due to different s ta rtin g pixel points. T his can happen w hen th e
co n tra s t betw een th e regions in th e im age is low. T he m ajo rity of th e previous im plem en
ta tio n s of th e parallel segm entation algorithm s have e ither used b inary im ages or grey-
level im ages contain ing artificial regions which have a high degree of co n tra s t betw een each
o th er.
Chapter 5. Intermediate level processing 138
5.3 S egm en ta tion using P ercep tu a l O rganization
An edge or pixel based segm entation involves detec tion of edge poin ts represen ting dis
continuities in pixel in tensities in an im age, and linking these edge po in ts in to chains of
contiguous curves. However, th is m ethod often resu lts in a fragm ented segm enta tion in
which th e curves produced do no t correspond to com plete o b jec t boundaries in im ages of
com plex environm ents. T here are two approaches th a t have been proposed to deal w ith
th is problem . One is su itab le for applications in restric ted dom ain and m akes use of model-
based tech.mqu.es (Chin & Dyer, 1986). M odel-based techniques rely on th e prior knowledge
of th e o b jec ts in a scene, and predict th e ir appearance in th e low level descrip tions th a t
can be ex trac ted from th e fragm ented segm entation . O th er approach which has becom e
popular in recent years and which appears prom ising even in com plex environm ents is th a t
of perceptual organization (Lowe, 1985).
P ercep tu a l organization hierarchically organizes low level im age featu res to higher level
s tru c tu re s such as edge points to lines, lines to parallels, rectangles and polygons, and,
rectangles and polygons to the ob ject descriptions. P ercep tual organization is form ally
defined as the ability of the hum an visual system to derive relevant groupings or s tru c tu re s
from th e in p u t im ages w ithou t any prior knowledge ab o u t their con ten ts (Lowe, 1985). T he
grouping process follows th e laws of perceptual grouping such as proxim ity (closer elem ents
are grouped to g e th er), sim ilarity (sim ilar elem ents are grouped to g eth er), con tinuation
(elem ents lying on a com m on line or curve are grouped to g e th er), closure (curves ten d to
be com pleted to enclose a region), and sym m etry (elem ents sym m etric ab o u t som e axis
are grouped to g e th er). T he hum an visual system is very good a t detec ting geom etric
re la tionsh ips such as collinearity, parallelism , connectivity , and repetitive p a tte rn s in an
o therw ise random ly d istribu ted set of im age elem ents, and it can usually see shapes in
a rran g em en ts of poor m achine-generated edge o u tp u ts of even com plex scenes (Lowe,
1985).
P ercep tu a l organization has recently been applied to solve a num ber of p ractical com
p u te r vision problem s. It has proved to be effective for ex trac tio n of s tra ig h t lines (B oldt
e t al., 1989), ex trac tio n of curves (Dolan & Weiss, 1993), detection of buildings in aerial
im ages (H uertas e t al., 1993), (M ohan & N evatia, 1989), searching geom etric s tru c tu re s
Chapter 5. Intermediate level processing 139
in n a tu ra l scene im ages (Reynolds & Beveridge, 1987), and detection of large m an-m ade
o b jec ts in non-urban scenes (Lu & Aggarwal, 1992), In th is section, we discuss parallel
im plem entation of th e perceptual grouping steps as ou tlined in (Lu & A ggarw al, 1992),
w ith specific em phasis on th e line grouping process. T he following section presen ts the
sequentia l line grouping process as described in (B oldt e t ah , 1989), (Lu & A ggarw al,
1992), while th e section following it presents its parallel im plem entation .
5.3.1 Sequential Line grouping algorithm
T he in p u t to th e line grouping process is a set of fragm ented line segm ents which are
ex trac ted using th e existing edge detection, edge linking and linear approx im ation tech
niques. T he o u tp u t is a set of s tra ig h t lines which represen t linear s tru c tu re s a t a higher
level of g ranu la rity as shown in Figure 5.4. T here are several existing techniques th a t
could be used for ex trac tin g the initial line fragm ents in an im age. We use the techniques
described in th e Scerpo vision system (Lowe, 1985) to perform the edge detec tion and
linear approx im ation of the edge contours by piecewise linear segm ents. These operations
co n s titu te a prerequisite s tep to th e line grouping process. We describe these operations
briefly for th e sake of com pleteness.
Figure 5.4: Line G rouping
We use two algorithm s based on the Laplacian of G aussian and th e Sobel edge o p e ra to r
Chapter 5. Intermediate level processing 140
to select th e in itial edge locations as described in (Lowe, 1985). We convolve th e im age
w ith a Laplacian of G aussian o p era to r and assign to each pixel in the convolved im age,
a gray value p roportional to th e absolute value of th e resu lt of th e convolution. W e then
apply a Sobel grad ien t o p era to r to the convolved im age and select as edge locations only
those zero crossing pixels th a t are above a given threshold in th e Sobel g rad ien t im age. We
th en perform edge th inning on th e resu ltan t im age and link th e edge points on th e basis
o f connectiv ity to form th e edge contours. We use a sim ple recursive endpo in t subdivision
m ethod to app rox im ate th e edge contours by piecewise line segm ents as in (Lowe, 1985).
In th is m ethod , a line segm ent joining the endpoin ts of an edge con tour is recursively
subdivided a t the po int of m axim um deviation. T his subdivision continues and eventually
re tu rn s a se t consisting of one or m ore line segm ents such th a t th e m axim um deviation
of any poin t on th e edge contour from its corresponding line segm ent is less th a n som e
th resho ld value.
T h e line segm ents ex trac ted using the techniques described above are often fragm ented
and do no t reflect th e linear s tru c tu res in the im age well. A post-processing m ethod based
on th e principles of percep tual grouping is needed to ob ta in th e required linear s tru c tu re s .
T he line grouping process perform s a repeated grouping of lines in to longer lines using th e
principles or rela tional constra in ts of perceptual grouping. We use th ree basic re la tional
co n stra in ts of percep tual grouping, namely, proximity^ collinearity^ and continuation^ to
im plem ent th e line grouping algorithm . T he details of o th er finer co n stra in ts are given
in (B oldt e t ah, 1989). Consider an a rb itra ry ungrouped line in the im age. W e call such
a line as base line. A set of previously ungrouped lines are grouped w ith the base line if
th ey satisfy th e following relational constra in ts:
• P roxim ity : T he end points of the lines should fall in th e neighborhood of th e base
line. T he size and shape of the neighborhood is controlled by th e corresponding
p aram eters . F igure 5.5 (a) shows a circular neighborhood draw n a t th e end po in ts
of th e base line.
• C ollinearity: T he lines should approxim ately be collinear to th e base line. T he
difference in the o rien ta tion of th e base line and any o th er line in its proxim ity
should be less th an a threshold (Figure 5.5 (b)).
Chapter 5. Intermediate level processing 141
• C 'ontinnation or Overlap: The lines within the proxim ity of the base line m ust not
overlap too much. The distance between the point Q l of the base line and the
projection of point P2 on /I m ust be sm aller than a threshold (F igure 5.5 (c)),
where 12 is any line within the proxim ity of the base line U.
(a)
( b )
12
Q2
Figure 5.5: Relational constra in ts in the line grouping algorithm a) proxim ity b) collinear
ity and c) continuation
T he line grouping algorithm searches the neighborhoods of the end points of each base
line in order to find all lines within its proximity. Each line w ithin the proxim ity of the
base line needs to satisfy o ther two conditions in order to be considered for grouping w ith
(he base line. We call the set of lines L th a t satisfy the conditions s ta ted above w ith
respect to the base line / i , with /I G T, a token cjvoup.
Chapter 5. Intermediate level processing 142
A fter finding a token group L w ith respect to th e base line / I , a rep resen ta tive line
/ of L is com puted . Line I passes th rough th e poin t th a t is geom etric cen ter of th e line
segm ents in L (Lu & A ggarw al, 1992). T he o rien ta tion of I is th e length-w eighted average
of th e o rien ta tion of th e lines in L. T he endpoin ts of line I are determ ined by orthogonally
p ro jec ting th e line segm ents in L onto 1. T he tw o fu r th e s t a p a rt p ro jec tion poin ts are
th e end po in ts of I. T he line I replaces th e lines in L (see F igure 5.4). T he line grouping
process continues until no m ore m erging is possible. It always te rm in a tes a fte r a finite
num ber of ite ra tio n s as there are only finite num ber of lines in th e im age and th e ir num ber
declines in each ite ra tio n .
N ote th a t in order to reduce the search space, a line segm ent is represented by two of its
end poin ts and is indexed by the im age pixels corresponding to th e end poin ts (F igure 5.6).
Hence, an index a rray of th e size of the original im age is construc ted prior to th e grouping
process. W hen searching for lines close to a base line, th e neighborhood of th e end points
of th e base line in th e index array is searched. Only those lines whose end poin ts fall into
th is neighborhood are exam ined as shown in F igure 5.6(b).
B (base line)pointer to B
pointer p pointer to q
pointer to r
(a) (b)
F igure 5.6: Indexing technique used in th e line grouping process, a) search a rea for th e
base line b) th e index array
Chapter 5. Intermediate level processing 143
5 .3 .2 P ara lle l L ine grou p in g a lgor ith m
In th e parallel im plem entation , we assum e th a t th e fragm ented line segm ents or line tokens
have been ex trac ted from th e inpu t im age using th e existing m ethods of edge detection ,
edge linking and linear approxim ation . T he in p u t to th e parallel percep tual grouping
algorithm is therefore a set of line tokens which are com m unicated to each processor or
w orksta tion before s ta r tin g the line grouping process. Each processor has a com plete set
of th e token d a ta consisting of all inpu t tokens in th e im age. Each processor co n stru c ts
an index a rray and uses it to p a rtitio n the token d a ta in to a set of token groups. A load
balancing procedure is then employed to assign each processor a finite num ber of token
groups, in p roportion to its corresponding speed factor. T he token groups assigned to
th e processors are then processed in parallel. Each token group consisting of two or m ore
line segm ents is replaced by a representa tive line to form a new token, using th e m erging
procedure described in section 5.3.1.
A fter com pletion of the m erging process, each processor com m unicates its tokens (those
processed by it) to all o ther processors. Again, each processor has a com plete se t of
new token d a ta . T his process is repeated for a fixed num ber of ite ra tio n s or until no
m ore tokens can be grouped and m erged into rep resen ta tive line tokens. T he parallel line
g rouping algorithm can be sum m arized as follows:
1. B roadcast token d a ta from each w orksta tion to every o th er
2. Form Token G roups a t each processor
3. Assign a d istinc t set of token groups to each processor (for merging)
4. Perform m erging of th e token groups a t each processor
5. R epeat s tep 1 to step 4 for a fixed num ber of ite ra tio n s or until no m ore m erging is
possible
N ote th a t th e po ten tia l parallelism th a t can be exploited in the line grouping process
is m ainly during th e m erging process. A fter partitio n in g th e in p u t token d a ta into token
groups, th e replacem ent of each token group by a rep resen ta tiv e line is essentially a parallel
Chapter 5. Intermediate level processing 144
local processes. A spa tia l partition ing of th e index a rray (e ither horizontally or vertically)
in o rder to parallelize th e line grouping process m ay no t always be feasible. For exam ple,
when th e line segm ents in a token group span large portions of th e im age space, it is
ex trem ely difficult, if not im possible, to p artitio n th e index a rray spatially , in order to
realize a parallel im plem entation . These line segm ents would spread them selves across
several index a rray partitions.
T he parallel line grouping algorithm presented in th is section is sim ilar to an earlier
im plem entation proposed by P rasa n n a et. al. (P ra san n a &; W ang, 1996). However, we use
a different load balancing scheme which is based on th e d is trib u tio n of th e token groups.
T he load balancing m ethod used in (P ra san n a & W ang, 1996) is based on th e to ta l search
a rea of th e in p u t tokens. This m ethod m ay no t always lead to an even d istrib u tio n of
load, since, m any base line tokens m ay span large portions of th e im age a rea and m ay
no t require grouping or m erging w ith o ther line tokens (their token groups consist of only
th e base line). Also, th e parallel line grouping algorithm presented in (P ra sa n n a & W ang,
1996) is non-hierarchical, th a t is, it does no t group th e line tokens itera tively in to higher
levels of g ranularity .
T he s tru c tu re of th e parallel line grouping algorithm is sim ilar to th a t im plem ented by
an ite ra tiv e varian t of th e C ontroller-W orker p a tte rn (section 3.5). Hence, the parallel line
grouping algorithm can be parallelized using an ite ra tiv e varian t of th e C ontroller-W orker
p a tte rn . T he execution tim es for th e parallel line grouping algorithm im plem ented on
varying num ber of w orksta tions are displayed in Table 5.3. From Table 5.3, it can be
seen th a t th e execution tim e of the parallel line grouping algorithm does no t show any
im provem ent over its corresponding sequential im plem entation .
Table 5.3: E xecution tim e in (min:sec) for the line grouping process
Image Size No. of Tokens Number of Workstations
1 2 4 6 8 10 12 14 16
256x256 855 0:01 0:01 0:02 0:02 0:03 0:03 0:04 0:05 0:05
512x512 1454 0:02 0:03 0:04 0:04 0:04 0:05 0:05 0:05 0:06
iK x lK 7921 0:07 0:17 0:18 0:18 0:21 0:24 0:25 0:26 0:30
T h e poor perform ance of the parallel a lgorithm is m ainly due to inheren t sequen
Chapter 5. Intermediate level processing 145
tial n a tu re of th e line grouping process and the com m unication overheads in its parallel
im plem entation . T he only parallelism th a t can be exploited in th is algorithm is during
th e m erging operation , where, different token groups are replaced by th e ir correspond
ing rep resen ta tive line tokens, concurrently. T he tim e spen t in th e m erging operation
is however, significantly lower th an th e tim e spen t in com m unicating th e newly form ed
tokens betw een different w orksta tions, during each ite ra tio n . Also, when th e num ber of
line tokens required to be processed increase, th e com m unication overheads d om inate the
overall execution tim e as can be seen from th e entries in th e th ird row of Table 5.3. T he
com m unication overheads include tim e spen t in packing and unpacking th e line tokens in to
d a ta packets, and th e tim e spen t in com m unicating these d a ta packets betw een different
w orksta tions.
N evertheless, th e line grouping algorithm based on th e principles of p ercep tual o rgan i
zation serves as a typical exam ple of an interm ediate-level operation in com pu ter vision.
It illu stra te s problem s and difficulties encountered while parallelizing such algorithm s,
particu la rly on a cluster of w orkstations. Such algorithm s are m ore su itab le for sequential
im plem entations in th e w orksta tion environm ents.
5.4 Sum m ary
In th is chap te r, we have presented parallel im plem entations of two in term ed ia te level
vision algorithm s, namely, region-based split and m erge segm enta tion a lgorithm , and the
line grouping algorithm based on th e principles of percep tual organization . T h e segm en
ta tio n algorithm has been parallelized using the D iv ide-and-C onquer (DC) p a tte rn . T he
perform ance of th is algorithm does not show a scalable im provem ent w ith increase in
num ber of w orksta tions used in parallelization beyond a certain lim it. T his is due to
corresponding increase in tim e needed to m erge th e segm ented subim ages in th e m erging
opera tion . T he influence of th e com m unication tim e on overall perform ance of th e parallel
segm en ta tion algorithm is relatively insignificant. Hence, if com m unication tim e is no t
a dom in an t factor, th e perform ance of an algorithm parallelized using a D C p a tte rn is
influenced m ainly by the tim e com plexity of th e m erging operation .
Chapter 5. Intermediate level processing 146
T he line grouping algorithm has been parallelized using an ite ra tiv e varian t of the
C ontroller-W orker p a tte rn . Since th is algorithm is inheren tly sequential in n a tu re , the
only parallelism th a t can be exploited in th is algorithm is during th e rep lacem ent of
token groups by th e ir corresponding represen ta tive line tokens. T he tim e spen t in th is
o peration is however, significantly lower th an th e tim e spen t in all-to-all w orker com m u
nications in th e C ontroller-W orker p a tte rn . T he perform ance of th e parallel line grouping
algorithm therefore, does no t show any im provem ent over its corresponding sequential
im plem entation . T his exam ple illustra tes problem s and difficulties encountered while
parallelizing a typical in term ed iate level a lgorithm on coarse-grained m achines, such as
a clu ster of w orksta tions. It also shows lim itations of th e use of C ontroller-W orker p a tte rn
for parallelizing such applications on these machines.
Chapter 6
H igh level processing
In th is ch ap te r we discuss parallelization of a high level vision algorithm for ob jec t recog
nition using a Farm er-W orker p a tte rn . We also discuss parallelization of an application in
m edical im aging using th ree different design p a tte rn s , nam ely, Tem poral M ultiplexing,
P ipeline, and C om posite Pipeline, High level processing in com puter vision involves
recognition of ob jec ts in a scene based on th e knowledge acquired by th e lower level
processes from th e im ages(s) of th a t scene. T he task s a t th is level are usually top-dow n
or m odel-d irected , and involve m ainly sym bolic a n d /o r knowledge processing.
An exam ple of a high level vision task is m odel-based ob jec t recognition. Given a
d a tab a se of o b jec t models, m odel-based ob ject recognition involves finding instances of
these o b jec ts in a given scene, A m odel-based vision system ex tra c ts scene featu res, such
as edges and points from an im age of a scene, and com pares them w ith a d a tab a se of
o b jec t m odels in order to identify ob jects w ithin th a t scene. M ost m odel-based ob ject
recognition system s are based on hypothesizing m atches betw een th e scene and m odel
featu res, p red ic ting new m atches, and verifying or changing th e hypotheses th ro u g h a
search process (G rim son, 1990), (Lowe, 1985), T he task becom es m ore com plex if the
o b jec ts are overlapped or occluded in the scene, A review a n d /o r m ethods used in m odel-
based o b jec t recognition in com puter vision can be found in (Chin k. Dyer, 1986) (G rim son
& H utten locher, 1991),
147
Chapter 6. High level processing 148
In recent years, a new m ethod based on geom etric hashing has been proposed for
m odel-based recognition of ob jects (L am dan & W olfson, 1988). T his m ethod offers a
different and m ore parallelizable paradigm for m odel m atching . T he geom etric hashing
algorithm used for m odel m atching consists of two phases: preprocessing and recognition.
T he preprocessing phase uses a collection of ob ject m odels to build a hash table (described
la ter) d a ta s tru c tu re . T his d a ta s tru c tu re encodes th e m odel in form ation in a highly
red u n d an t and m ultiple view point way. In th e recognition phase, th e p roperties of th e
e x trac ted featu res in th e scene im age are used to index th e hash tab le d a ta s tru c tu re for
a possible m atch to cand ida te ob ject models. A lthough geom etric hashing still requires
a search over the featu res in a scene, it obviates a search over th e m odels and th e m odel
featu res. Hence, th e recognition phase is com putationally efficient and highly am enable
to parallel im plem entation (R ogoutsos & H um m el, 1992).
In th is chap ter, we discuss parallel im plem entation of th e recognition phase in th e
geom etric hashing algorithm used for m odel m atch ing . Section 6.1 describes th e sequen
tia l algorithm for perform ing geom etric hashing, while section 6.2 discusses its parallel
im plem entation . We end th is chap te r w ith a section th a t discusses parallelization of an
application in m edical im aging, namely, m ulti-scale shape descrip tion of M R brain im ages
in epileptic p a tien ts . We use th ree different approaches (based on Tem poral M ultiplexing,
P ipeline, and C om posite P ipeline p a tte rn s) to discuss parallelization of different m odules
in th is application.
6.1 Sequentia l geom etric hash ing a lgorithm
We assum e th a t th e d a tab ase has M ob ject m odels and each m odel is represented by
n fea tu re points. T he preprocessing and recognition phases of th e geom etric hashing
algorithm work as follows:
Chapter 6. High level processing 149
6.1 .1 P r e p r o c e ss in g P h a se
In th e preprocessing phase a hash tab le is created from th e M m odels in th e da tab ase .
For each m odel, tw o a rb itra ry featu re points, referred as basis sei, are used to define an
o rthogonal coord ina te system as shown in F igure 6 .1(a). Using th is coo rd ina te system ,
a new set of transfo rm ed coord inates of th e rem aining fea tu re po in ts in th e m odel are
com puted using sim ple transfo rm ation equations in ana ly tic geom etry (Efimov, 1966).
These new coord inates are then used to hash or generate entries in to a hash tab le . Each
en try in th e hash tab le consists of a {model, basis) pair, represen ting th e m odel num ber
and th e basis set. T his process is repeated for all possible basis sets in a given m odel, and
for all m odels in th e d a tab ase . As a resu lt, the hash bins in th e hash tab le will receive
m ore th a n one entry. T he final hash tab le contains a list of {model, basis) entries in each
bin, as shown in F igure 6.1. T he preprocessing procedure is executed off-line and only
once. T h e steps in th e preprocessing phase are outlined below:
1. E x tra c t a se t of n fea tu re points from a given m odel m .
2. Select as basis set, a pair of two distinct fea,t\ive po ints (%,j).
3. C om pute th e coord inates of the rem aining fea tu re poin ts in th e m odel w ith respect
to th e coo rd ina te system defined by th is basis se t (%,j).
4. C om pute th e hash bin locations using a hash function h (described later) applied on
th e transfo rm ed coord inates in step 3.
5. A dd {model, basis) pair, i.e. (m , (z ,j) ) , to th e list of entries in corresponding hash
bin locations com puted in step 4.
6. R epeat s teps 2-5 for all possible basis sets in m odel m .
7. R epeat steps 1-6 for all m odels m in th e d a tab ase .
Chapter 6. High level processing 150
M odel m Basis P = (p, q)
new coordinatesBin I
(m, P)
(n ,R)
k = h(x’y ’) (m,P)
Bin k
(m, P)
(n, V)
(n, R)
(b)
( k T )
(n ,V)
(a)
F igure 6.1: P reprocessing phase in th e geom etric hashing algorithm a) O rthogonal coor
d in a te system defined by th e basis set b) A dding {models basis) pairs in th e hash tab le
6 .1 .2 R e c o g n it io n phase
In th e recognition phase, an a rb itra ry pair of fea tu re po in ts from th e scene im age is
chosen as a basis set. T he transfo rm ed coord inates of th e rem aining poin ts in th e scene are
calcu lated relative to th e coord ina te system defined by th is basis set. Each new coord ina te
is m apped to th e hash tab le (the sam e as th a t in th e preprocessing phase), and th e en tries
in th e corresponding bin receive a vote. T he (m ode/, basis) pairs which receive sufficient
votes (i.e. above a certain threshold value) are taken as p o ten tia l m atch ing cand ida te
m odels. These are then passed to a verification m odule, which verifies th e presence of
m atch ing m odels aga inst th e scene features.
T he m ain goal of th e voting scheme is to reduce th e num ber of cand ida tes used in the
verification step . T he execution of the recognition phase corresponding to a bcisis set is
term ed as a probe. T he steps in th e recognition phase are outlined below:
1. E x tra c t a set of S fea tu re points from th e scene.
2. Select as basis set, an a rb itra ry pair of fea tu re po in ts (%,j) from S .
Chapter 6. High level processing 151
Scene Features Basis P= (p,q)
new coordinates Bin 1
k = h(x’ y
Bin k B inN
(a)
(m, P)
(m, P) (n,V)
(n, R) (n,V)(n,R)
Votes for Model m, Basis P
(n,V) (n, R) (m, P)Cell (n, V) Cell (n, R) Cell(m,P)Cell (m, P) Cell (I, T)
(b)
F igure 6.2: R ecognition phase in the geom etric hashing algorithm a) O rthogonal coordi
n a te system defined by th e basis set b) Accessing and collecting (modeZ, basis) pairs from
th e hash bins in hash tab le
3. Perform a probe using sequence of following steps:
• C om pute th e transfo rm ed coord inates of rem aining fea tu re po ints in S w ith
respect to the coord inate system defined by th is basis set
• C om pute th e hash bin locations in th e hash tab le using a hash function h
(described later) applied on th e transfo rm ed coord inates.
• Form a list of all th e {models basis) pairs s to red in th e corresponding hash bin
locations com puted in th e previous step .
• Select th e (m ode/, basis) pairs (w inning m odels) receiving a count of votes above
a given threshold value (if any).
4. R epeat from step 2 until som e winning {model, basis) pairs are found or until com-
Chapter 6. High level processing 152
pletion of som e specified num ber of itera tions.
5. Verify th e po ten tia l m odels found in step 3 (if any) aga inst th e set S of fea tu res in
th e scene.
6. Rem ove fea tu re points of th e m atch ing m odel(s) from th e scene (if applicable) and
rep ea t s teps 2-6 until som e specified condition or for a fixed num ber of ite ra tions.
T he selection of (m ode/, basis) pairs receiving m axim um votes in step 3 m ay be per
form ed by histogram ing (i.e. counting) these entries using corresponding (m ode/, basis)
counters. A lternatively , th e [modeCbasis) pairs m ay be sorted in order to find w inning
m odels having a coun t above a given threshold value.
6.2 P arallel geom etric hashing a lgorithm
In th is section, we present a parallel im plem entation of th e recognition phase of th e
geom etric hashing algorithm . T he preprocessing phase is a one tim e process and can
be carried o u t off-line. T he parallel im plem entation of th e recognition phase m ay be
realized by e ither a) perform ing th e operations of a single probe across several processors,
concurrently , or b) perform ing m ultiple probes on several processors, concurrently . In th e
la tte r case, each probe m ay in tu rn be im plem ented on a set of one or m ore processors.
T he su itab ility of each m ethod depends on the size of the hash tab le and th e am o u n t of
m em ory available on each processor of the underlying parallel arch itec tu re .
T here have been several p rior efforts in parallelizing th e recognition phase of the
geom etric hashing algorithm . B ourdon et. al. (B ourdon & M edioni, 1988) and R ogoutsos
et. al. (R ogoutsos & H um m el, 1992) have proposed parallel im plem entation of a probe
step (m ethod (a)) across several processors on SIM D hypercube-based m achines. T heir
im plem entations em ploy large num ber of processors in p roportion to th e size of th e m odel
d a tab ase . W ang et. al. (W ang e t al., 1994) have proposed several parallel im plem entations
of th e recognition phase on CM -5 and M P-1, using bo th m ethods (a) and (b).
Each im plem entation uses a different s tra teg y for d istrib u tin g th e hash tab le entries.
Chapter 6. High level processing 153
Each im plem entation uses e ither a histogram ing or a so rting m ethod to com pute th e win
ning [modeC basis) pairs receiving th e m axim um num ber of votes during th e recognition
phase. T heir im plem entations are independent of th e size of th e m odel d a tab a se and
achieve im proved perform ance over earlier efforts. T hey have achieved a single probe
tim e of ab o u t 200 millisecs on a 32-node CM-5 connection m achine, while R ogoutsos et.
al. (R ogoutsos & H um m el, 1992) have reported a single probe tim e of 1.52 sec on a 8K
processor connection m achine. B oth have used a synthesized m odel d a tab a se contain ing
1024 m odels (each m odel consisting of 16 featu re po in ts or d o t p a tte rn s) and a scene
consisting of approx im ately 200-256 featu re points.
D ue to lim ited local m em ory on individual processors, all th e above im plem entations
involve d istrib u tio n of th e hash tab le entries across several processors and, parallelizing
th e o pera tions of a single probe on these processors. In w orksta tion environm ent, th e tim e
required for perform ing a probe step on a single w orksta tion is of th e order 1.2 to 1.3 secs.
Parallelizing th e operations of a single probe across several w orksta tions as in previous
approaches would no t lead to any significant im provem ent in th e perform ance due to high
com m unication costs. Parallelizing a probe step involves com puting local (m ode/, basis)
w inning pairs and com m unicating these winning pairs betw een different processors in
o rder to find th e global (m ode/, basis) w inning pairs. Since an ob jec t requires a round 100-
250 probes for recognition (Rogoutsos & H um m el, 1992), we perform m ultiple probes on
various w orksta tions, concurrently. T he operations of each probe are however perform ed
on a single w orksta tion .
We now discuss th e ac tua l parallelization of th e recognition phase on a c luster of
w orksta tions. As in (R ogoutsos & Hum m el, 1992), we use a synthesized m odel d a tab ase
con tain ing 1024 models. Each m odel consists of 16 random ly generated fea tu re points
(do t p a tte rn s) . These m odel pairs are generated using a G aussian d istrib u tio n w ith zero
m ean and un it s tan d a rd deviation. Similarly, we co n stru c t a scene consisting of 200
scene po in ts using a norm al d istribu tion . In order to m ake th e recognition process as
efficient as possible, we apply tw o enhancem ents as m entioned in (R ogoutsos & H um m el,
1992). F irstly , we apply a rehashing function to th e transform ed coord inates (step 3 of th e
recognition phase described in section 6.1.2) so th a t th e expected list lengths of th e entries
in th e hash bins become as even as possible. For each transfo rm ed coord ina te (ri, u), th e
Chapter 6. High level processing
following hash function is applied:
154
_ u ^ + v ^f ( u ^ v ) = { l - e x p 2cr2 ^a ta n 2 {v^u)) (6 .1)
w here, a represen ts th e s tan d a rd deviation of th e m odel po ints. T he values of th e two
coord ina tes in equation 6.1 lie in intervals (0 ,1 ) and (—7r,7r), respectively. T hese coordi
n a te values can be quantized in to a tw o dim ensional hash a rray as shown in F igure 6 .3 (a).
E ach hash location contains a po in ter to a list or bin of (m , (%,j)) entries.
Secondly, we use certain sym m etries in th e hash tab le to reduce th e num ber of entries
in th e hash lists. If an en try of th e form (m , (%,j)) hashes to a location {x^y) in th e hash
tab le , then there will be a m irro r-en try of th e form (m , (j, %)) in location (2 , 99 — y) as
shown in F igure 6.3(b) - (c). We can therefore sto re only those (m , entries in th e hash
tab le for which i < j . T his will reduce th e num ber of entries in th e hash tab le by nearly
half, thereby halving the m em ory required to sto re th e hash tab le during the recognition
phase. For such a hash tab le , if f { u , v ) hashes to location (2 ,y ) in th e probe step of the
recognition phase, th e enties in locations (2 , y) and th e m irror-en tries in location (2 , 99 —y)
a re collected in a list, in order to com pute th e w inning [models basis) pairs in subsequent
processing. T he m irro r-en try of (m , ( i , j ) ) is (m , (j, %)).
■(x, y )
■(x, 99-y)
(a)
9 9 -y
Bin (x, y) Bin (x, 99-y)
(2, (3, 9))
(7, (9, 4))
a , (5, 3))
(2, (9,3))
(7, (4,9))
a , (3,5))
(b)
Bin (x, y) Bin (x, 99-y)
(2, (3, 9))
(7, (4,9))
(I, (3,5))
(c)
F igure 6.3: Hash tab le d a ta s tru c tu re a) sym m etric indexing in hash tab le b) hash en tries
in norm al hash tab le c) reduction in hash entries using sym m etries
Chapter 6. High level processing 155
For a d a tab ase consisting of 1024 m odels, w ith each m odel contain ing 16 fea tu re points,
th e size of th e norm al hash tab le would be ab o u t 20 M bytes, assum ing 6 bytes for each
(m , (i, j ) ) entry. Using the sym m etries m entioned above, th e size of th e hash tab le would
be reduced to 10 M bytes. T he w orksta tions (SUN S P A R C station 5) th a t we used in
im plem enting th e parallel geom etric hashing algorithm have 32 M bytes of local m em ory.
Hence, unlike in previous approaches, each processor (w orksta tion in th is im plem entation)
can sto re a sep a ra te b u t com plete copy of the hash tab le during th e parallel execution of
th e recognition phase. N ote th a t although we have used a synthesized m odel d a tab ase ,
th e size of th is d a tab ase is nearly the sam e as th e size of a typ ical m odel d a tab a se used in
th e s ta te -o f-th e -a rt im age understand ing techniques em ploying geom etric hashing (W ang
et ah, 1994).
T he algorithm for perform ing m ultiple probes in th e recognition phase can easily be
parallelized by using th e Farm er-W orker p a tte rn . Each worker w orksta tion in th e F arm er-
W orker p a tte rn has a copy of th e hash tab le and a set of scene fea tu res in its local m em ory,
prior to th e s ta r t of th e recognition phase. These are loaded from a file created during
th e preprocessing phase. The Farm er generates a rb itra ry basis sets, and assigns each to
a d ifferent worker for processing. Each worker perform s corresponding probe step using
its assigned basis set. Each worker com m unicates the w inning (m ode/, basis) pair(s) (if
any) to th e F arm er controlling th e whole process. W hen no w inning {m ode l, basis) are
found, each w orker is assigned an o th er basis set to perform a new probe. T his processes
continues until w inning (m ode/, basis) pairs are found or for a fixed num ber of ite ra tio n s.
T he algorithm is outlined below:
1. G enera te basis sets and assign each to a different w orksta tion (worker).
2. Perform probe step using the assigned basis set on each worker.
3. Select th e (m ode/, basis) pairs th a t receives a count of votes above certain th resho ld
value (if any). If no such (m ode/, basis) pairs exist, rep ea t th e procedure from step 1
for a certa in num ber of ite ra tio n s or until some specified condition.
4. Verify th e p o ten tia l m odels found in step 3 (if any) aga inst th e set S of fea tu res in
th e scene.
Chapter 6. High level processing 156
5. Rem ove th e featu re poin ts of m atching m odel(s) (if applicable) from th e scene and
rep ea t s teps 1-5 until some specified condition.
Table 6.1: E xecution tim e in (min:sec) for th e geom etric hashing algorithm
N o. o f P robes N um ber of W orkstations
1 2 4 6 8 10 12 14 16
50 1:03 0:36 0:21 0:17 0:13 0:12 0:11 0:10 0:12
100 2:00 1:05 0:37 0:27 0:22 0:18 0:17 0:15 0:15
150 3:06 1:37 0:50 0:36 0:29 0:24 0:23 0:21 0:18
200 4:00 2:05 1:06 0:47 0:37 0:30 0 ^ 8 0:24 0:23
250 5:03 2:36 1:21 0:56 0:44 0:37 0:32 0:28 0:27
5
4ideal250 probes
150 probes 14 50 probes3
2
1
0 2 4 6 8 10 12 14 160 2 4 6 8 10 12 14 16Exec, tim e(m in) v /s Processors Speedup v /s P rocessors
F igure 6.4: Perform ance of the geom etric hashing algorithm for o b jec t recognition
T he execution tim es for th e recognition phase of th e geom etric hashing algorithm
parallelized using different num ber of w orksta tions are shown in Table 6.1. A plot of these
execution tim es and th e speedups achieved for th is phase are shown in F igure 6.4. Since
th e com m unication tim e in th e algorithm is negligible com pared to th e co m p u ta tio n tim e,
th e observed speedups are qu ite close to the ideal speedups as can be seen from F igure 6.4.
Chapter 6. High level processing 157
For com parison, we com pute th e tim e required to perform 200 probes in previous im ple
m en ta tions. Using a 8K processor connection m achine, th e tim e required for perform ing
200 p robe steps is approx im ately 5 m ins (based on 1.52 secs/p ro b e tim e as repo rted
in (R ogoutsos & H um m el, 1992)). T he tim e required to perform th e sam e num ber of
p robes on 32-nodes of a CM -5 connection m achine would be around 38 secs (assum ing
m inim um tim e of 188 millisecs/ probe as reported in (W ang et ah , 1994)). Using 512
processors (the m axim um on CM -5) and perform ing m ultiple probes concurren tly (each
p robe im plem ented on a p artitio n of 32 processors), th e tim e required to perform 200
probes m ay be reduced to 2 to 3 secs. However, th e la tte r im plem entation (assum ing such
an im plem entation is possible) m ay need significant p rogram m ing effort in o rder to exploit
th e hardw are of th e underlying parallel m achine.
From Table 6.1, it can be seen th a t th e tim e required to perform 200 probes using 16
w o rk sta tio n s is only 23 secs. Hence, as th is exam ple illu stra tes, a w orksta tion environm ent
can provide a reasonable or som etim es even b e tte r perform ance com pared to th e conven
tional dedicated parallel m achines. N ote th a t the earlier im plem entations of th e parallel
geom etric hashing algorithm are fine-grained. In co n tra s t, th e parallel im plem entation of
th e geom etric hashing algorithm presented in th is section is relatively coarse-grained.
6.3 M u lti-sca le active shape d escrip tion - an ap p lication
All our previous discussions in th is thesis have so far been concen tra ted upon parallelizing
individual vision algorithm s using corresponding design p a tte rn s . In th is section, we
discuss parallelization of com plete m odules, each com prising a collection of several algo
rithm s, a t an application level. We take an application from th e field of m edical im aging,
nam ely, m ulti-scale active shape description of M R (m agnetic resonance) b ra in im ages
using active con tour models. This application form s a p a r t of th e research w ork carried
o u t in th e D ep artm en t of C om puter Science, U niversity College London, UK (Schnabel,
1997). W e presen t a brief overview of th is application and discuss parallelization of som e
of its m odules.
Chapter 6. High level processing 158
6.3.1 A n overview of th e shape descrip tion process
D etecting and describing brain deform ations in certain b rain diseases (e.g. epilepsy) is
a m ajo r ta sk in M R im aging. A conventional m ethod of detec ting and describing these
deform ations is to first segm ent th e cross-sectional im ages (im age slices) of th e brain
in to different regions. These regions correspond to different p a r ts of th e brain . A fter
identifying relevant regions(s), a set of shape m easurem ents (e.g. area, perim eter, e tc.) is
applied on these region(s) in order to de tec t and describe ap p ro p ria te shape deform ations.
T his ta sk is perform ed m anually by th e expert clinicians. T he conventional m ethod of
finding and describing shape deform ations is however tim e consum ing and ted ious. It
usually involves processing large volumes of volum etric bra in d a ta . Also, due to sho rtage
of ex p ert clinicians, it is difficult to diagnose each p a tien t in a given tim e co n stra in t. As
a resu lt, th ere is g rea t dem and to au to m ate th e shape descrip tion process in o rder to
p roduce m eaningful shape descrip tions reliably and quickly.
T he research work in (Schnabel, 1997) aim s to au to m ate th e shape descrip tion process,
and a tte m p ts to present it as a shape description tool for diagnosis. T h e shape descrip
tion tool enables bo th q u an tita tiv e and q ualita tive shape analysis a t different levels of
im age resolution (or scale). T he shape analysis process uses concepts in m ulti-scale im age
processing (M arr, 1982), (W itkin , 1983) to describe shape changes across several scales.
T hese concepts are based on th e fact th a t global shape fea tu res of o b jec ts in an im age
can be visualized a t coarser levels of im age resolution (higher scale). B u t finer shape
featu res of these ob jec ts can be observed only a t finer levels of im age resolution (lower
scale). T he shape descrip tion tool enables descrip tion of th e shape charac te ristic s and
shape changes across several different scales, s ta r tin g from either end of th e scale. T he
ac tu a l shape ex trac tion from th e im age slices is perform ed by using active con tour m odels.
Active con tour m odels or snakes are energy-m inim izing spline con tours used for im age
segm en ta tion (K ass e t ah, 1987).
T he m ain s tep s/m o d u les in the shape descrip tion process are: a) preprocessing b)
propagation c) shape focusing and d) shape analysis. T he preprocessing step involves appli
cation of sim ple im age processing techniques such as th resholding, h istog ram equalization ,
and m orphological operations (opening), on each im age slice in the volum etric b ra in d a ta .
Chapter 6. High level processing 159
T hese operations are applied for enhancing the o b jects of in tere st in each im age slice. T he
p ropagation step com putes shape contours for each im age slice as shown in F igure 6 .5(a).
T he process begins by first com puting a shape con tou r for som e in termediate im age slice.
An in term ed ia te im age slice is th e one a t the cen ter or near th e cen ter of th e set of all
im age slices. T he shape con tour for the in term ed iate im age slice is com puted by applying
an op tim iza tion procedure (W illiam s & Shah, 1992) on a given in itial con tour (usually
a circle), superim posed on a G aussian-blurred o u tp u t of th e im age slice. T he optim ized
shape con tou r of th e in term ed ia te im age slice is then p ropagated to bo th its neighboring
im age slices as shown in F igure 6 .5(a).
Using th e optim ized shape contours as initial con tours (superim posed on th e G aussian
o u tp u ts of corresponding im age slices), th e shape con tours of b o th neighboring im age slices
are com puted by applying th e sam e optim ization procedure. T his process is repea ted by
p ro p ag atin g th e shape contours to bo th sides of th e brain volum e (i.e. tw o im age slice
p a r titio n s defined by th e in term ediate im age slice) as shown in F igure 6 .5(a). A t th e end
of th e p ropagation process, each im age slice has an associated in itial shape con tour which
is used as an in p u t in th e subsequent shape focusing step .
T h e shape focusing step operates on each im age slice separately . It begins w ith the
construc tion of a scale-space for each im age slice. A scale-space of an im age slice consists
of a set of im ages ob tained by convolving th e im age slice by a G aussian function using
increasing values of a , where a represents th e scale or w id th of th e scaling o p era to r. Using
th e m ulti-scale active contour m odel (Schnabel, 1997), th e shape focusing process ex trac ts
a sh ap e of in te rest from various im ages in th e im age scale-space of each im age slice. This
is perform ed by p ropagating th e initial shape con tour (com puted in th e p ropagation step)
th ro u g h various im ages in th e im age scale-space (s ta rtin g a t lowest resolution or highest
scale), and regularizing the active contour m odel’s energy function w ith respect to the
scale. T h e initial, in term ediate , and final shape focusing resu lts form a m ulti-scale shape
s tack as shown in F igure 6.5(b). An illustra tion of th e shape focusing process applied
on four different im ages (scales) in th e im age scale-space of an im age slice is shown in
F igure 6.6.
In th e final shape analysis step , each m ulti-scale shape stack is analyzed using classic
Chapter 6. High level processing 160
O Propagalion ^
— o
Pn)pagaiion
<-------
G aussian B lurring and O ptinri/.aliun
G aussian B lurring and O piin iizaliun
G aussian B lurring and O pliiiiizatiim
O Pnrpagalion / \
—
G aussian B lurring and O ptim izalion
/ - ' N . Pm pagaliuno —
G aussian B lu n in g and O ptim iza tion
S hape C unliiu r iiir im age slice I
Shape Ciinliiur h ir im age slice 2
Shape Q in liiu r lo r im age slice 3
(a)
Sha|ie C onlour lo r im age slice 4
Shape C on lou r lo r im age s lice 5
Shape Focusing al increasing levels o l derail
L ow R esolution
(C oarser details)
i i iy h R esolu tion
(F in e r details)
(b )
l-'igure 6.5; M ulti-scale shape description process a) pro;)agation step applied on a set
of five im age slices b) multi-scale shape stack of an im age slice com puted in the shape
focusing step (F igure (b) adapted from (Schnabel, 1997)).
( b ) ( c ) ( d )
Figure 6.6: Shape focusing perform ed a t different scales in the image scale-space of an
im age slice using active contour models: (a) cr = 8 (b) a = 4 (c) a = 2 (d) a = 1. Im age
(a) also contains the initial contour superim posed in black. All images are taken from
(Schnabel, 1997).
shape descrip to rs in order to find the global and local changes in the shape. T he shape
con tour a t each layer or scale of the multi-scale shape stack is used to com pu te the
mean and slope m easurem ents for finding shape changes between the layers. These shape
Chapter 6. High level processing 161
Figure 6.7; V isualization of the stack contours (those displayed in F igure 6.6) stacked
using triangu la tion . Image taken from (Schnabel, 1997).
contours also are stacked and visualized (volume visualization) for qualita tive inspection
(Figure 6.7). Also, for each scale, the corresponding shape contours across all the m ulti-
scale shape stacks are stacked and visualized for global inspection.
6.4 P arallelization o f the shape descrip tion process
In t his section we discuss parallelization of some of the m odules in the m ulti-scale shape
description process applied on the volum etric brain d a ta of epileptic patien ts. T he task
is to ob tain shape descriptions of the grey m atte r/co rtic a l interface of the brain in order
to enable the study of its s tru c tu ra l abnorm alities (cortical dysgenesis) related to the
sym ptom s of epilepsy. T he num ber of image slices in the volum etric brain d a ta involved
in this application is 124 (for each patien t), of which only 96 im age slices contain the
im age of the actual grey m atte r. Each image slice is of the size 256X256 pixels (with slice
thickness 5 rivni, and pixel size 0.9375 mrn^).
We discuss th ree different approaches for parallelizing the shape description process.
Each approach uses a different design p a tte rn , namely. Tem poral M ultiplexing, P ipeline
or C om posite Pipeline. Of the three approaches, we provide experim ental results only
for the first approach. For the rem aining two approaches, we provide estimates of the
corresponding parallel execution times. These estim ates represent reasonable approx im a
tions of the corresponding parallel execution tim es. This is because the com ponents in
1 lie P ipeline/C om posite Pipeline im plem entations use existing sequential codes. Using
sequential execution tim e of each com ponent, it is easy to estim ate the overall parallel
Chapter 6. High level processing 162
execution tim e in these im plem entations (ignoring th e com m unication overheads). T he
com m unication overheads are however relatively negligible and can therefore be safely
ignored (they involve com m unication of 256X256 im ages a n d /o r sim ple d a ta s tru c tu re s
(e.g. contours, e tc .)). Also, in all th e th ree approaches, we do no t discuss parallelization
of th e final shape analysis step . T he shape analysis s tep requires d a ta from m ulti-scale
shape stacks of all im age slices, and therefore can only be perform ed sequentially.
Using th e sequential code developed in (Schnabel, 1997), th e tim e required to perform
th e preprocessing step on each im age slice is 3 secs, while th e tim e required to perform th e
corresponding p ropagation step is 16 secs. T he shape focusing step com prises a sequence
of operations such as G aussian sm oothing, com puting of im age po ten tia ls (‘C om pute
P o te n tia l’), and optim ization . These operations are applied on each im age (to ta l 16) in
th e im age scale-space of a given im age slice. T he G aussian sm ooth ing operation produces
a sm oothed im age, while ‘C om pute P o te n tia l’ operation ex tra c ts certa in im age featu res
such as th e m agnitude and direction of the im age grad ien t, th e im age cu rvatu re , and
d istance-transfo rm ed ridges of th e grad ien t m agnitude, from th e sm oothed im age. T he
‘C om pute P o te n tia l’ operation stores these im age featu res in a d a ta s tru c tu re called ‘Po
te n tia l’, which, along w ith th e sm oothed im age, is used for com puting th e shape con tour
of th e im age during th e optim ization operation .
T he th ree shape focusing operations require average processing tim e of 2 secs, 7 secs,
and 7 secs, respectively. Therefore th e to ta l tim e required to perform th e shape focusing
step on a single im age slice is 256 secs (16*16). Hence, for a set of 96 im age slices, th e
to ta l sequential tim e required to perform th e preprocessing, p ropagation and th e shape
focusing steps is 26400 secs (7 hrs, 20 mins). In order to m ain ta in consistency w ith earlier
discussions, we assum e availability of a t th e m ost 16 w orksta tions for th e parallelization
process.
6 .4 .1 P a ra lle l iza t io n us in g T em p ora l M u lt ip le x in g p a tte r n
T he sim plest form of parallelism th a t can be im plem ented w ith o u t m a jo r changes to
th e existing sequential code is realized by using th e Tem poral M ultip lexing p a tte rn . In
Chapter 6. High level processing 163
th is approach , we assum e th a t th e preprocessing and p ropagation steps are perform ed
sequentially. We parallelize only th e shape focusing step by processing th e im age slices
on different w orksta tions, concurrently . T he sequential a lgorithm to perform th e shape
focusing process is outlined below (s ta rtin g a t th e coarsest level of scale) :
1. G en era te an im age in th e im age scale-space for th e cu rren t im age slice, using a
G aussian sm ooth ing function .
2. Using th e G aussian im age generated in previous step , com pute im age po ten tia ls (i.e.
relevant im age features) required for th e optim ization operation in the nex t step .
3. Taking active contour m odel from the previous im age in th e im age scale-space as an
in itial shape contour, optim ize th e shape contour for th e cu rren t im age using fast
local op tim ization m ethod (W illiam s & Shah, 1992). N ote th a t for the first im age
in th e im age scale-space, th e initial shape contour is th e one, which is com puted in
th e p ropagation step.
4. R epeat s teps 1- 3 for all scales in the im age scale-space of th e cu rren t im age slice.
5. R ep ea t th e process from step 1 for all im age slices, s ta r tin g from th e coarsest level
of scale.
Since th e com pu ta tions of each im age slice in the shape focusing step are independent
of each o th er, they can be perform ed in parallel. Using a se t of 16 w orksta tions and
a T em poral M ultiplexing p a tte rn to process each im age slice concurrently , th e observed
parallel execution tim e required for processing all the im age slices in th e shape focusing step
is 1656 secs (Table 6.2). T he to ta l tim e required to perform th e preprocessing (sequential
im p lem en tation ), p ropagation (sequential im plem entation) and th e shape focusing (p ara l
lel im plem entation) s tep is 3480 secs (58 mins). Hence, concurren t processing of th e im age
slices in th e shape focusing step leads to a significant reduction in overall application tim e,
although we parallelized only p a r t of th e application.
Chapter 6. High level processing
6 .4 .2 P ara lle l iza t io n u s in g P ip e l in e p a tte r n
164
A lthough Tem poral M ultiplexing p a tte rn m ay also be used for parallelizing th e p repro
cessing step , th e p ropagation step does not enable concurren t processing of im age slices for
com puting th e initial shape con tours. In th e p ropagation step , th e o u tp u t shape con tour
of an im age slice serves as an in p u t for th e co m pu ta tion of th e final shape con to u r of
e ither one or b o th of its neighboring im age slices (F igure 6 .5 (a)). In such situ a tio n s,
a P ipeline p a tte rn m ay be used to exploit po ten tia l parallelism in an application . One
possible im plem entation of th e shape description process using a P ipeline p a tte rn is shown
in F igure 6.8. We assum e th a t each com ponent of th e P ipeline p a tte rn is im plem ented on
a sep a ra te w orksta tion .
Shape FocusingShapeDescription
ImageSlices Shape
AnalysisPropagationPreprocessing Compute
PotentialGaussianSmoothing Optimization
Repeated 16 times for each image slice
F igure 6.8: Paralle lization of th e shape description process using a P ipeline p a tte rn . T he
in teger values denote sequential tim e (in seconds) required for executing corresponding
com ponen ts of th e P ipeline p a tte rn .
T he processing in th e P ipeline p a tte rn begins by passing th e in term ed ia te im age slice
th ro u g h th e preprocessing com ponent, followed by th e ad jacen t im age slices in e ither of th e
tw o im age slice p artitio n s shown in F igure 6 .5(a). T he preprocessing com ponent processes
a given im age slice (called current im age slice) and passes it to th e p ropagation com ponent.
T he p ropagation com ponent optim izes shape con tour of th e cu rren t im age slice. It sto res
th is shape con tour for using it as an in p u t during processing of th e subsequent im age slice.
T he p ropagation com ponent passes th e optim ized shape con tour and th e cu rren t im age
slice to th e ‘G aussian S m ooth ing’ com ponent in th e shape focusing step .
T he shape focusing step com putes a m ulti-scale shape stack for th e cu rren t im age
slice as follows. T h e ‘G aussian S m ooth ing’ com ponent of th e P ipeline p a tte rn generates
Chapter 6. High level processing 165
G aussian-b lurred im ages of th e cu rren t im age slice, using decreasing values of sigm a (to ta l
16 sigm a values). These G aussian-b lurred im ages are then sequentially passed from th e
‘C om pute P o te n tia l’ com ponent to th e ‘O p tim iza tio n ’ com ponent. T he ‘O p tim iza tio n ’
com ponent optim izes th e shape con tour of th e cu rren t G aussian-b lurred im age, and sto res
it for using it as an in p u t for com puting shape con tour of th e subsequen t G aussian-b lurred
im age. T hese operations in th e shape focusing step are rep eated for 16 different sigm a
values. A fter com pletion of th e shape focusing step on cu rren t im age slice, th e resu lting
m ulti-scale shape stack of th e cu rren t im age slice is passed to th e shape analysis s tep . T he
shape analysis s tep can be perform ed separately and is therefore enclosed in a d o tted box.
A single P ipeline p a tte rn m ay be used to process both im age slice p a rtitio n s (defined
by th e in te rm ed ia te im age slice) one afte r the o ther. A lternatively , tw o P ipeline p a tte rn s
(M ultiple Pipelines) m ay be used for processing each p artitio n concurrently . We use th e
second approach since it reduces the overall execution tim e by alm ost half. A ssum ing
48th im age slice as th e in term ed iate im age slice, we divide th e set of 96 im age slices
in to tw o p artitio n s. Each p artitio n contains 48 and 49 im age slices, respectively (bo th
p a rtitio n s contain th e in term ed iate im age slice for p ropagating th e in itial shape con tou r).
We estim ate th e tim e required to process im age slices in larger of th e two im age slice
p a rtitio n s (i.e. th e one containing 49 im age slices). T his estim ate also represen ts th e to ta l
tim e required for processing all th e 96 im age slices, since th e sm aller im age slice p artitio n
can be processed concurren tly along w ith the larger one.
T he preprocessing and p ropagation com ponent operations can be overlapped w ith th e
op era tio n s in th e shape focusing step (except for the first im age slice). We therefore
co n cen tra te on th e shape focusing step . Assum ing overlap of co m p u ta tio n s in th ree
different operations of the shape focusing step , th e tim e required to perform th e shape
focusing step on 49 im age slices is 5497 secs (2 [latency) -f- 7 [latency) -f (7*16)*49).
T he latency te rm s in the expression represent execution tim es required for perform ing
corresponding operations (i.e. G aussian S m oothing and C om pute P o ten tia l) , for th e first
G aussian-b lu rred im age of the first im age slice. T he term ‘(7*16)’ denotes tim e required
for perform ing th e ‘O p tim iza tio n ’ operation on all G aussian-b lurred im ages of th e cu rren t
im age slice. T his also represents the tim e required for perform ing shape focusing step
on th e cu rren t im age slice (except for th e first im age slice), since o th er operations in
Chapter 6. High level processing 166
th e shape focusing step are executed concurrently . Hence, th e to ta l tim e required for
perform ing th e preprocessing, propagation , and th e shape focusing step is 5516 secs (1 hr^
31 mins^ 56 secs). N ote th a t th e tim es for the preprocessing and p ropagation steps (as
shown in Table 6.2) in th e P ipeline im plem entation , represen t execution tim es required to
perform these steps only on th e first im age slice.
T he to ta l execution tim e of the application parallelized using two sim ple P ipeline
p a tte rn s is relatively higher th an th e execution tim e in previous parallel im plem entation .
T his d rop in perform ance is due to the tim e-com plexity of th e shape focusing step and
inability to use add itional w orksta tions in th e parallelization process. As there are only five
com ponents in a Pipeline, the two Pipeline p a tte rn s to g eth er can use only 10 w orksta tions.
T he parallel im plem entation using a T em poral M ultiplexing p a tte rn can utilize all 16
w orksta tions. Hence, although th e percentage of th e application code parallelized using a
P ipeline p a tte rn is relatively higher th an in the previous approach , th e inability to scale
th e num ber of w orksta tions used in parallelization does n o t lead to any im provem ent in
th e overall perform ance of th e application over earlier approach.
6 .4 .3 P a ra lle l iza t io n us in g C o m p o s ite P ip e l in e p a t te r n
T h e lim ita tions in bo th the Tem poral M ultiplexing and P ipeline p a tte rn s can be resolved
by using a C om posite P ipeline p a tte rn . T he m ain bo ttleneck in th e sim ple P ipeline
p a tte rn is th e shape focusing step , which requires parallel execution tim e of 112 secs
(7*16) or approx im ate ly 2 m ins for com puting a m ulti-scale shape stack for each im age
slice (assum ing overlapping of com pu tations of individual operations in th e shape focusing
s tep ). T h e perform ance or th ro u g h p u t in a sim ple P ipeline p a tte rn depends on th e speed
of its slowest com ponent. Hence, using a Tem poral M ultiplexing p a tte rn a t th e shape
focusing step , significant perform ance gains can be achieved in th e sim ple P ipeline p a tte rn .
T he resu lting p a tte rn co n stitu te s a C om posite P ipeline p a tte rn as shown in F igure 6.9.
In th e C om posite Pipeline p a tte rn , we im plem ent th e preprocessing and th e prop
agation steps on a single w orksta tion . T he rem aining 15 w orksta tions can be used for
parallelizing th e shape focusing step using a T em poral M ultiplexing p a tte rn . T he process-
Chapter 6. High level processing 167
ShapeDescriptionImage Slices
Shape AnalysisShape Focusing
(TM pattern)
Preprocessingand
Propagation
Shape Focusing
(Worker)
Shape Focusing
(Worker)
F igure 6.9: Para lle liza tion of th e m ulti-scale shape description process using a C om posite
Pipeline p a tte rn
ing of each im age slice in th e C om posite P ipeline p a tte rn begins w ith th e execution of
preprocessing and th e p ropagation steps. Each im age slice th a t passes th rough th e first
s tag e (preprocessing and p ropagation) can im m ediately use a free w orksta tion to perform
th e shape focusing step . T his is because th e tim e required to process each im age slice
in th e first stage is 19 seconds. T he shape focusing step requires 256 secs (sequential
tim e) to process each im age slice. As there are 15 w orksta tions in th e second stage of the
C om posite P ipeline p a tte rn , th e average tim e required to perform th e shape focusing step
on each im age slice is approxim ately 17 sec (256/15), which is lower th an th e tim e spen t
in th e first stage. Therefore, any im age slice th a t passes th rough th e first stage can use
som e free w orksta tion th a t com pletes processing on its previous im age slice (if applicable).
Also, w ith th e exception of the last im age slice, th e operations of th e shape focusing
s tep can be completely overlapped w ith the operations of th e preprocessing and p ro p a
g a tion steps. T he shape focusing tim e shown in Table 6.2 for th e C om posite P ipeline
im plem en tation therefore, represents tim e required to process only th e last im age slice.
T he tim e required for th e preprocessing and p ropagation steps in th is im plem entation
is th e sam e as th a t in th e sequential version. Hence, th e to ta l execution tim e of th e
application parallelized using th e C om posite P ipeline p a tte rn is 2080 secs (34 mins, 40
secs), which represents a significant im provem ent over earlier approaches.
T he shape descrip tion exam ple illu stra tes th a t using sim ple design p a tte rn s and m ost
of th e existing sequential code, the w orksta tion environm ent can offer significant benefits
Chapter 6. High level processing 168
T able 6.2: Execution tim es in (seconds) ioT different im plem entations and individual steps
of th e shape descrip tion process
Im plementation Preprocessing Propagation Shape Focusing Total Time
Sequential 288 1536 24576 26400
(7 hrs, 20 mins)
Temporal M ultiplexing 288 1536 1656 3480
(58 mins)
Multiple Pipelines 3 16 5497 5516
(1 hr, 31 mins, 56 secs)
C om posite Pipeline 288 1536 256 2080
(34 mins, 40 secs)
of parallelizing m any vision applications. A lthough w orksta tion clusters m ay or m ay
no t be used in th e final system im plem entation , they can provide significant su p p o rt
for developing and p ro to typ ing applications requiring a large am o u n t of com puting tim e,
in m any research and o th er organizational setups which do n o t have dedicated parallel
com puting facilities.
6.5 Sum m ary
In th is ch ap te r, we discussed parallel im plem entation of th e recognition phase of th e
geom etric hashing algorithm used for ob ject recognition. We also discussed parallelization
of m ulti-scale active shape descrip tion process using th ree different p a tte rn s , namely.
T em poral M ultiplexing, P ipeline, and C om posite Pipeline. T he recognition phase of th e
geom etric hashing algorithm perform s several probe steps for identifying an o b jec t in
a scene im age. Each probe step (associated w ith a basis set) com prises a sequence of
op era tio n s for finding po ten tia l m odels th a t m atch th e scene featu res. We have developed
a coarse-grained parallel algorithm for th e recognition phase. T his algorithm perform s
m ultip le probes on different w orksta tions, concurrently . T he opera tions of each probe are
however perform ed on a single w orksta tion . T he perform ance of th is parallel algorithm
parallelized using th e Farm er-W orker p a tte rn has shown encouraging results. T h e perfor
m ance resu lts are som etim es even b e tte r th an those in earlier im plem entations perform ed
Chapter 6. High level processing 169
on dedicated parallel m achines.
T he parallelization of th e m ulti-scale active shape descrip tion process for M R brain
im ages in epileptic p a tien ts has also shown prom ising resu lts. T he sequential execution
tim e required to process 96 im age slices is 7 hrs, 20 m ins. T his includes tim e required for
perform ing preprocessing, p ropagation , and shape focusing steps in th e shape descrip tion
process. T he corresponding o b serv ed /estim ated parallel execution tim es using T em poral
M ultiplexing, P ipeline (M ultiple Pipelines), and C om posite P ipeline p a tte rn s are 58 mms;
1 hr, 31 m in s , 56 secs; and 34 m ins, 40 secs, respectively. O f th e th ree p a tte rn s . T em poral
M ultip lexing is th e sim plest to im plem ent. However, no t all m odules can be parallelized
using th is p a tte rn alone. P ipeline p a tte rn has lim ited scalability w ith respect to increase in
num ber of w orksta tions used in parallelization. Using M ultiple P ipelines solves th is prob
lem p artia lly b u t no t com pletely. C om posite P ipeline p a tte rn resolves lim ita tions in bo th
T em poral M ultiplexing and P ipeline p a tte rn s , and therefore achieves b e tte r perform ance
resu lts in com parison w ith o th er two p a tte rn s .
T he exam ples in th is ch ap te r illu s tra te th a t using sim ple design p a tte rn s and m ost
of th e existing sequential code, th e w orksta tion environm ent can offer significant benefits
for parallelizing m any high level vision algorithm s a n d /o r applications. T hey can provide
significant su p p o rt for developing and p ro to typ ing applications requiring large am oun t
of com puting tim e, in m any research and o ther o rgan izational se tups which do no t have
ded icated parallel com puting facilities.
Chapter 7
C onclusion
7.1 A im s and M otivation
T he research work in th is thesis is aim ed a t presenting and evaluating a set of design
p a tte rn s in tended to su p p o rt parallelization of vision applications on coarse-grained p ar
allel m achines, such as a cluster of w orksta tions. W orksta tion environm ents have recently
proved to be effective and econom ical p latform s for high perform ance com puting com pared
to th e conventional parallel m achines. They offer several advantages for parallelizing
and executing large applications on a relatively low-priced and readily available pool of
m achines. However, developing parallel applications on such m achines involves com plex
decisions such as dividing th e applications into several processes, d is trib u tio n of these p ro
cesses over various processors, scheduling of processor tim e betw een com peting processes,
and synchronization of th e com m unication between different processes.
D eveloping parallel p rogram s to control these decisions usually involves w riting ex
plicit p rogram code for process scheduling, process com m unication , and som etim es even
co m p u ta tio n in a single rou tine. This style of parallel code developm ent increases p rogram
com plexity, and reduces program reliability and code reusability. W riting explicit parallel
code for parallelizing various applications on a cluster of w orksta tions has som e add itional
problem s. T he available m achines and their capabilities can vary dynam ically during
170
Chapter 7. Conclusion 171
program execution or from one execution to ano ther. T his can som etim es lead to a
significant reduction in overall perform ance of an application . Also, m ost developers do
no t wish to spend tim e in low level p rogram m ing details in o rder to gain advan tages of
po ten tia l parallelism in an application . A bout 69% of parallel p rogram m ers (Pancake,
1996) m odify or use existing blocks of code to com pose new program s. M oreover, th e
m odification or p a rtia l reuse of existing code or p rogram design is often restric ted to
individual developers. T here is very little sharing of design knowledge am ong developers.
T he parallel p rogram s used for im plem enting m ajo rity of vision task s utilize a finite
se t of recurring algorithm ic s tru c tu re s or parallel p rogram m ing m odels. O ur research has
aim ed a t cap tu rin g and articu la tin g th e design inform ation in these algorithm ic s tru c tu re s
in th e form of design p a tte rn s. We have specified various aspects of parallel behavior
of each design p a tte rn (e.g. s tru c tu re , process p lacem ent, com m unication p a tte rn s , etc.)
in its definition or separate ly as issues to be addressed explicitly during its im plem enta
tion. Design p a tte rn s decouple th e code for im plem enting low level parallel p rogram m ing
details (i.e. process scheduling, com m unication, e tc.) from th e code for m anaging th e
ac tu a l com pu ta tion . Such decoupling ensures program reliability and code reusability.
Design p a tte rn s cap tu re design inform ation in a form th a t m akes them usable in differ
en t s itu a tio n s and in fu tu re work. T he design p a tte rn s presented in th is thesis would
enable researchers and developers to im plem ent m any in teractive and batch applications
in com pu ter vision on w orksta tion clusters.
A cluster of w orksta tions is characterized by high com m unication costs and a varia tion
in speed facto rs of individual m achines in the netw ork. A key fac to r th a t m inim izes the
effect of high com m unication costs on perform ance is ‘g ra n u la rity ’ (section 1.1) of a parallel
a lgorithm , which describes th e am ount of work associated w ith each p ro cess /ta sk relative
to th e com m unication. A cluster of w orksta tions is inherently coarse-grained. We have
fo rm ulated th e design p a tte rn s so th a t they im plem ent coarse-grained parallelism . Also,
due to variation in speed factors of individual m achines, an application parallelized on such
m achines needs to include proper load balancing s tra teg ies in order to ob ta in m axim um
perform ance gains. T he design p a tte rn s presented in th is thesis a tte m p t to d is trib u te th e
work load according to th e speed factors of individual m achines in th e netw ork. T his
load balancing is perform ed either s ta tica lly (i.e. before the s ta r t of th e co m p u ta tio n ), or
Chapter 7. Conclusion 172
dynam ically (during th e co m pu ta tion ).
We began our w ork by analyzing th e com pu tation and com m unication charac teristics
of vision tasks. We identified various form s of parallelism in vision task s and fo rm ulated
design p a tte rn s to im plem ent these tasks. Each design p a tte rn cap tu res com m on designs
used by developers to parallelize th e ir tasks. We presented a cata logue of design p a tte rn s
to im plem ent various form s of parallelism in vision tasks on a cluster of w orksta tions.
O ur nex t goal in th is thesis has been to evaluate th e use of these design p a tte rn s for
parallelizing vision tasks on a cluster of w orksta tions. We have im plem ented rep resen ta tive
vision algorithm s in low, in term ed iate and high level vision processing, and presen ted
th e experim ental resu lts of corresponding parallel im plem entations. T he resu lts of these
im plem entations have helped us to critically assess th e use of design p a tte rn s for achieving
perform ance gains in various algorithm s. It has also enabled evaluating th e v iab ility of
using w orksta tion clusters for im plem enting parallel vision applications.
7.2 R esearch R ev iew
T he lite ra tu re on parallelization of vision a lg o rith m s/ap p lica tio n s is vast, b u t th e re have
been no previous efforts to a b s tra c t and docum ent th e design inform ation from th e ir
corresponding parallel im plem entations. In ch ap te r 3, we have a tte m p te d to cap tu re
and docum ent th is design inform ation in th e form of design p a tte rn s so th a t they can
be used for parallelizing m any vision a lg o rith m s/ap p lica tio n s on coarse-grained parallel
m achines, such as a cluster of w orksta tions. A cata logue of key design p a tte rn s for parallel
vision applications would give s tan d a rd nam es and definitions to th e techniques used in
parallelization of these applications. Each p a tte rn has been described in a uniform way
using a tem p la te which provides description of how each p a tte rn works, where it should
be applied and w hat are th e trade-off in its use.
T he design p a tte rn s presented in chap te r 3 include. Farm er-W orker, M aster-W orker,
C ontroller-W orker, D ivide-and-C onquer, Tem poral M ultiplexing, P ipeline, and C om pos
ite P ipeline. T he Farm er-W orker p a tte rn is used for im plem enting d a ta parallel algo
Chapter 7. Conclusion 173
rith m s which require no com m unication during com p u ta tio n . B oth M aster-W orker and
C ontroller-W orker p a tte rn s are used for parallelizing problem s exhibiting d a ta parallelism ,
b u t which require com m unication of in term ed iate resu lts during processing. D ivide-and-
C onquer p a tte rn is used for parallelizing algorithm s th a t use a recursive s tra teg y to sp lit a
problem in to sm aller subproblem s and m erge the solution to these subproblem s in to a final
so lu tion . Tem poral M ultiplexing p a tte rn is used for processing several d a ta sets or im age
fram es on m ultiple processors. Finally, P ipeline and C om posite P ipeline p a tte rn s are
used for parallelizing applications th a t can be divided into a sequence (pipeline) of several
independen t subproblem s which are executed in a determ ined order. In th e C om posite
P ipeline p a tte rn , each subproblem m ay be fu rth e r parallelized using o th er relevant design
p a tte rn s .
A fter presenting a cata logue of design p a tte rn s , our next ta sk in th is thesis has been
to evaluate th e use of these p a tte rn s for parallelizing vision a lg o rith m s/ap p lica tio n s on a
clu ster of w orksta tions. We have im plem ented various represen ta tive algorithm s in low,
in term ed ia te and high level vision processing, and presented th e experim ental resu lts. In
ch ap te r 4, we presented parallel im plem entations of som e represen ta tive low level vision
algorithm s. Low level algorithm s parallelized using th e C ontroller-W orker p a tte rn (e.g.
h istogram equalization and 2D -F F T ) do not resu lt in any significant speedups due to
th e tim e com plexity of all-to-all worker com m unications in th is p a tte rn . B u t o th er low
level a lgorithm s parallelized using the Farm er-W orker p a tte rn (e.g. convolution and rank
filtering) and th e M aster-W orker p a tte rn (e.g. ‘ite ra tiv e ’ im age sharpen ing and im age
re s to ra tio n ) have shown encouraging results. However, applications parallelized using th e
M aster-W orker p a tte rn on enterprise clusters (section 2.5.3) m ay resu lt in dynam ic load
im balances and subsequently reduction in overall perform ance of th e application.
In ch ap te r 5, we presented parallel im plem entations of tw o in term ed ia te level vision
algorithm s, namely, region-based split and m erge segm entation a lgorithm , and th e line
g rouping algorithm based on th e principles of percep tual o rgan ization . T he segm enta
tion a lgorithm parallelized using th e D ivide-and-C onquer (DC) p a tte rn does no t exhibit
perform ance scalability owing to increase in corresponding tim e required for m erging th e
segm ented subim ages. If com m unication tim e is no t a dom inan t facto r, th e perform ance of
an a lgorithm parallelized using a D C p a tte rn is in fact influenced m ainly by th e tim e com
Chapter 7. Conclusion 174
plexity of th e m erging operation . T he line grouping algorithm has been parallelized using
an ‘ite ra tiv e ’ varian t of the C ontroller-W orker p a tte rn . T he perform ance of th e parallel
line grouping algorithm , however, does not show any im provem ent over its corresponding
sequential im plem entation . T he tim e spen t in ac tu a l co m p u ta tio n is significantly lower
th an th e tim e spen t in all-to-all worker com m unications in th e C ontroller-W orker p a tte rn .
In fact, it is very difficult to achieve any significant perform ance gains using C ontroller-
W orker p a tte rn , specially when it involves frequent all-to-all w orker com m unications.
In ch ap te r 6, we discussed parallel im plem entation of th e recognition phase of th e
geom etric hashing algorithm used for ob ject recognition. T he recognition phase perform s
several probe steps for identifying an ob ject from a scene im age. Each probe step (asso
ciated w ith a basis set) com prises a sequence of operations for finding p o ten tia l m odels
th a t m atch th e scene features. We developed a coarse-grained parallel algorithm for th e
recognition phase by perform ing m ultiple probes on different w orksta tions, concurrently .
T he operations of each probe are however perform ed on a single w orksta tion . T he parallel
im plem entation of th e recognition phase (using th e Farm er-W orker p a tte rn ) has in certain
cases achieved b e tte r results th an earlier im plem entations perform ed on dedicated parallel
m achines.
We also discussed parallelization of m ulti-scale active shape descrip tion process using
th ree different p a tte rn s: T em poral M ultiplexing, P ipeline, and C om posite P ipeline. All
th ree im plem entations have shown prom ising resu lts. O f th e th ree p a tte rn s . T em poral
M ultiplexing p a tte rn is th e sim plest to im plem ent since it allows m ost of th e existing
sequential code to be used in th e parallel im plem entation . However, no t all m odules of
th is applica tion can be parallelized using Tem poral M ultiplexing p a tte rn alone. Use of
P ipeline p a tte rn increases degree of parallelization b u t th is p a tte rn has lim ited scalabil
ity. C om posite P ipeline p a tte rn resolves lim ita tions in b o th T em poral M ultiplexing and
P ipeline p a tte rn , and therefore achieves b e tte r application perform ance com pared to o th er
tw o p a tte rn s .
To sum m arize, th e exam ples in th is thesis have shown th a t for m ost low level and high
level algorithm s in vision th e w orksta tion environm ent offers reasonable and som etim es
significant benefits for parallelizing these algorithm s. In term ed ia te level algorithm s, how
Chapter 1. Conclusion 175
ever, do no t represen t ideal cand ida tes for parallel im plem entation on w orksta tion clusters
due to th e ir ‘com m unication-in tensive’ na tu re . M ost of th e applications parallelized using
Farm er-W orker, Tem poral M ultiplexing, and C om posite P ipeline p a tte rn s have shown
encouraging results. A pplications parallelized using M aster-W orker, D ivide-and-C onquer,
and P ipeline p a tte rn s have shown satisfying results. A pplications parallelized using th e
C ontroller-W orker p a tte rn have, however, not resulted in any significant perform ance
gains. Also, th e m edical im aging application in ch ap te r 6 illu stra tes th a t w orksta tion
env ironm ents can provide significant su p p o rt for developing and p ro to ty p in g applica tions
requiring large am oun t of com puting tim e, in m any research and o th er o rgan izational
se tu p s which do no t have dedicated parallel com puting facilities.
7.3 C ontribu tions o f th e R esearch work
T h e con tribu tions of th is d isserta tion can be evaluated in te rm s of: a cata logue of design
p a tte rn s for parallel vision system s, coarse-grained parallel algorithm s for represen ta tive
vision applications, and critical assessm ent of th e use of design p a tte rn s in im plem enting
these applications on w orksta tion clusters. We rep ea t/su m m arize these co n tribu tions
again as follows:
• Catalogue o f design patterns: We presented a cata logue of design p a tte rn s for parallel
vision system s, describing each p a tte rn in te rm s of in ten t, m otivation , s tru c tu re ,
in teraction am ongst th e com ponents, and applicability. T his descrip tion enables
selection and use of a design p a tte rn in different s itu a tio n s and in fu tu re work.
• Coarse-grained parallel algorithms: We presented coarse-grained parallel a lgorithm s
and im plem entations for several vision task s such as convolution, im age filtering,
im age resto ra tion , region-based segm entation , line grouping, and geom etric hashing
algorithm for ob ject recognition. We also presented different parallel im plem enta
tions of th e m ulti-scale active shape descrip tion process (an application in m edical
im aging) using different design p a tte rn s.
Chapter 7. Conclusion 176
• Im p lem en ta tion on a cluster o f workstations: Using relevant design p a tte rn s , we
perform ed parallel im plem entations of the selected rep resen ta tive vision task s s ta te d
above. T he resu lts of these im plem entations enable critica l assessm ent of th e design
p a tte rn s for achieving im provem ents in application perform ance. It also enables
evaluating th e viability of using w orksta tion clusters for im plem enting parallel vision
applications.
7.4 C om parison w ith related work
A lthough th e concept of ab strac tin g com m on parallel p rogram m ing designs in th e form
of design p a tte rn s is new, th ere have been several prior efforts to identify and cap tu re
general parallel program m ing designs/m odels (C handy & Kesselm an, 1991), (K ung, 1989)
as softw are com ponents (e.g. im plem enta tion m achines (Z im ran e t al., 1990), tem plates
(Singh e t al., 1991), assets (Schaeffer e t ah, 1993), and skeletons (D arling ton e t ah , 1993)).
These softw are com ponents com prise ‘ready-to -use’ softw are rou tines for im plem enting
low level program m ing details (e.g. process scheduling, com m unication, e tc.) in th e
corresponding parallel p rogram m ing models. T he softw are system s based on these softw are
com ponents allow program m ers to w rite their parallel p rogram s in te rm s of these softw are
com ponents. T hese system s au tom atically insert th e necessary code for process scheduling
and com m unication in order to realize th e corresponding parallel im plem entation .
However, these system s do no t choose the ty p e of parallelism to apply; th is choice is
left to th e developer who judges and selects th e best form of parallelism in a p a rticu la r
application . Also, m ost of these system s have lim ited applicability. For exam ple, th e
E n terp rise system (Schaeffer e t ah, 1993) does no t su p p o rt d a ta parallelism , one of th e
m ost im p o rtan t form of parallelism in com puter vision. M ost of these system s do no t
su p p o rt com plex a n d /o r dom ain-specific parallel p rogram m ing m odels (e.g. parallelism
represented by th e C om posite Pipeline p a tte rn in vision).
O ur research work of presenting design p a tte rn s for parallel vision system s differs from
these approaches. We do no t present ‘ready-to -use’ program code th a t can be sim ply in
serted as softw are rou tine in a parallel im plem entation . We instead identify and docum ent
Chapter 7. Conclusion 177
explicitly various parallel p rogram m ing m odels com m only occurring in parallel so lutions
of problem s in certain dom ain, such as com puter vision. T he ‘in te n t’, ‘m o tiv a tio n ’, and
th e ‘app licab ility ’ aspects of th e design p a tte rn descrip tions enable th e user to select
a p p ro p ria te design p a tte rn (s) for parallelizing a given applica tion . T he o th er aspects of
th e design p a tte rn descrip tions provide guidelines for the ac tu a l im plem entation of the
design p a tte rn s for a p articu la r problem .
A design m ethodology for parallelizing com plete vision system s has also been presented
by D ow nton et. al. (D ow nton e t al., 1996). T heir design m ethod , based on pipeline o f
processor fa rm s (PPF)^ enables parallelization of com plete vision system s (w ith continuous
in p u t/o u tp u t) on M IM D parallel m achines. T he parallelization process in th e ir design
m odel is perform ed in a top-dow n fashion, where parallel im plem entations of individual
a lgorithm s are tre a ted as com ponents in th e design model. W hile th e design m ethodology
in (D ow nton e t al., 1996) has been im plicit, our work has concen tra ted on m aking th is
design m ethodology explicit. We have docum ented the P P F design m ethod in th e form
of C om posite-P ipeline p a tte rn in th is thesis. Also, th e design m ethod in (D ow nton e t al.,
1996) discusses parallelization a t m ostly application level. O ur work has a tte m p te d to
discuss parallelization a t bo th algorithm ic and application levels in vision.
T he m ain d isadvan tage of th e design p a tte rn s is th a t they do n o t provide a detailed
solution . A p a tte rn provides a generic scheme for solving a class of problem s, ra th e r
th a n ‘read y -to -u se’ softw are m odule which can be inserted in program . A user needs to
im plem ent th is schem e according to th e requirem ents of a given problem . A p a tte rn only
provides guidance for solving problem s, b u t it does not provide com plete solutions.
7.5 Future work
T he research work in th is thesis has aim ed a t presenting a set of design p a tte rn s in tended
to su p p o rt parallelization of vision application on a cluster of w orksta tions. Using these
design p a tte rn s we have also parallelized represen ta tive vision algorithm s in o rder to
d em o n stra te th e ir usefulness in im plem enting these algorithm s on w orksta tion clusters.
T he research work however raises fu rth e r questions and brings up research topics in a
Chapter 7. Conclusion 178
num ber of research areas such as:
• Fault tolerance: T he available resources in w orksta tion environm ents (especially in
en terp rise clusters) can change dynam ically during parallel execution of an applica
tion . A w orksta tion m ay becom e overloaded, or m ay be powered off for m ain tenance
purposes or, in th e w orst case, m ay crash. T he first tw o cases m ay be predicted
or known in advance. T he th ird case is unexpected and m ay resu lt in significant
loss of processing tim e. T he use of com m on m ethods, such as checkpointing and,
erro r detec tion and recovery, have high overheads. An a lte rn a tiv e m ethod is to
include fau lt to lerance m echanism s in each design p a tte rn . Some such a tte m p ts (for
w orksta tion environm ents) have been explored in the ‘processor fa rm ’ (C lem atis,
1994) and th e ‘supervisor-w orker’ (M agee & C heung, 1991) m odels (bo th m odels
represen t Farm er-W orker form of com pu ta tion ).
D etecting a failure in som e worker com ponent of a Farm er-W orker p a tte rn is rela
tively easy. T he Farm er com ponent can de tec t (and rectify) such a failure when som e
worker com ponent does no t respond w ithin a certain tim e lim it. O th er s tra teg ies
for de tec ting failures in e ither F arm er com ponent or process com m unication m ay be
sim ilarly devised. D etecting and rectifying failures in o th er p a tte rn s (e.g. M aster-
W orker and Pipeline) is however com plicated. Each worker com ponent in these
p a tte rn s sends/receives m essages from o ther worker com ponents. A failure in any
worker com ponent can lead to deadlock. Devising m echanism s for handling such
situ a tio n s is a challenging task .
• Load balancing: T he Farm er-W orker and T em poral-M ultip lexing p a tte rn s have an
inheren t load balancing property . However, o ther p a tte rn s m ay suffer from load
im balances during th e ir execution, especially when im plem ented on en terp rise clus
ters . T here is a need for devising m echanism s in order to minim ize th e effect of load
im balances in these p a tte rn s . Load balancing schemes m ay be inco rporated in the
p a tte rn itself. For exam ple, when a w orksta tion executing a w orker com ponent
of th e M aster-W orker p a tte rn is overloaded w ith ex ternal processes, th e worker
com ponen t m ay be transferred to ano ther free w orksta tion . T he overloaded worker
com ponents m ay be detec ted afte r every cycle of fixed num ber of ite ra tions, until
th e com pletion of com pu tation . T he code for perform ing load balancing operations
Chapter 7. Conclusion 179
m ay be included in th e p a tte rn im plem entation , or m ay be p a r t of a sep a ra te design
p a tte rn im plem entation ,
• P erform ance prediction:
Designing p ractical m odels for predicting parallel execution tim e of an applica tion
im plem ented on an en terprise cluster has been a challenging research a rea (Yan e t al.,
1996). We in tend to s tu d y th e feasibility of designing such m odels for th e design
p a tte rn s in parallel com puter vision. Each design p a tte rn m ay include a perform ance
prediction m odel as in skeletons (D arlington e t al., 1993) or im plem en ta tion m achines
(Z im ran e t al., 1990). T he com plexity of th e perform ance prediction m odel depends
on th e s tru c tu re of th e underlying design p a tte rn . For exam ple, using sequential tim e
of an algorithm , it is relatively easy to predict th e app rox im ate parallel execution
tim e in th e Farm er-W orker and Tem poral-M ultip lexing im plem entations. Similarly,
if th e sequential execution tim e of each com ponent in th e P ipeline and C om posite
P ipeline p a tte rn s is known, it is relatively easy to pred ic t th e parallel execution
tim e of th e corresponding application. However, p redicting perform ance in M aster-
W orker or D ivide-and-C onquer p a tte rn is relatively difficult.
Some im p o rtan t factors which need to be considered while designing perform ance
prediction m odels for each design p a tte rn (im plem ented on a w orksta tion cluster)
include com pu ta tional com plexity of th e problem , num ber of w orksta tions used in
parallelization, relative speed factors of individual m achines and th e netw ork b an d
w idth . T he com plexity of the perform ance prediction m odels is also influenced by
n a tu re of th e vision algorithm s. W hile it is relatively easy to p red ic t perform ance
in w ell-structured low level vision algorithm s, pred ic ting perform ance in in term e
d ia te and high level vision algorithm s is relatively difficult due to uncerta in ties in
com pu tations.
A ppendix A
N ota tion
A .l P a ttern D iagram
We use a varian t of th e ob ject m odel to describe th e com ponents and th e ir re la tionsh ips
in a design p a tte rn (B uschm ann e t ah, 1996).
SplitWorkSendSubtasksCollateResultsSendFinalResults
Master
DoCalcuIationExchangeData
SendResulls
Worker (p-1)
DoCalcuIationExchangeData
SendResulls
Worker (2)
DoCalcuIationExchangeData
SendResulls
Worker (p)
DoCalcuIationExchangeData
SendResulls
Worker (1)
T he com ponents are shown as rectangu lar boxes, denoting th e nam e of th e com po
nents and th e associated procedures w ithin th e com ponents. A line th a t connects th e
com ponents denotes an association.
180
Appendix A. Notation
A .2 O bject In teraction C harts
181
VVe ad ap t the O bjec t Message Sequence C h art no ta tion (OM SC) given in (B uschm ann
et ah, 1996) to describe the ob ject in teractions am ong the com ponents of a p a tte rn .
CallToParallelize
^ SplitWork
SendSubtask
SendSubtask
Loop DoCalcuIation DoCalcuIation
SendRt'sults
ProcedureScndRcsuIts(Activity lines'
CollateResults
SendFinalResults
Master Worker (1) Worker (2)C lient
T he com ponents in a p a tte rn are draw n as rectangu lar boxes. T hey are labeled
with their corresponding nam es. The activities of the com ponents are denoted by the
vertical bars a ttached to the bo ttom of the box (activ ity lines). T he messages between
t he com ponents are denoted by the horizontal arrows. T he tim e elapsed is shown from
to p to bo ttom , however, the tim e scale is not scaled. An itera tive com putation is shown
by an upward arrow , while, a procedure call within a p a tte rn com ponent, is shown by a
sm all downward arrow .
Bibliography
A lexander, C. (1979), The tim eless way o f building, Oxford U niversity P ress, New York,
US.
Alnuweiri, H. M. & P rasan n a , V. K. (1992), “P arallel arch itec tu res and a lgorithm s for
im age com ponent labeling” , IE E E Transactions on P a ttern A na lysis and M achine
In telligence 1 4 (1 0 ) , 1014-1034.
Alonso, R. & Cova, L. L. (1988), Sharing jobs am ong independently owned processors,
in “Proceedings of th e 8 th In terna tional C onference on D istrib u ted C om puting Sys
tem s” , IE E E C om puter Society Press, p p .282-288.
A m dahl, G . M. (1988), “L im its of expecta tion” . In tern a tio n a l Journa l o f Supercom puter
A pplica tions 2 (1 ) , 88-94.
A nderson, T . E., Culler, D. E ., P a tte rso n , D. A. e t a l .(1995), “A case for N O W (N etw orks
of W o rk sta tio n s)” , IE E E M icro F e b , 54-64.
Angus, I., Fox, G. C ., Kim, J . S. & W alker, D. W . (1989), Solving Problem s on C oncurrent
Processors, P rentice-H all, Englewood Cliffs, New Jersey, US.
A tallah , M . J ., Black, C. L., M arinescu, D. C. e t a l.(1992), “M odels and algorithm s
for coscheduling com pute-in tensive tasks on a netw ork of w orksta tions” . Journa l o f
Parallel and D istributed C om puting 16, 319-327.
Awcock, G . J . & T hom as, R. (1995), Applied Im age Processing, M acm illan, B asingstoke,
E ngland .
182
Bibliography. 183
B allard , D. H. & Brown, C. M. (1982), C om puter Vision^ P rentice-H all, Englew ood Cliffs,
New Jersey, US.
Beck, K ., Coplien, J . 0 . , C rocker, R ., Dom inick, L. e t a l .(1996), “In d u stria l experience
w ith design p a tte rn s” , IE E E Proceedings o f IC SE -18 p p .103-113.
Beguelin, A., D ongarra , J ., G eist, A., Jiang , W ., M anchek, R. & Sunderam , V. S. (1992),
P V M 3 U ser’s Guide and Reference Manual.^ o rn l/tm -12187 edition. O ak Ridge N a
tional L aborato ry , O ak Ridge,Tennessee, US.
Beguelin, A., D ongarra , J ., G eist, A., M anchek, R. & Sunderam , V. S. (1991), “Solving
co m p u ta tio n al g rand challengees using a netw ork of heterogeneous su p erco m p u te rs” ,
Proceedings o f F ifth S IA M Conference on Parallel Processing .
B oden, N. J ., C ohen, D., Felderm an, R. E. e t a l .(1995), “M yrinet: A gigabit-per-second
local a rea netw ork” , IE E E M icro F eb , 29-35.
B oldt, M ., Weiss, R. & R isem an, E. (1989), “Token-based ex trac tio n of s tra ig h t lines” ,
IE E E Transactions on System s, M an, and Cybernetics 1 9 (6 ) , 1581-1594.
B ourdon , O. & M edioni, G. (1988), “O bjec t recognition using geom etric hashing on th e
connection m achine” . In terna tiona l Conference on P a ttern Recognition p p .596-600.
B uschm ann, F . & M eunier, R. (1995), A System of P a tte rn s , in J . O. Coplien & D. C.
Schm idt (eds.), “P a tte rn Languages of P rog ram Design” , Addison-W esley, R eading,
M A, US, pp.325-343.
B uschm ann, F ., M eunier, R ., R ohnert, H., Som m erlad, P. & S tal, M. (eds.) (1996),
P attern -O rien ted Software A rchitecture - A S ystem o f P atterns, W iley and Sons,
C hichester, UK.
B uxton , H. e t a l .(1986), “A parallel approach to th e p ic tu re re s to ra tio n a lgorithm of
G em an and G em an on an SIMD m achine” . Im age and V ision C om puting p p .133-
142.
C handy, K. M. & K esselm an, C. (1991), “P arallel P rogram m ing in 2001” , IE E E Software
N o v , 11-20.
Bibliography. 184
C haudhary , V. & Aggarw al, J . K. (1990), Parallelism in com pu ter vision: a review,
in V. K um ar, P. S. G opalakrisnan &: L. N. K anal (eds.), “P arallel A lgorithm s for
M achine Intelligence and Vision” , Springer Verlag, p p .271-309.
C haudhary , V. & A ggarwal, J . K. (1991), “On th e com plexity of parallel im age com ponent
labeling” . In terna tiona l Conference on Parallel Processing I I I , 183-187.
C heng, D. Y. (1993), A survey of parallel program m ing languages and tools, Technical
R ep o rt RND-93-005, NASA Ames R esearch C enter.
C hin, R. T . & Dyer, C. R. (1986), “M odel-based recognition in ro b o t vision” , A C M
C om puting Surveys 1 8 (1 ) , 67-108.
C houdhary , A. & T hak u r, R. (1994), “C onnected com ponent labelling on coarse-grain
parallel com puters - an experim antal s tu d y ” . Journal o f Parallel and D istributed
C om puting 2 0 (1 ) , 79-83.
C houdhary , A. N. & P ate l, J . H. (1990), Parallel A rchitectures and Parallel A lgorithm s
fo r Integrated V ision System s^ K luwer A cadem ic P ublishers, B oston, USA.
C lark , H. & M cM illin, B. (1992), “D W A CS-a d is trib u ted com pute server utilizing idle
w orksta tions” . Journal o f Parallel and D istributed C om puting 1 4 (2 ) F e b , 175-186.
C lem atis, A. (1994), “F au lt to la ra n t p rogram m ing for netw ork based parallel com pu ting” .
M icroprocessing and M icroprogramm ing 40 , 765-768.
Coplien, J . O. & Schm idt, D. C. (eds.) (1995), P attern Languages o f Program Design,
Addison-W esley, R eading, MA, US.
C opty, N., R anka, S., Fox, C . & Shankar, R. V. (1989), “A d a ta parallel algorithm for
solving th e region growing problem on th e connection m achine” , Journa l o f Parallel
and D istributed C om puting 2 1 (1 ) , 160-168.
D arling ton , J ., F ield, A. J ., H arrison, P. C ., Kelly, P. H. J . e t a l .(1993), Paralle l p ro
gram m ing using skeleton functions. Technical R eport D oC 93 /6 , Im perial College,
London, UK.
D olan, J . & Weiss, R. (1993), “P ercep tual grouping of curved lines” . Proceedings o f the
D A R P A Im age Understanding W orkshop p p .1135-1145.
Bibliography. 185
D ow nton , A., Tregidgo, R. W . S. & C uhadar, A. (1996), G eneralized parallelism for em bed
ded vision applications, in A. Y. H. Zom aya (ed.), “Parallel C om puting : P arad igm s
and A pplications” , In ternational Thom son C om puter P ress, London, UK, p p .553-
577.
D uda , R. O. & H art, P. E. (1972), “Use of th e Hough tran sfo rm atio n to d e tec t lines and
curves in p ic tu res” . C om m unications o f the A C M p p .11-15.
Duff, M . & Levialdi, S. (eds.) (1982), Languages and Architectures fo r Im age Processing.,
A cadem ic P ress, 24 /28 Oval R oad, London N W l 7DX, UK.
D uncan , R. (1992), “Parallel com puter a rch itectu res” . A dvances in C om puters 34 , 113-
157.
Efim ov, N. V. (1966), A n elem entary course in analytical geometry., P ergam on P ress,
O xford.
F lynn , M . J . (1972), “Some com puter organizations and their effectiveness” , IE E E Trans
actions on C om puters 0 -2 1 ( 9 ) .
Foster, I. T . (1995), D esigning and Building Parallel Programs: C oncepts and Tools fo r
Parallel Software Engineering., Addison-W esley, R eading, M A W okingham .
C am m a, E ., Helm, R., Johnson, R. & Vlissides, J . (1994), Design P atterns: E lem en ts o f
Reusable O bject-O riented Software, Addison-W esley, R eading, M A, US.
Gonzalez, R. C. & W oods, R. E. (1993), Digital Im age Processing, Addison-W esley, R ead
ing, M A, US.
C rim son , W . (1990), Object Recognition by C om puter: The Role o f G eom etric C onstraints,
M IT Press.
C rim son , W . E. L. & H uttenlocher, D. P. (1991), “On th e verification of hypothesized
m atches in m odel-based recognition” , IE E E Transactions on P a ttern A n a lysis and
M achine Intelligence 1 3 (1 2 ) , 1201-1213.
H am brusch, S., He, X. & Miller, R. (1994), “P arallel algorithm s for gray-scale digitized
p ic tu re com ponent labeling on a m esh-connected co m p u te r” , Journal o f Parallel and
D istributed C om puting 2 0 (1 ) , 56-68.
Bibliography. 186
H am eed, F ., H am brusch, S. E ., K hokhar, A. A. & P a te l, J . N. (1997), “C on to u r ran k
ing on coarse grained m achines: A case stu d y for low-level vision co m p u ta tio n s” .
C oncurrency: Practice and Experience 9 (3 ) , 203-221.
H aralick, R. M . & Shapiro, L. G. (1985), “Im age segm enta tion techniques” . C om puter
Vision, Graphics and Im age Processing 29, 100-132.
H odgson, R. M ., Bailey, D. G ., Naylor, M. J ., Ng, A. L. M. & McNeill, S. J . (1985),
“P ro p erties , im plem entations and applications of rank filters” . Im age and Vision
C om puting 3, 3-14.
H orow itz, S. L. & Pavlidis, T . (1974), “P ic tu re segm enta tion by a d irected sp lit-and-
m erge procedure” . Proceedings o f the 2nd In tern a tio n a l Jo in t Conference on P a ttern
R ecognition p p .424-433.
H uertas, A., Lin, C. & N evatia, R. (1993), “D etection of buildings from m onocular views
of aerial scenes using percep tual grouping and shadow s” . Proceedings o f the D A R P A
Im age U nderstanding W orkshop p p .253-260.
H ussain, Z. (1991), Digital Im age Processing, Practical A pplica tions o f Parallel Processing
Techniques, Ellis Horw ood, C hichester, W est Sussex, UK.
Irvine, D . S. (1995), “C om puter-assisted semen analysis system s - Sperm m otility assess
m en t” , H um an Reproduction lO ( S l ) , 53-59.
K adam , S., R oberts , G. &: B uxton , B. (1996), Parallelizing vision-related app lica tions on
netw ork of w orksta tions using design p a tte rn s . Technical R eport R N /9 6 /2 5 , D e p a rt
m en t of C om pute r Science, University College, London, UK.
K adam , S., R oberts , G. & B uxton , B. (1997), “Design p a tte rn s for parallelizing vision-
re la ted applica tions on netw ork of w orksta tions” . The 11th A n n u a l In tern a tio n a l
Sym posium on High P erform ance C om puting System s, H P C S ’97 J u l , 569-583.
K apoor, S. e t a l .(1994), “D ep th and Im age Recovery Using a M R F M odel” , IE E E Trans
actions on P a ttern A na lysis and M achine In telligence pp. 1117-1122.
K ass, M ., W itk in , A. & Terzopoulos, D. (1987), “Snakes: A ctive con tour m odels” . P ro
ceedings o f the 1st In terna tiona l Conference o f C om puter V ision p p .259-268.
Bibliography. 187
K endall, P. & U hr, L. (eds.) (1982), M ulticom puters and Im age Processing, A lgorithm s
and Programs, A cadem ic Press, 111 F ifth Avenue, NY 10003, USA.
K ram er, H. P. & B ruckner, J . B. (1975), “Ite ra tio n s of a non-linear tran sfo rm atio n for
enhancem ent of dig ital im ages” . P attern Recognition 7, 53-58.
K ung, H. T . (1989), C om p u ta tio n a l m odels of parallel com puters, in R. J . Elliot & C. A. R.
H oare (eds.), “Scientific A pplications of M ultiprocessors” , P ren tice Hall.
L am dan , Y. & W olfson, H. (1988), “G eom etric hashing: a general and efficient m odel based
recognition schem e” . In terna tiona l Conference on C om puter V ision p p .238-249.
Lee, C. K. & H am di, M. (1995), “Parallel im age processing applications on a netw ork of
w o rk sta tio n s” . Parallel C om puting 2 1 (1 ) , 137-160.
Lee, J . S. (1983), “D igital im age sm ooth ing and sigm a filter” . C om puter Vision, Graphics
and Im age Processing 24 , 255-269.
Litzkow , M . J ., Livny, M. & M utka, M. W . (1988), C ondor - A hun ter of idle w orksta
tions, in “P roceedings of th e 8 th In terna tional Conference on D istrib u ted C om puting
S ystem s” , IE E E C om puter Society Press, p p .104-111.
Lowe, D. G . (1985), Perceptual O rganization and Visual Recognition, K luwer A cadem ic
P ress, H ingham , M A, US.
Lu, H. Q. & A ggarw al, J . K. (1992), “A pplying percep tual organization to th e detection
of m an-m ade ob jec ts in non-urban scenes” . P attern Recognition 2 5 (8 ) , 835-853.
M agee, J . N. & C heung, S. C. (1991), “Parallel a lgorithm design for w orksta tion c lusters” .
Softw are-P ractice and Experience 2 1 (3 ) M a r , 235-250.
M ard ia , K. V. & K anji, G. K. (eds.) (1993), Sta tis tics and Images, Vol. 1 of A dvances
in A pplied S ta tis tics Series, C arfax Publishing Com pany, P O Box 25, A bingdon,
O xfordshire 0 X 1 4 3UE, UK. A Supplem ent to Jo u rn a l of Applied S ta tis tics Volume
20 Nos 5 /6 1993.
M arr, D. (1982), Vision: A com putational investigation in to the hum an representation
and processing o f visual in form ation, W . H. Freem an, San Francisco.
Bibliography. 188
M attso n , T . G. (1996), Scientific com putation , in A. Y, H. Z om aya (ed.), “P ara lle l and
D istrib u ted C om puting H andbook” , M cG raw Hill, M cG raw Hill series on C om pute r
Engineering, p p .981-1002.
M ohan, R. &; N evatia, R. (1989), “Using percep tual o rgan ization to e x tra c t 3-D
s tru c tu re s” , IE E E Transactions on P a ttern A na lysis and M achine In telligence
1 1 (1 1 ) , 1121-1139.
M onroe, R. T ., K om panek, A., M elton, R. & G arlan , D. (1997), “A rch itec tu ra l styles,
design p a tte rn s , and o b jects” , IE E E Software 1 4 (1 ) , 43-52.
M utka, M. W . & Livny, M. (1987), Scheduling rem ote processing capacity in a w orksta tion-
processor bank netw ork, in “Proceedings of the 7th In te rn a tio n a l C onference on
D istrib u ted C om puting System s” , IE E E C om pute r Society P ress, p p .2 -9 .
N agao, M. & M atsuyam a, T . (1979), “Edge preserving sm oo th ing” . C om puter Graphics
and Im age Processing 9, 394-407.
N akanishi, H. & Sunderam , V. S. (1992), “S uperconcurren t sim ulation of polym er chains
on heterogeneous netw orks” , Proceedings o f IE E E Supercom puting Sym posium .
N arayan , P., C hen, L. h Davis, L. (1992), “Effective use of SIM D parallelism in low- and
in term ediate-level vision” , IE E E C om puter 25 F e b , 68-73.
Page, I. (ed.) (1988), Parallel A rchitectures and C om puter Vision., O xford U niversity Press.
P ancake, C. (1996), “W h a t com puter scientists and engineers should know a b o u t parallel-
sim and perform ance” . C om puter Applications in E ngineering E ducation 4 (2 ) , 145-
160.
P ita s , I. (1993), Digital Image Processing A lgorithm s, P ren tice Hall, New York, US.
P ra sa n n a K um ar, V. (ed.) (1991), Parallel Architectures and A lgorithm s fo r Im age Un
derstanding, A cadem ic Press, 1250 Sixth A venue, San Diego, CA 92101.
P ra sa n n a , V. K. & W ang, C. L. (1996), Parallelism for Im age U nderstand ing , in A. Y. H.
Zom aya (ed.), “Parallel and D istribu ted C om puting H andbook” , M cC raw Hill, Mc-
C raw Hill series on C om puter Engineering, p p .1042-1070.
Bibliography. 189
P ress, W . H., Teukolsky, S. A., V etterling, W . T . & F lannery, B. P. (1992), N um erical
Recipes in C, C am bridge U niversity Press, New York, US.
R anka, S. & Sahni, S. (1990), “Im age tem p la te m atch ing on M IM D hypercube m ulticom
p u te rs” , Journa l o f Parallel and D istributed C om puting 10 , 79-84.
R eynolds, G . & Beveridge, J . R. (1987), “Searching for geom etric s tru c tu re in im ages of
n a tu ra l scenes” . Proceedings o f the D A R P A Im age U nderstanding W orkshop p p .257-
271.
R ogoutsos, I. & Hum m el, R. (1992), “M assively parallel m odel m atching: geom etric
hash ing on th e connection m achine” , IE E E C om puter p p .33-42.
R osenfeld, A. (1988), “C om puter V ision” , Advances in C om puters 27 , 265-308.
Rosenfeld, A. & K ak, A. C. (1982), D igital P icture Processing^ A cadem ic P ress, New York,
US.
Ruff, B. P. D . (1988), A pipelined arch itec tu re for a video ra te canny o p era to r used a t the
in itia l s tage of a stereo im age analysis system , in I. Page (ed.), “Parallel A rch itectu res
and C o m p u te r Vision” , Oxford University Press.
Schaeffer, J ., Szafron, D ., Lobe, G. & Parsons, I. (1993), “T he E n terp rise m odel for devel
oping d is trib u ted app lica tions” , IE E E Parallel and D istributed Technology A u g , 8 5 -
96.
Schnabel, J . A. (1997), M ulti-Scale Active Shape D escription in M edical Im aging, P hD
thesis. U niversity College London, London, UK.
Siegel, H., A rm strong , J . B. & W atson, D. (1992), “M apping com puter vision re la ted tasks
o n to reconfigurable parallel processing system s” , IE E E C om puter 25 F e b , 54-63.
S ilverm an, R. D. & S tu a rt, S. J . (1989), “A d is trib u ted batch ing system for parallel
processing” , Software-Practice and Experience 1 9 (1 2 ) D e c , 1163-1174.
Singh, A ., Schaeffer, J . & G reen, M. (1991), “A tem plate-based approach to th e generation
of d is trib u ted applications using a netw ork of w orksta tions” , IE E E Transactions on
P arallel and D istributed S ystem s 2 (1 ) J a n , 52-67.
Bibliography. 190
Sonka, M ., Hlavac, V. & Boyle, R . (1993), Im age Processing, A na lysis and M achine Vision,
C h ap m an and Hall, London, UK.
S teenk iste , P. (1996), “N etw ork-based m ulticom puters: a p ractica l supercom pute r archi
te c tu re ” , IE E E T ransactions on Parallel and D istributed S ystem s 7 (8 ) A u g , 861-875.
S to u t, Q. F . (1987), “S upporting divide-and-conquer algorithm s for im age processing” .
Jo u rn a l o f Parallel and D istributed C om puting 4 (1 ) , 95-115.
S underam , V. (1990), “PV M : a fram ew ork for parallel d is trib u ted com puting” . C oncur
rency: Practice and Experience 2, 315-339.
Sunw oo, M. H., B aroody, B. S. & Aggarw al, J . K. (1987), “A parallel a lgorithm for region
labeling” . Proceedings o f the IE E E W orkshop on C om puter A rchitecture fo r P a ttern
A n a lysis and M achine In telligence p p .27-34.
T andiary , P ., K o thari, S. C ., D ixit, A. & A nderson, E. W . (1996), “B atru n : utilizing idle
w orksta tions for large-scale com puting” , IE E E Parallel and D istributed Technology
S u m m e r , 41-48.
T heim er, M . M . & L antz, K. A. (1988), F inding idle m achines in a w orksta tion-based dis
tr ib u ted system , in “Proceedings of th e 8 th In terna tional Conference on D istrib u ted
C o m puting System s” , IE E E C om puter Society P ress, pp. 112-122.
T u rco tte , L. (1993), A survey of softw are environm ents for exploiting netw orked com puting
resources. Technical R ep o rt M SM -EIRS-ERC-93-2, M ississippi S ta te U niversity.
T u rco tte , L. H. (1996), C luster com puting, in A. Y. H. Zom aya (ed.), “Paralle l and
D is trib u ted C om puting H andbook” , M cG raw Hill, M cG raw Hill series on C o m p u ter
E ngineering, p p .762-779.
U hr, L. (ed.) (1987), Parallel C om puter Vision, A cadem ic P ress, B oston, USA.
U hr, L., P res to n , K., Levialdi, S. & Duff, M. J . B. (eds.) (1986), Evaluation o f M ulticom
pu ters fo r Im age Processing, Academ ic P ress, New York, USA.
W ang, C. L. (1995), High perform ance com puting for vision on d is trib u ted m em ory m a
chines, P hD thesis. U niversity of Southern C alifornia, USA.
Bibliography. 191
W ang, C. L., B h at, P. B. & P rasan n a , V. K. (1996), “H igh-Perform ance com puting for
vision” . Proceedings o f the IE E E 8 4 (7 ) J u l , 931-946.
W ang, C. L., P ra san n a , V. K., Kim, H. J . & K hokhar, A. A. (1994), “Scalable da ta-para lle l
im p lem en tations of ob ject recognition using geom etric hash ing” . Journa l o f Parallel
and D istributed C om puting 2 1 (1 ) , 96-109.
W ang, X. & B lum , E. K. (1996), “Parallel execution of ite ra tiv e co m p u ta tio n s on w ork
s ta tio n c lusters” . Journal o f Parallel and D istributed C om puting 34 , 218-226.
W ebb, J . (1994), “High perform ance com puting in im age processing and com puter vision” .
In tern a tio n a l C onference on P a ttern Recognition S ep , 218-222.
W eem s, C. C ., L evitan , S. P., H anson, A. R ., R isem an, E. M . e t a l .(1989), “T he im age
u n d erstan d in g arch itec tu re” . In terna tiona l Journal o f C om puter V ision 2 (3 ) , 251-
282.
W illebeek-LeM air, M. & Reeves, A. P. (1990), “Solving non-uniform problem s on SIM D
com puters: case stu d y on region grow ing” , Journal o f Parallel and D istributed C om
p u ting 8 (2 ) , 135-149.
W illiam s, D. & Shah, M. (1992), “A fast algorithm for active con tours and cu rvatu re
es tim a tio n ” , C V G IP : Im age Understanding 5 5 (1 ) , 14-26.
W ilson, G. V. & Lu, P. (eds.) (1996), Parallel Prpgram m ing using T he M IT Press,
C am bridge, M assachusetts, London, UK.
W itk in , A. (1983), “Scale-space filtering” . In terna tiona l Jo in t C onference on A rtific ia l
In telligence pp. 1019-1022.
Y alam anchilli, S. & A ggarw al, J . K. (1994), “Parallel processing m ethodologies for im
age processing and com puter vision” . Advances in E lectronics and E lectron Physics
87 , 259-300.
Yan, Y ., Zhang, X. & Song, Y. (1996), “An effective and practical perform ance predic
tion m odel for parallel com puting on nondedicated heterogeneous N O W ” , Journa l o f
Parallel and D istributed C om puting 38 , 63-80.
Bibliography. 192
Z im ran , E ., R ao, M. & Segall, Z. (1990), “P erform ance efficient m apping of applications
to parallel and d is trib u ted arch itectu res” , In terna tiona l C onference on Parallel Pro
cessing I I , 147-154.