information theoretic measures and their applications … · information theoretic measures and...

INFORMATION THEORETIC MEASURES AND THEIR APPLICATIONS TOIMAGE REGISTRATION AND SEGMENTATION

By

FEI WANG

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2006

Copyright 2006

by

Fei Wang

For my wife, Lin, and my parents.

ACKNOWLEDGMENTS

I would like to first thank my advisor, Dr. Baba C. Vemuri, for everything he has

done for me during my doctoral study. This dissertation would not have taken shape

without his invaluable input. Dr. Vemuri introduced me to the field of medical image

analysis. His insight and experience have guided me throughout my research during

which time he provided numerous invaluable suggestions. It was a great pleasure

for me to conduct this dissertation under his supervision. I would also like to thank

Dr.Anand Rangarajan, Dr. Sartaj Sahni, Dr. Arunava Banerjee and Dr. Tan Wong for their

willingness to serve on my committee. In addition, special thanks go to Dr. Jorg Peters

for attending my PhD oral examination.

My doctoral research is a happy cooperation with many people. Dr. Vemuri has been

involved with the whole process, Dr. Rangarajan has guided me a lot in the groupwise

point registration part, Dr. Ilona Schmalfuss and Dr. Stephan Eisenschenk have kindly

provided the data for hippocampal segmentation and taught me what little I know of

Neuroscience. I have also benefitted from Dr. Thomas E. Davis’s guidance when I first

joined the lab. I would also like to thank Dr. Banerjee for stimulating debates, and Dr.

Jeffrey Ho for his professional advice and philosophical discussions. Thanks also goes

to my co-authors, Drs. Murali Rao and Yunmei Chen, who were my co-authors on a set

of papers that introduced, the concept of entropy based on probability distributions and

several properties of the same.

Needless to say, I am grateful for the support of my colleagues and friends at the

Computer and Information Science and Engineering Department at the University of

Florida. Dr. Zhizhou Wang, Dr. Jundong Liu, Dr. Tim Mcgraw, Dr. Eric Spellman, Bing

Jian, Santhosh Kodipaka, Nicholas Lord, Neeti Vohra, Angelos Barmpoutis, Seniha Esen

iv

Yuksel, Ozlem Subakan, Ritwik Kumar, EvrenOzarslan, Ajit Rajwade, Adrian Peter, Dr.

Jie Zhang and Dr. Hongyu Guo all deserve thanks.

And finally, most importantly, I thank my family. I thank my mother and father

for everything and my brother, too. And of course I thank my dearest Lin for her

understanding and love during the past few years. Their support and encouragement are

my source of strength.

This research was supported in part by the grants NIH RO1-NS42075 and NIH

R01-NS046812. I would also like to acknowledge travel support (for attending various

conferences to present research papers) from the IEEE Computer Society, the Department

of Computer and Information Science and Engineering and the College of Engineering of

the University of Florida.

v

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viii

LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xi

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Image and Point-set Registration. . . . . . . . . . . . . . . . . . . . . . 11.1.1 Image Registration. . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Groupwise Point-sets Registration. . . . . . . . . . . . . . . . . 3

1.2 Image Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Outline of Remainder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 ENTROPY AND RELATED MEASURES . . . . . . . . . . . . . . . . . . . 6

2.1 Shannon Entropy and Related Measures. . . . . . . . . . . . . . . . . . 62.2 Cumulative Residual Entropy: A New Measure of Information. . . . . . 82.3 Properties of CRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

2.3.1 CRE and Empirical CRE. . . . . . . . . . . . . . . . . . . . . . 112.3.2 Robustness of CRE. . . . . . . . . . . . . . . . . . . . . . . . . 12

3 APPLICATIONS TO MULTIMODALITY IMAGE REGISTRATION . . . . . 14

3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143.2 Multimodal Image Registration using CCRE. . . . . . . . . . . . . . . . 17

3.2.1 Transformation Model for Non-rigid Motion. . . . . . . . . . . . 213.2.2 Measure Optimization. . . . . . . . . . . . . . . . . . . . . . . . 213.2.3 Computation ofP (i > λ, k; µ) and ∂P (i>λ,k;µ)

∂µ. . . . . . . . . . . 23

3.2.4 Algorithm Summary. . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Implementation Results. . . . . . . . . . . . . . . . . . . . . . . . . . .25

3.3.1 Synthetic Motion Experiments. . . . . . . . . . . . . . . . . . . 263.3.1.1 Convergence speed. . . . . . . . . . . . . . . . . . . 263.3.1.2 Registration accuracy. . . . . . . . . . . . . . . . . . 283.3.1.3 Noise immunity. . . . . . . . . . . . . . . . . . . . . 293.3.1.4 Partial overlap. . . . . . . . . . . . . . . . . . . . . . 30

vi

3.3.2 Real Data Experiments. . . . . . . . . . . . . . . . . . . . . . . 31

4 DIVERGENCE MEASURES FOR GROUPWISE POINT-SETS REGIS-TRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

4.1 Previous Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364.2 Divergence Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . .38

4.2.1 Jensen-Shannon Divergence. . . . . . . . . . . . . . . . . . . . . 384.2.2 CDF-JS Divergence. . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .424.3.1 Energy Function for Groupwise Point-sets Registration. . . . . . 434.3.2 JS Divergence in a Hypothesis Testing Framework. . . . . . . . . 444.3.3 Unbiasness Property of the Divergence Measures. . . . . . . . . 454.3.4 Estimating JS and its Derivative. . . . . . . . . . . . . . . . . . . 47

4.3.4.1 Finite mixture models. . . . . . . . . . . . . . . . . . 474.3.4.2 Optimizing the JS divergence. . . . . . . . . . . . . . 49

4.3.5 Estimating CDF-JS and its Derivative. . . . . . . . . . . . . . . 504.3.5.1 Optimizing the CDF-JS divergence. . . . . . . . . . . 52

4.4 Experiment Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . .534.4.1 JS Divergence Results. . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.1.1 Alignment results. . . . . . . . . . . . . . . . . . . . 534.4.1.2 Atlas construction results. . . . . . . . . . . . . . . . 55

4.4.2 CDF-JS Divergence Results. . . . . . . . . . . . . . . . . . . . . 56

5 APPLICATIONS TO IMAGE SEGMENTATION . . . . . . . . . . . . . . . . 59

5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .595.2 Registration+Segmentation Model. . . . . . . . . . . . . . . . . . . . . 60

5.2.1 Gradient flows. . . . . . . . . . . . . . . . . . . . . . . . . . . .635.2.2 Algorithm Summary. . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66

6 CONCLUSIONS AND FUTURE WORK. . . . . . . . . . . . . . . . . . . . 72

6.1 Contributions of the Dissertation. . . . . . . . . . . . . . . . . . . . . . 726.2 Image and Point-sets Registration. . . . . . . . . . . . . . . . . . . . . 72

6.2.1 Non-rigid Image Registration. . . . . . . . . . . . . . . . . . . . 726.2.2 Groupwise Point-sets Registration. . . . . . . . . . . . . . . . . 73

6.3 Image Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .74

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83

vii

LIST OF TABLES

Table page

3–1 Comparison of the registration results between CCRE and MI for a fixed syn-thetic deformation field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

3–2 Comparison of total time taken to achieve registration by the CCRE with MI.. 31

3–3 Comparison of the value S of several brain structures for CCRE and MI.. . . . 33

5–1 Statistics of the error in estimated non-rigid deformation.. . . . . . . . . . . . 68

viii

LIST OF FIGURES

Figure page

1–1 Illustration of groupwise registration of corpus callosum point-sets manuallyextracted from the outer contours of the brain images.. . . . . . . . . . . . . . 4

3–1 CCRE, MI and NMI traces plotted for the misaligned MR & CT image pair. . 20

3–2 Comparison of convergence speed between CCRE and MI. . . . . . . . . . . 27

3–3 Plot demonstrating the change of Mean Deformation Error for CCRE and MIregistration results with time.. . . . . . . . . . . . . . . . . . . . . . . . . . .28

3–4 Results of application of our algorithm to synthetic data (see text for details).. 28

3–5 Registration results of MR T1 and T2 image slice with large non-overlap.. . . 30

3–6 Registration results of different subjects of MR & CT brain data with real non-rigid motion. (see text for details. . . . . . . . . . . . . . . . . . . . . . . . . 32

4–1 Illustration of corpus callosum point-sets represented as density functions.. . 35

4–2 Results of rigid registration in noiseless case. ’o’ and ’+’ indicate the modeland scene points respectively.. . . . . . . . . . . . . . . . . . . . . . . . . . .54

4–3 Non-rigid registration of the corpus callosum pointsets.. . . . . . . . . . . . . 54

4–4 Experiment results on seven 2D corpus collasum point sets.. . . . . . . . . . . 55

4–5 Robustness to outliers in the presence of large noise.. . . . . . . . . . . . . . . 57

4–6 Robustness test on 3D swan data. . . . . . . . . . . . . . . . . . . . . . . . . 57

4–7 Atlas construction from four 3D hippocampal point sets.. . . . . . . . . . . . 58

5–1 Model Illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61

5–2 Illustration of the various terms in the evolution of the level set functionφ. . . . 65

5–3 Results of application of our algorithm to synthetic data. . . . . . . . . . . . . 67

5–4 Results of application of our algorithm to a pair of slices from human brainMRIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69

5–5 Corpus Callosum segmentation on a pair of corresponding slices from distinctsubjects.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70

ix

5–6 Hippocampal segmentation using our algorithm on a pair of brain scans fromdistinct subjects.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

x

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

INFORMATION THEORETIC MEASURES AND THEIR APPLICATIONS TOIMAGE REGISTRATION AND SEGMENTATION

By

Fei Wang

August 2006

Chair: Baba C. VemuriMajor Department: Computer and Information Sciences and Engineering

Information theory has played a fundamental role in many fields of science and

engineering including computer vision and medical imaging. In this dissertation, various

information theoretic measures that are used in achieving the goal of solving several

important problems in medical imaging namely, image registration, point-set registration

and image segmentation are presented.

To measure the information content in a random variable, we first present a novel

measure based on its cumulative distribution that is dubbed Cumulative Residual Entropy

(CRE). This measure parallels the well-known Shannon entropy but has the following

advantages: (1) it is more general than the Shannon entropy as its definition is valid

in the discrete and continuous domains, (2) it possesses more general mathematical

properties and (3) it can be easily computed from sample data and these computations

asymptotically converge to the true values. Based on CRE, we define the cross-CRE

(CCRE) between two random variables, and apply it to solve the image alignment

problem for parameterized transformations. The key strengths of the CCRE over using

the now popular Mutual Information (based on Shannon’s entropy) between images being

xi

aligned are that the former has significantly larger tolerance to noise and a much larger

convergence range over the field of parameterized transformations.

Jensen-Shannon (JS) divergence has long been known as a measure of cohesion

between multiple probability densities. Similar to the idea of defining an entropy measure

based on distributions, we derived a JS divergence based on probability distributions and

dub it as the CDF-JS divergence. We then apply the JS and the CDF-JS divergence to

the groupwise point-set registration problem, which involves simultaneously registering

multiple shapes (represented as point-sets) for constructing an atlas. Estimating a

meaningful average or mean shape from a set of shapes represented by unlabeled point-

sets is a challenging problem, since this usually involves solving for point correspondence

under a non-rigid motion setting. The novel and robust algorithm we propose avoids the

correspondence problem by minimizing the CDF-JS/JS divergence between the point-sets

represented as probability distribution/density functions. The cost functions are fully

symmetric with no bias toward any of the given shapes to be registered and whose mean

is being sought. We empirically show that CDF-JS is more robust to noise and outliers

than JS divergence. Our algorithm can be especially useful for creating atlases of various

shapes present in images as well as for simultaneously registering 3D range data sets

without having to establish any correspondence.

In the context of image segmentation, we developed a novel model-based seg-

mentation technique that involves segmenting the novel 3D image data by non-rigidly

registering an atlas to it. The key contribution here to the solution of this problem is that

we present a novel variational formulation of the registration assisted image segmenta-

tion task, which leads to solving a coupled set of nonlinear PDEs that are solved using

efficient numerical schemes. Our segmentation algorithm is a departure from earlier

methods in that we have a unified variational principle wherein non-rigid registration and

segmentation are simultaneously achieved; unlike previous solutions to this problem, our

algorithm can accommodate image pairs with very distinct intensity distributions.

xii

CHAPTER 1INTRODUCTION

In 1948, motivated by the problem of efficiently transmitting information over

a noisy communication channel, Claude Shannon introduced a revolutionary new

probabilistic way of thinking about communication and simultaneously created the first

truly mathematical theory of entropy. His ideas created a sensation and were rapidly

developed to create the field of information theory, which employs probability and

ergodic theory to study the statistical characteristics of data and communication systems.

Since then, information theory has played a fundamental role in many fields of science

and engineering including computer vision and medical imaging. In this dissertation, we

endeavor to develop novel information theoretic methods with the application to medical

image analysis.

We examine two applications in particular, image (point-set) registration and

image segmentation. In the first of these applications, we follow a promising avenue of

work in using a probability density or distribution function as the signature of a given

“object” (image or point-set). Then by optimizing certain information theoretic measures

between these functions, we achieve the desired registration. In the segmentation

application, we consider an atlas based approach, in which segmentation and registration

are simultaneously achieved by solving a novel variational principle.

1.1 Image and Point-set Registration

We start with the image registration problem and then move on to the point-set

registration.

1.1.1 Image Registration

The image registration problem is defined as follows: Given a pair of images

I1(x, y) andI2(x′, y′), where(x′, y′)t = T (x, y)t andT is the matrix corresponding

1

2

to the unknown parameterized transformation to be determined, define a match metric

M(I1(x, y), I2(x′, y′)) and optimizeM over allT .

The fundamental characteristic of any image registration technique is the type of

spatial transformation or mapping used to properly overlay two images. The transforma-

tion can be classified into global and local transformations. A global transformation is

given by a single equation which maps the entire image. Local transformations map the

image differently depending on the spatial location and are thus more difficult to express

succinctly. The most common global transformations are rigid, affine and projective

transformations.

A transformation is called rigid if the distance between points in the image being

transformed is preserved. A rigid transformation can be expressed as

u(x, y) = (cos(φ)x− sin(φ)y + dx)− x

v(x, y) = (sin(φ)x + cos(φ)y + dy)− y

(1–1)

whereu(x, y) andv(x, y) denote the displacement at point(x, y) along theX andY

directions;φ is the rotation angle, and(dx, dy) the translation vector.

A transformation is called affine when any straight line in the first image is mapped

onto a straight line in the second image with parallelism being preserved. In 2D, the

affine transformation can be expressed as

u(x, y) = (a11x− a12y + dx)− x

v(x, y) = (a21x + a22y + dy)− y

(1–2)

where( a11 a12a21 a22 ) denotes an arbitrary real-valued matrix. Scaling transformation, which

has a transformation matrix of(

s1 00 s2

)and shearing transformation, which has a matrix

( 1 s30 1 ) are two examples of affine transformation, wheres1,s2 ands3 are positive real

numbers.

3

A more interesting case, in general, is that of a planar surface in motion viewed

through a pinhole camera. This motion can be described as a 2D projective transforma-

tion of the plane.

u(x, y) = a0x+a1y+a2

a6x+a7y+1− x

v(x, y) = a3x+a4y+a5

a6x+a7y+1− y

(1–3)

wherea0, ..., a7 are the global parameters.

When a global transformation does not adequately explain the relationship of a

pair of input images, a local transformation may be necessary. Registering an image

pair obtained at different times with some portion of the body experiencing growth,

or registering two images from different patients, fall into this local transformation

registration category. A motion field is usually used to describe the change in local

transformation problem.

1.1.2 Groupwise Point-sets Registration

Point-set representations of image data, e.g., feature points, are commonly used

in many applications and the problem of registering them frequently arises in a variety

of these application domains. Extensive studies on the point set registration and related

problems can be found in a rich literature covering both theoretical and practical issues

relating to computer vision and pattern recognition.

GivenN point-sets, which are denoted byXp, p ∈ 1, ..., N, each point-setXp

consists of pointsxpi ∈ RD, i ∈ 1, ..., np andnp is the number of points contained

in point-setXp. The task of multiple point pattern matching or point-set registration is

either to establish a consistent point-to-point correspondence between these point-sets or

to recover the spatial transformation which yields the best alignment. For example, we are

given a group of corpus callosum point-sets from the brain image scan, which is shown

in the left column of Figure1–1. All the point-sets are registered simultaneously to the

point-sets shown in the right column in a symmetric manner, meaning that the registration

4

result is not biased towards any of the original point-set. We will discuss these issues in

greater detail in Chapter4.

Figure 1–1:Illustration of groupwise registration of corpus callosum point-sets manuallyextracted from the outer contours of the brain images.

1.2 Image Segmentation

Image segmentation plays a crucial role in many medical imaging applications by

automating or facilitating the delineation of anatomical structures. The segmentation of

structure from 2D and 3D images is an important first step in analyzing medical data. For

example, it is necessary to segment the brain in an MR image, before it can be rendered in

3D for visualization purposes. Segmentation can also be used to automatically detect the

head and abdomen of a fetus from an ultrasound image. The boundaries can then be used

to get quantitative estimates of organ sizes and provide aid in any necessary diagnoses.

Another important application is registration. It may be easier, or at least less error prone

5

to segment objects in multiple images prior to registration. This is especially true in

images from different modalities such as CT and MRI.

Image-guided surgery is another important application that needs image segmen-

tation. Recent advances in technology have made it possible to acquire images of the

patient while the surgery is in-progress. The goal is then to segment relevant regions of

interest and overlay them on an image of the patient to help guide the surgeon in his/her

work.

Segmentation is therefore a very important task in medical imaging. However,

manual segmentation is not only a tedious and time consuming process, but is also

inaccurate. Segmentation by experts has shown to be variable up to 20%. It is therefore

desirable to use algorithms that are accurate and require as little user interaction as

possible.

1.3 Outline of Remainder

In the next chapter, we rigorously define a novel measure of information in a random

variable based on its cumulative distribution that we dub as cumulative residual entropy

(CRE). We also connect the measure to the mean residual life function in reliability

engineering. Thereafter follows the chapter on using the measure for multimodal image

registration. In Chapter 4, we present a simultaneous groupwise point-sets registration

and atlas construction algorithm, in which we minimize the proposed divergence

measures between point sets represented as probability densities or distributions. Based

on these new measures, we propose a novel variational principle in Chapter 5 for

solving the registration assisted image segmentation problem. Lastly, we end with some

concluding points and thoughts for future work.

CHAPTER 2ENTROPY AND RELATED MEASURES

2.1 Shannon Entropy and Related Measures

The concept of entropy is central to the field of information theory and was origi-

nally introduced by Shannon in his seminal paper [1] in the context of communication

theory. The entropy Shannon proposed is a measure of uncertainty in a discrete distri-

bution based on the Boltzman entropy of classical statistical mechanics. The Shannon

Entropy of a discrete distributionF is defined by

H(F ) = −∑

i

pi log pi, (2–1)

Since then, this concept and variants thereof have been extensively utilized in numerous

applications of science and engineering. To date, one of the most widely benefiting

application has been in financial analysis [2], data compression [3], statistics [4], and

information theory [5].

This measure of uncertainty has many important properties which agree with

our intuitive notion of randomness. We mention three: (1) It is always positive. (2) It

vanishes if and only if it is a certain event. (3) Entropy is increased by the addition of

an independent component, and decreased by conditioning. However, extension of this

notion to continuous distribution poses some challenges. A straightforward extension of

the discrete case to continuous distributionsF with densityf called differential entropy

reads

H(F ) = −∫

f(x) log f(x)dx (2–2)

However, this definition raises the following concerns, 1) First of all, it is defined based

on the density of the random variable, which in general may or may not exist, e.g., for

6

7

cases when the cumulative distribution function (cdf) is not differentiable. It would not

be possible to define the entropy of a random variable for which the density function

is undefined; 2) Secondly, the Shannon entropy of a discrete distribution is always

positive, while the differential entropy of a continuous variable may take any value

on the extended real line; 3) Shannon entropy computed from samples of a random

variable lacks the property of convergence to the differential entropy, i.e. even when the

sample size goes to infinity, the Shannon entropy estimated from these samples will not

converge to differential entropy [5]. The consequence of which is that it is impossible,

in general, to approximate the differential entropy of a continuous variable using the

entropy of empirical distributions; 4) Consider the following situation: SupposeX and

Y are two discrete random variables representing the height of a group of people, with

X taking on values5.1, 5.2, 5.3, 5.4, 5.5, each with a probability1/5 andY taking on

values5.1, 5.2, 5.3, 5.4, 7.5 (with Yao Ming in this group) again each with probability

1/5. The information content measured in these two random variables using Shannon

entropy is the same, i.e., Shannon entropy does not bring out any differences between

these two cases. However, if the two random variables represented the winning chances

in a basketball game, the information content in the two random variables would be

considered as being dramatically different. Nevertheless Shannon entropy fails to make

any distinction whatsoever between them. For additional discussion on some of these

issues the reader is referred to [6].

In this work we propose an alternative measure of uncertainty in a random variable

X and call it the Cumulative Residual Entropy (CRE) ofX. The main objective of our

study is to extend Shannon entropy to random variables with continuous distributions.

The concept we proposed overcomes the problems mentioned above, while retaining

many of the important properties of Shannon entropy. For instance, both are decreased

by conditioning, while increased by independent addition. They both obey the data

8

processing inequality, etc. However, the differential entropy does not have the following

important properties of CRE.

1. CRE has consistent definitions in both the continuous and discrete domains;

2. CRE is always non-negative;

3. CRE can be easily computed from sample data and these computations asymptoti-

cally converge to the true values.

The basic idea in our definition is to replace the density function with the cumulative

distribution in Shannon’s definition2–1. The distribution function is more regular than

the density function, because the density is computed as the derivative of the distribution.

Moreover, in practice what is of interest and/or measurable is the distribution function.

For example, if the random variable is the life span of a machine, then the event of

interest is not whether the life span equalst, but rather whether the life span exceeds

t. Our definition also preserves the well established principle that the logarithm of

the probability of an event should represent the information content in the event. The

discussions about the properties of CRE in the next few sections, we trust, are convincing

enough for further development of the concept of CRE.

2.2 Cumulative Residual Entropy: A New Measure of Information

In this section, we define an alternate measure of uncertainty in a random variable

and then derive some properties about this new measurement. We do not delve into the

proofs but refer the reader to a more comprehensive mathematical treatment in [7].

Definition: Let X be a random vector inRN andX = (X1, X2, ..., XN), F (λ) :=

P (|X| > λ) is the cumulative residual distribution, whereλ = (λ1, ....λN) and|X| > λ

means|Xi| > λi. F (λ) is also called survival function in the Reliability Engineering

literature. We define the cumulative residual entropy (CRE) ofX, by

E(X) = −∫

RN+

F (λ) log F (λ)dλ (2–3)

whereRN+ =

(xi ∈ RN ; xi ≥ 0

).

9

CRE can be related to the well-known concept of mean residual life function in

Reliability Engineering which is defined as:

mF (t) = E(X − t|X ≥ t) =

∫∞t

F (x)dx

F (t)(2–4)

ThemF (t) is of fundamental importance in Reliability Engineering & is often used to

measure departure from exponentiality. CRE can be shown to be the expectation ofmF (t)

[8], i.e.

E(x) = E(mF (x)) (2–5)

Now we give a few examples.

• Example 1: (CRE of the uniform distribution)

Consider a general uniform distribution with the density function:

p(x) =

1a

0 ≤ x ≤ a

0 o.w(2–6)

Then its CRE is computed as follows

E(X) = −∫ a

0

P (|X| > x) log P (|X| > x)dx

= −∫ a

0

(1− x

a) log(1− x

a)dx

=1

4a (2–7)

• Example 2: (CRE of the exponential distribution)

The exponential distribution with mean1/λ has the density function:

p(x) = λe−λx (2–8)

Correspondingly, the CRE of the exponential distribution is

E(x) = −∫ ∞

0

e−λx log e−λxdt

=

∫ ∞

0

λte−λxdt

10

=1

λ(2–9)

• Example 3: (CRE of the Gaussian Distribution)

The Gaussian probability density function is

p(x) =1√2πσ

exp[−(x−m)2

2σ2], (2–10)

wherem is the mean andσ2 is the variance.

The cumulative distribution function is:

F (x) = 1− erfc(x−m

σ), (2–11)

whereerfc is the error function:

erfc(x) =1√2π

∫ ∞

x

exp(−t2/2)dt.

Then the CRE of the Gaussian distribution is:

E(x) = −∫ ∞

0

erfc(x−m

σ) log

[erfc(

x−m

σ)]dx (2–12)

We’ll now states important properties that are related to the application of CRE

to image registration. For a complete list of properties, we refer the readers to a more

comprehensive treatment in [7].

2.3 Properties of CRE

The traditional Shannon entropy of a sum of independent variables is larger than that

of either. We have analogously the following theorem:

Theorem 1 For any non-negative and independent variablesX andY ,

max(E(X), E(Y )

) ≤ E(X + Y )

Proof: For a proof, see [7].

11

Similar to the case of Shannon’s entropy, if X and Y are independent random

variables,E(X, Y ) = E(|X|)E(X) + E(|Y |)E(Y ). More generally,

Proposition 1 If Xi are independent, then

E(X) =∑

i

( ∏

i6=j

E(|Xj|))E(Xi)

For a proof, see [7].

Conditional entropy is a fundamental concept in information theory. We now define

the concept of conditioning in the context of CRE.

Definition: Given random vectorsX andY ∈ RN , we define the conditional CRE

E(X|Y ) by :

E(X|Y ) = −∫

RN+

P (|X| > x|Y ) log P (|X| > x|Y )dx (2–13)

As in the Shannon entropy case, conditioning reduces CRE.

Proposition 2 For anyX andY

E[E(X|Y )] ≤ E(X) (2–14)

Equality holds iffX is independent ofY .

2.3.1 CRE and Empirical CRE

Next theorem shows one of the salient feature of CRE. In the discrete case Shannon

entropy is always non-negative, and equals zero if and only if the random variable is a

certain event. However, this is not valid for the Shannon entropy in the continuous case

as defined in Eqn.2–2. In contrast, in this regard CRE does not differentiate between

discrete and continuous cases, as shown by the following theorem:

Theorem 2 E(X) ≥ 0 and equality holds if and only ifP [|X| = λ] = 1 for some vector

λ, ie. |Xi| = λi with probability 1.

12

Shannon entropy computed from samples of a random variable lacks the property

of convergence to the differential entropy (see Eqn.2–2for a definition). In contrast, the

CRE,E(x) computed from the samples converges to the continuous counterpart. This is

summarized in the following theorem.

Proposition 3 (Weak Convergence). Let the random vectorsXk converge in distribution

to the random vectorX; by this we mean

limk→∞

E[ϕ(Xk)] = E[ϕ(X)] (2–15)

for all bounded continuous functionsφ onRN , if all the Xk are bounded inLp for some

p > N , then

limk→∞

E(Xk) = E(X) (2–16)

Proof: Refer to [7] for the proof.

This is a powerful property and as a consequence of it, we can compute CRE of an

random variable from the samples which would converge to the true CRE of the random

variable. Note thatXk can be samples of a continuous random variable.

2.3.2 Robustness of CRE

We now investigate the robustness (or the lack thereof) of differential entropy and

prove that while differential entropy is not robust with respect to small perturbations,

CRE on the contrary is quite robust. This property plays a key role in demonstrating the

noise immunity of CCRE over MI depicted in the experiments in the next Chapter.

Theorem 3 LetX be a discrete R.V., taking value(x1, x2, ..., xN), with probabilities

p1, p2, ..., pN

p(X = xi) = pi 1 ≤ i ≤ N (2–17)

X has Shannon entropy:H(X) = −∑pi log pi. LetYn have densityfn and be

independent ofX. Zn = X + Yn is no longer discrete, and has a density. Let X be as in

13

(2–17) andYn as above. SupposeYn → 0 in probability. Then

h(X + Yn) → −∞ (2–18)

Theorem 4 For X andYn as defined in Theorem3,

limY n→0

E(X + Yn) → E(X) (2–19)

Proof: This is a direct consequence of the Proposition3.

Theorems (3) and (4) are very important properties as they prove that the CRE is

robust to noise which is not the case for differential entropy. Intuitively, the robustness

of CRE maybe attributed to the use of CDF as opposed to a PDF in its definition, i.e., an

integral formulation as opposed to a differential formulation and it is well known that the

former is more robust compared to the later.

CHAPTER 3APPLICATIONS TO MULTIMODALITY IMAGE REGISTRATION

Matching two or more images under varying conditions – illumination, pose,

acquisition parameters etc. – is ubiquitous in Computer Vision, medical imaging,

geographical information systems etc. In the past several years, information theoretic

measures have been very widely used in defining cost functions to be optimized in

achieving a match. An example problem common to all the aforementioned areas is the

image registration problem. In the following, we will review the literature on existing

computational algorithms that have been reported for achieving multimodality image

registration, with the focus on the non-rigid registration methods. We will point out their

limitations and hence motivate the need for a new and efficient computational algorithm

for achieving our goal.

3.1 Related Work

Non-rigid image registration methods in literature to date may be classified into

feature-based and “direct” methods. Most feature-based methods are limited to determin-

ing the registration at the feature locations and require an interpolation at other locations.

If however, the transformation/registration between the images is a global transformation

e.g., rigid, affine etc. then, there is no need for an interpolation step. In the non-rigid case

however, interpolation is required. Also, the accuracy of the registration is dependent on

the accuracy of the feature detector.

Several feature-based methods involve detecting surfaces landmarks [9, 10, 11, 12],

edges, ridges, etc. Most of these assume a known correspondence with the exception

of the work in Chui et al.[9], Jian and Vemuri [13], Wang et al.[14] and Guo et al. [15].

Work reported in Irani and Anandan [16] uses the energy (squared magnitude) in the

directional derivative image as a representation scheme for matching achieved using the

14

15

SSD cost function. Recently, Liu et al. [17] reported the use of local frequency in a robust

statistical framework using the integral squared error a.k.a.,L2E. The primary advantage

of L2E over other robust estimators in literature is that there are no tuning parameters in

it. The idea of using local phase was also exploited by Mellor and Brady [18], who used

mutual information (MI) to match local-phase representation of images and estimated

the non-rigid registration between them. However, robustness to significant non-overlap

in the field of view (FOV) of the scanners was not addressed. For more on feature-based

methods, we refer the reader to the recent survey by Zitova and Flusser [19].

In the context of “direct” methods, the primary matching techniques for intra-

modality registration involve the use of normalized cross-correlation, modified SSD,

and (normalized) mutual information (MI). Ruiz-Alzola et al.[20] presented a unified

framework for non-rigid registration of scalar, vector and tensor data based on template

matching. For scalar images, the cost function is the extension of modified SSD using a

different definition of inner products. However this model can only be used on images

from the same modality as it assumes similar intensity values between images. In [21,

22], a level-set based image registration algorithm was introduced that was designed to

non-rigidly register two 3D volumes from the same modality of imaging. This algorithm

was computationally efficient and was used to achieve atlas-based segmentation. Direct

methods based on the optical-flow estimation form a large class for solving the non-rigid

registration problem. Hellier et al.[23] proposed a registration method based on a dense

robust 3-D estimation of the optical flow with a piecewise parametric description of

the deformation field. Their algorithm is unsuitable for multi-modal image registration

due to the brightness constancy assumption. Variants of optical flow-based registration

that accommodate for varying illumination maybe used for inter-modality registration

and we refer the reader to [24, 25] for such methods. Guimond et al., [26] reported a

multi-modal brain warping technique that uses Thirion’s Demons algorithm [27] with

an adaptive intensity correction. The technique however was not tested for robustness

16

with respect to significant non-overlap in the FOVs. More recently, Cuzol et al. [28]

introduced a new non-rigid image registration technique which basically involves a

Helmoholtz decomposition of the flow field which is then embedded into the brightness

constancy model of optical flow. The Helmholtz decomposition allows one to compute

large displacements when the data contains such displacements. This technique is

an innovation on accommodating for large displacements and not one that allows for

intermodality non-rigid registration. For more on intra-modality methods, we refer the

reader to the comprehensive surveys [29, 19].

A popular framework for “direct” methods is based on the information theoretic

measures [30], among them, mutual information (MI) pioneered by Viola and Wells [31]

and Collignon et al., [32] and modified in Studholme et al., [33] has been effective in

the application of image registration. Reported registration experiments in these works

are quite impressive for the case of rigid motion. The problem of being able to handle

non-rigid deformations in the MI framework is a very active area of research and some

recent papers reporting results on this problem are [18, 34, 35, 36, 37, 38, 39, 40, 41,

42]. In [34], Mattes et al., and in [35], Rueckert et al., presented mutual information

based schemes for matching multi-modal image pairs using B-Splines to represent the

deformation field on a regular grid. Guetter [43] recently incorporated a learned joint

intensity distribution into the mutual information formulation, in which the registration

is achieved by simultaneously minimizing the KL divergence between the observed

and learned intensity distributions and maximizing the mutual information between

the reference and alignment images. Recently, D’Agostino et al., [44] presented an

information theoretic approach wherein tissue class probabilities of each image being

registered are used to match over the space of transformations using a divergence measure

between the ideal case (where tissue class labels between images at corresponding voxels

are similar) and actual joint class distributions of both images. This work expects a

segmentation of either one of the images being registered. Computational efficiency and

17

accuracy (in the event of significant non-overlaps) are issues of concern in most if not all

the MI-based non-rigid registration methods.

Finally, some registration methods under the direct approach are inspired by

models from mechanics, either from elasticity [45, 46], or fluid mechanics [47, 48].

Fluid mechanics-based models accommodate for large deformations, but are largely

computationally expensive. Christensen [49] recently developed an interesting version

of these methods, where the direct deformation field and the inverse deformation field

are jointly estimated to guarantee the symmetry of the deformation with respect to

permutation of input images. A more general and mathematically rigorous treatment

of the non-rigid registration which subsumes the fluid-flow methods was presented in

Trouve [50]. All these methods however are primarily applicable to intra-modality and

not inter-modality registration.

3.2 Multimodal Image Registration using CCRE

Based on CRE, cross-CRE (CCRE) between two random variables was defined,

and applied to solve the image alignment problem, which is defined as: Given a pair

of imagesI1(x) andI2(x′), where(x′)t = T (x)t andT is the matrix corresponding

to the unknown parameterized transformation to be determined, define a match metric

M(I1(x), I2(x′)) and maximize/minimizeM over allT . The class of transformations can

be rigid, affine, projective or non-rigid transformations. Several matching criteria have

been proposed in the past, some of which were reviewed earlier. Amongst them, mutual

information is very popular and is defined as follows for the continuous random variable

case,

MI(X, Y ) = h(X) + h(Y )− h(X, Y ) (3–1)

whereh(X) is the differential entropy of the random variableX and is given byh(x) =∫∞−∞ p(x)lnp(x)dx, where p(x) is the probability density function and can be estimated

from the image data using any of the parametric and nonparametric methods. The reason

18

for defining MI in terms of differential entropy as opposed to Shannon entropy is to

facilitate the optimization of MI with respect to the registration parameters using any

of the gradient based optimization methods. Note that MI defined using the Shannon’s

entropy in discrete form will not converge to continuous case defined here due to the fact

that Shannon’s entropy does not converge to the differential entropy (see [5]).

We now define the cross-CRE (CCRE) using CRE defined in Eqn.2–3.

C(X, Y ) = E(X)− E[E(X/Y )], (3–2)

We will use this quantity as a matching criterion in the image alignment problem.

More specifically, letIT (x) be a test image we want to register to a reference image

IR(x). The transformationg(x; µ) describes the deformation fromVT to VR, where

VT andVR are continuous domains on whichIT andIR are defined,µ is the set of the

transformation parameters to be determined. We pose the task of image registration as

an optimization problem. To align the reference imageIR(x) with the transformed test

imageIT (g(x; µ)), we seek the set of the transformation parametersµ that maximizes

C(IT , IR) over the space of smooth transformations i.e.,

µ = argmaxµ

C(IT g(x; µ), IR

)(3–3)

The computation of CCRE requires estimates of the marginal and joint probability

distributions of the intensity values of the reference and test images. We denotep(l, k; µ)

as the joint probability of(IT g(x; µ), IR

). Let pT (l; µ) andpR(k) represent the

marginal probability for the test image and reference images respectively,LT andLR

are the discrete sets of intensities associated with the test image and reference image

respectively. Then, we can rewrite theCCRE(IT g(x; µ), IR

)as follows:

C(IT g(x; µ), IR

)= E(IT )− E[E(IT g(x; µ)/IR)]

= −∑

λ∈LT

∫ ∞

λ

pT (l; µ)dl log[ ∫ ∞

λ

pT (l; µ)dl]

19

+∑

k∈LR

pR(k)∑

λ∈LT

∫ ∞

λ

p(l, k; µ)

pR(k)dl log

[ ∫ ∞

λ

p(l, k; µ)

pR(k)dl

](3–4)

Let P (i > λ; µ) =∫∞

λpT (l; µ)dl andP (i > λ, k; µ) =

∫∞λ

p(l, k; µ)dl. Using the fact

thatpT (l; µ) =∑

k∈LRp(l, k; µ), we haveP (i > λ; µ) =

∑k∈LR

P (i > λ, k; µ). Eqn.

(3–4) can be further simplified, which leads to,

C(IT g(x; µ), IR

)

= −∑

λ∈LT

P (i > λ; µ) log P (i > λ; µ) +∑

k∈LR

∑

λ∈LT

P (i > λ, k; µ) logP (i > λ, k; µ)

pR(k)

= −∑

λ∈LT

∑

k∈LR

P (i > λ, k; µ) log P (i > λ; µ) (3–5)

+∑

k∈LR

∑

λ∈LT

P (i > λ, k; µ) logP (i > λ, k; µ)

pR(k)

=∑

λ∈LT

∑

k∈LR

P (i > λ, k; µ)[log

P (i > λ, k; µ)

pR(k)− log P (i > λ; µ)

]

=∑

λ∈LT

∑

k∈LR

P (i > λ, k; µ) logP (i > λ, k; µ)

pR(k)P (i > λ; µ)(3–6)

To illustrate the difference between CCRE and the now popular information theoretic

cost functions such as MI & NMI, we choose to plot these functions against a parameter

of the transformation, for illustrative purposes, say the rotations. The image pair we

used here is MR & CT images that were originally aligned, and the MR and CT data

intensities range from 0-255 with the mean 55.6 and 60.6 respectively. The cost functions

are computed over the rotation angle that was applied to the CT image to misalign it with

respect to the MR image. In each plot of the Figure3–1theX-axis shows the 3D rotation

angle aboutZ axis, while theY -axis shows the values of CCRE, MI and NMI computed

from the misaligned (by a rotation) image pairs. The second row shows a zoom-in view

of the plots over a smaller region, so as to get a detailed view of the cost function. The

following observations are made from this plot:

20

−0.4 −0.2 0 0.2 0.41.1115

1.1115

1.1116

1.1117

1.1117Normalized MI

−0.4 −0.2 0 0.2 0.40.4326

0.4328

0.433

0.4332

0.4334

0.4336

0.4338MI

−0.4 −0.2 0 0.2 0.415.985

15.99

15.995

16

16.005

16.01CCRE

−40 −20 0 20 4010

12

14

16

18CCRE

−40 −20 0 20 400.25

0.3

0.35

0.4

0.45

0.5

MI

−40 −20 0 20 401.05

1.1

Normalized MI

−0.4 −0.2 0 0.215.94

15.95

15.96

15.97

15.98CCRE

−0.4 −0.2 0 0.2 0.40.4046

0.4048

0.405

0.4052

0.4054

0.4056MI

−0.4 −0.2 0 0.2 0.41.1027

1.1028

1.1029Normalized MI

Figure 3–1:CCRE, MI and NMI traces plotted for the misaligned MR & CT image pairwhere misalignment is generated by a rotation of the CT image. First row: over the range−40 to 40. Second row: zoom in view between−0.5 to 0.5, where the arrows in thefirst row signify the position. Note that all three cost function are implemented with tri-linear interpolation. Third row: Three cost functions implemented with partial volumeinterpolation [32].

1. Similar to MI and NMI, the maximum of CCRE occurs at0 of rotation, which

confirms that our new information measure needs to be maximized in order to find

optimum transformation between two misaligned images.

2. The CCRE shows much larger range of values than MI & NMI. This feature plays

an important role in the numerical optimization since it leads to a more stable

numerical implementation by avoiding cancelation, round off etc. that often plague

arithmetic operations with smaller numerical values.

21

3. Upon closer inspection, we observe that CCRE is much smoother than the MI and

NMI for the MR& CT data pair, therefore verifies that CCRE is more regular than

other information theoretic measures.

3.2.1 Transformation Model for Non-rigid Motion

We model the non-rigid deformation field between two 3D image pairs using a

cubic B-splines basis in 3D. B-splines have a number of desirable properties for use

in modeling the deformation field. (1) Splines provide inherent control of smoothness

(degree of continuity). (2) B-splines are separable in multiple dimensions which provides

computational efficiency. Another feature of B-splines that is useful in a non-rigid

registration system is the ”local control”. Changing the location of a single control point

modifies only a local neighborhood of the control point.

The basic idea of the cubic B-spline deformation is to deform an object by manipu-

lating an underlying mesh of control pointsγi. The deformationg is defined by a sparse

regular control point grid. In 3D case, the deformation at any pointx = [x, y, z]T in the

test image can be interpolated with a linear combination of cubic B-spline convolution

kernel.

g(x) =∑

j

δjβ(3)(

x− γi

4ρ) (3–7)

whereβ(3)(x) = β(3)(x)β(3)(y)β(3)(z) and4ρ is spacing of the control grid.δj is the

expansion B-spline coefficients computed from the sample values of the image. For the

implementation details, we refer the reader to Forsey[51] and Mattes [34].

3.2.2 Measure Optimization

Calculation of the gradient of the energy function is necessary for its efficient and

robust maximization. The gradient of CCRE is given as,

∇C = [∂C∂µ1

,∂C∂µ2

, ...,∂C∂µn

] (3–8)

22

Each component of the gradient can be found by differentiating Eqn. (3–4) with respect

to a transformation parameters. We consider the two terms in Eqn. (3–4) separately when

computing the derivative. For the first term in Eqn. (3–4), we have,

∂E(IT )

∂µ=

∂

∂µ

[−

∑

λ∈LT

∫ ∞

λ

pT (l; µ)dl × log( ∫ ∞

λ

pT (l; µ)dl)]

= −∑

λ∈LT

log(P (i > λ; µ) + 1

)× ∂P (i > λ; µ)

∂µ(3–9)

whereP (i > λ; µ) =∫∞

λpT (l; µ)dl, and

∂P (i > λ; µ)

∂µ=

∫ ∞

λ

∂pT (l, µ)

∂µdl (3–10)

The derivative of the second term is given by,

∂E[E(IT g(x; µ)/IR)]

∂µ

=∂

∂µ

[ ∑

k∈LR

pR(k)∑

λ∈LT

∫ ∞

λ

p(l, k; µ)

pR(k)dl × log

( ∫ ∞

λ

p(l, k; µ)

pR(k)dl

)]

=∑

k∈LR

∑

λ∈LT

(log

P (i > λ, k; µ)

pR(k)+ 1

)∂P (i > λ, k; µ)

∂µ(3–11)

whereP (i > λ, k; µ) =∫∞

λp(l, k; µ)dl, and

∂P (i > λ, k; µ)

∂µ=

∫ ∞

λ

∂p(l, k; µ)

∂µdl (3–12)

Combining the derivatives of the two terms together, and using the fact that

∂pT (l; µ)

∂µ=

∂∑

k∈LRp(l, k; µ)

∂µ(3–13)

23

we have the analytic gradient of CCRE,

∂C(IT g(x; µ), IR

)

∂µ

= −∑

λ∈LT

[log P (i > λ; µ) + 1]∂∑

k∈LRP (i > λ, k; µ)

∂µ

+∑

k∈LR

∑

λ∈LT

[logP (i > λ, k; µ)

pR(k)+ 1]

∂P (i > λ, k; µ)

∂µ

(3–14)

= −∑

λ∈LT

∑

k∈LR

[log P (i > λ; µ) + 1]∂P (i > λ, k; µ)

∂µ

+∑

λ∈LT

∑

k∈LR


pR(k)+ 1]

∂P (i > λ, k; µ)

∂µ

=∑

λ∈LT

∑

k∈LR


pR(k)− log P (i > λ; µ)]× ∂P (i > λ, k; µ)

∂µ

=∑

λ∈LT

∑

k∈LR


pR(k)P (i > λ; µ)]× ∂P (i > λ, k; µ)

∂µ

note that in the derivation, we use the fact thatP (i > λ; µ) =∑

k∈LRP (i > λ, k; µ).

Comparing the expressions for CCRE and derivative of CCREC(IT g(x; µ), IR

)=

∑λ∈LT

∑k∈LR

log P (i>λ,k;µ)pR(k)P (i>λ;µ)

× P (i > λ, k; µ)

∂C(

IT g(x;µ),IR

)∂µ

=∑

λ∈LT

∑k∈LR

log P (i>λ,k;µ)pR(k)P (i>λ;µ)

× ∂P (i>λ,k;µ)∂µ

(3–15)

we note that the two formulas in (3–15) are similar to each other and they share the

common termlog P (i>λ,k;µ)pR(k)×P (i>λ;µ)

. From a computational viewpoint, this is quite beneficial

since the common term can not only save memory space, but also make the calculation

of gradient more efficient. From the formulation, we can also see that calculation of

CCRE and derivative of CCRE require us to find a method to estimateP (i > λ, k; µ) and

∂P (i>λ,k;µ)∂µ

. We will address the computation of these terms in the next subsection.

3.2.3 Computation ofP (i > λ, k; µ) and ∂P (i>λ,k;µ)∂µ

We will use the parzen window technique to estimate the cumulative distribution

function and its derivative. The calculation ofP (i > λ, k; µ) requires estimate of the

cumulative probability distributions of the intensity values of the reference and test

24

images. Letβ(0) be a zero-order spline Parzen window (centered unit pulse) andβ(3) be a

cubic spline Parzen window, the smoothed joint probability of(IR, IT g) is given by

p(l, k; µ) = α∑x∈V

β(0)(k − IR(x)− f 0

R

4bR

)β(3)

(l − IT (g(x; µ))− f 0

T

4bT

)(3–16)

whereα is a normalization factor that ensures∑

p(l, k) = 1, andIR(x) andIT (g(x; µ)

are samples of the reference and interpolated test images respectively, which is normal-

ized by the minimum intensity value,f 0R, f 0

T , and the intensity range of each bin,4bR,

4bT .

SinceP (i > λ, k; µ) =∫∞

λp(l, k; µ)dl, we have the following,

P (i > λ, k; µ) =

∫ ∞

λ

p(l, k; µ)dl

= α∑x∈V

β(0)(k − IR(x)− f 0

R

4bR

) ∫ ∞

λ

β(3)(l − IT (g(x; µ))− f 0

T

4bT

)dl

= α∑x∈V

β(0)(k − IR(x)− f 0

R

4bR

)Φ

(l − IT (g(x; µ))− f 0

T

4bT

)(3–17)

whereΦ() is the cumulative residual function of cubic spline kernel defined as

follows,

Φ(v) =

∫ ∞

v

β(3)(u)

=

1.0 v < −2

1.0− (v+2)4

24−2 ≤ v < −1

12− 2

3v + v3

3+ v4

8−1 ≤ v < 0

12− 2

3v + v3

3− v4

80 ≤ v < 1

(v−2)4

241 ≤ v < 2

0 v ≥ 2

(3–18)

25

Note thatdΦ(u)du

= −β(3)(u), we can then take the derivative of Eqn.3–17with respect to

µ, and we get

∂P (i > λ, k; µ)

∂µ=

α

4bT

∑x∈V

β(0)(k − IR(x)− f 0

R

4bR

)Φ′(l − IT (g(x; µ))− f 0

T

4bT

)

×(− ∂IT (t)

∂t

∣∣∣t= g(x; µ)

)∂g(x; µ)

∂µ

=α

4bT

∑x∈V

β(0)(k − IR(x)− f 0

R

4bR

)β(3)

(l − IT (g(x; µ))− f 0

T

4bT

)

×(∂IT (t)

∂t

∣∣∣t=g(x;µ)

)∂g(x; µ)

∂µ(3–19)

where∂IT (t)∂t

is the image gradient.

3.2.4 Algorithm Summary

The registration algorithm can be summarized as follows,

1 . For the current deformation field, interpolate the test image byIT g(x; µ).

CalculateP (i > λ, k; µ) and ∂P (i>λ,k;µ)∂µ

using Eqn. (3–17) and Eqn. (3–19)

respectively.

2 . ComputeP (i > λ; µ) as∑

k∈LRP (i > λ, k; µ), which is used to calculate the

common term in both CCRE and gradient of CCRE, i.e.,log P (i>λ,k;µ)pR(k)×P (i>λ;µ)

.

3 . Compute the energy function and its gradient using the formulas given in Eqn.

(3–15), we can then use the Quasi-Newton method to numerically solve the

optimization problem.

4 . Update the deformation fieldg(x; µ). Stop the registration process if the differ-

ence in consecutive iterates is less thanε = 0.01, a pre-chosen tolerance, otherwise

go toStep 1.

3.3 Implementation Results

In this section, we present the results of applying our non-rigid registration algorithm

to several data sets. The results are presented for synthetic as well as real data. The first

set of experiment was done with synthetic motion. We show the advantage of using the

CCRE measure in comparison to other information theoretic registration methods. We

26

show that CCRE is not only more robust, but also converges faster than others. We begin

by applying CCRE to register image pairs for which the ground truth was available.

3.3.1 Synthetic Motion Experiments

In this section, we demonstrate the robustness property of CCRE and will make

a case for its use over Mutual Information in the alignment problem. The case will be

made via experiments depicting faster convergence speed and superior performance

under noisy inputs in matching the image pairs misaligned by a synthetic non-rigid

motion, Additionally we will depict a larger capture range over MI-based methods in the

estimation of the motion parameters.

The data we use for this experiment are corresponding slices from an MR T1 and

T2 image pair, which is from the brainweb site at the Montreal Neurological Institute

[52]. They are originally aligned with each other. The two images are defined on a

1mm isotropic voxel grid in the Talairach space, with dimension(256 × 256). We then

apply a known non-rigid transformation to the T2 image and the goal is to recover this

deformation by applying our registration method. The mutual information scheme which

we are going to compare with is originally reported in literature [34] [53], in which the

explicit gradient forms are presented and thus allowing for the application of gradient

based optimization methods.

3.3.1.1 Convergence speed

In order to compare the convergence speed of CCRE versus MI, we design the

experiment as follows: with the MR T1 & T2 image pair as our data, we choose the MR

T1 image as the source, the target image was obtained by applying a known smooth non-

rigid transformation that was procedurally generated. Notice the significant difference

between the intensity profiles of the source and target images. For comparison purposes,

we use the same gradient descent optimization scheme, and let the two registration

methods run for the same amount of time, and show the registration result visually and

quantitatively.

27

Source Image Target Image

Transformed Source using CCRE Transformed Source using MI

Figure 3–2:Upper left, MR T1 image as source image; Upper right, deformed MR T2image as target image; Lower left and right, results of estimated transformations usingCCRE and MI applied to the source respectively. Both algorithms run for 30 secondsusing the same gradient descent technique.

The source and target image pair along with the results of estimated transformation

using CCRE and MI applied to the source are shown in Figure3–2. As evident visually,

we observe that the result generated by CCRE is more similar in shape with the target

image than the one produced by MI.

Quantitative assessment of accuracy of the registration is presented subsequently

in Figure3–3, where we plotted the change of mean deformation error (MDE) obtained

for the CCRE-based algorithm and the MI-based algorithm respectively. MDE is defined

asdm = 1card(R)

∑xi∈R ||g0(xi) − g(xi)||, whereg0(xi) andg(xi) are the ground truth

and estimated displacements respectively at voxelxi. ||.|| denotes the Euclidean norm,

andR is the volume of the region of interest. In both cases mean deformation error

are decreasing with time, but the solid line is decreasing faster than the dotted line.

For example, it takes about 5 minutes for MI to reach the error level inside 1.2 while

CCRE only requires about half of the time as that required by MI to get to the level. This

28

0 1 2 3 4 5 6 7 80.5

1

1.5

2

2.5

3

3.5

4

4.5

Time (minutes)

Mea

n de

form

atio

n er

ror

MI Results

CCRE Results

Figure 3–3:Plot demonstrating the change of Mean Deformation Error for CCRE andMI registration results with time. Solid line shows the MDE for CCRE registration result,while dotted line illustrates the MDE for MI result.

empirically validates the faster convergence speed of CCRE based algorithm over the

MI-based algorithm.

3.3.1.2 Registration accuracy

Using the same experiment setting as in the previous experiment, we present the

registration error for our algorithm in the estimated non-rigid deformation field as an

indicator of the accuracy of estimated deformations. Figure3.3.1.2depicts the results

0 2 4 60

200

400

600

800

Figure 3–4:Results of application of our algorithm to synthetic data (see text for details).

obtained for this image pair. which is organized as follows, from left to right: the first

29

row depicts the source image with the target image segmentation superposed to depict

the amount of mis-alignment, the registered source image which is obtained using

our algorithm superposed with the target segmentation, followed by the target image;

second row depicts ground truth deformation field which we used to generate the target

image from the MR T2 image, the estimated non-rigid deformation field followed by

histogram of the estimated magnitude error. Note that the error distribution is mostly

concentrated in the small error range indicating the accuracy of our method. As a measure

of accuracy of our method, we also estimated the average,µ, and the standard deviation,

σ, of the error in the estimated non-rigid deformation field. The error was estimated as the

angle between the ground truth and estimated displacement vectors.The average and

standard deviation are 1.5139 and 4.3211 (in degrees) respectively, which is quite

accurate.

3.3.1.3 Noise immunity

In the next experiment, we compare the robustness of the two methods (CCRE,

MI) in the presence of noise. Still selecting the MR T1 image slice from the previous

experiment as our source image, we generate the target image by applying a fixed smooth

synthetic deformation field. We conduct this experiment by varying the amount of

Gaussian noise added and then for each instance of the added noise, we register the

two images using the two techniques. We expect both schemes are going to fail at some

level of noise. (“failed” here means that the optimization algorithm primarily diverged).

By comparing the noise magnitude of the failure point, we can show the degree to

which these methods are tolerant. The numerical schemes we used to implement these

registrations are all based on BFGS quasi-Newton algorithm.

The mean magnitude of the synthetic motion is 4.37 pixel, with the standard

deviation at 1.8852. Table3–1show the registration results for the two schemes. From the

table, we observe that the MI fails when the standard deviation of the noise is increased

to 40, while CCRE is tolerant until66, a significant difference when compared to the MI.

30

Table 3–1:Comparison of the registration results between CCRE and MI for a fixedsynthetic deformation field.

CCRE MIσ MDE Standard Deviation MDE Standard Deviation10 1.0816 0.9345 1.3884 1.453819 1.1381 1.1702 1.4871 1.505230 1.1975 1.3484 1.5204 1.561540 1.3373 1.6609 FAIL60 1.3791 1.907266 FAIL

This experiment conclusively depicts that CCRE has more noise immunity than MI when

dealing with the non-rigid motion.

3.3.1.4 Partial overlap

Figure3–5depicts an example of registration of the MR T1 and T2 data sets with

large nonoverlap. The left image of the figure depicts the MR T1 brain scan as the source

image, and the right image shows the MR T2 data as the target. Note that the FOV for the

data sets are significantly nonoverlapping. The nonoverlap was simulated by cutting 66%

of the MR T1 image (Source image). The middle column depicts the transformed source

image along with an edge map of the target (Deformed MR T2 image) superimposed on

the transformed source. As is evident, the registration is visually quite accurate.

Figure 3–5:Registration results of MR T1 and T2 image slice with large non-overlap.(left) MR T1 source image before registration; (right) Deformed T2 target image; (mid-dle) the transformed MR image superimposed with edge map from target image.

31

3.3.2 Real Data Experiments

In this section, we present the performance of our method on a series of CT &

MR data containing real non-rigid misalignments. For the purpose of comparison, we

also apply traditional MI implemented as was presented in Mattes et al. [34] to these

same data sets. The CT image is of size(512, 512, 120) while the MR image size is

(512, 512, 142), and the voxel dimensions are(0.46, 0.46, 1.5)mm and(0.68, 0.68, 1.05)

for CT and MR respectively. The registration was performed on reduced volumes

(210 × 210 × 120) with the control knots placed every16 × 16 × 16 voxels. The

program was written in the C++ programming language, and all experiments were run on

a 2.6GHZ Pentium PC.

Table 3–2:Comparison of total time taken to achieve registration by the CCRE with MI.

1 2 3 4 5 6 7 8CCRE Time (s) 4827 3452 4345 4038 3910 4510 5470 3721

MI Time (s) 9235 6344 10122 17812 12157 11782 13157 10057

We have used a set of eight volumes of CT data sets and the task was to register

these eight volumes to the MR data chosen as the target image for all registrations,

by using both CCRE and MI algorithms. Note that all CT & MR volumes are from

different subjects and thus contains real non-rigid motion. The parameters used with both

algorithms were identical. For both algorithms, the optimization of the cost functions was

halted when improvements of at least0.0001 in the cost function could not be detected.

The time required for registering all data sets for our algorithm as well as MI method

are given in Tables3–2. This table shows that, on average, our CCRE algorithm is about

2.5 times faster than the traditional MI approach for this set of experiments. For brevity,

we only show one registration result in Figure3–6. Here, one slice of the volume is

shown on first row with the source CT image at left and reference image at right. The

middle image show the transformed CT image slice superimposed with edge map from

target image. On the second row, the source image superimposed with edge map from

32

target image is shown on the left, while shown in the middle and right are the surfaces

reconstructed from the transformed source using CCRE method and the target MR

image respectively. From this figure, we can see that the source and target image depict

considerable non-rigid changes in shape, nevertheless our method was able to register

these two images quite accurately. To validate the conformity of the two reconstructed

surfaces, we randomly sample 30 points from the surface of the transformed source

using CCRE, and then estimate the distances of these points to the surface of the target

MR volume. The average of these distances is about0.47mm, which indicates a very

good agreement between two surfaces. The resemblance of the reconstructed shapes

from transformed source with the target indicates that our CCRE algorithm succeeded in

matching the source CT volume to the target MR image.

Figure 3–6:Registration results of different subjects of MR & CT brain data with realnon-rigid motion. (see text for details.

The accuracy of the information theoretic based algorithm for non-rigid registration

problems was assessed quantitatively by means of an region-based segmentation task

[54]. ROIs (whole brain, eyes) were segmented automatically in these eight CT data sets

33

used as the source image and binary masks were created. The deformation fields between

the CT and MR volume were computed and used to project the masks from each of the

CT to the MR volume. Contours were manually drawn on a few slices chosen at random

in MR volume (four slices/volume). Manual contours on MR and contours obtained

automatically were then compared using an accepted similarity index defined as two

times the number of pixels in the intersection of the contours divided by the sum of the

number of pixels within each contour [41]. This index varies between zero (complete

disagreement) and one (complete agreement) and is sensitive to both displacement and

differences in size and shape. Table3–3lists mean values for the similarity index for

each structure. It is customarily accepted that a value of the similarity index above 0.80

indicates a very good agreement between contours. Our results are well above this value.

For comparison purpose, we also computed the same index for the MI method. We can

conclude from the table that our CCRE can achieve better registration accuracy than the

MI for the task of non-rigid registration of real multi-model images.

Table 3–3:Comparison of the value S of several brain structures for CCRE and MI.

Volume 1 2 3 4 5 6 7 8Whole Brain 0.987 0.996 0.974 0.962 0.975 0.967 0.988 0.981

CCRE Left Eye 0.925 0.935 0.925 0.907 0.875 0.890 0.834 0.871Right Eye 0.840 0.940 0.891 0.872 0.851 0.829 0.910 0.921

Whole Brain 0.986 0.981 0.976 0.96 0.950 0.961 0.942 0.952MI Left Eye 0.911 0.893 0.904 0.791 0.853 0.810 0.851 0.853

Right Eye 0.854 0.917 0.889 0.814 0.849 0.844 0.897 0.854

CHAPTER 4DIVERGENCE MEASURES FOR GROUPWISE POINT-SETS REGISTRATION

Matching point patterns is ubiquitous in many fields of Engineering and Science e.g.,

medical imaging, sports science, archaeology, and others. Point sets are widely used in

computer vision to represent boundary points of shapes contained in images or any other

salient features of objects contained in images. Given two or more images represented

using the salient features contained therein, most often than not, one is interested in

matching these (feature) point patterns to determine a linear or a nonlinear transformation

between the coordinates of the feature point sets. Such transformations capture the

changes in the pattern geometry characterized by the given feature point set.

The primary technical challenge in using point-set representations of shapes is

the correspondence problem. Typically correspondences can be estimated once the

point-sets are properly aligned with appropriate spatial transformations. If the objects

at hand are deformable, the adequate transformation would obviously be a non-rigid

spatial mapping. Solving for non-rigid deformations between point-sets with unknown

correspondence is a hard problem. In fact, many current methods only attempt to solve

for affine transformation for the alignment [55]. Furthermore, we also encounter the issue

of the bias problem in groupwise point-sets registration. If one arbitrarily chooses any one

of the given data sets as a reference, the estimated registration transformation would be

biased toward this chosen reference and it would be desirable to avoid such a bias. The

question that arises is: How do we align all the point-sets in a symmetric manner so that

there is no bias toward any particular point-set?

To overcome these aforementioned problems, we present a novel approach to

simultaneously register multiple point-sets and construct the atlas. The idea is to model

each point set by a kernel probability density or distribution, then quantify the distance

34

35

between these probability densities or distributions using information-theoretic measures.

Figure4–1illustrate this idea, where the right column of the figure is the density function

corresponding to the corpus callosum point-sets shown in the left. The distance is

Figure 4–1:Illustration of corpus callosum point-sets represented as density functions.

optimized over a space of coordinate transformations yielding the desired registrations.

It is obvious that once all the point sets are deformed into the same shape, the distance

measure between these distributions should be minimized since all the distribution are

identical to each other. We impose regularization on each deformation field to prevent

over-deforming of each point-sets (e.g. all the point-sets may deform into a single data

point).

The rest of the chapter is organized as follows: we begin by reviewing all the related

literatures, which is followed by a description of the divergence measures we used for

quantify the distance between densities or distributions. We then present the details of

36

our energy function, and the empirical way of estimating the cost functions and their

derivatives. Finally we will show the experimental results at the end of this chapter.

4.1 Previous Work

Extensive studies on the atlas construction for deformable shapes can be found in

literature covering both theoretical and practical issues relating to computer vision and

pattern recognition. According to the shape representation, they can be classified into

two distinct categories. One is the methods dealing with shapes represented by feature

point-sets, and everything else is in the other category including those shapes represented

as curves and surfaces of the shape boundary, and these curves and surfaces may be

either intrinsically or extrinsically parameterized (e.g. using point locations and spline

coefficients).

The work presented in [56] is a representative method using an intrinsic curve

parameterization to analyze deformable shapes. Shapes are represented as elements of

infinite-dimensional spaces and their pairwise difference are quantified using the lengths

of geodesics connecting them on these spaces, the intrinsic mean (Karcher mean) can

be computed as a point on the manifold (of shapes) which minimize the sum of square

geodesic distance between this unknown point to each individual shape, which lies on the

manifold. However the curves are limited by closed curves, and it has not been extended

to the 3D surface shapes. For methods using intrinsic curve or surface representations

[56, 57, 58], further statistical analysis on these representations is much more difficult

than analysis on the point representation, but the reward maybe higher due to the use of

intrinsic higher order representation.

Among these methods using point-sets parameterization, the idea of using non-

rigid spatial mapping functions, specifically thin-plate splines [59, 60, 61], to analyze

deformable shape has been widely adopted. Bookstein’s work in [59], successfully

initiated the research efforts on the usage of thin-plate splines to model the deformation

of shapes. This method is landmark-based, it avoids the correspondence problem since

37

the placement of corresponding points is driven by the visual perception of experts,

however it suffers from the the typical problem besetting landmark methods, e.g.

inconsistency. Several significant articles on robust and non-rigid point set matching have

been published by Rangaranjan and collaborators [62, 60, 63] using thin-plate splines.

In their recent work [60], they attempt to extend their work to the construction of an

mean shape from a set of unlabeled shapes which are represented by unlabeled point-sets.

The main strength of their work is the ability to jointly determine the correspondences

and non-rigid transformation between each point sets to the emerging mean shape using

deterministic annealing and soft-assign. However, in their work, the stability of the

registration result is not guaranteed in the case of data with outliers, and hence a good

stopping criterion is required. Unlike their approach, we do not need to first solve a

correspondence problem in order to subsequently solve a non-rigid registration problem.

The active shape model proposed in [64] utilized points to represent deformable

shapes. Their work pioneered the efforts in building point distribution models to under-

stand deformable shapes [64, 65]. Objects are represented as carefully-defined landmark

points and variation of shapes are modeled using a principal component analysis. These

landmark points are acquired through a more or less manual landmarking process where

an expert goes through all the samples to mark corresponding points on each sample. It is

a rather tedious process and accuracy is limited. In recent work [66], the authors attempt

to overcome this limitation by attempting to automatically solve for the correspondences

in a non-rigid setting. The resulting algorithm is very similar to the earlier work in [58]

and is restricted to curves. The work in [55] also uses 2D points to learn shape statistics,

which is quite similar to the active shape model method except that more attention has

been paid to the sample point-sets generation process from the shape. Unlike our method,

the transformation between curves are limited by rigid mapping, and process is not

symmetric.

38

There are several papers in the point-sets alignment literature which bear close

relation to our research reported here. For instance, Tsin and Kanade [67] proposed

a kernel correlation based point set registration approach where the cost function is

proportional to the correlation of two kernel density estimates. It is similar to our

work since we too model each of the point sets by a kernel density function and then

quantify the (dis)similarity between them using an information-theoretic measure,

followed by an optimization of a (dis)similarity function over a space of coordinate

transformations yielding the desired transformation. The difference lies in the fact that

divergence measures used in our work is a lot more general than the information-theoretic

measure used in [67], and can be easily extended to multiple point-sets. More recently,

in [68], Glaunes et al. convert the point matching problem into an image matching

problem by treating points as delta functions. Then they ”lift” these delta functions and

diffeomorphically match them. The main problem for this technique is that they need

a 3D spatial integral which must be numerically computed, while we do not need this

due to the empirical computation of the divergence measures. We will show it in the

experimental results that our method, when applied to match point-sets, achieves very

good performance in terms of both robustness and accuracy.

4.2 Divergence Measures

In probability theory and information theory, divergence measures generally stands

for those measures that quantify ”distance” between probability distributions. If there are

multiple distributions, the divergence will serve as a measure of cohesion between these

distributions. Since we are dealing with groupwise point-sets, which will be represented

as multiple probability densities/distributions, we will focus on these divergence measures

between multiple distributions.

4.2.1 Jensen-Shannon Divergence

There are many information and divergence measures exist in the literature on

information theory and statistics. The most famous one among them are Kullback-Leiber

39

(KL) divergence. The KL divergence (also known as the relative entropy) between two

densitiesp andq is defined as

DKL(p‖q) =

∫p(x) log

p(x)

q(x)dx

It is convex inp, non-negative (though not necessarily finite), and is zero if and only if

p = q. In information theory, it has an interpretation in terms of the length of encoded

messages from a source which emits symbols according to a probability density function.

While the familiar Shannon entropy gives a lower bound on the average length per

symbol a code can achieve, the KL-divergence betweenp andq gives the penalty (in

length per symbol) incurred by encoding a source with densityp under the assumption

that it really has densityq; this penalty is commonly called redundancy.

To illustrate this, consider the Morse code, designed to send messages in English.

The Morse code encodes the letter ”E” with a single dot and the letter ”Q” with a

sequence of four dots and dashes. Because ”E” is used frequently in English and ”Q”

seldom, this makes for efficient transmission. However if one wanted to use the Morse

code to send messages in Chinese pinyin, which might use ”Q” more frequently, he would

find the code less efficient. If we assume contrafactually that the Morse code is optimal

for English, this difference in efficiency is the redundancy.

Notice that KL divergence is not symmetric and a popular way to symmetrize it is

J(p, q) =1

2(DKL(p‖q) + DKL(q‖p))

which is called the J-divergence. Jensen-Shannon (JS) divergence, first introduced in [69],

serves as a measure of cohesion between multiple probability distributions. It has been

used by some researchers as a dissimilarity measure for image registration and retrieval

applications [70, 71] with very good results. It has many desirable properties, to name a

few, 1) The square root of JS-divergence (in the case when its parameter is fixed to0.5)

is a metric [72]; 2) JS-divergence relates to other information-theoretic functionals, such

40

as the relative entropy or the Kullback divergence, and hence it shares their mathematical

properties as well as their intuitive appeal; 3) The compared distributions using the

JS-divergence can be weighted, which allows one to take into account the different sizes

of the point set samples from which the probability distributions are computed; 4) The

JS-divergence measure also allows us to have different number of cluster centers in each

point-set. There is no requirement that the cluster centers be in correspondence as is

required by Chui et al [73]. Givenn probability density functionspi, i ∈ 1, ..., n, the

JS-divergence ofpi is defined by

JSπ(p1,p2, ...,pn) = H(∑

πipi)−∑

πiH(pi) (4–1)

whereπ = π1, π2, ..., πn|πi > 0,∑

πi = 1 are the weights of the probability density

functionspi andH(pi) is the Shannon entropy. The two terms on the right hand side

of Equation (4–1) are the entropy ofp :=∑

πipi (theπ- convex combination of the

pis ) and the same convex combination of the respective entropies. We can show that

JS-divergence can be derived from the KL divergence

JSα(p1, p2) = αKL(p1, αp1 + (1− α)p2) + (1− α)KL(p2, αp1 + (1− α)p2) (4–2)

whereα ∈ (0, 1) is a fixed parameter; we will also consider its straightforward general-

ization ton distributions.

4.2.2 CDF-JS Divergence

In Chapter2, we defined an entropy measure which is based on probability dis-

tribution instead of density function. The distribution function is more regular because

it is defined in an integral form unlike the density function, which is the derivative of

the distribution. The definition of Cumulative Residual Entropy also preserves the well

established principle that the logarithm of the probability of an event should represent

the information content in the event. CRE is shown to be more immune to noise and

41

outliers. Based on this idea, we can define a KL-divergence measure between Cumulative

Distribution Functions (CDFs),

Definition: Let Pr(X1 > x) andPr(X2 > x) be the cumulative residual distribution of

two random variablesX1 andX2 respectively, we define the CDF-KL divergence by

KD(P1,P2) =

∫Pr(X1 > x) ln

Pr(X1 > x)

Pr(X2 > x)dx (4–3)

Follow the same relationship between Jensen-Shannon divergence and KL divergence, we

can derive the so-called CDF-JS divergence from the definition of CDF-KL divergence,

(denoted asJ ), the result of which is shown in the following theorem,

Theorem 5 GivenN probability distributionsPi, i ∈ 1, ..., n, the CDF-JS divergence

of Pi is given by

J (P1,P2, ...,Pn) = E(∑

i

πiPi)−∑

i

πiE(Pi) (4–4)

whereπ = π1, π2, ..., πn|πi > 0,∑

πi = 1 are the weights of the probability

distributionsPi andE is the Cumulative Residual Entropy defined in Eqn. (2–3) of

Chapter2.

42

Proof: Without loss of generality, we can prove it for two random variable case, for

which the CDF-JS can be written as follows,

J (P1,P2)

= αKD(P1,P) + (1− α)KD(P2,P)

= α

∫Pr(X1 > x) ln

Pr(X1 > x)

Pr(X > x)dx + (1− α)

∫Pr(X2 > x) ln

Pr(X2 > x)

Pr(X > x)dx

= α

∫Pr(X1 > x) ln Pr(X1 > x)dx + (1− α)

∫Pr(X2 > x) ln Pr(X1 > x)dx

−∫ [

α Pr(X1 > x) + (1− α) Pr(X2 > x)]ln Pr(X > x)dx

= −αE(P1)− (1− α)E(P2)

−∫ [

α Pr(X1 > x) + (1− α) Pr(X2 > x)]ln Pr(X > x)dx

(4–5)

where,P is the distribution function corresponding to the density function

p = αp1 + (1− α)p2, which is the convex combination of the two probability densities,

therefore

Pr(X > x) =

∫ ∞

x

p(x)dx

=

∫ ∞

x

αp1(x) + (1− α)p2(x)dx

= α Pr(X1 > x) + (1− α) Pr(X2 > x)

(4–6)

Consequently, CDF-JS divergence for two random variable can be rewritten as

J (P1,P2) = −αE(P1)− (1− α)E(P2)−∫

P(X > x) lnP(X > x)dx

= E(P)− αE(P1)− (1− α)E(P2)

(4–7)

4.3 Methodology

In this section, we present the details of the proposed simultaneous atlas construction and

non-rigid registration method. Note that atlas construction normally requires the task of

43

non-rigid registration following which an atlas is constructed from the registered data.

However, in our work, the atlas emerges as a byproduct of the non-rigid registration. The

basic idea is to model each point set by a probability density or distribution function, then

quantify the distance between these functions using an information-theoretic measure.

The distance measure is optimized over a space of coordinate transformations yielding the

desired transformations. We will begin by presenting the energy function for solving the

groupwise point-sets registration problem.

4.3.1 Energy Function for Groupwise Point-sets Registration

We use the following notation: The data point-sets are denoted byXp, p ∈ 1, ..., N.Each point-setXp consists of pointsxp

i ∈ RD, i ∈ 1, ..., np . The atlas points-set is

denoted byZ. Assume that each point setXp is related toZ via a functionfp, µp is the

set of the transformation parameters associated with each functionf p. To compute the

mean shape from these point-sets and register them to the emerging mean shape, we need

to recover these transformation parameters to construct the mean shape. This problem can

modeled as an optimization problem with the objective function being the JS-divergence

or CDF-JS divergence between the distributions of the deformed point-sets, represented

asPi = p(f i(X i)), the atlas construction problem can now be formulated as,

minµiD(P1,P2, ...,PN) + λ

N∑i=1

||Lf i||2(4–8)

In (4–8),D is the divergence measure for multiple distributions, which we propose to use

either JS divergence or CDF-JS divergence. The weight parameterλ is a positive constant

the operatorL determines the kind of regularization imposed. For example,L could

correspond to a thin-plate spline, a Gaussian radial basis function, etc. Each choice ofL

is in turn related to a kernel and a metric of the deformation from and toZ.

Following the approach in [73], we choose the thin-plate spline (TPS) to represent the

non-rigid deformation. Givenn control pointsx1, . . . ,xn in Rd, a general non-rigid

44

mappingf : Rd → Rd represented by thin-plate spline can be written analytically as:

f(x) = WU(x) + Ax + t HereAx + t is the linear part off . The nonlinear part is

determined by ad× n matrix,W. And U(x) is ann× 1 vector consisting ofn basis

functionsUi(x) = U(x,xi) = U(‖x− xi‖) whereU(r) is the kernel function of

thin-plate spline. For example, if the dimension is 2 (d = 2) and the regularization

functional is defined on the second derivatives off , we haveU(r) = 1/(8π)r2ln(r).

Therefore, the cost function for non-rigid registration can be formulated as an energy

functional in a regularization framework, where the regularization term in equation4–8is

governed by the bending energy of the thin-plate spline warping and can be explicitly

given bytrace(WKWT ) whereK = (Kij), Kij = U(pi, pj) describes the internal

structure of the control point sets. Note the linear part can be obtained by an initial affine

registration, then an optimization can be performed to find the parameterW.

4.3.2 JS Divergence in a Hypothesis Testing Framework

In this section we show that the Jensen-Shannon divergence can be interpreted in the the

frame work of statistical hypothesis testing. To see this, we construct a likelihood ratio

between i.i.d. samples drawn from a mixture (∑

a πapa) and i.i.d. samples drawn from a

heterogeneous collection of densities (p1,p2, ...,pN ) with the samples being indexed by

the specific member distribution in the family from which they are drawn. Assume thatn1

samples are drawn fromp1, n2 from p2 etc. Let the total number of pooled samples be

defined asMdef=

∑Na=1 na. The likelihood ratio then is,

Λ =

∏Mk=1

∑Na=1 πapa(xk)∏N

a=1

∏na

ka=1 pa(xaka

). (4–9)

wherexk consists of pointsxai , i ∈ 1, ..., na, a ∈ 1, ..., N, which is the pooled data

of all the samples. In contrast to the typical statistical test relative to a threshold, we seek

the maximum of the likelihood ration in Eqn. (4–9). The following theorem shows the

relationship between Jensen-Shannon divergence and the likelihood ration.

45

Theorem 6 GivenN probability density functionspa, a ∈ 1, ..., N, maximizing the

hypothesis ratio in Eqn. (4–9) is equivalent to minimizing the Jensen-Shannon divergence

between theN probability densitiespa, a ∈ 1, ..., N.

Proof: The proof follows by taking logarithm of the likelihood ratio, and then using the

weak law of large numbers, we can show that the log-likelihood ratio is the negative of

Jensen-Shannon divergence.

We seek to maximize the probability that the samples are drawn from the mixture rather

than from separate members of the family (p1,p2, ...,pN ). In the context of groupwise

matching of point-sets, this makes eminent sense since maximizing the above ratio is

tantamount to increasing the chance that all of the observed point-sets are warped

versions of the same underlying warped and pooled data model.The notion of the pooled

data model is defined as follows. In our process of groupwise registration, the warping

does not have a fixed target data set. Instead, the warping is between the input data sets

and an evolving target which we call the pooled model. The target evolves to a fully

registered pooled data set at the end of the optimization process. The pooled model then

consists of input data sets which have undergone groupwise matching and are now fully

registered with each other.The connection to the JS-divergence arises from the fact that

the negative logarithm of the above ratio (Eqn.4–9) asymptotically converges to the

JS-divergence when the samples are assumed to be drawn from the mixture∑

a πapa.

4.3.3 Unbiasness Property of the Divergence Measures

Typically we are required to construct an atlas from very large number of point-sets, and

this process will usually take a long time since the computational complexity grows

polynomially with the increase of number of point-sets (N ) that we want to register.

However the following hierarchical method will significantly reduce the computational

complexity.

Assume that we are givenN point-sets, from which we are going to construct the atlas,

we can then divide theN point-sets intom subsets (generallym ¿ N ), therefore we can

46

constructm atlases from each subsets using our algorithms, and all the point-sets inside

each subsets are registered. Then we can either construct a single atlas from thesem atlas

point-sets, or we can further dividem atlas point-sets into even smaller subsets, and

follow the same process until a single atlas is constructed. The remaining question is

whether the atlas thus obtained is biased or not? The following theorem will lead us to the

answer.

Theorem 7 GivenN probability distributionsPa, a ∈ 1, ..., N, each having a weight

πa in the JS divergence or CDF-divergence. If we further divide theN distributions into

m subsets, such thatith subset containsni distributionsPa, a ∈ k(i)1 , k

(i)2 , ..., k

(i)ni , and

∑i ni = N . AssumeSi is the convex combination of the all the distributions in theith

subset, with the weightsπ

k(i)

βi, whereβi =

∑j π

k(i)j

, i.e. Si =∑ni

j=1 πk(i)j

Pk(i)j

/βi. We then

have the following relationship between the JS divergence of thePas and divergence of

theSis

Dπ(P1,P2, · · · ,PN)−Dβ(S1,S2, · · · ,Sm)

=m∑

i=1

βiDπk(i)

βi

(Pk(i)1

,Pk(i)2

, · · · ,Pk(i)ni

)(4–10)

Proof: It is trivial to derive the relationship in Eqn. (4–10) by simple algebraic

operations.

In our registration algorithm, all the point-sets are represented as probability distributions,

and the atlas thus constructed can be considered as convex combination of these

distributions. Therefore, we can treatPas andSis as the distributions corresponding to

the point-sets and the constructed atlases from the subsets respectively. Therefore from

Theorem7, we know that the relationship in Eqn. (4–10) holds between the JS divergence

of thePas andSis. Notice that the right hand side of Eqn. (4–10) is the JS/CDF-JS

divergences of the distributions in all the subsets, which are minimized in each steps of

the hierarchical method we proposed. Intuitively, if these point sets are aligned properly,

47

the corresponding distribution functions should be statistically similar. Therefore the

divergences of all the subsets should be zero all very close to zero, which means the right

hand side of Eqn. (4–10) is zero. Consequently, the JS/CDF-JS divergence of thePas and

divergence of theSis are equal to each other, therefore minimizing JS/CDF-JS divergence

of all the resultant atlas point-sets is equivalent to minimizing divergence of the original

point-sets, implying that there is no bias toward any particular partitioning of the

point-sets.

Having introduced the cost function and the transformation model, now the task is to

design an efficient way to estimate empirical divergence measures between multiple

densities or distributions and derive the analytic gradient of the estimated divergence in

order to achieve the optimal solution efficiently. We design two complete different

approaches for estimating JS divergence and CDF-JS divergence. We use finite mixture

model for estimating JS divergence and the parzen window technique for CDF-JS

divergence, the details of which will be introduced next.

4.3.4 Estimating JS and its Derivative

4.3.4.1 Finite mixture models

Considering the point set as a collection of Dirac Delta functions, it is natural to think of a

finite mixture model as representation of a point set. As the most frequently used mixture

model, a Gaussian mixture [74] is defined as a convex combination of Gaussian

component densities.

To model each point-set as a Gaussian mixture, we define a set of cluster centers, one for

each point-set, to serve as the Gaussian mixture centers. Since the feature point-sets are

usually highly structured, we can expect them to cluster well. Furthermore we can greatly

improve the algorithm efficiency by using limited number of clusters. Note that we can

choose the cluster centers to be the point-set itself if the size of point-set is quite small.

The cluster center point-sets are denoted byV p, p ∈ 1, ..., N. Each point-setV p

consists of pointsvpi ∈ RD, i ∈ 1, ..., Kp . Note that there areKp points in eachV p,

48

and the number of clusters for each point-set may be different (in our implementation, the

number of clusters were usually chosen to be proportional to the size of the point-sets).

The cluster centers are estimated by using a clustering process over the original sample

pointsxpi , and we only need to do this once before the process of joint atlas estimation

and point-sets registration. In our implementation, we utilize deterministic annealing

(DA) procedure with its proven benefit of robustness in clustering [75]. We begin by

specifying the density function of each point set.

pp(x) =Kp∑a=1

αpap(x|vp

a) (4–11)

In Equation (4–11), the occupancy probability which is different for each data point-set is

denoted byαp. The component densitiesp(x|vpa) is

p(x|vpa) =

1

(2π)D2 Σ

12a

exp(− 1

2

(x− vp

a

)TΣ−1

a

(x− vp

a

))(4–12)

Probability of the point setXp coming from this mixture is then

Pr(Xp|V p, αp) =

np∏i=1

pp(xpi ) =

np∏i=1

Kp∑a=1

αpap(xp

i |vpa) (4–13)

Later, we set the occupancy probability to be uniform and make the covariance matrices

Σa to be proportional to the identity matrix in order to simplify atlas estimation

procedure.

For simplicity, we chooseβi = 1N

, ∀i = 1, 2, ..., N. Let

Qxj

ip :=

∑Ka=1 αp

aPr(f j(xji )|fp(vp

a)) be a mixture model containing component densities

Pr(f j(xji )|f p(vp

a)),

p(f j(xji )|fp(vp

a)) =1

(2π)D2 Σ

12a

exp(− 1

2

(f j(xj

i )− fp(vpa)

)TΣ−1

a

(f j(xj

i )− f p(vpa)

))

(4–14)

WhereΣa, a ∈ 1, ..., K is the set of cluster covariance matrices. For the sake of

simplicity and ease of implementation, we assume that the occupancy probabilities are

49

uniform (αpa = 1

K) and the covariance matricesΣa are isotropic, diagonal, and identical

[(Σa = σ2ID)]. Having specified the density function of the data, we can then rewrite

Equation (4–8) as follows,

JSβ(P1,P2, ...,PN)

=1

N

[H(

∑ 1

NPi)−

∑H(P1)]

+[H(∑ 1

NPi)−

∑H(P2)]

+ · · ·+ [H(∑ 1

NPi)−

∑H(PN)]

(4–15)

For each term in the equation, we can estimate the entropy using the weak law of large

numbers, which is given by,

H(∑ 1

NPi)−H(Pj)) = − 1

ni

ni∑i=1

logQ

xji

1 + Qxj

i2 + ... + Q

xji

N

N+

1

ni

ni∑i=1

log Qxj

ij

=1

ni

ni∑i=1

logNQ

xji

j

Qxj

i1 + Q

xji

2 + ... + Qxj

iN

Combining these terms we have,

JS(P1,P2, ...,PN)

= 1

n1

n1∑i=1

logNQ

x1i

1

Qx1

i1 + Q

x1i

2 + ... + Qx1

iN

+1

n2

n2∑i=1

logNQ

x2i

2

Qx2

i1 + Q

x2i

2 + ... + Qx2

iN

+ · · ·+ 1

nN

nN∑i=1

logNQ

xNi

N

QxN

i1 + Q

xNi

2 + ... + QxN

iN

(4–16)

4.3.4.2 Optimizing the JS divergence

Computation of the gradient of the energy function is necessary in the minimization

process when employing a gradient-based scheme. If this can be done in analytical form,

it leads to an efficient optimization method. We now present the analytic form of the

gradient of the JS-divergence (our cost function):

∇JS = [∂JS

∂µ1,∂JS

∂µ2, ...,

∂JS

∂µN] (4–17)

50

Each component of the gradient maybe found by differentiating Eqn. (4–16) with respect

to the transformation parameters. In order to compute this gradient, let’s first calculate the

derivative ofQxj

ip with respect toµl,

∂Qxj

ip

∂µl=

1

(2π)D2 σ3K

∑Ka=1− exp

(− 1

2σ2 |Fjp|2)(Fjp · ∂fj(xj

i )

∂µl ) if l = j 6= p

1

(2π)D2 σ3K

∑Ka=1 exp

(− 1

2σ2 |Fjp|2)(Fjp · ∂fp(vp

a)∂µl ) if l = p 6= j

1

(2π)D2 σ3K

∑Ka=1 exp

(− 1

2σ2 |Fjp|2)(Fjp · [∂fp(vp

a)∂µl − ∂fj(xj

i )

∂µl ] if l = p = j

(4–18)

whereFjp := f j(xji )− f p(vp

a). Based on this, it is straight forward to derive the gradient

of the JS-divergence with respect to the transformation parametersµl, which is given by

∂JS

∂µl=

− 1

n1

n1∑

i=1

( 1

Qx1

i1 + Q

x1i

2 + ... + Qx1

iN

)∂Qx1

il

∂µl

− 1n2

n2∑

i=1

( 1

Qx2

i1 + Q

x2i

2 + ... + Qx2

iN

)∂Qx2

il

∂µl+ ......

− 1nl

nl∑

i=1

( 1

Qxl

i1 + Q

xli

2 + ... + Qxl

iN

)

[∂Qxl1

1

∂µl+ ... +

∂Qxl

NN

∂µl

]+

1nl

nl∑

i=1

( 1

Qxl

il

)∂Qxl

il

∂µl

− ......− 1nN

nN∑

i=1

( 1

QxN

i1 + Q

xNi

2 + ... + QxN

iN

)∂QxN

il

∂µl

(4–19)

4.3.5 Estimating CDF-JS and its Derivative

We will use the parzen window technique to estimate the cumulative distribution function

and its derivative. The calculation of CDF-JS divergence requires the estimation of the

cumulative probability distributions of each point-set. Without lost of generality, we only

discuss the derivation for the 2D case, which can be extended to 3D case easily. For each

pointxai with the coordinates[xa

i , yai ] in the point setXa, it is transformed by the function

fa to fa(xai ,µ

a) = [fa(xai ,µ

a), fa(yai ,µ

a)]. Let β(3) be a cubic spline Parzen window.

The smoothed probability density functionpa(l, k; µa) of the point-set

51

Xa, a ∈ 1, ..., N is given by

pa(l, k; µa) = αa

na∑i

β(3)(l − fa(xa

i ,µa)− x0

4bX

)

β(3)(k − fa(ya

i ,µa)− y0

4bY

) (4–20)

whereαa is a normalization factor that ensures∫ ∫

p(l, k)dldk = 1, [l, k] are the

coordinate values in the X and Y axis respectively, the transformed point coordinates

[fa(xai ,µ

a), fa(yai , µ

a)] is normalized by the minimum coordinate value,x0, y0, and the

range of each bin,4bX ,4bY . From the density function, we can calculate the cumulative

residual distribution function by the formula

P a(l > λ, k > γ; µa) =∫∞

λ

∫∞γ

pa(l, k; µa)dldk, whereλ ∈ Lx, γ ∈ Ly, Lx, Ly are

discrete sets of coordinate values in the X and Y axis respectively. To express it in further

detail, we have the following,

P a(λ, γ; µa) = αana∑

i

∫ ∞

λβ(3)

(l − fa(xa

i ,µa)− x0

4bX

)dl

∫ ∞

γβ(3)

(k − fa(ya

i , µa)− y0

4bY

)dk

= αana∑

i

Φ(λ− fa(xa

i , µa)− x0

4bX

)Φ

(γ − fa(ya

i ,µa)− y0

4bY

)

whereΦ(.) is the cumulative residual function of cubic spline kernel which is defined as

follows,

Φ(v) =

∫ ∞

v

β(3)(u)du

Note thatdΦ(u)du

= −β(3)(u).

Having specified the distribution function of the data, we can then rewrite Eqn. (4–8) as

follows, (For simplicity, we chooseπa = 1N

,∀a = 1, 2, ..., N. )

J (P1,P2, ...,PN ) = E(N∑

a=1

πaPa)−N∑

a=1

πaE(Pa)

= −∑

λ

∑γ

P log P +1N

∑a

∑

λ

∑γ

P a log P a

(4–21)

52

whereP is the cumulative residual distribution function for the density function

1N

∑pa(l, k; µa), which can be expressed as

P (λ, γ; µa) =1N

N∑

a=1

αana∑

i

Φ(λ− fa(xa

i , µa)− x0

4bX

)

Φ(γ − fa(ya

i ,µa)− y0

4bY

) (4–22)

4.3.5.1 Optimizing the CDF-JS divergence

We now present the analytic form of the gradient of the CDF-JS divergence (our cost

function):

∇J = [∂J∂µ1

,∂J∂µ2

, ...,∂J∂µN

] (4–23)

Each component of the gradient maybe found by differentiating Eqn (4–21) with respect

to the transformation parameters. It can be easily shown that∂P (λ,γ;µa)∂µa = 1

N∂P a(λ,γ;µa)

∂µa .

Based on these facts, it is straight forward to derive the gradient of the CDF-JS

divergence with respect to the transformation parametersµa, which is given by

∂J∂µa

= −∑

λ

∑γ

[1 + logP ]∂P (λ, γ;µa)

∂µa

+1N

∑

λ

∑γ

[1 + logP a]∂P a(λ, γ; µa)

∂µa

=1N

∑a

∑

λ

∑γ

∂P a(λ, γ; µa)∂µa

logP a

P

(4–24)

As a byproduct of the groupwise registration using the CDF-JS divergence, we get the

atlas of the given population of data sets, which is simply obtained by substituting the

estimated transformation functionsfa in to the formula for the atlasp(A) =∑N

a=1 πaPa.

Note that our algorithm can be applied to yield a biased registration in situations that

demand such a solution. This is achieved by fixing one of the data sets (say the reference)

and estimating the transformation from this to the novel scene data. We will present

53

experimental results on point-set alignment between two given point-sets as well as atlas

construction from multiple point-sets in the next section.

4.4 Experiment Results

We now present experimental results using JS divergence and CDF-JS divergence for

point-sets registration. And the results will be demonstrated on both synthetic and real

data sets.

4.4.1 JS Divergence Results

To demonstrate the robustness and accuracy of our algorithm, we show the alignment

results by applying the JS-divergence to the point-set matching problem. Then, we will

present the atlas construction results in the second part.

4.4.1.1 Alignment results

First, to test the validity of our approach, we perform a set of exact rigid registration

experiments on both synthetic and real data sets without noise and outliers. Some

examples are shown in Figure4–2. The top row shows the registration result for a 2D real

range data set of a road (which was also used in Tsin and Kanade’s experiments [67]).

The figure depicts the real data and the registered (using rigid motion). Top left frame

contains two unregistered point sets superposed on each other. Top right frame contains

the same point sets after registration using our algorithm. A 3D helix example is

presented in the second row (with the same arrangement as the top row). We also tested

our method against the KC method [67] and the ICP methods, as expected, our method

and KC method exhibit a much wider convergence basin/range than the ICP and both

achieve very high accuracy in the noiseless case.

We also applied our algorithm to non-rigidly register medical datasets (2D point-sets).

Figure4–3depicts some results of our registration method applied to a set of 2D corpus

callosum slices with feature points manually extracted by human experts. Registration

result is shown in the left column with the warping of 2D grid under the recovered motion

which is shown in the middle column. Our non-rigid alignment performs well in the

54

−40 −20 0−10

0

10

20

30

Initial setup

−30 −20 −10 0 10

−5

0

5

10

15

20

25

30

35

After registration

−5

0

5

−5

0

5

10

15

20−20

−10

0

10

20

30

40

50

60

Initial setup

−5

0

5

0

5

10

15

20−20

−10

0

10

20

30

40

50

60

After registration

Figure 4–2:Results of rigid registration in noiseless case. ’o’ and ’+’ indicate the modeland scene points respectively.

presence of noise and outliers (Figure4–3right column). For the purpose of comparison,

we also tested the TPS-RPM program provided in [62] on this data set, and found that

TPS-RPM can correctly register the pair without outliers (Figure4–3top left) but failed

to match the corrupted pair (Figure4–3top right).

0.4 0.6 0.8 1

Initial Setup

0.4 0.6 0.8 1

After registration0.4 0.6 0.8 1

00.10.2

Original point set

0.4 0.6 0.8 1

00.10.2

Deformed point set0.4 0.6 0.8 1

00.10.2

Initial Setup

0.2 0.4 0.6 0.8 1

00.10.2

After registration

Figure 4–3:Non-rigid registration of the corpus callosum data. Left column: two man-ually segmented corpus callosum slices before and after registration; Middle column:warping of the 2D grid using the recovered motion; Top right: same slices with one cor-rupted by noise and outliers, before and after registration.

55

4.4.1.2 Atlas construction results

In this section, we begin with a simple but demonstrative example of our algorithm for 2D

atlas estimation. The structure we are interested in this experiment is the corpus callosum

as it appears in MR brain images. Constructing an atlas for the corpus callosum and

subsequently analyzing the individual shape variation from ”normal” anatomy has been

regarded as potentially valuable for the study of brain diseases such as agenesis of the

corpus callosum(ACC), and fetal alcohol syndrome(FAS).

0.4 0.6 0.8 1

0

0.1

0.2

Point−sets Before Registration

0.8 1

Point Set1

0.4 0.6 0.8 1−0.1

0

0.1

0.2

Point Set2

0.4 0.6 0.8 1−0.1

0

0.1

0.2

Point Set3

0.8 1

Point Set4

0.4 0.6 0.8 1−0.1

00.10.2

Point Set5

0.4 0.6 0.8 1−0.1

00.10.2

Point Set6

0.8 1

Point Set7

0.4 0.6 0.8 1

0

0.1

0.2

Deformed Point−sets

Figure 4–4:Experiment results on seven 2D corpus collasum point sets. The first tworows and the left image in third row show the deformation of each point-set to the at-las, superimposed with initial point set (show in ’o’) and deformed point-set (shown in’*’). Middle image in the third row: The estimated atlas is shown superimposed over allthe point-sets. Right: The estimated atlas is shown superimposed over all the deformedpoint-sets.

We manually extracted points on the outer contour of the corpus callosum from seven

normal subjects, (as shown Figure4–4, indicated by ”o”). The recovered deformation

between each point-set and the mean shape are superimposed on the first two rows in

Figure4–4. The resulting atlas (mean point-set) is shown in third row of Figure4–4, and

is superimposed over all the point-sets. As we described earlier, all these results are

computed simultaneously and automatically. This example clearly demonstrate that our

56

joint matching and atlas construction algorithm can simultaneously align multiple shapes

(modeled by sample point-sets) and compute a meaningful atlas/mean shape.

4.4.2 CDF-JS Divergence Results

First, to see how the CDF-JS method behaves in the presence of noise and outliers, we

designed the following procedure to generate a corrupted template point set from a model

set. For a model set withn points, we control the degree of corruption by (1) discarding a

subset of size(1− ρ)n from the model point set, (2) applying a rigid transformation (R,t)

to the template, (3) perturbing the points of the template with noise (of strengthε), and (4)

adding(τ − ρ)n spurious, uniformly distributed points to the template. Thus, after

corruption, a template point set will have a total ofτn points, of which onlyρn

correspond to points in the model set. Since ICP is known to be sensitive to outliers, we

only compare our method with the more robust Jensen-Shannon divergence method in

terms of the sensitivity of noise and outliers. The comparison is done via a set of 2D

experiments.At each of several noise levels and outliers strengths, we generate five

models and six corrupted templates from each model for a total of 30 pairs at each noise

and outlier strength setting. For each pair, we use our algorithm and the JS method to

estimate the known rigid transformation which was partially responsible for the

corruption. Results show when the noise level is low, both JS and CDF-JS have strong

resistance to outliers. However, we observe that when the noise level is high, CDF-JS

method exhibits stronger resistance to outliers than the JS method, as shown in Figure

4–5, which confirm that CDF-JS is indeed more robust in the presence of high noise and

outlier level. A 3D example is also presented in Figure4–6.

Next, we present groupwise registration results on 3D hippocampal point-sets. Four 3D

point-sets were extracted from epilepsy patients with left anterior temporal lobe foci

identified with EEG. An interactive segmentation tool was used to segment the

hippocampus from the 3D brain MRI scans of 4 subjects. The point-sets differ in shape,

with the number of points450, 421, 376, 307 in each point-set respectively. In the first

57

Figure 4–5:Robustness to outliers in the presence of large noise. Errors in estimated rigidtransform vs. proportion of outliers ((τ − ρ)/(ρ)) for both our method and KC method.

050

1000

100

Initial setup

40608010012050100

150200

0

50

100

150

200

After registration

Figure 4–6: Robustness test on 3D swan data. ’o’ and ’+’ indicate the model and scenepoints respectively. Note that the scene point-set is corrupted by noise and outliers.

four images of Figure4–7, the recovered nonrigid deformation between each

hippocampal point-set to the atlas is shown along with a superimposition on all of the

original data sets. In second row of the Figure4–7, we also show the scatter plot of

original point-sets along with all the point-sets after the non-rigid warping. An

examination of the two scatter plots clearly shows the efficacy of our recovered non-rigid

warping. Note that validation of what an atlas shape ought to be in the real data case is a

difficult problem and we relegate its resolution to a future paper.

58

Pointset 1 Pointset 2 Pointset 3

Pointset 4 Pooled Pointsets Deformed Pointsets

Figure 4–7: Atlas construction from four 3D hippocampal point sets. The first row andthe left image in second row shows the deformation of each point-set to the atlas (rep-resented as cluster centers), superimposed with initial point set (show in green ’o’) anddeformed point-set (shown in red ’+’). Left image in the second row: Scatter plot of theoriginal four hippocampal point-sets. Right: Scatter plot of all the warped point-sets.

CHAPTER 5APPLICATIONS TO IMAGE SEGMENTATION

In Medical Imaging applications, segmentation can be a daunting task due to possibly

large inhomogeneities in image intensities across an image e.g., in MR images. These

inhomogeneities combined with volume averaging during the imaging and possible lack

of precisely defined shape boundaries for certain anatomical structures complicates the

segmentation problem immensely. One possible solution for such situations is atlas-based

segmentation. The atlas once constructed can be used as a template and can be registered

non-rigidly to the image being segmented (henceforth called a target image) thereby

achieving the desired segmentation. Many of the methods that achieve atlas-based

segmentation are based on a two stage process involving, (i) estimating the non-rigid

deformation field between the atlas image and the target image and then, (ii) applying the

estimated deformation field to the desired shape/atlas to achieve the segmentation of the

corresponding structure/s in the target image. In this chapter, we develop a novel

technique that will simultaneously achieve the non-rigid registration and segmentation.

There is a vast body of literature for the tasks of registration and segmentation

independently however, methods that combine them into one algorithm are far and few in

between. In the following, we will briefly review the few existing methods that attempt to

achieve simultaneous registration and segmentation.

5.1 Related Work

In one of the earliest attempts at joint registration & segmentation, Bansal et al., [76]

developed a minmax entropy framework to rigidly register & segment portal and CT data

sets. In [77], Yezzi et al., present a variational principle for achieving simultaneous

registration and segmentation, however, the registration part is limited to rigid motions. A

similar limitation applies to the technique presented by Noble et al., in [78]. A variational

59

60

principle in a level-set based formulation was presented in Pargios et. al., [79], for

segmentation and registration of cardiac MRI data. Their formulation was again limited

to rigid motion and the experiments were limited to 2D images. In Fischl et al., [80], a

Bayesian method is presented that simultaneously estimates a linear registration and the

segmentation of a novel image. Note that linear registration does not involve non-rigid

deformations. The case of joint registration and segmentation with non-rigid registration

has not been addressed adequately in literature with the exception of the recent work

reported in Soatto and Yezzi [81] and Vemuriet al.,[82]. However, these methods can

only work with image pairs that are necessarily from the same modality or the intensity

profiles are not too disparate.

In this paper, we present a unified variational principle that will simultaneously register

the atlas shape (contour/surface) to the novel brain image and segment the desired shape

(contour/surface) in the novel image. In this work, the atlas serves in the segmentation

process as a prior and the registration of this prior to the novel brain scan will assist in

segmenting it. Another key feature/strength of our proposed registration+segmentation

scheme is thatit accommodates for image pairs having very distinct intensity

distributions as in multimodality data sets. More details on this are presented in section

5.2.

5.2 Registration+Segmentation Model

We now present our formulation of joint registration & segmentation model. LetI1 be the

atlas image containing the atlas shapeC, I2 the novel image that needs to be segmented

andv be the vector field, fromI2 to I1 i.e., the transformation is centered inI2, defining

the non-rigid deformation between the two images. The variational principle describing

our formulation of the registration assisted segmentation problem is given by:

minE(v, C) = Seg(I2, C) + dist(v(C), C) + Reg(I1, I2,v). (5–1)

61

Figure 5–1:Model Illustration

Where, the first term denotes the segmentation functional.C is the boundary contour

(surface in 3D) of the desired anatomical shape inI2. The second term measures the

distance between the transformed atlasv(C) and the current segmentationC in the novel

brain image i.e., the target image and the third term denotes the non-rigid registration

functional between the two images. Our joint registration & segmentation model is

illustrated in Figure5.2.

For the segmentation functional, we use a piecewise constant Mumford Shah model,

which is one of the well-known variational models for image segmentation, wherein it is

assumed that the image to be segmented can be modeled by piece-wise constant regions,

as was done in [54]. This assumption simplifies our presentation but our model itself can

be easily extended to the piecewise smooth regions case. Additionally, since we are only

interested in segmenting a desired anatomical shape (e.g., the hippocampus, the corpus

callosum, etc.), we will only be concerned with a binary segmentation i.e., two classes

namely, voxels inside the desired shape and those that are outside it. These assumptions

can be easily relaxed if necessary but at the cost of making the energy functional more

complicated and hence computationally more challenging. The segmentation functional

takes the following form:

Seg(I2, C) =

∫

Ω

(I2 − u)2dx + α

∮

C

ds (5–2)

62

Where,Ω is the image domain andα is a regularization parameter.u = ui if x ∈ Cin and

u = uo if x ∈ Cout. Cin andCout denote the regions inside and outside of the curve,C

representing the desired shape boundaries inI2.

For the non-rigid registration term in the energy function, we use the information

theoretic-based criteria, cross cumulative residual entropy (CCRE) which we introduced

in Chapters2. CCRE was shown to outperform Mutual Information based registration in

the context of noise immunity and convergence range, motivating us to pick this criteria

over the MI-based cost function. The new registration functional is defined by

Reg(I1, I2,v) = −(C(I1(v(x)), I2(x)) + µ

∫

Ω

(||∇v(x)||2))

(5–3)

where, cross-CREC(I1, I2) is given by,

C(I1, I2) = E(I1)− E[E(I1/I2)] (5–4)

with E(I1) = − ∫R+

P (|I1| > λ) log P (|I1| > λ)dλ andR+ = (x ∈ R; x ≥ 0). v(x) is as

before andµ is the regularization parameter and|| · || denotes Frobenius norm. Using a

B-spline representation of the non-rigid deformation, one need only compute this field at

the control points of the B-splines and interpolate elsewhere, thus accruing computational

advantages. Using this representation, we have derived analytic expressions for the

gradient of the energy with respect to the registration parameters. This in turn makes our

optimization more robust and efficient.

In order for the registration and the segmentation terms to “talk” to each other, we need a

connection term and that is given by

dist(v(C), C) =

∫

R

φv(C)(x) dx (5–5)

where,R is the region enclosed byC, φv(C)(x) is the embedding signed distance function

of the contourv(C), which can be used to measure the distance betweenv(C) andC.

The level-set functionφ : R2 → R is chosen so that its zero level-set corresponds to the

63

transformed template curvev(C). Let Edist := dist(v(C), C), one can show that

∂Edist

∂C= φv(C)(C)N whereN is the normal toC. The corresponding curve evolution

equation given by gradient descent is then

∂C

∂t= −φv(C)(C)N (5–6)

Not only does the signed distance function representation make it easier for us to convert

the curve evolution problem to the level-set framework, it also facilitates the matching of

the evolving curveC and the transformed template curvev(C), and yet does not rely on a

parametric specification of eitherC or the transformed template curve. Note that since

dist(v(C), C) is a function of the unknown registrationv and the unknown segmentation

C, it plays the crucial role of connecting the registration and the segmentation terms.

Combining these three functionals together, we get the following variational principle for

the simultaneous registration+segmentation problem:

minE(C,v, uo, ui) =

∫

Ω

(I2 − u)2dx + α1

∮

C

ds + α2 dist(v(C), C)

− α3C(I1(v(x)), I2(x)) + α4

∫

Ω

‖∇v(x)‖2dx.

(5–7)

αi are weights controlling the contribution of each term to the overall energy function and

can be treated as unknown constants and either set empirically or estimated during the

optimization process. This energy function is quite distinct from those used in methods

existing in literature because it is achieving the Mumford-Shah type of segmentation in an

active contour framework jointly with non-rigid registration and shape distance terms. We

are now ready to discuss the level-set formulation of the energy function in the following

section.

5.2.1 Gradient flows

The level set method have been used extensively for implementing the curve evolution

based segmentation, primarily due to its many advantages over the competing approaches.

These include the ability to elegantly handle changes in the topology of the curve (splits

64

and merges), the ability to deal with the formation of cusps and corners, which are

extremely common in curve evolution, and the numerical stability and efficiency afforded

in its implementation. For our model where the equation for the unknown curveC is

coupled with the equations forv(x), uo, ui, it is convenient for us to use the level set

approach as proposed in [54].

Taking the variation ofE(.) with respect toC and writing down the gradient descent

leads to the following curve evolution equation:

∂C

∂t= −

[−(I2 − ui)

2 + (I2 − uo)2 + α1κ + α2φv(C)(C)

]N (5–8)

Note that equation (5–6) is used in the derivation. Equation (5–8) in the level-set

framework is given by:

∂φ

∂t=

[−(I2 − ui)

2 + (I2 − uo)2 + α1∇ · ∇φ

|∇φ| + α2φv(C)(C)

]|∇φ| (5–9)

whereui anduo are the mean values inside and outside of the curveC in the imageI2. To

drive the curve towards the template’s level-set functionφv(C) more efficiently, rather than

just having the zero level-sets match, we can add another termφ(C) into the level-set

evolution equation giving us,

∂φ

∂t=

[− (I2 − ui)

2 + (I2 − uo)2 + α1∇ · ∇φ

|∇φ|+α2

(φv(C)(C)− φ(C)

)] |∇φ|.(5–10)

As illustrated in Figure5–2, the two parametersα1 andα2 are used to balance the

influence of the shape distance model and the region-based model. Note thatφ(C) = 0 at

any location of the curve by the definition of level-set functionφ, this added term does

not affect the curve evolution equation [83].

As mentioned before, we use a B-spline basis to represent the displacement vector field

v(x, µ), whereµ is the transformation parameters of the B-spline basis.

∂E

∂µ= α2

∂∫

Rφv(C)(x) dx

∂µ− α3

∂C(I1(v(x)), I2(x))

∂µ+ α4

∂∫

Ω‖∇v(x)‖2dx

∂µ(5–11)

65

Figure 5–2:Illustration of the various terms in the evolution of the level set functionφ.To updateφ, we combine the standard region based update termS, and level set functioncorresponding to the shape distance term.

The first term of equation(5–11) can be rewritten as follows:

∂∫

Rφv(C)(x) dx

∂µ=

∫

R

∂φv(C)(x) dx

∂µ

=

∫

R

∂φv(C)

∂v

∣∣∣v=v(x,µ)

· ∂v(x, µ)

∂µdx

(5–12)

where∂φv(C)

∂vis the directional derivative in the direction ofv(x, µ). The second term of

Eqn. (5–11) has been derived in Eqn. (3–15) of the chapter3. . We simply state the result

here without the derivations for the sake of brevity,

∂C(I2, I1 v(x; µ))

∂µ=

∑

λ∈I1

∑

k∈I2

logP (i > λ, k; µ)

pI2(k)P (i > λ; µ)· ∂P (i > λ, k; µ)

∂µ(5–13)

whereP (i > λ, k; µ) andP (i > λ; µ) are the joint and marginal cumulative residual

distributions respectively.pI2(k) is the density function of imageI2. The last term of Eqn.

(5–11) leads to,∂

∫Ω‖∇v(x)‖2dx

∂µ= 2

∫

Ω

∇v · ∂v

∂µdx (5–14)

where both the matrices∇v and ∂v∂µ

are vectorized before the dot product is computed.

66

Substituting equations (5–12), (5–13) and (5–14) respectively back into the equation

(5–11), we get the analytical gradient of our energy function with respect to the B-spline

transformation parametersµ. We then solve for the stationary point of this nonlinear

equation numerically using a quasi-Newton method.

5.2.2 Algorithm Summary

Given the atlas imageI1 and the unknown subject’s brain scanI2, we want the

segmentation resultC in I2. Initialize C in I2 to C and set the initial displacement field to

zero.

1. For fixedC, update deformation field using gradient-based numerical method for

one step.

2. For fixed deformation fieldv, evolveφ in I2 and thereby updateC as the zero

level-set ofφ.

3. Stop the registration process if the difference in consecutive iterates is less than

ε = 0.01, a pre-chosen tolerance, else go toStep 1.

5.3 Results

In this section, we present several examples results from an application of our algorithm.

The results are presented for synthetic as well as real data. The first three experiments

were performed in 2D, while the fourth one was performed in 3D. Note that the image

pairs used in all these experiments have significantly different intensity profiles, which is

unlike any of the previous methods, reported in literature, used for joint registration and

segmentation. The synthetic data example contains a pair of MR T1 and T2 weighted

images which are from the MNI brainweb site [52]. They were originally aligned with

each other. We use the MR T1 image as the source image and the target image was

generated from the MR T2 image by applying a known non-rigid transformation that was

procedurally generated using kernel-based spline representations (cubic B-Spline). The

possible values of each direction in deformation vary from−15 to 15 in pixels. In this

67

Figure 5–3:Results of application of our algorithm to synthetic data (see text for details).

case, we present the error in the estimated non-rigid deformation field, using our

algorithm, as an indicator of the accuracy of estimated deformations.

Figure5–3depicts the results obtained for this image pair. With the MR T1 image as the

source image, the target was obtained by applying a synthetically generated non-rigid

deformation field to the MR T2 image. Notice the significant difference between the

intensity profiles of the source and target images. Figure5–3is organized as follows, from

left to right: the first row depicts the source image with the atlas-segmentation superposed

in red, the registered source image which is obtained using our algorithm followed by the

target image with the unregistered atlas-segmentation superposed to depict the amount of

mis-alignment; second row depicts ground truth deformation field which we used to

generate the target image from the MR T2 image, followed by the estimated non-rigid

deformation field and finally the segmented target. As visually evident, the

registration+segmentation are quite accurate from a visual inspection point of view. As a

measure of accuracy of our method, we estimated the average,µ, and the standard

deviation,σ, of the error in the estimated non-rigid deformation field. The error was

estimated as the angle between the ground truth and estimated displacement vectors.The

68

average and standard deviation are 1.5139 and 4.3211 (in degrees) respectively, which is

quite accurate.

Table5–1depicts statistics of the error in estimated non-rigid deformation when

compared to the ground truth. For the mean ground truth deformation (magnitude of the

displacement vector) in Column-1 of each row, 5 distinct deformation fields with this

mean are generated and applied to the target image of the given source-target pair to

synthesize 5 pairs of distinct data sets. These pairs (one at a time) are input to our

algorithm and the mean(µ) of the mean deformation error (MDE) is computed over the

five pairs and reported in Column-2 of the table. MDE is defined as

Table 5–1:Statistics of the error in estimated non-rigid deformation.

µg µ of MDE σ of MDE2.4 0.5822 0.04643.3 0.6344 0.09234.5 0.7629 0.02535.5 0.7812 0.0714

dm = 1card(R)

∑xi∈R ||v0(xi)− v(xi)||, wherev0(xi) v(xi) is the ground truth and

estimated displacements respectively at voxelxi. ||.|| denotes the Euclidean norm, andR

is the volume of the region of interest. Column-3 depicts the standard deviation of the

MDE for the five pairs of data in each row. As evident, the mean and the standard

deviation of the error are reasonably small indicating the accuracy of our joint registration

+ segmentation algorithm.Note that this testing was done on a total of 20 image pairs

(=40) as there are 5 pairs of images per row.

For the first real data experiment, we selected two image slices from two different

modalities of brain scans. The two slices depict considerable changes in shape of the

ventricles, the region of interest in the data sets. One of the two slices was arbitrarily

selected as the source and segmentation of the ventricle in the source was achieved using

an active contour model. The goal was then to automatically find the ventricle in the

target image using our algorithm given the input data along with the segmented ventricles

69

Figure 5–4:Results of application of our algorithm to a pair of slices from human brainMRIs (see text for details).

in the source image. Figure5–4is organized as follows, from left to right: the first row

depicts the source image with the atlas-segmentation superposed in black followed by the

target image with the unregistered atlas-segmentation superposed to depict the amount of

mis-alignment; second row depicts the estimated non-rigid vector field and finally the

segmented target. As evident from figures5–4, the accuracy of the achieved

registration+segmentation visually very good. Note that the non-rigid deformation

between the two images in these two examples is quite large and our method was able to

simultaneously register and segment the target data sets quite accurately.

The second real data example is obtained from two brain MRIs of different subjects and

modalities, the segmentation of the cerebellum in the source image is given. We selected

two “corresponding” slices from these volume data sets to conduct the experiment. Note

that even though the number of slices for the two data sets are the same, the slices may

not correspond to each other from an anatomical point of view. However, for the purposes

of illustration of our algorithm, this is not very critical. We use the corresponding slice of

the 3D segmentation of the source as our atlas-segmentation. The results of an application

70

Figure 5–5:Corpus Callosum segmentation on a pair of corresponding slices from dis-tinct subjects.

of our algorithm are organized as before in figure5–5. Once again, as evident, the visual

quality of the segmentation and registration are very high.

Finally we present a 3D real data experiment. In this experiment, the input is a pair of 3D

brain scans with the segmentation of the hippocampus in one of the two images (labeled

the atlas image) being obtained using the well known PCA on the several training data

sets. Each data set contains 19 slices of size 256x256. The goal was then to automatically

find the hippocampus in the target image given the input. Figure5–6depicts the results

obtained for this image pair. From left to right, the first image shows the given (atlas)

hippocampus surface followed by one cross-section of this surface overlaid on the source

image slice; the third image shows the segmented hippocampus surface from the target

image using our algorithm and finally the cross-section of the segmented surface overlaid

on the target image slice. To validate the accuracy of the segmentation result, we

randomly sampled 120 points from the segmented surface and computed the average

distance from these points to the ground truth hand segmented hippocampal surface in the

target image. The hand segmentation was performed by an expert neuroanatomist. The

71

Figure 5–6:Hippocampal segmentation using our algorithm on a pair of brain scans fromdistinct subjects. (see text for details)

average and standard deviation of the error in the aforementioned distance in estimated

hippocampal shape are 0.8190 and 0.5121(in voxels) respectively, which is very accurate.

CHAPTER 6CONCLUSIONS AND FUTURE WORK

6.1 Contributions of the Dissertation

We have introduced a variety of information theoretic measures and showed various

applications. The novel information measures we presented in this dissertation include

• Entropy defined on distributions, Cumulative Residual Entropy (CRE)

• Cross-Cumulative Residual Entropy (CCRE)

• CDF based Kullback-Leiber (KL) divergence

• CDF based Jensen-Shannon (JS) divergence

We demonstrated their applications to the following medical image analysis problems,

• Non-rigid image registration.

• Simultaneous groupwise point-sets registration and atlas construction.

• Atlas based image segmentation.

Our contributions to each of these topics are summarized in the following sections.

6.2 Image and Point-sets Registration

6.2.1 Non-rigid Image Registration

For non-rigid image registration, we presented a novel way to register multi-modal

datasets based on the matching criterion called cross cumulative residual entropy(CCRE)

[84] to measure the similarity between two images. The matching measure is defined

based on a new information measure, namely cumulative residual entropy (CRE), which

is defined based on the probability distributions instead of probability densities, therefore

CCRE is valid for both discrete and continuous domain. Furthermore, CCRE also inherits

the robustness property of the CRE measure. In [84], we presented results of rigid and

affine registration under a variety of noise levels and showed significantly superior

performance over MI-based methods.

72

73

The Cross-CRE between two images to be registered is maximized over the space of

smooth and unknown non-rigid transformations, which is represented by a tri-cubic

BSplines placed on a regular gird. The analytic gradient of this matching measure is then

derived in this paper to achieve efficient and accurate non-rigid registration. It turns out

that the gradient of the CCRE has a similar formulation with the cost function, which

greatly saves memory space in the optimization process. The matching criterion is

optimized using Quasi-Newton method to recover the transformation parameters.

The key strengths of our proposed non-rigid registration scheme are demonstrated

through the registration of the synthetic as well as real data sets from multi-modality (MR

T1 and T2 weighted, MR & CT) imaging sources. It is showed that our CCRE not only

can accommodate images to be registered of varying contrast+brightness, but it is also

robust in the presence of noise. CCRE converges faster when compared with other

information theory-based registration methods. Finally we also showed that CCRE is well

suited for situations where the source and the target images have FOVs with large

non-overlapping regions (which is quite common in practice). Comparisons were made

between CCRE and traditional MI [34, 51], which was defined using the Shannon

entropy. All the experiments depicted significantly better performance of CCRE over the

MI-based methods currently used in literature.

Our future work will focus on extending the transformations model to the one that permits

the spatial adaptation of the transformation’s compliance, which will allow us to reduce

the number of degrees of freedom in the overall transformation. Validation of non-rigid

registration on real data with the aid of segmentations and landmarks obtained manually

from a group of trained anatomists are the goals of our ongoing work.

6.2.2 Groupwise Point-sets Registration

We presented a novel and robust algorithm for the groupwise non-rigid registration of

multiple unlabeled point-sets with no bias toward any of the given point-sets. To quantify

the divergence between multiple probability distributions estimated from the given point

74

sets, we proposed several divergence measures, the first of which is the Jensen-Shannon

divergence. Since it lacks robustness, we develop a novel measure based on their

cumulative distribution functions that we dub as the CDF-JS divergence. The measure

parallels the well known Jensen-Shannon divergence (defined for probability density

functions) but is more regular than the JS divergence since its definition is based on CDFs

as opposed to density functions. As a consequence, CDF-JS is more immune to noise and

statistically more robust than the JS.

Our proposed methods do not require any knowledge of correspondence between the

input point-sets, and therefore these point-sets need not have the same cardinality. One

other salient feature of our proposed algorithms is that we get a probabilistic atlas as the

byproduct of the registration process. Our algorithm can be especially useful for creating

atlases of various shapes present in images as well as for simultaneously (rigidly or

non-rigidly) registering 3D range data sets without having to establish any

correspondence.

Our future work will focus on using maximum likelihood estimation (MLE) to

automatically determine weighting coefficients in the divergence measures and smoothing

term; We are also attempting to extend our techniques to diffeomorphic point-sets

matching.

6.3 Image Segmentation

In the part of image segmentation, we presented a novel variational formulation of the

joint (non-rigid) registration and segmentation problem which requires the solution to a

coupled set of nonlinear PDEs that are solved using efficient numerical schemes. Our

work is a departure from earlier methods in that we presented aunified variational

principlewherein non-rigid registration and segmentation are simultaneously achieved.

Unlike earlier methods presented in literature,a key feature of our algorithm is that it can

accommodate for image pairs having distinct intensity distributions. We presented several

examples (twenty) on synthetic and (three) real data sets along with quantitative accuracy

75

estimates of the registration in the synthetic data case. The accuracy as evident in these

experiments is quite satisfactory. Our future efforts will focus on adapting our

algorithm+software for the clinic use.

REFERENCES

[1] C. E. Shannon, “A mathematical theory of communication,”Bell System TechnicalJournal, pp. 379–423 and 623–656, 1948.

[2] W. F. Sharpe,Investments. London: Prentice Hall, 1985.

[3] D. Salomon,Data Compression. New York: Springer, 1998.

[4] S. Kullback,Information Theory and Statistics. New York: Wiley, 1959.

[5] T. M. Cover and J. A. Thomas,Elements of Information Theory. New York: Wiley,1991.

[6] G. Jumarie,Relative Information. New York: Springer, 1990.

[7] M. Rao, Y. Chen, B. C. Vemuri, and F. Wang, “Cumulative residual entropy, a newmeasure of information,”IEEE Transactions on Information Theory, vol. 50, no. 6,pp. 1220–1228, June 2004.

[8] M. Asadi and Y. Zohrevand, “On the dynamic cumulative residual entropy,”Unpublished Manuscript, 2006.

[9] H. Chui, L. Win, R. Schultz, J. Duncan, and A. Rangarajan, “A unified non-rigidfeature registration method for brain mapping,”Medical Image Analysis, vol. 7,no. 2, pp. 112–130, 2003.

[10] N. Paragios, M. Rousson, and V. Ramesh, “Non-rigid registration using distancefunctions,”Comput. Vis. Image Underst., vol. 89, no. 2-3, pp. 142–165, 2003.

[11] M. A. Audette, K. Siddiqi, F. P. Ferrie, and T. M. Peters, “An integratedrange-sensing, segmentation and registration framework for the characterization ofintra-surgical brain deformations in image-guided surgery,”Comput. Vis. ImageUnderst., vol. 89, no. 2-3, pp. 226–251, 2003.

[12] A. Leow, P. M. Thompson, H. Protas, and S.-C. Huang, “Brain warping withimplicit representations.” inInternational Symposium on Biomedical Imaging, 2004,pp. 603–606.

[13] B. Jian and B. C. Vemuri, “A robust algorithm for point set registration usingmixture of gaussians.” inIEEE International Conference on Computer Vision, 2005,pp. 1246–1251.

76

77

[14] F. Wang, B. C. Vemuri, A. Rangarajan, I. M. Schmalfuss, and S. J. Eisenschenk,“Simultaneous nonrigid registration of multiple point sets and atlas construction,” inEuropean Conference on Computer Vision, 2006, pp. 551–563.

[15] S. J. H. Guo, A. Rangarajan, “A new joint clustering and diffeomorphism estimationalgorithm for non-rigid shape matching,” inIEEE Computer Vision and PatternRecognition, 2004, pp. 16–22.

[16] M. Irani and P. Anandan, “Robust Multi-sensor Image Alignment,” inInternationalConference on Computer Vision, Bombay, India, 1998, pp. 959–965.

[17] J. Liu, B. C. Vemuri, and J. L. Marroquin, “Local frequency representations forrobust multimodal image registration,”IEEE Transactions on Medical Imaging,vol. 21, no. 5, pp. 462–469, 2002.

[18] M. Mellor and M. Brady, “Non-rigid multimodal image registration using localphase,” inMedical Image Computing and Computer-Assisted Intervention,Saint-Malo, France, Sep 2004, pp. 789–796.

[19] B. Zitova and J. Flusser, “Image registration methods: a survey.”Image VisionComput., vol. 21, no. 11, pp. 977–1000, 2003.

[20] J. Ruiz-Alzola, C.-F. Westin, S. K. Warfield, A. Nabavi, and R. Kikinis, “Nonrigidregistration of 3d scalar vector and tensor medical data,” inThird InternationalConference on Medical Image Computing and Computer-Assisted Intervention,A. M. DiGioia and S. Delp, Eds., Pittsburgh, October 11–14 2000, pp. 541–550.

[21] L. Marroquin, B. Vemuri, S. Botello, F. Calderon, and A. Fernandez-Bouzas, “Anaccurate and efficient bayesian method for automatic segmentation of brain mri,” inIEEE Transactions on Medical Imaging, 2002, pp. 934–945.

[22] B. C. Vemuri, J. Ye, Y. Chen, and C. M. Leonard, “A level-set based approach toimage registration,” inIEEE Workshop on Mathematical Methods in BiomedicalImage Analysis, 2000, pp. 86–93.

[23] P. Hellier, C. Barillot, E. Mmin, and P. Prez, “Hierarchical estimation of a densedeformation field for 3d robust registration,”IEEE Transactions on MedicalImaging, vol. 20, no. 5, pp. 388–402, May 2001.

[24] R. Szeliski and J. Coughlan, “Spline-based image registration,”Int. J. Comput.Vision, vol. 22, no. 3, pp. 199–218, March 1997.

[25] S. H. Lai and M. Fang, “Robust and efficient image alignment with spatially-varyingillumination models,” inIEEE Conference on Computer Vision and PatternRecognition, 1999, pp. II: 167–172.

[26] A. Guimond, A. Roche, N. Ayache, and J. Menuier, “Three-DimensionalMultimodal Brain Warping Using the Demons Algorithm and Adaptive Intensity

78

Corrections,”IEEE Transactions on Medical Imaging, vol. 20, no. 1, pp. 58–69,2001.

[27] J.-P. Thirion, “Image matching as a diffusion process: an analogy with maxwell’sdemons,”Medical Image Analysis, vol. 2, no. 3, pp. 243–260, 1998.

[28] A. Cuzol, P. Hellier, and E. Memin, “A novel parametric method for non-rigid imageregistration,” inProc. Information Processing in Medical Imaging (IPMI’05), ser.LNCS, G. Christensen and M. Sonka, Eds., no. 3565, Glenwood Springes,Colorado, USA, July 2005, pp. 456–467.

[29] A. W. Toga and P. M. Thompson, “The role of image registration in brain mapping,”Image Vision Comput., vol. 19, no. 1-2, pp. 3–24, 2001.

[30] E. D’Agostino, F. Maes, D. Vandermeulen, and P. Suetens, “Non-rigidatlas-to-image registration by minimization of class-conditional image entropy.” inMedical Image Computing and Computer-Assisted Intervention, 2004, pp. 745–753.

[31] P. A. Viola and W. M. Wells, “Alignment by maximization of mutual information,”in IEEE International Conference on Computer Vision, MIT, Cambridge, 1995.

[32] A. Collignon, F. Maes, D. Delaere, D. Vandermeulen, P. Suetens, and G. Marchal,“Automated multimodality image registration based on information theory,”Proc.Information Processing in Medical Imaging, pp. 263–274, 1995.

[33] C. Studholme, D. Hill, and D. J. Hawkes, “Automated 3D registration of MR andCT images in the head,”Medical Image Analysis, vol. 1, no. 2, pp. 163–175, 1996.

[34] D. Mattes, D. R. Haynor, H. Vesselle, T. K. Lewellen, and W. Eubank, “Pet-ct imageregistration in the chest using free-form deformations.”IEEE Transactions onMedical Imaging, vol. 22, no. 1, pp. 120–128, 2003.

[35] D. Rueckert, A. F. Frangi, and J. A. Schnabel, “Automatic construction of 3dstatistical deformation models of the brain using non-rigid registration.”IEEETransactions on Medical Imaging, vol. 22, no. 8, pp. 1014–1025, 2003.

[36] G. Hermosillo, C. Chefd’hotel, and O. Faugeras, “Variational methods formultimodal image matching,”Int. J. Comput. Vision, vol. 50, no. 3, pp. 329–343,2002.

[37] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. G. Hill, M. O. Leach, and D. J. Hawkes,“Nonrigid registration using free-form deformations: Application to breast mrimages,”IEEE Transactions on Medical Imaging, vol. 18, no. 8, pp. 712–721,August 1999.

[38] M. E. Leventon and W. E. L. Grimson, “Multimodal volume registration using jointintensity distributions,” inMedical Image Computing and Computer-AssistedIntervention (MICCAI), Cambridge, MA, 1998, pp. 1057–1066.

79

[39] T. Gaens, F. Maes, D. Vandermeulen, and P. Suetens, “Non-rigid multimodal imageregistration using mutual information,” inProc. Conference on Medical ImageComputing and Compter-Assisted Intervention (MICCAI), 1998, pp. 1099–1106.

[40] D. Loeckx, F. Maes, D. Vandermeulen, and P. Suetens, “Nonrigid image registrationusing free-form deformations with a local rigidity constraint.” inMedical ImageComputing and Computer-Assisted Intervention, 2004, pp. 639–646.

[41] G. K. Rohde, A. Aldroubi, and B. M. Dawant, “The adaptive bases algorithm forintensity based nonrigid image registration.”IEEE Transactions on MedicalImaging, vol. 22, no. 11, pp. 1470–1479, 2003.

[42] V. Duay, P.-F. D’Haese, R. Li, and B. M. Dawant, “Non-rigid registration algorithmwith spatially varying stiffness properties.” inInternational Symposium onBiomedical Imaging, 2004, pp. 408–411.

[43] C. Guetter, C. Xu, F. Sauer, and J. Hornegger, “Learning based non-rigidmulti-modal image registration using kullback-leibler divergence.” inMedical ImageComputing and Computer-Assisted Intervention, 2005, pp. 255–262.

[44] E. D’Agostino, F. Maes, D. Vandermeulen, and P. Suetens, “An informationtheoretic approach for non-rigid image registration using voxel class probabilities,”Medical Image Analysis, vol. 10, no. 3, pp. 413–431, 2006.

[45] C. Davatzikos, “Spatial transformation and registration of brain images usingelastically deformable models,”Comput. Vis. Image Underst., vol. 66, no. 2, pp.207–222, 1997.

[46] J. C. Gee, M. Reivich, and R. Bajcsy, “Elastically deforming 3d atlas to matchanatomical brain images,”J. Comput. Assist. Tomogr., vol. 17, no. 2, pp. 225–236,1993.

[47] M. Bro-Nielsen and C. Gramkow, “Fast fluid registration of medical images,” inProc. of the 4th International Conference on Visualization in BiomedicalComputing. London, UK: Springer-Verlag, 1996, pp. 267–276.

[48] G. E. Christensen, R. D. Rabbitt, and M. I. Miller, “Deformable templates usinglarge deformation kinematics,”IEEE Transactions On Image Processing, vol. 5,no. 10, pp. 1435–1447, October 1996.

[49] X. Geng, D. Kumar, and G. E. Christensen, “Transitive inverse-consistent manifoldregistration.” inProc. Information Processing in Medical Imaging, 2005, pp.468–479.

[50] A. Trouve, “Diffeomorphisms groups and pattern matching in image analysis,”Int.J. Comput. Vision, vol. 28, no. 3, pp. 213–221, 1998.

[51] D. R. Forsey and R. H. Bartels, “Hierarchical b-spline refinement,”ComputerGraphics, vol. 22, no. 4, pp. 205–212, 1988.

80

[52] C. Cocosco, V. Kollokian, R.-S. Kwan, and A. Evans, “Brainweb: online interface toa 3-d mri simulated brain database,” 1997, last accessed: July 2005. [Online].Available: http://www.bic.mni.mcgill.ca/brainweb/

[53] P. Thevenaz and M. Unser, “Optimization of mutual information for multiresolutionimage registration,”IEEE Transactions on Image Processing, vol. 9, no. 12, pp.2083–2099, December 2000.

[54] T. Chan and L. Vesse, “An active contour model without edges,” inIntl. Conf. onScale-space Theories in Computer Vision, 1999, pp. 266–277.

[55] N. Duta, A. K. Jain, and M.-P. Dubuisson-Jolly, “Automatic construction of 2d shapemodels,”IEEE Transactions Pattern Anal. Mach. Intell., vol. 23, no. 5, pp. 433–446,2001.

[56] E. Klassen, A. Srivastava, W. Mio, and S. H. Joshi, “Analysis of planar shapes usinggeodesic paths on shape spaces.”IEEE Transactions Pattern Anal. Mach. Intell.,vol. 26, no. 3, pp. 372–383, 2003.

[57] T. B. Sebastian, P. N. Klein, B. B. Kimia, and J. J. Crisco, “Constructing 2d curveatlases,” inIEEE Workshop on Mathematical Methods in Biomedical ImageAnalysis, Washington, DC, USA, 2000, pp. 70–77.

[58] H. Tagare, “Shape-based nonrigid correspondence with application to heart motionanalysis.”IEEE Transactions on Medical Imaging, vol. 18, no. 7, pp. 570–579, 1999.

[59] F. L. Bookstein, “Principal warps: Thin-plate splines and the decomposition ofdeformations.”IEEE Transactions Pattern Anal. Mach. Intell., vol. 11, no. 6, pp.567–585, 1989.

[60] H. Chui, A. Rangarajan, J. Zhang, and C. M. Leonard, “Unsupervised learning of anatlas from unlabeled point-sets.”IEEE Transactions Pattern Anal. Mach. Intell.,vol. 26, no. 2, pp. 160–172, 2004.

[61] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition usingshape contexts,”IEEE Transactions Pattern Anal. Mach. Intell., vol. 24, no. 4, pp.509–522, 2002.

[62] H. Chui and A. Rangarajan, “A new algorithm for non-rigid point matching.” inIEEE Computer Vision and Pattern Recognition, 2000, pp. 2044–2051.

[63] H. Guo, A. Rangarajan, S. Joshi, and L. Younes, “Non-rigid registration of shapesvia diffeomorphic point matching.” inInternational Symposium on BiomedicalImaging, 2004, pp. 924–927.

[64] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape models: theirtraining and application,”Comput. Vis. Image Underst., vol. 61, no. 1, pp. 38–59,1995.

http://www.bic.mni.mcgill.ca/brainweb/

81

[65] Y. Wang and L. H. Staib, “Boundary finding with prior shape and smoothnessmodels,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22,no. 7, pp. 738–743, 2000.

[66] A. Hill, C. J. Taylor, and A. D. Brett, “A framework for automatic landmarkidentification using a new method of nonrigid correspondence.”IEEE TransactionsPattern Anal. Mach. Intell., vol. 22, no. 3, pp. 241–251, 2000.

[67] Y. Tsin and T. Kanade, “A correlation-based approach to robust point setregistration.” inEuropean Conference on Computer Vision, 2004, pp. 558–569.

[68] J. Glaunes, A. Trouve, and L. Younes, “Diffeomorphic matching of distributions: Anew approach for unlabelled point-sets and sub-manifolds matching.” inIEEEComputer Vision and Pattern Recognition, 2004, pp. 712–718.

[69] J. Lin, “Divergence measures based on the shannon entropy,”IEEE TransactionsInformation Theory, vol. 37, pp. 145–151, 1991.

[70] A. Hero, O. M. B. Ma, and J. Gorman, “Applications of entropic spanning graphs,”IEEE Transactions Signal Processing, vol. 19, pp. 85–95, 2002.

[71] Y. He, A. Ben-Hamza, and H. Krim, “A generalized divergence measure for robustimage registration,”IEEE Transactions Signal Processing, vol. 51, pp. 1211–1220,2003.

[72] D. M. Endres and J. E. Schindelin, “A new metric for probability distributions,”IEEE Transactions Information Theory, vol. 49, pp. 1858–60, 2003.

[73] H. Chui and A. Rangarajan, “A new point matching algorithm for non-rigidregistration,”Computer Vision and Image Understanding (CVIU), vol. 89, pp.114–141, 2003.

[74] G. McLachlan and K. Basford,Mixture Model:Inference and Applications toClustering. New York: Marcel Dekker, 1988.

[75] A. L. Yuille, P. Stolorz, and J. Utans, “Statistical physics, mixtures of distributions,and the em algorithm,”Neural Comput., vol. 6, no. 2, pp. 334–340, 1994.

[76] R. Bansal, L. Staib, Z. Chen, A. Rangarajan, J. Knisely, R. Nath, and J. Duncan.,“Entropy-based, multiple-portal-to-3d ct registration for prostate radiotherapy usingiteratively estimated segmentation,” inMedical Image Computing andComputer-Assisted Intervention, 1999, pp. 567–578.

[77] A. Yezzi, L. Zollei, and T. Kapur, “A variational framework for joint segmentationand registration,” inIEEE Workshop on Mathematical Methods in Biomedical ImageAnalysis, 2001, pp. 388–400.

[78] P. Wyatt and J. Noble, “Mrf-map joint segmentation and registration,” inMedicalImage Computing and Computer-Assisted Intervention, 2002, pp. 580–587.

82

[79] N. Paragios, M. Rousson, and V. Ramesh, “Knowledge-based registration &segmentation of the left ventricle: A level set approach.” inWACV, 2002, pp. 37–42.

[80] B. Fischl, D. Salat, E. Buena, and M. A. et.al., “Whole brain sementation:Automated labeling of the neuroanatomical structures in the human brain,” inNeuron, vol. 33, 2002, pp. 341–355.

[81] S. Soatto and A. J. Yezzi, “Deformotion: Deforming motion, shape average and thejoint registration and segmentation of images,” inEuropean Conference onComputer Vision, 2002, pp. 32–57.

[82] B. C. Vemuri, Y. Chen, and Z. Wang, “Registration assisted image smoothing andsegmentation,” inEuropean Conference on Computer Vision, 2002, pp. 546–559.

[83] T. Zhang and D. Freedman, “Tracking objects using density matching and shapepriors.” in IEEE International Conference on Computer Vision, 2003, pp.1056–1062.

[84] F. Wang, B. C. Vemuri, M. Rao, and Y. Chen, “A new & robust information theoreticmeasure and its application to image alignment.” inProc. Information Processing inMedical Imaging, 2003, pp. 388–400.

BIOGRAPHICAL SKETCH

Fei Wang was born in Yan Cheng, JiangSu, P. R. China. He received his Bachelor of

Science degree from the University of Science and Technology of China, P. R. China, in

2001. He earned his Master of Science and Doctor of Philosophy degree from the

University of Florida in December 2002 and August 2006 respectively. His research

interests include medical imaging, computer vision, pattern recognition, computer

graphics and shape modeling.

83

information theoretic measures and their applications … · information theoretic measures and...

Documents